I don't know what it called, but i think this is possible
I am looking to write something(don't know the exact name) that will,
go to a webpage and select a value from drop-down box on that page and read values from that page after selection, I am not sure weather it called crawler or activity, i am new to this but i heard long time back from one of my friend this can be done,
can any one please give me a head start
Thanks
You need an HTTP client library (perhaps libcurl in C, or some C# wrapper for it, or some native C# HTTP client library like this).
You also need to parse the retrieved HTML content. So you probably need an HTML parsing library (maybe HTML agility pack).
If the targeted webpage is nearly fixed and has e.g. some comments to ease finding the relevant part, you might use simpler or ad-hoc parsing techniques.
Some sites might send a nearly empty static HTML client, with the actual page being dynamically constructed by Javascript scripts (Ajax). In that case, you are unlucky.
Maybe you want some web service ....
One simple way (but not the most efficient way) is to simply read the webpage as String using the WebClient, for example:
WebClient Web = new WebClient();
String Data = Web.DownloadString("Address");
Now since HTML is simply an XML document you can parse the string to a XDocument and look up the tag that represents the dropdown box. Parsing the string to XDocument is done this way:
XDocument xdoc = XDocument.Pase(Data);
Update:
If you want to read the result of the selected value, and that result is displayed within the page do this:
Get all the items as I explained.
If the page does not make use of models, then you can use your selected value as an argument for example :
www.somepage.com/Name=YourItem?
Read the page again and find the value
Related
I need to retrieve some info from an html doc since the web service to get a json or an xml is still not ready. Im working with c# and using regular expressions to get the data i need from the html string. I've managed to get the div i want to work with from the whole html string but now i'm having trouble getting the info between the first span tag.
I've attempted to retrieve the data between ; and the first closing span tag but what i really want is the content between the first span tag.
Here's the regular expression i've written so far, but it's not working:
".*;(?<Content>(\r|\n|.)*)</span>"
I also tried this but didnt work either:
"<span class=""type"">(?<Content>(\r|\n|.)*)</span>"
Here is the div i want to retrieve the data from:
<div class="main">ABASASDFÓ 18/06/2014 17:38h Blabla Balbal <span class="type">15.80€ </span>+1.94 % +0.30€ | HOME <SPAN class="type2">11,398.70</span> +0.65 % +74.10</div>
EDIT: I can't use Htmlagilitypack since my client does not want us to use any external library. I've also heard about using the XmlReader but i'm not sure the structure of the html will match an xml one accordingly.
This regex will capture the string:
"<span class=\"type\">(?<Content>([^<]*))</span>"
Although, I agree with other answers, you should use something like Path instead of Regexes for parsing html.
Here's how it is done with a regex in Javascript. You should be able to adapt this for C# pretty easily.
var inner = html.match( /<span class="type"(?:\s+[a-z]+(?:\s*=\s*(?:"[^"]*"|'[^']*'|[^\s>]+)))*\s*>([\S\s]*)<\/span>/i)[1];
Fiddle: http://jsfiddle.net/GarryPas/uk32r8vz/
You want to use XPath for that. Something like this:
div/span/text()
I understand not wanting some external 3rd party library in your solution, the solution to that is to go fetch the source code of the entire library:
https://htmlagilitypack.codeplex.com/
Now you don't have an external library, you have an internal library and you can use the right tool for the job!
XmlReader is a fairly low-level tool, it could technically do the job for you but what you're more after is "use XmlReader to do XPath" which is talked about here: https://msdn.microsoft.com/en-us/library/ms950778.aspx
The XPathReader class is the result of all that, which has been superseded by LINQ to XML: https://msdn.microsoft.com/en-ca/library/bb387098.aspx
So another option here is to try to use some LINQ to process your HTML file, but that might be tricky since HTML isn't good XML. Still, it's another option if you're looking for those.
in the global.asax to measure the request execution time in the onbeginrequest (start the stopwatch) and onendrequest (calculate the difference).
then in the end request do response.write with the result.
however it writes the result AFTER the closing html tag. basically appends to the end.
current line of code is:
HttpContext.Current.Response.Write(elapsedTime);
is there an easy way for the response write to REPLACE the string ::actualResult:: within the actual html with the actual result string from the response write?
i've tried a lot of things including searching online but seems no one needs this or i suck at searching. i thought i could just get the entire response somehow and replace from there but unsure how to do that... something along ...Response.GetTheEnitreResponse??.Replace... of course that is just wishful thinking ;)
thnx
You didn't specify if you were using web forms, MVC, web pages, etc. but normally these frameworks have buffered a response that has been output by whatever page the user is hitting. Your code in onendrequest is coming to the party after all of the page contents (normally closed with an html closing tag) has been output to the buffer. So when you do a Response.Write you are appending to that html, thus it is outside the closing html tag.
If you want to have the timing be visual on the page you will have to parse into the response and inject your string. This looks hard to do outside of a Page class in ASP.NET.
Messy, and there are better alternatives. Tracing is usually the way these types of things are handled.
You may want to consider writing this information out to a Glimpse trace or somehow hooking into its display... I can't say enough about Glimpse.
Rather than writing the elapsed value to the response, you could store the result in HttpContext.Items and then access this on the view/page:
HttpContext.Current.Items.Add("elapsed", elapsed);
HttpContext.Items Property
I want my program to automatically download only certain information off a website. After finding out that this is nearly impossible I figured it would be best if the program would just download the entire web page and then find the information that I needed inside of a string.
How can I find certain words/numbers after specific words? The word before the number I want to have is always the same. The number varies and that is the number I need in my program.
Sounds like screen scraping. I recommend using CSQuery https://github.com/jamietre/CsQuery (or HtmlAgilityPack if you want). Get the source, parse as object, loop over all text nodes and do your string comparison there. The actual way of doing this varies a LOT on how the source HTML is done.
Maby something like this untested example written from memory (CSQuery)
var dom = CQ.Create(stringWithHtml);
dom["*"].Each((i, e) =>
{
// handle only text nodes
if (e.NodeType == NodeType.TEXT_NODE) {
// do your check here
}
}
I've used HTML Agility Pack for multiple applications and it works well. Lots of options too.
It's a lovely HTML parser that is commonly recommended for this. It will take malformed HTML and massage it into XHTML and then a traversable DOM, like the XML classes. So, is very useful for the code you find in the wild.
Within a c# project I'm sending a WebRequest to a php website, which takes the values and uses a select statement to query the DB and return an HTML page. Since there is only one value that comes back from that query, I need to assign this value to my c# code.
The source of the body-tag of the returned HTML-page (and with the StreamReader in my c#) looks like this:
<table border='1'>
<tr><td>ValueINeed</td></tr>
</table>
How do I access the value inside this in order to assign it to a string in my c# code?
thank you.
If you are the author of the PHP code as well, I would suggest that you make another page that returns json or something instead, this way you would be able to avoid parsing HTML.
But if this really is what you are stuck with, I would suggest that you take a look at Html Agility Pack. Here is another quesiton here on StackOverflow that are about how to use the Html Agility Pack.
If the result is always the same you could just split the string or use regex.
If not you may use a html parser: http://htmlagilitypack.codeplex.com/
I have an MVC 3 web application project, and in one page I use NicEdit to allow the user enter formatted text.
When the controller receives the data, the data is in html format... perfect. NicEdit itself don't allow for script tags, nor onFoo events to be entered directly in elements, but one user with bad intentions can force scripts in and that would not be safe.
What can I do to ensure the safety of the incoming data... strip out script tags, find and remove onXyz events... what else?
Also, what is the easiest way to do it? Should I use HtmlAgilityPack, or there is a simple function somewhere that will do all the job with a simple call.
Note: just encoding the whole string is not a valid solution. What I want is a way to ensure that the Html code is safe to render in another page, when someone wants to view the submited content.
Thanks!
You could use the AntiXss library. Dangerous scripts will be removed:
string input = ...
string safeOutput = AntiXss.GetSafeHtmlFragment(input);