Get Proper XPath for SelectNodes - c#

I just started using HtmlAgilityPack to scrape some text from websites. I have experimented and found that some websites are easier than others in regards to getting the proper XPath when using the SelectNodes method. I believe I am doing something wrong but can't figure it out.
For example when exploring the DOM in Google Chrome, I am able to copy the XPath: //*[#id="page"]/span/table[7]/tbody/tr[1]/td/span[2]/a then I would do something like..
var search = doc.DocumentNode.SelectNodes("//[#id=\"page\"]//span//table//tr//td//span//a"
When using the search in a foreach loop I get a null reference error and sure enough the debugger says search is null. So I am assuming the XPath is wrong..(or I am doing something else totally wrong) So my question is how exactly do I get the proper XPath for HtmlAgilityPack to find these nodes?

Following up on what you request in your last comment, the html is fully rendered only after the http get request is returns.
Several javascript calls insert blocks of html into the document.
You want the following of them: loadCompanyProfileData('ContactInfo'), which generates an http get request that looks like:
http://financials.morningstar.com/cmpind/company-profile/component.action?component=ContactInfo&t=XNAS:AAPL&region=usa&culture=en-US&cur=&_=1465809033745.
This returns the email, which you can extract with code like the following:
HtmlWeb w = new HtmlWeb();
var doc = w.Load("http://financials.morningstar.com/cmpind/company-profile/component.action?component=ContactInfo&t=XNAS:AAPL&region=usa&culture=en-US&cur=&_=1465809033745");
var emails = doc.DocumentNode.CssSelect("a")
.Where(a => a.GetAttributeValue("href")
.StartsWith("mailto:"))
.Select(a => a.GetAttributeValue("href")
.Replace("mailto:", string.Empty));
emails ends up containing 1 element, being investor_relations#apple.com.
You problem is to determine what should be the "cur" parameter that the loadCompanyProfileData javascript function uses for each distinct company.
I could not locate in the code where/how is this parameter generated.
One alternative would be to execute a browser emulator (like selenium web driver port for c#) so you can execute javascript code - and run the call to loadCompanyProfileData('ContactInfo') for each company request.
But I could not get this to work as well, my web drive script execution does not look to be working.

Related

C# Html parsing

I'm trying to parse HTML in my C# project without success, I am using a HtmlAgilityPack lib to do so, I can get some of the HTML body text but not all of it for some reason.
I need to grab the div with ID of latestPriceSection, and filter to the USD value from https://www.monero.how/widget
My function (doesn't work)
public void getXMRRate()
{
HtmlWeb web = new HtmlWeb();
HtmlAgilityPack.HtmlDocument document = web.Load("https://www.monero.how/widget");
HtmlNode[] nodes = document.DocumentNode.SelectNodes("//a").Where(x => x.InnerHtml.Contains("latestPriceSection")).ToArray();
foreach (HtmlNode item in nodes)
{
Console.WriteLine(item.InnerHtml);
}
}
Your function doesn't work because the widget is updated via script. The div contains nothing when you load the page. You can't use HAP to scrape the information of this. Find a web service that can give you the information you need.
Alternatively you can use Selenium to get the HTML after the page has loaded the script. Or you the WebBrowser class, but that requires you to have a form application where the form contains the WebBrowser.
You need to retrieve JSON-data from https://www.monero.how/widgetLive.json, because widget use this resource in Ajax request.

Parse webpage with Fragment identifier in URL, using HTML Agility Pack

I want to parse webpage with Fragment identifier(#), f.e. http://steamcommunity.com/market/search?q=appid%3A570+uncommon#p4
When i use my browser(Google Chrome), i have different result, for different identifier(#p1,#p2,#p3), but when i use HTML Agility Pack, i always get first page, despite of page identifier.
string sURL = "http://steamcommunity.com/market/search?q=appid%3A570+uncommon#p"
wClient = new WebClient();
html = new HtmlAgilityPack.HtmlDocument();
html.LoadHtml(wClient.DownloadString(sURL+i));
I understand, that something like Ajax used here and in fact exist only one page. How can i fix my problem, and get results from other pages using C#?
Like David said,
use URL : http://steamcommunity.com/market/search/render/?query=appid%3A570%20uncommon&search_descriptions=0&start=30&count=10
where start is the start number and count is the number of items you want.
the result is a json result, so for stating the obvious you only want to use results_html
side note: in your chrome browser (when pressed F12) click on network tab and you will see the request and result being made

How to read returned xml value from google maps

I am trying to call google maps geocode and am following the example on their webpage to try and apply it to mine
http://code.google.com/apis/maps/documentation/geocoding/index.html
in this example, the Geocoding API requests an xml response for the
identical query shown above for "1600 Amphitheatre Parkway, Mountain
View, CA":
http://maps.googleapis.com/maps/api/geocode/xml?address=1600+Amphitheatre+Parkway,+Mountain+View,+CA&sensor=true_or_false
The XML returned by this request is shown below.
Now i am trying to run that url like this in my c# winforms application
string url = "http://maps.googleapis.com/maps/api/geocode/xml?address=1600+Amphitheatre+Parkway,+Mountain+View,+CA&sensor=true_or_false";
WebRequest req = HttpWebRequest.Create(url);
WebResponse res = req.GetResponse();
StreamReader sr = new StreamReader(res.GetResponseStream());
try
{
Match coord = Regex.Match(sr.ReadToEnd(), "<coordinates>.*</coordinates>");
var b = coord.Value.Substring(13, coord.Length - 27);
}
finally
{
sr.Close();
}
However it doesnt seem to be returning anything and as such my var b line gives an index out of bounds error. Can anyone point me in the right direction for at least getting the example to work so i can apply the logic to my own application?
Thanks
If you visit your link "http://maps.googleapis.com/maps/api/geocode/xml?address=1600+Amphitheatre+Parkway,+Mountain+View,+CA&sensor=true_or_false" directly in a browser you can see what it's returning. It's giving me a REQUEST DENIED error.
The problem is caused by the sensor=true_or_false parameter. You have to choose if you want it to be true or false. Google put it this way in their example so that you have to explicitly decide for yourself. This setting indicates if your application is using a location sensor or not. In your case, I'm guessing not, so set it to false.
If you change the link you're using to http://maps.googleapis.com/maps/api/geocode/xml?address=1600%20Amphitheatre%20Parkway,%20Mountain%20View,%20CA&sensor=false, I think you'll get the results you were expecting.

Google Finance, how to get the JSON data streamed?

I tried to explain this earlier, but obviously failed!
So, if you have a google finance graph open, for instance:
http://www.google.com/finance?q=INDEXNASDAQ:.IXIC
I would like to somehow use the (HttpWebRequest) object in C# so that I can grab the small data which google sends to the page to update the graph.
A friend mentioned this was JSON?
I was trying to use the following code example, but even when i set the keep alive property to 'true', it still wouldnt work:
http://msdn.microsoft.com/en-us/library/system.net.httpwebrequest.keepalive.aspx#Y369
You also need to change the example's line that sets the Connection property to Close. Comment out this line (along with keeping the keep-alive property set to true):
myHttpWebRequest2.Connection = "Close";
You do that and your example should run fine.
Regarding getting the data and using HttpWebRequest to work with it, you can do that. The data returned isn't JSON - it looks like straight text and I'm guessing Google's javascript is parsing it out. (I haven't inspected the javascript on Google Finance's page, but that's my guess.)
Using Fiddler, the response from this URL:
http://www.google.com/finance/getprices?q=.IXIC&x=INDEXNASDAQ&i=120&p=10m&f=d,c,v,o,h,l&df=cpct&auto=1&ts=1307994768643
looks like this:
EXCHANGE%3DINDEXNASDAQ
MARKET_OPEN_MINUTE=570
MARKET_CLOSE_MINUTE=960
INTERVAL=120
COLUMNS=DATE,CLOSE,HIGH,LOW,OPEN,VOLUME
DATA=
TIMEZONE_OFFSET=-240
a1307994120,2641.12,2641.12,2639.96,2640.01,0
1,2638.76,2642.14,2638.76,2641.13,0
2,2638.95,2640.54,2638.74,2638.79,0
3,2639.85,2640.01,2638.08,2638.95,0
4,2640.07,2640.87,2639.31,2639.88,0
5,2640.31,2640.48,2639.42,2640.08,0
6,2641.09,2641.09,2640.3,2640.31,0
A little cryptic, but you can see how the COLUMNS line lines up with the data at the bottom. Also, the f querystring parameter seems to be indicating which columns to return (d=date, c=close,v=volume,o=open,h=high,l=low).
EDIT: I should mention that the URL I used is being sent from the finance graph page to get updated data - you can see this URL being requested at regular intervals using a tool like Fiddler. The response data that I pasted above is also output by the sample application from MSDN.
But commenting out that one line in the example from MSDN and a little fiddling with Fiddler should give you the data and clues you need to parse the return that comes from that URL.
I hope this helps!
PS - my first line in my modified MSDN example looks like this:
HttpWebRequest myHttpWebRequest1 = (HttpWebRequest)WebRequest.Create("http://www.google.com/finance/getprices?q=.IXIC&x=INDEXNASDAQ&i=120&p=10m&f=d,c,v,o,h,l&df=cpct&auto=1&ts=1307994768643");
I made a similar change to the other WebRequest call a little further down in the example...other than that, I didn't change anything else in the example.

C# HTTP programming

i want to build a piece of software that will process some html forms, the software will be a kind of bot that will process some forms on my website automatically.
Is there anyone who can give me some basic steps how to do this job...Any tutorials, samples, books or whatever can help me.
Can some of you post an working code with POST method ?
Check out How to: Send Data Using the WebRequest Class. It gives an example of how create a page that posts to another page using the HttpWebRequest class.
To fill out the form...
Find all of the INPUT or TEXTAREA elements that you want to fill out.
Build the data string that you are going to send back to the server. The string is formatted like "name1=value1&name2=value2" (just like in the querystring). Each value will need to be URL encoded.
If the form's "method" attribute is "GET", then take the URL in the "action" attribute, add a "?" and the data string, then make a "GET" web request to the URL.
If the form's "method" is "POST", then the data is submitted in a different area of the web request. Take a look at this page for the C# code.
To expand on David and JP's answers':
Assuming you're working with forms whose contents you're not familiar with, you can probably...
pull the page with the form via an HttpWebRequest.
load it into an XmlDocument
Use XPath to traverse/select the form elements.
Build your query string/post data based on the elements.
Send the data with HttWebRequest
If the form's structure is known in advance, you can really just start at #4.
(untested) example (my XPath is not great so the syntax is almost certainly not quite right):
HttpWebRequest request;
HttpWebResponse response;
XmlDocument xml = new XmlDocument();
string form_url = "http://...."; // you supply this
string form_submit_url;
XmlNodeList element_nodes;
XmlElement form_element;
StringBuilder query_string = new StringBuilder();
// #1
request = (HttpWebRequest)WebRequest.Create(form_url));
response = (HttpWebResponse)request.GetResponse();
// #2
xml.Load(response.GetResponseStream());
// #3a
form_element = xml.selectSingleNode("form[#name='formname']");
form_submit_url = form_element.GetAttribute("action");
// #3b
element_nodes = form_element.SelectNodes("input,select,textarea", nsmgr)
// #4
foreach (XmlNode input_element in element_nodes) {
if (query_string.length > 0) { query_string.Append("&"); }
// MyFormElementValue() is a function/value you need to provide/define.
query_string.Append(input_element.GetAttribute("name") + "=" + MyFormElementValue(input_element.GetAttribute("name"));
}
// #5
// This is a GET request, you can figure out POST as needed, and deduce the submission type via the <form> element's attribute.
request = (HttpWebRequest)WebRequest.Create(form_submit_url + "?" + query_string.ToString()));
References:
Link
http://www.developerfusion.com/forum/thread/26371/
http://msdn.microsoft.com/en-us/library/system.xml.xmlelement.getattribute.aspx
http://msdn.microsoft.com/en-us/library/system.xml.xmlelement.selectnodes.aspx
If you don't want to go the HttpWebRequest route, I would suggest WatiN. Makes it very easy to automate IE or Firefox and not worry about the internals of the HTTP requests.

Categories