Trying to scrape a .pdf from a site but the XPath is being stubborn.
Site I'm trying to get the .pdf from
xpath given by inspect > copy > copy xpath:
//*[#id="content"]/div/table[2]/tbody/tr[0]/td[3]/a
For some reason /tbody does nothing but cause an issue. Removing it has worked for all other Xpath I'm using, and seems to be the way to go here as well.
//*[#id="content"]/div/table[2]/tr[0]/td[3]/a
This yields the result:
<img width="16" height="16" src="/apps/cba/g_doctype_pdf.gif" border="0"><br><small>Download<br>Agreement</small>
Which seems to be a child node?
In any case backing the xpath up a bit to:
//*[#id="content"]/div/table[2]/tr[0]/td[3]
gets me
<a target="_blank" href="/apps/cba/docs/1088-CBA6-2017_Redacted.pdf"><img width="16" height="16" src="/apps/cba/g_doctype_pdf.gif" border="0"><br><small>Download<br>Agreement</small></a>
This is nice since all I need is the value in the href attribute and I can reconstruct the URL and so on. I'm not a wizard with XPath but it seems to me that this final adjustment should get me what I want:
//*[#id="content"]/div/table[2]/tr[0]/td[3]/#href
However it returns the tag again.
I'm stumped on this. Any suggestions?
Edit:
The marked solution made it apparent to me that I was making an assumption. I assumed that I could dereference the href tag in the same manner that I was dereferencing other nodes. This is not the case, and I had to adjust my dereferencing to something like this:
var node_collection = hdoc.DocumentNode.SelectNodes(#"//*[#id=""content""]/div/table[2]/tr[1]/td[3]/a/#href");
string output = node[0].Attributes["href"].Value
The problem was not with the Xpath at all. The problem was my lack of understanding of the HtmlDocument object that I was dealing with. Pasting whre I was trying to get at the href tag would have made this obvious to anyone experienced. Being too self conscious about copy-pasting my whole block of messy code made it impossible for anyone to help me. Learn from my mistakes kids, robust sections of code make it easier to accurately identify the problem.
You are right, tbody is added by Chromes on Copy XPath and should be removed since it is not present in the raw HTML code.*
Selecting the href attribute should work as suggested: //*[#id="content"]/div/table[2]/tr[1]/td[3]/a/#href
I could load the first href like this:
HtmlWeb web = new HtmlWeb();
HtmlDocument hdoc = web.Load("https://work.alberta.ca/apps/cba/searchresults.asp?query=&employer=&union=&locality=&local=&effective_fy=&effective_fm=&effective_ty=&effective_tm=&expiry_fy=&expiry_fm=&expiry_ty=&expiry_tm=");
var nav = (HtmlNodeNavigator)hdoc.CreateNavigator();
var val = nav.SelectSingleNode(#"//*[#id=""content""]/div/table[2]/tr[1]/td[3]/a/#href").Value;
Or all of them like this:
XPathNavigator nav2 = hdoc.CreateNavigator();
XPathNodeIterator xiter = nav2.Select(#"//*[#id=""content""]/div/table[2]/tr/td[3]/a/#href");
while (xiter.MoveNext())
{
Console.WriteLine(xiter.Current.Value);
}
* However, some engines indeed require tbody to be present in the XPath as demonstrated here. Only then we get a result. See this answer why tbody is added by Chrome, Firebug, and alike in the first place.
Related
I have the following code that grabs the nodes with text for certain descendants of specific tags/classes, and it was working before, but I haven't ran this program in a couple of months (nobody else has touched it) so I'm wondering why it's throwing an error now. My nodeList looks like this:
var nodesList = doc.DocumentNode
.SelectNodes("//article[#class='article-content']//div[#class='article-content-block']//text()[not(parent::script)]")
.Select(node => node.InnerText).ToList();
I look at the web page, and there are multiple paragraph and ul tags that fit that particular Xpath query, but nodesList is returning:
System.ArgumentNullException: 'Value cannot be null. (Parameter 'source')'
The DocumentNode has name: #document, which I would expect is normal and the InnerHtml is showing the entirety of the page's HTML however the InnerText is showing Javascript must be enabled for the correct page display. Any ideas as to why it would be throwing null? I don't recall seeing the Javascript must be enabled for the correct page display before for the DocumentNode's InnerText, so I'm wondering if that has something to do with it.
It sounds like the webpage content is being loaded dynamically. That's not a problem for your browser, because it executes Javascript automatically, but the .NET web components don't do any of that. You should be able to use your browser's dev tools to determine which request actually contains the content you're looking for, and then replicate that request in your code.
It could also be that something else about your request isn't playing nice with the server - missing/bad HTTP headers, unexpected TLS version, maybe even firewall stuff - causing it to return a different response.
I am scraping a web page and I am finding trouble accessing the value of an attribute within a <tr> tag in the OuterHTML of a node.
<tr data-descr="Revit+SA+regression+-+Obj" data-ids="2571302">
The above HTML contains the attribute data-ids which I am trying to get the value of.
Below is the code for accessing the web page (I would like to point out that I am deeply sorry for the lack of a reproductible example due to the web page not being accessible to the public) and reaching the node containing certain keywords that I wanted to investigate.
HtmlWeb web = new HtmlWeb();
HtmlDocument document = web.Load("WebPageIsPrivate");
HtmlNode[] nodes = document.DocumentNode.SelectNodes("//tr")
.Where(x => x.InnerHtml.Contains("Revit") & x.InnerHtml.Contains("regression")).ToArray();
At this point, I figured that I can use nodes.OuterHtml.ToString() to get the HTML above. However, this would mean that I'd have to replace the string's characters until I only have the 2571302 (in this example) left. I am wondering whether there's an easier way of getting to that value.
Please let me know if the post is not clear enough to the reader and requires more details - I will do my best to provide them. Documentation on this subject is also highly welcome.
Thank you.
foreach (HtmlNode item in nodes)
{
x.Add(item.Attributes["data-ids"].Value);
}
This did the job.
This is the line of code I am using, when I look in the watch window, 'c' is null.
HtmlNodeCollection c = doc.DocumentNode.SelectNodes("//*[#id=\"content\"]/table/tbody/tr[2]/td/center[2]/b");
But when I declare 'c' as this, the watch window shows it to be a valid HtmlNodeCollection
HtmlNodeCollection c = new HtmlNodeCollection(doc.DocumentNode.ParentNode);
If I then set 'c' to the first code snippet, it goes back to being null.
I know the XPath is correct, as I obtained it from the Chrome Inspect Element of the element I want to get.
SelectNodes returns null when nothing has been found.
You think your XPATH is ok because you used a browser's (Chrome, Firefox, etc.) constructed XPATH, but unfortunately, this XPATH is not exactly the same as the one you got from the network (or a file, or a raw stream).
Browsers rely on the in-memory DOM they use internally which can be dramatically different. That's why you see elements such as TBODY that only exist in DOM, not in markup (where they are optional).
So, I suggest you get back to the string/stream you give to the Html Agility Pack and check that XPATH again. I bet there is no TBODY, for a start.
This is the xpath text i tried to use along with HtmlAgilityPack C# parser.
//div[#id = 'sc1']/table/tbody/tr/td/span[#class='blacktxt']
I tried to evaluate the xpath expression with firefox xpath add=on and sucessfully got the required items. But the c# code returns an Null exception.
HtmlAgilityPack.HtmlNodeCollection node = htmldoc.DocumentNode.SelectNodes("//div[#id ='sc1']/table/tbody/tr/td/span[#class='blacktxt']");
MessageBox.Show(node.ToString());
the node always contains null value...
Please help me to find the way to get around this problem...
Thank you..
DOM Requires <tbody/> Tags to be Inserted
All common browser extensions for building XPath expressions work on the DOM. Opposite to the HTML specs, the DOM specs require <tr/> elements to be inside <tbody/> elements, so browsers add such elements if missing. You can easily see the difference if looking at the HTML source using Firebug (or similar developer tools working on the DOM) versus displaying the page source (using wget or similar tools that do not interpret anything if necessary).
The Solution
Remove the /tbody axis step, and your XPath expression will probably work.
//div[#id = 'sc1']/table/tr/td/span[#class='blacktxt']
If you Need to Support Both HTML With and Without <tbody/> Tags
For a more general solution, you could replace the /tbody axis step by a decendant-or-self step //, but this could jump into "inner tables":
//div[#id = 'sc1']/table//tr/td/span[#class='blacktxt']
Better would be to use alternative XPath expressions:
//div[#id = 'sc1']/table/tr/td/span[#class='blacktxt'] | //div[#id = 'sc1']/table/tbody/tr/td/span[#class='blacktxt']
A cleaner XPath 2.0 only solution would be
//div[#id = 'sc1']/table/(tbody, self::*)/tr/td/span[#class='blacktxt']
I know it may be of my noobness in XPath, but let me ask to make sure, cuz I've googled enough.
I have a website and wanna get the news headings from it: www.farsnews.com (it is Persian)
Using FireBug and FireXpath extensions under firefox and by hand I extract and test multiple Xpath expressions that matches the headings, such as:
* html/body/div[2]/div[2]/div[2]/div[*]/div[2]/a/div[2]
* .//*[#class="topnewsinfotitle "]
* .//div[#class="topnewsinfotitle "]
I also tested these using XPather extension and they seem to work pretty well, but when I get to test them... the SelectNodes returns null!
Any clue or hint?
here is a chunk of the code:
listBox2.ResetText();
HtmlAgilityPack.HtmlWeb w = new HtmlAgilityPack.HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = w.Load("http://www.farsnews.com");
HtmlAgilityPack.HtmlNodeCollection nc = doc.DocumentNode.SelectNodes(".//div[#class=\"topnewsinfotitle \"]");
listBox2.Items.Add(nc.Count+" Items selected!");
foreach (HtmlAgilityPack.HtmlNode node in nc) {
listBox2.Items.Add(node.InnerText);
}
Thanks.
I have tested your expressions. And as mentioned by Dialecticus in a comment, you have a ending space which shouldn't there.
//div[#class='topnewsinfotitle ']/text()
Returns 'empty sequence', see evaluation: http://xmltools.dk/EQA-ACA6
//div[#class='topnewsinfotitle']/text()
Returns a list of your headlines, see: http://xmltools.dk/EgA2APAj
However, if there could be other classes you use this ( http://xmltools.dk/EwA8AJAW ):
//div[contains(#class, 'topnewsinfotitle')]/text()
(I see they is an encoding issue in the links I've provided, however, it shouldn't matter for the meaning and for all the XPath expressions, you can remove /text() to get the nodes instead of only the text)
BUT, if you own this site, you should provide the headlines with a XML (maybe RSS or ATOM) or JSON which will have better performance and, most important, be more bullet-proof.