C# HtmlAgilityPack HtmlNodeCollection SelectNodes not working - c#

This is the line of code I am using, when I look in the watch window, 'c' is null.
HtmlNodeCollection c = doc.DocumentNode.SelectNodes("//*[#id=\"content\"]/table/tbody/tr[2]/td/center[2]/b");
But when I declare 'c' as this, the watch window shows it to be a valid HtmlNodeCollection
HtmlNodeCollection c = new HtmlNodeCollection(doc.DocumentNode.ParentNode);
If I then set 'c' to the first code snippet, it goes back to being null.
I know the XPath is correct, as I obtained it from the Chrome Inspect Element of the element I want to get.

SelectNodes returns null when nothing has been found.
You think your XPATH is ok because you used a browser's (Chrome, Firefox, etc.) constructed XPATH, but unfortunately, this XPATH is not exactly the same as the one you got from the network (or a file, or a raw stream).
Browsers rely on the in-memory DOM they use internally which can be dramatically different. That's why you see elements such as TBODY that only exist in DOM, not in markup (where they are optional).
So, I suggest you get back to the string/stream you give to the Html Agility Pack and check that XPATH again. I bet there is no TBODY, for a start.

Related

Trying to get word count using HtmlAgilityPack, but node list is returning as null

I have the following code that grabs the nodes with text for certain descendants of specific tags/classes, and it was working before, but I haven't ran this program in a couple of months (nobody else has touched it) so I'm wondering why it's throwing an error now. My nodeList looks like this:
var nodesList = doc.DocumentNode
.SelectNodes("//article[#class='article-content']//div[#class='article-content-block']//text()[not(parent::script)]")
.Select(node => node.InnerText).ToList();
I look at the web page, and there are multiple paragraph and ul tags that fit that particular Xpath query, but nodesList is returning:
System.ArgumentNullException: 'Value cannot be null. (Parameter 'source')'
The DocumentNode has name: #document, which I would expect is normal and the InnerHtml is showing the entirety of the page's HTML however the InnerText is showing Javascript must be enabled for the correct page display. Any ideas as to why it would be throwing null? I don't recall seeing the Javascript must be enabled for the correct page display before for the DocumentNode's InnerText, so I'm wondering if that has something to do with it.
It sounds like the webpage content is being loaded dynamically. That's not a problem for your browser, because it executes Javascript automatically, but the .NET web components don't do any of that. You should be able to use your browser's dev tools to determine which request actually contains the content you're looking for, and then replicate that request in your code.
It could also be that something else about your request isn't playing nice with the server - missing/bad HTTP headers, unexpected TLS version, maybe even firewall stuff - causing it to return a different response.

Trying to select slippery href attribute with xpath in c#

Trying to scrape a .pdf from a site but the XPath is being stubborn.
Site I'm trying to get the .pdf from
xpath given by inspect > copy > copy xpath:
//*[#id="content"]/div/table[2]/tbody/tr[0]/td[3]/a
For some reason /tbody does nothing but cause an issue. Removing it has worked for all other Xpath I'm using, and seems to be the way to go here as well.
//*[#id="content"]/div/table[2]/tr[0]/td[3]/a
This yields the result:
<img width="16" height="16" src="/apps/cba/g_doctype_pdf.gif" border="0"><br><small>Download<br>Agreement</small>
Which seems to be a child node?
In any case backing the xpath up a bit to:
//*[#id="content"]/div/table[2]/tr[0]/td[3]
gets me
<a target="_blank" href="/apps/cba/docs/1088-CBA6-2017_Redacted.pdf"><img width="16" height="16" src="/apps/cba/g_doctype_pdf.gif" border="0"><br><small>Download<br>Agreement</small></a>
This is nice since all I need is the value in the href attribute and I can reconstruct the URL and so on. I'm not a wizard with XPath but it seems to me that this final adjustment should get me what I want:
//*[#id="content"]/div/table[2]/tr[0]/td[3]/#href
However it returns the tag again.
I'm stumped on this. Any suggestions?
Edit:
The marked solution made it apparent to me that I was making an assumption. I assumed that I could dereference the href tag in the same manner that I was dereferencing other nodes. This is not the case, and I had to adjust my dereferencing to something like this:
var node_collection = hdoc.DocumentNode.SelectNodes(#"//*[#id=""content""]/div/table[2]/tr[1]/td[3]/a/#href");
string output = node[0].Attributes["href"].Value
The problem was not with the Xpath at all. The problem was my lack of understanding of the HtmlDocument object that I was dealing with. Pasting whre I was trying to get at the href tag would have made this obvious to anyone experienced. Being too self conscious about copy-pasting my whole block of messy code made it impossible for anyone to help me. Learn from my mistakes kids, robust sections of code make it easier to accurately identify the problem.
You are right, tbody is added by Chromes on Copy XPath and should be removed since it is not present in the raw HTML code.*
Selecting the href attribute should work as suggested: //*[#id="content"]/div/table[2]/tr[1]/td[3]/a/#href
I could load the first href like this:
HtmlWeb web = new HtmlWeb();
HtmlDocument hdoc = web.Load("https://work.alberta.ca/apps/cba/searchresults.asp?query=&employer=&union=&locality=&local=&effective_fy=&effective_fm=&effective_ty=&effective_tm=&expiry_fy=&expiry_fm=&expiry_ty=&expiry_tm=");
var nav = (HtmlNodeNavigator)hdoc.CreateNavigator();
var val = nav.SelectSingleNode(#"//*[#id=""content""]/div/table[2]/tr[1]/td[3]/a/#href").Value;
Or all of them like this:
XPathNavigator nav2 = hdoc.CreateNavigator();
XPathNodeIterator xiter = nav2.Select(#"//*[#id=""content""]/div/table[2]/tr/td[3]/a/#href");
while (xiter.MoveNext())
{
Console.WriteLine(xiter.Current.Value);
}
* However, some engines indeed require tbody to be present in the XPath as demonstrated here. Only then we get a result. See this answer why tbody is added by Chrome, Firebug, and alike in the first place.

Get end of element with HTML Agility Pack?

I am using HTML Agility Pack to parse my HTML, and I need to know the position of each element within the HTML. HtmlNode.StreamPosition gives me the location in the HTML, works great. However, I'd also like the stream position of the end of element. I can get the StreamPosition and add on the length of the OuterHTML, but this is inaccurate as the OuterHTML from HTMLAgility pack will often not match up exactly with the actual HTML text.
I'm also game for using AngleSharp, if it's any easier or better suited to this. So basically, I can get the location of the start of a HTML element, how can I get the location of the end?
There is actually a private _endnode field of HtmlNode which is the closing tag of element. So either you can change HAP source code to expose it, or you can use System.Reflection to access it.
There is another
similar HAP issue with some example codes.

HtmlAgilityPack C#--- Selectnodes Always returns a Null

This is the xpath text i tried to use along with HtmlAgilityPack C# parser.
//div[#id = 'sc1']/table/tbody/tr/td/span[#class='blacktxt']
I tried to evaluate the xpath expression with firefox xpath add=on and sucessfully got the required items. But the c# code returns an Null exception.
HtmlAgilityPack.HtmlNodeCollection node = htmldoc.DocumentNode.SelectNodes("//div[#id ='sc1']/table/tbody/tr/td/span[#class='blacktxt']");
MessageBox.Show(node.ToString());
the node always contains null value...
Please help me to find the way to get around this problem...
Thank you..
DOM Requires <tbody/> Tags to be Inserted
All common browser extensions for building XPath expressions work on the DOM. Opposite to the HTML specs, the DOM specs require <tr/> elements to be inside <tbody/> elements, so browsers add such elements if missing. You can easily see the difference if looking at the HTML source using Firebug (or similar developer tools working on the DOM) versus displaying the page source (using wget or similar tools that do not interpret anything if necessary).
The Solution
Remove the /tbody axis step, and your XPath expression will probably work.
//div[#id = 'sc1']/table/tr/td/span[#class='blacktxt']
If you Need to Support Both HTML With and Without <tbody/> Tags
For a more general solution, you could replace the /tbody axis step by a decendant-or-self step //, but this could jump into "inner tables":
//div[#id = 'sc1']/table//tr/td/span[#class='blacktxt']
Better would be to use alternative XPath expressions:
//div[#id = 'sc1']/table/tr/td/span[#class='blacktxt'] | //div[#id = 'sc1']/table/tbody/tr/td/span[#class='blacktxt']
A cleaner XPath 2.0 only solution would be
//div[#id = 'sc1']/table/(tbody, self::*)/tr/td/span[#class='blacktxt']

XPath problem, getting "expression must evaluate to a node-set." error

I'm having trouble retrieving a single node by its explicit XPath that I have already found by other ways. I have node and I can get its XPath, but when I try to retrieve that same node again this time via node.XPath it gives the "expression must evaluate to a node-set" error. Shouldn't this work? I'm using HtmlAgilityPack in C# btw for the HtmlDocument.
HtmlDocument doc = new HtmlDocument();
doc.Load(#"..\..\test1.htm");
HtmlNode node = doc.DocumentNode.SelectSingleNode("(//node()[#id='something')])[first()]");
HtmlNode same = doc.DocumentNode.SelectSingleNode(node.XPath);
BTW: this is the value of node.XPath:
"/html[1]/body[1]/table[1]/tr[1]/td[1]/div[1]/div[1]/div[2]/table[1]/tr[1]/td[1]/div[1]/div[1]/table[1]/tr[1]/td[1]/div[1]/div[1]/div[4]/div[2]/div[1]/div[1]/div[4]/#text[2]"
I was able to get it working by replacing #text with the function text(). I'm not sure why it didn't just emit the XPath that way in the first place.
HtmlNode same = doc.DocumentNode.SelectSingleNode(node.XPath.Replace("#text","text()");
Your XPath ends in "#text[2]", which means "the second 'text' attribute". Attributes aren't nodes, they're node metadata.
This is a common problem I've had with XPath: wanting the value of an attribute while the XPath operation absolutely has to extract a node.
The solution I've used for this is to wrap my XPath fetching with something that detects and strips off the attribute portion of the string (via a myXPathString.LastIndexOf( "#" ) method call) and then uses the truncated myXPathString to fetch the node and collect the desired attribute value as a second step.
Hope that helps,
J

Categories