HtmlAgilityPack | Wrongly retrieved table nodes - c#

Hi there I've just registered on this website, because I need some help.
I want to get results from the nyaa.eu website.
Basically:
Table Node is called <table class="tlist">
Every row node is called <tr class="tlistrow"> also sometimes it's 'trusted tlistrow' etc.
The Nodes I try to retrieve are: <td class="tlistname"> <td class="tlistsize"> <td class="tlistsn"> and <td class="tlistln">
Firstly I'm retrieving a table which contains all the info about torrents:
HtmlNode hnTable = doc.DocumentNode.SelectSingleNode("//table[#class='tlist']");
So, next thing is retrieving all the rows which contains 'tlistrow' in its class attribute:
HtmlNodeCollection hncRows = hnTable.SelectNodes("//tr[contains(#class,'tlistrow')]");
And finally the problem is when I read every node it's always the same one:
foreach (HtmlNode row in hncRows)
{
foreach (HtmlNode child in row.ChildNodes)
{
if (child.SelectSingleNode("//td[#class='tlistname']") != null)
{
MessageBox.Show("Something found!\n\n" + child.SelectSingleNode("//td[#class='tlistname']").InnerText);
break;
}
}
}
The text displayed in the messagebox is always the same, it looks like it only selects one node multiple times.
How can I fix this or if I am doing anything wrong, please correct me.

foreach (HtmlNode child in row.ChildNodes)
{
if (child.SelectSingleNode("//td[#class='tlistname']") != null)
{
MessageBox.Show("Something found!\n\n" + child.SelectSingleNode("//td[#class='tlistname']").InnerText);
break;
}
}
You need to understand the difference between relative XPath expressions and absolute XPath expressions.
A relative XPath expression is evaluated off (having as an initial context node) a specific node in the XML document.
An absolute XPath expression is evaluated against the whole XML document (having as initial context node the document-node).
Any XPath expression that starts with the character / is an absolute XPath expression.
Based on the provided code, you want to use a relative XPath expression with an initial context node, containd in the variable named child.
The problem is that the expression you are using:
//td[#class='tlistname']
starts with / and is therefore an absolute XPath expression.
This, passed to the SelectSingleNode() method always selects the first td element in the whole XML document, that has a class attribute with string value "tlistname."
Solution: Use a relative XPath expression, such as:
.//td[#class='tlistname']

The // XPath in the expression will look anywhere in the document for matches. Remove that when you don't need it.
So, try something like:
HtmlNode hnTable = doc.DocumentNode.SelectSingleNode("//table[#class='tlist']");
HtmlNodeCollection hncRows = hnTable.SelectNodes("/tr[contains(#class,'tlistrow')]");
foreach (HtmlNode row in hncRows)
{
foreach (HtmlNode child in row.ChildNodes)
{
if (child.SelectSingleNode("/td[#class='tlistname']") != null)
{
MessageBox.Show("Something found!\n\n" + child.SelectSingleNode("/td[#class='tlistname']").InnerText);
break;
}
}
}

Related

get all nodes and its content using htmldocument/HtmlAgilityPack

I need to get all nodes from a html, then from that nodes I need to get the text and sub-nodes, and the same thing but from that sub-sub-nodes.
For example, I have this HTML:
<p>This <b>is a Link</b> with <b>bold</b></p>
So I need a way to get the p node, then the non-formatted text (this), the only-bold text (is a), the bolded link (Link) and the rest formatted and not formatted text.
I know that with the htmldocument I can select all nodes and sub-nodes, but, how Can I get the text before the sub-node, then the sub-node, and its text/sub-nodes so I can make the rendered version of the html ("This is a Link with bold")?
Please note that the above example is a simple one. The HTML would have more complex things like list, frames, numbered list, triple-formatted text, etc. Also note that the rendered thing is not a problem. I have already done that but in another way. What I need is the part to get the nodes and its content only.
Also, I can't ignore any node, so I can't filter by nothing. And the main node could start as p, div, frame, ul, etc.
After looking in the htmldoc and its properties, and thanks to #HungCao 's observation, I got a working simple way to interpretate a HTML code.
My code is a little more complex to add it as example, so I will post a lite version of it.
First of all, the htmlDoc has to be loaded. It could be on any function:
HtmlDocument htmlDoc = new HtmlDocument();
string html = #"<p>This <b>is a Link</b> with <b>bold</b></p>";
htmlDoc.LoadHtml(html);
Then we need to interpretate each "main" node (p in this case) and, depending its type, we need to load a LoopFunction (InterNode)
HtmlNodeCollection nodes = htmlDoc.DocumentNode.ChildNodes;
foreach (HtmlNode node in nodes)
{
if(node.Name.ToLower() == "p") //Low the typeName just in case
{
Paragraph newPPara = new Paragraph();
foreach(HtmlNode childNode in node.ChildNodes)
{
InterNode(childNode, ref newPPara);
}
richTextBlock.Blocks.Add(newPPara);
}
}
Please note that there is a property called "NodeType", but it will not return the correct type. So, instead use the "Name" property (Also note that the Name property in htmlNode is not the same as the Name attribute in HTML).
Finally, we have the InterNode function that will add inlines to the referred (ref) Paragraph
public bool InterNode(HtmlNode htmlNode, ref Paragraph originalPar)
{
string htmlNodeName = htmlNode.Name.ToLower();
List<string> nodeAttList = new List<string>();
HtmlNode parentNode = htmlNode.ParentNode;
while (parentNode != null) {
nodeAttList.Add(parentNode.Name);
parentNode = parentNode.ParentNode;
} //we need to get it multiple types, because it could be b(old) and i(talic) at the same time.
Inline newRun = new Run();
foreach (string noteAttStr in nodeAttList) //with this we can set all the attributes to the inline
{
switch (noteAttStr)
{
case ("b"):
case ("strong"):
{
newRun.FontWeight = FontWeights.Bold;
break;
}
case ("i"):
case ("em"):
{
newRun.FontStyle = FontStyle.Italic;
break;
}
}
}
if(htmlNodeName == "#text") //the #text means that its a text node. Like <i><#text/></i>. Thanks #HungCao
{
((Run)newRun).Text = htmlNode.InnerText;
} else //if it is not a #text, don't load its innertext, as it's another node and it will always have a #text node as a child (if it has any text)
{
foreach (HtmlNode childNode in htmlNode.ChildNodes)
{
InterNode(childNode, ref originalPar);
}
}
return true;
}
Note: I know that I said that my app need to render the HTML in another way that a webview does, and I know that this example code generate the same thing as a Webview, but, as I said before, this is just a lite version of my final code. In fact, my original/full code is working as I need to and this is just the base.

Getting same value when I'm looking in different nodes HTMLAgility

I'm having a problem using html agility.
I stumbled upon an img src/ img alt where i have to take data.
All is good when there is only one thing i need to take my data, but when there are more it founds everything in the collection like it should, but the data taken is always from the 1st node in the collection...
HtmlNodeCollection collection = doc.DocumentNode.SelectNodes("//div[#class='listHolder']//article[#class='brochure openBrochureAction']//div[#class='imgBrochure']");
foreach (HtmlNode node in collection)
{
//Tried these examples:
NomeFolheto = node.SelectSingleNode("//div[#class='imageRatioHorizontal']//img[#alt]").GetAttributeValue("alt", "none").Trim();
string testeNome = node.SelectSingleNode("//div[#class='imageRatioHorizontal']//img/#alt").Attributes["alt"].Value;
string testeimagem = node.SelectSingleNode("//div[#class='imageRatioHorizontal']//img/#src").Attributes["src"].Value;
imagem = node.SelectSingleNode("//div[#class='imageRatioHorizontal']//img[#src]").GetAttributeValue("src", "none").Trim();
}
Like i said, the collection finds all nodes it should, and gets the 1st value properly, but when it goes for the other nodes, the values it gets are from the 1st node.
What am I doing wrong? I went to check each node in the collection, and they have same "alt" attribute like it should and different "src" attribute like they should, but I know because i debugged that it's picking up the 1st node every time.
Thanks in advance
Your xpath expressions are all starting from the root (of the document). Even when you have a reference to a single node, it is still just a reference to that node within the entire tree.
You should use .// for the expressions:
HtmlNodeCollection collection = doc.DocumentNode.SelectNodes("//div[#class='listHolder']//article[#class='brochure openBrochureAction']//div[#class='imgBrochure']");
foreach (HtmlNode node in collection)
{
//Tried these examples:
NomeFolheto = node
.SelectSingleNode(".//div[#class='imageRatioHorizontal']//img[#alt]")
.GetAttributeValue("alt", "none").Trim();
string testeNome = node
.SelectSingleNode(".//div[#class='imageRatioHorizontal']//img/#alt")
.Attributes["alt"].Value;
string testeimagem = node
.SelectSingleNode(".//div[#class='imageRatioHorizontal']//img/#src")
.Attributes["src"].Value;
imagem = node
.SelectSingleNode(".//div[#class='imageRatioHorizontal']//img[#src]")
.GetAttributeValue("src", "none").Trim();
}

Gettig Htmlelement based on HtmlAgilityPack.HtmlNode

I use HtmlAgilityPack to parse the html document of a webbrowser control.
I am able to find my desired HtmlNode, but after getting the HtmlNode, I want to retun the corresponding HtmlElement in the WebbrowserControl.Document.
In fact HtmlAgilityPack parse an offline copy of the live document, while I want to access live elements of the webbrowser Control to access some rendered attributes like currentStyle or runtimeStyle
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(webBrowser1.Document.Body.InnerHtml);
var some_nodes = doc.DocumentNode.SelectNodes("//p");
// this selection could be more sophisticated
// and the answer shouldn't relay on it.
foreach (HtmlNode node in some_nodes)
{
HtmlElement live_element = CorrespondingElementFromWebBrowserControl(node);
// CorrespondingElementFromWebBrowserControl is what I am searching for
}
If the element had a specific attribute it could be easy but I want a solution which works on any element.
Please help me what can I do about it.
In fact there seems to be no direct possibility to change the document directly in the webbroser control.
But you can extract the html from it, mnipulate it and write it back again like this:
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(webBrowser1.DocumentText);
foreach (HtmlAgilityPack.HtmlNode node in doc.DocumentNode.ChildNodes) {
node.Attributes.Add("TEST", "TEST");
}
StringBuilder sb = new StringBuilder();
using (StringWriter sw = new StringWriter(sb)) {
doc.Save(sw);
webBrowser1.DocumentText = sb.ToString();
}
For direct manipulation you can maybe use the unmanaged pointer webBrowser1.Document.DomDocument to the document, but this is outside of my knowledge.
HtmlAgilityPack definitely can't provide access to nodes in live HTML directly. Since you said there is no distinct style/class/id on the element you have to walk through the nodes manually and find matches.
Assuming HTML is reasonably valid (so both browser and HtmlAgilityPack perform normalization similarly) you can walk pairs of elements starting from the root of both trees and selecting the same child node.
Basically you can build "position-based" XPath to node in one tree and select it in another tree. Xpath would look something like (depending you want to pay attention to just positions or position and node name):
"/*[1]/*[4]/*[2]/*[7]"
"/body/div[2]/span[1]/p[3]"
Steps:
In using HtmlNode you've found collect all parent nodes up to the root.
Get root of element of HTML in browser
for each level of children find position of corresponding child on HtmlNodes collection on step 1 in its parent and than find live HtmlElement among children of current live node.
Move to newly found child and go back to 3 till found node you are looking for.
the XPath attribute of the HtmlAgilityPack.HtmlNode shows the nodes on the path from root to the node. For example \div[1]\div[2]\table[0]. You can traverse this path in the live document to find the corresponding live element. However this path may not be precise as HtmlAgilityPack removes some tags like <form> then before using this solution add the omitted tags back using
HtmlNode.ElementsFlags.Remove("form");
struct DocNode
{
public string Name;
public int Pos;
}
///// structure to hold the name and position of each node in the path
The following method finds the live element according to the XPath
static public HtmlElement GetLiveElement(HtmlNode node, HtmlDocument doc)
{
var pattern = #"/(.*?)\[(.*?)\]"; // like div[1]
// Parse the XPath to extract the nodes on the path
var matches = Regex.Matches(node.XPath, pattern);
List<DocNode> PathToNode = new List<DocNode>();
foreach (Match m in matches) // Make a path of nodes
{
DocNode n = new DocNode();
n.Name = n.Name = m.Groups[1].Value;
n.Pos = Convert.ToInt32(m.Groups[2].Value)-1;
PathToNode.Add(n); // add the node to path
}
HtmlElement elem = null; //Traverse to the element using the path
if (PathToNode.Count > 0)
{
elem = doc.Body; //begin from the body
foreach (DocNode n in PathToNode)
{
//Find the corresponding child by its name and position
elem = GetChild(elem, n);
}
}
return elem;
}
the code for GetChild Method used above
public static HtmlElement GetChild(HtmlElement el, DocNode node)
{
// Find corresponding child of the elemnt
// based on the name and position of the node
int childPos = 0;
foreach (HtmlElement child in el.Children)
{
if (child.TagName.Equals(node.Name,
StringComparison.OrdinalIgnoreCase))
{
if (childPos == node.Pos)
{
return child;
}
childPos++;
}
}
return null;
}

C# parse html with xpath

I'm trying to parse out stock exchange information whit a simple piece of C# from a HTML document. The problem is that I can not get my head around the syntax, the tr class="LomakeTaustaVari" gets parsed out but how do I get the second bit that has no tr-class?
Here's a piece of the HTML, it repeats it self whit different values.
<tr class="LomakeTaustaVari">
<td><div class="Ensimmainen">12:09</div></td>
<td><div>MSI</div></td>
<td><div>POH</div></td>
<td><div>42</div></td>
<td><div>64,50</div></td>
</tr>
<tr>
<td><div class="Ensimmainen">12:09</div></td>
<td><div>SRE</div></td>
<td><div>POH</div></td>
<td><div>156</div></td>
<td><div>64,50</div></td>
</tr>
My C# code:
{
HtmlAgilityPack.HtmlWeb web = new HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = web.Load ("https://www.op.fi/op/henkiloasiakkaat/saastot-ja-sijoitukset/kurssit-ja-markkinat/markkinat?sivu=alltrades.html&sym=KNEBV.HSE&from=10:00&to=19:00&id=32453");
foreach (HtmlNode row in doc.DocumentNode.SelectNodes("//tr[#class='LomakeTaustaVari']"))
{
Console.WriteLine(row.InnerText);
}
Console.ReadKey();
}
Try to use the next xpath //tr[preceding-sibling::tr[#class='LomakeTaustaVari']]:
var nodes = doc.DocumentNode.SelectNodes("//tr[preceding-sibling::tr[#class='LomakeTaustaVari']]");
It should select nodes that have preceding node tr with class LomakeTaustaVari.
Just FYI: if no nodes found, SelectNodes method returns null.
If you manage to get a reference to the <tr class="LomakeTaustaVari"> element, I see two possible solutions.
You can navigate to the parent and then find all its <tr> children:
lomakeTaustaVariElement.Parent.SelectNodes("tr"); // iterate over these if needed
You can also use NextSibling to get the next <tr>:
var trWithoutClass = lomakeTaustaVariElement.NextSibling;
Please note that using the second alternative you may run into issues, because whitespace present in the HTML may be interpreted as being a distinct element.
To overcome this, you may recursively call NextSibling until you encounter a tr element.
This will iterate over all nodes in document. You will probably also need to be more specific with starting node, so you will only select that you are interested in.
foreach (HtmlNode row in doc.DocumentNode.SelectNodes("//tr"))
{
Console.WriteLine(row.InnerText);
}
Probably I don't understand something, but the simplest XPath for any tr element selection should do the work:
doc.DocumentNode.SelectNodes("//tr")
Otherwise, in case you would like to select elements with specific class attributes only, it could be:
doc.DocumentNode.SelectNodes("//tr[#class = 'someClass1' or #class = 'someClass2']")
If you do not like to load the page and want to use a ready html string, e.g. from a WebBrowser element, you can use the following example:
var web = new HtmlAgilityPack.HtmlDocument();
web.LoadHtml(webBrowser1.Document.Body.Parent.OuterHtml);
var q = web.DocumentNode.SelectNodes("/html/body/div[2]/div/div[1]") //XPath /html/body/div[2]/div/div[1]

xpath issue (HtmlAgilityPack)

I have a "TR" node array. All I want is to get child "TD" tags of each its element.
I don't have any idea of how to do it.
Anyone knows?
Here's my code:
foreach (HtmlNode tr in doc.DocumentNode.SelectNodes("//table[#id=\"ctl00_ContentPlaceHolder1_CustomerByLocation_ViewPanelStandAlone_ViewPanel_Grid_ctl01\"]/tr[position()>1]"))
{
foreach (HtmlNode td in tr.SelectNodes("//td"))
{
w.WriteLine(td.InnerHtml);
}
w.WriteLine("***********************");
}
In XPath "//" means "all nodes starting from root - so your second search "//td" ignores tr as parent and searches whole DOM anyway.
Most likely you are looking for just "td" (instead of "//td").

Categories