xpath issue (HtmlAgilityPack)

xpath issue (HtmlAgilityPack) - c#

I have a "TR" node array. All I want is to get child "TD" tags of each its element.
I don't have any idea of how to do it.
Anyone knows?
Here's my code:
foreach (HtmlNode tr in doc.DocumentNode.SelectNodes("//table[#id=\"ctl00_ContentPlaceHolder1_CustomerByLocation_ViewPanelStandAlone_ViewPanel_Grid_ctl01\"]/tr[position()>1]"))
{
foreach (HtmlNode td in tr.SelectNodes("//td"))
{
w.WriteLine(td.InnerHtml);
}
w.WriteLine("***********************");
}

In XPath "//" means "all nodes starting from root - so your second search "//td" ignores tr as parent and searches whole DOM anyway.
Most likely you are looking for just "td" (instead of "//td").

Related

HtmlAgilityPack remove childnode of childnode

I have a string containing something like this :
string text = "<p>test <span> <font> here </font> </span> try</p><p> <font> try 2</font> </p>"
What I need is to filter something like this :
Keep Text inside P
Remove Span and content (font and text)
Keep Text inside font if its direct parent is not a Span*
What I have is :
StringBuilder sbtexttoCorrect = new StringBuilder();
HtmlDocument html = new HtmlDocument();
html.LoadHtml(textToFormat);
var nodes = html.DocumentNode.SelectNodes("//p");
foreach (var line in nodes)
{
if (line.Name =="SPAN")
{
line.RemoveAllChildren();
line.Remove();
}
}
foreach (var txt in nodes)
{
sbtexttoCorrect.Append(txt.InnerText);
}
But the sbtexttoCorrect at then end still gets the child font of the span. Even with the Removechild and his own Remove.
What am I missing?
Note : on another post someone told me :
foreach (var line in nodes.Select(node => node.ChildNodes.Where(
childNode => childNode.Name != "span"))
.Select(
textNodes => textNodes.Aggregate(String.Empty, (current, node) => current + node.InnerText)))
{
sbtexttoCorrect.Append(line);
}
But I do not understand all of the syntax so I wanted to rewrite my own try, plus it did not work all the time too, it is still getting the text inside the Font inside the Span.
Note 2 I can't find any doc on the specification of the Agilty Pack. If someone knows where to find it, I'd like to learn more about this library.
Edit The real HTML is way more complexe, with a number of childNode that I can't know for sur, they can be TD or DIV, the only thing really sure is when there is a span I need to skip his content and his childNode

I see these problems in your code:
You treat the span as UpperCase whereas HtmlAgilityPack handles it as LowerCase => your if block will never hit
You only loop on the p elements (instead on the childs of p elements) => your if block will never hit
Based on your additional explications this should work:
It selects all spans with an XPath (so should work for upper and lower case)
It removes the spans
It cleans all html elements (as indicated here)
string text = "<p>test <SPAN> <font> here </font> </SPAN> try</p><p><table> <tr><td><span>test</span></td></tr></table><font> try 2</font> </p>";
StringBuilder sbtexttoCorrect = new StringBuilder();
HtmlDocument html = new HtmlDocument();
html.LoadHtml(text);
var nodes = html.DocumentNode.SelectNodes("//span");
foreach (var node in nodes)
{
node.Remove();
}
foreach (var node in html.DocumentNode.DescendantsAndSelf())
{
if (!node.HasChildNodes)
{
string t = node.InnerText;
if (!string.IsNullOrEmpty(t))
sbtexttoCorrect.AppendLine(t);
}
}

Getting same value when I'm looking in different nodes HTMLAgility

I'm having a problem using html agility.
I stumbled upon an img src/ img alt where i have to take data.
All is good when there is only one thing i need to take my data, but when there are more it founds everything in the collection like it should, but the data taken is always from the 1st node in the collection...
HtmlNodeCollection collection = doc.DocumentNode.SelectNodes("//div[#class='listHolder']//article[#class='brochure openBrochureAction']//div[#class='imgBrochure']");
foreach (HtmlNode node in collection)
{
//Tried these examples:
NomeFolheto = node.SelectSingleNode("//div[#class='imageRatioHorizontal']//img[#alt]").GetAttributeValue("alt", "none").Trim();
string testeNome = node.SelectSingleNode("//div[#class='imageRatioHorizontal']//img/#alt").Attributes["alt"].Value;
string testeimagem = node.SelectSingleNode("//div[#class='imageRatioHorizontal']//img/#src").Attributes["src"].Value;
imagem = node.SelectSingleNode("//div[#class='imageRatioHorizontal']//img[#src]").GetAttributeValue("src", "none").Trim();
}
Like i said, the collection finds all nodes it should, and gets the 1st value properly, but when it goes for the other nodes, the values it gets are from the 1st node.
What am I doing wrong? I went to check each node in the collection, and they have same "alt" attribute like it should and different "src" attribute like they should, but I know because i debugged that it's picking up the 1st node every time.
Thanks in advance

Your xpath expressions are all starting from the root (of the document). Even when you have a reference to a single node, it is still just a reference to that node within the entire tree.
You should use .// for the expressions:
HtmlNodeCollection collection = doc.DocumentNode.SelectNodes("//div[#class='listHolder']//article[#class='brochure openBrochureAction']//div[#class='imgBrochure']");
foreach (HtmlNode node in collection)
{
//Tried these examples:
NomeFolheto = node
.SelectSingleNode(".//div[#class='imageRatioHorizontal']//img[#alt]")
.GetAttributeValue("alt", "none").Trim();
string testeNome = node
.SelectSingleNode(".//div[#class='imageRatioHorizontal']//img/#alt")
.Attributes["alt"].Value;
string testeimagem = node
.SelectSingleNode(".//div[#class='imageRatioHorizontal']//img/#src")
.Attributes["src"].Value;
imagem = node
.SelectSingleNode(".//div[#class='imageRatioHorizontal']//img[#src]")
.GetAttributeValue("src", "none").Trim();
}

Append attribute to each element in XML File

I have an xml and I want to append an attribute to each element in xml file.
IEnumerable<XElement> childList = from el in xml.Elements()
select el;
textBox1.Text = childList.ToString();
foreach (XElement el in childList)
{
el.Add(new XAttribute("Liczba_Potomkow", "dziesiec"));
textBox1.Text = el.ToString();
xml.Save("Employees.xml");
}
unfortunately, when I open the file only the first line seems to be affected. (only first elements gets new attribute). Why is so ?

I assume xml is an XDocument? If so, you're calling Elements() directly on the parent of the root element - so the only element it finds will be the root element itself.
If you want to do something for all elements in the document, you should use the Descendants() method.
Additionally, your query expression is pointless - you might as well just use xml.Elements() - and I really don't think you should be saving in a loop.
I think you just want:
foreach (var element in xml.Descendants())
{
element.Add(new XAttribute("Liczba_Potomkow", "dziesiec"));
}
xml.Save("Employees.xml");

HtmlAgilityPack | Wrongly retrieved table nodes

Hi there I've just registered on this website, because I need some help.
I want to get results from the nyaa.eu website.
Basically:
Table Node is called <table class="tlist">
Every row node is called <tr class="tlistrow"> also sometimes it's 'trusted tlistrow' etc.
The Nodes I try to retrieve are: <td class="tlistname"> <td class="tlistsize"> <td class="tlistsn"> and <td class="tlistln">
Firstly I'm retrieving a table which contains all the info about torrents:
HtmlNode hnTable = doc.DocumentNode.SelectSingleNode("//table[#class='tlist']");
So, next thing is retrieving all the rows which contains 'tlistrow' in its class attribute:
HtmlNodeCollection hncRows = hnTable.SelectNodes("//tr[contains(#class,'tlistrow')]");
And finally the problem is when I read every node it's always the same one:
foreach (HtmlNode row in hncRows)
{
foreach (HtmlNode child in row.ChildNodes)
{
if (child.SelectSingleNode("//td[#class='tlistname']") != null)
{
MessageBox.Show("Something found!\n\n" + child.SelectSingleNode("//td[#class='tlistname']").InnerText);
break;
}
}
}
The text displayed in the messagebox is always the same, it looks like it only selects one node multiple times.
How can I fix this or if I am doing anything wrong, please correct me.

foreach (HtmlNode child in row.ChildNodes)
{
if (child.SelectSingleNode("//td[#class='tlistname']") != null)
{
MessageBox.Show("Something found!\n\n" + child.SelectSingleNode("//td[#class='tlistname']").InnerText);
break;
}
}
You need to understand the difference between relative XPath expressions and absolute XPath expressions.
A relative XPath expression is evaluated off (having as an initial context node) a specific node in the XML document.
An absolute XPath expression is evaluated against the whole XML document (having as initial context node the document-node).
Any XPath expression that starts with the character / is an absolute XPath expression.
Based on the provided code, you want to use a relative XPath expression with an initial context node, containd in the variable named child.
The problem is that the expression you are using:
//td[#class='tlistname']
starts with / and is therefore an absolute XPath expression.
This, passed to the SelectSingleNode() method always selects the first td element in the whole XML document, that has a class attribute with string value "tlistname."
Solution: Use a relative XPath expression, such as:
.//td[#class='tlistname']

The // XPath in the expression will look anywhere in the document for matches. Remove that when you don't need it.
So, try something like:
HtmlNode hnTable = doc.DocumentNode.SelectSingleNode("//table[#class='tlist']");
HtmlNodeCollection hncRows = hnTable.SelectNodes("/tr[contains(#class,'tlistrow')]");
foreach (HtmlNode row in hncRows)
{
foreach (HtmlNode child in row.ChildNodes)
{
if (child.SelectSingleNode("/td[#class='tlistname']") != null)
{
MessageBox.Show("Something found!\n\n" + child.SelectSingleNode("/td[#class='tlistname']").InnerText);
break;
}
}
}

HtmlAgilityPack set node InnerText

I want to replace inner text of HTML tags with another text.
I am using HtmlAgilityPack
I use this code to extract all texts
HtmlDocument doc = new HtmlDocument();
doc.Load("some path")
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//text()[normalize-space(.) != '']")) {
// How to replace node.InnerText with some text ?
}
But InnerText is readonly. How can I replace texts with another text and save them to file ?

Try code below. It select all nodes without children and filtered out script nodes. Maybe you need to add some additional filtering. In addition to your XPath expression this one also looking for leaf nodes and filter out text content of <script> tags.
var nodes = doc.DocumentNode.SelectNodes("//body//text()[(normalize-space(.) != '') and not(parent::script) and not(*)]");
foreach (HtmlNode htmlNode in nodes)
{
htmlNode.ParentNode.ReplaceChild(HtmlTextNode.CreateNode(htmlNode.InnerText + "_translated"), htmlNode);
}

Strange, but I found that InnerHtml isn't readonly. And when I tried to set it like that
aElement.InnerHtml = "sometext";
the value of InnerText also changed to "sometext"

The HtmlTextNode class has a Text property* which works perfectly for this purpose.
Here's an example:
var textNodes = doc.DocumentNode.SelectNodes("//body/text()").Cast<HtmlTextNode>();
foreach (var node in textNodes)
{
node.Text = node.Text.Replace("foo", "bar");
}
And if we have an HtmlNode that we want to change its direct text, we can do something like the following:
HtmlNode node = //...
var textNode = (HtmlTextNode)node.SelectSingleNode("text()");
textNode.Text = "new text";
Or we can use node.SelectNodes("text()") in case it has more than one.
* Not to be confused with the readonly InnerText property.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

xpath issue (HtmlAgilityPack) - c#

In XPath "//" means "all nodes starting from root - so your second search "//td" ignores tr as parent and searches whole DOM anyway. Most likely you are looking for just "td" (instead of "//td").

Related

HtmlAgilityPack remove childnode of childnode

Getting same value when I'm looking in different nodes HTMLAgility

Append attribute to each element in XML File

HtmlAgilityPack | Wrongly retrieved table nodes

HtmlAgilityPack set node InnerText

Categories

Resources