I'm having a problem using html agility.
I stumbled upon an img src/ img alt where i have to take data.
All is good when there is only one thing i need to take my data, but when there are more it founds everything in the collection like it should, but the data taken is always from the 1st node in the collection...
HtmlNodeCollection collection = doc.DocumentNode.SelectNodes("//div[#class='listHolder']//article[#class='brochure openBrochureAction']//div[#class='imgBrochure']");
foreach (HtmlNode node in collection)
{
//Tried these examples:
NomeFolheto = node.SelectSingleNode("//div[#class='imageRatioHorizontal']//img[#alt]").GetAttributeValue("alt", "none").Trim();
string testeNome = node.SelectSingleNode("//div[#class='imageRatioHorizontal']//img/#alt").Attributes["alt"].Value;
string testeimagem = node.SelectSingleNode("//div[#class='imageRatioHorizontal']//img/#src").Attributes["src"].Value;
imagem = node.SelectSingleNode("//div[#class='imageRatioHorizontal']//img[#src]").GetAttributeValue("src", "none").Trim();
}
Like i said, the collection finds all nodes it should, and gets the 1st value properly, but when it goes for the other nodes, the values it gets are from the 1st node.
What am I doing wrong? I went to check each node in the collection, and they have same "alt" attribute like it should and different "src" attribute like they should, but I know because i debugged that it's picking up the 1st node every time.
Thanks in advance
Your xpath expressions are all starting from the root (of the document). Even when you have a reference to a single node, it is still just a reference to that node within the entire tree.
You should use .// for the expressions:
HtmlNodeCollection collection = doc.DocumentNode.SelectNodes("//div[#class='listHolder']//article[#class='brochure openBrochureAction']//div[#class='imgBrochure']");
foreach (HtmlNode node in collection)
{
//Tried these examples:
NomeFolheto = node
.SelectSingleNode(".//div[#class='imageRatioHorizontal']//img[#alt]")
.GetAttributeValue("alt", "none").Trim();
string testeNome = node
.SelectSingleNode(".//div[#class='imageRatioHorizontal']//img/#alt")
.Attributes["alt"].Value;
string testeimagem = node
.SelectSingleNode(".//div[#class='imageRatioHorizontal']//img/#src")
.Attributes["src"].Value;
imagem = node
.SelectSingleNode(".//div[#class='imageRatioHorizontal']//img[#src]")
.GetAttributeValue("src", "none").Trim();
}
Related
I use HtmlAgilityPack to parse the html document of a webbrowser control.
I am able to find my desired HtmlNode, but after getting the HtmlNode, I want to retun the corresponding HtmlElement in the WebbrowserControl.Document.
In fact HtmlAgilityPack parse an offline copy of the live document, while I want to access live elements of the webbrowser Control to access some rendered attributes like currentStyle or runtimeStyle
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(webBrowser1.Document.Body.InnerHtml);
var some_nodes = doc.DocumentNode.SelectNodes("//p");
// this selection could be more sophisticated
// and the answer shouldn't relay on it.
foreach (HtmlNode node in some_nodes)
{
HtmlElement live_element = CorrespondingElementFromWebBrowserControl(node);
// CorrespondingElementFromWebBrowserControl is what I am searching for
}
If the element had a specific attribute it could be easy but I want a solution which works on any element.
Please help me what can I do about it.
In fact there seems to be no direct possibility to change the document directly in the webbroser control.
But you can extract the html from it, mnipulate it and write it back again like this:
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(webBrowser1.DocumentText);
foreach (HtmlAgilityPack.HtmlNode node in doc.DocumentNode.ChildNodes) {
node.Attributes.Add("TEST", "TEST");
}
StringBuilder sb = new StringBuilder();
using (StringWriter sw = new StringWriter(sb)) {
doc.Save(sw);
webBrowser1.DocumentText = sb.ToString();
}
For direct manipulation you can maybe use the unmanaged pointer webBrowser1.Document.DomDocument to the document, but this is outside of my knowledge.
HtmlAgilityPack definitely can't provide access to nodes in live HTML directly. Since you said there is no distinct style/class/id on the element you have to walk through the nodes manually and find matches.
Assuming HTML is reasonably valid (so both browser and HtmlAgilityPack perform normalization similarly) you can walk pairs of elements starting from the root of both trees and selecting the same child node.
Basically you can build "position-based" XPath to node in one tree and select it in another tree. Xpath would look something like (depending you want to pay attention to just positions or position and node name):
"/*[1]/*[4]/*[2]/*[7]"
"/body/div[2]/span[1]/p[3]"
Steps:
In using HtmlNode you've found collect all parent nodes up to the root.
Get root of element of HTML in browser
for each level of children find position of corresponding child on HtmlNodes collection on step 1 in its parent and than find live HtmlElement among children of current live node.
Move to newly found child and go back to 3 till found node you are looking for.
the XPath attribute of the HtmlAgilityPack.HtmlNode shows the nodes on the path from root to the node. For example \div[1]\div[2]\table[0]. You can traverse this path in the live document to find the corresponding live element. However this path may not be precise as HtmlAgilityPack removes some tags like <form> then before using this solution add the omitted tags back using
HtmlNode.ElementsFlags.Remove("form");
struct DocNode
{
public string Name;
public int Pos;
}
///// structure to hold the name and position of each node in the path
The following method finds the live element according to the XPath
static public HtmlElement GetLiveElement(HtmlNode node, HtmlDocument doc)
{
var pattern = #"/(.*?)\[(.*?)\]"; // like div[1]
// Parse the XPath to extract the nodes on the path
var matches = Regex.Matches(node.XPath, pattern);
List<DocNode> PathToNode = new List<DocNode>();
foreach (Match m in matches) // Make a path of nodes
{
DocNode n = new DocNode();
n.Name = n.Name = m.Groups[1].Value;
n.Pos = Convert.ToInt32(m.Groups[2].Value)-1;
PathToNode.Add(n); // add the node to path
}
HtmlElement elem = null; //Traverse to the element using the path
if (PathToNode.Count > 0)
{
elem = doc.Body; //begin from the body
foreach (DocNode n in PathToNode)
{
//Find the corresponding child by its name and position
elem = GetChild(elem, n);
}
}
return elem;
}
the code for GetChild Method used above
public static HtmlElement GetChild(HtmlElement el, DocNode node)
{
// Find corresponding child of the elemnt
// based on the name and position of the node
int childPos = 0;
foreach (HtmlElement child in el.Children)
{
if (child.TagName.Equals(node.Name,
StringComparison.OrdinalIgnoreCase))
{
if (childPos == node.Pos)
{
return child;
}
childPos++;
}
}
return null;
}
string XML1 = "<Root><InsertHere></InsertHere></Root>";
string XML2 = "<Root><child1><childnodes>data</childnodes><childnodes>data1</childnodes></child1><child2><childnodes>data</childnodes><childnodes>data1</childnodes></child2></Root>";
Among below mentioned two code samples.. usage of childNodes doesn't copy all the child nodes from XML2. only <child1> is being copied.
string strXpath = "/Root/InsertHere";
XmlDocument xdxmlChildDoc = new XmlDocument();
XmlDocument ParentDoc = new XmlDocument();
ParentDoc.LoadXml(XML1);
xdxmlChildDoc.LoadXml(XML2);
XmlNode xnNewNode = ParentDoc.ImportNode(xdxmlChildDoc.DocumentElement.SelectSingleNode("/Root"), true);
if (xnNewNode != null)
{
XmlNodeList xnChildNodes = xnNewNode.SelectNodes("/*");
if (xnChildNodes != null)
{
foreach (XmlNode xnNode in xnChildNodes)
{
if (xnNode != null)
{
ParentDoc.DocumentElement.SelectSingleNode(strXpath).AppendChild(xnNode);
}
}
}
}
code2:
if (xnNewNode != null)
{
XmlNodeList xnChildNodes = xnNewNode.ChildNodes;
if (xnChildNodes != null)
{
foreach (XmlNode xnNode in xnChildNodes)
{
if (xnNode != null)
{
ParentDoc.DocumentElement.SelectSingleNode(strXpath).AppendChild(xnNode);
}
}
}
}
ParentDoc.OuterXML after executing first sample of code:
<Root>
<InsertHere>
<child1>
<childnodes>data</childnodes>
<childnodes>data1</childnodes>
</child1>
<child2>
<childnodes>data</childnodes>
<childnodes>data1</childnodes>
</child2>
</InsertHere>
</Root>
ParentDoc.OuterXML after executing second sample of Code
<Root>
<InsertHere>
<child1>
<childnodes>data</childnodes>
<childnodes>data1</childnodes>
</child1>
</InsertHere>
</Root>
I have done some debugging of the code, and it shows that xnNewNode.ChildNodes initially also returns 2 child nodes. After one iteration in the loop, the first child is however removed from ChildNodes, and therefore the loop ends prematurely.
If you want to use the ChildNodes property, one workaround is to "transfer" the child node references to an array or list, like this:
var xnChildNodes = xnNewNode.ChildNodes.Cast<XmlNode>().ToArray();
UPDATE
As Tomer W pointed out in his answer, when using XmlNode.AppendChild the inserted node is also removed from its original location. As stated in the MSDN documentation:
If the newChild is already in the tree, it is removed from
its original position and added to its target position.
With SelectNodes you have already created a new node collection, but with ChildNodes you are accessing the original collection.
this is a clearing of what Anders G posted, with more through explanation.
I am surprised the foreach does not fail (Throw Exception) in this situation, but hell.
In code1.
1. Create a NEW COLLECTION of nodes
2. Select nodes to it
3. append to other node => removing from original collection, but not the newly created one.
4 you are removing the node you are adding from the newly collection.
in Code2
1. Reference the ORIGINAL node collection
{child1, child2}
2. append 1st Node away to another collection => removing it from the original collection
{child2}
3. now when the foreach at index 1, it see that it passed the end of the collection. and exit.
this happens a lot when changing a collection that is subject to iteration.
but most the time, the IEnumerator is throwing an Exception when such happens.
hope i made it all clear
I had the same problem and observed, that whitespace nodes seem to have a value attached to the node, which is not the case with other nodes (at least in my application).This method removes the whitespace nodes from the node.ChildNodes list:
private List<XmlNode> findChildnodes(XmlNode node)
{
List<XmlNode> result = new List<XmlNode>();
foreach (XmlNode childnode in node.ChildNodes)
{
if(childnode.Value == null)
{
result.Add(childnode);
}
}
return result;
}
In answer to your question, Node.childNodes is All of the child nodes, whereas Node.SelectNodes(/*) is all of the child nodes that match /*. Only XML elements will match /*, so any attributes, CDATA nodes, text nodes, etc will be excluded.
Nevertheless, the problem occurs because you are changing the collection of nodes while while iterating over them. You cannot do that. The select nodes method returns a list of references to nodes. This is why is works.
Here is what I have so far:
HtmlAgilityPack.HtmlDocument ht = new HtmlAgilityPack.HtmlDocument();
TextReader reader = File.OpenText(#"C:\Users\TheGateKeeper\Desktop\New folder\html.txt");
ht.Load(reader);
reader.Close();
HtmlNode select= ht.GetElementbyId("cats[]");
List<HtmlNode> options = new List<HtmlNode>();
foreach (HtmlNode option in select.ChildNodes)
{
if (option.Name == "option")
{
options.Add(option);
}
}
Now I have a list of all the "options" for the select element. What properties do I need to access to get the key and the text?
So if for example the html for one option would be:
<option class="level-1" value="1">Funky Town</option>
I want to get as output:
1 - Funky Town
Thanks
Edit: I just noticed something. When I got the child elements of the "Select" elements, it returned elements of type "option" and elements of type "#text".
Hmmm .. #text has the string I want, but select has the value.
I tought HTMLAgilityPack was an html parser? Why did it give me confusing values like this?
This is due to the default configuration for the html parser; it has configured the <option> as HtmlElementFlag.Empty (with the comment 'they sometimes contain, and sometimes they don't...'). The <form> tag has the same setup (CanOverlap + Empty) which causes them to appear as empty nodes in the dom, without any child nodes.
You need to remove that flag before parsing the document.
HtmlNode.ElementsFlags.Remove("option");
Notice that the ElementsFlags property is static and any changes will affect all further parsing.
edit: you should probably be selecting the option nodes directly via xpath. I think this should work for that:
var options = select.SelectNodes("option");
that will get your options without the text nodes. the options should contain that string you want somewhere. waiting for your html sample.
foreach (var option in options)
{
int value = int.Parse(option.Attributes["value"].Value);
string text = option.InnerText;
}
you can add some sanity checking on the attribute to make sure it exists.
Hi there I've just registered on this website, because I need some help.
I want to get results from the nyaa.eu website.
Basically:
Table Node is called <table class="tlist">
Every row node is called <tr class="tlistrow"> also sometimes it's 'trusted tlistrow' etc.
The Nodes I try to retrieve are: <td class="tlistname"> <td class="tlistsize"> <td class="tlistsn"> and <td class="tlistln">
Firstly I'm retrieving a table which contains all the info about torrents:
HtmlNode hnTable = doc.DocumentNode.SelectSingleNode("//table[#class='tlist']");
So, next thing is retrieving all the rows which contains 'tlistrow' in its class attribute:
HtmlNodeCollection hncRows = hnTable.SelectNodes("//tr[contains(#class,'tlistrow')]");
And finally the problem is when I read every node it's always the same one:
foreach (HtmlNode row in hncRows)
{
foreach (HtmlNode child in row.ChildNodes)
{
if (child.SelectSingleNode("//td[#class='tlistname']") != null)
{
MessageBox.Show("Something found!\n\n" + child.SelectSingleNode("//td[#class='tlistname']").InnerText);
break;
}
}
}
The text displayed in the messagebox is always the same, it looks like it only selects one node multiple times.
How can I fix this or if I am doing anything wrong, please correct me.
foreach (HtmlNode child in row.ChildNodes)
{
if (child.SelectSingleNode("//td[#class='tlistname']") != null)
{
MessageBox.Show("Something found!\n\n" + child.SelectSingleNode("//td[#class='tlistname']").InnerText);
break;
}
}
You need to understand the difference between relative XPath expressions and absolute XPath expressions.
A relative XPath expression is evaluated off (having as an initial context node) a specific node in the XML document.
An absolute XPath expression is evaluated against the whole XML document (having as initial context node the document-node).
Any XPath expression that starts with the character / is an absolute XPath expression.
Based on the provided code, you want to use a relative XPath expression with an initial context node, containd in the variable named child.
The problem is that the expression you are using:
//td[#class='tlistname']
starts with / and is therefore an absolute XPath expression.
This, passed to the SelectSingleNode() method always selects the first td element in the whole XML document, that has a class attribute with string value "tlistname."
Solution: Use a relative XPath expression, such as:
.//td[#class='tlistname']
The // XPath in the expression will look anywhere in the document for matches. Remove that when you don't need it.
So, try something like:
HtmlNode hnTable = doc.DocumentNode.SelectSingleNode("//table[#class='tlist']");
HtmlNodeCollection hncRows = hnTable.SelectNodes("/tr[contains(#class,'tlistrow')]");
foreach (HtmlNode row in hncRows)
{
foreach (HtmlNode child in row.ChildNodes)
{
if (child.SelectSingleNode("/td[#class='tlistname']") != null)
{
MessageBox.Show("Something found!\n\n" + child.SelectSingleNode("/td[#class='tlistname']").InnerText);
break;
}
}
}
I have a simple XML
<AllBands>
<Band>
<Beatles ID="1234" started="1962">greatest Band<![CDATA[lalala]]></Beatles>
<Last>1</Last>
<Salary>2</Salary>
</Band>
<Band>
<Doors ID="222" started="1968">regular Band<![CDATA[lalala]]></Doors>
<Last>1</Last>
<Salary>2</Salary>
</Band>
</AllBands>
However ,
when I want to reach the "Doors band" and to change its ID :
using (var stream = new StringReader(result))
{
XDocument xmlFile = XDocument.Load(stream);
var query = from c in xmlFile.Elements("Band")
select c;
...
query has no results
But
If I write xmlFile.Elements().Elements("Band") so it Does find it.
What is the problem ?
Is the full path from the Root needed ?
And if so , Why did it work without specify AllBands ?
Does the XDocument Navigation require me to know the full level structure down to the required element ?
Elements() will only check direct children - which in the first case is the root element, in the second case children of the root element, hence you get a match in the second case. If you just want any matching descendant use Descendants() instead:
var query = from c in xmlFile.Descendants("Band") select c;
Also I would suggest you re-structure your Xml: The band name should be an attribute or element value, not the element name itself - this makes querying (and schema validation for that matter) much harder, i.e. something like this:
<Band>
<BandProperties Name ="Doors" ID="222" started="1968" />
<Description>regular Band<![CDATA[lalala]]></Description>
<Last>1</Last>
<Salary>2</Salary>
</Band>
You can do it this way:
xml.Descendants().SingleOrDefault(p => p.Name.LocalName == "Name of the node to find")
where xml is a XDocument.
Be aware that the property Name returns an object that has a LocalName and a Namespace. That's why you have to use Name.LocalName if you want to compare by name.
You should use Root to refer to the root element:
xmlFile.Root.Elements("Band")
If you want to find elements anywhere in the document use Descendants instead:
xmlFile.Descendants("Band")
The problem is that Elements only takes the direct child elements of whatever you call it on. If you want all descendants, use the Descendants method:
var query = from c in xmlFile.Descendants("Band")
My experience when working with large & complicated XML files is that sometimes neither Elements nor Descendants seem to work in retrieving a specific Element (and I still do not know why).
In such cases, I found that a much safer option is to manually search for the Element, as described by the following MSDN post:
https://social.msdn.microsoft.com/Forums/vstudio/en-US/3d457c3b-292c-49e1-9fd4-9b6a950f9010/how-to-get-tag-name-of-xml-by-using-xdocument?forum=csharpgeneral
In short, you can create a GetElement function:
private XElement GetElement(XDocument doc,string elementName)
{
foreach (XNode node in doc.DescendantNodes())
{
if (node is XElement)
{
XElement element = (XElement)node;
if (element.Name.LocalName.Equals(elementName))
return element;
}
}
return null;
}
Which you can then call like this:
XElement element = GetElement(doc,"Band");
Note that this will return null if no matching element is found.
The Elements() method returns an IEnumerable<XElement> containing all child elements of the current node. For an XDocument, that collection only contains the Root element. Therefore the following is required:
var query = from c in xmlFile.Root.Elements("Band")
select c;
Sebastian's answer was the only answer that worked for me while examining a xaml document. If, like me, you'd like a list of all the elements then the method would look a lot like Sebastian's answer above but just returning a list...
private static List<XElement> GetElements(XDocument doc, string elementName)
{
List<XElement> elements = new List<XElement>();
foreach (XNode node in doc.DescendantNodes())
{
if (node is XElement)
{
XElement element = (XElement)node;
if (element.Name.LocalName.Equals(elementName))
elements.Add(element);
}
}
return elements;
}
Call it thus:
var elements = GetElements(xamlFile, "Band");
or in the case of my xaml doc where I wanted all the TextBlocks, call it thus:
var elements = GetElements(xamlFile, "TextBlock");