Gettig Htmlelement based on HtmlAgilityPack.HtmlNode - c#

I use HtmlAgilityPack to parse the html document of a webbrowser control.
I am able to find my desired HtmlNode, but after getting the HtmlNode, I want to retun the corresponding HtmlElement in the WebbrowserControl.Document.
In fact HtmlAgilityPack parse an offline copy of the live document, while I want to access live elements of the webbrowser Control to access some rendered attributes like currentStyle or runtimeStyle
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(webBrowser1.Document.Body.InnerHtml);
var some_nodes = doc.DocumentNode.SelectNodes("//p");
// this selection could be more sophisticated
// and the answer shouldn't relay on it.
foreach (HtmlNode node in some_nodes)
{
HtmlElement live_element = CorrespondingElementFromWebBrowserControl(node);
// CorrespondingElementFromWebBrowserControl is what I am searching for
}
If the element had a specific attribute it could be easy but I want a solution which works on any element.
Please help me what can I do about it.

In fact there seems to be no direct possibility to change the document directly in the webbroser control.
But you can extract the html from it, mnipulate it and write it back again like this:
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(webBrowser1.DocumentText);
foreach (HtmlAgilityPack.HtmlNode node in doc.DocumentNode.ChildNodes) {
node.Attributes.Add("TEST", "TEST");
}
StringBuilder sb = new StringBuilder();
using (StringWriter sw = new StringWriter(sb)) {
doc.Save(sw);
webBrowser1.DocumentText = sb.ToString();
}
For direct manipulation you can maybe use the unmanaged pointer webBrowser1.Document.DomDocument to the document, but this is outside of my knowledge.

HtmlAgilityPack definitely can't provide access to nodes in live HTML directly. Since you said there is no distinct style/class/id on the element you have to walk through the nodes manually and find matches.
Assuming HTML is reasonably valid (so both browser and HtmlAgilityPack perform normalization similarly) you can walk pairs of elements starting from the root of both trees and selecting the same child node.
Basically you can build "position-based" XPath to node in one tree and select it in another tree. Xpath would look something like (depending you want to pay attention to just positions or position and node name):
"/*[1]/*[4]/*[2]/*[7]"
"/body/div[2]/span[1]/p[3]"
Steps:
In using HtmlNode you've found collect all parent nodes up to the root.
Get root of element of HTML in browser
for each level of children find position of corresponding child on HtmlNodes collection on step 1 in its parent and than find live HtmlElement among children of current live node.
Move to newly found child and go back to 3 till found node you are looking for.

the XPath attribute of the HtmlAgilityPack.HtmlNode shows the nodes on the path from root to the node. For example \div[1]\div[2]\table[0]. You can traverse this path in the live document to find the corresponding live element. However this path may not be precise as HtmlAgilityPack removes some tags like <form> then before using this solution add the omitted tags back using
HtmlNode.ElementsFlags.Remove("form");
struct DocNode
{
public string Name;
public int Pos;
}
///// structure to hold the name and position of each node in the path
The following method finds the live element according to the XPath
static public HtmlElement GetLiveElement(HtmlNode node, HtmlDocument doc)
{
var pattern = #"/(.*?)\[(.*?)\]"; // like div[1]
// Parse the XPath to extract the nodes on the path
var matches = Regex.Matches(node.XPath, pattern);
List<DocNode> PathToNode = new List<DocNode>();
foreach (Match m in matches) // Make a path of nodes
{
DocNode n = new DocNode();
n.Name = n.Name = m.Groups[1].Value;
n.Pos = Convert.ToInt32(m.Groups[2].Value)-1;
PathToNode.Add(n); // add the node to path
}
HtmlElement elem = null; //Traverse to the element using the path
if (PathToNode.Count > 0)
{
elem = doc.Body; //begin from the body
foreach (DocNode n in PathToNode)
{
//Find the corresponding child by its name and position
elem = GetChild(elem, n);
}
}
return elem;
}
the code for GetChild Method used above
public static HtmlElement GetChild(HtmlElement el, DocNode node)
{
// Find corresponding child of the elemnt
// based on the name and position of the node
int childPos = 0;
foreach (HtmlElement child in el.Children)
{
if (child.TagName.Equals(node.Name,
StringComparison.OrdinalIgnoreCase))
{
if (childPos == node.Pos)
{
return child;
}
childPos++;
}
}
return null;
}

Related

Parse HTML class in individual items with htmlagilitypack

I want to parse HTML, I used the following code but I get all of it in one item instead of getting the items individually
var url = "https://subscene.com/subtitles/searchbytitle?query=joker&l=";
var web = new HtmlWeb();
var doc = web.Load(url);
IEnumerable<HtmlNode> nodes =
doc.DocumentNode.Descendants()
.Where(n => n.HasClass("search-result"));
foreach (var item in nodes)
{
string itemx = item.SelectSingleNode(".//a").Attributes["href"].Value;
MessageBox.Show(itemx);
MessageBox.Show(item.InnerText);
}
I only receive 1 message for the first item and the second message displays all items
When you search the data from the url based on class 'search-result', there is only one node that is returned. Instead of iterating through its children, you only go through that one div, which is why you are only getting one result.
If you want to get a list of all the links inside the div with class "search-result", then you can do the following.
Code:
string url = "https://subscene.com/subtitles/searchbytitle?query=joker&l=";
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load(url);
List<string> listOfUrls = new List<string>();
HtmlNode searchResult = doc.DocumentNode.SelectSingleNode("//div[#class='search-result']");
// Iterate through all the child nodes that have the 'a' tag.
foreach (HtmlNode node in searchResult.SelectNodes(".//a"))
{
string thisUrl = node.GetAttributeValue("href", "");
if (!string.IsNullOrEmpty(thisUrl) && !listOfUrls.Contains(thisUrl))
listOfUrls.Add(thisUrl);
}
What does it do?
SelectSingleNode("//div[#class='search-result']") -> retrieves the div that has all the search results and ignores the rest of the document.
Iterates through all the "subnodes" only that have href in it and adds it to a list. Subnodes are determined based on the dot notation SelectNodes(".//a") (Instead of .//, if you do //, it will search the entire page which is not what you want).
If statement makes sure its only adding unique non-null values.
You have all the links now.
Fiddle: https://dotnetfiddle.net/j5aQFp
I think it's how you're looking up and storing the data. Try:
foreach (HtmlNode link doc.DocumentNode.SelectNodes("//a[#href]"))
{
string hrefValue = link.GetAttributeValue( "href", string.Empty );
MessageBox.Show(hrefValue);
MessageBox.Show(link.InnerText);
}

get all nodes and its content using htmldocument/HtmlAgilityPack

I need to get all nodes from a html, then from that nodes I need to get the text and sub-nodes, and the same thing but from that sub-sub-nodes.
For example, I have this HTML:
<p>This <b>is a Link</b> with <b>bold</b></p>
So I need a way to get the p node, then the non-formatted text (this), the only-bold text (is a), the bolded link (Link) and the rest formatted and not formatted text.
I know that with the htmldocument I can select all nodes and sub-nodes, but, how Can I get the text before the sub-node, then the sub-node, and its text/sub-nodes so I can make the rendered version of the html ("This is a Link with bold")?
Please note that the above example is a simple one. The HTML would have more complex things like list, frames, numbered list, triple-formatted text, etc. Also note that the rendered thing is not a problem. I have already done that but in another way. What I need is the part to get the nodes and its content only.
Also, I can't ignore any node, so I can't filter by nothing. And the main node could start as p, div, frame, ul, etc.
After looking in the htmldoc and its properties, and thanks to #HungCao 's observation, I got a working simple way to interpretate a HTML code.
My code is a little more complex to add it as example, so I will post a lite version of it.
First of all, the htmlDoc has to be loaded. It could be on any function:
HtmlDocument htmlDoc = new HtmlDocument();
string html = #"<p>This <b>is a Link</b> with <b>bold</b></p>";
htmlDoc.LoadHtml(html);
Then we need to interpretate each "main" node (p in this case) and, depending its type, we need to load a LoopFunction (InterNode)
HtmlNodeCollection nodes = htmlDoc.DocumentNode.ChildNodes;
foreach (HtmlNode node in nodes)
{
if(node.Name.ToLower() == "p") //Low the typeName just in case
{
Paragraph newPPara = new Paragraph();
foreach(HtmlNode childNode in node.ChildNodes)
{
InterNode(childNode, ref newPPara);
}
richTextBlock.Blocks.Add(newPPara);
}
}
Please note that there is a property called "NodeType", but it will not return the correct type. So, instead use the "Name" property (Also note that the Name property in htmlNode is not the same as the Name attribute in HTML).
Finally, we have the InterNode function that will add inlines to the referred (ref) Paragraph
public bool InterNode(HtmlNode htmlNode, ref Paragraph originalPar)
{
string htmlNodeName = htmlNode.Name.ToLower();
List<string> nodeAttList = new List<string>();
HtmlNode parentNode = htmlNode.ParentNode;
while (parentNode != null) {
nodeAttList.Add(parentNode.Name);
parentNode = parentNode.ParentNode;
} //we need to get it multiple types, because it could be b(old) and i(talic) at the same time.
Inline newRun = new Run();
foreach (string noteAttStr in nodeAttList) //with this we can set all the attributes to the inline
{
switch (noteAttStr)
{
case ("b"):
case ("strong"):
{
newRun.FontWeight = FontWeights.Bold;
break;
}
case ("i"):
case ("em"):
{
newRun.FontStyle = FontStyle.Italic;
break;
}
}
}
if(htmlNodeName == "#text") //the #text means that its a text node. Like <i><#text/></i>. Thanks #HungCao
{
((Run)newRun).Text = htmlNode.InnerText;
} else //if it is not a #text, don't load its innertext, as it's another node and it will always have a #text node as a child (if it has any text)
{
foreach (HtmlNode childNode in htmlNode.ChildNodes)
{
InterNode(childNode, ref originalPar);
}
}
return true;
}
Note: I know that I said that my app need to render the HTML in another way that a webview does, and I know that this example code generate the same thing as a Webview, but, as I said before, this is just a lite version of my final code. In fact, my original/full code is working as I need to and this is just the base.

C# XmlDocument select nodes returns empty

i am trying to work with http://api.met.no/weatherapi/locationforecast/1.9/?lat=49.8197202;lon=18.1673554 XML.
Lets say i want to select all value attribute of each temperature element.
i tried this.
const string url = "http://api.met.no/weatherapi/locationforecast/1.9/?lat=49.8197202;lon=18.1673554";
WebClient client = new WebClient();
string x = client.DownloadString(url);
XmlDocument xml = new XmlDocument();
xml.LoadXml(x);
XmlNodeList nodes = xml.SelectNodes("/weatherdata/product/time/location/temperature");
//XmlNodeList nodes = xml.SelectNodes("temperature");
foreach (XmlNode node in nodes)
{
Console.WriteLine(node.Attributes[0].Value);
}
But i get nothing all the time. What am i doing wrong?
The current single slash is targeting weatherdata under the root but the root is weatherdata.
Add a preceding slash to your xpath query to make it a double slash:
XmlNodeList nodes = xml.SelectNodes("//weatherdata/product/time/location/temperature");
Double slashes tells xpath to select nodes in the document from the current node that match the selection no matter where they are.
or remove the preceding slash:
XmlNodeList nodes = xml.SelectNodes("weatherdata/product/time/location/temperature");
which looks for the whole path including the root.
Also, since you apparently want the value called value add this:
Console.WriteLine(node.Attributes["value"].Value);
Since the value at of node.Attributes[0].Value may not be in the order you expect.
Are you attempting to loop through each attribute?
foreach (XmlNode node in nodes)
{
//You could grab just the value like below
Console.WriteLine(node.Attributes["value"].Value);
//or loop through each attribute
foreach (XmlAttribute f in node.Attributes)
{
Console.WriteLine(f.Value);
}
}

How to get all XML nodes with the same name without knowing their level?

I have a XML Example:
<Fruits>
<Red_fruits>
<Red_fruits></Red_fruits>
</Red_fruits>
<Yellow_fruits>
<banana></banana>
</Yellow_fruits>
<Red_fruits>
<Red_fruits></Red_fruits>
</Red_fruits>
</Fruits>
I have 4 Red_fruits tags, 2 of them shares the same ParentNode (Fruits), I want to get those which have the same ParentNode.
But I just want those which have the same name (Red_fruits), which means Yellow_fruits tag isn't included.
This is the way I am doing right now using C# language:
XmlDocument doc = new XmlDocument();
string selectedTag = cmbX.text;
if (File.Exists(txtFile.text))
{
try
{
//Load
doc.Load(cmbFile.text);
//Select Nodes
XmlNodeList selectedNodeList = doc.SelectNodes(".//" + selectedTag);
}
Catch
{
MessageBox.show("Some error message here");
}
}
This is returning me all red_fruits, not just the ones that belongs to Fruits.
I can't make XmlNodeList = doc.SelectNodes("/Fruits/Red_fruits") because I want to use this code to read random XML files, so I don't know the exact name that specific node will have, I just need to put all nodes with the same name and same level into a XmlNodeList using C# Language.
Is there a way of achieve this without using LINQ? How to do that?
An understanding on the usage of Single Slash / and Double slash // can help here.
Let's see how / and // work in relation to the root node. When / is used at the beginning of a path:
/a
it will define an absolute path to node a relative to the root. As such, in this case, it will only find a nodes at the root of the XML tree.
When // is used at the beginning of a path:
//a
it will define a path to node a anywhere within the XML document. As such, in this case, it will find a nodes located at any depth within the XML tree.
These XPath expressions can also be used in the middle of an XPath value to define ancestor-descendant relationships. When / is used in the middle of a path:
/a/b
it will define a path to node b that is an immediate direct descendant (ie. a child) of node a.
When // used in the middle of a path:
/a//b
it will define a path to node b that is ANY descendant of node a.
Coming back to your question:
// using GetElementsByTagName() return all the Elements having name: Red_Fruits
XmlDocument doc = new XmlDocument();
XmlNodeList nodes= doc.GetElementsByTagName("Red_Fruits");
//Using SelectNodes() method
XmlNodelist nodes = doc.SelectNodes("//Fruits/Red_Fruits");
// This will select all elements that are children of the <Fruits> element.
In case <Fruits> is the root element use the Xpath: /Fruits/Red_Fruits. [ a single slash /]
If you're simply trying to find the "next" or "previous" iteration of a single node, you can do the following and then compare it to the name
XmlNode current = doc.SelectSingleNode("Fruits").SelectSingleNode("Red_fruits");
XmlNode previous = current.NextSibling;
XmlNode next = current.NextSibling;
and you can iterate until you find the proper sibling
while(next.Name != current.Name)
{
next = next.NextSibling;
}
or you can even get your list by invoking the 'Parent' property
XmlNodeList list = current.ParentNode.SelectNodes(current.Name);
Worst case scenario, you can cycle through the XMLNode items in selectedNodeList and check the ParentNode properties. If necessary you could go recursive on the ParentNode check and count the number of times it takes to get to the root node. This would give you the depth of a node. Or you could compare the ParentNode at each level to see if it is the parent you are interested in, if that parent is not the root.
public void Test(){
XmlDocument doc = new XmlDocument();
string selectedTag = cmbX.text;
if (File.Exists(txtFile.text))
{
try
{
//Load
doc.Load(cmbFile.text);
//Select Nodes
XmlNodeList selectedNodeList = doc.SelectNodes(".//" + selectedTag);
List<XmlNode> result = new List<XmlNode>();
foreach(XmlNode node in selectedNodeList){
if(depth(node) == 2){
result.Add(node);
}
}
// result now has all the selected tags of depth 2
}
Catch
{
MessageBox.show("Some error message here");
}
}
}
private int depth(XmlNode node) {
int depth = 0;
XmlNode parent = node.ParentNode;
while(parent != null){
parent = node.ParentNode;
depth++;
}
return depth;
}

HtmlAgilityPack set node InnerText

I want to replace inner text of HTML tags with another text.
I am using HtmlAgilityPack
I use this code to extract all texts
HtmlDocument doc = new HtmlDocument();
doc.Load("some path")
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//text()[normalize-space(.) != '']")) {
// How to replace node.InnerText with some text ?
}
But InnerText is readonly. How can I replace texts with another text and save them to file ?
Try code below. It select all nodes without children and filtered out script nodes. Maybe you need to add some additional filtering. In addition to your XPath expression this one also looking for leaf nodes and filter out text content of <script> tags.
var nodes = doc.DocumentNode.SelectNodes("//body//text()[(normalize-space(.) != '') and not(parent::script) and not(*)]");
foreach (HtmlNode htmlNode in nodes)
{
htmlNode.ParentNode.ReplaceChild(HtmlTextNode.CreateNode(htmlNode.InnerText + "_translated"), htmlNode);
}
Strange, but I found that InnerHtml isn't readonly. And when I tried to set it like that
aElement.InnerHtml = "sometext";
the value of InnerText also changed to "sometext"
The HtmlTextNode class has a Text property* which works perfectly for this purpose.
Here's an example:
var textNodes = doc.DocumentNode.SelectNodes("//body/text()").Cast<HtmlTextNode>();
foreach (var node in textNodes)
{
node.Text = node.Text.Replace("foo", "bar");
}
And if we have an HtmlNode that we want to change its direct text, we can do something like the following:
HtmlNode node = //...
var textNode = (HtmlTextNode)node.SelectSingleNode("text()");
textNode.Text = "new text";
Or we can use node.SelectNodes("text()") in case it has more than one.
* Not to be confused with the readonly InnerText property.

Categories