HTMLAgility:Replace two nodes with a new element - c#

I am looping through a nodes collection. I have to replace the current node and sibling of the current node with a new element.
I have written the below code for doing that:
private void modifyNodes(IEnumerable<HtmlNode> selectedNodes)
{
foreach (var node in selectedNodes)
{
node.NextSibling.SetAttributeValue("style", "font-weight:bold;padding:2px 2px;");
node.SetAttributeValue("style", "float:right;");
var parentNode = node.ParentNode;
var doc = new HtmlDocument();
var newElement = doc.CreateElement("table");
newElement.SetAttributeValue("style", "background-color:#e4ecf8;width:100%");
var sectionRow = doc.CreateElement("tr");
var headerColumn = doc.CreateElement("td");
headerColumn.AppendChild(node.NextSibling);
var weightColumn = doc.CreateElement("td");
weightColumn.AppendChild(node);
sectionRow.AppendChild(headerColumn);
sectionRow.AppendChild(weightColumn);
newElement.AppendChild(sectionRow);
element.ParentNode.RemoveChild(node);
parentNode.ReplaceChild(newElement, node.NextSibling);
}
}
This is adding the new element and removing the passed node. But it's failing to remove the next sibling of the node. What am I doing wrong here.
Please help.

You're explicitly replaced node.NextSibling with the newElement, as you said that the new element was added. The problem may be in the type of the next sibling. Most probably, this is a text node (very often those \r\n which divide the HTML nodes).
So it seems, that your new node just replaced the text node, and the result is a bit unexpected. So if this is a really an issue, you could do a workaround like this:
// next sibling
var next = node.NextSibling;
// get the first non-text node
while (next != null && next is HtmlTextNode)
next = next.NextSibling;
var newNode = doc.CreateElement(...);
// replace the current node with the new one
current.ParentNode.ReplaceChild(newNode, current);
// remove the next node if it was found
if (next != null)
next.Remove();

Related

C# XmlDocument select nodes returns empty

i am trying to work with http://api.met.no/weatherapi/locationforecast/1.9/?lat=49.8197202;lon=18.1673554 XML.
Lets say i want to select all value attribute of each temperature element.
i tried this.
const string url = "http://api.met.no/weatherapi/locationforecast/1.9/?lat=49.8197202;lon=18.1673554";
WebClient client = new WebClient();
string x = client.DownloadString(url);
XmlDocument xml = new XmlDocument();
xml.LoadXml(x);
XmlNodeList nodes = xml.SelectNodes("/weatherdata/product/time/location/temperature");
//XmlNodeList nodes = xml.SelectNodes("temperature");
foreach (XmlNode node in nodes)
{
Console.WriteLine(node.Attributes[0].Value);
}
But i get nothing all the time. What am i doing wrong?
The current single slash is targeting weatherdata under the root but the root is weatherdata.
Add a preceding slash to your xpath query to make it a double slash:
XmlNodeList nodes = xml.SelectNodes("//weatherdata/product/time/location/temperature");
Double slashes tells xpath to select nodes in the document from the current node that match the selection no matter where they are.
or remove the preceding slash:
XmlNodeList nodes = xml.SelectNodes("weatherdata/product/time/location/temperature");
which looks for the whole path including the root.
Also, since you apparently want the value called value add this:
Console.WriteLine(node.Attributes["value"].Value);
Since the value at of node.Attributes[0].Value may not be in the order you expect.
Are you attempting to loop through each attribute?
foreach (XmlNode node in nodes)
{
//You could grab just the value like below
Console.WriteLine(node.Attributes["value"].Value);
//or loop through each attribute
foreach (XmlAttribute f in node.Attributes)
{
Console.WriteLine(f.Value);
}
}

Gettig Htmlelement based on HtmlAgilityPack.HtmlNode

I use HtmlAgilityPack to parse the html document of a webbrowser control.
I am able to find my desired HtmlNode, but after getting the HtmlNode, I want to retun the corresponding HtmlElement in the WebbrowserControl.Document.
In fact HtmlAgilityPack parse an offline copy of the live document, while I want to access live elements of the webbrowser Control to access some rendered attributes like currentStyle or runtimeStyle
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(webBrowser1.Document.Body.InnerHtml);
var some_nodes = doc.DocumentNode.SelectNodes("//p");
// this selection could be more sophisticated
// and the answer shouldn't relay on it.
foreach (HtmlNode node in some_nodes)
{
HtmlElement live_element = CorrespondingElementFromWebBrowserControl(node);
// CorrespondingElementFromWebBrowserControl is what I am searching for
}
If the element had a specific attribute it could be easy but I want a solution which works on any element.
Please help me what can I do about it.
In fact there seems to be no direct possibility to change the document directly in the webbroser control.
But you can extract the html from it, mnipulate it and write it back again like this:
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(webBrowser1.DocumentText);
foreach (HtmlAgilityPack.HtmlNode node in doc.DocumentNode.ChildNodes) {
node.Attributes.Add("TEST", "TEST");
}
StringBuilder sb = new StringBuilder();
using (StringWriter sw = new StringWriter(sb)) {
doc.Save(sw);
webBrowser1.DocumentText = sb.ToString();
}
For direct manipulation you can maybe use the unmanaged pointer webBrowser1.Document.DomDocument to the document, but this is outside of my knowledge.
HtmlAgilityPack definitely can't provide access to nodes in live HTML directly. Since you said there is no distinct style/class/id on the element you have to walk through the nodes manually and find matches.
Assuming HTML is reasonably valid (so both browser and HtmlAgilityPack perform normalization similarly) you can walk pairs of elements starting from the root of both trees and selecting the same child node.
Basically you can build "position-based" XPath to node in one tree and select it in another tree. Xpath would look something like (depending you want to pay attention to just positions or position and node name):
"/*[1]/*[4]/*[2]/*[7]"
"/body/div[2]/span[1]/p[3]"
Steps:
In using HtmlNode you've found collect all parent nodes up to the root.
Get root of element of HTML in browser
for each level of children find position of corresponding child on HtmlNodes collection on step 1 in its parent and than find live HtmlElement among children of current live node.
Move to newly found child and go back to 3 till found node you are looking for.
the XPath attribute of the HtmlAgilityPack.HtmlNode shows the nodes on the path from root to the node. For example \div[1]\div[2]\table[0]. You can traverse this path in the live document to find the corresponding live element. However this path may not be precise as HtmlAgilityPack removes some tags like <form> then before using this solution add the omitted tags back using
HtmlNode.ElementsFlags.Remove("form");
struct DocNode
{
public string Name;
public int Pos;
}
///// structure to hold the name and position of each node in the path
The following method finds the live element according to the XPath
static public HtmlElement GetLiveElement(HtmlNode node, HtmlDocument doc)
{
var pattern = #"/(.*?)\[(.*?)\]"; // like div[1]
// Parse the XPath to extract the nodes on the path
var matches = Regex.Matches(node.XPath, pattern);
List<DocNode> PathToNode = new List<DocNode>();
foreach (Match m in matches) // Make a path of nodes
{
DocNode n = new DocNode();
n.Name = n.Name = m.Groups[1].Value;
n.Pos = Convert.ToInt32(m.Groups[2].Value)-1;
PathToNode.Add(n); // add the node to path
}
HtmlElement elem = null; //Traverse to the element using the path
if (PathToNode.Count > 0)
{
elem = doc.Body; //begin from the body
foreach (DocNode n in PathToNode)
{
//Find the corresponding child by its name and position
elem = GetChild(elem, n);
}
}
return elem;
}
the code for GetChild Method used above
public static HtmlElement GetChild(HtmlElement el, DocNode node)
{
// Find corresponding child of the elemnt
// based on the name and position of the node
int childPos = 0;
foreach (HtmlElement child in el.Children)
{
if (child.TagName.Equals(node.Name,
StringComparison.OrdinalIgnoreCase))
{
if (childPos == node.Pos)
{
return child;
}
childPos++;
}
}
return null;
}

How to get all XML nodes with the same name without knowing their level?

I have a XML Example:
<Fruits>
<Red_fruits>
<Red_fruits></Red_fruits>
</Red_fruits>
<Yellow_fruits>
<banana></banana>
</Yellow_fruits>
<Red_fruits>
<Red_fruits></Red_fruits>
</Red_fruits>
</Fruits>
I have 4 Red_fruits tags, 2 of them shares the same ParentNode (Fruits), I want to get those which have the same ParentNode.
But I just want those which have the same name (Red_fruits), which means Yellow_fruits tag isn't included.
This is the way I am doing right now using C# language:
XmlDocument doc = new XmlDocument();
string selectedTag = cmbX.text;
if (File.Exists(txtFile.text))
{
try
{
//Load
doc.Load(cmbFile.text);
//Select Nodes
XmlNodeList selectedNodeList = doc.SelectNodes(".//" + selectedTag);
}
Catch
{
MessageBox.show("Some error message here");
}
}
This is returning me all red_fruits, not just the ones that belongs to Fruits.
I can't make XmlNodeList = doc.SelectNodes("/Fruits/Red_fruits") because I want to use this code to read random XML files, so I don't know the exact name that specific node will have, I just need to put all nodes with the same name and same level into a XmlNodeList using C# Language.
Is there a way of achieve this without using LINQ? How to do that?
An understanding on the usage of Single Slash / and Double slash // can help here.
Let's see how / and // work in relation to the root node. When / is used at the beginning of a path:
/a
it will define an absolute path to node a relative to the root. As such, in this case, it will only find a nodes at the root of the XML tree.
When // is used at the beginning of a path:
//a
it will define a path to node a anywhere within the XML document. As such, in this case, it will find a nodes located at any depth within the XML tree.
These XPath expressions can also be used in the middle of an XPath value to define ancestor-descendant relationships. When / is used in the middle of a path:
/a/b
it will define a path to node b that is an immediate direct descendant (ie. a child) of node a.
When // used in the middle of a path:
/a//b
it will define a path to node b that is ANY descendant of node a.
Coming back to your question:
// using GetElementsByTagName() return all the Elements having name: Red_Fruits
XmlDocument doc = new XmlDocument();
XmlNodeList nodes= doc.GetElementsByTagName("Red_Fruits");
//Using SelectNodes() method
XmlNodelist nodes = doc.SelectNodes("//Fruits/Red_Fruits");
// This will select all elements that are children of the <Fruits> element.
In case <Fruits> is the root element use the Xpath: /Fruits/Red_Fruits. [ a single slash /]
If you're simply trying to find the "next" or "previous" iteration of a single node, you can do the following and then compare it to the name
XmlNode current = doc.SelectSingleNode("Fruits").SelectSingleNode("Red_fruits");
XmlNode previous = current.NextSibling;
XmlNode next = current.NextSibling;
and you can iterate until you find the proper sibling
while(next.Name != current.Name)
{
next = next.NextSibling;
}
or you can even get your list by invoking the 'Parent' property
XmlNodeList list = current.ParentNode.SelectNodes(current.Name);
Worst case scenario, you can cycle through the XMLNode items in selectedNodeList and check the ParentNode properties. If necessary you could go recursive on the ParentNode check and count the number of times it takes to get to the root node. This would give you the depth of a node. Or you could compare the ParentNode at each level to see if it is the parent you are interested in, if that parent is not the root.
public void Test(){
XmlDocument doc = new XmlDocument();
string selectedTag = cmbX.text;
if (File.Exists(txtFile.text))
{
try
{
//Load
doc.Load(cmbFile.text);
//Select Nodes
XmlNodeList selectedNodeList = doc.SelectNodes(".//" + selectedTag);
List<XmlNode> result = new List<XmlNode>();
foreach(XmlNode node in selectedNodeList){
if(depth(node) == 2){
result.Add(node);
}
}
// result now has all the selected tags of depth 2
}
Catch
{
MessageBox.show("Some error message here");
}
}
}
private int depth(XmlNode node) {
int depth = 0;
XmlNode parent = node.ParentNode;
while(parent != null){
parent = node.ParentNode;
depth++;
}
return depth;
}

What is the difference between Node.SelectNodes(/*) and Node.childNodes?

string XML1 = "<Root><InsertHere></InsertHere></Root>";
string XML2 = "<Root><child1><childnodes>data</childnodes><childnodes>data1</childnodes></child1><child2><childnodes>data</childnodes><childnodes>data1</childnodes></child2></Root>";
Among below mentioned two code samples.. usage of childNodes doesn't copy all the child nodes from XML2. only <child1> is being copied.
string strXpath = "/Root/InsertHere";
XmlDocument xdxmlChildDoc = new XmlDocument();
XmlDocument ParentDoc = new XmlDocument();
ParentDoc.LoadXml(XML1);
xdxmlChildDoc.LoadXml(XML2);
XmlNode xnNewNode = ParentDoc.ImportNode(xdxmlChildDoc.DocumentElement.SelectSingleNode("/Root"), true);
if (xnNewNode != null)
{
XmlNodeList xnChildNodes = xnNewNode.SelectNodes("/*");
if (xnChildNodes != null)
{
foreach (XmlNode xnNode in xnChildNodes)
{
if (xnNode != null)
{
ParentDoc.DocumentElement.SelectSingleNode(strXpath).AppendChild(xnNode);
}
}
}
}
code2:
if (xnNewNode != null)
{
XmlNodeList xnChildNodes = xnNewNode.ChildNodes;
if (xnChildNodes != null)
{
foreach (XmlNode xnNode in xnChildNodes)
{
if (xnNode != null)
{
ParentDoc.DocumentElement.SelectSingleNode(strXpath).AppendChild(xnNode);
}
}
}
}
ParentDoc.OuterXML after executing first sample of code:
<Root>
<InsertHere>
<child1>
<childnodes>data</childnodes>
<childnodes>data1</childnodes>
</child1>
<child2>
<childnodes>data</childnodes>
<childnodes>data1</childnodes>
</child2>
</InsertHere>
</Root>
ParentDoc.OuterXML after executing second sample of Code
<Root>
<InsertHere>
<child1>
<childnodes>data</childnodes>
<childnodes>data1</childnodes>
</child1>
</InsertHere>
</Root>
I have done some debugging of the code, and it shows that xnNewNode.ChildNodes initially also returns 2 child nodes. After one iteration in the loop, the first child is however removed from ChildNodes, and therefore the loop ends prematurely.
If you want to use the ChildNodes property, one workaround is to "transfer" the child node references to an array or list, like this:
var xnChildNodes = xnNewNode.ChildNodes.Cast<XmlNode>().ToArray();
UPDATE
As Tomer W pointed out in his answer, when using XmlNode.AppendChild the inserted node is also removed from its original location. As stated in the MSDN documentation:
If the newChild is already in the tree, it is removed from
its original position and added to its target position.
With SelectNodes you have already created a new node collection, but with ChildNodes you are accessing the original collection.
this is a clearing of what Anders G posted, with more through explanation.
I am surprised the foreach does not fail (Throw Exception) in this situation, but hell.
In code1.
1. Create a NEW COLLECTION of nodes
2. Select nodes to it
3. append to other node => removing from original collection, but not the newly created one.
4 you are removing the node you are adding from the newly collection.
in Code2
1. Reference the ORIGINAL node collection
{child1, child2}
2. append 1st Node away to another collection => removing it from the original collection
{child2}
3. now when the foreach at index 1, it see that it passed the end of the collection. and exit.
this happens a lot when changing a collection that is subject to iteration.
but most the time, the IEnumerator is throwing an Exception when such happens.
hope i made it all clear
I had the same problem and observed, that whitespace nodes seem to have a value attached to the node, which is not the case with other nodes (at least in my application).This method removes the whitespace nodes from the node.ChildNodes list:
private List<XmlNode> findChildnodes(XmlNode node)
{
List<XmlNode> result = new List<XmlNode>();
foreach (XmlNode childnode in node.ChildNodes)
{
if(childnode.Value == null)
{
result.Add(childnode);
}
}
return result;
}
In answer to your question, Node.childNodes is All of the child nodes, whereas Node.SelectNodes(/*) is all of the child nodes that match /*. Only XML elements will match /*, so any attributes, CDATA nodes, text nodes, etc will be excluded.
Nevertheless, the problem occurs because you are changing the collection of nodes while while iterating over them. You cannot do that. The select nodes method returns a list of references to nodes. This is why is works.

HtmlAgilityPack replace node

I want to replace a node with a new node. How can I get the exact position of the node and do a complete replace?
I've tried the following, but I can't figured out how to get the index of the node or which parent node to call ReplaceChild() on.
string html = "<b>bold_one</b><strong>strong</strong><b>bold_two</b>";
HtmlDocument document = new HtmlDocument();
document.LoadHtml(html);
var bolds = document.DocumentNode.Descendants().Where(item => item.Name == "b");
foreach (var item in bolds)
{
string newNodeHtml = GenerateNewNodeHtml();
HtmlNode newNode = new HtmlNode(HtmlNodeType.Text, document, ?);
item.ParentNode.ReplaceChild( )
}
To create a new node, use the HtmlNode.CreateNode() factory method, do not use the constructor directly.
This code should work out for you:
var htmlStr = "<b>bold_one</b><strong>strong</strong><b>bold_two</b>";
var doc = new HtmlDocument();
doc.LoadHtml(htmlStr);
var query = doc.DocumentNode.Descendants("b");
foreach (var item in query.ToList())
{
var newNodeStr = "<foo>bar</foo>";
var newNode = HtmlNode.CreateNode(newNodeStr);
item.ParentNode.ReplaceChild(newNode, item);
}
Note that we need to call ToList() on the query, we will be modifying the document so it would fail if we don't.
If you wish to replace with this string:
"some text <b>node</b> <strong>another node</strong>"
The problem is that it is no longer a single node but a series of nodes. You can parse it fine using HtmlNode.CreateNode() but in the end, you're only referencing the first node of the sequence. You would need to replace using the parent node.
var htmlStr = "<b>bold_one</b><strong>strong</strong><b>bold_two</b>";
var doc = new HtmlDocument();
doc.LoadHtml(htmlStr);
var query = doc.DocumentNode.Descendants("b");
foreach (var item in query.ToList())
{
var newNodesStr = "some text <b>node</b> <strong>another node</strong>";
var newHeadNode = HtmlNode.CreateNode(newNodesStr);
item.ParentNode.ReplaceChild(newHeadNode.ParentNode, item);
}
Have Implemented the following solution to achieve the same.
var htmlStr = "<b>bold_one</b><div class='LatestLayout'><div class='olddiv'><strong>strong</strong></div></div><b>bold_two</b>";
var htmlDoc = new HtmlDocument();
HtmlDocument document = new HtmlDocument();
document.Load(htmlStr);
htmlDoc.DocumentNode.SelectSingleNode("//div[#class='olddiv']").Remove();
htmlDoc.DocumentNode.SelectSingleNode("//div[#class='LatestLayout']").PrependChild(newChild)
htmlDoc.Save(FilePath); // FilePath .html file with full path if need to save file.
so selecting an object and removing respective HTML object
and appending it as chile. of respective object.

Categories