Remove node of single child parent in html agility pack

Remove node of single child parent in html agility pack - c#

I'm using Html Agility Pack (1.4.9.5) to remove a node within a specified class:
var document = new HtmlDocument();
document.LoadHtml("<p><div class=\"remove-it\"></div></p>");
var nodesToRemove = document.QuerySelectorAll(".remove-it");
if (nodesToRemove != null)
{
foreach (var node in nodesToRemove)
{
node.Remove();
}
}
var res = document.DocumentNode.OuterHtml;
The problem is that at the end res is equal to:
<p>
but it should be:
<p></p>
How can I fix this?

Almost there! You are missing
HtmlNode.ElementsFlags["p"] = HtmlElementFlag.Closed; before document.LoadHtml("<p><div class=\"remove-it\"></div></p>");.
What that does is that the p element will be automatically closed when parsing the document.

Related

HTML Agility Issue

I have a the HTML code which I would like to parse.
I have written the code below:
HtmlAgilityPack.HtmlWeb web5 = new HtmlAgilityPack.HtmlWeb();
HtmlAgilityPack.HtmlDocument doc5 = web5.Load("http://www.analytics4.co.uk/pdf.js/web/viewer.html?file=http://www.analytics4.co.uk/pdf.js/pdf/w15639.pdf");
//var divs5 = doc5.DocumentNode.SelectNodes("//div[id='viewerContainer']").SelectMany(x => x.Descendants("div"));
// HtmlAgilityPack.HtmlDocument doc5 = web5.Load("http://google.co.uk");
HtmlNodeCollection tl = doc5.DocumentNode.SelectNodes("//div[#id='viewerContainer']//div[#id='viewer']//");
foreach (HtmlAgilityPack.HtmlNode node in tl)
{
Console.WriteLine(node.InnerHtml);
Console.WriteLine(node.OuterHtml);
}
The result I get for Inner HTML is just
<div id="viewer" class="pdfViewer"></div>
and it doesn't make sense. Could anyone explain me how can I go deeper and deeper to the inner divs and so on? Please guys...I need your help.

To go deeper you can use this techniques:
foreach (var node in tl){
var a = node.ChildNodes[2]; // a is the third child of node
var b = node.SelectSingleNode("./div[3]"); // b is the third "div"
// element in node children. The "./" in XPath means "from current node"
}
Good luck!

Wrap in an element with HtmlAgilityPack?

I have an HtmlDocument that may or may have a proper <head> and <body> section or might just be an html fragment. Either way, I want to run it through a function that will ensure that it has (more) proper html structure.
I know that I can check if it has a body by seeing if
doc.DocumentNode.SelectSingleNode("//body");
is null. If it does not have a body, how would I wrap the contents of doc.DocumentNode in a <body> element and assign it back to the HtmlDocument?
Edit: There seems to be some confusion about what I want to do. In jquery terms:
$doc = $(document);
if( !$doc.has('body') ) {
$doc.wrapInner('body');
}
Basically, if there is no body element, put a body element around everything.

You could do something like this:
HtmlDocument doc = new HtmlDocument();
doc.Load(MyTestHtm);
HtmlNode body = doc.DocumentNode.SelectSingleNode("//body");
if (body == null)
{
HtmlNode html = doc.DocumentNode.SelectSingleNode("//html");
// we presume html exists
body = CloneAsParentNode(html.ChildNodes, "body");
}
static HtmlNode CloneAsParentNode(HtmlNodeCollection nodes, string name)
{
List<HtmlNode> clones = new List<HtmlNode>(nodes);
HtmlNode parent = nodes[0].ParentNode;
// create a new parent with the given name
HtmlNode newParent = nodes[0].OwnerDocument.CreateElement(name);
// insert before the first node in the selection
parent.InsertBefore(newParent, nodes[0]);
// clone all sub nodes
foreach (HtmlNode node in clones)
{
HtmlNode clone = node.CloneNode(true);
newParent.AppendChild(clone);
}
// remove all sub nodes
foreach (HtmlNode node in clones)
{
parent.RemoveChild(node);
}
return newParent;
}

XPath Select Siblings where sibling parent is root document

I need to select the siblings of root and only for the first level.
I'm selecting html and it can be anyting...
ex:
<p><img...></p>
<p><img...></p>
<ul><li><a>somelink</a></li></ul>
<a>...
what i tryed was this /*[following-sibling::*] but that only selects the two <p><img...></p>...
what i want is to select the first level as in P P ul a.
Im useing Html Agillity Pack and my code looks like this:
var nodeCollection = new List<HtmlNode>();
var document = new HtmlDocument();
document.LoadHtml(html);
if (document.DocumentNode != null)
{
foreach (var node in document.DocumentNode.SelectNodes("/*[following-sibling::*]"))
{
nodeCollection.Add(node);
}
Anyone know what im doing wrong ?

I think i found a working xpath and it was not so hard xD
the following worked: /child::*

Get all attribute values of given tag with Html Agility Pack

I want to get all values of 'id' attribute of 'span' tag with html agility pack.
But instead of attributes I got tags themself. Here's the code
private static IEnumerable<string> GetAllID()
{
HtmlDocument sourceDocument = new HtmlDocument();
sourceDocument.Load(FileName);
var nodes = sourceDocument.DocumentNode.SelectNodes(
#"//span/#id");
return nodes.Nodes().Select(x => x.Name);
}
I'll appreciate if someone tells me what's wrong here.

try
var nodes = sourceDocument.DocumentNode.SelectNodes("//span[#id]");
List<string> ids = new List<string>(nodes.Count);
if(nodes != null)
{
foreach(var node in nodes)
{
if(node.Id != null)
ids.Add(node.Id);
}
}
return ids;

Selecting attribute values with html Agility Pack

I'm trying to retrieve a specific image from a html document, using html agility pack and this xpath:
//div[#id='topslot']/a/img/#src
As far as I can see, it finds the src-attribute, but it returns the img-tag. Why is that?
I would expect the InnerHtml/InnerText or something to be set, but both are empty strings. OuterHtml is set to the complete img-tag.
Are there any documentation for Html Agility Pack?

You can directly grab the attribute if you use the HtmlNavigator instead.
//Load document from some html string
HtmlDocument hdoc = new HtmlDocument();
hdoc.LoadHtml(htmlContent);
//Load navigator for current document
HtmlNodeNavigator navigator = (HtmlNodeNavigator)hdoc.CreateNavigator();
//Get value from given xpath
string xpath = "//div[#id='topslot']/a/img/#src";
string val = navigator.SelectSingleNode(xpath).Value;

Html Agility Pack does not support attribute selection.

You may use the method "GetAttributeValue".
Example:
//[...] code before needs to load a html document
HtmlAgilityPack.HtmlDocument htmldoc = e.Document;
//get all nodes "a" matching the XPath expression
HtmlNodeCollection AllNodes = htmldoc.DocumentNode.SelectNodes("*[#class='item']/p/a");
//show a messagebox for each node found that shows the content of attribute "href"
foreach (var MensaNode in AllNodes)
{
string url = MensaNode.GetAttributeValue("href", "not found");
MessageBox.Show(url);
}

Html Agility Pack will support it soon.
http://htmlagilitypack.codeplex.com/Thread/View.aspx?ThreadId=204342

Reading and Writing Attributes with Html Agility Pack
You can both read and set the attributes in HtmlAgilityPack. This example selects the < html> tag and selects the 'lang' (language) attribute if it exists and then reads and writes to the 'lang' attribute.
In the example below, the doc.LoadHtml(this.All), "this.All" is a string representation of a html document.
Read and write:
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(this.All);
string language = string.Empty;
var nodes = doc.DocumentNode.SelectNodes("//html");
for (int i = 0; i < nodes.Count; i++)
{
if (nodes[i] != null && nodes[i].Attributes.Count > 0 && nodes[i].Attributes.Contains("lang"))
{
language = nodes[i].Attributes["lang"].Value; //Get attribute
nodes[i].Attributes["lang"].Value = "en-US"; //Set attribute
}
}
Read only:
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(this.All);
string language = string.Empty;
var nodes = doc.DocumentNode.SelectNodes("//html");
foreach (HtmlNode a in nodes)
{
if (a != null && a.Attributes.Count > 0 && a.Attributes.Contains("lang"))
{
language = a.Attributes["lang"].Value;
}
}

I used the following way to obtain the attributes of an image.
var MainImageString = MainImageNode.Attributes.Where(i=> i.Name=="src").FirstOrDefault();
You can specify the attribute name to get its value; if you don't know the attribute name, give a breakpoint after you have fetched the node and see its attributes by hovering over it.
Hope I helped.

I just faced this problem and solved it using GetAttributeValue method.
//Selecting all tbody elements
IList<HtmlNode> nodes = doc.QuerySelectorAll("div.characterbox-main")[1]
.QuerySelectorAll("div table tbody");
//Iterating over them and getting the src attribute value of img elements.
var data = nodes.Select((node) =>
{
return new
{
name = node.QuerySelector("tr:nth-child(2) th a").InnerText,
imageUrl = node.QuerySelector("tr td div a img")
.GetAttributeValue("src", "default-url")
};
});

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Remove node of single child parent in html agility pack - c#

Almost there! You are missing HtmlNode.ElementsFlags["p"] = HtmlElementFlag.Closed; before document.LoadHtml("<p><div class=\"remove-it\"></div></p>");. What that does is that the p element will be automatically closed when parsing the document.

Related

HTML Agility Issue

Wrap in an element with HtmlAgilityPack?

XPath Select Siblings where sibling parent is root document

Get all attribute values of given tag with Html Agility Pack

Selecting attribute values with html Agility Pack

Categories

Resources