Splitting HTML string into two parts with HtmlAgilityPack

Splitting HTML string into two parts with HtmlAgilityPack - c#

I'm looking for the best way to split an HTML document over some tag in C# using HtmlAgilityPack. I want to preserve the intended markup as I'm doing the split. Here is an example.
If the document is like this:
<p>
<div>
<p>
Stuff
</p>
<p>
<ul>
<li>Bullet 1</li>
<li>link</li>
<li>Bullet 3</li>
</ul>
</p>
<span>Footer</span>
</div>
</p>
Once it's split, it should look like this:
Part 1
<p>
<div>
<p>
Stuff
</p>
<p>
<ul>
<li>Bullet 1</li>
</ul>
</p>
</div>
</p>
Part 2
<p>
<div>
<p>
<ul>
<li>Bullet 3</li>
</ul>
</p>
<span>Footer</span>
</div>
</p>
What would be the best way of doing something like that?

Definitely not by regex. (Note: this was originally a tag on the question—now removed.) I'm usually not one to jump on The Pony is Coming bandwagon, but this is one case in which regular expressions would be particularly bad.
First, I would write a recursive function that removes all siblings of a node that follow that node—call it RemoveSiblingsAfter(node)—and then calls itself on its parent, so that all siblings following the parent are removed as well (and all siblings following the grandparent, and so on). You can use an XPath to find the node(s) on which you want to split, e.g. doc.DocumentNode.SelectNodes("//a[#href='#']"), and call the function on that node. When done, you'd remove the splitting node itself, and that's it. You'd repeat these steps for a copy of the original document, except you'd implement RemoveSiblingsBefore(node) to remove siblings that precede a node.
In your example, RemoveSiblingsBefore would act as follows:
<a href="#"> has no siblings, so recurse on parent, <li>.
<li> has a preceding sibling—<li>Bullet 1</li>—so remove, and recurse on parent, <ul>.
<ul> has no siblings, so recurse on parent, <p>.
<p> has a preceding sibling—<p>Stuff</p>—so remove, and recurse on parent, <div>.
and so on.

Here is what I came up with. This does the split and removes the "empty" elements of the element where the split happens.
private static void SplitDocument()
{
var doc = new HtmlDocument();
doc.Load("HtmlDoc.html");
var links = doc.DocumentNode.SelectNodes("//a[#href]");
var firstPart = GetFirstPart(doc.DocumentNode, links[0]).DocumentNode.InnerHtml;
var secondPart = GetSecondPart(links[0]).DocumentNode.InnerHtml;
}
private static HtmlDocument GetFirstPart(HtmlNode currNode, HtmlNode link)
{
var nodeStack = new Stack<Tuple<HtmlNode, HtmlNode>>();
var newDoc = new HtmlDocument();
var parent = newDoc.DocumentNode;
nodeStack.Push(new Tuple<HtmlNode, HtmlNode>(currNode, parent));
while (nodeStack.Count > 0)
{
var curr = nodeStack.Pop();
var copyNode = curr.Item1.CloneNode(false);
curr.Item2.AppendChild(copyNode);
if (curr.Item1 == link)
{
var nodeToRemove = NodeAndEmptyAncestors(copyNode);
nodeToRemove.ParentNode.RemoveChild(nodeToRemove);
break;
}
for (var i = curr.Item1.ChildNodes.Count - 1; i >= 0; i--)
{
nodeStack.Push(new Tuple<HtmlNode, HtmlNode>(curr.Item1.ChildNodes[i], copyNode));
}
}
return newDoc;
}
private static HtmlDocument GetSecondPart(HtmlNode link)
{
var nodeStack = new Stack<HtmlNode>();
var newDoc = new HtmlDocument();
var currNode = link;
while (currNode.ParentNode != null)
{
currNode = currNode.ParentNode;
nodeStack.Push(currNode.CloneNode(false));
}
var parent = newDoc.DocumentNode;
while (nodeStack.Count > 0)
{
var node = nodeStack.Pop();
parent.AppendChild(node);
parent = node;
}
var newLink = link.CloneNode(false);
parent.AppendChild(newLink);
currNode = link;
var newParent = newLink.ParentNode;
while (currNode.ParentNode != null)
{
var foundNode = false;
foreach (var child in currNode.ParentNode.ChildNodes)
{
if (foundNode) newParent.AppendChild(child.Clone());
if (child == currNode) foundNode = true;
}
currNode = currNode.ParentNode;
newParent = newParent.ParentNode;
}
var nodeToRemove = NodeAndEmptyAncestors(newLink);
nodeToRemove.ParentNode.RemoveChild(nodeToRemove);
return newDoc;
}
private static HtmlNode NodeAndEmptyAncestors(HtmlNode node)
{
var currNode = node;
while (currNode.ParentNode != null && currNode.ParentNode.ChildNodes.Count == 1)
{
currNode = currNode.ParentNode;
}
return currNode;
}

Related

selecting href from <a> node using HtmlAgilityPack

Im trying to learn webscraping and to get the href value from the "a" node using Htmlagilitypack in C#. There is multiple Gridcells within the gridview that has articles with smallercells and I want the "a" node href value from all of them
<div class=Tabpanel>
<div class=G ridW>
<div class=G ridCell>
<article>
<div class=s mallerCell>
<a href="..........">
</div>
</article>
</div>
</div>
<div class=r andom>
</div>
<div class=r andom>
</div>
</div>
This is what I have come up with so far, feels like I'm making it way more complicated than it has to be. Where do I go from here? Or is there an easier way to do this?
httpclient = new HttpClient();
var html = await httpclient.GetStringAsync(Url);
var htmldoc = new HtmlDocument();
htmldoc.LoadHtml(html);
var ReceptLista = new List < HtmlNode > ();
ReceptLista = htmldoc.DocumentNode.Descendants("div")
.Where(node => node.GetAttributeValue("class", "")
.Equals("GridW")).ToList();
var finalList = new List < HtmlNode > ();
finalList = ReceptLista[0].Descendants("article").ToList();
var finalList2 = new List < List < HtmlNode >> ();
for (int i = 0; i < finalList.Count; i++) {
finalList2.Add(finalList[i].DescendantNodes().Where(node => node.GetAttributeValue("class", "").Equals("RecipeTeaser-content")).ToList());
}
var finalList3 = new List < List < HtmlNode >> ();
for (int i = 0; i < finalList2.Count; i++) {
finalList3.Add(finalList2[i].Where(node => node.GetAttributeValue("class", "").Equals("RecipeTeaser-link js-searchRecipeLink")).ToList());
}

If you can probably make things a lot simpler by using XPath.
If you want all the links in article tags, you can do the following.
var anchors = htmldoc.SelectNodes("//article/a");
var links = anchors.Select(a=>a.attributes["href"].Value).ToList();
I think it is Value. Check with docs.
If you want only the anchor tags that are children of article, and also with class smallerCell, you can change the xpath to //article/div[#class='smallerClass']/a.
you get the idea. I think you're just missing xpath knowledge. Also note that HtmlAgilityPack also has plugins that can add CSS selectors, so that's also an option if you don't want to do xpath.

Simplest way I'd go about it would be this...
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(text);
var nodesWithARef = doc.DocumentNode.Descendants("a");
foreach (HtmlNode node in nodesWithARef)
{
Console.WriteLine(node.GetAttributeValue("href", ""));
}
Reasoning: Using the Descendants function would give you an array of all the links that you're interested in from the entire html. You can go over the nodes and do what you need ... i am simply printing the href.
Another Way to go about it would be to look up all the nodes that have the class named 'smallerCell'. Then, for each of those nodes, look up the href if it exists under that and print it (or do something with it).
var nodesWithSmallerCells = doc.DocumentNode.SelectNodes("//div[#class='smallerCell']");
if (nodesWithSmallerCells != null)
foreach (HtmlNode node in nodesWithSmallerCells)
{
HtmlNodeCollection children = node.SelectNodes(".//a");
if (children != null)
foreach (HtmlNode child in children)
Console.WriteLine(child.GetAttributeValue("href", ""));
}

HtmlAgilityPack filtering HTML based on a query

I have a block of two HTML elements which look like this:
<div class="a-row">
<a class="a-size-small a-link-normal a-text-normal" href="/Chemical-Guys-CWS-107-Extreme-Synthetic/dp/B003U4P3U0/ref=sr_1_1_sns?s=automotive&ie=UTF8&qid=1504525216&sr=1-1">
<span aria-label="$19.51" class="a-color-base sx-zero-spacing">
<span class="sx-price sx-price-large">
<sup class="sx-price-currency">$</sup>
<span class="sx-price-whole">19</span>
<sup class="sx-price-fractional">51</sup>
</span>
</span>
<span class="a-letter-space"></span>Subscribe & Save
</a>
</div>
And next block of HTML:
<div class="a-row a-spacing-none">
<a class="a-link-normal a-text-normal" href="https://rads.stackoverflow.com/amzn/click/com/B003U4P3U0" rel="nofollow noreferrer">
<span aria-label="$22.95" class="a-color-base sx-zero-spacing">
<span class="sx-price sx-price-large">
<sup class="sx-price-currency">$</sup>
<span class="sx-price-whole">22</span>
<sup class="sx-price-fractional">95</sup>
</span>
</span>
</a>
<span class="a-letter-space"></span>
<i class="a-icon a-icon-prime a-icon-small s-align-text-bottom" aria-label="Prime">
<span class="a-icon-alt">Prime</span>
</i>
</div>
Both of these elements are quite similar in their structure, but the trick is that I want to extract the value of element which next to it contains a span element with a class: aria-label="Prime"
This is how I currently extract the price but it's not good:
if (htmlDoc.DocumentNode.SelectNodes("//span[#class='a-color-base sx-zero-spacing']") != null)
{
var span = htmlDoc.DocumentNode.SelectSingleNode("//span[#class='a-color-base sx-zero-spacing']");
price = span.Attributes["aria-label"].Value;
}
This basically selects HTML element at position 0, since there are more than one element. But the trick here is that I would like to select that span element which contains the prime value , just like the 2nd piece of HTML I've shown...
In case the 2nd element with such values doesn't exists I would just simply use this first method I wrote up there...
Can someone help me out with this ? =)
I've also tried something like this:
var pr = htmlDoc.DocumentNode.SelectNodes("//a[#class='a-link-normal a-text-normal']")
.Where(x => x.SelectSingleNode("//i[#class='a-icon a-icon-prime a-icon-small s-align-text-bottom']") != null)
.Select(x => x.SelectSingleNode("//span[#class='a-color-base sx-zero-spacing']").Attributes["aria-label"].Value);
But it's still returning first element xD
New version guys:
var pr = htmlDoc.DocumentNode.SelectNodes("//a[#class='a-link-normal a-text-normal']");
string prrrrrr = "";
for (int i = 0; i < pr.Count; i++)
{
if (pr.ElementAt(i).SelectNodes("//i[#class='a-icon a-icon-prime a-icon-small s-align-text-bottom']").ElementAt(i) != null)
{
prrrrrr = pr.ElementAt(i).SelectNodes("//span[#class='a-color-base sx-zero-spacing']").ElementAt(i).Attributes["aria-label"].Value;
}
}
So the idea is that I take out all "a" elements from the HTML file and create a HTML Node collection of a's, and then loop through them and see which one indeed contains the element that I'm looking for and then match it...?
The problem here is that this if statement always passes:
if (pr.ElementAt(i).SelectNodes("//i[#class='a-icon a-icon-prime a-icon-small s-align-text-bottom']").ElementAt(i) != null)
How can I loop through each individual element in node collection ?

I think you should start to look at div level with class a-row. Then loop and check if the div contains a i with class area-label equals to 'Prime'. And finally get the span with the a-color-base sx-zero-spacing class and the value of the attribute aria-label like this:
HtmlNodeCollection nodes = htmlDoc.DocumentNode.SelectNodes("//div[starts-with(#class,'a-row')]");
foreach (HtmlNode node in nodes)
{
HtmlNode i = node.SelectSingleNode("i[#aria-label='Prime']");
if (i != null)
{
HtmlNode span = node.SelectSingleNode(".//span[#class='a-color-base sx-zero-spacing']");
if (span != null)
{
string currentValue = span.Attributes["aria-label"].Value;
}
}
}

Why can't I remove tag name using linq to xml

why i can't delete the tag name and retain its value if the tag name im going to delete doesn't have a child node
here is the xml file
<p>
<li>
<BibUnstructured>Some text</BibUnstructured>
</li>
<li>
<BibUnstructured>another text</BibUnstructured>
</li>
</p>
and this is must be the output
<p>
<li>
Some text
</li>
<li>
another text
</li>
</p>
and here is my code as of now
XElement rootBook = XElement.Load("try.xml");
IEnumerable<XElement> Book =
from el in rootBook.Descendants("BibUnstructured").ToList()
select el;
foreach (XElement el in Book)
{
if (el.HasElements)
{
el.ReplaceWith(el.Elements());
}
Console.WriteLine(el);
}
Console.WriteLine(rootBook.ToString());
if i remove the if statement it delete the tag name and its content

Your BibUnstructured elements don't have child elements, but the do have child nodes (text nodes, in this case). Try this:
foreach (var book in doc.Descendants("BibUnstructured").ToList())
{
if (book.Nodes().Any())
{
book.ReplaceWith(book.Nodes());
}
}
See this fiddle for a working demo.

Charles already explained why it is not working, alternatively you could also do this.
XElement element = XElement.Load("try.xml");
element.Descendants("li").ToList().ForEach(x=> {
var item = x.Element("BibUnstructured");
if(item != null)
{
x.Add(item.Value);
item.Remove();
}
});
Check this Demo

You have to set the value of parent node to the value of the child node you are going to delete.
Try the following:
XElement rootBook = XElement.Load("try.xml");
IEnumerable<XElement> Book =
from el in rootBook.Descendants("BibUnstructured").ToList()
select el;
foreach (XElement el in Book)
{
if (!el.HasElements)
{
XElement parent= el.Parent;
string value=el.Value;
el.Remove();
parent.Value=value;
Console.WriteLine(parent);
}
}
Console.WriteLine(rootBook.ToString());
And the output is:
<li>Some text</li>
<li>another text</li>
<p>
<li>Some text</li>
<li>another text</li>
</p>

C# HtmlAgilityPack Select table from specific h2

I have some html:
<h2>Results</h2>
<div class="box">
<table class="tFormat">
<th>Head</th>
<tr>1</tr>
</table>
</div>
<h2>Grades</h2>
<div class="box">
<table class="tFormat">
<th>Head</th>
<tr>1</tr>
</table>
</div>
I was wondering how would I get the table under "Results"
I've tried:
var nodes = doc.DocumentNode.SelectNodes("//h2");
foreach (var o in nodes)
{
if (o.InnerText.Equals("Results"))
{
foreach (var c in o.SelectNodes("//table"))
{
Console.WriteLine(c.InnerText);
}
}
}
It works but it also gets the table under Grades h2

Note that the div is not hierarchically inside the header, so it doesn't make sense to look for it there.
This can work for you - it finds the next element after the title:
if (o.InnerText.Equals("Results"))
{
var nextDiv = o.NextSibling;
while (nextDiv != null && nextDiv.NodeType != HtmlNodeType.Element)
nextDiv = nextDiv.NextSibling;
// nextDiv should be correct here.
}
You can also write a more specific xpath to find just that div:
doc.DocumentNode.SelectNodes("//h2[text()='Results']/following-sibling::div[1]");

var nodes = doc.DocumentNode.SelectNodes("//h2");
if (nodes.FirstOrDefault()!=null)
{
var o=nodes.FirstOrDefault();
if (o.InnerText.Equals("Results"))
{
foreach (var c in o.SelectNodes("//table"))
{
Console.WriteLine(c.InnerText);
}
}
}

Recursive searching a pattern in a string

I am using c#.
I have following string
<li>
P1
<ul>
<li>P11</li>
<li>P12</li>
<li>P13</li>
<li>P14</li>
</ul>
</li>
<li>
P2
<ul>
<li>P21</li>
<li>P22</li>
<li>P23</li>
</ul>
</li>
<li>
P3
<ul>
<li>P31</li>
<li>P32</li>
<li>P33</li>
<li>P34</li>
</ul>
</li>
<li>
P4
<ul>
<li>P41</li>
<li>P42</li>
</ul>
</li>
My aim is to fill the following list from the above string.
List<class1>
class1 has two properties,
string parent;
List<string> children;
It should fill P1 in parent and P11,P12,P13,P14 in children, and make a list of them.
Any suggestion will be helpful.
Edit
Sample
public List<class1> getElements()
{
List<class1> temp = new List<class1>();
foreach(// <a> element in string)
{
//in the recursive loop
List<string> str = new List<string>();
str.add("P11");
str.add("P12");
str.add("P13");
str.add("P14");
class1 obj = new class1("P1",str);
temp.add(obj);
}
return temp;
}
the values are hard coded here, but it would be dynamic.

What you want is a recursive descent parser. All the other suggestions of using libraries are basically suggesting that you use a recursive descent parser for HTML or XML that has been written by others.
The basic structure of a recursive descent parser is to do a linear search of a list of tokens (in your case a string) and upon encountering a token that delimits a sub entity call the parser again to process the sublist of tokens (substring).
You can Google for the term "recursive descent parser" and find plenty of useful result. Even the Wikipedia article is fairly good in this case and includes an example of a recursive descent parser in C.

If you can't use a third party tool like my recommended Html Agility Pack you could use the Webbrowser class and the HtmlDocument class to parse the HTML:
WebBrowser wbc = new WebBrowser();
wbc.DocumentText = "foo"; // necessary to create the document
HtmlDocument doc = wbc.Document.OpenNew(true);
doc.Write((string)html); // insert your html-string here
List<class1> elements = wbc.Document.GetElementsByTagName("li").Cast<HtmlElement>()
.Where(li => li.Children.Count == 2)
.Select(outerLi => new class1
{
parent = outerLi.FirstChild.InnerText,
children = outerLi.Children.Cast<HtmlElement>()
.Last().Children.Cast<HtmlElement>()
.Select(innerLi => innerLi.FirstChild.InnerText).ToList()
}).ToList();
Here's the result in the debugger window:

You can also use XmlDocument:
XmlDocument doc = new XmlDocument();
doc.LoadXml(yourInputString);
XmlNodeList colNodes = xmlSource.SelectNodes("li");
foreach (XmlNode node in colNodes)
{
// ... your logic here
// for example
// string parentName = node.SelectSingleNode("a").InnerText;
// string parentHref = node.SelectSingleNode("a").Attribures["href"].Value;
// XmlNodeList children =
// node.SelectSingleNode("ul").SelectNodes("li");
// foreach (XmlNode child in children)
// {
// ......
// }
}

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Splitting HTML string into two parts with HtmlAgilityPack - c#

Related

selecting href from <a> node using HtmlAgilityPack

HtmlAgilityPack filtering HTML based on a query

Why can't I remove tag name using linq to xml

C# HtmlAgilityPack Select table from specific h2

Recursive searching a pattern in a string

Categories

Resources