I am using c#.
I have following string
<li>
P1
<ul>
<li>P11</li>
<li>P12</li>
<li>P13</li>
<li>P14</li>
</ul>
</li>
<li>
P2
<ul>
<li>P21</li>
<li>P22</li>
<li>P23</li>
</ul>
</li>
<li>
P3
<ul>
<li>P31</li>
<li>P32</li>
<li>P33</li>
<li>P34</li>
</ul>
</li>
<li>
P4
<ul>
<li>P41</li>
<li>P42</li>
</ul>
</li>
My aim is to fill the following list from the above string.
List<class1>
class1 has two properties,
string parent;
List<string> children;
It should fill P1 in parent and P11,P12,P13,P14 in children, and make a list of them.
Any suggestion will be helpful.
Edit
Sample
public List<class1> getElements()
{
List<class1> temp = new List<class1>();
foreach(// <a> element in string)
{
//in the recursive loop
List<string> str = new List<string>();
str.add("P11");
str.add("P12");
str.add("P13");
str.add("P14");
class1 obj = new class1("P1",str);
temp.add(obj);
}
return temp;
}
the values are hard coded here, but it would be dynamic.
What you want is a recursive descent parser. All the other suggestions of using libraries are basically suggesting that you use a recursive descent parser for HTML or XML that has been written by others.
The basic structure of a recursive descent parser is to do a linear search of a list of tokens (in your case a string) and upon encountering a token that delimits a sub entity call the parser again to process the sublist of tokens (substring).
You can Google for the term "recursive descent parser" and find plenty of useful result. Even the Wikipedia article is fairly good in this case and includes an example of a recursive descent parser in C.
If you can't use a third party tool like my recommended Html Agility Pack you could use the Webbrowser class and the HtmlDocument class to parse the HTML:
WebBrowser wbc = new WebBrowser();
wbc.DocumentText = "foo"; // necessary to create the document
HtmlDocument doc = wbc.Document.OpenNew(true);
doc.Write((string)html); // insert your html-string here
List<class1> elements = wbc.Document.GetElementsByTagName("li").Cast<HtmlElement>()
.Where(li => li.Children.Count == 2)
.Select(outerLi => new class1
{
parent = outerLi.FirstChild.InnerText,
children = outerLi.Children.Cast<HtmlElement>()
.Last().Children.Cast<HtmlElement>()
.Select(innerLi => innerLi.FirstChild.InnerText).ToList()
}).ToList();
Here's the result in the debugger window:
You can also use XmlDocument:
XmlDocument doc = new XmlDocument();
doc.LoadXml(yourInputString);
XmlNodeList colNodes = xmlSource.SelectNodes("li");
foreach (XmlNode node in colNodes)
{
// ... your logic here
// for example
// string parentName = node.SelectSingleNode("a").InnerText;
// string parentHref = node.SelectSingleNode("a").Attribures["href"].Value;
// XmlNodeList children =
// node.SelectSingleNode("ul").SelectNodes("li");
// foreach (XmlNode child in children)
// {
// ......
// }
}
Related
Suppose I have the following HTML code:
<div class="MyDiv">
<h2>Josh</h2>
</div>
<div class="MyDiv">
<h2>Anna</h2>
</div>
<div class="MyDiv">
<h2>Peter</h2>
</div>
And I want to get the names, so this is what I did (C#):
string url = "https://...";
var web = new HtmlWeb();
HtmlNode[] nodes = null;
HtmlDocument doc = null;
doc = web.Load(url);
nodes = doc.DocumentNode.SelectNodes("//div[#class='MyDiv").ToArray() ?? null;
foreach (HtmlNode n in nodes){
var name = n.SelectSingleNode("//h2");
Console.WriteLine(name.InnerHtml);
}
Output:
Josh
Josh
Josh
and it is so strange because n contains only the desired <div>. How can I resolve this issue?
Fixed by writing .//h2 instead of //h2
It's because of your XPath statement "//h2". You should change this simply to "h2". When you start with the two "//" the path starts at the top. And then it selects "Josh" every time, because that is the first h2 node.
You could also do like this:
List<string> names =
doc.DocumentNode.SelectNodes("//div[#class='MyDiv']/h2")
.Select(dn => dn.InnerText)
.ToList();
foreach (string name in names)
{
Console.WriteLine(name);
}
Im trying to learn webscraping and to get the href value from the "a" node using Htmlagilitypack in C#. There is multiple Gridcells within the gridview that has articles with smallercells and I want the "a" node href value from all of them
<div class=Tabpanel>
<div class=G ridW>
<div class=G ridCell>
<article>
<div class=s mallerCell>
<a href="..........">
</div>
</article>
</div>
</div>
<div class=r andom>
</div>
<div class=r andom>
</div>
</div>
This is what I have come up with so far, feels like I'm making it way more complicated than it has to be. Where do I go from here? Or is there an easier way to do this?
httpclient = new HttpClient();
var html = await httpclient.GetStringAsync(Url);
var htmldoc = new HtmlDocument();
htmldoc.LoadHtml(html);
var ReceptLista = new List < HtmlNode > ();
ReceptLista = htmldoc.DocumentNode.Descendants("div")
.Where(node => node.GetAttributeValue("class", "")
.Equals("GridW")).ToList();
var finalList = new List < HtmlNode > ();
finalList = ReceptLista[0].Descendants("article").ToList();
var finalList2 = new List < List < HtmlNode >> ();
for (int i = 0; i < finalList.Count; i++) {
finalList2.Add(finalList[i].DescendantNodes().Where(node => node.GetAttributeValue("class", "").Equals("RecipeTeaser-content")).ToList());
}
var finalList3 = new List < List < HtmlNode >> ();
for (int i = 0; i < finalList2.Count; i++) {
finalList3.Add(finalList2[i].Where(node => node.GetAttributeValue("class", "").Equals("RecipeTeaser-link js-searchRecipeLink")).ToList());
}
If you can probably make things a lot simpler by using XPath.
If you want all the links in article tags, you can do the following.
var anchors = htmldoc.SelectNodes("//article/a");
var links = anchors.Select(a=>a.attributes["href"].Value).ToList();
I think it is Value. Check with docs.
If you want only the anchor tags that are children of article, and also with class smallerCell, you can change the xpath to //article/div[#class='smallerClass']/a.
you get the idea. I think you're just missing xpath knowledge. Also note that HtmlAgilityPack also has plugins that can add CSS selectors, so that's also an option if you don't want to do xpath.
Simplest way I'd go about it would be this...
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(text);
var nodesWithARef = doc.DocumentNode.Descendants("a");
foreach (HtmlNode node in nodesWithARef)
{
Console.WriteLine(node.GetAttributeValue("href", ""));
}
Reasoning: Using the Descendants function would give you an array of all the links that you're interested in from the entire html. You can go over the nodes and do what you need ... i am simply printing the href.
Another Way to go about it would be to look up all the nodes that have the class named 'smallerCell'. Then, for each of those nodes, look up the href if it exists under that and print it (or do something with it).
var nodesWithSmallerCells = doc.DocumentNode.SelectNodes("//div[#class='smallerCell']");
if (nodesWithSmallerCells != null)
foreach (HtmlNode node in nodesWithSmallerCells)
{
HtmlNodeCollection children = node.SelectNodes(".//a");
if (children != null)
foreach (HtmlNode child in children)
Console.WriteLine(child.GetAttributeValue("href", ""));
}
I have a the HTML code which I would like to parse.
I have written the code below:
HtmlAgilityPack.HtmlWeb web5 = new HtmlAgilityPack.HtmlWeb();
HtmlAgilityPack.HtmlDocument doc5 = web5.Load("http://www.analytics4.co.uk/pdf.js/web/viewer.html?file=http://www.analytics4.co.uk/pdf.js/pdf/w15639.pdf");
//var divs5 = doc5.DocumentNode.SelectNodes("//div[id='viewerContainer']").SelectMany(x => x.Descendants("div"));
// HtmlAgilityPack.HtmlDocument doc5 = web5.Load("http://google.co.uk");
HtmlNodeCollection tl = doc5.DocumentNode.SelectNodes("//div[#id='viewerContainer']//div[#id='viewer']//");
foreach (HtmlAgilityPack.HtmlNode node in tl)
{
Console.WriteLine(node.InnerHtml);
Console.WriteLine(node.OuterHtml);
}
The result I get for Inner HTML is just
<div id="viewer" class="pdfViewer"></div>
and it doesn't make sense. Could anyone explain me how can I go deeper and deeper to the inner divs and so on? Please guys...I need your help.
To go deeper you can use this techniques:
foreach (var node in tl){
var a = node.ChildNodes[2]; // a is the third child of node
var b = node.SelectSingleNode("./div[3]"); // b is the third "div"
// element in node children. The "./" in XPath means "from current node"
}
Good luck!
I use HtmlAgilityPack to parse the html document of a webbrowser control.
I am able to find my desired HtmlNode, but after getting the HtmlNode, I want to retun the corresponding HtmlElement in the WebbrowserControl.Document.
In fact HtmlAgilityPack parse an offline copy of the live document, while I want to access live elements of the webbrowser Control to access some rendered attributes like currentStyle or runtimeStyle
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(webBrowser1.Document.Body.InnerHtml);
var some_nodes = doc.DocumentNode.SelectNodes("//p");
// this selection could be more sophisticated
// and the answer shouldn't relay on it.
foreach (HtmlNode node in some_nodes)
{
HtmlElement live_element = CorrespondingElementFromWebBrowserControl(node);
// CorrespondingElementFromWebBrowserControl is what I am searching for
}
If the element had a specific attribute it could be easy but I want a solution which works on any element.
Please help me what can I do about it.
In fact there seems to be no direct possibility to change the document directly in the webbroser control.
But you can extract the html from it, mnipulate it and write it back again like this:
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(webBrowser1.DocumentText);
foreach (HtmlAgilityPack.HtmlNode node in doc.DocumentNode.ChildNodes) {
node.Attributes.Add("TEST", "TEST");
}
StringBuilder sb = new StringBuilder();
using (StringWriter sw = new StringWriter(sb)) {
doc.Save(sw);
webBrowser1.DocumentText = sb.ToString();
}
For direct manipulation you can maybe use the unmanaged pointer webBrowser1.Document.DomDocument to the document, but this is outside of my knowledge.
HtmlAgilityPack definitely can't provide access to nodes in live HTML directly. Since you said there is no distinct style/class/id on the element you have to walk through the nodes manually and find matches.
Assuming HTML is reasonably valid (so both browser and HtmlAgilityPack perform normalization similarly) you can walk pairs of elements starting from the root of both trees and selecting the same child node.
Basically you can build "position-based" XPath to node in one tree and select it in another tree. Xpath would look something like (depending you want to pay attention to just positions or position and node name):
"/*[1]/*[4]/*[2]/*[7]"
"/body/div[2]/span[1]/p[3]"
Steps:
In using HtmlNode you've found collect all parent nodes up to the root.
Get root of element of HTML in browser
for each level of children find position of corresponding child on HtmlNodes collection on step 1 in its parent and than find live HtmlElement among children of current live node.
Move to newly found child and go back to 3 till found node you are looking for.
the XPath attribute of the HtmlAgilityPack.HtmlNode shows the nodes on the path from root to the node. For example \div[1]\div[2]\table[0]. You can traverse this path in the live document to find the corresponding live element. However this path may not be precise as HtmlAgilityPack removes some tags like <form> then before using this solution add the omitted tags back using
HtmlNode.ElementsFlags.Remove("form");
struct DocNode
{
public string Name;
public int Pos;
}
///// structure to hold the name and position of each node in the path
The following method finds the live element according to the XPath
static public HtmlElement GetLiveElement(HtmlNode node, HtmlDocument doc)
{
var pattern = #"/(.*?)\[(.*?)\]"; // like div[1]
// Parse the XPath to extract the nodes on the path
var matches = Regex.Matches(node.XPath, pattern);
List<DocNode> PathToNode = new List<DocNode>();
foreach (Match m in matches) // Make a path of nodes
{
DocNode n = new DocNode();
n.Name = n.Name = m.Groups[1].Value;
n.Pos = Convert.ToInt32(m.Groups[2].Value)-1;
PathToNode.Add(n); // add the node to path
}
HtmlElement elem = null; //Traverse to the element using the path
if (PathToNode.Count > 0)
{
elem = doc.Body; //begin from the body
foreach (DocNode n in PathToNode)
{
//Find the corresponding child by its name and position
elem = GetChild(elem, n);
}
}
return elem;
}
the code for GetChild Method used above
public static HtmlElement GetChild(HtmlElement el, DocNode node)
{
// Find corresponding child of the elemnt
// based on the name and position of the node
int childPos = 0;
foreach (HtmlElement child in el.Children)
{
if (child.TagName.Equals(node.Name,
StringComparison.OrdinalIgnoreCase))
{
if (childPos == node.Pos)
{
return child;
}
childPos++;
}
}
return null;
}
I'm looking for the best way to split an HTML document over some tag in C# using HtmlAgilityPack. I want to preserve the intended markup as I'm doing the split. Here is an example.
If the document is like this:
<p>
<div>
<p>
Stuff
</p>
<p>
<ul>
<li>Bullet 1</li>
<li>link</li>
<li>Bullet 3</li>
</ul>
</p>
<span>Footer</span>
</div>
</p>
Once it's split, it should look like this:
Part 1
<p>
<div>
<p>
Stuff
</p>
<p>
<ul>
<li>Bullet 1</li>
</ul>
</p>
</div>
</p>
Part 2
<p>
<div>
<p>
<ul>
<li>Bullet 3</li>
</ul>
</p>
<span>Footer</span>
</div>
</p>
What would be the best way of doing something like that?
Definitely not by regex. (Note: this was originally a tag on the question—now removed.) I'm usually not one to jump on The Pony is Coming bandwagon, but this is one case in which regular expressions would be particularly bad.
First, I would write a recursive function that removes all siblings of a node that follow that node—call it RemoveSiblingsAfter(node)—and then calls itself on its parent, so that all siblings following the parent are removed as well (and all siblings following the grandparent, and so on). You can use an XPath to find the node(s) on which you want to split, e.g. doc.DocumentNode.SelectNodes("//a[#href='#']"), and call the function on that node. When done, you'd remove the splitting node itself, and that's it. You'd repeat these steps for a copy of the original document, except you'd implement RemoveSiblingsBefore(node) to remove siblings that precede a node.
In your example, RemoveSiblingsBefore would act as follows:
<a href="#"> has no siblings, so recurse on parent, <li>.
<li> has a preceding sibling—<li>Bullet 1</li>—so remove, and recurse on parent, <ul>.
<ul> has no siblings, so recurse on parent, <p>.
<p> has a preceding sibling—<p>Stuff</p>—so remove, and recurse on parent, <div>.
and so on.
Here is what I came up with. This does the split and removes the "empty" elements of the element where the split happens.
private static void SplitDocument()
{
var doc = new HtmlDocument();
doc.Load("HtmlDoc.html");
var links = doc.DocumentNode.SelectNodes("//a[#href]");
var firstPart = GetFirstPart(doc.DocumentNode, links[0]).DocumentNode.InnerHtml;
var secondPart = GetSecondPart(links[0]).DocumentNode.InnerHtml;
}
private static HtmlDocument GetFirstPart(HtmlNode currNode, HtmlNode link)
{
var nodeStack = new Stack<Tuple<HtmlNode, HtmlNode>>();
var newDoc = new HtmlDocument();
var parent = newDoc.DocumentNode;
nodeStack.Push(new Tuple<HtmlNode, HtmlNode>(currNode, parent));
while (nodeStack.Count > 0)
{
var curr = nodeStack.Pop();
var copyNode = curr.Item1.CloneNode(false);
curr.Item2.AppendChild(copyNode);
if (curr.Item1 == link)
{
var nodeToRemove = NodeAndEmptyAncestors(copyNode);
nodeToRemove.ParentNode.RemoveChild(nodeToRemove);
break;
}
for (var i = curr.Item1.ChildNodes.Count - 1; i >= 0; i--)
{
nodeStack.Push(new Tuple<HtmlNode, HtmlNode>(curr.Item1.ChildNodes[i], copyNode));
}
}
return newDoc;
}
private static HtmlDocument GetSecondPart(HtmlNode link)
{
var nodeStack = new Stack<HtmlNode>();
var newDoc = new HtmlDocument();
var currNode = link;
while (currNode.ParentNode != null)
{
currNode = currNode.ParentNode;
nodeStack.Push(currNode.CloneNode(false));
}
var parent = newDoc.DocumentNode;
while (nodeStack.Count > 0)
{
var node = nodeStack.Pop();
parent.AppendChild(node);
parent = node;
}
var newLink = link.CloneNode(false);
parent.AppendChild(newLink);
currNode = link;
var newParent = newLink.ParentNode;
while (currNode.ParentNode != null)
{
var foundNode = false;
foreach (var child in currNode.ParentNode.ChildNodes)
{
if (foundNode) newParent.AppendChild(child.Clone());
if (child == currNode) foundNode = true;
}
currNode = currNode.ParentNode;
newParent = newParent.ParentNode;
}
var nodeToRemove = NodeAndEmptyAncestors(newLink);
nodeToRemove.ParentNode.RemoveChild(nodeToRemove);
return newDoc;
}
private static HtmlNode NodeAndEmptyAncestors(HtmlNode node)
{
var currNode = node;
while (currNode.ParentNode != null && currNode.ParentNode.ChildNodes.Count == 1)
{
currNode = currNode.ParentNode;
}
return currNode;
}