HtmlAgilityPack select subnodes value - c#

HTML string contains the exact string
<div class="XKa d-k-l"><span class="VTb d-k-l"></span></div><div class="pha d-k-l"><div ><div>Hello World </div>
I want to retrieve the Hello World from a div.
I am using HtmlAgilityPack
var item =
HTMLContent.DocumentNode
.SelectSingleNode("//div[#class='XKa d-k-l']//span[#class='VTb d-k-l']//div[#class='pha d-k-l']")
.InnerHtml;
Exception: Object reference not set to an instance of an object. Cant figure out the correct syntax Appreciate your help

div[#class='pha d-k-l'] is not descendant of div[#class='XKa d-k-l'], the relation is siblings instead ancestor-descendant. You can try using following-sibling axis like so :
//div[#class='XKa d-k-l']/following-sibling::div[#class='pha d-k-l']
working demo example :
var html = #"<div class='XKa d-k-l'><span class='VTb d-k-l'></span></div><div class='pha d-k-l'><div><div>Hello World </div></div></div>";
var HTMLContent = new HtmlDocument();
HTMLContent.LoadHtml(html);
var item = HTMLContent.DocumentNode
.SelectSingleNode("//div[#class='XKa d-k-l']/following-sibling::div[#class='pha d-k-l']").InnerHtml;
Console.WriteLine(item);
output :
<div><div>Hello World </div></div>
update :
You can add the span check like this :
//div[#class='XKa d-k-l'][span/#class='VTb d-k-l']/following-sibling::div[#class='pha d-k-l']

Related

HtmlAgilityPack cannot find node

I am trying to get Start called span from Here
Chrome gives me this xPath: //*[#id="guide-pages"]/div[2]/div[1]/div/div[1]/div/div/div[2]/div/div[3]/div[2]/div[1]/h2
But HtmlAgilityPack returns null, after I tried remove them one by one; this works: //*[#id="guide-pages"]/div[2]/div[1] , but not the rest of them.
My full Code:
HtmlDocument doc = new HtmlDocument();
var text = await ReadUrl();
doc.LoadHtml(text);
Console.WriteLine($"Getting Data From: {doc.DocumentNode.SelectSingleNode("//head/title").InnerText}"); //Works fine
Console.WriteLine(doc.DocumentNode.SelectSingleNode("//*[#id='guide-pages']/div[2]/div[1]/div/div[1]/div/div/div[2]/div/div[3]/div[2]/div[1]/h2") == null);
Output:
Getting Data From: Miss Fortune Build Guide : [7.11] KOREAN MF Build - Destroy the Carry! [Added Support] :: League of Legends Strategy Builds
True
Don't use xpath from Chrome. Use LINQ in HtmlAgilityPack instead. For example
.Descendants("div") will give you all the div under 1 html node. Each html node will have meta data like id, attributes(classes...), and you can query your wanted div from there.
This is one handy method to check if a HtmlNode has classes or not.
public static bool HasClass(this HtmlNode node, params string[] classValueArray)
{
var classValue = node.GetAttributeValue("class", "");
var classValues = classValue.Split(' ');
return classValueArray.All(c => classValues.Contains(c));
}

Remove whole div with specific class name

Is it possible to remove the whole div with a specific class name? For example;
<body>
<div class="head">...</div>
<div class="container">...</div>
<div class="foot">...</div>
</body>
I would like to remove the div with the "container" class.
A C# code example would be verry useful, thank you.
The proper way (I suppose) to do this is via built in Gecko DOM classes and methods.
So, in your case something like:
var containers = yourDocument.GetElementsByClassName("container");
//this returns an IEnumerable of elements with this class. If you only ever gonna have one, you can do it like that:
var yourContainer = containers.FirstOrDefault();
yourContainer.Parent.RemoveChild(yourContainer);
Obviously, you can also do loops etc.
If you want to parse html in c# the best way is to use Html agility pack :
https://htmlagilitypack.codeplex.com/
HtmlDocument document = new HtmlDocument();
document.Load(#"C:\yourfile.html")
HtmlNode nodesToRemove= document .DocumentNode.SelectNodes("//div[#class='container']").ToList();
foreach (var node in nodesToRemove)
node.Remove();
Well, with the help of regex, you can remove your desired div
var data = "<body>\n<div class=\"head\">...</div>\n" +
"<div class=\"container\">...</div>\n" +
"<div class=\"foot\">...</div>\n</body>";
var rxStr = "<div[^<]+class=([\"'])container\\1.*</div>";
var rx = new System.Text.RegularExpressions.Regex (rxStr,
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
var nStr = rx.Replace (data, "");
Console.WriteLine (nStr);
This will reduce your string to
<body>
<div class="head">...</div>
<div class="foot">...</div>
</body>

Get text that lies after pattern without class or id

I am using the HtmlAgiityPack.
It is an excellent tool for parsing data, however every instance I have used it, I have always had either a class or id to aim at, i.e. -
string example = doc.DocumentNode.SelectSingleNode("//div[#class='target']").InnerText.Trim();
However I have come across a piece of text that isn't nested in any particular pattern with a class or id I can aim at. E.g. -
<p>Example Header</p>: This is the text I want!<br>
However the example given does always following the same patter i.e. the text will always be after </p>: and before <br>.
I can extract the text using a regular expression however would prefer to use the agility pack as the rest of the code follows suit. Is there a means of doing this using the pack?
This XPath works for me :
var html = #"<div class=""target"">
<p>Example Header</p>: This is the text I want!<br>
</div>";
var doc = new HtmlDocument();
doc.LoadHtml(html);
var result = doc.DocumentNode.SelectSingleNode("/div[#class='target']/text()[(normalize-space())]").OuterHtml;
Console.WriteLine(result);
/text() select all text nodes that is direct child of the <div>
[(normalize-space())] exclude all text nodes those contain only white
spaces (there are 2 new lines excluded from this html sample : one before <p> and the other after <br>)
Result :
UPDATE I :
All element must have a parent, like <div> in above example. Or if it is the root node you're talking about, the same approach should still work. The key is to use /text() XPath to get text node :
var html = #"<p>Example Header</p>: This is the text I want!<br>";
var doc = new HtmlDocument();
doc.LoadHtml(html);
var result = doc.DocumentNode.SelectSingleNode("/text()[(normalize-space())]").OuterHtml;
Console.WriteLine(result);
UPDATE II :
Ok, so you want to select text node after <p> element and before <br> element. You can use this XPath then :
var result =
doc.DocumentNode
.SelectSingleNode("/text()[following-sibling::br and preceding-sibling::p]")
.OuterHtml;

Retrieving value of element using HTMLAgility pack

I am using HTMLAgility pack to parse html and then using xpath retrieve a table column with a specific class.
HtmlAgilityPack.HtmlWeb web = new HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = web.Load("www.url.com");
foreach (HtmlNode row in doc.DocumentNode.SelectNodes("(//td[#class='titleColumn'])[2]"))
{
Response.Write(row.InnerHtml + "<br />");
}
I retrieve the data and row.Innerhtml looks like this.
<a>Title</a> <span>Year</span><br />
I want to save the value of a and span element in separate string variables. Please help
Your xpath expression selects the second <td> that has the class titleColumn. According to the node's inner html, this <td> hode has two child nodes: <a> and <span>. So you could easily find these nodes, and then put inner text (or inner html) into your string variables. See, this:
foreach (var row in doc.DocumentNode.SelectNodes("(//td[#class='titleColumn'])[2]"))
{
var a = row.SelectSingleNode("a");
var span = row.SelectSingleNode("span");
Console.WriteLine(a.InnerText);
Console.WriteLine(span.InnerText);
}
will output:
Title
Year

HtmlAgilityPack Get all links inside a DIV

I want to be able to get 2 links from inside a div.
Currently I can select one but whene there's more it doesn't seem to work.
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load(url);
HtmlNode node = doc.DocumentNode.SelectSingleNode("//div[#class='myclass']");
if (node != null)
{
foreach (HtmlNode type in node.SelectNodes("//a#href"))
{
recipe.type += type.InnerText;
}
}
else
recipe.type = "Error fetching type.";
Trying to get it from this piece of HTML:
<div class="myclass">
<h3>Not Relevant Header</h3>
This text,
and this text
</div>
Any help is appreciated, Thanks in advance.
var div = doc.DocumentNode.SelectSingleNode("//div[#class='myclass']");
if(div!=null)
{
var links = div.Descendants("a")
.Select(a => a.InnerText)
.ToList();
}
Use this XPath:
//div[#class = 'myclass']//a
It grabs all descendant a elements in div with class = 'myclass'.
And //a#href is incorrect XPath.
Use:
//div[contains(concat(' ', #class, ' '), ' myclass ')]//a
This selects any a element that is a descendant of any div whose class attribute contains a classname of "myclass".
The classname may be single, or the attribute may also contain other classnames. In this case the classname may be the starting one, or the last one or may be surrounded by other classnames -- the above XPath expression correctly selects the wanted nodes in all of these different cases.

Categories