Html Agility Pack Xpath

Html Agility Pack Xpath - c#

How can I use this xPath with Html Agility Pack?
xPath:
//div[#class='test']/(text())[last()]
I've tried this code:
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//div[#class='test']/(text())[last()]"))
{
test = node.InnerText();
}
Html:
<div class="test">
<ul>
<li><b>Test1</b>Test1 Text</li>
<li><b>Test2</b>Test2 Text</li>
</ul>
</div>
I need to extract "Test2 Text" without specific the ul tag in the xPath.

You can try using this XPath :
(//div[#class='test']//text()[normalize-space()])[last()]
//div[#class='test']//text()[normalize-space()] finds all non-empty text nodes within the div. And then, [last()] return only the last node from all found text nodes.
Working demo example (see it online here) :
var html = #"<div class='test'>
<ul>
<li><b>Test1</b>Test1 Text</li>
<li><b>Test2</b>Test2 Text</li>
</ul>
";
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
HtmlNode node = doc.DocumentNode.SelectSingleNode("(//div[#class='test']//text()[normalize-space()])[last()]");
Console.WriteLine(node.InnerText);
output :
Test2 Text

Related

Html Agility Pack - Remove element by id

I'm trying remove specific piece of code by element id with help of Html Agility Pack. Html:
<div id="id00">
<h1>Title</h1>
</div>
<div id="id10">
<div id="id11">
<h2>Title 2</h2>
<p>Some text</p>
</div>
<a id="idToRemove" href="#">Anchor text</a>
</div>
My method:
public static string RemoveElement(string html, string elementId)
{
elementId = "idToRemove";
HtmlDocument htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(html);
var node = htmlDoc.GetElementbyId(elementId);
node.Remove();
html = htmlDoc.Text;
return html;
}
Unfortunately it's not working at all.

It works, but htmlDoc.Text is the wrong property, use:
return htmlDoc.DocumentNode.OuterHtml;

fetching span value from html document

I have following xpath fetched using firefox xpath plugin
id('some_id')/x:ul/x:li[4]/x:span
using html agility pack I'm able to fetch id('some_id')/x:ul/x:li[4]
htmlDoc.DocumentNode.SelectNodes(#"//div[#id='some_id']/ul/li[4]").FirstOrDefault();
but I dont know how to get this span value.
update
<div id="some_id">
<ul>
<li><li>
<li><li>
<li><li>
<li>
Some text
<span>text I want to grab</span>
</li>
</ul>
</div>

You don't need parse HTML with LINQ2XML, HTMLAgilityPack it's for it and it's more easy to obtain the node in the following way :
var html = #" <div id=""some_id"">
<ul>
<li></li>
<li></li>
<li></li>
<li>
Some text
<span>text I want to grab</span>
</li>
</ul>
</div>";
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
var value = doc.DocumentNode.SelectSingleNode("div[#id='some_id']/ul/li/span").InnerText;
Console.WriteLine(value);

An alternative approach (without html-agility-pack) would be to use LINQ2XML. You can use the XDocument.Descendants method to take the span element and take it's value:
var xml = #" <div id=""some_id"">
<ul>
<li></li>
<li></li>
<li></li>
<li>
Some text
<span>text I want to grab</span>
</li>
</ul>
</div>";
var doc = XDocument.Parse(xml);
Console.WriteLine(doc.Root.Descendants("span").FirstOrDefault().Value);
The code can be extended to check if the div element has the matching id, using the XElement.Attribute property:
var doc = XDocument.Parse(xml);
Console.WriteLine(doc.Elements("div").Where (e => e.Attribute("id").Value == "some_id").Descendants("span").FirstOrDefault().Value);
One drawback of this solution is that the XML structure (HTML, XHTML) needs to be properly closed or else the parsing will fail.

How to remove all children nodes of selected node - html-agility-pack

Alright i want to remove all children nodes of this particular node
Here the node source code
<div class="Price fs30 clr8">
7,
<span class="PriceCurrency">73 TL
<span class="kdv">KDV Dahil</span>
</span>
<div class="SaleDiv">
%15
<span>İndirim</span>
</div>
</div>
So i want to remove all span children and div children - actually all children whatever is under the node
After removing these children i should get 7, as a innertext of the selected node
Ty very much for answers
c# .net 4.5 wpf

If you meant to keep only text nodes within the outer <div>, you can select all html child nodes using star XPath selector (*) and remove them. Here is an example in console application :
var html = #"<div class=""Price fs30 clr8"">
7,
<span class=""PriceCurrency"">73 TL
<span class=""kdv"">KDV Dahil</span>
</span>
<div class=""SaleDiv"">
%15
<span>İndirim</span>
</div>
</div>";
var doc = new HtmlDocument();
doc.LoadHtml(html);
var div = doc.DocumentNode.SelectSingleNode("//div[#class='Price fs30 clr8']");
foreach (HtmlNode node in div.SelectNodes("*"))
{
node.Remove();
}
var innerText = div.InnerText.Trim();
Console.WriteLine(innerText);

how to remove div content from html and place the same on top of all divs using html agility pack

i have a problem with html agility pack i am unable to remove div content from html and place the same content on top of all divs. like
<body>
<div class="1">...</div>
<div class="2">...</div>
<div class="3">...</div>
</body>
now i want to remove/sort third div and place it on top of first div.
Any help would be great. Thanks!

You should try this code:
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml("<html><body><div class=\"1\">...</div><div class=\"2\">...</div><div class=\"3\">...</div></body></html>");
HtmlNode body = doc.DocumentNode.SelectSingleNode("/html/body");
HtmlNode div = body.SelectSingleNode("div[#class='3']");
if (div != null) {
div.Remove();
body.InsertBefore(div, body.FirstChild);
}

HtmlAgilityPack extracts text from all divs in a page and not just from the one div specified in the code

I am having a strange behaviour with a xpath expression with HtmlAgilityPack.
I'm trying to use the HtmlAgilityPack to extract all the values within a div declared as
<div class='cont'> However, when I use the code below I simply get all values within
<div class='cont'> AND <div class='button'>. Does anyone know why this is happening?
Here is the full code to reproduce it:
using System;
using System.Xml.XPath;
using HtmlAgilityPack;
namespace ConsoleApplication1
{
class Program
{
static void Main(string[] args)
{
const string text1 = #"<div class=""cont"">
<h3>content</h3>
<div style=""margin: 0cm 0cm 0pt"" class=""Normal"">content1</div><div style=""margin: 0cm 0cm 0pt"" class=""Normal""> content2</div>
<div style=""margin: 0cm 0cm 0pt"" class=""Normal"">content3 </div>
<div>content4 </div><strong>content5
<div>content6 </div><ul type=""disc"">
<div>content7 </div>
<div>content8 </div> </ul>
<p class='margin10'><font size=""2"">
<div>
<p><span style=""font-family: Arial"">content9</span></p>
</div>
<div>content10</font><u><font color=""#0000ff"" size=""2""><font color=""#0000ff"" size=""2""> content11 </u></font></font><font size=""2""> content12
<div>content13</div>
</div>
</font>
</p>
</div>
<div class=""button"">
<span class=""applybtn""><a class=""buttonGlobal buttonAlpha"" href=""/uk/job/apply/(id)/608735"">content14</a></span>
</div>";
foreach (XPathNavigator node in SearchInPage(text1, "//div[#class='cont']"))
{
Console.WriteLine("option " + node.Value);
}
}
private static XPathNodeIterator SearchInPage(string text, string xpath)
{
HtmlDocument htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml(text);
XPathNavigator xpathNavigator = htmlDocument.CreateNavigator();
XPathNodeIterator nodes = xpathNavigator.Select(xpath);
return nodes;
}
}
}
The code returns:
'content', 'content1-13' PLUS 'content14' which exists within <div class='button'>

So If I'm understanding correctly, you want to find the value only for the children nodes of node <div class="cont">?
Try this:
HtmlDocument doc = new HtmlDocument;
doc.Load(Html);
HtmlNode node = doc.DocumentNode.SelectSingleNode(".//div[#class='cont']");
foreach(HtmlNode childNode in node)
{
Console.WriteLine(childNode.Value);
}
I don't have a way to debug this in front of me, but this should work. the (".//div[#class='cont']") should select only the specified node and it's children, and ignore anything that lives outside the specified node. The rest is just Linq and HtmlAgilityPack - Remember, HtmlAgilityPack implements XPath, so make sure to look through AgilityPacks available methods before using XPath... remember that xml and html are different languages, and what works for one won't necessarily work for the other.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Html Agility Pack Xpath - c#

Related

Html Agility Pack - Remove element by id

fetching span value from html document

How to remove all children nodes of selected node - html-agility-pack

how to remove div content from html and place the same on top of all divs using html agility pack

HtmlAgilityPack extracts text from all divs in a page and not just from the one div specified in the code

Categories

Resources