How to remove all children nodes of selected node - html-agility-pack

How to remove all children nodes of selected node - html-agility-pack - c#

Alright i want to remove all children nodes of this particular node
Here the node source code
<div class="Price fs30 clr8">
7,
<span class="PriceCurrency">73 TL
<span class="kdv">KDV Dahil</span>
</span>
<div class="SaleDiv">
%15
<span>İndirim</span>
</div>
</div>
So i want to remove all span children and div children - actually all children whatever is under the node
After removing these children i should get 7, as a innertext of the selected node
Ty very much for answers
c# .net 4.5 wpf

If you meant to keep only text nodes within the outer <div>, you can select all html child nodes using star XPath selector (*) and remove them. Here is an example in console application :
var html = #"<div class=""Price fs30 clr8"">
7,
<span class=""PriceCurrency"">73 TL
<span class=""kdv"">KDV Dahil</span>
</span>
<div class=""SaleDiv"">
%15
<span>İndirim</span>
</div>
</div>";
var doc = new HtmlDocument();
doc.LoadHtml(html);
var div = doc.DocumentNode.SelectSingleNode("//div[#class='Price fs30 clr8']");
foreach (HtmlNode node in div.SelectNodes("*"))
{
node.Remove();
}
var innerText = div.InnerText.Trim();
Console.WriteLine(innerText);

Related

Html agility pack Addressing

in this Html
<div class="contacts-list">
<h4 class="title">Contact</h4>
<div class="contact-phone">
<span class="icon"><i class="ee-phone"></i></span><span class="type">تلفن</span>
<span class="contact-data">
<a dir='auto' href='tel:05138946697'>05138946697</a> </span>
</div>
I have to extract the value of the "a" tag but I must be sure it is inside a "div" tag with a "contact-phone" class.
I don't really understand how I have to do this can someone help me?

so I get the value I need like this using the HTML Agility pack and Xpath
foreach (HtmlNode node in htmlDocument.DocumentNode.SelectNodes("//div[#class='" + "contact-phone" + "']/span[#class='"+ "contact-data" + "']/a"))
{
value = node.InnerText;
}

Html Agility Pack Xpath

How can I use this xPath with Html Agility Pack?
xPath:
//div[#class='test']/(text())[last()]
I've tried this code:
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//div[#class='test']/(text())[last()]"))
{
test = node.InnerText();
}
Html:
<div class="test">
<ul>
<li><b>Test1</b>Test1 Text</li>
<li><b>Test2</b>Test2 Text</li>
</ul>
</div>
I need to extract "Test2 Text" without specific the ul tag in the xPath.

You can try using this XPath :
(//div[#class='test']//text()[normalize-space()])[last()]
//div[#class='test']//text()[normalize-space()] finds all non-empty text nodes within the div. And then, [last()] return only the last node from all found text nodes.
Working demo example (see it online here) :
var html = #"<div class='test'>
<ul>
<li><b>Test1</b>Test1 Text</li>
<li><b>Test2</b>Test2 Text</li>
</ul>
";
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
HtmlNode node = doc.DocumentNode.SelectSingleNode("(//div[#class='test']//text()[normalize-space()])[last()]");
Console.WriteLine(node.InnerText);
output :
Test2 Text

fetching span value from html document

I have following xpath fetched using firefox xpath plugin
id('some_id')/x:ul/x:li[4]/x:span
using html agility pack I'm able to fetch id('some_id')/x:ul/x:li[4]
htmlDoc.DocumentNode.SelectNodes(#"//div[#id='some_id']/ul/li[4]").FirstOrDefault();
but I dont know how to get this span value.
update
<div id="some_id">
<ul>
<li><li>
<li><li>
<li><li>
<li>
Some text
<span>text I want to grab</span>
</li>
</ul>
</div>

You don't need parse HTML with LINQ2XML, HTMLAgilityPack it's for it and it's more easy to obtain the node in the following way :
var html = #" <div id=""some_id"">
<ul>
<li></li>
<li></li>
<li></li>
<li>
Some text
<span>text I want to grab</span>
</li>
</ul>
</div>";
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
var value = doc.DocumentNode.SelectSingleNode("div[#id='some_id']/ul/li/span").InnerText;
Console.WriteLine(value);

An alternative approach (without html-agility-pack) would be to use LINQ2XML. You can use the XDocument.Descendants method to take the span element and take it's value:
var xml = #" <div id=""some_id"">
<ul>
<li></li>
<li></li>
<li></li>
<li>
Some text
<span>text I want to grab</span>
</li>
</ul>
</div>";
var doc = XDocument.Parse(xml);
Console.WriteLine(doc.Root.Descendants("span").FirstOrDefault().Value);
The code can be extended to check if the div element has the matching id, using the XElement.Attribute property:
var doc = XDocument.Parse(xml);
Console.WriteLine(doc.Elements("div").Where (e => e.Attribute("id").Value == "some_id").Descendants("span").FirstOrDefault().Value);
One drawback of this solution is that the XML structure (HTML, XHTML) needs to be properly closed or else the parsing will fail.

Cannot find specific XML elements in XML Document

I just ran into a head scratcher, I'm not quite sure why this does not work. I want to find all the elements with the attribute "video".
My XML document looks like this:
<MainMenu>
<div id="BroughtInMenu">
<div class="menuItem0">
Menu Item
<div class="subMenu0">
<div class="menuItem1">
Dictation
<div class="subMenu1">
<div class="menuItem2" video="1">Fee Earner</div>
<div class="menuItem2" video="1">Secretary</div>
<div class="menuItem2" video="1">View File History</div>
</div>
</div>
<div class="menuItem1">
PM Advanced Agenda
<div class="subMenu1">
<div class="menuItem2">
Help
<div class="subMenu2">
<div class="menuItem3" video="1">Release Notes</div>
</div>
</div>
<div class="menuItem2">
System Maintenance
<div class="subMenu2">
<div class="menuItem3" video="1">Additional Field Setup</div>
<div class="menuItem3" video="1">Role Permission Maintenance</div>
<div class="menuItem3" video="1">Shared Diary Permissions</div>
</div>
</div>
<div class="menuItem2">
Utilities
<div class="subMenu2">
<div class="menuItem3" video="1">Change Entity Subtype</div>
<div class="menuItem3" video="1">Field Maintenance</div>
<div class="menuItem3" video="1">Move Client and Files to Fee Earner</div>
<div class="menuItem3" video="1">Reallocate Files</div>
</div>
</div>
</div>
</div> . . . . . . . . . . . . . . . . ..
This is very the same as HTML. This is for a website, so at the end I want to get all the elements with the attribute "video".
If I can do this, then I will only grab the div elements with the attribute "video", and then I will be able to use that for something else, like in a search, where I actually search the xml document and return the div, etc etc... hope you see my drift here...
Because the video attribute is going to point to a location, it will be very useful for html purposes to just jump to the video when the div is clicked.
So far I have tried this, but i am not getting the elements at all:
XElement xDoc = XElement.Load(Server.MapPath("automation/xml/mainMenu.xml"));
IEnumerable<XElement> list = from el in xDoc.Elements("div") where el.Attribute("video") != null select el;
foreach (XElement element in list)
{
//Nothing found?
}
I also thought about REGEX... maybe regex will be able to pull the divs i want, already in text format so that i can just push it into an html element in the website?
Any help will be greatly appreceiated!

Use Descendands instead of Elements. Elements returns just immediate children.
var xDoc = XElement.Load(Server.MapPath("automation/xml/mainMenu.xml"));
var list = from el in xDoc.Descendants("div")
where el.Attribute("video") != null
select el;
foreach (XElement element in list)
{
//Nothing found?
}

You can select elements where a particular attribute is present with XPath. To use the XPath extension methods, you need to include the namespace.
using System.Xml.XPath;
An XPath such as "//div[#video]" will include all "div" tags at any level, but filter the selected elements to only those with a "video" attribute, so you're not looping unnecessarily through lots of elements checking for the presence of an attribute.
var xDoc = XElement.Load(Server.MapPath("automation/xml/mainMenu.xml"));
foreach (var divWithVideo in xDoc.XPathSelectElements ("//div[#video]")) {
Console.WriteLine (divWithVideo);
}
Here you are only iterating on the elements with a "video" attribute.

HtmlAgilityPack extracts text from all divs in a page and not just from the one div specified in the code

I am having a strange behaviour with a xpath expression with HtmlAgilityPack.
I'm trying to use the HtmlAgilityPack to extract all the values within a div declared as
<div class='cont'> However, when I use the code below I simply get all values within
<div class='cont'> AND <div class='button'>. Does anyone know why this is happening?
Here is the full code to reproduce it:
using System;
using System.Xml.XPath;
using HtmlAgilityPack;
namespace ConsoleApplication1
{
class Program
{
static void Main(string[] args)
{
const string text1 = #"<div class=""cont"">
<h3>content</h3>
<div style=""margin: 0cm 0cm 0pt"" class=""Normal"">content1</div><div style=""margin: 0cm 0cm 0pt"" class=""Normal""> content2</div>
<div style=""margin: 0cm 0cm 0pt"" class=""Normal"">content3 </div>
<div>content4 </div><strong>content5
<div>content6 </div><ul type=""disc"">
<div>content7 </div>
<div>content8 </div> </ul>
<p class='margin10'><font size=""2"">
<div>
<p><span style=""font-family: Arial"">content9</span></p>
</div>
<div>content10</font><u><font color=""#0000ff"" size=""2""><font color=""#0000ff"" size=""2""> content11 </u></font></font><font size=""2""> content12
<div>content13</div>
</div>
</font>
</p>
</div>
<div class=""button"">
<span class=""applybtn""><a class=""buttonGlobal buttonAlpha"" href=""/uk/job/apply/(id)/608735"">content14</a></span>
</div>";
foreach (XPathNavigator node in SearchInPage(text1, "//div[#class='cont']"))
{
Console.WriteLine("option " + node.Value);
}
}
private static XPathNodeIterator SearchInPage(string text, string xpath)
{
HtmlDocument htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml(text);
XPathNavigator xpathNavigator = htmlDocument.CreateNavigator();
XPathNodeIterator nodes = xpathNavigator.Select(xpath);
return nodes;
}
}
}
The code returns:
'content', 'content1-13' PLUS 'content14' which exists within <div class='button'>

So If I'm understanding correctly, you want to find the value only for the children nodes of node <div class="cont">?
Try this:
HtmlDocument doc = new HtmlDocument;
doc.Load(Html);
HtmlNode node = doc.DocumentNode.SelectSingleNode(".//div[#class='cont']");
foreach(HtmlNode childNode in node)
{
Console.WriteLine(childNode.Value);
}
I don't have a way to debug this in front of me, but this should work. the (".//div[#class='cont']") should select only the specified node and it's children, and ignore anything that lives outside the specified node. The rest is just Linq and HtmlAgilityPack - Remember, HtmlAgilityPack implements XPath, so make sure to look through AgilityPacks available methods before using XPath... remember that xml and html are different languages, and what works for one won't necessarily work for the other.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

How to remove all children nodes of selected node - html-agility-pack - c#

Related

Html agility pack Addressing

Html Agility Pack Xpath

fetching span value from html document

Cannot find specific XML elements in XML Document

HtmlAgilityPack extracts text from all divs in a page and not just from the one div specified in the code

Categories

Resources