Html Agility Pack - Remove element by id - c#

I'm trying remove specific piece of code by element id with help of Html Agility Pack. Html:
<div id="id00">
<h1>Title</h1>
</div>
<div id="id10">
<div id="id11">
<h2>Title 2</h2>
<p>Some text</p>
</div>
<a id="idToRemove" href="#">Anchor text</a>
</div>
My method:
public static string RemoveElement(string html, string elementId)
{
elementId = "idToRemove";
HtmlDocument htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(html);
var node = htmlDoc.GetElementbyId(elementId);
node.Remove();
html = htmlDoc.Text;
return html;
}
Unfortunately it's not working at all.

It works, but htmlDoc.Text is the wrong property, use:
return htmlDoc.DocumentNode.OuterHtml;

Related

Html agility pack Addressing

in this Html
<div class="contacts-list">
<h4 class="title">Contact</h4>
<div class="contact-phone">
<span class="icon"><i class="ee-phone"></i></span><span class="type">تلفن</span>
<span class="contact-data">
<a dir='auto' href='tel:05138946697'>05138946697</a> </span>
</div>
I have to extract the value of the "a" tag but I must be sure it is inside a "div" tag with a "contact-phone" class.
I don't really understand how I have to do this can someone help me?
so I get the value I need like this using the HTML Agility pack and Xpath
foreach (HtmlNode node in htmlDocument.DocumentNode.SelectNodes("//div[#class='" + "contact-phone" + "']/span[#class='"+ "contact-data" + "']/a"))
{
value = node.InnerText;
}

using HtmlAgilityPack to select innerHtml

let say i have follow html document
<div class=" wrap_body text_align_left" style="">
<div class="some"> hello </div>
<div class="someother"> world </div>
hello world
</div>
i want to extract this
<div class="some"> hello </div>
<div class="someother"> world </div>
hello world
what is best way to extract using HtmlAgilityPack with c# or vb.net?
this is my code until done but some struggle .
thanks!
For Each no As HtmlAgilityPack.HtmlNode In docs.DocumentNode.SelectNodes("//div[contains(#class,'wrap_body')]")
Dim attr As String = no.GetAttributeValue("wrap_body", "")
Next
Below is a sample for getting Inner Html
var html =
#"<body>
<div class='wrap_body text_align_left' style=''>
<div class='some'> hello </div>
<div class='someother'> world </div>
hello world
</div>
</body>";
var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(html);
var htmlNodes = htmlDoc.DocumentNode.SelectNodes("//body/div");
foreach (var node in htmlNodes)
{
Console.WriteLine(node.InnerHtml);
}
You can use SelectNodes of DocumentNode metod to retrieve specific nodes from html.
class Program
{
static void Main(string[] args)
{
string htmlContent = File.ReadAllText(#"Your path to html file"); ;
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(htmlContent);
var innerContent = doc.DocumentNode.SelectNodes("/div").FirstOrDefault().InnerHtml;
Console.WriteLine(innerContent);
}
}
Output:

Html Agility Pack Xpath

How can I use this xPath with Html Agility Pack?
xPath:
//div[#class='test']/(text())[last()]
I've tried this code:
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//div[#class='test']/(text())[last()]"))
{
test = node.InnerText();
}
Html:
<div class="test">
<ul>
<li><b>Test1</b>Test1 Text</li>
<li><b>Test2</b>Test2 Text</li>
</ul>
</div>
I need to extract "Test2 Text" without specific the ul tag in the xPath.
You can try using this XPath :
(//div[#class='test']//text()[normalize-space()])[last()]
//div[#class='test']//text()[normalize-space()] finds all non-empty text nodes within the div. And then, [last()] return only the last node from all found text nodes.
Working demo example (see it online here) :
var html = #"<div class='test'>
<ul>
<li><b>Test1</b>Test1 Text</li>
<li><b>Test2</b>Test2 Text</li>
</ul>
";
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
HtmlNode node = doc.DocumentNode.SelectSingleNode("(//div[#class='test']//text()[normalize-space()])[last()]");
Console.WriteLine(node.InnerText);
output :
Test2 Text

HtmlAgilityPack extracts text from all divs in a page and not just from the one div specified in the code

I am having a strange behaviour with a xpath expression with HtmlAgilityPack.
I'm trying to use the HtmlAgilityPack to extract all the values within a div declared as
<div class='cont'> However, when I use the code below I simply get all values within
<div class='cont'> AND <div class='button'>. Does anyone know why this is happening?
Here is the full code to reproduce it:
using System;
using System.Xml.XPath;
using HtmlAgilityPack;
namespace ConsoleApplication1
{
class Program
{
static void Main(string[] args)
{
const string text1 = #"<div class=""cont"">
<h3>content</h3>
<div style=""margin: 0cm 0cm 0pt"" class=""Normal"">content1</div><div style=""margin: 0cm 0cm 0pt"" class=""Normal""> content2</div>
<div style=""margin: 0cm 0cm 0pt"" class=""Normal"">content3 </div>
<div>content4 </div><strong>content5
<div>content6 </div><ul type=""disc"">
<div>content7 </div>
<div>content8 </div> </ul>
<p class='margin10'><font size=""2"">
<div>
<p><span style=""font-family: Arial"">content9</span></p>
</div>
<div>content10</font><u><font color=""#0000ff"" size=""2""><font color=""#0000ff"" size=""2""> content11 </u></font></font><font size=""2""> content12
<div>content13</div>
</div>
</font>
</p>
</div>
<div class=""button"">
<span class=""applybtn""><a class=""buttonGlobal buttonAlpha"" href=""/uk/job/apply/(id)/608735"">content14</a></span>
</div>";
foreach (XPathNavigator node in SearchInPage(text1, "//div[#class='cont']"))
{
Console.WriteLine("option " + node.Value);
}
}
private static XPathNodeIterator SearchInPage(string text, string xpath)
{
HtmlDocument htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml(text);
XPathNavigator xpathNavigator = htmlDocument.CreateNavigator();
XPathNodeIterator nodes = xpathNavigator.Select(xpath);
return nodes;
}
}
}
The code returns:
'content', 'content1-13' PLUS 'content14' which exists within <div class='button'>
So If I'm understanding correctly, you want to find the value only for the children nodes of node <div class="cont">?
Try this:
HtmlDocument doc = new HtmlDocument;
doc.Load(Html);
HtmlNode node = doc.DocumentNode.SelectSingleNode(".//div[#class='cont']");
foreach(HtmlNode childNode in node)
{
Console.WriteLine(childNode.Value);
}
I don't have a way to debug this in front of me, but this should work. the (".//div[#class='cont']") should select only the specified node and it's children, and ignore anything that lives outside the specified node. The rest is just Linq and HtmlAgilityPack - Remember, HtmlAgilityPack implements XPath, so make sure to look through AgilityPacks available methods before using XPath... remember that xml and html are different languages, and what works for one won't necessarily work for the other.

how to get html div element innertext by id using regular expression in C#

I'm getting full html code using WebClient. But i need to get specified div from full html using regular expression.
for example:
<body>
<div id="main">
<div id="left" style="float:left">this is a <b>left</b> side:<div style='color:red'> 1 </div>
</div>
<div id="right" style="float:left"> main side</div>
<div>
</body>
if i need div named 'main', function return
<div id="left" style="float:left">this is a <b>left</b> side:<div style='color:red'> 1 </div>
</div>
<div id="right" style="float:left"> main side</div>
If i need div named 'left', function return
this is a <b>left</b> side:<div style='color:red'> 1 </div>
If i need div named 'right', function return
main side
How can i do?
Why do people insist on trying to use regex to parse html? You can probably do it if you exclude a whole host of edge-cases... but just use HTML Agility Pack and you're done:
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(...); // or Load
string main = doc.DocumentNode.SelectSingleNode("//div[#id='main']").InnerHtml;
(note I'm assuming it is not xhtml; if it is xhtml, use XmlDocument or XDocument, and very similar code to the above)
string divname = "somename";
Match m = RegEx.Match(htmlContent, "<div[^>]*id="+divname+".*?>(.*?)</div");
string contenct = m.Groups[1].Tostring();
won't work if you have nested divs inside the desired div

Categories