using HtmlAgilityPack to select innerHtml - c#

let say i have follow html document
<div class=" wrap_body text_align_left" style="">
<div class="some"> hello </div>
<div class="someother"> world </div>
hello world
</div>
i want to extract this
<div class="some"> hello </div>
<div class="someother"> world </div>
hello world
what is best way to extract using HtmlAgilityPack with c# or vb.net?
this is my code until done but some struggle .
thanks!
For Each no As HtmlAgilityPack.HtmlNode In docs.DocumentNode.SelectNodes("//div[contains(#class,'wrap_body')]")
Dim attr As String = no.GetAttributeValue("wrap_body", "")
Next

Below is a sample for getting Inner Html
var html =
#"<body>
<div class='wrap_body text_align_left' style=''>
<div class='some'> hello </div>
<div class='someother'> world </div>
hello world
</div>
</body>";
var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(html);
var htmlNodes = htmlDoc.DocumentNode.SelectNodes("//body/div");
foreach (var node in htmlNodes)
{
Console.WriteLine(node.InnerHtml);
}

You can use SelectNodes of DocumentNode metod to retrieve specific nodes from html.
class Program
{
static void Main(string[] args)
{
string htmlContent = File.ReadAllText(#"Your path to html file"); ;
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(htmlContent);
var innerContent = doc.DocumentNode.SelectNodes("/div").FirstOrDefault().InnerHtml;
Console.WriteLine(innerContent);
}
}
Output:

Related

Html Agility Pack - Remove element by id

I'm trying remove specific piece of code by element id with help of Html Agility Pack. Html:
<div id="id00">
<h1>Title</h1>
</div>
<div id="id10">
<div id="id11">
<h2>Title 2</h2>
<p>Some text</p>
</div>
<a id="idToRemove" href="#">Anchor text</a>
</div>
My method:
public static string RemoveElement(string html, string elementId)
{
elementId = "idToRemove";
HtmlDocument htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(html);
var node = htmlDoc.GetElementbyId(elementId);
node.Remove();
html = htmlDoc.Text;
return html;
}
Unfortunately it's not working at all.
It works, but htmlDoc.Text is the wrong property, use:
return htmlDoc.DocumentNode.OuterHtml;

HTMLAgilityPack C#, How to extract text from nested Tags in DIV

I have this HTML code where I want to extract the date from:
<div id="footer">
<div style="font-size:smaller">
Added in:
<strong>
07/06/2021 2:15:36 PM
</strong>
</div>
</div>
This is my C# HTMLAgilityPack
doc.DocumentNode.SelectSingleNode("//div[#id='footer']").InnerText
doc.DocumentNode.SelectSingleNode("//div[#id='footer']/div/strong").InnerText
Update :
All Code :
var html ="<div id=\"footer\"><div style=\"font-size:smaller\"> Added in:<strong> 07/06/2021 2:15:36 PM </strong></div></div>";
var doc = new HtmlDocument();
doc.LoadHtml(html);
var time = doc.DocumentNode.SelectSingleNode("//div[#id='footer']/div/strong").InnerText;
and I extracted the Date

XPath query not working(need find by text)

Hello sow i working with HtmlAgilityPack and i have this problem all elemnts that i need have the same stractior and the same class exept the text of the span like in the code i have span with text Amount and Date sow i need to build link like this
"//span(with text=Amount)[div and contains(#class,'detailsValue ')]");
I need to get data 1,700,000.00 from the div that in the span with text 'Amount' and 14.04.2014 from the div that in the span with text 'Date'
Any ideas?
This what i have now
List<string> OriginalAmount = GetListDataFromHtmlSourse(PageData, "//span[div and contains(#class,'detailsValue ')]");
private static List<string> GetListDataFromHtmlSourse(string HtmlSourse, string link)
{
List<string> data = new List<string>();
HtmlAgilityPack.HtmlDocument DocToParse = new HtmlAgilityPack.HtmlDocument();
DocToParse.LoadHtml(HtmlSourse);
foreach (HtmlNode node in DocToParse.DocumentNode.SelectNodes(link))
{
if (node.InnerText != null) data.Add(node.InnerText);
}
return data;
}
<div class=" depositDetails cellHeight float " style="height: 37px;">
<span class=" detailsName darkgray ">Amount</span>
<br>
<div class="detailsValue float" style="direction:rtl">1,700,000.00 </div>
</div>
</div>
<div class="BoxCellHeight float">
<div class="cellHeight separatorvertical float" style="height: 46px;"> </div>
<div class=" depositDetails cellHeight float " style="height: 40px;">
<span class=" detailsName darkgray ">Date</span>
<br>
<div class="detailsValue float">14.04.2014</div>
</div>
</div>
Actually, the question is not very clear. How about this :
//span[.='Amount']/following-sibling::div[contains(#class,'detailsValue')]]
Above XPath will search for <span> element with text equals "Amount", then get it's following <div> sibling having class contains "detailsValue"
UPDATE :
According to your comment, if I don't misunderstand it, you want both value (div after Amount span and div after Date span). Try this XPath :
//span[.='Amount' or .='Date']/following-sibling::div[contains(#class, 'detailsValue')]

Linq to XML - Render CDATA as HTML

I have the following XML:
<stories>
<story id="1234">
<title>This is a title</title>
<date>1/1/1980</date>
<article>
<![CDATA[<p>This is an article.</p>]]>
</article>
</story>
</stories>
And the following Linq to XML code in C#:
#{
XDocument xmlDoc = XDocument.Load("foo.xml");
var stories = from story in xmlDoc.Descendants("stories")
.Descendants("story")
.OrderByDescending(s => (string)s.Attribute("id"))
select new
{
title = story.Element("title").Value,
date = story.Element("date").Value,
article = story.Element("article").Value,
};
foreach (var story in stories)
{
<text><div class="news_item">
<span class="title">#story.title</span>
<span class="date">#story.date</span>
<div class="story">#story.article</div>
</div></text>
}
}
The rendered HTML is output to the browser as:
<div class="news_item">
<span class="title">This is a title</span>
<span class="date">1/1/1980</span>
<div class="story"><p>This is an article.</p></div>
</div>
I want the <p> tag rendered as HTML to the browser, not encoded. How do I accomplish this?
Razor encodes values by default. You need to use Html.Raw helper to avoid it ( Html.Raw() in ASP.NET MVC Razor view )
<div class="story">#Html.Raw(story.article)</div>

HtmlAgilityPack extracts text from all divs in a page and not just from the one div specified in the code

I am having a strange behaviour with a xpath expression with HtmlAgilityPack.
I'm trying to use the HtmlAgilityPack to extract all the values within a div declared as
<div class='cont'> However, when I use the code below I simply get all values within
<div class='cont'> AND <div class='button'>. Does anyone know why this is happening?
Here is the full code to reproduce it:
using System;
using System.Xml.XPath;
using HtmlAgilityPack;
namespace ConsoleApplication1
{
class Program
{
static void Main(string[] args)
{
const string text1 = #"<div class=""cont"">
<h3>content</h3>
<div style=""margin: 0cm 0cm 0pt"" class=""Normal"">content1</div><div style=""margin: 0cm 0cm 0pt"" class=""Normal""> content2</div>
<div style=""margin: 0cm 0cm 0pt"" class=""Normal"">content3 </div>
<div>content4 </div><strong>content5
<div>content6 </div><ul type=""disc"">
<div>content7 </div>
<div>content8 </div> </ul>
<p class='margin10'><font size=""2"">
<div>
<p><span style=""font-family: Arial"">content9</span></p>
</div>
<div>content10</font><u><font color=""#0000ff"" size=""2""><font color=""#0000ff"" size=""2""> content11 </u></font></font><font size=""2""> content12
<div>content13</div>
</div>
</font>
</p>
</div>
<div class=""button"">
<span class=""applybtn""><a class=""buttonGlobal buttonAlpha"" href=""/uk/job/apply/(id)/608735"">content14</a></span>
</div>";
foreach (XPathNavigator node in SearchInPage(text1, "//div[#class='cont']"))
{
Console.WriteLine("option " + node.Value);
}
}
private static XPathNodeIterator SearchInPage(string text, string xpath)
{
HtmlDocument htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml(text);
XPathNavigator xpathNavigator = htmlDocument.CreateNavigator();
XPathNodeIterator nodes = xpathNavigator.Select(xpath);
return nodes;
}
}
}
The code returns:
'content', 'content1-13' PLUS 'content14' which exists within <div class='button'>
So If I'm understanding correctly, you want to find the value only for the children nodes of node <div class="cont">?
Try this:
HtmlDocument doc = new HtmlDocument;
doc.Load(Html);
HtmlNode node = doc.DocumentNode.SelectSingleNode(".//div[#class='cont']");
foreach(HtmlNode childNode in node)
{
Console.WriteLine(childNode.Value);
}
I don't have a way to debug this in front of me, but this should work. the (".//div[#class='cont']") should select only the specified node and it's children, and ignore anything that lives outside the specified node. The rest is just Linq and HtmlAgilityPack - Remember, HtmlAgilityPack implements XPath, so make sure to look through AgilityPacks available methods before using XPath... remember that xml and html are different languages, and what works for one won't necessarily work for the other.

Categories