How to use HTML Agility pack - c#

How do I use the HTML Agility Pack?
My XHTML document is not completely valid. That's why I wanted to use it. How do I use it in my project? My project is in C#.

First, install the HTMLAgilityPack nuget package into your project.
Then, as an example:
HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();
// There are various options, set as needed
htmlDoc.OptionFixNestedTags=true;
// filePath is a path to a file containing the html
htmlDoc.Load(filePath);
// Use: htmlDoc.LoadHtml(xmlString); to load from a string (was htmlDoc.LoadXML(xmlString)
// ParseErrors is an ArrayList containing any errors from the Load statement
if (htmlDoc.ParseErrors != null && htmlDoc.ParseErrors.Count() > 0)
{
// Handle any parse errors as required
}
else
{
if (htmlDoc.DocumentNode != null)
{
HtmlAgilityPack.HtmlNode bodyNode = htmlDoc.DocumentNode.SelectSingleNode("//body");
if (bodyNode != null)
{
// Do something with bodyNode
}
}
}
(NB: This code is an example only and not necessarily the best/only approach. Do not use it blindly in your own application.)
The HtmlDocument.Load() method also accepts a stream which is very useful in integrating with other stream oriented classes in the .NET framework. While HtmlEntity.DeEntitize() is another useful method for processing html entities correctly. (thanks Matthew)
HtmlDocument and HtmlNode are the classes you'll use most. Similar to an XML parser, it provides the selectSingleNode and selectNodes methods that accept XPath expressions.
Pay attention to the HtmlDocument.Option?????? boolean properties. These control how the Load and LoadXML methods will process your HTML/XHTML.
There is also a compiled help file called HtmlAgilityPack.chm that has a complete reference for each of the objects. This is normally in the base folder of the solution.

I don't know if this will be of any help to you, but I have written a couple of articles which introduce the basics.
HtmlAgilityPack Article Series
Introduction To The HtmlAgilityPack Library
Easily extracting links from a snippet of html with HtmlAgilityPack
The next article is 95% complete, I just have to write up explanations of the last few parts of the code I have written. If you are interested then I will try to remember to post here when I publish it.

HtmlAgilityPack uses XPath syntax, and though many argues that it is poorly documented, I had no trouble using it with help from this XPath documentation: https://www.w3schools.com/xml/xpath_syntax.asp
To parse
<h2>
Jack
</h2>
<ul>
<li class="tel">
81 75 53 60
</li>
</ul>
<h2>
Roy
</h2>
<ul>
<li class="tel">
44 52 16 87
</li>
</ul>
I did this:
string url = "http://website.com";
var Webget = new HtmlWeb();
var doc = Webget.Load(url);
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//h2//a"))
{
names.Add(node.ChildNodes[0].InnerHtml);
}
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//li[#class='tel']//a"))
{
phones.Add(node.ChildNodes[0].InnerHtml);
}

Main HTMLAgilityPack related code is as follows
using System;
using System.Net;
using System.Web;
using System.Web.Services;
using System.Web.Script.Services;
using System.Text.RegularExpressions;
using HtmlAgilityPack;
namespace GetMetaData
{
/// <summary>
/// Summary description for MetaDataWebService
/// </summary>
[WebService(Namespace = "http://tempuri.org/")]
[WebServiceBinding(ConformsTo = WsiProfiles.BasicProfile1_1)]
[System.ComponentModel.ToolboxItem(false)]
// To allow this Web Service to be called from script, using ASP.NET AJAX, uncomment the following line.
[System.Web.Script.Services.ScriptService]
public class MetaDataWebService: System.Web.Services.WebService
{
[WebMethod]
[ScriptMethod(UseHttpGet = false)]
public MetaData GetMetaData(string url)
{
MetaData objMetaData = new MetaData();
//Get Title
WebClient client = new WebClient();
string sourceUrl = client.DownloadString(url);
objMetaData.PageTitle = Regex.Match(sourceUrl, #
"\<title\b[^>]*\>\s*(?<Title>[\s\S]*?)\</title\>", RegexOptions.IgnoreCase).Groups["Title"].Value;
//Method to get Meta Tags
objMetaData.MetaDescription = GetMetaDescription(url);
return objMetaData;
}
private string GetMetaDescription(string url)
{
string description = string.Empty;
//Get Meta Tags
var webGet = new HtmlWeb();
var document = webGet.Load(url);
var metaTags = document.DocumentNode.SelectNodes("//meta");
if (metaTags != null)
{
foreach(var tag in metaTags)
{
if (tag.Attributes["name"] != null && tag.Attributes["content"] != null && tag.Attributes["name"].Value.ToLower() == "description")
{
description = tag.Attributes["content"].Value;
}
}
}
else
{
description = string.Empty;
}
return description;
}
}
}

public string HtmlAgi(string url, string key)
{
var Webget = new HtmlWeb();
var doc = Webget.Load(url);
HtmlNode ourNode = doc.DocumentNode.SelectSingleNode(string.Format("//meta[#name='{0}']", key));
if (ourNode != null)
{
return ourNode.GetAttributeValue("content", "");
}
else
{
return "not fount";
}
}

Getting Started - HTML Agility Pack
// From File
var doc = new HtmlDocument();
doc.Load(filePath);
// From String
var doc = new HtmlDocument();
doc.LoadHtml(html);
// From Web
var url = "http://html-agility-pack.net/";
var web = new HtmlWeb();
var doc = web.Load(url);

try this
string htmlBody = ParseHmlBody(dtViewDetails.Rows[0]["Body"].ToString());
private string ParseHmlBody(string html)
{
string body = string.Empty;
try
{
var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(html);
var htmlBody = htmlDoc.DocumentNode.SelectSingleNode("//body");
body = htmlBody.OuterHtml;
}
catch (Exception ex)
{
dalPendingOrders.LogMessage("Error in ParseHmlBody" + ex.Message);
}
return body;
}

Related

Extracting full line of text using partial text [duplicate]

How do I use the HTML Agility Pack?
My XHTML document is not completely valid. That's why I wanted to use it. How do I use it in my project? My project is in C#.
First, install the HTMLAgilityPack nuget package into your project.
Then, as an example:
HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();
// There are various options, set as needed
htmlDoc.OptionFixNestedTags=true;
// filePath is a path to a file containing the html
htmlDoc.Load(filePath);
// Use: htmlDoc.LoadHtml(xmlString); to load from a string (was htmlDoc.LoadXML(xmlString)
// ParseErrors is an ArrayList containing any errors from the Load statement
if (htmlDoc.ParseErrors != null && htmlDoc.ParseErrors.Count() > 0)
{
// Handle any parse errors as required
}
else
{
if (htmlDoc.DocumentNode != null)
{
HtmlAgilityPack.HtmlNode bodyNode = htmlDoc.DocumentNode.SelectSingleNode("//body");
if (bodyNode != null)
{
// Do something with bodyNode
}
}
}
(NB: This code is an example only and not necessarily the best/only approach. Do not use it blindly in your own application.)
The HtmlDocument.Load() method also accepts a stream which is very useful in integrating with other stream oriented classes in the .NET framework. While HtmlEntity.DeEntitize() is another useful method for processing html entities correctly. (thanks Matthew)
HtmlDocument and HtmlNode are the classes you'll use most. Similar to an XML parser, it provides the selectSingleNode and selectNodes methods that accept XPath expressions.
Pay attention to the HtmlDocument.Option?????? boolean properties. These control how the Load and LoadXML methods will process your HTML/XHTML.
There is also a compiled help file called HtmlAgilityPack.chm that has a complete reference for each of the objects. This is normally in the base folder of the solution.
I don't know if this will be of any help to you, but I have written a couple of articles which introduce the basics.
HtmlAgilityPack Article Series
Introduction To The HtmlAgilityPack Library
Easily extracting links from a snippet of html with HtmlAgilityPack
The next article is 95% complete, I just have to write up explanations of the last few parts of the code I have written. If you are interested then I will try to remember to post here when I publish it.
HtmlAgilityPack uses XPath syntax, and though many argues that it is poorly documented, I had no trouble using it with help from this XPath documentation: https://www.w3schools.com/xml/xpath_syntax.asp
To parse
<h2>
Jack
</h2>
<ul>
<li class="tel">
81 75 53 60
</li>
</ul>
<h2>
Roy
</h2>
<ul>
<li class="tel">
44 52 16 87
</li>
</ul>
I did this:
string url = "http://website.com";
var Webget = new HtmlWeb();
var doc = Webget.Load(url);
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//h2//a"))
{
names.Add(node.ChildNodes[0].InnerHtml);
}
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//li[#class='tel']//a"))
{
phones.Add(node.ChildNodes[0].InnerHtml);
}
Main HTMLAgilityPack related code is as follows
using System;
using System.Net;
using System.Web;
using System.Web.Services;
using System.Web.Script.Services;
using System.Text.RegularExpressions;
using HtmlAgilityPack;
namespace GetMetaData
{
/// <summary>
/// Summary description for MetaDataWebService
/// </summary>
[WebService(Namespace = "http://tempuri.org/")]
[WebServiceBinding(ConformsTo = WsiProfiles.BasicProfile1_1)]
[System.ComponentModel.ToolboxItem(false)]
// To allow this Web Service to be called from script, using ASP.NET AJAX, uncomment the following line.
[System.Web.Script.Services.ScriptService]
public class MetaDataWebService: System.Web.Services.WebService
{
[WebMethod]
[ScriptMethod(UseHttpGet = false)]
public MetaData GetMetaData(string url)
{
MetaData objMetaData = new MetaData();
//Get Title
WebClient client = new WebClient();
string sourceUrl = client.DownloadString(url);
objMetaData.PageTitle = Regex.Match(sourceUrl, #
"\<title\b[^>]*\>\s*(?<Title>[\s\S]*?)\</title\>", RegexOptions.IgnoreCase).Groups["Title"].Value;
//Method to get Meta Tags
objMetaData.MetaDescription = GetMetaDescription(url);
return objMetaData;
}
private string GetMetaDescription(string url)
{
string description = string.Empty;
//Get Meta Tags
var webGet = new HtmlWeb();
var document = webGet.Load(url);
var metaTags = document.DocumentNode.SelectNodes("//meta");
if (metaTags != null)
{
foreach(var tag in metaTags)
{
if (tag.Attributes["name"] != null && tag.Attributes["content"] != null && tag.Attributes["name"].Value.ToLower() == "description")
{
description = tag.Attributes["content"].Value;
}
}
}
else
{
description = string.Empty;
}
return description;
}
}
}
public string HtmlAgi(string url, string key)
{
var Webget = new HtmlWeb();
var doc = Webget.Load(url);
HtmlNode ourNode = doc.DocumentNode.SelectSingleNode(string.Format("//meta[#name='{0}']", key));
if (ourNode != null)
{
return ourNode.GetAttributeValue("content", "");
}
else
{
return "not fount";
}
}
Getting Started - HTML Agility Pack
// From File
var doc = new HtmlDocument();
doc.Load(filePath);
// From String
var doc = new HtmlDocument();
doc.LoadHtml(html);
// From Web
var url = "http://html-agility-pack.net/";
var web = new HtmlWeb();
var doc = web.Load(url);
try this
string htmlBody = ParseHmlBody(dtViewDetails.Rows[0]["Body"].ToString());
private string ParseHmlBody(string html)
{
string body = string.Empty;
try
{
var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(html);
var htmlBody = htmlDoc.DocumentNode.SelectSingleNode("//body");
body = htmlBody.OuterHtml;
}
catch (Exception ex)
{
dalPendingOrders.LogMessage("Error in ParseHmlBody" + ex.Message);
}
return body;
}

Parsing innertext of html

This is part of html that i am parsing
<li>http://some.link.com/4DFR6DJ43Y/sessionid?ticket=ASDSIDFK32423421</li>
I want to get http://some.link.com/4DFR6DJ43Y/sessionid?ticket=ASDSIDFK32423421 as an output.
So far i have tried
HtmlDocument document = new HtmlDocument();
document.LoadHtml(responseFromServer);
var link = document.DocumentNode.SelectSingleNode("//a");
if (link != null)
{
if(link.innerText.Contains("ticket"))
{
Console.WriteLine(link.InnerText);
}
}
... but output is null (no inner texts are found).
That's probably because the first link in your HTML document as returned by SelectSingleNode(), doesn't contains text "ticket". You can check for the target text in XPath directly , like so :
var link = document.DocumentNode.SelectSingleNode("//a[contains(.,'ticket')]");
if (link != null)
{
Console.WriteLine(link.InnerText);
}
or using LINQ style if you like :
var link = document.DocumentNode
.SelectNodes("//a")
.OfType<HtmlNode>()
.FirstOrDefault(o => o.InnerText.Contains("ticket"));
if (link != null)
{
Console.WriteLine(link.InnerText);
}
You provided a piece of code that won't compile because innerText is not defined. If you try this code, you'll probably get what you asked for:
HtmlDocument document = new HtmlDocument();
document.LoadHtml(html);
var link = document.DocumentNode.SelectSingleNode("//a");
if (link != null)
{
if(link.InnerText.Contains("ticket"))
{
Console.WriteLine(link.InnerText);
}
}
You can use HTML Agility Pack instead of HTML Document then you can do deep parsing in HTML. for more information please see the following information.
See the following link.
How to use HTML Agility pack

Use Microsoft Translator API to translate a whole web page with C#

translate a web page with Microsoft Translator API (SOAP) with C#. I want to make my websites translated, but using the Translator Widget is not good for me as I need google to crawl my translated pages as well. So I'll need to translate it before sending it to the browser.
So far there's no API (I tried finding it, I couldn't, If you happen to know one please mention) where you could just pass a url and it'll send you the translated response like this : http://www.microsofttranslator.com/bv.aspx?from=&to=nl&a=http%3A%2F%2Fwww.imdb.com%2F
These are the attempts I've taken so far:
1. Download string from Url, pass to Client.Translate(..).
The formatter threw an exception while trying to deserialize the
message: Error in deserializing body of request message for operation
'Translate'. The maximum string content length quota (30720) has been
exceeded while reading XML data. This quota may be increased by
changing the MaxStringContentLength property on the
XmlDictionaryReaderQuotas object used when creating the XML reader.
Line 516, position 48.
2.
private static void processDocument(HtmlAgilityPack.HtmlDocument html, LanguageServiceClient Client)
{
HtmlNodeCollection coll = html.DocumentNode.SelectNodes("//text()[normalize-space(.) != '']");
foreach (HtmlNode node in coll)
{
if (node.InnerText == node.InnerHtml)
{
//node.InnerHtml = translateText(node.InnerText);
node.InnerHtml = Client.Translate("", node.InnerText, "en", "fr", "text/html", "general");
}
}
}
This one this taking way too much time. And in the end I'm getting a Bad request (400) exception.
What would be the best way to tackle this problem? I also plan to save the documents so that I don't have to translate every time.
This C# example translates HTML from a local file:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.IO;
using HtmlAgilityPack;
namespace TranslationAssistant.Business
{
class HTMLTranslationManager
{
public static int DoTranslation(string htmlfilename, string fromlanguage, string tolanguage)
{
string htmldocument = File.ReadAllText(htmlfilename);
string htmlout = string.Empty;
HtmlDocument htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(htmldocument);
htmlDoc.DocumentNode.SetAttributeValue("lang", TranslationServices.Core.TranslationServiceFacade.LanguageNameToLanguageCode(tolanguage));
var title = htmlDoc.DocumentNode.SelectSingleNode("//head//title");
if (title != null) title.InnerHtml = TranslationServices.Core.TranslationServiceFacade.TranslateString(title.InnerHtml, fromlanguage, tolanguage, "text/html");
var body = htmlDoc.DocumentNode.SelectSingleNode("//body");
if (body != null)
{
if (body.InnerHtml.Length < 10000)
{
body.InnerHtml = TranslationServices.Core.TranslationServiceFacade.TranslateString(body.InnerHtml, fromlanguage, tolanguage, "text/html");
}
else
{
List<HtmlNode> nodes = new List<HtmlNode>();
AddNodes(body.FirstChild, ref nodes);
Parallel.ForEach(nodes, (node) =>
{
if (node.InnerHtml.Length > 10000)
{
throw new Exception("Child node with a length of more than 10000 characters encountered.");
}
node.InnerHtml = TranslationServices.Core.TranslationServiceFacade.TranslateString(node.InnerHtml, fromlanguage, tolanguage, "text/html");
});
}
}
htmlDoc.Save(htmlfilename, Encoding.UTF8);
return 1;
}
/// <summary>
/// Add nodes of size smaller than 10000 characters to the list, and recurse into the bigger ones.
/// </summary>
/// <param name="rootnode">The node to start from</param>
/// <param name="nodes">Reference to the node list</param>
private static void AddNodes(HtmlNode rootnode, ref List<HtmlNode> nodes)
{
string[] DNTList = { "script", "#text", "code", "col", "colgroup", "embed", "em", "#comment", "image", "map", "media", "meta", "source", "xml"}; //DNT - Do Not Translate - these nodes are skipped.
HtmlNode child = rootnode;
while (child != rootnode.LastChild)
{
if (!DNTList.Contains(child.Name.ToLowerInvariant())) {
if (child.InnerHtml.Length > 10000)
{
AddNodes(child.FirstChild, ref nodes);
}
else
{
if (child.InnerHtml.Trim().Length != 0) nodes.Add(child);
}
}
child = child.NextSibling;
}
}
}
}
This is HTMLTranslationManager.cs in http://github.com/microsofttranslator/documenttranslator, and it uses the helper function TranslateString() in TranslationServiceFacade.cs. You could simplify and just insert the translation service call right here in place of TranslateString().

Select elements added to the DOM by a script

I've been trying to get either an <object> or an <embed> tag using:
HtmlNode videoObjectNode = doc.DocumentNode.SelectSingleNode("//object");
HtmlNode videoEmbedNode = doc.DocumentNode.SelectSingleNode("//embed");
This doesn't seem to work.
Can anyone please tell me how to get these tags and their InnerHtml?
A YouTube embedded video looks like this:
<embed height="385" width="640" type="application/x-shockwave-flash"
src="http://s.ytimg.com/yt/swf/watch-vfl184368.swf" id="movie_player" flashvars="..."
allowscriptaccess="always" allowfullscreen="true" bgcolor="#000000">
I got a feeling the JavaScript might stop the swf player from working, hope not...
Cheers
Update 2010-08-26 (in response to OP's comment):
I think you're thinking about it the wrong way, Alex. Suppose I wrote some C# code that looked like this:
string codeBlock = "if (x == 1) Console.WriteLine(\"Hello, World!\");";
Now, if I wrote a C# parser, should it recognize the contents of the string literal above as C# code and highlight it (or whatever) as such? No, because in the context of a well-formed C# file, that text represents a string to which the codeBlock variable is being assigned.
Similarly, in the HTML on YouTube's pages, the <object> and <embed> elements are not really elements at all in the context of the current HTML document. They are the contents of string values residing within JavaScript code.
In fact, if HtmlAgilityPack did ignore this fact and attempted to recognize all portions of text that could be HTML, it still wouldn't succeed with these elements because, being inside JavaScript, they're heavily escaped with \ characters (notice the precarious Unescape method in the code I posted to get around this issue).
I'm not saying my hacky solution below is the right way to approach this problem; I'm just explaining why obtaining these elements isn't as straightforward as grabbing them with HtmlAgilityPack.
YouTubeScraper
OK, Alex: you asked for it, so here it is. Some truly hacky code to extract your precious <object> and <embed> elements out from that sea of JavaScript.
class YouTubeScraper
{
public HtmlNode FindObjectElement(string url)
{
HtmlNodeCollection scriptNodes = FindScriptNodes(url);
for (int i = 0; i < scriptNodes.Count; ++i)
{
HtmlNode scriptNode = scriptNodes[i];
string javascript = scriptNode.InnerHtml;
int objectNodeLocation = javascript.IndexOf("<object");
if (objectNodeLocation != -1)
{
string htmlStart = javascript.Substring(objectNodeLocation);
int objectNodeEndLocation = htmlStart.IndexOf(">\" :");
if (objectNodeEndLocation != -1)
{
string finalEscapedHtml = htmlStart.Substring(0, objectNodeEndLocation + 1);
string unescaped = Unescape(finalEscapedHtml);
var objectDoc = new HtmlDocument();
objectDoc.LoadHtml(unescaped);
HtmlNode objectNode = objectDoc.GetElementbyId("movie_player");
return objectNode;
}
}
}
return null;
}
public HtmlNode FindEmbedElement(string url)
{
HtmlNodeCollection scriptNodes = FindScriptNodes(url);
for (int i = 0; i < scriptNodes.Count; ++i)
{
HtmlNode scriptNode = scriptNodes[i];
string javascript = scriptNode.InnerHtml;
int approxEmbedNodeLocation = javascript.IndexOf("<\\/object>\" : \"<embed");
if (approxEmbedNodeLocation != -1)
{
string htmlStart = javascript.Substring(approxEmbedNodeLocation + 15);
int embedNodeEndLocation = htmlStart.IndexOf(">\";");
if (embedNodeEndLocation != -1)
{
string finalEscapedHtml = htmlStart.Substring(0, embedNodeEndLocation + 1);
string unescaped = Unescape(finalEscapedHtml);
var embedDoc = new HtmlDocument();
embedDoc.LoadHtml(unescaped);
HtmlNode videoEmbedNode = embedDoc.GetElementbyId("movie_player");
return videoEmbedNode;
}
}
}
return null;
}
protected HtmlNodeCollection FindScriptNodes(string url)
{
var doc = new HtmlDocument();
WebRequest request = WebRequest.Create(url);
using (var response = request.GetResponse())
using (var stream = response.GetResponseStream())
{
doc.Load(stream);
}
HtmlNode root = doc.DocumentNode;
HtmlNodeCollection scriptNodes = root.SelectNodes("//script");
return scriptNodes;
}
static string Unescape(string htmlFromJavascript)
{
// The JavaScript has escaped all of its HTML using backslashes. We need
// to reverse this.
// DISCLAIMER: I am a TOTAL Regex n00b; I make no claims as to the robustness
// of this code. If you could improve it, please, I beg of you to do so. Personally,
// I tested it on a grand total of three inputs. It worked for those, at least.
return Regex.Replace(htmlFromJavascript, #"\\(.)", UnescapeFromBeginning);
}
static string UnescapeFromBeginning(Match match)
{
string text = match.ToString();
if (text.StartsWith("\\"))
{
return text.Substring(1);
}
return text;
}
}
And in case you're interested, here's a little demo I threw together (super fancy, I know):
class Program
{
static void Main(string[] args)
{
var scraper = new YouTubeScraper();
HtmlNode davidAfterDentistEmbedNode = scraper.FindEmbedElement("http://www.youtube.com/watch?v=txqiwrbYGrs");
Console.WriteLine("David After Dentist:");
Console.WriteLine(davidAfterDentistEmbedNode.OuterHtml);
Console.WriteLine();
HtmlNode drunkHistoryObjectNode = scraper.FindObjectElement("http://www.youtube.com/watch?v=jL68NyCSi8o");
Console.WriteLine("Drunk History:");
Console.WriteLine(drunkHistoryObjectNode.OuterHtml);
Console.WriteLine();
HtmlNode jessicaDailyAffirmationEmbedNode = scraper.FindEmbedElement("http://www.youtube.com/watch?v=qR3rK0kZFkg");
Console.WriteLine("Jessica's Daily Affirmation:");
Console.WriteLine(jessicaDailyAffirmationEmbedNode.OuterHtml);
Console.WriteLine();
HtmlNode jazzerciseObjectNode = scraper.FindObjectElement("http://www.youtube.com/watch?v=VGOO8ZhWFR4");
Console.WriteLine("Jazzercise - Move your Boogie Body:");
Console.WriteLine(jazzerciseObjectNode.OuterHtml);
Console.WriteLine();
Console.Write("Finished! Hit Enter to quit.");
Console.ReadLine();
}
}
Original Answer
Why not try using the element's Id instead?
HtmlNode videoEmbedNode = doc.GetElementbyId("movie_player");
Update: Oh man, you're searching for HTML tags that are themselves within JavaScript? That's definitely why this isn't working. (They aren't really tags to be parsed from the perspective of HtmlAgilityPack; all of that JavaScript is really one big string inside a <script> tag.) Maybe there's some way you can parse the <script> tag's inner text itself as HTML and go from there.

html agility pack remove children

I'm having difficulty trying to remove a div with a particular ID, and its children using the HTML Agility pack. I am sure I'm just missing a config option, but its Friday and I'm struggling.
The simplified HTML runs:
<html><head></head><body><div id='wrapper'><div id='functionBar'><div id='search'></div></div></div></body></html>
This is as far as I have got. The error thrown by the agility pack shows it cannot find a div structure:
<div id='functionBar'></div>
Here's the code so far (taken from Stackoverflow....)
HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();
// There are various options, set as needed
//htmlDoc.OptionFixNestedTags = true;
// filePath is a path to a file containing the html
htmlDoc.LoadHtml(Html);
string output = string.Empty;
// ParseErrors is an ArrayList containing any errors from the Load statement
if (htmlDoc.ParseErrors != null && htmlDoc.ParseErrors.Count > 0)
{
// Handle any parse errors as required
}
else
{
if (htmlDoc.DocumentNode != null)
{
HtmlAgilityPack.HtmlNode bodyNode = htmlDoc.DocumentNode.SelectSingleNode("//body");
if (bodyNode != null)
{
HtmlAgilityPack.HtmlNode functionBarNode = bodyNode.SelectSingleNode ("//div[#id='functionBar']");
bodyNode.RemoveChild(functionBarNode,false);
output = bodyNode.InnerHtml;
}
}
}
bodyNode.RemoveChild(functionBarNode,false);
But functionBarNode is not a child of bodyNode.
How about functionBarNode.ParentNode.RemoveChild(functionBarNode, false)? (And forget the bit about finding bodyNode.)
You can simply call:
var documentNode = document.DocumentNode;
var functionBarNode = documentNode.SelectSingleNode("//div[#id='functionBar']");
functionBarNode.Remove();
It is much simpler, and does the same as:
functionBarNode.ParentNode.RemoveChild(functionBarNode, false);
This will work for multiple:
HtmlDocument d = this.Download(string.Format(validatorUrl, Url));
foreach (var toGo in QuerySelectorAll(d.DocumentNode, "p[class=helpwanted]").ToList())
{
toGo.Remove();
}

Categories