HTML Agility Pack Node Selection - c#

I'm brand new to HTML Agility Pack (as well as network-based programming in general). I am trying to extract a specific line of HTML, but I don't know enough about HTML Agility Pack's syntax to understand what I'm not writing correctly (and am lost in their documentation). URLs here are modified.
string html;
using (WebClient client = new WebClient())
{
html = client.DownloadString("https://google.com/");
}
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
foreach (HtmlNode img in doc.DocumentNode.SelectNodes("//div[#class='ngg-gallery-thumbnail-box']//div[#class='ngg-gallery-thumbnail']//a"))
{
Debug.Log(img.GetAttributeValue("href", null));
}
return null;
This is what the HTML looks like
<div id="ngg-image-3" class="ngg-gallery-thumbnail-box" >
<div class="ngg-gallery-thumbnail">
<a href="https://urlhere.png"
// More code here
</a>
</div>
</div>
The problem occurs on the foreach line. I've tried matching examples online the best I can but am missing it. TIA.

HTMLAgilityPack uses XPath syntax to query nodes - HAP effectively converts the HTML document into an XML document. So the trick is learning about XPATH querying so you can get the right combinations of tags and attributes to get the result you need.
The HTML snippet you pasted isn't well formed (there's no closing >on the anchor tag. Assuming that it is closed, then
//div[#class='ngg-gallery-thumbnail-box']//div[#class='ngg-gallery-thumbnail']//a[#href]
will return an XPathNodeList of only those tags that have href attributes.
If there are none that meet your criteria, nothing will be written.
For debugging purposes, perhaps log less specific query node count or OuterXml to see what you're getting e.g.
Debug.Log(doc.DocumentNode.SelectNodes("//div[#class='ngg-gallery-thumbnail-box']//div[#class='ngg-gallery-thumbnail'])[0].OuterXml)

Related

How to get an element using c#

I'm new with C#, and I'm trying to access an element from a website using webBrowser. I wondered how can I get the "Developers" string from the site:
<div id="title" style="display: block;">
<b>Title:</b> **Developers**
</div>
I tried to use webBrowser1.Document.GetElementById("title") ,but I have no idea how to keep going from here.
Thanks :)
You can download the source code using WebClient class
then look within the file for the <b>Title:</b>**Developers**</div> and then omit everything beside the "Developers".
HtmlAgilityPack and CsQuery is the way many people has taken to work with HTML page in .NET, I'd recommend them too.
But in case your task is limited to this simple requirement, and you have a <div> markup that is valid XHTML (like the markup sample you posted), then you can treat it as an XML. Means you can use .NET native API such as XDocument or XmlDocument to parse the HTML and perform an XPath query to get specific part from it, for example :
var xml = #"<div id=""title"" style=""display: block;""> <b>Title:</b> Developers</div>";
//or according to your code snippet, you may be able to do as follow :
//var xml = webBrowser1.Document.GetElementById("title").OuterHtml;
var doc = new XmlDocument();
doc.LoadXml(xml);
var text = doc.DocumentElement.SelectSingleNode("//div/b/following-sibling::text()");
Console.WriteLine(text.InnerText);
//above prints " Developers"
Above XPath select text node ("Developers") next to <b> node.
You can use HtmlAgilityPack (As mentioned by Giannis http://htmlagilitypack.codeplex.com/). Using a web browser control is too much for this task:
HtmlAgilityPack.HtmlWeb web = new HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = web.Load("http://www.google.com");
var el = doc.GetElementbyId("title");
string s = el.InnerHtml; // get the : <b>Title:</b> **Developers**
I haven't tried this code but it should be very close to working.
There must be an InnerText in HtmlAgilityPack as well, allowing you to do this:
string s = el.InnerText; // get the : Title: **Developers**
You can also remove the Title: by removing the appropriate node:
el.SelectSingleNode("//b").Remove();
string s = el.InnerText; // get the : **Developers**
If for some reason you want to stick to the web browser control, I think you can do this:
var el = webBrowser1.Document.GetElementById("title");
string s = el.InnerText; // get the : Title: **Developers**
UPDATE
Note that the //b above is XPath syntax which may be interesting for you to learn:
http://www.w3schools.com/XPath/xpath_syntax.asp
http://www.freeformatter.com/xpath-tester.html

How can i stop HtmlAgilityPack changing the source of the loaded page?

I'm using HtmlAgilityPack to load html file like this:
var doc = new HtmlAgilityPack.HtmlDocument();
doc.OptionOutputOriginalCase = true;
doc.Load(#"c:\ftp\file3.html");
then i'm using xpath to select node and get outerHTML but the problem is that i get a modified page source, for example i get :
<font class="hello" id="price">
when on real page source it's
<font class=hello id=price>
how do i avoid that ?
You don't. At least not when using a DOM parser.
The HTML Agility Pack in this case is taking the string input and doing its best to create a valid DOM from that input. This is not valid:
<font class=hello id=price>
So it translates it into something that is valid:
<font class="hello" id="price">
It will attempt to do the same for any and all invalid markup in the HTML. If you don't want to use valid markup, then a DOM parser probably isn't the right tool for the job. At that point you're working with a custom string input and you'd have to parse it yourself.

HtmlAgilityPack invalid markup

I am using the HtmlAgilityPack from codeplex.
When I pass a simple html string into it and then get the resulting html back,
it cuts off tags.
Example:
string html = "<select><option>test</option></select>";
HtmlDocument document = new HtmlDocument();
document.LoadHtml(html);
var result = d.DocumentNode.OuterHtml;
// result gives me:
<select><option>test</select>
So the closing tag for the option is missing. Am I missing a setting or using this wrong?
I fixed this by commenting out line 92 of HtmlNode.cs in the source, compiled and it worked like a charm.
ElementsFlags.Add("option", HtmlElementFlag.Empty); // comment this out
Found the answer on this question
In HTML the tag has no end tag.
In XHTML the tag must be properly closed.
http://www.w3schools.com/tags/tag_option.asp
"There is also no adherence to XHTML or XML" - HTML Agility Pack.
This could be why? My guess is that if the tag is optional, the Agility Pack will leave it off. Hope this helps!

Running into an issue trying to extract the text from a snippet of HTML

i am using the HTML Agility pack to convert
<font size="1">This is a test</font>
to
This is a test
using this code:
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
string stripped = doc.DocumentNode.InnerText;
but i ran into an issue where i have this:
<font size="1">This is a test & this is a joke</font>
and the code above converted this to
This is a test & this is a joke
but i wanted it to convert it to:
This is a test & this is a joke
does the html agility pack support what i am trying to do? why doesn't the HTML agiligy code do this by default or i am doing something wrong ?
You can run HttpUtility.HtmlDecode() on the output.
However, note that InnerText will include HTML tags that may be contained inside the outermost tag. If you want to remove all tags, you will have to walk the document tree and retrieve all the text bit by bit.

How to parse this piece of HTML?

good morning!
i am using c# (framework 3.5sp1) and want to parse following piece of html via regex:
<h1>My caption</h1>
<p>Here will be some text</p>
<hr class="cs" />
<h2 id="x">CaptionX</h2>
<p>Some text</p>
<hr class="cs" />
<h2 id="x">CaptionX</h2>
<p>Some text</p>
<hr class="cs" />
<h2 id="x">CaptionX</h2>
<p>Some text</p>
i need following output:
group 1: content of h1
group 2: content of h1-following text
group 3-n: content of subcaptions + text
what i have atm:
<hr.*?/>
<h2.*?>(.*?)</h2>
([\W\S]*?)
<hr.*?/>
this will give me every odd subcaption + content (eg. 1, 3, ...) due to the trailing <hr/>. for parsing the h1-caption i have another pattern (<h1.*?>(.*?)</h1>), which only gives me the caption but not the content - i'm fine with that atm.
does anybody have a hint/solution for me or any alternative logics (eg. parsing the html via reader and assigning it this way?)?
edit:
as some brought in HTMLAgilityPack, i was curious about this nice tool. i accomplished getting content of the <h1>-tag.
but ... myproblem is parsing the rest. this is caused by: the tags for the content may vary - from <p> to <div> and <ul>...
atm this seems more or less iterate over the whole document and parsing tag for tag ...?
any hints?
You will really need HTML parser for this
Don't use regex to parse HTML. Consider using the HTML Agility Pack.
There are some possibilities:
REGEX - Fast but not reliable, it cant deal with malformed html.
HtmlAgilityPack - Good, but have many memory leaks. If you want to deal with a few files, there is no problem.
SGMLReader - Really good, but there are a problem. Sometimes it cant find the default namespace to get others nodes, then it is impossible to parse html.
http://developer.mindtouch.com/SgmlReader
Majestic-12 - Good but not so fast as SGMLReader.
http://www.majestic12.co.uk/projects/html_parser.php
Example for SGMLreader (VB.net)
Dim sgmlReader As New Sgml.SgmlReader()
Public htmldoc As New System.Xml.Linq.XDocument
sgmlReader.DocType = "HTML"
sgmlReader.WhitespaceHandling = System.Xml.WhitespaceHandling.All
sgmlReader.CaseFolding = Sgml.CaseFolding.ToLower
sgmlReader.InputStream = New System.IO.StringReader(vSource)
sgmlReader.CaseFolding = CaseFolding.ToLower
htmldoc = XDocument.Load(sgmlReader)
Dim XNS As XNamespace
' In this part you can have a bug, sometimes it cant get the Default Namespace*********
Try
XNS = htmldoc.Root.GetDefaultNamespace
Catch
XNS = "http://www.w3.org/1999/xhtml"
End Try
If XNS.NamespaceName.Trim = "" Then
XNS = "http://www.w3.org/1999/xhtml"
End If
'use it with the linq commands
For Each link In htmldoc.Descendants(XNS + "script")
Scripts &= link.Value
Next
In Majestic-12 is different, you have to walk to every tag with a "Next" command. You can find a example code with the dll.
As others have mentioned, use the HtmlAgilityPack. However, if you like jQuery/CSS selectors, I just found a fork of the HtmlAgilityPack called Fizzler:
http://code.google.com/p/fizzler/
Using this you could find all <p> tags using:
var pTags = doc.DocumentNode.QuerySelectorAll('p').ToList();
Or find a specific div like <div id="myDiv"></div>:
var myDiv = doc.DocumentNode.QuerySelectorAll('#myDiv');
It can't get any easier than that!

Categories