I have continually had problems with Html Agility Pack; my XPath queries only ever work when they are extremely simple:
//*[#id='some_id']
or
//input
However, anytime they get more complicated, then Html Agility Pack can't handle it.
Here's an example demonstrating the problem, I'm using WebDriver to navigate to Google, and return the page source, which is passed to Html Agility Pack, and both WebDriver and HtmlAgilityPack attempt to locate the element/node (C#):
//The XPath query
const string xpath = "//form//tr[1]/td[1]//input[#name='q']";
//Navigate to Google and get page source
var driver = new FirefoxDriver(new FirefoxProfile()) { Url = "http://www.google.com" };
Thread.Sleep(2000);
//Can WebDriver find it?
var e = driver.FindElementByXPath(xpath);
Console.WriteLine(e!=null ? "Webdriver success" : "Webdriver failure");
//Can Html Agility Pack find it?
var source = driver.PageSource;
var htmlDoc = new HtmlDocument { OptionFixNestedTags = true };
htmlDoc.LoadHtml(source);
var nodes = htmlDoc.DocumentNode.SelectNodes(xpath);
Console.WriteLine(nodes!=null ? "Html Agility Pack success" : "Html Agility Pack failure");
driver.Quit();
In this case, WebDriver successfully located the item, but Html Agility Pack did not.
I know, I know, in this case it's very easy to change the xpath to one that will work: //input[#name='q'], but that will only fix this specific example, which isn't the point, I need something that will exactly or at least closely mirror the behavior of WebDriver's xpath engine, or even the FirePath or FireFinder add-ons to Firefox.
If WebDriver can find it, then why can't Html Agility Pack find it too?
The issue you're running into is with the FORM element. HTML Agility Pack handles that element differently - by default, it will never report that it has children.
In the particular example you gave, this query does find the target element:
.//div/div[2]/table/tr/td/table/tr/td/div/table/tr/td/div/div[2]/input
However, this does not, so it's clear the form element is tripping up the parser:
.//form/div/div[2]/table/tr/td/table/tr/td/div/table/tr/td/div/div[2]/input
That behavior is configurable, though. If you place this line prior to parsing the HTML, the form will give you child nodes:
HtmlNode.ElementsFlags.Remove("form");
Related
I'm brand new to HTML Agility Pack (as well as network-based programming in general). I am trying to extract a specific line of HTML, but I don't know enough about HTML Agility Pack's syntax to understand what I'm not writing correctly (and am lost in their documentation). URLs here are modified.
string html;
using (WebClient client = new WebClient())
{
html = client.DownloadString("https://google.com/");
}
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
foreach (HtmlNode img in doc.DocumentNode.SelectNodes("//div[#class='ngg-gallery-thumbnail-box']//div[#class='ngg-gallery-thumbnail']//a"))
{
Debug.Log(img.GetAttributeValue("href", null));
}
return null;
This is what the HTML looks like
<div id="ngg-image-3" class="ngg-gallery-thumbnail-box" >
<div class="ngg-gallery-thumbnail">
<a href="https://urlhere.png"
// More code here
</a>
</div>
</div>
The problem occurs on the foreach line. I've tried matching examples online the best I can but am missing it. TIA.
HTMLAgilityPack uses XPath syntax to query nodes - HAP effectively converts the HTML document into an XML document. So the trick is learning about XPATH querying so you can get the right combinations of tags and attributes to get the result you need.
The HTML snippet you pasted isn't well formed (there's no closing >on the anchor tag. Assuming that it is closed, then
//div[#class='ngg-gallery-thumbnail-box']//div[#class='ngg-gallery-thumbnail']//a[#href]
will return an XPathNodeList of only those tags that have href attributes.
If there are none that meet your criteria, nothing will be written.
For debugging purposes, perhaps log less specific query node count or OuterXml to see what you're getting e.g.
Debug.Log(doc.DocumentNode.SelectNodes("//div[#class='ngg-gallery-thumbnail-box']//div[#class='ngg-gallery-thumbnail'])[0].OuterXml)
I'm parsing an HTML file using HTML Agility Pack. I want to get
<title>Some title <title>
As you see, title doesn't have a class. So I couldn't catch it no matter what I have tried. I couldn't find the solution on the web either. How can I catch this HTML tag which doesn't have a class? Thanks.
This might do the trick for you
doc.DocumentNode.SelectSingleNode("//head/title");
or
doc.DocumentNode.SelectSingleNode("//title");
or
doc.DocumentNode.Descendants("title").FirstOrDefault()
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(htmlContent);
var result = doc.DocumentNode.SelectNodes("title").FirstOrDefault();
I am using the HtmlAgilityPack from codeplex.
When I pass a simple html string into it and then get the resulting html back,
it cuts off tags.
Example:
string html = "<select><option>test</option></select>";
HtmlDocument document = new HtmlDocument();
document.LoadHtml(html);
var result = d.DocumentNode.OuterHtml;
// result gives me:
<select><option>test</select>
So the closing tag for the option is missing. Am I missing a setting or using this wrong?
I fixed this by commenting out line 92 of HtmlNode.cs in the source, compiled and it worked like a charm.
ElementsFlags.Add("option", HtmlElementFlag.Empty); // comment this out
Found the answer on this question
In HTML the tag has no end tag.
In XHTML the tag must be properly closed.
http://www.w3schools.com/tags/tag_option.asp
"There is also no adherence to XHTML or XML" - HTML Agility Pack.
This could be why? My guess is that if the tag is optional, the Agility Pack will leave it off. Hope this helps!
i am using the HTML Agility pack to convert
<font size="1">This is a test</font>
to
This is a test
using this code:
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
string stripped = doc.DocumentNode.InnerText;
but i ran into an issue where i have this:
<font size="1">This is a test & this is a joke</font>
and the code above converted this to
This is a test & this is a joke
but i wanted it to convert it to:
This is a test & this is a joke
does the html agility pack support what i am trying to do? why doesn't the HTML agiligy code do this by default or i am doing something wrong ?
You can run HttpUtility.HtmlDecode() on the output.
However, note that InnerText will include HTML tags that may be contained inside the outermost tag. If you want to remove all tags, you will have to walk the document tree and retrieve all the text bit by bit.
i want download all images stored in html(web page) , i dont know how much image will be download , and i don`t want use "HTML AGILITY PACK"
i search in google but all site make me more confused ,
i tried regex but only one result ... ,
People are giving you the right answer - you can't be picky and lazy, too. ;-)
If you use a half-baked solution, you'll deal with a lot of edge cases. Here's a working sample that gets all links in an HTML document using HTML Agility Pack (it's included in the HTML Agility Pack download).
And here's a blog post that shows how to grab all images in an HTML document with HTML Agility Pack and LINQ
// Bing Image Result for Cat, First Page
string url = "http://www.bing.com/images/search?q=cat&go=&form=QB&qs=n";
// For speed of dev, I use a WebClient
WebClient client = new WebClient();
string html = client.DownloadString(url);
// Load the Html into the agility pack
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
// Now, using LINQ to get all Images
List<HtmlNode> imageNodes = null;
imageNodes = (from HtmlNode node in doc.DocumentNode.SelectNodes("//img")
where node.Name == "img"
&& node.Attributes["class"] != null
&& node.Attributes["class"].Value.StartsWith("img_")
select node).ToList();
foreach(HtmlNode node in imageNodes)
{
Console.WriteLine(node.Attributes["src"].Value);
}
First of all I just can't leave this phrase alone:
images stored in html
That phrase is probably a big part of the reason your question was down-voted twice. Images are not stored in html. Html pages have references to images that web browsers download separately.
This means you need to do this in three steps: first download the html, then find the image references inside the html, and finally use those references to download the images themselves.
To accomplish this, look at the System.Net.WebClient() class. It has a .DownloadString() method you can use to get the html. Then you need to find all the <img /> tags. You're own your own here, but it's straightforward enough. Finally, you use WebClient's .DownloadData() or DownloadFile() methods to retrieve the images.
You can use a WebBrowser control and extract the HTML from that e.g.
System.Windows.Forms.WebBrowser objWebBrowser = new System.Windows.Forms.WebBrowser();
objWebBrowser.Navigate(new Uri("your url of html document"));
System.Windows.Forms.HtmlDocument objDoc = objWebBrowser.Document;
System.Windows.Forms.HtmlElementCollection aColl = objDoc.All.GetElementsByName("IMG");
...
or directly invoke the IHTMLDocument family of COM interfaces
In general terms
You need to fetch the html page
Search for img tags and extract the src="..." portion out of them
Keep a list of all these extracted image urls.
Download them one by one.
Maybe this question about C# HTML parser will help you a little bit more.