I am using the HtmlAgilityPack from codeplex.
When I pass a simple html string into it and then get the resulting html back,
it cuts off tags.
Example:
string html = "<select><option>test</option></select>";
HtmlDocument document = new HtmlDocument();
document.LoadHtml(html);
var result = d.DocumentNode.OuterHtml;
// result gives me:
<select><option>test</select>
So the closing tag for the option is missing. Am I missing a setting or using this wrong?
I fixed this by commenting out line 92 of HtmlNode.cs in the source, compiled and it worked like a charm.
ElementsFlags.Add("option", HtmlElementFlag.Empty); // comment this out
Found the answer on this question
In HTML the tag has no end tag.
In XHTML the tag must be properly closed.
http://www.w3schools.com/tags/tag_option.asp
"There is also no adherence to XHTML or XML" - HTML Agility Pack.
This could be why? My guess is that if the tag is optional, the Agility Pack will leave it off. Hope this helps!
Related
I'm brand new to HTML Agility Pack (as well as network-based programming in general). I am trying to extract a specific line of HTML, but I don't know enough about HTML Agility Pack's syntax to understand what I'm not writing correctly (and am lost in their documentation). URLs here are modified.
string html;
using (WebClient client = new WebClient())
{
html = client.DownloadString("https://google.com/");
}
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
foreach (HtmlNode img in doc.DocumentNode.SelectNodes("//div[#class='ngg-gallery-thumbnail-box']//div[#class='ngg-gallery-thumbnail']//a"))
{
Debug.Log(img.GetAttributeValue("href", null));
}
return null;
This is what the HTML looks like
<div id="ngg-image-3" class="ngg-gallery-thumbnail-box" >
<div class="ngg-gallery-thumbnail">
<a href="https://urlhere.png"
// More code here
</a>
</div>
</div>
The problem occurs on the foreach line. I've tried matching examples online the best I can but am missing it. TIA.
HTMLAgilityPack uses XPath syntax to query nodes - HAP effectively converts the HTML document into an XML document. So the trick is learning about XPATH querying so you can get the right combinations of tags and attributes to get the result you need.
The HTML snippet you pasted isn't well formed (there's no closing >on the anchor tag. Assuming that it is closed, then
//div[#class='ngg-gallery-thumbnail-box']//div[#class='ngg-gallery-thumbnail']//a[#href]
will return an XPathNodeList of only those tags that have href attributes.
If there are none that meet your criteria, nothing will be written.
For debugging purposes, perhaps log less specific query node count or OuterXml to see what you're getting e.g.
Debug.Log(doc.DocumentNode.SelectNodes("//div[#class='ngg-gallery-thumbnail-box']//div[#class='ngg-gallery-thumbnail'])[0].OuterXml)
I'm parsing an HTML file using HTML Agility Pack. I want to get
<title>Some title <title>
As you see, title doesn't have a class. So I couldn't catch it no matter what I have tried. I couldn't find the solution on the web either. How can I catch this HTML tag which doesn't have a class? Thanks.
This might do the trick for you
doc.DocumentNode.SelectSingleNode("//head/title");
or
doc.DocumentNode.SelectSingleNode("//title");
or
doc.DocumentNode.Descendants("title").FirstOrDefault()
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(htmlContent);
var result = doc.DocumentNode.SelectNodes("title").FirstOrDefault();
I have a string that has HTML formatted content.
Now I want to convert that string to HTML, May I use HtmlElementCollection
Is it possible? If yes, then how?
Kindly explain. Thanks!
Take a look at the HtmlAgilityPack. More information can be found on the answers of other similar questions.
A string will be handled as HTML when you push this string into an environment that will render the content as HTML.
When you push the content of the string to an environment that doesn't handle HTML or you explicitly say that you don't want it to render as HTML. It will be rendered as plain text.
Use HTMLAGILITYPACK and use the following:
var st1 = stringdata; // your html formatted string
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(st1); // this is now html doc
If someone still locking for this
For that just type code below in the place you what to show the HTML
#((MarkupString)htmlString)
//htmlString =>string that has HTML formatted content
//tested with blazor server last ver of dotnet(now 6.0)
I have continually had problems with Html Agility Pack; my XPath queries only ever work when they are extremely simple:
//*[#id='some_id']
or
//input
However, anytime they get more complicated, then Html Agility Pack can't handle it.
Here's an example demonstrating the problem, I'm using WebDriver to navigate to Google, and return the page source, which is passed to Html Agility Pack, and both WebDriver and HtmlAgilityPack attempt to locate the element/node (C#):
//The XPath query
const string xpath = "//form//tr[1]/td[1]//input[#name='q']";
//Navigate to Google and get page source
var driver = new FirefoxDriver(new FirefoxProfile()) { Url = "http://www.google.com" };
Thread.Sleep(2000);
//Can WebDriver find it?
var e = driver.FindElementByXPath(xpath);
Console.WriteLine(e!=null ? "Webdriver success" : "Webdriver failure");
//Can Html Agility Pack find it?
var source = driver.PageSource;
var htmlDoc = new HtmlDocument { OptionFixNestedTags = true };
htmlDoc.LoadHtml(source);
var nodes = htmlDoc.DocumentNode.SelectNodes(xpath);
Console.WriteLine(nodes!=null ? "Html Agility Pack success" : "Html Agility Pack failure");
driver.Quit();
In this case, WebDriver successfully located the item, but Html Agility Pack did not.
I know, I know, in this case it's very easy to change the xpath to one that will work: //input[#name='q'], but that will only fix this specific example, which isn't the point, I need something that will exactly or at least closely mirror the behavior of WebDriver's xpath engine, or even the FirePath or FireFinder add-ons to Firefox.
If WebDriver can find it, then why can't Html Agility Pack find it too?
The issue you're running into is with the FORM element. HTML Agility Pack handles that element differently - by default, it will never report that it has children.
In the particular example you gave, this query does find the target element:
.//div/div[2]/table/tr/td/table/tr/td/div/table/tr/td/div/div[2]/input
However, this does not, so it's clear the form element is tripping up the parser:
.//form/div/div[2]/table/tr/td/table/tr/td/div/table/tr/td/div/div[2]/input
That behavior is configurable, though. If you place this line prior to parsing the HTML, the form will give you child nodes:
HtmlNode.ElementsFlags.Remove("form");
i am using the HTML Agility pack to convert
<font size="1">This is a test</font>
to
This is a test
using this code:
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
string stripped = doc.DocumentNode.InnerText;
but i ran into an issue where i have this:
<font size="1">This is a test & this is a joke</font>
and the code above converted this to
This is a test & this is a joke
but i wanted it to convert it to:
This is a test & this is a joke
does the html agility pack support what i am trying to do? why doesn't the HTML agiligy code do this by default or i am doing something wrong ?
You can run HttpUtility.HtmlDecode() on the output.
However, note that InnerText will include HTML tags that may be contained inside the outermost tag. If you want to remove all tags, you will have to walk the document tree and retrieve all the text bit by bit.