HTML agility parsing error

HTML agility parsing error - c#

HTML
<html>
<head>
<title>Sample Page</title>
</head>
<body>
<form action="demo_form.asp" id="form1" method="get">
First name: <input type="text" name="fname"><br>
Last name: <input type="text" name="lname"><br>
<input type="submit" value="Submit">
</form>
</body>
</html>
Code
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(File.ReadAllText(#"C:\sample.html"));
HtmlNode nd = doc.DocumentNode.SelectSingleNode("//form[#id='form1']");
//nd.InnerHtml is "".
//nd.InnerText is "".
Problem
nd.ChildNodes //Collection(to get all nodes in form) is always null.
nd.SelectNodes("/input") //returns null.
nd.SelectNodes("./input") //returns null.
"//form[#id='form1']/input" //returns null.
what i want is to access childnodes of form tag with id=form1 one by one in order of occurrence. I tried same xpath in chrome developer console and it works just exactly the way i wanted. Is HTMlAgility pack is having problem in reading html from file or Web.

Your html is invalid and may be preventing the html agility pack from working properly.
Try adding a doctype (and an xml namespace) to the start of your document and change your input element's closing tags from > to />

Try adding the following statement before loading the document:
HtmlNode.ElementsFlags.Remove("form");
HtmlAgilityPack's default behaviour adds all the form's inner-elements as siblings in stead of children. The statement above alters that behaviour so that they (meaning the input tags) will appear as childnodes.
Your code would look like this:
HtmlNode.ElementsFlags.Remove("form");
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(File.ReadAllText(#"C:\sample.html"));
HtmlNode nd = doc.DocumentNode.SelectSingleNode("//form[#id='form1']");
etc...
references:
bug issue & fix: http://htmlagilitypack.codeplex.com/workitem/23074
codeplex forum post: http://htmlagilitypack.codeplex.com/discussions/247206

Related

Remove HTML nodes from HTTP Request

I have some HTML code stored into a string variable, resulting from a HttpWebRequest:
<html>
<head>
<div>Lots of scripts and libraries</div>
</head>
<body>
<div>Some very useful data</div>
</body>
<footer>
<div>Not interesting struff</div>
</footer>
<html>
How can I do to remove all unecesary nodes and get into this:
<body>
<div>Some very useful data</div>
</body>

The easiest way is to use HtmlAgilityPack to grab just the body tag.
var document = new HtmlAgilityPack.HtmlDocument();
document.LoadHtml(html);
HtmlNode body = document.DocumentNode.SelectSingleNode("//body");
From there, you can use HtmlAgilityPack to further parse the body node for more detail.

HTML to RichTextBox as Plaintext with Hyperlinks

Reading so much about not using RegExes for stripping HTML, I am wondering about how to get some Links into my RichTextBox without getting all the messy html that is also in the content that i download from some newspaper site.
What i have: HTML from a newspaper website.
What i want: The article as plain text in a RichTextBox. But with links (that is, replacing the bar with <Hyperlink NavigateUri="foo">bar</Hyperlink>).
HtmlAgilityPack gives me HtmlNode.InnerText (stripped of all HTML tags) and HtmlNode.InnerHtml (with all tags). I can get the Url and text of the link(s) with articlenode.SelectNodes(".//a"), but how should i know where to insert that in the plain text of HtmlNode.InnerText?
Any hint would be appreciated.

Here is how you can do it (with a sample console app but the idea is the same for Silverlight):
Let's suppose you have this HTML:
<html>
<head></head>
<body>
Link 1: bar
Link 2: bar2
</body>
</html>
Then this code:
HtmlDocument doc = new HtmlDocument();
doc.Load(myFileHtm);
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//a"))
{
// replace the HREF element in the DOM at the exact same place
// by a deep cloned one, with a different name
HtmlNode newNode = node.ParentNode.ReplaceChild(node.CloneNode("Hyperlink", true), node);
// modify some attributes
newNode.SetAttributeValue("NavigateUri", newNode.GetAttributeValue("href", null));
newNode.Attributes.Remove("href");
}
doc.Save(Console.Out);
will output this:
<html>
<head></head>
<body>
Link 1: <hyperlink navigateuri="foo1">bar</hyperlink>
Link 2: <hyperlink navigateuri="foo2">bar2</hyperlink>
</body>
</html>

How can i parse html file in windows phone 7?

Hi am using xml file given below,i want to parse html file .
<Description>
<Fullcontent>
<div id="container" class="cf">
<link rel="stylesheet" href="http://dev2.mercuryminds.com/imageslider/css/demo.css" type="text/css" media="screen" />
<ul class="slides">
<li>Sonam Kapoor<img src="http://deys.jpeg"/></li>
<li>Amithab<img src="http://deysAmithab.jpeg"/></li>
<li>sridevi<img src="http://deyssridevi.jpeg"/></li>
<li>anil-kapoor<img src="http://deysanil-kapoor.jpeg"/></li>
</ul>
</div>
</Fullcontent>
</Description>
i want bind image with name

You can install HtmlAgilityPack from NuGet (just search for agility). Parsing is also simple. Here is way for selecting image tags and taking source attributes:
HtmlDocument html = new HtmlDocument();
html.Load(path_to_file);
var urls = html.DocumentNode.SelectNodes("//ul[#class='slides']/li/img")
.Select(node => node.Attributes["src"].Value);
Btw looks like direct selection of attributes is not supported yet.

How can I extract just text from the html

I have a requirement to extract all the text that is present in the <body> of the html. Sample Html input :-
<html>
<title>title</title>
<body>
<h1> This is a big title.</h1>
How are doing you?
<h3> I am fine </h3>
<img src="abc.jpg"/>
</body>
</html>
The output should be :-
This is a big title. How are doing you? I am fine
I want to use only HtmlAgility for this purpose. No regular expressions please.
I know how to load HtmlDocument and then using xquery like '//body' we can get body contents. But how do I strip the html as I have shown in output?
Thanks in advance :)

You can use the body's InnerText:
string html = #"
<html>
<title>title</title>
<body>
<h1> This is a big title.</h1>
How are doing you?
<h3> I am fine </h3>
<img src=""abc.jpg""/>
</body>
</html>";
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
string text = doc.DocumentNode.SelectSingleNode("//body").InnerText;
Next, you may want to collapse spaces and new lines:
text = Regex.Replace(text, #"\s+", " ").Trim();
Note, however, that while it is working in this case, markup such as hello<br>world or hello<i>world</i> will be converted by InnerText to helloworld - removing the tags. It is difficult to solve that issue, as display is ofter determined by the CSS, not just by the markup.

How about using the XPath expression '//body//text()' to select all text nodes?

You can use NUglify that supports text extraction from HTML:
var result = Uglify.HtmlToText("<div> <p>This is <em> a text </em></p> </div>");
Console.WriteLine(result.Code); // prints: This is a text
As it is using a HTML5 custom parser, it should be quite robust (specially if the document doesn't contain any errors) and is a very fast (no regexp involved but a pure recursive descent parser, faster than HtmlAgilityPack and more GC friendly)

Normally for parsing html I would recommend a HTML parser, however since you want to remove all html tags a simple regex should work.

Html Agility Pack - Get html fragment from an html document

Using the html agility pack; how would I extract an html "fragment" from a full html document? For my purposes, an html "fragment" is defined as all content inside of the <body> tags.
For example:
Sample Input:
<html>
<head>
<title>blah</title>
</head>
<body>
<p>My content</p>
</body>
</html>
Desired Output:
<p>My content</p>
Ideally, I'd like to return the content unaltered if it didn't contain an <html> or <body> element (eg. assume that I was passed a fragment in the first place if it wasn't a full html document)
Can anyone point me in the right direction?

I think you need to do it in pieces.
you can do selectnodes of document for body or html as follows
doc.DocumentNode.SelectSingleNode("//body") // returns body with entire contents :)
then you can check for null values for criteria and if that is provided, you can take the string as it is.
Hope it helps :)

The following will work:
public string GetFragment(HtmlDocument document)
{
return doc.DocumentNode.SelectSingleNode("//body") == null ? doc.DocumentNode.InnerHtml : doc.DocumentNode.SelectSingleNode("//body").InnerHtml;
}

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

HTML agility parsing error - c#

Your html is invalid and may be preventing the html agility pack from working properly. Try adding a doctype (and an xml namespace) to the start of your document and change your input element's closing tags from > to />

Related

Remove HTML nodes from HTTP Request

HTML to RichTextBox as Plaintext with Hyperlinks

How can i parse html file in windows phone 7?

How can I extract just text from the html

Html Agility Pack - Get html fragment from an html document

Categories

Resources