HTML
<html>
<head>
<title>Sample Page</title>
</head>
<body>
<form action="demo_form.asp" id="form1" method="get">
First name: <input type="text" name="fname"><br>
Last name: <input type="text" name="lname"><br>
<input type="submit" value="Submit">
</form>
</body>
</html>
Code
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(File.ReadAllText(#"C:\sample.html"));
HtmlNode nd = doc.DocumentNode.SelectSingleNode("//form[#id='form1']");
//nd.InnerHtml is "".
//nd.InnerText is "".
Problem
nd.ChildNodes //Collection(to get all nodes in form) is always null.
nd.SelectNodes("/input") //returns null.
nd.SelectNodes("./input") //returns null.
"//form[#id='form1']/input" //returns null.
what i want is to access childnodes of form tag with id=form1 one by one in order of occurrence. I tried same xpath in chrome developer console and it works just exactly the way i wanted. Is HTMlAgility pack is having problem in reading html from file or Web.
Your html is invalid and may be preventing the html agility pack from working properly.
Try adding a doctype (and an xml namespace) to the start of your document and change your input element's closing tags from > to />
Try adding the following statement before loading the document:
HtmlNode.ElementsFlags.Remove("form");
HtmlAgilityPack's default behaviour adds all the form's inner-elements as siblings in stead of children. The statement above alters that behaviour so that they (meaning the input tags) will appear as childnodes.
Your code would look like this:
HtmlNode.ElementsFlags.Remove("form");
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(File.ReadAllText(#"C:\sample.html"));
HtmlNode nd = doc.DocumentNode.SelectSingleNode("//form[#id='form1']");
etc...
references:
bug issue & fix: http://htmlagilitypack.codeplex.com/workitem/23074
codeplex forum post: http://htmlagilitypack.codeplex.com/discussions/247206
Related
I have some HTML code stored into a string variable, resulting from a HttpWebRequest:
<html>
<head>
<div>Lots of scripts and libraries</div>
</head>
<body>
<div>Some very useful data</div>
</body>
<footer>
<div>Not interesting struff</div>
</footer>
<html>
How can I do to remove all unecesary nodes and get into this:
<body>
<div>Some very useful data</div>
</body>
The easiest way is to use HtmlAgilityPack to grab just the body tag.
var document = new HtmlAgilityPack.HtmlDocument();
document.LoadHtml(html);
HtmlNode body = document.DocumentNode.SelectSingleNode("//body");
From there, you can use HtmlAgilityPack to further parse the body node for more detail.
Reading so much about not using RegExes for stripping HTML, I am wondering about how to get some Links into my RichTextBox without getting all the messy html that is also in the content that i download from some newspaper site.
What i have: HTML from a newspaper website.
What i want: The article as plain text in a RichTextBox. But with links (that is, replacing the bar with <Hyperlink NavigateUri="foo">bar</Hyperlink>).
HtmlAgilityPack gives me HtmlNode.InnerText (stripped of all HTML tags) and HtmlNode.InnerHtml (with all tags). I can get the Url and text of the link(s) with articlenode.SelectNodes(".//a"), but how should i know where to insert that in the plain text of HtmlNode.InnerText?
Any hint would be appreciated.
Here is how you can do it (with a sample console app but the idea is the same for Silverlight):
Let's suppose you have this HTML:
<html>
<head></head>
<body>
Link 1: bar
Link 2: bar2
</body>
</html>
Then this code:
HtmlDocument doc = new HtmlDocument();
doc.Load(myFileHtm);
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//a"))
{
// replace the HREF element in the DOM at the exact same place
// by a deep cloned one, with a different name
HtmlNode newNode = node.ParentNode.ReplaceChild(node.CloneNode("Hyperlink", true), node);
// modify some attributes
newNode.SetAttributeValue("NavigateUri", newNode.GetAttributeValue("href", null));
newNode.Attributes.Remove("href");
}
doc.Save(Console.Out);
will output this:
<html>
<head></head>
<body>
Link 1: <hyperlink navigateuri="foo1">bar</hyperlink>
Link 2: <hyperlink navigateuri="foo2">bar2</hyperlink>
</body>
</html>
Hi am using xml file given below,i want to parse html file .
<Description>
<Fullcontent>
<div id="container" class="cf">
<link rel="stylesheet" href="http://dev2.mercuryminds.com/imageslider/css/demo.css" type="text/css" media="screen" />
<ul class="slides">
<li>Sonam Kapoor<img src="http://deys.jpeg"/></li>
<li>Amithab<img src="http://deysAmithab.jpeg"/></li>
<li>sridevi<img src="http://deyssridevi.jpeg"/></li>
<li>anil-kapoor<img src="http://deysanil-kapoor.jpeg"/></li>
</ul>
</div>
</Fullcontent>
</Description>
i want bind image with name
You can install HtmlAgilityPack from NuGet (just search for agility). Parsing is also simple. Here is way for selecting image tags and taking source attributes:
HtmlDocument html = new HtmlDocument();
html.Load(path_to_file);
var urls = html.DocumentNode.SelectNodes("//ul[#class='slides']/li/img")
.Select(node => node.Attributes["src"].Value);
Btw looks like direct selection of attributes is not supported yet.
I have a requirement to extract all the text that is present in the <body> of the html. Sample Html input :-
<html>
<title>title</title>
<body>
<h1> This is a big title.</h1>
How are doing you?
<h3> I am fine </h3>
<img src="abc.jpg"/>
</body>
</html>
The output should be :-
This is a big title. How are doing you? I am fine
I want to use only HtmlAgility for this purpose. No regular expressions please.
I know how to load HtmlDocument and then using xquery like '//body' we can get body contents. But how do I strip the html as I have shown in output?
Thanks in advance :)
You can use the body's InnerText:
string html = #"
<html>
<title>title</title>
<body>
<h1> This is a big title.</h1>
How are doing you?
<h3> I am fine </h3>
<img src=""abc.jpg""/>
</body>
</html>";
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
string text = doc.DocumentNode.SelectSingleNode("//body").InnerText;
Next, you may want to collapse spaces and new lines:
text = Regex.Replace(text, #"\s+", " ").Trim();
Note, however, that while it is working in this case, markup such as hello<br>world or hello<i>world</i> will be converted by InnerText to helloworld - removing the tags. It is difficult to solve that issue, as display is ofter determined by the CSS, not just by the markup.
How about using the XPath expression '//body//text()' to select all text nodes?
You can use NUglify that supports text extraction from HTML:
var result = Uglify.HtmlToText("<div> <p>This is <em> a text </em></p> </div>");
Console.WriteLine(result.Code); // prints: This is a text
As it is using a HTML5 custom parser, it should be quite robust (specially if the document doesn't contain any errors) and is a very fast (no regexp involved but a pure recursive descent parser, faster than HtmlAgilityPack and more GC friendly)
Normally for parsing html I would recommend a HTML parser, however since you want to remove all html tags a simple regex should work.
Using the html agility pack; how would I extract an html "fragment" from a full html document? For my purposes, an html "fragment" is defined as all content inside of the <body> tags.
For example:
Sample Input:
<html>
<head>
<title>blah</title>
</head>
<body>
<p>My content</p>
</body>
</html>
Desired Output:
<p>My content</p>
Ideally, I'd like to return the content unaltered if it didn't contain an <html> or <body> element (eg. assume that I was passed a fragment in the first place if it wasn't a full html document)
Can anyone point me in the right direction?
I think you need to do it in pieces.
you can do selectnodes of document for body or html as follows
doc.DocumentNode.SelectSingleNode("//body") // returns body with entire contents :)
then you can check for null values for criteria and if that is provided, you can take the string as it is.
Hope it helps :)
The following will work:
public string GetFragment(HtmlDocument document)
{
return doc.DocumentNode.SelectSingleNode("//body") == null ? doc.DocumentNode.InnerHtml : doc.DocumentNode.SelectSingleNode("//body").InnerHtml;
}