Remove HTML nodes from HTTP Request - c#

I have some HTML code stored into a string variable, resulting from a HttpWebRequest:
<html>
<head>
<div>Lots of scripts and libraries</div>
</head>
<body>
<div>Some very useful data</div>
</body>
<footer>
<div>Not interesting struff</div>
</footer>
<html>
How can I do to remove all unecesary nodes and get into this:
<body>
<div>Some very useful data</div>
</body>

The easiest way is to use HtmlAgilityPack to grab just the body tag.
var document = new HtmlAgilityPack.HtmlDocument();
document.LoadHtml(html);
HtmlNode body = document.DocumentNode.SelectSingleNode("//body");
From there, you can use HtmlAgilityPack to further parse the body node for more detail.

Related

CsQuery replace tags

I using CsQuery in order to parse HTML documents. What I'm trying to do is to replace all the "br" HTML tags with "." character.
Assuming that this is my input HTML:
<html>
<body>
Hello
<br>
World
</body>
</html>
The requested output will be:
<html>
<body>
Hello
.
World
</body>
</html>
Pseudo code:
CQ dom = CQ.CreateFromUrl("http://my.url");
dom.ReplaceTag("<br>", ".");
Is this possible?
Thanks for advices.
That's pretty easy, just replace the <br> elements by setting their OuterHTML.
The relevant selector is just "br":
foreach (var br in dom["br"])
br.OuterHTML = ".";
Call dom.Render() to see the result.

HTML agility parsing error

HTML
<html>
<head>
<title>Sample Page</title>
</head>
<body>
<form action="demo_form.asp" id="form1" method="get">
First name: <input type="text" name="fname"><br>
Last name: <input type="text" name="lname"><br>
<input type="submit" value="Submit">
</form>
</body>
</html>
Code
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(File.ReadAllText(#"C:\sample.html"));
HtmlNode nd = doc.DocumentNode.SelectSingleNode("//form[#id='form1']");
//nd.InnerHtml is "".
//nd.InnerText is "".
Problem
nd.ChildNodes //Collection(to get all nodes in form) is always null.
nd.SelectNodes("/input") //returns null.
nd.SelectNodes("./input") //returns null.
"//form[#id='form1']/input" //returns null.
what i want is to access childnodes of form tag with id=form1 one by one in order of occurrence. I tried same xpath in chrome developer console and it works just exactly the way i wanted. Is HTMlAgility pack is having problem in reading html from file or Web.
Your html is invalid and may be preventing the html agility pack from working properly.
Try adding a doctype (and an xml namespace) to the start of your document and change your input element's closing tags from > to />
Try adding the following statement before loading the document:
HtmlNode.ElementsFlags.Remove("form");
HtmlAgilityPack's default behaviour adds all the form's inner-elements as siblings in stead of children. The statement above alters that behaviour so that they (meaning the input tags) will appear as childnodes.
Your code would look like this:
HtmlNode.ElementsFlags.Remove("form");
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(File.ReadAllText(#"C:\sample.html"));
HtmlNode nd = doc.DocumentNode.SelectSingleNode("//form[#id='form1']");
etc...
references:
bug issue & fix: http://htmlagilitypack.codeplex.com/workitem/23074
codeplex forum post: http://htmlagilitypack.codeplex.com/discussions/247206

Insert HTML Code Just After the Opening BODY Tag

Does anyone have any samples of injecting a HTML snippet just after the open BODY tag in an ASP.Net webforms page? The positioning of this code is very specific.
The beginning HTML might look like this:
</head>
<body>
<div id="header">
The resulting HTML should look like this:
</head>
<body>
<div id="new-div"></div>
<div id="header">
This is a scenario where the HTML cannot be manipulated directly, and javascript would do this too late for the additional HTML to be useful. It must be done with server-side code and in place before the HTML makes it to the web browser.
You can do it this way in your aspx markup:
<html>
<head>
</head>
<body>
<%= FunctionTheOutputsString() %>
The <%= is short for Response.Write(), which is a function that writes directly into the page.
With jQuery, you can use prepend() as so:
​$(function(){
$('body').prepend('<div id="new-div">Div content</div>');
})​;
​jsfiddle here.
Update: The server-side solution could also be (besides above answer):
<html>
<head>
</head>
<body>
<asp:placeholder id="divPlaceHolder" Visible="False" runat="server">
<div id="new-div">
</div>
</asp:placeHolder>
On Page_Load()...
if(SomeCondition)
divPlaceHolder.Visible=true;
And because non-visible elements aren't rendered, the new-div element will only be displayed if SomeCondition is true

Html Agility Pack - Get html fragment from an html document

Using the html agility pack; how would I extract an html "fragment" from a full html document? For my purposes, an html "fragment" is defined as all content inside of the <body> tags.
For example:
Sample Input:
<html>
<head>
<title>blah</title>
</head>
<body>
<p>My content</p>
</body>
</html>
Desired Output:
<p>My content</p>
Ideally, I'd like to return the content unaltered if it didn't contain an <html> or <body> element (eg. assume that I was passed a fragment in the first place if it wasn't a full html document)
Can anyone point me in the right direction?
I think you need to do it in pieces.
you can do selectnodes of document for body or html as follows
doc.DocumentNode.SelectSingleNode("//body") // returns body with entire contents :)
then you can check for null values for criteria and if that is provided, you can take the string as it is.
Hope it helps :)
The following will work:
public string GetFragment(HtmlDocument document)
{
return doc.DocumentNode.SelectSingleNode("//body") == null ? doc.DocumentNode.InnerHtml : doc.DocumentNode.SelectSingleNode("//body").InnerHtml;
}

How to add <link> or <meta> tags to <head> with HtmlAgilityPack?

The link to download documentation from http://htmlagilitypack.codeplex.com is returning an error and I can't figure this out by trying the code.
I'm trying to insert various tags into the <head> section of a HtmlDocument that I've loaded from a HTML string. The original issue I'm having is described here.
Can somebody give me an idea of how to achieve this? Thanks
Maybe a bit late :-) Suppose I have this test.htm Html file:
<html>
<head>
<title>Hello World!</title>
</head>
<body>
Hello World
</body>
</html>
Here is how to add a LINK element underneath the HEAD element. You will not the semantics is very close to the System.Xml one, on purpose:
HtmlDocument doc = new HtmlDocument();
doc.Load("test.htm");
HtmlNode head = doc.DocumentNode.SelectSingleNode("/html/head");
HtmlNode link = doc.CreateElement("link");
head.AppendChild(link);
link.SetAttributeValue("rel", "shortcut icon");
link.SetAttributeValue("href", "http://www.mysite.com/favicon.ico");
The result will be:
<html>
<head>
<title>Hello World!</title>
<link rel="shortcut icon" href="http://www.mysite.com/favicon.ico"></head>
<body>
Hello World
</body>
</html>

Categories