Extracting an html fragment from an html document

Extracting an html fragment from an html document - c#

I'm looking for an efficient means of extracting an html "fragment" from an html document. My first implementation of this used the Html Agility Pack. This appeared to be a reasonable way to attack this problem, until I started running the extraction on large html documents - performance was very poor for something so trivial (I'm guessing due to the amount of time it was taking to parse the entire document).
Can anyone suggest a more efficient means of achieving my goal?
To summarize:
For my purposes, an html "fragment"
is defined as all content inside of
the <body> tags of an html
document
Ideally, I'd like to return the
content unaltered if it didn't
contain an <html> or <body>
(I'll assume I was passed an html
fragment to begin with)
I have the entire html document available in memory (as a string), I won't be streaming it on demand - so a potential solution won't need to worry about that.
Performance is critical, so a potential solution should account for this.
Sample Input:
<html>
<head>
<title>blah</title>
</head>
<body>
<p>My content</p>
</body>
</html>
Desired Output:
<p>My content</p>
A solution in C# or VB.NET would be welcome.

Most html is not going to be XHTML compliant. I would do an HTTP get request and search the resultant text for .Contains("<body>") and .Contains("</body>"). You can use these two locations as your start and stop indexes for a reader stream. Outside the body tag you really don't need to worry about XML compliance.

You could hack it using a WebBrowse control and take advantage of webBrowser1.document property (though not sure what you're trying to accomplish).

If I remember correctly, I did something similar in the past with an XPathNavigator. I think it looked something like this:
XPathDocument xDoc = new System.Xml.XPath.XPathDocument(new StringReader(content));
XPathNavigator xNav = xDoc.CreateNavigator();
XPathNavigator node = xNav.SelectSingleNode("/body");
where you could change /body to whatever you need to look for.

Related

HTML Agility Pack (C#) malforms my code

I'm currently coding a desktop application in c# which also has to handle XHTML document manipulation. For that purpose I'm using the Html Agility Pack which seemed to be okay so far. After carefully checking the output from Html Agility Pack I found out that the code isn't well formed xhtml any more.
It removes self-closing tags (slash) and overwrites other proprietary code elements...
eg. input html code:
<input autocapitalize="off" id="username" name="username" placeholder="Benutzername" type="text" value="$(username)" />
eg. output html code
<input autocapitalize="off" id="username" name="username" placeholder="Benutzername" type="text" value="$(username)">
(removed the trailing slash...)
Another example is with proprietary code elements (for Mikrotik hotspot devices):
eg input html code
<form action="$(link-login-only)" method="post" name="login" $(if chap-id) onSubmit="return doLogin()"$(endif)>
The $(if chap-id), $(endif) and $(link-login-only) parts are custom code fragments interpreted from the Mikrotik device.
eg. output html code after Html Agility Pack (which transforms it to unuseable code)
<form action="$(link-login-only)" method="post" name="login" $(if="" chap-id)="" onsubmit="return doLogin()" $(endif)="">
Has someone an idea how to "instruct" Html Agility Pack to output well formed XHTML and to ignore "custom code" fragments (is this possibly via Regex)?
Thanks in advance! :-)

In your first example, HTML Agility Pack is actually fixing your markup. The input element is a void element. Since there is no context inside, it needs no closing tag.
HTML Agility Pack is made for parsing valid HTML markup, not markup embedded with custom code. In your first example, the custom markup is inside quotes therefore isn't an issue. In your second example, the variables are outside quotes.
HTML Agility Pack tries to parse them as regular (but malformed) attributes of the element. There's no way to fix that. You'll have to find another way to parse your markup if you need support for custom code inside the markup.

Necromancing.
Problem 1 is because you probably didn't specify OptionOutputAsXml = true, meaning HtmlAgilityPack outputs HTML instead of XHTML.
Actually, doing this is rather clever, as it reduces the file size.
If you need XHTML, you need to specifically instruct HtmlAgilityPack to output XHTML (XML), not HTML (SGML).
SGML allows for tags with no closing tag (/>), while XML does not.
To fix this:
public static void BeautifyHtml()
{
string input = "<html><body><p>This is some test test<br ><ul><li>item 1<li>item2<</ul></body>";
HtmlAgilityPack.HtmlDocument test = new HtmlAgilityPack.HtmlDocument();
test.LoadHtml(input);
test.OptionOutputAsXml = true;
test.OptionCheckSyntax = true;
test.OptionFixNestedTags = true;
System.Text.StringBuilder sb = new System.Text.StringBuilder();
using (System.IO.TextWriter stringWriter = new System.IO.StringWriter(sb))
{
test.Save(stringWriter);
}
string beautified = sb.ToString();
System.Console.WriteLine(beautified);
}

An alternative is CsQuery which, at least for the simple cases you've got here, will leave your pre-processor tags alone by nature of just treating them like valueless attributes. That is, HAP appears to convert any attribute someattribute without a value to someattribute="". CsQuery won't do this.
However the observations #Justin Niessner makes about your markup are going to be true for any parser that is not specifically designed to parse the templating code you have in there. Just because this one example makes it through CsQuery is no guarantee some other format won't result in something that's not a valid attribute name, or if not valid, at least acceptable to an HTML5 parser.
If you need to manipulate something as HTML, then do it after templating. If you need to manipulate it before the templating engine has at it, then you're in a catch 22, since it's not HTML yet. Or alternatively you could use a templating system that uses valid HTML markup for its keywords (example: Knockout).

Repairing malformatted html attributes using c#

I have a web application with an upload functionality for HTML files generated by chess software to be able to include a javascript player that reproduces a chess game.
I do not like to load the uploaded files in a frame so I reconstruct the HTML and javascript generated by the software by parsing the dynamic parts of the file.
The problem with the HTML is that all attributes values are surrounded with an apostrophe instead of a quotation mark. I am looking for a way to fix this using a library or a regex replace using c#.
The html looks like this:
<DIV class='pgb'><TABLE class='pgbb' CELLSPACING='0' CELLPADDING='0'><TR><TD>
and I would transform it into:
<DIV class="pgb"><TABLE class="pgbb" CELLSPACING="0" CELLPADDING="0"><TR><TD>

I'd say your best option is to use something like HTML Agility Pack to parse the generated HTML, and then ask it to re-serialize it to string (hopefully correcting any formatting problems in the process). Any attempt at Regexes or other direct string manipulation of HTML is going to be difficult, fragile and broken...
Example (when your HTML is stored in a file on the hard disk):
HtmlDocument doc = new HtmlDocument();
doc.Load("file.htm");
doc.Save("file.htm");
It is also possible to do this directly in memory from a string or Stream of input HTML.

you could use something like:
string ouputString = Regex.Replace(inputString, #"(?<=\<[^<>]*)\'(?=[^<>]*\>)", "\"");
Changed it after Oded's remark, this leaves the body HTML intact. But I agree, Regex is a bad idea for parsing HTML. Mark's answer is better.

Remove JavaScript with Regex

I am having trouble removing all javascript from a HTML page with C#. I have three regex expressions that remove a lot but miss a lot too. Parsing the javascript with the MSHTML DOM parser causes the javascript to actually run, which is what I am trying to avoid by using the regex.
"<script.*/>"
"<script[^>]*>.*</script>"
"<script.*?>[\\s\\S]*?</.*?script>"
Does anyone know what I am missing that is causing these three regex expressions to miss blocks of JavaScript?
An example of what I am trying to remove:
<script src="do_files/page.js" type="text/javascript"></script>
<script src="do_files/page.js" type="text/javascript" />
<script type="text/javascript">
<!--
var Time=new Application('Time')
//-->
</script>
<script type="text/javascript">
if(window['com.actions']) {
window['com.actions'].approvalStatement = "",
window['com.actions'].hasApprovalStatement = false
}
</script>

I assume you are trying to simply sanitize the input of JavaScript. Frankly I'm worried that this is too simple of a solution, 'cuz it seems so incredibly simple. See below for reasoning, after the expression (in a C# string):
#"(?s)<script.*?(/>|</script>)"
That's it - I hope! (It certainly works for your examples!)
My reasoning for the simplicity is that the primary issue with trying to parse HTML with regex is the potential for nested tags - it's not so much the nesting of DIFFERENT tags, but the nesting of SYNONYMOUS tags
For example,
<b> bold <i> AND italic </i></b>
...is not so bad, but
<span class='BoldText'> bold <span class='ItalicText'> AND italic </span></span>
would be much harder to parse, because the ending tags are IDENTICAL.
However, since it is invalid to nest script tags, the next instance of />(<-is this valid?) or </script> is the end of this script block.
There's always the possibility of HTML comments or CDATA tags inside the script tag, but those should be fine if they don't contain </script>. HOWEVER: if they do, it would definitely be possible to get some 'code' through. I don't think the page would render, but some HTML parsers are amazingly flexible, so ya never know. to handle a little extra possible whitespace, you could use:
#"(?s)<\s?script.*?(/\s?>|<\s?/\s?script\s?>)"
Please let me know if you can figure out a way to break it that will let through VALID HTML code with run-able JavaScript (I know there are a few ways to get some stuff through, but it should be broken in one of many different ways if it does get through, and should not be run-able JavaScript code.)

It is generally agreed upon that trying to parse HTML with regex is a bad idea and will yield bad results. Instead, you should use a DOM parser. jQuery wraps nicely around the browser's DOM and would allow you to very easily remove all <script> tags.

ok I have faced a similar case, when I need to clean "rich text" (text with HTML formatting) from any possible javascript-ing.
there are several ways to add javascript to HTML:
by using the <script> tag, with javascript inside it or by loading a javascript file using the "src" attribue.
ex: <script>maliciousCode();</script>
by using an event on an HTML element, such as "onload" or "onmouseover"
ex: <img src="a.jpg" onload="maliciousCode()">
by creating a hyperlink that calls javascript code
ex: <a href="javascript:maliciousCode()">...
This is all I can think of for now.
So the submitted HTML Code needs to be cleaned from these 3 cases. A simple solution would be to look for these patterns using Regex, and replace them by "" or do whatever else you want.
This is a simple code to do this:
public static string CleanHTMLFromScript(string str)
{
Regex re = new Regex("<script[^>]*>", RegexOptions.IgnoreCase);
str = re.Replace(str, "");
re = new Regex("<[a-z][^>]*on[a-z]+=\"?[^\"]*\"?[^>]*>", RegexOptions.IgnoreCase);
str = re.Replace(str, "");
re = new Regex("<a\\s+href\\s*=\\s*\"?\\s*javascript:[^\"]*\"[^>]*>", RegexOptions.IgnoreCase);
str = re.Replace(str, "");
return(str);
}
This code takes care of any spaces and quotes that may or may not be added. It seems to be working fine, not perfect but it does the trick. Any improvements are welcome.

Creating your own HTML parser or script detector is a particularly bad idea if this is being done to prevent cross-site scripting. Doing this by hand is a Very Bad Idea, because there are any number of corner cases and tricks that can be used to defeat such an attempt. This is termed "black listing", as it attempts to remove the unsafe items from HTML, and it's pretty much doomed to failure.
Much safer to use a white list processor (such as AntiSamy), which only allows approved items through by automatically escaping everything else.
Of course, if this isn't what you're doing then you should probably edit your question to give some more context...
Edit:
Now that we know you're using C#, try the HTMLAgilityPack as suggested here.

Which language are you using? As a general statement, Regular Expressions are not suitable for parsing HTML.
If you are on the .net Platform, the HTML Agility Pack offers a much better parser.

You should use a real html parser for the job. That being said, for simple stripping
of script blocks you could use a rudimentary regex like below.
The idea is that you will need a callback to determine if capture group 1 matched.
If it did, the callback should pass back things that hide html (like comments) back
through unchanged, and the script blocks are passed back as an empty string.
This won't substitute for an html processor though. Good luck!
Search Regex: (modifiers - expanded, global, include newlines in dot, callback func)
(?:
<script (?:\s+(?:".*?"|\'.*?\'|[^>]*?)+)? \s*> .*? </script\s*>
| </?script (?:\s+(?:".*?"|\'.*?\'|[^>]*?)+)? \s*/?>
)
|
( # Capture group 1
<!(?:DOCTYPE.*?|--.*?--)> # things that hide html, add more constructs here ...
)
Replacement func pseudo code:
string callback () {
if capture buffer 1 matched
return capt buffer 1
else return ''
}

Most efficient way to add missing alt tags for images in a large html document

In order to comply with accessibility standards, I need to ensure that all images in some dynamically-generated html (which I don't control) have an empty alt tag if none is specified.
Example input:
<html>
<body>
<img src="foo.gif" />
<p>Some other content</p>
<img src="bar.gif" alt="" />
<img src="blah.gif" alt="Blah!" />
</body>
</html>
Desired output:
<html>
<body>
<img src="foo.gif" alt="" />
<p>Some other content</p>
<img src="bar.gif" alt="" />
<img src="blah.gif" alt="Blah!" />
</body>
</html>
The html could be quite large and the DOM heavily-nested, so using something like the Html Agility Pack is out.
Can anyone suggest an efficient way to accomplish this?
Update:
It is a safe assumption that the html I'm dealing with is well-formed, so a potential solution need not account for that at all.

Your problem seems very specific, you need to alter some output, but you don't want to parse the whole thing with (something general-purpose like) HTMLAgilityPack for performance reasons. The best solution would seem to be to do it the hard way.
I would just brute force it. It would be hard to do it more efficiently than something like this (completely untested and almost guaranteed not to work exactly as-is, but logic should be fine, if missing a "+1" or "-1" somewhere):
string addAltTag(string html) {
StringBuilder sb = new StringBuilder();
int pos=0;
int lastPos=0;
while(pos>=0) {
int nextpos;
pos=html.IndexOf("<img",pos);
if (pos>=0) {
// images can't have children, and there should not be any angle braces
// anyhere in the attributes, so should work fine
nextPos =html.IndexOf(">",pos);
}
if (nextPos>0) {
// back up if XML formed
if (html.indexOf(nextPos-1,1)=="/") {
nextPos--;
}
// output everything from last position up to but
// before the closing caret
sb.Append(html.Substring(lastPos,nextPos-lastPos-1);
// can't just look for "alt" could be in the image url or class name
if (html.Substring(pos,nextPos-pos).IndexOf(" alt=\"")<0) {
sb.Append(" alt="\"\"");
}
lastPos=nextPos;
} else {
// unclosed image -- just quit
pos=-1;
}
}
sb.Append(html.Substring(lastPos);
return sb.ToString();
}
You may need to do things like convert to lowercase before testing, parse or test for variants e.g alt = " (that is, with spaces), etc. depending on the consistency you can expect from your HTML.
By the way, there is no way this would be faster, but if you want to use something a little more general for some reason, you can also give a shot to CsQuery. This is my own C# implementation of jQuery which would do something like this very easily, e.g.
obj.Select("img").Not("[alt]").Attr("alt",String.Empty);
Since you say that HTML agility pack performs badly on deeply-nested HTML, this may work better for you, because the HTML parser I use is not recursive and should perform linearly regardless of nesting. But it would be far slower than just coding to your exact need since it does, of course, parse the entire document into an object model. Whether that is fast enough for your situation, who knows.

I just tested this on a 8mb HTML file with about 250,000 lines. It did take a few seconds for the document to load, but the select method was very fast. Not sure how big your file is or what you are expecting. I even edited the HTML file to include some missing tags, such as </body> and some random </div>. It still was able to parse correctly.
HtmlDocument doc = new HtmlDocument();
doc.Load(#"c:\\test.html");
HtmlNodeCollection col = doc.DocumentNode.SelectNodes("//img[not(#alt)]");
I had a total of 54,322 nodes. The select took milliseconds.
If the above will not work, and you can reliably predict the output, it is possible for you to stream the file in and break it in to manageable chunks.
pseduo-code
stream file in
parse in HtmlAgilityPack
loop until end of stream
I imagine you could incorporate Parallel.ForEach() in there as well, although I can't find documentation on whether this is safe with HtmlAgilityPack.

Well, if I review your content for Section 508 compliance, I will fail your web site or content - unless the blank alt text is for decorative (not needed for comprehension of content) only.
Blank alt text is only for decoration. Inserting it might fool some automated reporting tools, but you certainly are not meeting Section 508 compliance.
From a project management standpoint, you are better off leaving it failing so the end-users creating the content become responsible and the automated tool accurately reports it as non-compliant.

Hoping Chaps are clever enough to generate the Html markup wherever they need. Then here is the quick trick to convert the find out the SEO result for Images missing ALT attribute without too much struggle.
private static bool HasImagesWithoutAltTags(string htmlContent)
{
var doc = new HtmlDocument();
doc.LoadHtml(htmlContent);
return doc.DocumentNode.Descendants("img").Any() && doc.DocumentNode.SelectNodes("//img[not(#alt)]").Any();
}

Regex to get the tags

I have a html like this :
<h1> Headhing </h>
<font name="arial">some text</font></br>
some other text
In C#,
I want to get the out put as below. Simply content inside the font start tag and end tag
<font name="arial">some text</font>

First off, your html is wrong. you should close a <h1> with a </h1> not </h>. This one thing is why reg ex is inappropriate to parse tags.
Second, there are hundreds of questions on SO talking about parsing html with regex. The answer is don't. Use something like the html agility pack.

I wouldn't recommend to try it with regex.
I use the HTML Agility Pack to parse HTML and get what I want.
It's a lovely HTML parser that is commonly recommended for this. It will take malformed HTML and massage it into XHTML and then a traversable DOM, like the XML classes. So, is very useful for the code you find in the wild.
There's also an HTML parser from Microsoft MSHTML but I haven't tried it.

Regex regExfont = new Regex(#"<font name=""arial""[^>]*>.*</font>");
MatchCollection rows = regExfont.Matches(string);
good website is http://www.regexlib.com/RETester.aspx

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Extracting an html fragment from an html document - c#

You could hack it using a WebBrowse control and take advantage of webBrowser1.document property (though not sure what you're trying to accomplish).

Related

HTML Agility Pack (C#) malforms my code

Repairing malformatted html attributes using c#

Remove JavaScript with Regex

Most efficient way to add missing alt tags for images in a large html document

Regex to get the tags

Categories

Resources