I'm currently coding a desktop application in c# which also has to handle XHTML document manipulation. For that purpose I'm using the Html Agility Pack which seemed to be okay so far. After carefully checking the output from Html Agility Pack I found out that the code isn't well formed xhtml any more.
It removes self-closing tags (slash) and overwrites other proprietary code elements...
eg. input html code:
<input autocapitalize="off" id="username" name="username" placeholder="Benutzername" type="text" value="$(username)" />
eg. output html code
<input autocapitalize="off" id="username" name="username" placeholder="Benutzername" type="text" value="$(username)">
(removed the trailing slash...)
Another example is with proprietary code elements (for Mikrotik hotspot devices):
eg input html code
<form action="$(link-login-only)" method="post" name="login" $(if chap-id) onSubmit="return doLogin()"$(endif)>
The $(if chap-id), $(endif) and $(link-login-only) parts are custom code fragments interpreted from the Mikrotik device.
eg. output html code after Html Agility Pack (which transforms it to unuseable code)
<form action="$(link-login-only)" method="post" name="login" $(if="" chap-id)="" onsubmit="return doLogin()" $(endif)="">
Has someone an idea how to "instruct" Html Agility Pack to output well formed XHTML and to ignore "custom code" fragments (is this possibly via Regex)?
Thanks in advance! :-)
In your first example, HTML Agility Pack is actually fixing your markup. The input element is a void element. Since there is no context inside, it needs no closing tag.
HTML Agility Pack is made for parsing valid HTML markup, not markup embedded with custom code. In your first example, the custom markup is inside quotes therefore isn't an issue. In your second example, the variables are outside quotes.
HTML Agility Pack tries to parse them as regular (but malformed) attributes of the element. There's no way to fix that. You'll have to find another way to parse your markup if you need support for custom code inside the markup.
Necromancing.
Problem 1 is because you probably didn't specify OptionOutputAsXml = true, meaning HtmlAgilityPack outputs HTML instead of XHTML.
Actually, doing this is rather clever, as it reduces the file size.
If you need XHTML, you need to specifically instruct HtmlAgilityPack to output XHTML (XML), not HTML (SGML).
SGML allows for tags with no closing tag (/>), while XML does not.
To fix this:
public static void BeautifyHtml()
{
string input = "<html><body><p>This is some test test<br ><ul><li>item 1<li>item2<</ul></body>";
HtmlAgilityPack.HtmlDocument test = new HtmlAgilityPack.HtmlDocument();
test.LoadHtml(input);
test.OptionOutputAsXml = true;
test.OptionCheckSyntax = true;
test.OptionFixNestedTags = true;
System.Text.StringBuilder sb = new System.Text.StringBuilder();
using (System.IO.TextWriter stringWriter = new System.IO.StringWriter(sb))
{
test.Save(stringWriter);
}
string beautified = sb.ToString();
System.Console.WriteLine(beautified);
}
An alternative is CsQuery which, at least for the simple cases you've got here, will leave your pre-processor tags alone by nature of just treating them like valueless attributes. That is, HAP appears to convert any attribute someattribute without a value to someattribute="". CsQuery won't do this.
However the observations #Justin Niessner makes about your markup are going to be true for any parser that is not specifically designed to parse the templating code you have in there. Just because this one example makes it through CsQuery is no guarantee some other format won't result in something that's not a valid attribute name, or if not valid, at least acceptable to an HTML5 parser.
If you need to manipulate something as HTML, then do it after templating. If you need to manipulate it before the templating engine has at it, then you're in a catch 22, since it's not HTML yet. Or alternatively you could use a templating system that uses valid HTML markup for its keywords (example: Knockout).
Related
I have a web application with an upload functionality for HTML files generated by chess software to be able to include a javascript player that reproduces a chess game.
I do not like to load the uploaded files in a frame so I reconstruct the HTML and javascript generated by the software by parsing the dynamic parts of the file.
The problem with the HTML is that all attributes values are surrounded with an apostrophe instead of a quotation mark. I am looking for a way to fix this using a library or a regex replace using c#.
The html looks like this:
<DIV class='pgb'><TABLE class='pgbb' CELLSPACING='0' CELLPADDING='0'><TR><TD>
and I would transform it into:
<DIV class="pgb"><TABLE class="pgbb" CELLSPACING="0" CELLPADDING="0"><TR><TD>
I'd say your best option is to use something like HTML Agility Pack to parse the generated HTML, and then ask it to re-serialize it to string (hopefully correcting any formatting problems in the process). Any attempt at Regexes or other direct string manipulation of HTML is going to be difficult, fragile and broken...
Example (when your HTML is stored in a file on the hard disk):
HtmlDocument doc = new HtmlDocument();
doc.Load("file.htm");
doc.Save("file.htm");
It is also possible to do this directly in memory from a string or Stream of input HTML.
you could use something like:
string ouputString = Regex.Replace(inputString, #"(?<=\<[^<>]*)\'(?=[^<>]*\>)", "\"");
Changed it after Oded's remark, this leaves the body HTML intact. But I agree, Regex is a bad idea for parsing HTML. Mark's answer is better.
I have a string containing HTML and I need to be able to access a specific element to get the text from it (the element has no id or class or name so regex is out of the question).
For example, lets say I needed to access: "/html/body/div/div[3]/div/table[0]/div/ul/li[12]/a/".
How could I go about doing this?
If the HTML is well formatted, you can parse the HTML with an XmlDocument
Also as Maxim mentioned, the HTML Agility Pack can probably do what you need.
Here's a recent article from 4guysfromrolla on parsing HTML with the HTML Agility Pack
I have a html like this :
<h1> Headhing </h>
<font name="arial">some text</font></br>
some other text
In C#,
I want to get the out put as below. Simply content inside the font start tag and end tag
<font name="arial">some text</font>
First off, your html is wrong. you should close a <h1> with a </h1> not </h>. This one thing is why reg ex is inappropriate to parse tags.
Second, there are hundreds of questions on SO talking about parsing html with regex. The answer is don't. Use something like the html agility pack.
I wouldn't recommend to try it with regex.
I use the HTML Agility Pack to parse HTML and get what I want.
It's a lovely HTML parser that is commonly recommended for this. It will take malformed HTML and massage it into XHTML and then a traversable DOM, like the XML classes. So, is very useful for the code you find in the wild.
There's also an HTML parser from Microsoft MSHTML but I haven't tried it.
Regex regExfont = new Regex(#"<font name=""arial""[^>]*>.*</font>");
MatchCollection rows = regExfont.Matches(string);
good website is http://www.regexlib.com/RETester.aspx
I'm looking for an efficient means of extracting an html "fragment" from an html document. My first implementation of this used the Html Agility Pack. This appeared to be a reasonable way to attack this problem, until I started running the extraction on large html documents - performance was very poor for something so trivial (I'm guessing due to the amount of time it was taking to parse the entire document).
Can anyone suggest a more efficient means of achieving my goal?
To summarize:
For my purposes, an html "fragment"
is defined as all content inside of
the <body> tags of an html
document
Ideally, I'd like to return the
content unaltered if it didn't
contain an <html> or <body>
(I'll assume I was passed an html
fragment to begin with)
I have the entire html document available in memory (as a string), I won't be streaming it on demand - so a potential solution won't need to worry about that.
Performance is critical, so a potential solution should account for this.
Sample Input:
<html>
<head>
<title>blah</title>
</head>
<body>
<p>My content</p>
</body>
</html>
Desired Output:
<p>My content</p>
A solution in C# or VB.NET would be welcome.
Most html is not going to be XHTML compliant. I would do an HTTP get request and search the resultant text for .Contains("<body>") and .Contains("</body>"). You can use these two locations as your start and stop indexes for a reader stream. Outside the body tag you really don't need to worry about XML compliance.
You could hack it using a WebBrowse control and take advantage of webBrowser1.document property (though not sure what you're trying to accomplish).
If I remember correctly, I did something similar in the past with an XPathNavigator. I think it looked something like this:
XPathDocument xDoc = new System.Xml.XPath.XPathDocument(new StringReader(content));
XPathNavigator xNav = xDoc.CreateNavigator();
XPathNavigator node = xNav.SelectSingleNode("/body");
where you could change /body to whatever you need to look for.
I need to write an application for a friends site which parses hidden fields. I've downloaded the Html Agility Pack library, but I'm kinda confused because there are not really any examples. The HTML field looks like this:
<input type = "hidden" autocomplete="off" value="randomvalue" name="foo">
How would I go about getting the value from this field?
from memory, something like:
var value = docroot.SelectSingleNode("//input[#type='hidden' and #name='foo']")
.Attributes["value"].Value;