Repairing malformatted html attributes using c# - c#

I have a web application with an upload functionality for HTML files generated by chess software to be able to include a javascript player that reproduces a chess game.
I do not like to load the uploaded files in a frame so I reconstruct the HTML and javascript generated by the software by parsing the dynamic parts of the file.
The problem with the HTML is that all attributes values are surrounded with an apostrophe instead of a quotation mark. I am looking for a way to fix this using a library or a regex replace using c#.
The html looks like this:
<DIV class='pgb'><TABLE class='pgbb' CELLSPACING='0' CELLPADDING='0'><TR><TD>
and I would transform it into:
<DIV class="pgb"><TABLE class="pgbb" CELLSPACING="0" CELLPADDING="0"><TR><TD>

I'd say your best option is to use something like HTML Agility Pack to parse the generated HTML, and then ask it to re-serialize it to string (hopefully correcting any formatting problems in the process). Any attempt at Regexes or other direct string manipulation of HTML is going to be difficult, fragile and broken...
Example (when your HTML is stored in a file on the hard disk):
HtmlDocument doc = new HtmlDocument();
doc.Load("file.htm");
doc.Save("file.htm");
It is also possible to do this directly in memory from a string or Stream of input HTML.

you could use something like:
string ouputString = Regex.Replace(inputString, #"(?<=\<[^<>]*)\'(?=[^<>]*\>)", "\"");
Changed it after Oded's remark, this leaves the body HTML intact. But I agree, Regex is a bad idea for parsing HTML. Mark's answer is better.

Related

Extract strings using Regex

I want to download a html source, then search for the username and other information, and then display this in my program.
I'm pretty new to programming, but a straight noob when it comes to things like this (Regex) so I hope you can explain it to me.
I used Regex before extracting a K/D ratio from a html source, for that I used this code:
string pattern = #"<span class=""kdratio"">\d+\.\d+";
But I have no idea how to start on this one...
This is the line of the source that contains the information:
<section class="profile-header" profile="true" motto="user's motto" user="User" figure="hr-3322-45.hd-190-1.ch-3342-64-66.lg-285-64.sh-3068-82-66.ea-1404-64">
I only need the parts user="User" and figure="x", I couldn't try anything because I really wouldn't know how to start, because the html line looks so different from what I have experience with.
Regular expressions are not a good idea for matching HTML unless it's very simple, single, tag matching. See here: RegEx match open tags except XHTML self-contained tags
I recommend using an HTML DOM-parsing library and use XPath or CSS selectors to get the information you want. For .NET, HtmlAgilityPack is recommended. For CSS Selectors you'll want Fizzler (an add-on for HtmlAgilityPack).
In JavaScript (easily rewritten to C# and HtmlAgilityPack) it would be this:
document.querySelector(
"section[class=profile-header][profile=true][user=User]"
).textContent
HtmlAgilityPack: http://html-agility-pack.net
Fizzler: https://www.nuget.org/packages/Fizzler.Systems.HtmlAgilityPack/
Generally for parsing HTML, Regex is not a good choice! HTML tends to be so complicated and it is so hard to write a single Regex to be able to match everything! Instead use a parser like Html Agility Pack.

How to retrieve data from an html string from a span tag by using Regular Expressions?

I need to retrieve some info from an html doc since the web service to get a json or an xml is still not ready. Im working with c# and using regular expressions to get the data i need from the html string. I've managed to get the div i want to work with from the whole html string but now i'm having trouble getting the info between the first span tag.
I've attempted to retrieve the data between ; and the first closing span tag but what i really want is the content between the first span tag.
Here's the regular expression i've written so far, but it's not working:
".*;(?<Content>(\r|\n|.)*)</span>"
I also tried this but didnt work either:
"<span class=""type"">(?<Content>(\r|\n|.)*)</span>"
Here is the div i want to retrieve the data from:
<div class="main">ABASASDFÓ 18/06/2014 17:38h Blabla Balbal <span class="type">15.80€ </span>+1.94 % +0.30€ | HOME <SPAN class="type2">11,398.70</span> +0.65 % +74.10</div>
EDIT: I can't use Htmlagilitypack since my client does not want us to use any external library. I've also heard about using the XmlReader but i'm not sure the structure of the html will match an xml one accordingly.
This regex will capture the string:
"<span class=\"type\">(?<Content>([^<]*))</span>"
Although, I agree with other answers, you should use something like Path instead of Regexes for parsing html.
Here's how it is done with a regex in Javascript. You should be able to adapt this for C# pretty easily.
var inner = html.match( /<span class="type"(?:\s+[a-z]+(?:\s*=\s*(?:"[^"]*"|'[^']*'|[^\s>]+)))*\s*>([\S\s]*)<\/span>/i)[1];
Fiddle: http://jsfiddle.net/GarryPas/uk32r8vz/
You want to use XPath for that. Something like this:
div/span/text()
I understand not wanting some external 3rd party library in your solution, the solution to that is to go fetch the source code of the entire library:
https://htmlagilitypack.codeplex.com/
Now you don't have an external library, you have an internal library and you can use the right tool for the job!
XmlReader is a fairly low-level tool, it could technically do the job for you but what you're more after is "use XmlReader to do XPath" which is talked about here: https://msdn.microsoft.com/en-us/library/ms950778.aspx
The XPathReader class is the result of all that, which has been superseded by LINQ to XML: https://msdn.microsoft.com/en-ca/library/bb387098.aspx
So another option here is to try to use some LINQ to process your HTML file, but that might be tricky since HTML isn't good XML. Still, it's another option if you're looking for those.

HTML Agility Pack (C#) malforms my code

I'm currently coding a desktop application in c# which also has to handle XHTML document manipulation. For that purpose I'm using the Html Agility Pack which seemed to be okay so far. After carefully checking the output from Html Agility Pack I found out that the code isn't well formed xhtml any more.
It removes self-closing tags (slash) and overwrites other proprietary code elements...
eg. input html code:
<input autocapitalize="off" id="username" name="username" placeholder="Benutzername" type="text" value="$(username)" />
eg. output html code
<input autocapitalize="off" id="username" name="username" placeholder="Benutzername" type="text" value="$(username)">
(removed the trailing slash...)
Another example is with proprietary code elements (for Mikrotik hotspot devices):
eg input html code
<form action="$(link-login-only)" method="post" name="login" $(if chap-id) onSubmit="return doLogin()"$(endif)>
The $(if chap-id), $(endif) and $(link-login-only) parts are custom code fragments interpreted from the Mikrotik device.
eg. output html code after Html Agility Pack (which transforms it to unuseable code)
<form action="$(link-login-only)" method="post" name="login" $(if="" chap-id)="" onsubmit="return doLogin()" $(endif)="">
Has someone an idea how to "instruct" Html Agility Pack to output well formed XHTML and to ignore "custom code" fragments (is this possibly via Regex)?
Thanks in advance! :-)
In your first example, HTML Agility Pack is actually fixing your markup. The input element is a void element. Since there is no context inside, it needs no closing tag.
HTML Agility Pack is made for parsing valid HTML markup, not markup embedded with custom code. In your first example, the custom markup is inside quotes therefore isn't an issue. In your second example, the variables are outside quotes.
HTML Agility Pack tries to parse them as regular (but malformed) attributes of the element. There's no way to fix that. You'll have to find another way to parse your markup if you need support for custom code inside the markup.
Necromancing.
Problem 1 is because you probably didn't specify OptionOutputAsXml = true, meaning HtmlAgilityPack outputs HTML instead of XHTML.
Actually, doing this is rather clever, as it reduces the file size.
If you need XHTML, you need to specifically instruct HtmlAgilityPack to output XHTML (XML), not HTML (SGML).
SGML allows for tags with no closing tag (/>), while XML does not.
To fix this:
public static void BeautifyHtml()
{
string input = "<html><body><p>This is some test test<br ><ul><li>item 1<li>item2<</ul></body>";
HtmlAgilityPack.HtmlDocument test = new HtmlAgilityPack.HtmlDocument();
test.LoadHtml(input);
test.OptionOutputAsXml = true;
test.OptionCheckSyntax = true;
test.OptionFixNestedTags = true;
System.Text.StringBuilder sb = new System.Text.StringBuilder();
using (System.IO.TextWriter stringWriter = new System.IO.StringWriter(sb))
{
test.Save(stringWriter);
}
string beautified = sb.ToString();
System.Console.WriteLine(beautified);
}
An alternative is CsQuery which, at least for the simple cases you've got here, will leave your pre-processor tags alone by nature of just treating them like valueless attributes. That is, HAP appears to convert any attribute someattribute without a value to someattribute="". CsQuery won't do this.
However the observations #Justin Niessner makes about your markup are going to be true for any parser that is not specifically designed to parse the templating code you have in there. Just because this one example makes it through CsQuery is no guarantee some other format won't result in something that's not a valid attribute name, or if not valid, at least acceptable to an HTML5 parser.
If you need to manipulate something as HTML, then do it after templating. If you need to manipulate it before the templating engine has at it, then you're in a catch 22, since it's not HTML yet. Or alternatively you could use a templating system that uses valid HTML markup for its keywords (example: Knockout).

How to strip all tags from wikipedia pages or make page more readable

I want to strip all tags, remove the [show][Hide] stuffs from wikipedia, or is there some website that makes pages in more readable format.
Please I am aware of the Wikipedia printable version, but I don't need any tags in that, as I have some other use. So please answer the original question only, about any website or webservice or code snippets in php/C# to remove the tags from a webpages.
Also like when I copy some list from firefox it replaces <li> with the *, is it possible to set something in firefox to return some other non readable character like some kind of dot
You can start by taking a look at the strip_tags function.
You could use an HTML parser, BeautifulSoup (Python) or Simple HTML DOM for example. Or you could try using an XML parser.
I want to strip all tags, remove the
[show][Hide] stuffs from wikipedia, or
is there some website that makes pages
in more readable format.
You should take a look at DBpedia, Wikipedia, but just the data.
http://dbpedia.org/About
What about htmlagilitypack
htmlagilitypackt
Similar thread available in stackoverflow
Is there a Wikipedia API?
Try this function.
Dim pattern As String = "<(.|\n)*?>"
Return System.Text.RegularExpressions.Regex.Replace(strHtmlString, pattern, String.Empty).Trim()

What is the best way to search through HTML in a C# string for specific text and mark the text?

What would be the best way to search through HTML inside a C# string variable to find a specific word/phrase and mark (or wrap) that word/phrase with a highlight?
Thanks,
Jeff
I like using Html Agility Pack very easy to use, although there hasn't been much updates lately, it is still usable. For example grabbing all the links
HtmlWeb client = new HtmlWeb();
HtmlDocument doc = client.Load("http://yoururl.com");
HtmlNodeCollection Nodes = doc.DocumentNode.SelectNodes("//a[#href]");
foreach (var link in Nodes)
{
Console.WriteLine(link.Attributes["href"].Value);
}
Regular Expression would be my way. ;)
If the HTML you're using XHTML compliant, you could load it as an XML document, and then use XPath/XSL - long winded but kind of elegant?
An approach I used in the past is to use HTMLTidy to convert messy HTML to XHTML, and then use XSL/XPath for screen scraping content into a database, to create a reverse content management system.
Regular expressions would do it, but could be complicated once you try stripping out tags, image names etc, to remove false positives.
In simple cases, regular expressions will do.
string input = "ttttttgottttttt";
string output = Regex.Replace(input, "go", "<strong>$0</strong>");
will yield: "tttttt<strong>go</strong>ttttttt"
But when you say HTML, if you're referring to final text rendered, that's a bit of a mess. Say you've got this HTML:
<span class="firstLetter">B</span>ook
To highlight the word 'Book', you would need the help of a proper HTML renderer. To simplify, one can first remove all tags and leave only contents, and then do the usual replace, but it doesn't feel right.
You could look at using Html DOM, an open source project on SourceForge.net.
This way you could programmatically manipulate your text instead of relying regular expressions.
Searching for strings, you'll want to look up regular expressions. As for marking it, once you have the position of the substring it should be simple enough to use that to add in something to wrap around the phrase.

Categories