Clear raw HTML from malicious data in C# - c#

I'm writing ASP.NET MVC app. Some pieces of HTML comes from user and some of them from third-party sources. Is there easy and fast enough way to clean HTML without heavy artillery like HAP(Html Agility Pack) or Tidy?
I'm just need to remove scripts, styles, <object>/<embed>, href="javascript:", style=, onclick and I'm not think that removing them manually via .Remove/.Replace is a good way even with StringBuilder.
In example, if I have next input
<html>
<style src="http://harmyourpage.com"/>
<script src="http://killyourdog.com"/>
<div>
Good link
Bad link
<p>Some text <b>to</b> test</p><br/>
<h1 style="position:absolute;">Damage your layout</h1>
And an image there <img src="http://co.com/a.jpg"/><br>
<span onclick="harm()">Good span with bad attribute</span>
<object>Your lovely java can be there</object>
</div>
</html>
which must be converted into next:
<div>
Good link
<a>Bad link</a>
<p>Some text <b>to</b> test</p><br/>
<h1>Damage your layout</h1>
And an image there <img src="http://co.com/a.jpg"/><br>
<span>Good span with bad attribute</span>
</div>
So, how to do this — with whitelist of tags and anttributes — in right way?
UPD: I'm tried StackExchange HtmlHelpers library, but it removes needed tags such as div, a and img.

Fastest way to achieve the same is to use Regular Expression
var regex = new Regex(
"(\\<script(.+?)\\</script\\>)|(\\<style(.+?)\\</style\\>)|(\\<object(.+?)\\</object\\>)",
RegexOptions.Singleline | RegexOptions.IgnoreCase
);
string ouput = regex.Replace(input, "");
You can also use Microsoft Web Protection Library (http://wpl.codeplex.com/) for same like
Sanitizer.GetSafeHtmlFragment(input);

Related

Edit HTML file using MVC w/ ACE Editor

I am trying to build a page which can read the contents of a HTML file and output it's data to the screen.
To get the HTML files data I am doing:
ViewBag.PageHtml = System.IO.File.ReadAllText(#"W:\1.html");
Then in the View, I have the following
#Html.Raw(ViewBag.PageHtml)
The HTML data is this:
<html>
<head>
<title>Test Title</title>
</head>
<body>
<p>The body</p>
</body>
</html>
The result of the Html.Raw is this, some how the <html><head> etc tags are being removed.
<title>Test Title</title>
<p>The body</p>
Can someone please explain to me why this is, and how I can prevent it from happening?
Thanks in advance
I managed to solve this myself, the first step was to add a hidden textarea field.
<textarea id="templateHtml" style="display: none">#ViewBag.PageHtml</textarea>
I left the div empty like this
<div id="txtArea"></div>
Then I just used the value of the text area as the value of the ACE Editor.
var el = document.getElementById("txtArea");
editor = ace.edit(el);
editor.session.setValue($("#templateHtml").val());
editor.setTheme("ace/theme/github");
editor.getSession().setMode("ace/mode/html");
editor.setOption("showPrintMargin", false);
This is interesting to see.
It seems as if #Html.Raw sanitizes the input; which is usually a good idea.
In your case however; it seems to not help you that much.
However, having multiple <html>and <body> tags is not allowed. See this question: Multiple <html><body> </html></body> in same file
If you know that these tags are removed, then simply re-add them. With that said however; i think you are having the wrong approach to this.
If all you want to do is return a static HTML-file from a HTML-document, then you could just serve this "as is". If you are looking to do this "dynamically" somehow (maybe from a database or similar) then you should probably do it directly from the controller using a ContentResult.
namespace Project.Controllers
{
public class HomeController : Controller
{
public ContentResult ServePureHtml()
{
string htmlData = System.IO.File.ReadAllText(#"W:\1.html");
return Content(htmlData, "text/html");
}
}
}
You could use this as a partial result as well. Using Html.Action or Html.RenderAction
For instance
<!-- ... -->
<div class="editor-content">
#{
Html.RenderAction("ServePureHtml", "Home");
}
</div>
<!-- ... -->
Also, be cautious of using the ViewBag. Its behavior is usually not amazing.

AntiXSS doesn't sanitize unclosed html tag

Why are unclosed html tags not sanitized with Microsoft AntiXSS?
string untrustedHtml = "<img src=x onmouseover=confirm(foo) y=";
string trustedHtml = AntiXSS.Sanitizer.GetSafeHtmlFragment(untrustedHtml); // returns "<img src=x onmouseover=confirm(foo) y="
Closing tags are sanitized:
string untrustedHtml = "<img src=x onmouseover=confirm(foo) y=a>";
string trustedHtml = AntiXSS.Sanitizer.GetSafeHtmlFragment(untrustedHtml); // returns ""
It is recommended to use HTML encoding whenever possible instead of HTML sanitation.
Sanitation should only be used if you actually need to use some HTML but want to remove any unsafe code. 99% of the times you don't need any HTML to be inserted by your users, and eliminating that should be done with encoding.
Having said that, if you still want to perform sanitation, AntiXSS is not the best solution - both because of the example above, and the fact that it also removes totally safe HTML and falsely recognizing it as unsafe, causing AntiXSS sanitizer to be ineffective.
Ajax control toolkit have a better internal sanitizer you can use, but notice that it is less secured because their partly work with black-lists (searching for the dangerous code instead of permitting only safe code).
If you still want to use AntiXSS sanitation, you can just check whether the HTML that was inserted is a valid one before you sent to the sanitizer. You can do it for example with XML document class of some kind as any valid HTML is also a valid XML.
Hope this helps.
What version of the AntiXss library are you using?
I used version 4.3.0.0 and when I ran this through Encoder.GetSafeHtmlFragment()
and the output gave the following value "<img src=x onmouseover=test(1) y="
as you can see, they automatically encoded the non-HTML values.
Here is the code I used:
protected void Page_Load(object sender, EventArgs e)
{
var testValue = "<img src=x onmouseover=test(1) y=";
litFirst.Text = testValue;
litSecond.Text = Sanitizer.GetSafeHtml(testValue);
litThird.Text = Sanitizer.GetSafeHtmlFragment(testValue);
}
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
<head runat="server">
<title></title>
<script>
function test(x) {
alert(x);
}
</script>
</head>
<body>
<form id="form1" runat="server">
<div>
First: <asp:Literal ID="litFirst" runat="server"/>
<br/>
Second: <asp:Literal ID="litSecond" runat="server"/>
<br/>
Third: <asp:Literal ID="litThird" runat="server"/>
</div>
</form>
</body>
</html>
But I also agree with Gil Cohen, in that you really should not allow users to enter HTML.
Along with Gil Cohen, I would recommend that instead of allowing them to enter in HTML directly, do it through an intermediate language like Markup, Textile, Wiki markup to name a few. This gives the advantage of allowing users to have more control over there output but still does not let them write HTML directly.
There are JavaScript WYSIWYG editors that will output the markup/preview for the user, then allow you to save the markup language for later use (to be converted to HTML during the output procedure, not before you save it to your data-store).

Regular Expressions select whole outer DIV

been trying for hours to solve this problem. I want to use regular expressions to select whole divs including nested divs see example string below:
AA <div> Text1 </div> BB <div style=\"vertical-align : middle;\"> Text2 <div>Text 3</div> </div> CC
Want to return the following values
<div> Text1 </div>
<div style=\"vertical-align : middle;\"> Text2 <div>Text 3</div> </div>
The closes I've got is using the following code but just gives me each DIV
(?<BeginTag><\s*div.*?>)|(?<EndTag><\s*/\s*div.*?>)
Any help would be great.
To expand on my rather snarky comment, a Regex is not a good tool for parsing any kind of HTML. Only in the simplest of scenarios will it be feasible, and even then, I would not recommend it.
What you need is a good tool for parsing HTML. In the .NET world, a nice library for this is the HTMLAgilityPack or perhaps the SGMLReader project.
You do need to invest a little bit of time in learning the API, but it will be worth it.
For the little fragment you are showing, I think the easiest API for you will be SGMLReader. It can read HTML as if it were XML, which means you can convert it to an XDocument and use a much nicer API. The code for that could look like this:
string markup = "<html>AA <div> Text1 </div> BB <div style=\"vertical-align : middle;\"> Text2 <div>Text 3</div> </div> CC</html>";
XDocument doc;
using(var reader = Sgml.SgmlReader.Create(new StringReader(markup)))
doc = XDocument.Load(reader);
var rootLevelDivs = doc.Root.Elements("div");
foreach(var div in rootLevelDivs)
Console.WriteLine(div);

How do I output html from my database

I am displaying my 'news' page and I want the customer to be able to output some simple html.
My view looks like this:
#using SuburbanCustPortal.SuburbanService
<br />
#foreach (var item in (IEnumerable<TokenNews>) ViewBag.News)
{
<div class="fulldiv">
<fieldset>
<legend>#item.Title</legend>
<div>
#item.Body
</div>
</fieldset>
</div>
}
When I do it this away, the html isn't being rendered by the browser, it's just showing the html as text.
Any way that I can output the html where the browser will render it?
You didn't exactly specify what part is being shown as text, but if it's item.Body, do #Html.Raw(item.Body) instead. That turns a string into an IHtmlString, whose purpose is to tell Razor that this thing is guaranteed to be safe to output as-is, and will not contain nasties like XSS attacks (ensuring this when using Html.Raw is your job). Everything that is not an IHtmlString will be escaped automatically by Razor.

How can I extract just text from the html

I have a requirement to extract all the text that is present in the <body> of the html. Sample Html input :-
<html>
<title>title</title>
<body>
<h1> This is a big title.</h1>
How are doing you?
<h3> I am fine </h3>
<img src="abc.jpg"/>
</body>
</html>
The output should be :-
This is a big title. How are doing you? I am fine
I want to use only HtmlAgility for this purpose. No regular expressions please.
I know how to load HtmlDocument and then using xquery like '//body' we can get body contents. But how do I strip the html as I have shown in output?
Thanks in advance :)
You can use the body's InnerText:
string html = #"
<html>
<title>title</title>
<body>
<h1> This is a big title.</h1>
How are doing you?
<h3> I am fine </h3>
<img src=""abc.jpg""/>
</body>
</html>";
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
string text = doc.DocumentNode.SelectSingleNode("//body").InnerText;
Next, you may want to collapse spaces and new lines:
text = Regex.Replace(text, #"\s+", " ").Trim();
Note, however, that while it is working in this case, markup such as hello<br>world or hello<i>world</i> will be converted by InnerText to helloworld - removing the tags. It is difficult to solve that issue, as display is ofter determined by the CSS, not just by the markup.
How about using the XPath expression '//body//text()' to select all text nodes?
You can use NUglify that supports text extraction from HTML:
var result = Uglify.HtmlToText("<div> <p>This is <em> a text </em></p> </div>");
Console.WriteLine(result.Code); // prints: This is a text
As it is using a HTML5 custom parser, it should be quite robust (specially if the document doesn't contain any errors) and is a very fast (no regexp involved but a pure recursive descent parser, faster than HtmlAgilityPack and more GC friendly)
Normally for parsing html I would recommend a HTML parser, however since you want to remove all html tags a simple regex should work.

Categories