why does MS anti xss library (v4) remove html 5 data attributes - c#

AntiXss library seems to strip out html 5 data attributes, does anyone know why?
I need to retain this input:
<label class='ui-templatefield' data-field-name='P_Address3' data-field-type='special' contenteditable='false'>[P_Address3]</label>
The main reason for using the anti xss library (v4.0) is to ensure unrecognized style attributes are not parsed, is this even possible?
code:
var result = Sanitizer.GetSafeHtml(html);
EDIT:
The input below would result in the entire style attributes removed
Input:
var input = "<p style=\"width:50px;height:10px;alert('evilman')\"/> Not sure why is is null for some wierd reason!<br><p></p>";
Output:
var input = "<p style=\"\"/> Not sure why is is null for some wierd reason!<br><p></p>";
Which is fine, if anyone messes around with my code on client side, but I also need the data attribute tags to work!

I assume you mean the sanitizer, rather than the encoder. It's doing what it's supposed to - it simply doesn't understand HTML5 or recognise the attributes, so it strips them. There are ways to XSS via styles.
It's not possible to customise the safe list either I'm afraid, the code base simply doesn't allow for this - I know a large number of people want those, but it would take a complete rewrite to support it.

Related

Is there a standard way to check for HTML content with Fluent Validation

I have some text entry fields on a form and I want to prevent the user from submitting any HTML content, thus reducing chances of XSS attacks or just breaking the layout.
Is there any standard way to do this check with Fluent Validation or do I need to roll my own using a Regex. I'd prefer to use a tried and tested method rather than write my own and risk missing something subtle.
I'm using it with .Net6 and ASP.Net for Web APIs. We intend to update to .Net7 in the next few months so anything that brings could be useful.
My source for all of this is this page.
First of all you would need to replace the & character with &
Then replace < with <
Finally replace > with >.
You could also surround your html with <pre> tags, so that it preserves line returns and spaces.

How display data generated by Rich Text Editor in ASP.NET MVC? [duplicate]

I have a controller which generates a string containing html markup. When it displays on views, it is displayed as a simple string containing all tags.
I tried to use an Html helper to encode/decode to display it properly, but it is not working.
string str= "seeker has applied to Job floated by you.</br>";
On my views,
#Html.Encode(str)
You are close you want to use #Html.Raw(str)
#Html.Encode takes strings and ensures that all the special characters are handled properly. These include characters like spaces.
You should be using IHtmlString instead:
IHtmlString str = new HtmlString("seeker has applied to Job floated by you.</br>");
Whenever you have model properties or variables that need to hold HTML, I feel this is generally a better practice. First of all, it is a bit cleaner. For example:
#Html.Raw(str)
Compared to:
#str
Also, I also think it's a bit safer vs. using #Html.Raw(), as the concern of whether your data is HTML is kept in your controller. In an environment where you have front-end vs. back-end developers, your back-end developers may be more in tune with what data can hold HTML values, thus keeping this concern in the back-end (controller).
I generally try to avoid using Html.Raw() whenever possible.
One other thing worth noting, is I'm not sure where you're assigning str, but a few things that concern me with how you may be implementing this.
First, this should be done in a controller, regardless of your solution (IHtmlString or Html.Raw). You should avoid any logic like this in your view, as it doesn't really belong there.
Additionally, you should be using your ViewModel for getting values to your view (and again, ideally using IHtmlString as the property type). Seeing something like #Html.Encode(str) is a little concerning, unless you were doing this just to simplify your example.
you can use
#Html.Raw(str)
See MSDN for more
Returns markup that is not HTML encoded.
This method wraps HTML markup using the IHtmlString class, which
renders unencoded HTML.
I had a similar problem with HTML input fields in MVC. The web paged only showed the first keyword of the field.
Example: input field: "The quick brown fox" Displayed value: "The"
The resolution was to put the variable in quotes in the value statement as follows:
<input class="ParmInput" type="text" id="respondingRangerUnit" name="respondingRangerUnit"
onchange="validateInteger(this.value)" value="#ViewBag.respondingRangerUnit">
I had a similar problem recently, and google landed me here, so I put this answer here in case others land here as well, for completeness.
I noticed that when I had badly formatted html, I was actually having all my html tags stripped out, with just the non-tag content remaining. I particularly had a table with a missing opening table tag, and then all my html tags from the entire string where ripped out completely.
So, if the above doesn't work, and you're still scratching your head, then also check you html for being valid.
I notice even after I got it working, MVC was adding tbody tags where I had none. This tells me there is clean up happening (MVC 5), and that when it can't happen, it strips out all/some tags.

How to extend Html.Raw in order to Sanitize dangerous HTML data before displaying

I inherited a web-app which already has some input fields accepting plain Html from user. (you may understand that the XSS (Cross Site Scripting) bell rings here...! )
The same input is displayed on specific view pages with the use of #Html.Raw (... the bell now rings louder)
And, to be able to do that work, the [ValidateInput(false)] decorator on the Controller and [AllowHtml] on the Model field, comes to fill the picture... (what can i say about the bell!!!)
Now, before someone convicts some programmer to death :-) let me make clear that this dangerous input functionality is allowed to users of specific-admin-role. So this is kind of controlled situation.
Lately, though, we decided to add some control to this situation, as this functionality creates risk from inside, in case of malicious behavior of the admin user himself.
The easy implementable option would be to disable this whole funcionality and add some Markdown editor instead, which will store harmless Rich-Text-Format input, BUT still I would have to transform all the existing data to this Markdown, so that they display correctly.
What I need, though, is to be able to lower the risk of inside - not eliminate - by adding some sort of Filter of Script tags and other dangerous tags, as an extension of the existing Html.Raw helper.
Can anyone suggest a way to extend or wrap the existing HtmlHelper, please?
Here is the Metadata info:
// Summary:
// Returns markup that is not HTML encoded.
//
// Parameters:
// value:
// The HTML markup.
//
// Returns:
// The HTML markup without encoding.
public IHtmlString Raw(string value);
Using Microsoft AntiXSS library you can avoid Cross Site Scripting attacks.Install AntiXSS 4.3.0. from nuget Install-Package AntiXSS.
#Html.Raw(Microsoft.Security.Application.Sanitizer.GetSafeHtmlFragment(value))
if this didnt work then try with AjaxControlToolkit's HtmlAgilityPackSanitizerProvider .using this you can whitelist some tags and attributes.
you can check this SO link

populating textboxes with less than / greater than symbols (< and >)

so I've been running into some problems where in various parts of my website I'm developing, I'm displaying some logs that contain < and > symbols in various spots. Well when I display the log it works fine. Of course anytime I navigate away I get an error of:
A potentially dangerous Request.Form value was detected from the client ...
Now I understand it's because < and > are html special characters which I get. But, is there any way to disable or somehow allow the page to display / process those? I know I could strip those characters out of anyplace they may appear, but I'd rather not if I don't have to.
Any suggestions?
You didn't post any code, so I will assume you want something along the lines of:
<textbox><</textbox>
It's simple really, HTML encode your content:
<textbox><</textbox>
You can use HttpUtility.HtmlEncode to do this.
Replace ">" with ">" and "<" with "<"
Read this see a list of HTMLs special entities
If you simply want your web application to allow form input to contain potentially dangerous characters there are a few ways to do this depending on framework. I mostly use MVC myself, where you use the [ValidateInput(false)] attribute on your controller actions.
For WebForms, I'll direct you here instead.. http://msdn.microsoft.com/en-us/library/ie/bt244wbb.aspx :)
to answer your question put ValidateRequest="false" in <%#Page...
be careful, as you are now responsible for prevent script attacks

Suggestions on how build an HTML Diff tool?

In this post I asked if there were any tools that compare the structure (not actual content) of 2 HTML pages. I ask because I receive HTML templates from our designers, and frequently miss minor formatting changes in my implementation. I then waste a few hours of designer time sifting through my pages to find my mistakes.
The thread offered some good suggestions, but there was nothing that fit the bill. "Fine, then", thought I, "I'll just crank one out myself. I'm a halfway-decent developer, right?".
Well, once I started to think about it, I couldn't quite figure out how to go about it. I can crank out a data-driven website easily enough, or do a CMS implementation, or throw documents in and out of BizTalk all day. Can't begin to figure out how to compare HTML docs.
Well, sure, I have to read the DOM, and iterate through the nodes. I have to map the structure to some data structure (how??), and then compare them (how??). It's a development task like none I've ever attempted.
So now that I've identified a weakness in my knowledge, I'm even more challenged to figure this out. Any suggestions on how to get started?
clarification: the actual content isn't what I want to compare -- the creative guys fill their pages with lorem ipsum, and I use real content. Instead, I want to compare structure:
<div class="foo">lorem ipsum<div>
is different that
<div class="foo"><p>lorem ipsum<p><div>
The DOM is a data structure - it's a tree.
Run both files through the following Perl script, then use diff -iw to do a case-insensitive, whitespace-ignoring diff.
#! /usr/bin/perl -w
use strict;
undef $/;
my $html = <STDIN>;
while ($html =~ /\S/) {
if ($html =~ s/^\s*<//) {
$html =~ s/^(.*?)>// or die "malformed HTML";
print "<$1>\n";
} else {
$html =~ s/^([^<]+)//;
print "(text)\n";
}
}
#Mike - that would compare everything, including the content of the page, which isn't want the original poster wanted.
Assuming that you have access to the browser's DOM (by writing a Firefox/IE plugin or whatever), I would probably put all of the HTML elements into a tree, then compare the two trees. If the tag name is different, then the node is different. You might want to stop enumerating at a certain point (you probably don't care about span, bold, italic, etc. - maybe only worry about divs?), since some tags are really the content, rather than the structure, of the page.
If i was to tacke this issue I would do this:
Plan for some kind of a DOM for html pages. starts at lightweight and then add more as needed. I would use composite pattern for the data structure. i.e. every element has children collection of the base class type.
Create a parser to parse html pages.
Using the parser load html element to the DOM.
After the pages' been loaded up to the DOM, you have the hierachical snapshot of your html pages structure.
Keep iterating through every element on both sides till the end of the DOM. You'll find the diff in the structure, when you hit a mismatched of element type.
In your example you would have only a div element object loaded on one side, on the other side you would have a div element object loaded with 1 child element of type paragraph element. fire up your iterator, first you'll match up the div element, second iterator you'll match up paragraph with nothing. You've got your structural difference.
I think some of the suggestions above don't take into account that there are other tags in the HTML between two pages which would be textually different, but the resulting HTML markup is functionally equivalent. Danimal lists control IDs as an example.
The following two markups are functionlly identical, but would show up as different if you simply compared tags:
<div id="ctl00_TopNavHome_DivHeader" class="header4">foo</div>
<div class="header4">foo</div>
I was going to suggest Danimal write an HTML translation which looks for the HTML tags and converts both docs into a simplified version of both which omits ID tags and any other tags you designate as irrelevant. This’d likely have to be a work in progress, as you ignore certain attributes/tags and then run into new ones which you also want to ignore.
However, I like the idea of using the XmlSchemaInterface to boil it down to the XML schema, then use a diff tool which understands XML rules.
See http://www.semdesigns.com/Products/SmartDifferencer/index.html for a tool that is parameterized by langauge grammar, and produces deltas in terms of language elements (identifiers, expressions, statements, blocks, methods, ...) inserted, deleted, moved, replaced, or has identifiers substituted across it consistently. This tool ignores whitespace reformatting (e.g., different linebreaks or layouts) and semantically indistinguishable values (e.g., it knows that 0x0F and 15 are the same value).
This can be applied to HTML using an HTML parser.
EDIT: 9/12/2009. We've built an experimental SmartDiff tool using an HTML editor.
http://www.mugo.ca/Products/Dom-Diff
Works with FF 3.5. I haven't tested FF 3.6 yet.
This has been an excellent start. A few more clarifications/comments:
I probably don't care about IDs, since .net will mangle them
some of the structure will be in a repeater or other such control, so I might end up having more or fewer repeating elements
further thought:
I think a good start would be to assume the html is XHTML compliant. I could then infer the schema (using the new .net XmlSchemaInference methods), then diff the schemata. I can then look at the differences and consider whether or not they're significant.
My suggestion is just the basic way of doing it... Of course to tackle the issue you mentioned additional rules must be applied here... Which is in your case, we got a matching div element, and then apply attributes/property matching rules and what not...
To be honest, there are many and complicated rules that need to be applied for the comparison, and its not just a simple matching element to another element. For example what happens if you have duplicates.
e.g. 1 div element on one side, and 2 div element on the other side. How are you gonna match up which div elements matches together?
There are alot other complicated issues that you will find in the comparison word. Im speaking based of experience (part of my job is to maitain my company text comparison engine).
Take a look at beyond compare. It has an XML comparison feature that can help you out.
You may also have to consider that the 'content' itself could contain additional mark-up so it's probably worth stripping out everything within certain elements (like <div>s with certain IDs or classes) before you do your comparison. For example:
<div id="mainContent">
<p>lorem ipsum etc..</p>
</div>
and
<div id="mainContent">
<p>Here is some real content<img class="someImage" src="someImage.jpg" /></p>
<ul>
<li>and</li>
<li>some</li>
<li>more..</li>
</ul>
</div>
Pretty Diff can do this. It will compare the code structure only regardless of differences to white space, comments, or even content. Just be sure to check the option "Normalize Content and String Literals".
http://prettydiff.com/
I would use (or contribute to) html5lib and its SAX output. Just zip through the 2 SAX streams looking for mismatches and highlight the whole corresponding subtree.
I don't know any tool but I know there is a simple way to do this:
First, use a regular expression tool to strip off all the text in your HTML file. You can use this regular expression to search for the text (?<=^|>)[^><]+?(?=<|$) and replace them with an empty string (""), i.e. delete all the text. After this step, you will have all HTML markup tags. There are a lot of free regular expression tools out there.
Then, you repeat the first step for the original HTML file.
Last, you use a diff tool to compare the two sets of HTML markups. This will show what is missing between one set and the other.
If i were to do this, first i would learn HTML. (^-^) Then i would build a tool that strips out all of the actual content and then saves that as a file so it can be piped through WinDiff (or other merge tool).
Open each page in the browser and save them as .htm files. Compare the two using windiff.

Categories