c# encode string to prevent XSS injection (RegEx??) - c#

I have a string that contains valid html tags like <table> or <br/> and that is displayed in a tooltip including database. Security team considers this a a critical issue as this could allow xss attacks from a user that would insert an <script> alert(...)
Everywhere on the web, I'm told to use HttpUtility.HtmlEncode(). Problem is that this is also encoding my valid tag.
What I'm looking for and I can't find is a RegEx that would allow me to filter xss injections without stripping my valid html tags.
Does something like this exist?

Should be pretty simple, Html Encode the entire thing, then use regex to replace all instances of <table> with <table> etc. An example regex would be "<(\/?(table|span|p|br|tr|td|th|thead|tbody|tfoot|b|i)\s*\\?)>" and replace with "<\1>"
That should get you pretty close. Of course it won't allow complex tags, like <table id=...> etc, but you will have to decide if that's a requirement or not. Or use a markdown editor instead.

Related

using a string format with parameters to set onclick

Why doesn't this work?
<input type="button" id="btnAccept" value="Accept" onclick='<%# String.Format("accept('{0}','{1}','{2}','{3}-{4}');", Container.DataItem("PositionID"), Container.DataItem("ApplicantID"), Container.DataItem("FullName"), Container.DataItem("DepartmentName"), Container.DataItem("PositionTitle"))%>' />
The onclick doesn't do anything.
Your best bet is to look at the generated HTML. I think it's a really good habit to check the generated HTML in text format and how it renders on-screen, all the time. Besides errors such as this (which can easily be spotted in the generated HTML), it will help you catch other possible invalid uses of HTML which may render as intended in one browser while rendering terribly in another. HTML rendering engines employ many tricks to try and make invalid HTML look okay.
Anyway, all things aside (such as, assuming accept(...) exists, and all other calls in the tag are correct) I think the issue you are having is as follows:
onclick='<%# String.Format("accept('{0}','{1}','{2}','{3}-{4}');", ... )%>'
This line is probably going to evaluate to look something like this:
onclick='accept('{0}','{1}','{2}','{3}-{4}');'
With all single quotes, all the onclick attribute will see is onclick='accept(' which is not a valid javascript method call. You're going to want to use the "" strings, which you can embed in the format string by escaping them.
String.Format("accept(\"{0}\",\"{1}\",\"{2}\",\"{3}-{4}\");", ... )
Then, you should be able to get the correct combination of ' and " within the attribute:
onclick='accept("{0}","{1}","{2}","{3}-{4}");'

How can I check for any html <script> tags in C#, plus anything else nasty?

A user is allowed to format their html in a textbox. This then gets sent to the backend where it will be validated. Other users may then see this textbox.
I want to check for any tags in the backend. I know this can be done with a relatively simple regex. I would just do something like <\s*?script\s*?>
My issue though is if someone does something like this:
test
This would pass validation. I could also make the regex check for onClick, but I'm sure there are other ways around this.
My question: Is there a good way to do this? Am I just going to have to rely on regexes and my own research to figure out how else they could run a script?
EDIT
I suppose I could create a whitelist of what they can enter. It's primarily meant for formatting text, so <b>, <i>, <h> etc. This may or may not be an acceptable solution though, I need to look and see what the actual use case is. I'm hoping there's a different solution to this.
Really you should use white-list validation (i.e. allow only specific examples that you know are safe) rather than trying to detect and remove potentially hazardous input.
One really nice way to do this is to use Markdown rather than just allowing HTML input.
There are OWASP Guidelines for HTML injection.
A simple for removing all HTML tags from content
public string Strip(string text)
{
return Regex.Replace(text, #”<(.|\n)*?>”, string.Empty);
}

Html Encoding of output in legacy ASP.NET site

I have a legacy ASP.Net site (recently upgraded to .NET 4.0) which never had Request Validation turned on and it doesn't Html encode any user input at all.
My solution was to turn on request validation and to catch the HttpRequestValidationException in Global.asax and redirect the user to an error page. I don't Html Encode the user input as I'll have to do it in thousands of places. I hope my approach will stop any XSS vectors getting saved into database.
However, in case if there is already any XSS vector stored in database I reckon I should also Html encode all output. Unfortunately I have very limited dev and test resource to successfully achieve this. I came up with a list of changes I need to go through:
Search and Replace all <%= %> with <%: %>.
Search and Replace all Labels with Literals and add Mode="Encode".
Wrap all eval() with HtmlEncode.
My questions are:
Is there any simpler way of turning on all output to be automatically Html encoded?
Am I missing anything from above list?
Any pitfalls I should be careful about?
Thanks.
Search and Replace all <%= %> with <%: %>.
Don't forget the <%# and Response.Write which will be harder to replace
Search and Replace all Labels with Literals and add Mode="Encode".
But you will loose all formatting on the previously generated spans, break the DOM
and the corresponding js/css
You would also have to search all Literals with Mode="PassThrough" and set them to Encode
Wrap all eval() with HtmlEncode.
Yes, it seems like a subset of the <%# matter above
Also, you could have some custom controls with funky render method
Assuming there is "only" a relational DB as back-end, If I had access to the DB, I would first go on identifying the problematic tables and columns which values contain markup.
I would then :
cleanup as best as I can those values in DB.
ensure HtmlEncoding of the corresponding outputs in my pages
I could then go for a basic global replace <%= becoming <%: and sanitize outputs on the long run.

Repairing malformatted html attributes using c#

I have a web application with an upload functionality for HTML files generated by chess software to be able to include a javascript player that reproduces a chess game.
I do not like to load the uploaded files in a frame so I reconstruct the HTML and javascript generated by the software by parsing the dynamic parts of the file.
The problem with the HTML is that all attributes values are surrounded with an apostrophe instead of a quotation mark. I am looking for a way to fix this using a library or a regex replace using c#.
The html looks like this:
<DIV class='pgb'><TABLE class='pgbb' CELLSPACING='0' CELLPADDING='0'><TR><TD>
and I would transform it into:
<DIV class="pgb"><TABLE class="pgbb" CELLSPACING="0" CELLPADDING="0"><TR><TD>
I'd say your best option is to use something like HTML Agility Pack to parse the generated HTML, and then ask it to re-serialize it to string (hopefully correcting any formatting problems in the process). Any attempt at Regexes or other direct string manipulation of HTML is going to be difficult, fragile and broken...
Example (when your HTML is stored in a file on the hard disk):
HtmlDocument doc = new HtmlDocument();
doc.Load("file.htm");
doc.Save("file.htm");
It is also possible to do this directly in memory from a string or Stream of input HTML.
you could use something like:
string ouputString = Regex.Replace(inputString, #"(?<=\<[^<>]*)\'(?=[^<>]*\>)", "\"");
Changed it after Oded's remark, this leaves the body HTML intact. But I agree, Regex is a bad idea for parsing HTML. Mark's answer is better.

stripping all attributes from an html tag using regex

I've been trying to formulate a regular expression to remove any attributes that may be present in html tags but I'm having trouble doing this and Google doesn't seem to provide any answers either.
Basically my input string looks something like
<p style="font-family:Arial;" class="x" onclick="doWhatever();">this text</p>
<img style="border:0px" src="pic.gif" />
and I would like to remove any attributes inside the tag to produce a string like:
<p>this text</p>
<img src="pic.gif" />
Does anybody know a regex for doing this? I'm using Regex.Replace in C# by the way.
There are really excellent tools for handling this sort of task in .NET without having to resort to the regex hammer. This will also be more reliable than a regular expression based solution.
I'd suggest that you take a look at HTML Agility Pack.
HTML is easiest interfaced with using a DOM, but if you really want to do this using a regex you could probably take advantage of that you want to remove all attributes, e.g. leave nothing left but the tag. IMO you should use a DOM parser instead.
either that or using jquery each to go trough all html elements and remove attr. or from particular element. Why would you be doing that anyway?

Categories