Detect if html string contains javascript [duplicate] - c#

Is there a library or acceptable method for sanitizing the input to an html page?
In this case I have a form with just a name, phone number, and email address.
Code must be C#.
For example:
"<script src='bobs.js'>John Doe</script>" should become "John Doe"

We are using the HtmlSanitizer .Net library, which:
Is open-source (MIT) - GitHub link
Is fully customizable, e.g. configure which elements should be removed. see wiki
Is actively maintained
Doesn't have the problems like Microsoft Anti-XSS library
Is unit tested with the
OWASP XSS Filter Evasion Cheat Sheet
Is special built for this (in contrast to HTML Agility Pack, which is a parser - not a sanitizer)
Doesn't use regular expressions (HTML isn't a regular language!)
Also on NuGet

Based on the comment you made to this answer, you might find some useful info in this question:
https://stackoverflow.com/questions/72394/what-should-a-developer-know-before-building-a-public-web-site
Here's a parameterized query example. Instead of this:
string sql = "UPDATE UserRecord SET FirstName='" + txtFirstName.Text + "' WHERE UserID=" + UserID;
Do this:
SqlCommand cmd = new SqlCommand("UPDATE UserRecord SET FirstName= #FirstName WHERE UserID= #UserID");
cmd.Parameters.Add("#FirstName", SqlDbType.VarChar, 50).Value = txtFirstName.Text;
cmd.Parameters.Add("#UserID", SqlDbType.Integer).Value = UserID;
Edit: Since there was no injection, I removed the portion of the answer dealing with that. I left the basic parameterized query example, since that may still be useful to anyone else reading the question.
--Joel

It sounds like you have users that submit content but you cannot fully trust them, and yet you still want to render the content they provide as super safe HTML. Here are three techniques: HTML encode everything, HTML encode and/or remove just the evil parts, or use a DSL that compiles to HTML you are comfortable with.
Should it become "John Doe"? I would HTML encode that string and let the user, "John Doe" (if indeed that is his real name...), have the stupid looking name <script src='bobs.js'>John Doe</script>. He shouldn't have wrapped his name in script tags or any tags in the first place. This is the approach I use in all cases unless there is a really good business case for one of the other techniques.
Accept HTML from the user and then sanitize it (on output) using a whitelist approach like the sanitization method #Bryant mentioned. Getting this right is (extremely) hard, and I defer pulling that off to greater minds. Note that some sanitizers will HTML encode evil where others would have removed the offending bits completely.
Another approach is to use a DSL that "compiles" to HTML. Make sure to whitehat your DSL compiler because some (like MarkdownSharp) will allow arbitrary HTML like <script> tags and evil attributes through unencoded (which by the way is perfectly reasonable but may not be what you need or expect). If that is the case you will need to use technique #2 and sanitize what your compiler outputs.
Closing thoughts:
If there is not a strong business case for technique #2 or #3 then reduce risk and save yourself effort and the use of the worries, go with technique #1.
Don't assume your safe because you used a DSL. For example: the original implementation of Markdown allows HTML through, unencoded. "For any markup that is not covered by Markdown’s syntax, you simply use HTML itself. There’s no need to preface it or delimit it to indicate that you’re switching from Markdown to HTML; you just use the tags."
Encode when you output. You can also encode input but doing so can put you in a bind. If you encoded incorrectly and saved that, how will you get the original input back so that you can re-encode after fixing faulty encoder?

If by sanitize you mean REMOVE the tags entirely, the RegEx example referenced by Bryant is the type of solution you want.
If you just want to ensure that the code DOESN'T mess with your design and render to the user. You can use the HttpUtility.HtmlEncode method to prevent against that!

What about using Microsoft Anti-Cross Site Scripting Library?

Related

Is XSS possible through the MailAddress class?

Considering I parse user input, which is supposed to be an email address, into the MailAdress class:
var mailString = Request.QueryString["mail"];
var mail = new MailAddress(mailString);
Is there any possibility left for a cross-site-scripting attack if I output the MailAddress object later in any way? For example through a Literal control in WebForms:
litMessage.Text = "Your mail address is " + mail.Address;
Is it necessary to sanitize the outpout even though I made sure that the address is a valid email address by parsing the string?
From what I could gather the RFC for mail addresses is pretty complicated, so I am unsure if cross site scripts can be hidden in a mail address considered valid by .NET.
EDIT:
MSDN says that > and < brackets are allowed in an email address:
The address parameter can contain a display name and the associated e-mail address if you enclose the address in angle brackets. For example: "Tom Smith <tsmith#contoso.com>"
So the question remains if this is enough for an XSS attack and/or if the MailMessage class does anything to escape dangerous parts.
Generally speaking, you shouldn't need to validate the output later. However, I always recommend that you do so for the following reasons:
There may be a hole somewhere in your app that doesn't validate the input properly. This could be discovered by an attacker and used for XSS. This is especially possible when many different devs are working on the app.
There may be old data in the database that was stored before implementing/updating your filter on the input. This could contain malicious code that could be used for XSS.
Attackers are very clever and can usually figure out a way to beat a filter. Microsoft puts a lot of attention on preventing this, but it's never going to perfect. It makes the attackers job that much harder if they face and outgoing filter as well and as incoming filter.
I know it's a pain to constantly filter, but there is a lot of value in doing so. A Defense-in-Depth strategy is necessary in today's world.
Edit:
Sorry I didn't really answer the second part of your question. Based on the documentation I don't get the impression that the API is focused on sanitizing as much as it is on verifying valid formatting. Therefore I don't know that it is safe to rely on it for security purposes.
However, writing your own sanitizer isn't terribly hard, and you can update it immediately if you find flaws. First run the address through a good RegEx filter (see: Regex Email validation), then recursively remove every nonvalid character in an email address (these shouldn't get through at this point but do this for comprehensiveness and in case you want to reuse the class elsewhere), then escape every character with HTML meaning. I emphasize the recursive application of the filter because attackers can take advantage of a non-recursive filter with stuff like this:
<scr<script>ipt>
Notice that a nonrecursive filter would remove the middle occurence of <script> and leave the outer occurrence in tact.
Is it necessary to sanitize the outpout
You don't 'sanitise' output, you encode it. Every string that you output into an HTML document needs to be HTML-encoded, so if there was a < character in the mail address it wouldn't matter - you'd get < in the HTML source as a result and that would display correctly as a literal < on the page.
Many ASP.NET controls automatically take care of HTML-escaping for you, but Literal does not by default because it can be used to show markup. But if you set the Mode property of the Literal control to Encode then setting the Text like you're doing is perfectly fine.
You should make sure you always use safe HTML-encoded output every time you put content into an HTML page, regardless of whether you think the values you're using will ever be able to include a < character. This is a separation-of-concerns issue: HTML output code knows all about HTML formatting, but it shouldn't know anything about what characters are OK in an e-mail address or other application field.
Leaving out an escape because you think the value is 'safe' introduces an implicit and fragile coupling between the output stage and the input stage, making it difficult to verify that the code is safe and easy to make it unsafe when you make changes.

How can I check for any html <script> tags in C#, plus anything else nasty?

A user is allowed to format their html in a textbox. This then gets sent to the backend where it will be validated. Other users may then see this textbox.
I want to check for any tags in the backend. I know this can be done with a relatively simple regex. I would just do something like <\s*?script\s*?>
My issue though is if someone does something like this:
test
This would pass validation. I could also make the regex check for onClick, but I'm sure there are other ways around this.
My question: Is there a good way to do this? Am I just going to have to rely on regexes and my own research to figure out how else they could run a script?
EDIT
I suppose I could create a whitelist of what they can enter. It's primarily meant for formatting text, so <b>, <i>, <h> etc. This may or may not be an acceptable solution though, I need to look and see what the actual use case is. I'm hoping there's a different solution to this.
Really you should use white-list validation (i.e. allow only specific examples that you know are safe) rather than trying to detect and remove potentially hazardous input.
One really nice way to do this is to use Markdown rather than just allowing HTML input.
There are OWASP Guidelines for HTML injection.
A simple for removing all HTML tags from content
public string Strip(string text)
{
return Regex.Replace(text, #”<(.|\n)*?>”, string.Empty);
}

URL from user, XSS security? [duplicate]

We have a high security application and we want to allow users to enter URLs that other users will see.
This introduces a high risk of XSS hacks - a user could potentially enter javascript that another user ends up executing. Since we hold sensitive data it's essential that this never happens.
What are the best practices in dealing with this? Is any security whitelist or escape pattern alone good enough?
Any advice on dealing with redirections ("this link goes outside our site" message on a warning page before following the link, for instance)
Is there an argument for not supporting user entered links at all?
Clarification:
Basically our users want to input:
stackoverflow.com
And have it output to another user:
stackoverflow.com
What I really worry about is them using this in a XSS hack. I.e. they input:
alert('hacked!');
So other users get this link:
stackoverflow.com
My example is just to explain the risk - I'm well aware that javascript and URLs are different things, but by letting them input the latter they may be able to execute the former.
You'd be amazed how many sites you can break with this trick - HTML is even worse. If they know to deal with links do they also know to sanitise <iframe>, <img> and clever CSS references?
I'm working in a high security environment - a single XSS hack could result in very high losses for us. I'm happy that I could produce a Regex (or use one of the excellent suggestions so far) that could exclude everything that I could think of, but would that be enough?
If you think URLs can't contain code, think again!
https://owasp.org/www-community/xss-filter-evasion-cheatsheet
Read that, and weep.
Here's how we do it on Stack Overflow:
/// <summary>
/// returns "safe" URL, stripping anything outside normal charsets for URL
/// </summary>
public static string SanitizeUrl(string url)
{
return Regex.Replace(url, #"[^-A-Za-z0-9+&##/%?=~_|!:,.;\(\)]", "");
}
The process of rendering a link "safe" should go through three or four steps:
Unescape/re-encode the string you've been given (RSnake has documented a number of tricks at http://ha.ckers.org/xss.html that use escaping and UTF encodings).
Clean the link up: Regexes are a good start - make sure to truncate the string or throw it away if it contains a " (or whatever you use to close the attributes in your output); If you're doing the links only as references to other information you can also force the protocol at the end of this process - if the portion before the first colon is not 'http' or 'https' then append 'http://' to the start. This allows you to create usable links from incomplete input as a user would type into a browser and gives you a last shot at tripping up whatever mischief someone has tried to sneak in.
Check that the result is a well formed URL (protocol://host.domain[:port][/path][/[file]][?queryField=queryValue][#anchor]).
Possibly check the result against a site blacklist or try to fetch it through some sort of malware checker.
If security is a priority I would hope that the users would forgive a bit of paranoia in this process, even if it does end up throwing away some safe links.
Use a library, such as OWASP-ESAPI API:
PHP - http://code.google.com/p/owasp-esapi-php/
Java - http://code.google.com/p/owasp-esapi-java/
.NET - http://code.google.com/p/owasp-esapi-dotnet/
Python - http://code.google.com/p/owasp-esapi-python/
Read the following:
https://www.golemtechnologies.com/articles/prevent-xss#how-to-prevent-cross-site-scripting
https://www.owasp.org/
http://www.secbytes.com/blog/?p=253
For example:
$url = "http://stackoverflow.com"; // e.g., $_GET["user-homepage"];
$esapi = new ESAPI( "/etc/php5/esapi/ESAPI.xml" ); // Modified copy of ESAPI.xml
$sanitizer = ESAPI::getSanitizer();
$sanitized_url = $sanitizer->getSanitizedURL( "user-homepage", $url );
Another example is to use a built-in function. PHP's filter_var function is an example:
$url = "http://stackoverflow.com"; // e.g., $_GET["user-homepage"];
$sanitized_url = filter_var($url, FILTER_SANITIZE_URL);
Using filter_var allows javascript calls, and filters out schemes that are neither http nor https. Using the OWASP ESAPI Sanitizer is probably the best option.
Still another example is the code from WordPress:
http://core.trac.wordpress.org/browser/tags/3.5.1/wp-includes/formatting.php#L2561
Additionally, since there is no way of knowing where the URL links (i.e., it might be a valid URL, but the contents of the URL could be mischievous), Google has a safe browsing API you can call:
https://developers.google.com/safe-browsing/lookup_guide
Rolling your own regex for sanitation is problematic for several reasons:
Unless you are Jon Skeet, the code will have errors.
Existing APIs have many hours of review and testing behind them.
Existing URL-validation APIs consider internationalization.
Existing APIs will be kept up-to-date with emerging standards.
Other issues to consider:
What schemes do you permit (are file:/// and telnet:// acceptable)?
What restrictions do you want to place on the content of the URL (are malware URLs acceptable)?
Just HTMLEncode the links when you output them. Make sure you don't allow javascript: links. (It's best to have a whitelist of protocols that are accepted, e.g., http, https, and mailto.)
You don't specify the language of your application, I will then presume ASP.NET, and for this you can use the Microsoft Anti-Cross Site Scripting Library
It is very easy to use, all you need is an include and that is it :)
While you're on the topic, why not given a read on Design Guidelines for Secure Web Applications
If any other language.... if there is a library for ASP.NET, has to be available as well for other kind of language (PHP, Python, ROR, etc)
For Pythonistas, try Scrapy's w3lib.
OWASP ESAPI pre-dates Python 2.7 and is archived on the now-defunct Google Code.
How about not displaying them as a link? Just use the text.
Combined with a warning to proceed at your own risk may be enough.
addition - see also Should I sanitize HTML markup for a hosted CMS? for a discussion on sanitizing user input
There is a library for javascript that solves this problem
https://github.com/braintree/sanitize-url
Try it =)
In my project written in JavaScript I use this regex as white list:
url.match(/^((https?|ftp):\/\/|\.{0,2}\/)/)
the only limitation is that you need to put ./ in front for files in same directory but I think I can live with that.
Using Regular Expression to prevent XSS vulnerability is becoming complicated thus hard to maintain over time while it could leave some vulnerabilities behind. Having URL validation using regular expression is helpful in some scenarios but better not be mixed with vulnerability checks.
Solution probably is to use combination of an encoder like AntiXssEncoder.UrlEncode for encoding Query portion of the URL and QueryBuilder for the rest:
public sealed class AntiXssUrlEncoder
{
public string EncodeUri(Uri uri, bool isEncoded = false)
{
// Encode the Query portion of URL to prevent XSS attack if is not already encoded. Otherwise let UriBuilder take care code it.
var encodedQuery = isEncoded ? uri.Query.TrimStart('?') : AntiXssEncoder.UrlEncode(uri.Query.TrimStart('?'));
var encodedUri = new UriBuilder
{
Scheme = uri.Scheme,
Host = uri.Host,
Path = uri.AbsolutePath,
Query = encodedQuery.Trim(),
Fragment = uri.Fragment
};
if (uri.Port != 80 && uri.Port != 443)
{
encodedUri.Port = uri.Port;
}
return encodedUri.ToString();
}
public static string Encode(string uri)
{
var baseUri = new Uri(uri);
var antiXssUrlEncoder = new AntiXssUrlEncoder();
return antiXssUrlEncoder.EncodeUri(baseUri);
}
}
You may need to include white listing to exclude some characters from encoding. That could become helpful for particular sites.
HTML Encoding the page that render the URL is another thing you may need to consider too.
BTW. Please note that encoding URL may break Web Parameter Tampering so the encoded link may appear not working as expected.
Also, you need to be careful about double encoding
P.S. AntiXssEncoder.UrlEncode was better be named AntiXssEncoder.EncodeForUrl to be more descriptive. Basically, It encodes a string for URL not encode a given URL and return usable URL.
You could use a hex code to convert the entire URL and send it to your server. That way the client would not understand the content in the first glance. After reading the content, you could decode the content URL = ? and send it to the browser.
Allowing a URL and allowing JavaScript are 2 different things.

why does MS anti xss library (v4) remove html 5 data attributes

AntiXss library seems to strip out html 5 data attributes, does anyone know why?
I need to retain this input:
<label class='ui-templatefield' data-field-name='P_Address3' data-field-type='special' contenteditable='false'>[P_Address3]</label>
The main reason for using the anti xss library (v4.0) is to ensure unrecognized style attributes are not parsed, is this even possible?
code:
var result = Sanitizer.GetSafeHtml(html);
EDIT:
The input below would result in the entire style attributes removed
Input:
var input = "<p style=\"width:50px;height:10px;alert('evilman')\"/> Not sure why is is null for some wierd reason!<br><p></p>";
Output:
var input = "<p style=\"\"/> Not sure why is is null for some wierd reason!<br><p></p>";
Which is fine, if anyone messes around with my code on client side, but I also need the data attribute tags to work!
I assume you mean the sanitizer, rather than the encoder. It's doing what it's supposed to - it simply doesn't understand HTML5 or recognise the attributes, so it strips them. There are ways to XSS via styles.
It's not possible to customise the safe list either I'm afraid, the code base simply doesn't allow for this - I know a large number of people want those, but it would take a complete rewrite to support it.

How to use C# to sanitize input on an html page?

Is there a library or acceptable method for sanitizing the input to an html page?
In this case I have a form with just a name, phone number, and email address.
Code must be C#.
For example:
"<script src='bobs.js'>John Doe</script>" should become "John Doe"
We are using the HtmlSanitizer .Net library, which:
Is open-source (MIT) - GitHub link
Is fully customizable, e.g. configure which elements should be removed. see wiki
Is actively maintained
Doesn't have the problems like Microsoft Anti-XSS library
Is unit tested with the
OWASP XSS Filter Evasion Cheat Sheet
Is special built for this (in contrast to HTML Agility Pack, which is a parser - not a sanitizer)
Doesn't use regular expressions (HTML isn't a regular language!)
Also on NuGet
Based on the comment you made to this answer, you might find some useful info in this question:
https://stackoverflow.com/questions/72394/what-should-a-developer-know-before-building-a-public-web-site
Here's a parameterized query example. Instead of this:
string sql = "UPDATE UserRecord SET FirstName='" + txtFirstName.Text + "' WHERE UserID=" + UserID;
Do this:
SqlCommand cmd = new SqlCommand("UPDATE UserRecord SET FirstName= #FirstName WHERE UserID= #UserID");
cmd.Parameters.Add("#FirstName", SqlDbType.VarChar, 50).Value = txtFirstName.Text;
cmd.Parameters.Add("#UserID", SqlDbType.Integer).Value = UserID;
Edit: Since there was no injection, I removed the portion of the answer dealing with that. I left the basic parameterized query example, since that may still be useful to anyone else reading the question.
--Joel
It sounds like you have users that submit content but you cannot fully trust them, and yet you still want to render the content they provide as super safe HTML. Here are three techniques: HTML encode everything, HTML encode and/or remove just the evil parts, or use a DSL that compiles to HTML you are comfortable with.
Should it become "John Doe"? I would HTML encode that string and let the user, "John Doe" (if indeed that is his real name...), have the stupid looking name <script src='bobs.js'>John Doe</script>. He shouldn't have wrapped his name in script tags or any tags in the first place. This is the approach I use in all cases unless there is a really good business case for one of the other techniques.
Accept HTML from the user and then sanitize it (on output) using a whitelist approach like the sanitization method #Bryant mentioned. Getting this right is (extremely) hard, and I defer pulling that off to greater minds. Note that some sanitizers will HTML encode evil where others would have removed the offending bits completely.
Another approach is to use a DSL that "compiles" to HTML. Make sure to whitehat your DSL compiler because some (like MarkdownSharp) will allow arbitrary HTML like <script> tags and evil attributes through unencoded (which by the way is perfectly reasonable but may not be what you need or expect). If that is the case you will need to use technique #2 and sanitize what your compiler outputs.
Closing thoughts:
If there is not a strong business case for technique #2 or #3 then reduce risk and save yourself effort and the use of the worries, go with technique #1.
Don't assume your safe because you used a DSL. For example: the original implementation of Markdown allows HTML through, unencoded. "For any markup that is not covered by Markdown’s syntax, you simply use HTML itself. There’s no need to preface it or delimit it to indicate that you’re switching from Markdown to HTML; you just use the tags."
Encode when you output. You can also encode input but doing so can put you in a bind. If you encoded incorrectly and saved that, how will you get the original input back so that you can re-encode after fixing faulty encoder?
If by sanitize you mean REMOVE the tags entirely, the RegEx example referenced by Bryant is the type of solution you want.
If you just want to ensure that the code DOESN'T mess with your design and render to the user. You can use the HttpUtility.HtmlEncode method to prevent against that!
What about using Microsoft Anti-Cross Site Scripting Library?

Categories