Escape text for HTML - c#

How do i escape text for html use in C#? I want to do
sample="<span>blah<span>"
and have
<span>blah<span>
show up as plain text instead of blah only with the tags part of the html :(.
Using C# not ASP

using System.Web;
var encoded = HttpUtility.HtmlEncode(unencoded);

Also, you can use this if you don't want to use the System.Web assembly:
var encoded = System.Security.SecurityElement.Escape(unencoded)
Per this article, the difference between System.Security.SecurityElement.Escape() and System.Web.HttpUtility.HtmlEncode() is that the former also encodes apostrophe (') characters.

If you're using .NET 4 or above and you don't want to reference System.Web, you can use WebUtility.HtmlEncode from System
var encoded = WebUtility.HtmlEncode(unencoded);
This has the same effect as HttpUtility.HtmlEncode and should be preferred over System.Security.SecurityElement.Escape.

In ASP.NET 4.0 there's new syntax to do this. Instead of
<%= HttpUtility.HtmlEncode(unencoded) %>
you can simply do
<%: unencoded %>
Read more here:
New <%: %> Syntax for HTML Encoding Output in ASP.NET 4 (and ASP.NET MVC 2)

.NET 4.0 and above:
using System.Web.Security.AntiXss;
//...
var encoded = AntiXssEncoder.HtmlEncode("input", useNamedEntities: true);

You can use actual html tags <xmp> and </xmp> to output the string as is to show all of the tags in between the xmp tags.
Or you can also use on the server Server.UrlEncode or HttpUtility.HtmlEncode.

For a simple way to do this in Razor pages, use the following:
In .cshtml:
#Html.Raw(Html.Encode("<span>blah<span>"))
In .cshtml.cs:
string rawHtml = Html.Raw(Html.Encode("<span>blah<span>"));

You can use:
System.Web.HttpUtility.JavaScriptStringEncode("Hello, this is Satan's Site")
It was the only thing that worked (ASP.NET 4.0+) when dealing with HTML like this. The&apos; gets rendered as ' (using htmldecode) in the HTML content, causing it to fail:
It's Allstars

There are some special quotes characters which are not removed by HtmlEncode and will not be displayed in Edge or Internet Explorer correctly, like ” and “. You can extend replacing these characters with something like the below function.
private string RemoveJunkChars(string input)
{
return HttpUtility.HtmlEncode(input.Replace("”", "\"").Replace("“", "\""));
}

Related

Razor - write something unencoded without using html helper

I'm looking for a way to write a value in my razor view, without encoding it, AND without using the Html Helper.
I'm rendering the view in a hybrid website, where I parse my View programmatically, like this:
string html = "<html>#("Write something <strong>unencoded</strong>"</html>")
html = Razor.Parse<TModel>(html, model);
So essentially, my html variable contains a template containing razor c# code. Because I am compiling my view like this, I have no access to the Html helper (the accepted answer in this post implies this is indeed the case: How to render a Razor View to a string in ASP.NET MVC 3?)
However, my html variable also contains a statement like this:
#Html.Raw("<strong>This should be printed unencoded</strong>")
This does not work but gives "Html is not available in this context".
How can I achieve the same behavior? Using Response.Write gives the exact same error.
Are there any other ways?
Note: this is a hybrid website, containing both classic ASP webforms and some newer Web API and MVC stuff. The View I'm using is NOT accessed through conventional MVC ways.
HtmlString type of string should work for you.
Represents an HTML-encoded string that should not be encoded again.
Sample with creating such string inline (normally you'd have such values in Model set by controler):
#(new HtmlString("<strong>This should be printed unencoded</strong>"))
It took me a while to figure out, but this is the final solution (which works for me, but has some security implications, so do read on!)
Compile your view like this
var config = new TemplateServiceConfiguration();
config.EncodedStringFactory = new RawStringFactory();
var service = RazorEngineService.Create(config);
html = service.RunCompile(html, "templateNameInTheCache", null, model);
As you can see, I employed the RawStringFactory to make sure no HTML at all gets encoded. Of course, MVC automatically encodes HTML as a safety precaution, so doing this isn't very safe. Only do this if you're 100% sure that all of the output in the entire Razor view is safe!

Bold Text in Html.FormatValue using Razor

I want to have the following result. Username has to be bold:
Blabla Username Bla.
I have the Format in a ressource file:
Blabla {0} Bla.
And in the view I do the following:
#Html.FormatValue(User.Identity.Name, Resources.MyFormatString)
How can I make the Username bold and use Html.FormatValue? Or is there another method to achieve this?
You could simply change your resource to contain the bold-tag, strong-tag or a style.
Like "Blabla <b>{0}</b> Bla.".
[edit]
Indeed, checked Html.FormatValue for an escape functionality, did not see one, but apparently it does :)
In that case using #Html.Raw and string.Format will work.
#Html.Raw(string.Format(Resources.MyFormatString, "SomeName"))
(tested in MVC 5, but #Html.Raw is also available in 4)
Also a small note: storing HTML in resources is probably not the best idea, mixing UI & content.
[/edit]
I wanted to solve your example with including html tags, be safe with html characters in the resources, and safely include user input or html tags. My solution of your example is
#(Resources.MyFormatString.FormatWithHtml(
"<b>" + HttpUtility.HtmlEncode(User.Identity.Name) + "</b>"))
using my function FormatWithHtml
/// Encodes to MvcHtmlString and includes HTML tags or already encoded strings, placeholder is the '|' character
public static MvcHtmlString FormatWithHtml (this string format, params string[] htmlIncludes)
{
var result = new StringBuilder();
int i = -1;
foreach(string part in format.Split('|')) {
result.Append(HttpUtility.HtmlEncode(part));
if (++i < htmlIncludes.Length)
result.Append(htmlIncludes[i]);
}
return new MvcHtmlString(result.ToString());
}
One more example, this
#("Resource is safe to html characters <&> and will include |format tags| or any | user input."
.FormatWithHtml("<b>", "</b>", "<b id='FromUser'>" +HttpUtility.HtmlEncode("<a href='crack.me'>click</a>") +"</b>"))
will write to your razor page
Resource is safe to html characters <&> and will include format tags or any <a href='crack.me'>click</a> user input.

Does using razor in WebMatrix mitigate an XSS threat?

I have purposfully (for testing) assigned the following variable in WebMatrix C#:
string val = "<script type='text/javascript'>alert('XSS Vector')</script>";
Later in the page I have used razor to write that value directly to the page.
<p>
#val
</p>
It writes the text, but in a safe manner (i.e., no alert scripts run)
This, coupled with the fact that if 'val' contains an html entity (e.g., <) it also writes exactly "<" and not "<" as I would expect the page to render.
Is this because C# runs first, then html is rendered?
More importantly, is using razor in this fashion a suitable replacement for html encoding, when used like this?
The #Variable syntax will HtmlEncode any text you pass to it; hence you seeing literally what you set to the string value. You are correct in that this is for XSS protection. It is part of Razor that does this; the #Variable syntax itself.
So basically, using the #Variable syntax is not so much a 'replacement' for Html Encoding; it is HTML encoding.
Related: If you ever want some string to render the HTML, you would use this syntax in Razor:
#Html.Raw(Variable)
That causes the Html Encoding not to be done. Obviously, this is dangerous to do with user-supplied input.

Stripping script tags from HTML input

public static string MakeWebSafe(this string x) {
const string RegexRemove = #"(<\s*script[^>]*>)|(<\s*/\s*script[^>]*>)";
return Regex.Replace(x, RegexRemove, string.Empty, RegexOptions.IgnoreCase);
}
Is there any reason this implementation isn't good enough. Can you break it? Is there anything I haven't considered? If you use or have used something different, what are its advantages?
I'm aware this leaves the body of the script in the text, but that's okay for this project.
UPDATE
Don't do the above! I went with this in the end: HTML Agility Pack strip tags NOT IN whitelist.
Have you considered this kind of scenario??
<scri<script>pt type="text/javascript">
causehavoc();
</scr</script>ipt>
The best thing to do is remove all tags, encode things, or use bbcode
Yes, your RegEx can be circumvented by unicode encoding the script tags. I would suggest you look to more robust libraries when it comes to security. Take a look at Microsoft Web Protection Library

Regex to get src value from an img tag

I am using the following regex to get the src value of the first img tag in an HTML document.
string match = "src=(?:\"|\')?(?<imgSrc>[^>]*[^/].(?:jpg|png))(?:\"|\')?"
Now it captures total src attribute that I dont need. I just need the url inside the src attribute. How to do it?
Parse your HTML with something else. HTML is not regular and thus regular expressions aren't at all suited to parsing it.
Use an HTML parser, or an XML parser if the HTML is strict. It's a lot easier to get the src attribute's value using XPath:
//img/#src
XML parsing is built into the System.Xml namespace. It's incredibly powerful. HTML parsing is a bit more difficult if the HTML isn't strict, but there are lots of libraries around that will do it for you.
see When not to use Regex in C# (or Java, C++ etc) and Looking for C# HTML parser
PS, how can I put a link to a StackOverflow question in a comment?
Your regex should (in english) match on any character after a quote, that is not a quote inside an tag on the src attribute.
In perl regex, it would be like this:
/src=[\"\']([^\"\']+)/
The URL will be in $1 after running this.
Of course, this assumes that the urls in your src attributes are quoted. You can modify the values in the [] brackets accordingly if they are not.

Categories