Why is this not HTMLEncoding - "<" or "&" - c#

Can anyone tell me why this is not encoding using htmlencode
any string that has < before the string ie
<something or &something
is not being displayed back to the html page when looking at the encoding the < and & is not being encoded. I would have expected these characters to be encoded to < or &
edit: this is the code I use to encode the string:
var replacedHtml = Regex.Replace(html,
#"</?(\w*)[^>]*>",
me => AllowedTags.Any(s => s.Equals(me.Groups[1].Value, StringComparison.OrdinalIgnoreCase))
? me.Value
: HttpUtility.HtmlEncode(me.Value), RegexOptions.Singleline);
return replacedHtml;
edit: i think the issue is not on the server side but rather on the angular side. the ng-bind-html
<span ng-bind-html="ctl.linkGroup.Notes | TextToHtmlSafe">
angular.module('CPSCore.Filters')
.filter('TextToHtmlSafe', ['$sce',function ($sce) {
return function (text) {
if (!text)
return text;
var htmlText = text.replace(/\n/g, '<br />');
return $sce.trustAsHtml(htmlText);
};
}]);
is declaring that
<something
without the closing tag is not safe and therefore removes it from the view

Try System.Net.WebUtility.HtmlDecode to properly decode special characters. Using this, < changes to < and & changes to & which is properly displayed html pages.

In HTML, the ampersand character (“&”) declares the beginning of an entity reference (a special character). If you want one to appear in text on a web page you should use the encoded named entity “&”—more technical mumbo-jumbo at w3c.org. While most web browsers will let you get away without encoding them, stuff can get dicey in weird edge cases and fail completely in XML.
The other main characters to remember to encode are < (<) and > (>), you don’t want to confuse your browser about where HTML tags start and end

Related

HtmlAgilityPack treats everything after < (less than sign) as attributes

I have some input I get via a textarea and I convert that input into a html document, that is later parsed into a PDF document.
When my users input the less than sign (<) everything brakes in my HtmlDocument. HtmlAgilityPack suddenly handles everything after the less than sign as an attribute. See the output:
Within this Character Data block I can use double dashes as much as I want (along with <, &,="" ',="" and="" ')="" *and="" *="" %="" myparamentity;="" will="" be="" expanded="" to="" the="" text="" 'has="" been="" expanded'...however,="" i="" can't="" use="" the="" cend="" sequence(if="" i="" need="" to="" use="" it="" i="" must="" escape="" one="" of="" the="" brackets="" or="" the="" greater-than="" sign).="">
It gets a little better if I just add the
htmlDocument.OptionOutputOptimizeAttributeValues = true;
which gives me:
Within this Character Data block I can use double dashes as much as I want (along with <, &,= ',= and= ')= *and= *= %= myparamentity;= will= be= expanded= to= the= text= 'has= been= expanded'...however,= i= can't= use= the= cend= sequence(if= i= need= to= use= it= i= must= escape= one= of= the= brackets= or= the= greater-than= sign).=>
I have tried all of the options on the htmldocument and none of them lets me specify that the parser should not be strict. On the other hand I might be able to live with it stripping away the <, but adding all the equal signs doesn't really work for me.
void Main()
{
var input = #"Within this Character Data block I can use double dashes as much as I want (along with <, &, ', and ') *and * % MyParamEntity; will be expanded to the text 'Has been expanded'...however, I can't use the CEND sequence(if I need to use it I must escape one of the brackets or the greater-than sign).";
var htmlDoc = WrapContentInHtml(input);
htmlDoc.DocumentNode.OuterHtml.ToString().Dump();
}
private HtmlDocument WrapContentInHtml(string content)
{
var htmlBuilder = new StringBuilder();
htmlBuilder.AppendLine("<!DOCTYPE html>");
htmlBuilder.AppendLine("<html>");
htmlBuilder.AppendLine("<head>");
htmlBuilder.AppendLine("<title></title>");
htmlBuilder.AppendLine("</head>");
htmlBuilder.AppendLine("<body><div id='sagsfremstillingContainer'>");
htmlBuilder.AppendLine(content);
htmlBuilder.AppendLine("</div></body></html>");
var htmlDocument = new HtmlDocument();
htmlDocument.OptionOutputOptimizeAttributeValues = true;
var htmlDoc = htmlBuilder.ToString();
htmlDocument.LoadHtml(htmlDoc);
return htmlDocument;
}
Does anybody have an idea to how I can solve this problem.
The closest question I can find is this:
Losing the 'less than' sign in HtmlAgilityPack loadhtml
Where he actually complains about the < disappearing which would be fine for me. Of course fixing the parsing error is the best solution.
EDIT:
I am using HtmlAgilityPack 1.4.9
Your content is blatantly wrong. This is not about "strictness", it's really about the fact that you're pretending a piece of text is valid HTML. In fact, the results you are getting are exactly because the parser is not strict.
When you need to insert plain text into HTML, you need to encode it first, so that all the various HTML control characters are converted to HTML properly - for example, < must be changed to < and & to &.
One way to handle this is to use the DOM - use InnerText on the target div, instead of slapping strings together and pretending they're HTML. Another is to use some explicit encoding method - for example HttpUtility.HtmlEncode.
You can use System.Net.WebUtility.HtmlEncode which works even without a reference to System.Web.dll which also has HttpServerUtility.HtmlEncode
var input = #"Within this Character Data block I can use double dashes as much as I want (along with <, &, ', and ') *and * % MyParamEntity; will be expanded to the text 'Has been expanded'...however, I can't use the CEND sequence(if I need to use it I must escape one of the brackets or the greater-than sign).";
var htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml(System.Net.WebUtility.HtmlEncode(input));
Debug.Assert(!htmlDocument.ParseErrors.Any());
Result:
Within this Character Data block I can use double dashes as much as I want (along with <, &, ', and ') *and * % MyParamEntity; will be expanded to the text 'Has been expanded'...however, I can't use the CEND sequence(if I need to use it I must escape one of the brackets or the greater-than sign).

Bold Text in Html.FormatValue using Razor

I want to have the following result. Username has to be bold:
Blabla Username Bla.
I have the Format in a ressource file:
Blabla {0} Bla.
And in the view I do the following:
#Html.FormatValue(User.Identity.Name, Resources.MyFormatString)
How can I make the Username bold and use Html.FormatValue? Or is there another method to achieve this?
You could simply change your resource to contain the bold-tag, strong-tag or a style.
Like "Blabla <b>{0}</b> Bla.".
[edit]
Indeed, checked Html.FormatValue for an escape functionality, did not see one, but apparently it does :)
In that case using #Html.Raw and string.Format will work.
#Html.Raw(string.Format(Resources.MyFormatString, "SomeName"))
(tested in MVC 5, but #Html.Raw is also available in 4)
Also a small note: storing HTML in resources is probably not the best idea, mixing UI & content.
[/edit]
I wanted to solve your example with including html tags, be safe with html characters in the resources, and safely include user input or html tags. My solution of your example is
#(Resources.MyFormatString.FormatWithHtml(
"<b>" + HttpUtility.HtmlEncode(User.Identity.Name) + "</b>"))
using my function FormatWithHtml
/// Encodes to MvcHtmlString and includes HTML tags or already encoded strings, placeholder is the '|' character
public static MvcHtmlString FormatWithHtml (this string format, params string[] htmlIncludes)
{
var result = new StringBuilder();
int i = -1;
foreach(string part in format.Split('|')) {
result.Append(HttpUtility.HtmlEncode(part));
if (++i < htmlIncludes.Length)
result.Append(htmlIncludes[i]);
}
return new MvcHtmlString(result.ToString());
}
One more example, this
#("Resource is safe to html characters <&> and will include |format tags| or any | user input."
.FormatWithHtml("<b>", "</b>", "<b id='FromUser'>" +HttpUtility.HtmlEncode("<a href='crack.me'>click</a>") +"</b>"))
will write to your razor page
Resource is safe to html characters <&> and will include format tags or any <a href='crack.me'>click</a> user input.

HTML encode string in MVC Razor except for one tag

I have a string which contains HTML in a Razor view.
var s = "A string.<br> It has <abbr>HTML</abbr>."
I want to HTML encode everything except for the <br>
s == "A string.<br> It has <abbr>HTML<abbr>."
This does not work:
#s.Replace("<br>", "<br>")
The strings come from a database of user created content and the only tags they should contain are <br>s, but in practice they might contain just about anything, and if they do I must keep it that way but display it in a safe way. Of course, the data should never have been saved this way to begin with, but it is, so I have to deal with it.
I can't just use Html.Raw() because I need to encode everything else. However, Html.Encode() encodes too much, converting "\r" to "
" etc. (Something must have been funny with the data I tested this on the first time, Html.Encode() does not seem to be the issue any more).
You can do it as follows:
#Html.Raw(Html.Encode(str).Replace("<br>", "<br>"))
We encode the string using Html.Encode after which we convert the encoded tag <br> back to the <br> tag and use #Html.Raw to output it without being encoded again.
This looks a little ugly to me, but it works.
var splitString = s.Split(new string[] { "<br>" }, StringSplitOptions.None);
foreach (string line in splitString)
{
#line<br>
}

Detect the Presence of Incorrect HTML tag and Correct It

I have a program in C# which fetches some data from a database. The data can contain html tags . Unfortunately, in some circumstances, the LAST closing html tag is missing the ">" character .
Can anyone help me find a solution to check for this instance of incorrect html and then add the trailing ">" character.
Thank you.
---EDIT---
I was thinking of solving the problem this way:
Check for the last occurence of </tag
Check if the character after it is >
If not, add >
However, I don't know what regex expression I should use at 1). Does anyone have an idea. I'm not very good at regex.
---EDIT---
These are some examples of data I could have:
hello <span class=green>Sean</span> Moore
hello <span><span class="green">Roger</span></span
Presumably you get the HTML from the database as a string, in which case, the EndsWith method on string will do the job
if(!html.EndsWith(">"))
{
html += ">";
}
It's a quick and dirty method, so as your code grows, you're likely going to want to move away from quick hacks. In this respect, you might want to start taking a look at things like HtmlAgilityPack
1) If the data has an embracing html tag:
if(Data.StartsWith("<") && !Data.EndsWith(">"))
Data += ">";
This checks whether your data is html (starts with a <) and is incorrect (doesn't end with a >) and if that is true, it adds a >.
2) If there can be text outside html tags:
if (Data.Contains("</") && Data.LastIndexOf(">") < Data.LastIndexOf("</"))
{
int LastTagPosition = Data.LastIndexOf("</");
int LastTagEndPosition = Data.IndexOf(" ", LastTagPosition);
if (LastTagEndPostion < 0)
Data += ">";
else
Data.Insert(LastTagEndPosition, ">");
}
This checks wether there are closing html tags and whether there is a > after the last </. If not then it adds an > at the next space or at the end of the data if there is no space.

How to deal with different kinds of encoding in the javascript

I recognized that based on a context in which I want to use some parameters, there are at least 4 kinds of encoding that are necessary to avoid corrupted code being executed :
Javascript encoding when constructing a javascript code, e.g.
var a = "what's up ?"
var b = "alert('" + a + "');"
eval(b); // or anything else that executes b as code
URL encoding when using a string as a parameter into the url, e.g.
var a = "Bonnie & Clyde";
var b = "mypage.html?par=" + a;
window.location.href = b; // or anything else that tries to use b as URL
HTML encoding when using a string as an HTML source of some element, e.g.
var a = "<script>alert('hi');</script>";
b.innerHTML = a; // or anything else that interprets a directly
HTML attribute encoding when using a string as a value of an attribute, e.g.
var a = 'alert("hello")';
var b = '<img onclick="' + a + '" />'; // or anything else that uses a as a (part of) a tag's attribute
While in the ASP.NET codebehind I'm aware of ways to encode the string in all 4 cases (using e.g. DataContractJsonSerializer, HttpUtility.UrlEncode, HttpUtility.HtmlEncode and HttpUtility.HtmlAttributeEncode), it would be quite interesting to know whether there are some utilities that I could use directly from javascript to encode / decode strings in these 4 cases.
Case 2 can be dealt with using encodeURIComponent(), as danp suggested.
Case 3 won't execute the script in most browsers. If you want the output to the document to be <script>...</script>, you should edit the text content of the element instead:
var a = "<script>alert('hi');</script>";
if ("textContent" in b)
b.textContent = a; // W3C DOM
else
b.innerText = a; // Internet Explorer <=8
Cases 1, and 4 aren't really encoding issues, they're sanitation issues. Encoding the strings passed to these functions would probably cause a syntax error or just result in a string value that isn't assigned to anything. Sanitizing usually involves looking for certain patterns and either allowing the action or disallowing it - it's safer to have a whitelist than a blacklist (that sounds terrible!).
Internet Explorer 8 has an interesting function called window.toStaticHTML() that will remove any script content from a HTML string. Very useful for sanitizing HTML before inserting into the DOM. Unfortunately, it's proprietary so you won't find this function in other browsers.
You can use the javascript function escape(..) for some of these purposes.
e: actually forget! sorry, it's a deprecated function - encodeURI(), decodeURI() etc are the way forward! Details here.
escape and unescape functions do not
work properly for non-ASCII characters
and have been deprecated. In
JavaScript 1.5 and later, use
encodeURI, decodeURI,
encodeURIComponent, and
decodeURIComponent.
The escape and unescape functions let
you encode and decode strings. The
escape function returns the
hexadecimal encoding of an argument in
the ISO Latin character set. The
unescape function returns the ASCII
string for the specified hexadecimal
encoding value.encoding value.

Categories