What follows is a piece of text that gets HtmlEncoded in C# before being sent to the browser (during a callback). Once received, in Javascript I do myDiv.innerHTML = theStringBelow;
<span xmlns:asp="http://schemas.microsoft.com/ASPNET/20"
xmlns:SharePoint="Microsoft.Sharepoint.WebControls"
xmlns:ext="my_namespace:my_xslt_extension">Some text to be shown.</span>
However, what results is that I simply see the exact text shown above. It isn't being treated as an html element that got added to the DOM, but as plain text. When I add the exact same text through javascript (e.g., I skip the callback, and just say myDiv="exactString") it DOES get added correctly (it gets treated as a span).
What is going on? Do I have to un-encode it? Should I not have encoded to begin with?
Edit
The question still stands for curiosity's sake, but I have fixed the issue simply by not HtmlEncoding the data. An earlier issue must have added onto this one, making me think the HtmlEncoding was still necessary.
You should not HTMLEncode it if it is to become HTML nodes. What HTML encoding will do is turn your string from above into this:
<span xmlns:asp="http://schemas.microsoft.com/ASPNET/20"
xmlns:SharePoint="Microsoft.Sharepoint.WebControls"
xmlns:ext="my_namespace:my_xslt_extension">Some text to be shown.</span>
Try passing the string in as it is. You will of course have to escape the string. But once it has become a string in JavaScript it should be unescaped as it is being made into a string in memory. Then you should be able to do the div.innerHTML call and get your expected result. The escaping of the string can probably be accomplished by doing the following:
// in your .cs code-behind/view/whatever.
string = string.replace("""", "\""");
Which should produce:
<span xmlns:asp=\"http://schemas.microsoft.com/ASPNET/20\"
xmlns:SharePoint=\"Microsoft.Sharepoint.WebControls\"
xmlns:ext=\"my_namespace:my_xslt_extension\">Some text to be shown.</span>
Which you can then output like so:
// in your webform/view
<script type="text/javascript">
var mystring;
mystring = "<%=string;%>";
</script>
Let me know how that works out for you.
HTML Encode will turn < into < and so on. This breaks HTML Formatting and is used so blocks of text like this:
Insert <name> here
Does not turn out like this:
Insert here
If your intent is to have the <span ... get inserted into the html directly you either need to NOT encode it on the way out, or if that will disrupt transmission, you need to decode it in js before you set the .innerHTML part.
Related
I have some data from a lookup like this: =winz\ach'dull.
How can I replace single quotes (') with ("").
This is my code =>
<input type="button" id="btnSelect" onclick="Select('<%#Eval("LoginName").ToString().Replace("'", "\'")%>');" value="Select"/>
I'm trying to create code like this:
Select('<%#Eval("LoginName").ToString().Replace("'", "\'")%>');
but it does not not work.
Please correct and help me. Thanks.
In pure javascript we could do :
var a="winz\ach'dull.";
alert(a.replace("'",'"'));
And that would replace your single quote.
Note: Your code is C# not javascript.
You can escape quotes with the "\" character and it works perfectly with HTML. So the answer to exactly what you wrote would be: (this is just to humour you in the future)
"Select('<%#Eval(\"LoginName\").ToString().Replace(\"'\", \"\'\")%>');"
But you have syntax errors in what you are writing and that Eval stuff is not javascript so I don't know why ToString and Replace are attached to it. I've changed it a little based on guessing what you're trying to do:
<input onclick="Select('<%#Eval("LoginName")%>').ToString().Replace(\"'\", \"'\");">
Note that if you're using C# or something on the server side it doesn't need to be escaped because by the time the HTML is parsed in the DOM, typically a browser the source no longer contains your server side code and only the output!
Why doesn't this work?
<input type="button" id="btnAccept" value="Accept" onclick='<%# String.Format("accept('{0}','{1}','{2}','{3}-{4}');", Container.DataItem("PositionID"), Container.DataItem("ApplicantID"), Container.DataItem("FullName"), Container.DataItem("DepartmentName"), Container.DataItem("PositionTitle"))%>' />
The onclick doesn't do anything.
Your best bet is to look at the generated HTML. I think it's a really good habit to check the generated HTML in text format and how it renders on-screen, all the time. Besides errors such as this (which can easily be spotted in the generated HTML), it will help you catch other possible invalid uses of HTML which may render as intended in one browser while rendering terribly in another. HTML rendering engines employ many tricks to try and make invalid HTML look okay.
Anyway, all things aside (such as, assuming accept(...) exists, and all other calls in the tag are correct) I think the issue you are having is as follows:
onclick='<%# String.Format("accept('{0}','{1}','{2}','{3}-{4}');", ... )%>'
This line is probably going to evaluate to look something like this:
onclick='accept('{0}','{1}','{2}','{3}-{4}');'
With all single quotes, all the onclick attribute will see is onclick='accept(' which is not a valid javascript method call. You're going to want to use the "" strings, which you can embed in the format string by escaping them.
String.Format("accept(\"{0}\",\"{1}\",\"{2}\",\"{3}-{4}\");", ... )
Then, you should be able to get the correct combination of ' and " within the attribute:
onclick='accept("{0}","{1}","{2}","{3}-{4}");'
I am having trouble removing all javascript from a HTML page with C#. I have three regex expressions that remove a lot but miss a lot too. Parsing the javascript with the MSHTML DOM parser causes the javascript to actually run, which is what I am trying to avoid by using the regex.
"<script.*/>"
"<script[^>]*>.*</script>"
"<script.*?>[\\s\\S]*?</.*?script>"
Does anyone know what I am missing that is causing these three regex expressions to miss blocks of JavaScript?
An example of what I am trying to remove:
<script src="do_files/page.js" type="text/javascript"></script>
<script src="do_files/page.js" type="text/javascript" />
<script type="text/javascript">
<!--
var Time=new Application('Time')
//-->
</script>
<script type="text/javascript">
if(window['com.actions']) {
window['com.actions'].approvalStatement = "",
window['com.actions'].hasApprovalStatement = false
}
</script>
I assume you are trying to simply sanitize the input of JavaScript. Frankly I'm worried that this is too simple of a solution, 'cuz it seems so incredibly simple. See below for reasoning, after the expression (in a C# string):
#"(?s)<script.*?(/>|</script>)"
That's it - I hope! (It certainly works for your examples!)
My reasoning for the simplicity is that the primary issue with trying to parse HTML with regex is the potential for nested tags - it's not so much the nesting of DIFFERENT tags, but the nesting of SYNONYMOUS tags
For example,
<b> bold <i> AND italic </i></b>
...is not so bad, but
<span class='BoldText'> bold <span class='ItalicText'> AND italic </span></span>
would be much harder to parse, because the ending tags are IDENTICAL.
However, since it is invalid to nest script tags, the next instance of />(<-is this valid?) or </script> is the end of this script block.
There's always the possibility of HTML comments or CDATA tags inside the script tag, but those should be fine if they don't contain </script>. HOWEVER: if they do, it would definitely be possible to get some 'code' through. I don't think the page would render, but some HTML parsers are amazingly flexible, so ya never know. to handle a little extra possible whitespace, you could use:
#"(?s)<\s?script.*?(/\s?>|<\s?/\s?script\s?>)"
Please let me know if you can figure out a way to break it that will let through VALID HTML code with run-able JavaScript (I know there are a few ways to get some stuff through, but it should be broken in one of many different ways if it does get through, and should not be run-able JavaScript code.)
It is generally agreed upon that trying to parse HTML with regex is a bad idea and will yield bad results. Instead, you should use a DOM parser. jQuery wraps nicely around the browser's DOM and would allow you to very easily remove all <script> tags.
ok I have faced a similar case, when I need to clean "rich text" (text with HTML formatting) from any possible javascript-ing.
there are several ways to add javascript to HTML:
by using the <script> tag, with javascript inside it or by loading a javascript file using the "src" attribue.
ex: <script>maliciousCode();</script>
by using an event on an HTML element, such as "onload" or "onmouseover"
ex: <img src="a.jpg" onload="maliciousCode()">
by creating a hyperlink that calls javascript code
ex: <a href="javascript:maliciousCode()">...
This is all I can think of for now.
So the submitted HTML Code needs to be cleaned from these 3 cases. A simple solution would be to look for these patterns using Regex, and replace them by "" or do whatever else you want.
This is a simple code to do this:
public static string CleanHTMLFromScript(string str)
{
Regex re = new Regex("<script[^>]*>", RegexOptions.IgnoreCase);
str = re.Replace(str, "");
re = new Regex("<[a-z][^>]*on[a-z]+=\"?[^\"]*\"?[^>]*>", RegexOptions.IgnoreCase);
str = re.Replace(str, "");
re = new Regex("<a\\s+href\\s*=\\s*\"?\\s*javascript:[^\"]*\"[^>]*>", RegexOptions.IgnoreCase);
str = re.Replace(str, "");
return(str);
}
This code takes care of any spaces and quotes that may or may not be added. It seems to be working fine, not perfect but it does the trick. Any improvements are welcome.
Creating your own HTML parser or script detector is a particularly bad idea if this is being done to prevent cross-site scripting. Doing this by hand is a Very Bad Idea, because there are any number of corner cases and tricks that can be used to defeat such an attempt. This is termed "black listing", as it attempts to remove the unsafe items from HTML, and it's pretty much doomed to failure.
Much safer to use a white list processor (such as AntiSamy), which only allows approved items through by automatically escaping everything else.
Of course, if this isn't what you're doing then you should probably edit your question to give some more context...
Edit:
Now that we know you're using C#, try the HTMLAgilityPack as suggested here.
Which language are you using? As a general statement, Regular Expressions are not suitable for parsing HTML.
If you are on the .net Platform, the HTML Agility Pack offers a much better parser.
You should use a real html parser for the job. That being said, for simple stripping
of script blocks you could use a rudimentary regex like below.
The idea is that you will need a callback to determine if capture group 1 matched.
If it did, the callback should pass back things that hide html (like comments) back
through unchanged, and the script blocks are passed back as an empty string.
This won't substitute for an html processor though. Good luck!
Search Regex: (modifiers - expanded, global, include newlines in dot, callback func)
(?:
<script (?:\s+(?:".*?"|\'.*?\'|[^>]*?)+)? \s*> .*? </script\s*>
| </?script (?:\s+(?:".*?"|\'.*?\'|[^>]*?)+)? \s*/?>
)
|
( # Capture group 1
<!(?:DOCTYPE.*?|--.*?--)> # things that hide html, add more constructs here ...
)
Replacement func pseudo code:
string callback () {
if capture buffer 1 matched
return capt buffer 1
else return ''
}
I have a downloader program that download pages from internet .
the encoding of each page is different , some are in UTF-8 and some are Unicode.
For example : a that shows 'a' character ; pages full of this characters .We should convert this encodings to normal text .
I used the UnicodeEncoding class in c# , but they do not help me .
How can i decode this encodings to real characters? Is there a class or method that converting this ?
Thanks .
That is html-encoded; try HtmlDecode? (you'll need a reference to System.Web.dll)
Text in html pages which are in the form of starting with & and ending with ;, are HTML encoded.
You can decode these by using:
string html = ...; //your html
string decoded = System.Web.HttpUtility.HtmlDecode( html );
Also see Characters in string changed after downloading HTML from the internet for code on how to make sure you download the page in the correct character set.
You're getting confused between HTML/XML escaping and UTF-8/Unicode.
If the page is valid XML, life will be easier - you can just parse it as any other XML document, and then just get the relevant text nodes... all the XML escaping will be "unescaped" when you get the text.
If it's arbitrary - and possibly invalid - HTML then life is a bit harder. You may well want to normalize it into valid HTML first, then parse it and again ask for the text nodes.
If you can give us a more concrete example, it will be easier to advise you.
The HtmlDecode method suggested in other answers may very well be all you need - but you should definitely try to understand what's going on first. For example, you may well want to only decode certain fragments of the HTML - if you decode the whole document, then you could end up with text which looks it contains like HTML tags, but actually just contained text in the original document.
I have this following setup, a textarea named with some data in it that may have carriage returns and another textarea that has style='display:none' in order to make it hidden as follows:
<textarea id="myTextarea" onBlur="encryptMyData()"></textarea>
<textarea name="encryptedText" style='display:none'></textarea>
the user enters data in the first textarea and when that text area loses focus the 'encryptMyData()' javascript function is calling an ajax call to take whatever the user entered in the first textfield, encrypt it using rijndael, and paste it in the encryptedText textarea so that it is stored in the database later.
Now what I need to do is this, find a way to convert the carriage returns before encryption to a tag like so [cr] so that when I retrieve the data, all formatting is retained. Any idea how I do this? I'm using asp.net and c# to perform the encryption.
Your newline characters are likely still present in the encrypted data.
If you absolutely do want to "display" the encrypted data with newlines retained, you likely need to do a stringData.Split(Environment.NewLine), encrypt each resulting string separately, then String.Join(Environment.NewLine, arrayOfEncryptedDataLines) the strings back together before returning to the webpage.
-edit-
You might be better off not going by the server, though. Have a look at http://www.hanewin.net/encrypt/aes/aes.htm
You can use the JavaScript escape() method to take care of the carriage returns and spaces. Server-side you need to unescape the sequence again, but there is no default unescape method present in C#. You could try using the unescape method in the Microsoft.JScript.GlobalObject namespace.
Basically the AJAX call to your service is probably stripping out the new lines, so you'll need to do the conversion before you call the web service.
In your encryptMyData function perform a replace before you send it the server:
// assuming sometext contains the contents of myTextarea
// Perform a global replace for all occurrences of a new line with [CR]:
sometext = sometext.replace(/\n/g, "[CR]");
Then pass sometext to the ajax call, rather than the straight value of the textarea.