I am having trouble removing all javascript from a HTML page with C#. I have three regex expressions that remove a lot but miss a lot too. Parsing the javascript with the MSHTML DOM parser causes the javascript to actually run, which is what I am trying to avoid by using the regex.
"<script.*/>"
"<script[^>]*>.*</script>"
"<script.*?>[\\s\\S]*?</.*?script>"
Does anyone know what I am missing that is causing these three regex expressions to miss blocks of JavaScript?
An example of what I am trying to remove:
<script src="do_files/page.js" type="text/javascript"></script>
<script src="do_files/page.js" type="text/javascript" />
<script type="text/javascript">
<!--
var Time=new Application('Time')
//-->
</script>
<script type="text/javascript">
if(window['com.actions']) {
window['com.actions'].approvalStatement = "",
window['com.actions'].hasApprovalStatement = false
}
</script>
I assume you are trying to simply sanitize the input of JavaScript. Frankly I'm worried that this is too simple of a solution, 'cuz it seems so incredibly simple. See below for reasoning, after the expression (in a C# string):
#"(?s)<script.*?(/>|</script>)"
That's it - I hope! (It certainly works for your examples!)
My reasoning for the simplicity is that the primary issue with trying to parse HTML with regex is the potential for nested tags - it's not so much the nesting of DIFFERENT tags, but the nesting of SYNONYMOUS tags
For example,
<b> bold <i> AND italic </i></b>
...is not so bad, but
<span class='BoldText'> bold <span class='ItalicText'> AND italic </span></span>
would be much harder to parse, because the ending tags are IDENTICAL.
However, since it is invalid to nest script tags, the next instance of />(<-is this valid?) or </script> is the end of this script block.
There's always the possibility of HTML comments or CDATA tags inside the script tag, but those should be fine if they don't contain </script>. HOWEVER: if they do, it would definitely be possible to get some 'code' through. I don't think the page would render, but some HTML parsers are amazingly flexible, so ya never know. to handle a little extra possible whitespace, you could use:
#"(?s)<\s?script.*?(/\s?>|<\s?/\s?script\s?>)"
Please let me know if you can figure out a way to break it that will let through VALID HTML code with run-able JavaScript (I know there are a few ways to get some stuff through, but it should be broken in one of many different ways if it does get through, and should not be run-able JavaScript code.)
It is generally agreed upon that trying to parse HTML with regex is a bad idea and will yield bad results. Instead, you should use a DOM parser. jQuery wraps nicely around the browser's DOM and would allow you to very easily remove all <script> tags.
ok I have faced a similar case, when I need to clean "rich text" (text with HTML formatting) from any possible javascript-ing.
there are several ways to add javascript to HTML:
by using the <script> tag, with javascript inside it or by loading a javascript file using the "src" attribue.
ex: <script>maliciousCode();</script>
by using an event on an HTML element, such as "onload" or "onmouseover"
ex: <img src="a.jpg" onload="maliciousCode()">
by creating a hyperlink that calls javascript code
ex: <a href="javascript:maliciousCode()">...
This is all I can think of for now.
So the submitted HTML Code needs to be cleaned from these 3 cases. A simple solution would be to look for these patterns using Regex, and replace them by "" or do whatever else you want.
This is a simple code to do this:
public static string CleanHTMLFromScript(string str)
{
Regex re = new Regex("<script[^>]*>", RegexOptions.IgnoreCase);
str = re.Replace(str, "");
re = new Regex("<[a-z][^>]*on[a-z]+=\"?[^\"]*\"?[^>]*>", RegexOptions.IgnoreCase);
str = re.Replace(str, "");
re = new Regex("<a\\s+href\\s*=\\s*\"?\\s*javascript:[^\"]*\"[^>]*>", RegexOptions.IgnoreCase);
str = re.Replace(str, "");
return(str);
}
This code takes care of any spaces and quotes that may or may not be added. It seems to be working fine, not perfect but it does the trick. Any improvements are welcome.
Creating your own HTML parser or script detector is a particularly bad idea if this is being done to prevent cross-site scripting. Doing this by hand is a Very Bad Idea, because there are any number of corner cases and tricks that can be used to defeat such an attempt. This is termed "black listing", as it attempts to remove the unsafe items from HTML, and it's pretty much doomed to failure.
Much safer to use a white list processor (such as AntiSamy), which only allows approved items through by automatically escaping everything else.
Of course, if this isn't what you're doing then you should probably edit your question to give some more context...
Edit:
Now that we know you're using C#, try the HTMLAgilityPack as suggested here.
Which language are you using? As a general statement, Regular Expressions are not suitable for parsing HTML.
If you are on the .net Platform, the HTML Agility Pack offers a much better parser.
You should use a real html parser for the job. That being said, for simple stripping
of script blocks you could use a rudimentary regex like below.
The idea is that you will need a callback to determine if capture group 1 matched.
If it did, the callback should pass back things that hide html (like comments) back
through unchanged, and the script blocks are passed back as an empty string.
This won't substitute for an html processor though. Good luck!
Search Regex: (modifiers - expanded, global, include newlines in dot, callback func)
(?:
<script (?:\s+(?:".*?"|\'.*?\'|[^>]*?)+)? \s*> .*? </script\s*>
| </?script (?:\s+(?:".*?"|\'.*?\'|[^>]*?)+)? \s*/?>
)
|
( # Capture group 1
<!(?:DOCTYPE.*?|--.*?--)> # things that hide html, add more constructs here ...
)
Replacement func pseudo code:
string callback () {
if capture buffer 1 matched
return capt buffer 1
else return ''
}
Related
public static string MakeWebSafe(this string x) {
const string RegexRemove = #"(<\s*script[^>]*>)|(<\s*/\s*script[^>]*>)";
return Regex.Replace(x, RegexRemove, string.Empty, RegexOptions.IgnoreCase);
}
Is there any reason this implementation isn't good enough. Can you break it? Is there anything I haven't considered? If you use or have used something different, what are its advantages?
I'm aware this leaves the body of the script in the text, but that's okay for this project.
UPDATE
Don't do the above! I went with this in the end: HTML Agility Pack strip tags NOT IN whitelist.
Have you considered this kind of scenario??
<scri<script>pt type="text/javascript">
causehavoc();
</scr</script>ipt>
The best thing to do is remove all tags, encode things, or use bbcode
Yes, your RegEx can be circumvented by unicode encoding the script tags. I would suggest you look to more robust libraries when it comes to security. Take a look at Microsoft Web Protection Library
This question already has answers here:
Using C# regular expressions to remove HTML tags
(10 answers)
Closed 4 years ago.
Need regular expression to remove the a tag from the following url Name to output only the string "Name". I am using C#.net.
Any help is appreciated
This will do a pretty good job:
str = Regex.Replace(str, #"<a\b[^>]+>([^<]*(?:(?!</a)<[^<]*)*)</a>", "$1");
You should be looking at Html Agility Pack. RegEx works on almost all cases but it fails for some basics or broken Html. Since, the grammar of HTML is not regular, Html Agility pack still works perfectly fine in all cases.
If you are looking for just one time this particular case of anchor tag, any above RegEx would work for you, but Html Agility Pack is your long run, solid solution to strip off any Html tags.
Ref: Using C# regular expressions to remove HTML tags
You can try to use this one. It has not been tested under all conditions, but it will return the correct value from your example.
\<[^\>]+\>(.[^\<]+)</[^\>]+\>
Here's a version that will work for only tags.
\<a\s[^\>]+\>(.[^\<]+)</a\>
I tested it on the following HTML and it returned Name and Value only.
Name<label>This is a label</label> Value
Agree with Priyank that using a parser is a safer bet. If you do go the route of using a regex, consider how you want to handle edge cases. It's easy to transform the simple case you mentioned in your question. And if that is indeed the only form the markup will take, a simple regex can handle it. But if the markup is, for example, user generated or from 3rd party source, consider cases such as these:
<a>foo</a> --> foo # a bare anchor tag, with no attributes
# the regexes listed above wouldn't handle this
<b>boldness</b> --> <b>boldness</b>
# stripping out only the anchor tag
<A onClick="javascript:alert('foo')">Upper\ncase</A> --> Upper\ncase
# and obviously the regex should be case insensitive and
# apply to the entire string, not just one line at a time.
<b>bold</b>bar --> <b>bold</b>bar
# cases such as this tend to break a lot of regexes,
# if the markup in question is user generated, you're leaving
# yourself open to the risk of XSS
Following is working for me.
Regex.Replace(inputvalue, "\<[\/]*a[^\>]*\>", "")
I'm looking for a regex that will allow me to get all javscript and css link tags in a string so that I can strip certain tags from a DotNetNuke (Yeah I know.... ouch!) page on an overridden render event.
I know about the html agility pack i've even read Jeff Atwoods blog entry but unfortunately I don't have the luxury of a 3rd party library.
Any help would be appreciated.
Edit, I gave this a try to get a javascript entry but it didn't work. Regex's are a dark art to me.
updatedPageSource = Regex.Replace(
pageSource,
String.Format("<script type=\"text/javascript\" src=\".*?{0}\"></script>",
name), "", RegexOptions.IgnoreCase);
I have a few comments on this, your RegEx is close, the following has been tested to work
<script type="text/javascript" src=".*myfile.js"></script>
I used the following test inputs
<script type="text/javascript" src="myfile.js"></script>
<script type="text/javascript" src="/test/myfile.js"></script>
<script type="text/javascript" src="/test/Looky/myfile.js"></script>
However, I would caution on this approach, and it does take time to parse, can be error prone, etc...
DISCLAIMER: Regex + HTML = ouch!
Your problem may be that you are not escaping the Regex metacharacters from name (e.g. the dot metacharacter '.'). You may want to try this:
updatedPageSource = Regex.Replace(
pageSource,
String.Format("<script\\s+type=\"text/javascript\"\\s+src=\".*?{0}\"\\s*>\\s*</script>", Regex.Escape(name)),
"",
RegexOptions.IgnoreCase);
// Just one of the many reasons why you don't mix Regex with HTML:
updatedPageSource = Regex.Replace(
updatedPageSource,
String.Format("<script\\s+src=\".*?{0}\"\\s+type=\"text/javascript\"\\s*>\\s*</script>", Regex.Escape(name)),
"",
RegexOptions.IgnoreCase);
I also added optional whitespace here and there.
Don't forget to account for things like whitespace, other attributes, different orders of attributes (i.e. src="foo" type="bar" vs type="bar" src="foo"), and " vs ' quoting. Maybe this?
#"<\s*script\b.*?\bsrc=(""|').*?{0}\1\b.*?(/>|>\s*</\s*script\s*>)"
I went ahead and took out the type attribute. If you have the filename, you know what type of script it is anyway; plus, this accounts for tags where the src tag comes first, or they used the deprecated language tag, or they omitted type altogether (it's supposed to be there, but it isn't always). Note that I'm using the lazy .*? so that it doesn't match all the way to the last </script> in the page.
I have following string:
<div> text0 </div> prefix <div> text1 <strong>text2</strong> text3 </div> text4
and want to know wether it contains text3 inside divs that go after prefix:
prefix<div>...text3...</div>
but I don't know how ta make regex for that, since I can't use [^<]+ because div's can contain strong tag inside.
Please help
EDIT:
Div tags after prefix are guaranted to be not nested
Language is C#
Text4 is very long, so regex must not look after closing div
EDIT2: I don't want to use html parser, it can be easily (and MUCH faster) achieved with Regex. HTML there is simple: no attributes in tags; no nesting div's. And even some % of wrong answers are acceptable in my case.
If you turn off the "greedy" option, you should be able to just use something like prefix<div>.*text3.*</div>. (If the <div> is allowed to have attributes, use prefix<div[^>]*>.*text3.*</div> instead.)
Numerous improvements could be made to this in order to take account of unusual spacing, >s within quotes, </div> within quotes, etc.
Patterns like prefix<div>...<div></div>text3</div> would be more difficult. You might have to capture all of the occurrences of the div tag so that you could count how many div tags were open at a given time.
EDIT: Oops, turning off the greedy option won't always give the right result, even in examples other than the one above. Probably better just to capture all occurrences of the div tag and go from there. As noted above by Peter, HTML is not a regular language and so you can't use regular expressions to do everything you might want with it.
this is my new regex:
prefix<div>([^<]*<(?!/div>))*[^<]*text3([^<]*<(?!/div>))*[^<]*</div>
seems to work ok.
For C# + HtmlAgilityPack you can do something like:
InputString = Regex.Replace(InputString,"^(?:[^<]+?|<[^>]*>)*?prefix","");
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(InputString);
HtmlNode node = doc.DocumentNode.SelectSingleNode("//div[contains('text3')]");
The prefix removal is still not a good way of dealing with it. Ideally you'd do something like using HtmlAgilityPack to find where prefix occurs in the DOM, translate that to provide the position in the string, then do a substring(pos,len) (or equivalent) to look at only the relevant text (you can also avoid looking at text4 using similar method).
I'm afraid I can't translate all that into code right now; hopefully someone else can help there.
(original answer, before extra details provided)
Here is a JavaScript + jQuery solution:
var InputString = '<div>text0 </div> prefix <div>text1 <strong>text2</strong> text3 </div> text4';
InputString = InputString.replace(/^.*?prefix/,'');
var MatchingDivs = jQuery('div:contains(text3)','<div>'+InputString+'</div>')
console.log(MatchingDivs.get());
This makes use of jQuery's ability to accept a context as second argument (though it appears this needs to be wrapped in div tags to actually work).
I'm trying to convert all instances of the > character to its HTML entity equivalent, >, within a string of HTML that contains HTML tags. The furthest I've been able to get with a solution for this is using a regex.
Here's what I have so far:
public static readonly Regex HtmlAngleBracketNotPartOfTag = new Regex("(?:<[^>]*(?:>|$))(>)", RegexOptions.Compiled | RegexOptions.Singleline);
The main issue I'm having is isolating the single > characters that are not part of an HTML tag. I don't want to convert any existing tags, because I need to preserve the HTML for rendering. If I don't convert the > characters, I get malformed HTML, which causes rendering issues in the browser.
This is an example of a test string to parse:
"Ok, now I've got the correct setting.<br/><br/>On 12/22/2008 3:45 PM, jproot#somedomain.com wrote:<br/><div class"quotedReply">> Ok, got it, hope the angle bracket quotes are there.<br/>><br/>> On 12/22/2008 3:45 PM, > sbartfast#somedomain.com wrote:<br/>>> Please someone, reply to this.<br/>>><br/>><br/></div>"
In the above string, none of the > characters that are part of HTML tags should be converted to >. So, this:
<div class"quotedReply">>
should become this:
<div class"quotedReply">>
Another issue is that the expression above uses a non-capturing group, which is fine except for the fact that the match is in group 1. I'm not quite sure how to do a replace only on group 1 and preserve the rest of the match. It appears that a MatchEvaluator doesn't really do the trick, or perhaps I just can't envision it right now.
I suspect my regex could do with some lovin'.
Anyone have any bright ideas?
Why do you want to do this? What harm are the > doing? Most parsers I've come across are quite happy with a > on its own without it needing to be escaped to an entity.
Additionally, it would be more appropriate to properly encode the content strings with HtmlUtilty.HtmlEncode before concatenating them with strings containing HTML markup, hence if this is under your control you should consider dealing with it there.
The trick is to capture everything that isn't the target, then plug it back in along with the changed text, like this:
Regex.Replace(str, #"\G((?>[^<>]+|<[^>]*>)*)>", "$1>");
But Anthony's right: right angle brackets in text nodes shouldn't cause any problems. And matching HTML with regexes is tricky; for example, comments and CDATA can contain practically anything, so a robust regex would have to match them specifically.
Maybe read your HTML into an XML parser which should take care of the conversions for you.
Are you talking about the > chars inside of an HTML tag, (Like in Java's innerText), or in the arguements list of an HTML tag?
If you want to just sanitize the text between the opening and closing tag, that should be rather simple. Just locate any > char, and replace it with the > ;. (I'd also do it with the < tag), but the HTML render engine SHOULD take care of this for you...
Give an example of what you are trying to sanitize, and maybe we an find the best solution for it.
Larry
Could you read the string into an XML document and look at the values and replace the > with > in the values. This would require recursively going into each node in the document but that shouldn't be too hard to do.
Steve_C, you may try this RegEx. This will give capture any HTML tags in reference 1, and the text between the tags is stored in capture 2. I didn't fully test this, just throwing it out there in case it might help.
<([A-Z][A-Z0-9]*)[^>]*>(.*?)</\1>