I am using .net 4.5 and
HttpUtility.HtmlDecode fails to decode ' which is single quote character
Any idea why ?
Using C# .net 4.5 WPF on windows 8.1
Here the text that is failed
Apple 13'' Z0RA30256 MacBook Pro Retina
Below is framework version
#region Assembly System.Web.dll, v4.0.0.0
// C:\Program Files (x86)\Reference Assemblies\Microsoft\Framework\.NETFramework\v4.5\System.Web.dll
#endregion
It's not possible to handle this with the built it HtmlDecode method, you would have to find/replace it or otherwise work around.
Below is the source code for HtmlDecode - you can see from the comment explicitly that your scenario is considered and not supported - HTML entities have to be bounded with a ;, otherwise they are simply not HTML entities. Browsers are forgiving of the incorrect markup, and compensate accordingly.
// We found a '&'. Now look for the next ';' or '&'. The idea is that
// if we find another '&' before finding a ';', then this is not an entity,
// and the next '&' might start a real entity (VSWhidbey 275184)
Here is the full source of the .NET HtmlDecode in HttpUtility, if you want to adapt the behaviour.
http://referencesource.microsoft.com/#System/net/System/Net/WebUtility.cs,44d08941e6aeb00d
public static void HtmlDecode(string value, TextWriter output)
{
if (value == null)
{
return;
}
if (output == null)
{
throw new ArgumentNullException("output");
}
if (value.IndexOf('&') < 0)
{
output.Write(value); // good as is
return;
}
int l = value.Length;
for (int i = 0; i < l; i++)
{
char ch = value[i];
if (ch == '&')
{
// We found a '&'. Now look for the next ';' or '&'. The idea is that
// if we find another '&' before finding a ';', then this is not an entity,
// and the next '&' might start a real entity (VSWhidbey 275184)
int index = value.IndexOfAny(_htmlEntityEndingChars, i + 1);
if (index > 0 && value[index] == ';')
{
string entity = value.Substring(i + 1, index - i - 1);
if (entity.Length > 1 && entity[0] == '#')
{
// The # syntax can be in decimal or hex, e.g.
// å --> decimal
// å --> same char in hex
// See http://www.w3.org/TR/REC-html40/charset.html#entities
ushort parsed;
if (entity[1] == 'x' || entity[1] == 'X')
{
UInt16.TryParse(entity.Substring(2), NumberStyles.AllowHexSpecifier, NumberFormatInfo.InvariantInfo, out parsed);
}
else
{
UInt16.TryParse(entity.Substring(1), NumberStyles.Integer, NumberFormatInfo.InvariantInfo, out parsed);
}
if (parsed != 0)
{
ch = (char)parsed;
i = index; // already looked at everything until semicolon
}
}
else
{
i = index; // already looked at everything until semicolon
char entityChar = HtmlEntities.Lookup(entity);
if (entityChar != (char)0)
{
ch = entityChar;
}
else
{
output.Write('&');
output.Write(entity);
output.Write(';');
continue;
}
}
}
}
output.Write(ch);
}
}
Related
EDIT : Here's my current code (21233664 chars)
string str = myInput.Text;
StringBuilder sb = new StringBuilder();
foreach (char c in str)
{
if ((c >= 'a' && c <= 'z') || c == '_' || c==' ')
{
sb.Append(c);
}
}
output.Text = sb.ToString();
Let's say I have a huge text file which contains special characters and normal expressions with underscores.
Here are a few examples of the strings that I'm looking for :
super_test
test
another_super_test
As you can see, only lower case letters are allowed with underscores.
Now, if I have those strings in a text file that looks like this :
> §> ˜;# ®> l? super_test D>ÿÿÿÿ “G? tI> €[> €? È
The problem I'm facing is that some lonely letters are still saved. In the example given above, the output would be :
l super_test t
To get ridden of those chars, I must go through the whole file again but here's my question : how can I know whether a letter is lonely or not?
I'm not sure I understand the possibilities with regex, so if anyone can give me a hint I'd really appreciate it.
You clearly need a regular expression. A simple one would be [a-z_]{2,}, which takes all strings of lowercase a to z letters and underscore that are at least 2 characters long.
Just be careful when you are parsing the big file. Being huge, I imagine you use some sort of buffers. You need to make sure you don't get half of a word in one buffer and the other in the next.
You can't treat the space just like the other acceptable characters. In addition to being acceptable, the space also serves as a delimiter for your lonesome characters. (This might be a problem with the proposed regular expressions as well; I couldn't say for sure.) Anyway, this does what (I think) you want:
string str = "> §> ˜;# ®> l? super_test D>ÿÿÿÿ “G? tI> €[> €? È";
StringBuilder sb = new StringBuilder();
char? firstLetterOfWord = null;
foreach (char c in str)
{
if ((c >= 'a' && c <= 'z') || c == '_')
{
int length = sb.Length;
if (firstLetterOfWord != null)
{
// c is the second character of a word
sb.Append(firstLetterOfWord);
sb.Append(c);
firstLetterOfWord = null;
}
else if (length == 0 || sb[length - 1] == ' ')
{
// c is the first character of a word; save for next iteration
firstLetterOfWord = c;
}
else
{
// c is part of a word; we're not first, and prev != space
sb.Append(c);
}
}
else if (c == ' ')
{
// If you want to eliminate multiple spaces in a row,
// this is the place to do so
sb.Append(' ');
firstLetterOfWord = null;
}
else
{
firstLetterOfWord = null;
}
}
Console.WriteLine(sb.ToString());
It works with singletons and full words at both start and end of string.
If your input contains something like one#two, the output will run together (onetwo with no intervening space). Assuming that's not what you want, and also assuming that you have no need for multiple spaces in a row:
StringBuilder sb = new StringBuilder();
bool previousWasSpace = true;
char? firstLetterOfWord = null;
foreach (char c in str)
{
if ((c >= 'a' && c <= 'z') || c == '_')
{
if (firstLetterOfWord != null)
{
sb.Append(firstLetterOfWord).Append(c);
firstLetterOfWord = null;
previousWasSpace = false;
}
else if (previousWasSpace)
{
firstLetterOfWord = c;
}
else
{
sb.Append(c);
}
}
else
{
firstLetterOfWord = null;
if (!previousWasSpace)
{
sb.Append(' ');
previousWasSpace = true;
}
}
}
Console.WriteLine(sb.ToString());
So I have a string that I need to split by semicolon's
Email address: "one#tw;,.'o"#hotmail.com;"some;thing"#example.com
Both of the email addresses are valid
So I want to have a List<string> of the following:
"one#tw;,.'o"#hotmail.com
"some;thing"#example.com
But the way I am currently splitting the addresses is not working:
var addresses = emailAddressString.Split(new[] { ';' }, StringSplitOptions.RemoveEmptyEntries)
.Select(x => x.Trim()).ToList();
Because of the multiple ; characters I end up with invalid email addresses.
I have tried a few different ways, even going down working out if the string contains quotes and then finding the index of the ; characters and working it out that way, but it's a real pain.
Does anyone have any better suggestions?
Assuming that double-quotes are not allowed, except for the opening and closing quotes ahead of the "at" sign #, you can use this regular expression to capture e-mail addresses:
((?:[^#"]+|"[^"]*")#[^;]+)(?:;|$)
The idea is to capture either an unquoted [^#"]+ or a quoted "[^"]*" part prior to #, and then capture everything up to semicolon ; or the end anchor $.
Demo of the regex.
var input = "\"one#tw;,.'o\"#hotmail.com;\"some;thing\"#example.com;hello#world";
var mm = Regex.Matches(input, "((?:[^#\"]+|\"[^\"]*\")#[^;]+)(?:;|$)");
foreach (Match m in mm) {
Console.WriteLine(m.Groups[1].Value);
}
This code prints
"one#tw;,.'o"#hotmail.com
"some;thing"#example.com
hello#world
Demo 1.
If you would like to allow escaped double-quotes inside double-quotes, you could use a more complex expression:
((?:(?:[^#\"]|(?<=\\)\")+|\"([^\"]|(?<=\\)\")*\")#[^;]+)(?:;|$)
Everything else remains the same.
Demo 2.
I obviously started writing my anti regex method at around the same time as juharr (Another answer). I thought that since I already have it written I would submit it.
public static IEnumerable<string> SplitEmailsByDelimiter(string input, char delimiter)
{
var startIndex = 0;
var delimiterIndex = 0;
while (delimiterIndex >= 0)
{
delimiterIndex = input.IndexOf(';', startIndex);
string substring = input;
if (delimiterIndex > 0)
{
substring = input.Substring(0, delimiterIndex);
}
if (!substring.Contains("\"") || substring.IndexOf("\"") != substring.LastIndexOf("\""))
{
yield return substring;
input = input.Substring(delimiterIndex + 1);
startIndex = 0;
}
else
{
startIndex = delimiterIndex + 1;
}
}
}
Then the following
var input = "blah#blah.com;\"one#tw;,.'o\"#hotmail.com;\"some;thing\"#example.com;hello#world;asdasd#asd.co.uk;";
foreach (var email in SplitEmailsByDelimiter(input, ';'))
{
Console.WriteLine(email);
}
Would give this output
blah#blah.com
"one#tw;,.'o"#hotmail.com
"some;thing"#example.com
hello#world
asdasd#asd.co.uk
You can also do this without using regular expressions. The following extension method will allow you to specify a delimiter character and a character to begin and end escape sequences. Note it does not validate that all escape sequences are closed.
public static IEnumerable<string> SpecialSplit(
this string str, char delimiter, char beginEndEscape)
{
int beginIndex = 0;
int length = 0;
bool escaped = false;
foreach (char c in str)
{
if (c == beginEndEscape)
{
escaped = !escaped;
}
if (!escaped && c == delimiter)
{
yield return str.Substring(beginIndex, length);
beginIndex += length + 1;
length = 0;
continue;
}
length++;
}
yield return str.Substring(beginIndex, length);
}
Then the following
var input = "\"one#tw;,.'o\"#hotmail.com;\"some;thing\"#example.com;hello#world;\"D;D#blah;blah.com\"";
foreach (var address in input.SpecialSplit(';', '"'))
Console.WriteLine(v);
While give this output
"one#tw;,.'o"#hotmail.com
"some;thing"#example.com
hello#world
"D;D#blah;blah.com"
Here's the version that works with an additional single escape character. It assumes that two consecutive escape characters should become one single escape character and it's escaping both the beginEndEscape charter so it will not trigger the beginning or end of an escape sequence and it also escapes the delimiter. Anything else that comes after the escape character will be left as is with the escape character removed.
public static IEnumerable<string> SpecialSplit(
this string str, char delimiter, char beginEndEscape, char singleEscape)
{
StringBuilder builder = new StringBuilder();
bool escapedSequence = false;
bool previousEscapeChar = false;
foreach (char c in str)
{
if (c == singleEscape && !previousEscapeChar)
{
previousEscapeChar = true;
continue;
}
if (c == beginEndEscape && !previousEscapeChar)
{
escapedSequence = !escapedSequence;
}
if (!escapedSequence && !previousEscapeChar && c == delimiter)
{
yield return builder.ToString();
builder.Clear();
continue;
}
builder.Append(c);
previousEscapeChar = false;
}
yield return builder.ToString();
}
Finally you probably should add null checking for the string that is passed in and note that both will return a sequence with one empty string if you pass in an empty string.
I am trying to extract information out of a string - a fortran formatting string to be specific. The string is formatted like:
F8.3, I5, 3(5X, 2(A20,F10.3)), 'XXX'
with formatting fields delimited by "," and formatting groups inside brackets, with the number in front of the brackets indicating how many consecutive times the formatting pattern is repeated. So, the string above expands to:
F8.3, I5, 5X, A20,F10.3, A20,F10.3, 5X, A20,F10.3, A20,F10.3, 5X, A20,F10.3, A20,F10.3, 'XXX'
I am trying to make something in C# that will expand a string that conforms to that pattern. I have started going about it with lots of switch and if statements, but am wondering if I am not going about it the wrong way?
I was basically wondering if some Regex wizzard thinks that Regular expressions can do this in one neat-fell swoop? I know nothing about regular expressions, but if this could solve my problem I am considering putting in some time to learn how to use them... on the other hand if regular expressions can't sort this out then I'd rather spend my time looking at another method.
This has to be doable with Regex :)
I've expanded my previous example and it test nicely with your example.
// regex to match the inner most patterns of n(X) and capture the values of n and X.
private static readonly Regex matcher = new Regex(#"(\d+)\(([^(]*?)\)", RegexOptions.None);
// create new string by repeating X n times, separated with ','
private static string Join(Match m)
{
var n = Convert.ToInt32(m.Groups[1].Value); // get value of n
var x = m.Groups[2].Value; // get value of X
return String.Join(",", Enumerable.Repeat(x, n));
}
// expand the string by recursively replacing the innermost values of n(X).
private static string Expand(string text)
{
var s = matcher.Replace(text, Join);
return (matcher.IsMatch(s)) ? Expand(s) : s;
}
// parse a string for occurenses of n(X) pattern and expand then.
// return the string as a tokenized array.
public static string[] Parse(string text)
{
// Check that the number of parantheses is even.
if (text.Sum(c => (c == '(' || c == ')') ? 1 : 0) % 2 == 1)
throw new ArgumentException("The string contains an odd number of parantheses.");
return Expand(text).Split(new[] { ',', ' ' }, StringSplitOptions.RemoveEmptyEntries);
}
I would suggest using a recusive method like the example below( not tested ):
ResultData Parse(String value, ref Int32 index)
{
ResultData result = new ResultData();
Index startIndex = index; // Used to get substrings
while (index < value.Length)
{
Char current = value[index];
if (current == '(')
{
index++;
result.Add(Parse(value, ref index));
startIndex = index;
continue;
}
if (current == ')')
{
// Push last result
index++;
return result;
}
// Process all other chars here
}
// We can't find the closing bracket
throw new Exception("String is not valid");
}
You maybe need to modify some parts of the code, but this method have i used when writing a simple compiler. Although it's not completed, just a example.
Personally, I would suggest using a recursive function instead. Every time you hit an opening parenthesis, call the function again to parse that part. I'm not sure if you can use a regex to match a recursive data structure.
(Edit: Removed incorrect regex)
Ended up rewriting this today. It turns out that this can be done in one single method:
private static string ExpandBrackets(string Format)
{
int maxLevel = CountNesting(Format);
for (int currentLevel = maxLevel; currentLevel > 0; currentLevel--)
{
int level = 0;
int start = 0;
int end = 0;
for (int i = 0; i < Format.Length; i++)
{
char thisChar = Format[i];
switch (Format[i])
{
case '(':
level++;
if (level == currentLevel)
{
string group = string.Empty;
int repeat = 0;
/// Isolate the number of repeats if any
/// If there are 0 repeats the set to 1 so group will be replaced by itself with the brackets removed
for (int j = i - 1; j >= 0; j--)
{
char c = Format[j];
if (c == ',')
{
start = j + 1;
break;
}
if (char.IsDigit(c))
repeat = int.Parse(c + (repeat != 0 ? repeat.ToString() : string.Empty));
else
throw new Exception("Non-numeric character " + c + " found in front of the brackets");
}
if (repeat == 0)
repeat = 1;
/// Isolate the format group
/// Parse until the first closing bracket. Level is decremented as this effectively takes us down one level
for (int j = i + 1; j < Format.Length; j++)
{
char c = Format[j];
if (c == ')')
{
level--;
end = j;
break;
}
group += c;
}
/// Substitute the expanded group for the original group in the format string
/// If the group is empty then just remove it from the string
if (string.IsNullOrEmpty(group))
{
Format = Format.Remove(start - 1, end - start + 2);
i = start;
}
else
{
string repeatedGroup = RepeatString(group, repeat);
Format = Format.Remove(start, end - start + 1).Insert(start, repeatedGroup);
i = start + repeatedGroup.Length - 1;
}
}
break;
case ')':
level--;
break;
}
}
}
return Format;
}
CountNesting() returns the highest level of bracket nesting in the format statement, but could be passed in as a parameter to the method. RepeatString() just repeats a string the specified number of times and substitutes it for the bracketed group in the format string.
Is there a class similar to HttpUtility to encode the content of a custom header? Ideally I would like to keep the content readable.
You can use the HttpEncoder.HeaderNameValueEncode Method in the .NET Framework 4.0 and above.
For previous versions of the .NET Framework, you can roll your own encoder, using the logic noted on the HttpEncoder.HeaderNameValueEncode reference page:
All characters whose Unicode value is less than ASCII character 32,
except ASCII character 9, are URL-encoded into a format of %NN where
the N characters represent hexadecimal values.
ASCII character 9 (the horizontal tab character) is not URL-encoded.
ASCII character 127 is encoded as %7F.
All other characters are not encoded.
Update:
As OliverBock point out the HttpEncoder.HeaderNameValueEncode Method is protected and internal. I went to open source Mono project and found the mono's implementation
void HeaderNameValueEncode (string headerName, string headerValue, out string encodedHeaderName, out string encodedHeaderValue)
{
if (String.IsNullOrEmpty (headerName))
encodedHeaderName = headerName;
else
encodedHeaderName = EncodeHeaderString (headerName);
if (String.IsNullOrEmpty (headerValue))
encodedHeaderValue = headerValue;
else
encodedHeaderValue = EncodeHeaderString (headerValue);
}
static void StringBuilderAppend (string s, ref StringBuilder sb)
{
if (sb == null)
sb = new StringBuilder (s);
else
sb.Append (s);
}
static string EncodeHeaderString (string input)
{
StringBuilder sb = null;
for (int i = 0; i < input.Length; i++) {
char ch = input [i];
if ((ch < 32 && ch != 9) || ch == 127)
StringBuilderAppend (String.Format ("%{0:x2}", (int)ch), ref sb);
}
if (sb != null)
return sb.ToString ();
return input;
}
Just FYI
[here ] (https://github.com/mono/mono/blob/master/mcs/class/System.Web/System.Web.Util/HttpEncoder.cs)
For me helped Uri.EscapeDataString(headervalue)
This does the same job as HeaderNameValueEncode(), but will also encode % characters so the header can be reliably decoded later.
static string EncodeHeaderValue(string value)
{
return Regex.Replace(value, #"[\u0000-\u0008\u000a-\u001f%\u007f]", (m) => "%"+((int)m.Value[0]).ToString("x2"));
}
static string DecodeHeaderValue(string encoded)
{
return Regex.Replace(encoded, #"%([0-9a-f]{2})", (m) => new String((char)Convert.ToInt32(m.Groups[1].Value, 16), 1), RegexOptions.IgnoreCase);
}
Mayeb This one ?
UrlEncode Function
sorry its off the top of my head but for your request object there should be a headers object you can add to.
i.e. request.headers.add("blah");
Thats not spot on but it should point you in the right direction.
is anybody aware of a list of exactly what triggers ASP.NET's HttpRequestValidationException? [This is behind the common error: "A potentially dangerous Request.Form value was detected," etc.]
I've checked here, around the Web, and MSDN Library but can't find this documented. I'm aware of some ways to generate the error, but would like to have a complete list so I can guard against and selectively circumvent it (I know how to disable request validation for a page, but this isn't an option in this case).
Is it a case of "security through obscurity"?
Thanks.
[Note: Scripts won't load for me in IE8 (as described frequently in the Meta forum) so I won't be able to "Add comment."]
EDIT 1: Hi Oded, are you aware of a list that documents the conditions used to determine a "potentially malicious input string"? That's what I'm looking for.
EDIT 2: #Chris Pebble: Yeah, what you said. :)
I couldn't find a document outlining a conclusive list, but looking through Reflector and doing some analysis on use of HttpRequestValidationException, it looks like validation errors on the following can cause the request validation to fail:
A filename in one of the files POSTed to an upload.
The incoming request raw URL.
The value portion of the name/value pair from any of the incoming cookies.
The value portion of the name/value pair from any of the fields coming in through GET/POST.
The question, then, is "what qualifies one of these things as a dangerous input?" That seems to happen during an internal method System.Web.CrossSiteScriptingValidation.IsDangerousString(string, out int) which looks like it decides this way:
Look for < or & in the value. If it's not there, or if it's the last character in the value, then the value is OK.
If the & character is in a &# sequence (e.g., for a non-breaking space), it's a "dangerous string."
If the < character is part of <x (where "x" is any alphabetic character a-z), <!, </, or <?, it's a "dangerous string."
Failing all of that, the value is OK.
The System.Web.CrossSiteScriptingValidation type seems to have other methods in it for determining if things are dangerous URLs or valid JavaScript IDs, but those don't appear, at least through Reflector analysis, to result in throwing HttpRequestValidationExceptions.
Update:
Warning: Some parts of the code in the original answer (below) were removed and marked as OBSOLETE.
Latest source code in Microsoft site (has syntax highlighting):
http://referencesource.microsoft.com/#System.Web/CrossSiteScriptingValidation.cs
After checking the newest code you will probably agree that what Travis Illig explained are the only validations used now in 2018 (and seems to have no changes since 2014 when the source was released in GitHub). But the old code below may still be relevant if you use an older version of the framework.
Original Answer:
Using Reflector, I did some browsing. Here's the raw code. When I have time I will translate this into some meaningful rules:
The HttpRequestValidationException is thrown by only a single method in the System.Web namespace, so it's rather isolated. Here is the method:
private void ValidateString(string s, string valueName, string collectionName)
{
int matchIndex = 0;
if (CrossSiteScriptingValidation.IsDangerousString(s, out matchIndex))
{
string str = valueName + "=\"";
int startIndex = matchIndex - 10;
if (startIndex <= 0)
{
startIndex = 0;
}
else
{
str = str + "...";
}
int length = matchIndex + 20;
if (length >= s.Length)
{
length = s.Length;
str = str + s.Substring(startIndex, length - startIndex) + "\"";
}
else
{
str = str + s.Substring(startIndex, length - startIndex) + "...\"";
}
throw new HttpRequestValidationException(HttpRuntime.FormatResourceString("Dangerous_input_detected", collectionName, str));
}
}
That method above makes a call to the IsDangerousString method in the CrossSiteScriptingValidation class, which validates the string against a series of rules. It looks like the following:
internal static bool IsDangerousString(string s, out int matchIndex)
{
matchIndex = 0;
int startIndex = 0;
while (true)
{
int index = s.IndexOfAny(startingChars, startIndex);
if (index < 0)
{
return false;
}
if (index == (s.Length - 1))
{
return false;
}
matchIndex = index;
switch (s[index])
{
case 'E':
case 'e':
if (IsDangerousExpressionString(s, index))
{
return true;
}
break;
case 'O':
case 'o':
if (!IsDangerousOnString(s, index))
{
break;
}
return true;
case '&':
if (s[index + 1] != '#')
{
break;
}
return true;
case '<':
if (!IsAtoZ(s[index + 1]) && (s[index + 1] != '!'))
{
break;
}
return true;
case 'S':
case 's':
if (!IsDangerousScriptString(s, index))
{
break;
}
return true;
}
startIndex = index + 1;
}
}
That IsDangerousString method appears to be referencing a series of validation rules, which are outlined below:
private static bool IsDangerousExpressionString(string s, int index)
{
if ((index + 10) >= s.Length)
{
return false;
}
if ((s[index + 1] != 'x') && (s[index + 1] != 'X'))
{
return false;
}
return (string.Compare(s, index + 2, "pression(", 0, 9, true, CultureInfo.InvariantCulture) == 0);
}
-
private static bool IsDangerousOnString(string s, int index)
{
if ((s[index + 1] != 'n') && (s[index + 1] != 'N'))
{
return false;
}
if ((index > 0) && IsAtoZ(s[index - 1]))
{
return false;
}
int length = s.Length;
index += 2;
while ((index < length) && IsAtoZ(s[index]))
{
index++;
}
while ((index < length) && char.IsWhiteSpace(s[index]))
{
index++;
}
return ((index < length) && (s[index] == '='));
}
-
private static bool IsAtoZ(char c)
{
return (((c >= 'a') && (c <= 'z')) || ((c >= 'A') && (c <= 'Z')));
}
-
private static bool IsDangerousScriptString(string s, int index)
{
int length = s.Length;
if ((index + 6) >= length)
{
return false;
}
if ((((s[index + 1] != 'c') && (s[index + 1] != 'C')) || ((s[index + 2] != 'r') && (s[index + 2] != 'R'))) || ((((s[index + 3] != 'i') && (s[index + 3] != 'I')) || ((s[index + 4] != 'p') && (s[index + 4] != 'P'))) || ((s[index + 5] != 't') && (s[index + 5] != 'T'))))
{
return false;
}
index += 6;
while ((index < length) && char.IsWhiteSpace(s[index]))
{
index++;
}
return ((index < length) && (s[index] == ':'));
}
So there you have it. It's not pretty to decipher, but it's all there.
How about this script? Your code can not detect this script, right?
";}alert(1);function%20a(){//
Try this regular expresson pattern.
You may need to ecape the \ for javascript ex \\
var regExpPattern = '[eE][xX][pP][rR][eE][sS][sS][iI][oO][nN]\\(|\\b[oO][nN][a-zA-Z]*\\b\\s*=|&#|<[!/a-zA-Z]|[sS][cC][rR][iI][pP][tT]\\s*:';
var re = new RegExp("","gi");
re.compile(regExpPattern,"gi");
var outString = null;
outString = re.exec(text);
Following on from Travis' answer, the list of 'dangerous' character sequences can be simplified as follows;
&#
<A through to <Z (upper and lower case)
<!
</
<?
Based on this, in an ASP.Net MVC web app the following Regex validation attribute can be used on a model field to trigger client side validation before an HttpRequestValidationException is thrown when the form is submitted;
[RegularExpression(#"^(?![\s\S]*(&#|<[a-zA-Z!\/?]))[\s\S]*$", ErrorMessage = "This field does not support HTML or allow any of the following character sequences; "&#", "<A" through to "<Z" (upper and lower case), "<!", "</" or "<?".")]
Note that validation attribute error messages are HTML encoded when output by server side validation, but not when used in client side validation, so this one is already encoded as we only intend to see it with client side validation.
From MSDN:
'The exception that is thrown when a potentially malicious input string is received from the client as part of the request data. '
Many times this happens when JavaScript changes the values of a server side control in a way that causes the ViewState to not agree with the posted data.