Why Regex IsMatch() hangs - c#

I have an exepression to validate an email address:
string REGEX_EMAIL = #"^\w+([\.\!\#\$\%\&\'\*\+\-\/\=\?\^\`\{\¦\}\~]*\w+)*#\w+([\.\-]\w+)*\.\w+([\.\-]\w+)*$";
If address is correct IsMatch() method quickly shows true result. But if address string is long and wrong this method hangs.
What can I do to raise speed of this method?
Thanks.

You are experiencing catastrophic backtracking.
The simplified regex:
Regex regexObj = new Regex(#"^\w+([-.!#$%&'*+/=?^`{¦}~]*\w+)*#\w+([.-]\w+)*\.\w+([.-]\w+)*$");
Has potential problems e.g. ([.-]\w+)*\.
If the . is missing for example and you have a long string of characters before it then all possible combinations must be taken into account for your regex to find out that it actually fails.

You have a couple things going on which are hurting the performance in this regular expression.
Catastrophic backtracking
Too many optional statements
You can definitely improve performance by using the + instead of the * in a few key places, but this of course changes what the regular expression will and won't match. So the easiest fix I found is actually covered in the catastrophic backtracking article above. You can use the nonbacktracking subexpression to drastically improve performance in this case, without changing the regular expression's behavior in any way that matters.
The nonbacktracking subexpression looks like this... (?>pattern)
So try this regular expression instead:
^\w+(?>[\.\!\#\$\%\&\'\*\+\-\/\=\?\^\`\{\¦\}\~]*\w+)*#\w+([\.\-]\w+)*\.\w+([\.\-]\w+)*$
On a slightly related topic, my philosophy for checking for a valid email address is a bit different. For one, long regular expressions like this one can potentially have performance problems as you've found.
Secondly, there's the upcoming promise of email address internationalization which complicates all of this even more.
Lastly, the main purpose of any regular expression based email validation is to catch typos and blatant attempts to get through your form without entering a real email address. But to check if an email address is genuine requires that you send an email to that address.
So my philosophy is to err on the side of accepting too much. And that, in fact, is a very simple thing to do...
^.+#.+\..+$
This should match any conceivably valid email address, and some invalid ones as well.

So, there is some backtracking issues. You can reduce those issues with a prudent use of the independent subexpression constructs, but you will still have issues because inner expressions won't have that constraint. The best thing to do is to separate the major parts.
Changing it to this helps alot (expanded):
^
(?>
\w+
(
[\.\!\#\$\%\&\'\*\+\-\/\=\?\^\`\{\¦\}\~]*
\w+
)*
#
\w+
)
(?>
([\.\-]\w+)*
\.
\w+
([\.\-]\w+)*
)
$
However, if you refactor to the equivalent expression by putting some well placed assertions, then re-adding the independent subexpressions grouping, you can virtually eliminate backtracking. Running this through my regex dubugger shows it takes only a very few steps to either pass or fail (expanded):
^
(?>
\w+
[\.\!\#\$\%\&\'\*\+\-\/\=\?\^\`\{\¦\}\~\w]*
(?<=\w)
#
\w+
)
(?=.*\.\w)
(?>
([\.\-]\w+)+
)
$

Related

Solve Catastrophic Backtracking in my regex detecting links?

I have an regex expression to find links in texts:
(?i)\\b((?:https?://|www\\d{0,3}[.]|[a-z0-9.\\-]+[.][a-z]{2,4}/)(?:[^'\"\\n\\r()<>]+|\\(([^'\"\\n\\r()<>]+|(\\([^'\"\\n\\r()<>]+\\)))*\\))+(?:\\(([^'\"\\n\\r()<>]+|(\\([^'\"\\n\\r()<>]+\\)))*\\)|[^'\"\\n\\r`!()\\[\\]{};:'.,<>?\u00AB\u00BB\u201C\u201D\u2018\u2019]))
But some ( in a link is causing an thread lock. Searching the Internet I've found some website suggesting that's a Catastrophic Backtracking problem. I've spent some time to optimize the pattern but it does not work. Any ideas?
Example input link that is causing the problem:
https://subdomain.domain.com/web/?id=-%c3%a1(%c2%81y%e2%80%9a%c3%a5d%e2%80%ba%c3%a8%c2%a7%c2%be.%c3%a9+%c2%a8
You should keep to the principle: all subsequent adjoining subpatterns cannot match at the same location in the string. If you quantify them with * or ?, make sure those obligatory patterns before them do not match the same text. Else, revamp the pattern. Or make use of atomic groupings.
The (?:https?://|www\d{0,3}[.]|[a-z0-9.-]+[.][a-z]{2,4}/) part is an alternation where both can match at the same location in the string. This cannot be avoided, so use an atomic group to prevent backtracking into the pattern.
Look at [a-z0-9.-]+[.], . is present in the + quantified character class. Make it more linear, replace with [a-z0-9-]*(?:\.[a-z0-9-]*)*\..
The (?:[^'"\n\r()<>]+|\(([^'"\n\r()<>]+|(\([^'"\n\r()<>]+\)))*\))+ part is a buggy pattern: [^'"\n\r()<>]+ is + quantified, and again, and it leads to situations when the regex engine reduces it to (?:a+)+, a classical CA scenario. Use atomic groupings if you do not want to re-vamp, although it seems to be a part of a pattern matching balanced parentheses and can be re-written as [^'"\n\r()<>]*(?:\((?>[^()]+|(?<o>\()|(?<-o>\)))*(?(o)(?!))\)[^'"\n\r()<>]*)*.
The ([^'"\n\r()<>]+|(\([^'"\n\r()<>]+\)))* part is similar to the part above, change ( to (?> where you quantify the group and the single obligatory pattern inside it.
The fixed pattern is
var pattern = #"(?i)\b((?>https?://|www\d{0,3}\.|[a-z0-9-]*(?:\.[a-z0-9-]*)*\.[a-z]{2,4}/)(?>[^'""\n\r()<>]+|\((?>[^'""\n\r()<>]+|\([^'""\n\r()<>]+\))*\))+(?:\((?>[^'""\n\r()<>]+|(\([^'""\n\r()<>]+\)))*\)|[^]['""\n\r`!(){};:.,<>?\u00AB\u00BB\u201C\u201D\u2018\u2019]))";
See how it fails gracefully here.

Regex to validate domain name with port

I am new developer and don't have much exposure on Regular Expression. Today I assigned to fix a bug using regex but after lots of effort I am unable to find the error.
Here is my requirement.
My code is:
string regex = "^([A-Za-z0-9\\-]+|[A-Za-z0-9]{1,3}\\.[A-Za-z0-9]{1,3}\\.[A-Za-z0-9] {1,3}\\.[A-Za-z0-9]{1,3}):([0-9]{1,5}|\\*)$";
Regex _hostEndPointRegex = new Regex(regex);
bool isTrue = _hostEndPointRegex.IsMatch(textBox1.Text);
It's throwing an error for the domain name like "nikhil-dev.in.abc.ni:8080".
I am not sure where the problem is.
Your regex is a bit redundant in that you or in some stuff that is already included in the other or block.
I just simplified what you had to
(?:[A-Za-z0-9-]+\.)+[A-Za-z0-9]{1,3}:\d{1,5}
and it works just fine...
I'm not sure why you had \ in the allowed characters as I am pretty sure \ is not allowed in a host name.
Your problem is that your or | breaks things up like this...
[A-Za-z0-9\\-]+
or
[A-Za-z0-9]{1,3}\\.[A-Za-z0-9]{1,3}\\.[A-Za-z0-9]{1,3}\\.[A-Za-z0-9]{1,3}
or
\*
Which as the commentor said was not including "-" in the 2nd block.
So perhaps you intended
^((?:[A-Za-z0-9\\-]+|[A-Za-z0-9]{1,3})\.[A-Za-z0-9]{1,3}\.[A-Za-z0-9]{1,3}\.[A-Za-z0-9]{1,3}):([0-9]{1,5}|\*)$
However the first to two or'ed items would be redundant as + includes {1-3}.
ie. [A-Za-z0-9\-]+ would also match anything that this matches [A-Za-z0-9]{1,3}
You can use this tool to help test your Regex:
http://regexpal.com/
Personally I think every developer should have regexbuddy
The regex above although it works will allow non-valid host names.
it should be modified to not allow punctuation in the first character.
So it should be modified to look like this.
(?:[A-Za-z0-9][A-Za-z0-9-]+\.)(?:[A-Za-z0-9-]+\.)+[A-Za-z0-9]{1,3}:\d{1,5}
Also in theory the host isn't allowed to end in a hyphen.
it is all so complicated I would use the regex only to capture the parts and then use Uri.CheckHostName to actually check the Uri is valid.
Or you can just use the regex suggested by CodeCaster

How does non-backtracking subexpression work "(?>exp)"

I am trying to become better at regular expressions. I am having a hard time trying to understand what does (?> expression ) means. Where can I find more info on non-backtacking subexpressoins? The description of THIS link says:
Greedy subexpression, also known as a non-backtracking subexpression.
This is matched only once and then does not participate in
backtracking.
this other link: http://msdn.microsoft.com/en-us/library/bs2twtah(v=vs.71).aspx has also a definition of non-backtracking subexpression but I still am having a hard time understanding what it means plus I cannot think of an example where I will use (?>exp)
As always, regular-expressions.info is a good place to start.
Use an atomic group if you want to make sure that whatever has once been matched will stay part of the match.
For example, to match a number of "words" that may or may not be separated by spaces, then followed by a colon, a user tried the regex:
(?:[A-Za-z0-9_.&,-]+\s*)+:
When there was a match, everything was fine. But when there wasn't, his PC would become non-responsive with 100% CPU load because of catastrophic backtracking because the regex engine would vainly try to find a matching combination of words that would allow the following colon to match. Which was of course impossible.
By using an atomic group, this could have been prevented:
(?>[A-Za-z0-9_.&,-]+\s*)+:
Now whatever has been matched stays matched - no backtracking and therefore fast failing times.
Another good example I recently came across:
If you want to match all numbers that are not followed by an ASCII letter, you might want to use the regex \d+(?![A-Za-z]). However, this will fail with inputs like 123a because the regex engine will happily return the match 12 by backtracking until the following character is no longer a letter. If you use (?>\d+)(?![A-Za-z]), this won't happen. (Of course, \d+(?![\dA-Za-z]) would also work)
The Regex Tutorial has a page on it here: http://www.regular-expressions.info/atomic.html
Basically what it does is discards backtracking information, meaning that a(?>bc|b)c matches abcc but not abc.
The reason it doesn't match the second string is because it finds a match with bc, and discards backtracking information about the bc|b alternation. It essentially forgets the |b part of it. Therefore, there is no c after the bc, and the match fails.
The most useful method of using atomic groups, as they are called, is to optimize slow regexes. You can find more information on the aforementioned page.
Read up on possessive quantifiers [a-z]*+ make the backtracking engine remember only the previous step that matched not all of the previous steps that matched.
This is useful when a lot of acceptable steps are probable and they will eat up memory if each step is stored for any possible backtracking regression.
Possessive quantifiers are a shorthand for atomic groups.

How to Find All Matches in Regular Expressions when one Overlaps OR Contains the Other?

The question of how to find every match when they might overlap was asked in Overlapping matches in Regex. However, as far as I can see, the answers there does not cover a more general case.
How can we find all substrings that begin with "a" and end with "z"? For example, given "akzzaz", it should find "akz", "akzz", "az" and "akzzaz".
Since there may be more than one match starting at the same position, ("akz" and "akzz") and also there may be more than one match ending at the same position ("az" and "akzzaz") I cannot see how using a lookahead or lookbehind helps as in the mentioned link. (Also, please bear in mind that in the general case "a" and "z" might be more complex regular expressions)
I use C#, so, in case it matters, having any feature specific to .Net Regular Expressions is OK.
Regular expressions are designed to find one match at a time. Even a global match operation is simply repeated applications of the same regex, each starting at the end of the previous match in the target string. So no, regexes are not able to find all matches in this way.
I will stick my neck out and say that I don't believe you can even find "all strings beginning with 'a' in 'akzzaz'" with a regex. /(a.*)/g will find the entire string, while /(a.*?)/g will find just 'a' twice.
The way I would code this would be to locate all 'a's, and search each of the substrings from there to the end of the string for all 'z's. So search 'akzzaz` and 'az' for 'z', giving 'akz', 'akzz', 'akzzaz', and 'az'. That is a fairly simple thing to do, but not a job for a regex unless the actual 'a' and 'z' tokens are complex.
For your current problem, string.startwith and string.endwith would do be a better job. Regular Expression is not necessarily faster in all cases.
Try this regular expression
a[akz]+z - in case a, k and z are the only characters
a[a-z]+z - in case of any alphabet
I think it's worth noting that there is actually a way for a regex to return more than one match at the same time. Although this doesn't answer your question, I think this would be a good place to mention this for others who may run into a similar situation.
The regex below for example would return all the right substrings of a string with a single match and has them in different capturing groups:
(?=(\w+)).
This regex uses capturing groups inside a zero-width assertion and for each match at position i(each character) the capturing group is a substring of length n-i.
Doing anything that would require the regex engine to stay in the same place after a match is probably overkill for a regular expression approach.

A beginner's guide to interpreting Regex?

Greetings.
I've been tasked with debugging part of an application that involves a Regex -- but, I have never dealt with Regex before. Two questions:
1) I know that the regexes are supposed to be testing whether or not two strings are equivalent, but what specifically do the two regex statements, below, mean in plain English?
2) Does anyone have a recommendation on websites / sources where I can learn more about Regexes? (preferably in C#)
if (Regex.IsMatch(testString, #"^(\s*?)(" + tag + #")(\s*?),", RegexOptions.IgnoreCase))
{
result = true;
}
else if (Regex.IsMatch(testString, #",(\s*?)(" + tag + #")(\s*?),", RegexOptions.IgnoreCase))
{
result = true;
}
It's going to be difficult to tell what that regex means, without knowing what's in tag. In fact, it looks like that regex is broken (or, at least, doesn't properly escape inputs).
Roughly speaking, for the first regex:
The ^ says to match at the beginning of the string.
The (...) sets up a capturing group (which is available, although this example apparently doesn't use it).
The \s matches any white space characters (spaces, tabs, etc.)
The *? matches zero or more of the previous character (in this case, whitespace), and because it has a question-mark, it matches the minimum number of characters needed to make the rest of the expression work.
The (" + tag + #") inserts the contents of the tag into the regex. As I mention, that's dangerous, without escaping.
The (\s*?) matches the same as the before (the minimum number of whitespace characters)
The , matches a trailing comma.
The second regex is very similar, but looks for a starting comma (rather than the beginning of the string).
I like the Python documentation for Regular Expressions, but it looks like this site
has a pretty good, basic introduction, with C# examples.
One word - Cribsheet (or is that two?) :)
I'm not c# savvy but I can recommend an awesome guide to regular expressions that I use for Bash and Java programming. It applies to pretty much all languages:
http://www.amazon.com/Mastering-Regular-Expressions-Jeffrey-Friedl/dp/0596528124/ref=tmm_pap_title_0
It is totally worth $30 to own this book. It is VERY thorough and helped my fundamental understanding of Regex a lot.
-Ryan
Since you specifically tagged C#, I recommend the Regex Hero as a tool you can use to play around with them since it's running on .NET. It also lets you toggle the different RegexOptions flags as you would pass them into the constructor when creating a new Regex.
Also, if you're using a version of Visual Studio 2010 that supports extensions, I would take a look at the Regex Editor extension... it will popup whenever you type new Regex( and offer you some guidance and autocomplete for your regex pattern.
Using The Regex Coach
The regular expression is a sequence consisting of the expression '(\s*?)', the expression '(tag)', the expression '(\s*?)', and the character ','.
where (\s*?) is defined as The regular expression is a repetition which matches a whitespace character as often as necessary.
the second one matches a , at the start too
As for good learning websites, I like www.regular-expressions.info/
Super simple version:
At the start of a string 0 or more spaces, whatever Tag is, 0 or More spaces, a comma.
the second one is
a comma, 0 or more spaces, whatever Tag is, 0 or More spaces, a comma.
Once you have the very basic idea about regex (it's full of resources over there) I recommend you to use Expresso for creating your regular expressions.
Expresso editor is equally suitable as a teaching tool for the beginning user of regular expressions or as a full-featured development environment for the experienced programmer or web designer with an extensive knowledge of regular expressions.
Your premise is not correct. Regular expressions are not used to tell if two strings are equivalent, but rather if the input string matches a certain pattern.
The first test above looks for any text that does not contain "zero or more whitespace charaters" searching "non-greedy". Then matches the text of the variable "tag" in the middle, then "zero or more whitespace characters, non greedy" again.
The second one is very similar, except that it allows for beginning whitespace as long as it follows a comma.
It is hard to explain "non-greedy" in this context, especially involving whitespace characters, so look here for more information.
A regular expression is a way to describe a set of strings that have some particular characteristics.
They don't merely need just to compare two strings.. what you usually do it to test if a string matches a particular regular expression. They can also be used to do simple parsing of a string in tokens that respect some patterns..
The good thing about regexps is that they allow you to express certain constraints inside a string keeping it general and able to match a group of strings that respect those constraints.. then they follow a formal specification that doesn't leave ambiguities around..
Here you can find a comparison table of various regular expression languages in many different programming languages and a specific guide for C# if you follow its link.
Usually the implementations for the various languages are quite similar since the syntax is somewhat standardized from the theoretical topics regexps come from, so any tutorial about regexp will be fine, then you'll just need to get into C# API.
1) The first regex is trying to do a case-insensitive match starting at the beginning of the test string. It then matches optional whitespace, followed by whatever is in tag, followed by optional whitespace then finally a comma.
The second matches a string containing a comma, followed by optional whitespace, followed by whatever is in tag, followed by optional whitespace then finally a comma.
Thought it's for C# I recommend picking up the Perl Pocket Reference which has a great Regex syntax reference. It helped my out a lot when I was learning regexes 14 years ago.
http://www.myregextester.com/ is a decent regular expression tester that also has an explain option for C# regexps - For Instance check out this example:
The regular expression:
(?-imsx:^(\s*?)(tagtext)(\s*?),)
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?-imsx: group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
^ the beginning of the string
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
\s*? whitespace (\n, \r, \t, \f, and " ") (0
or more times (matching the least amount
possible))
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
( group and capture to \2:
----------------------------------------------------------------------
tagtext 'tagtext'
----------------------------------------------------------------------
) end of \2
----------------------------------------------------------------------
( group and capture to \3:
----------------------------------------------------------------------
\s*? whitespace (\n, \r, \t, \f, and " ") (0
or more times (matching the least amount
possible))
----------------------------------------------------------------------
) end of \3
----------------------------------------------------------------------
, ','
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
A regular expression does not tell you if two strings match, but rather if a given string matches a pattern.
This site is my favorite for learning and testing regular expressions:
http://gskinner.com/RegExr/
It allows you to interactively test regular expressions as you write them, and provides a built-in tutorial.
Although it doesn't use C#, Rejex is a simple tool for testing and learning about regular expressions which includes a quick reference for the special characters
It looks like that they are trying to match some kind of list of words delimited by colons (UPDATE: commas).
The first one is probably matching first item and the second one some item after the first one excluding the last one. I hope you will understand :).
A good source of information about regular expressions is at http://www.regular-expressions.info/
also a great site to test your regular expressions with extra info: http://regex101.com/

Categories