I have the following REGEX expression (that works) to allow Alpha-Numeric (as well as ' and -) and no double spacing:
^([a-zA-Z0-9'-]+\s?)*$
Due to the nested grouping, this allows Catastrophic Backtracking to happen - which is bad!
How can I simplify this expression to avoid Catastrophic Backtracking??
(Ideally this wouldn't allow white-space in first and last characters either)
Explanation
Nested group doesn't automatically causes catastrophic backtracking. In your case, it is because your regex degenerates to the classical example of catastrophic backtracking (a*)*.
Since \s in optional in ^([a-zA-Z0-9'-]+\s?)*$, on input without any spaces but has characters outside the allowed list, the regex simply degenerates to ^([a-zA-Z0-9'-]+)*$.
You can also think in term of expansion of the original regex:
[a-zA-Z0-9'-]+\s?[a-zA-Z0-9'-]+\s?[a-zA-Z0-9'-]+\s?[a-zA-Z0-9'-]+\s?...
Since \s is optional, we can remove it:
[a-zA-Z0-9'-]+[a-zA-Z0-9'-]+[a-zA-Z0-9'-]+[a-zA-Z0-9'-]+...
And we got a series of consecutive [a-zA-Z0-9'-]+, which will try all ways to distribute the characters between themselves and blow up the complexity.
Solution
The standard way to write a regex to match token delimiter token ... delimiter token is token (delimiter token)*. While it is possible to rewrite the regex avoid repeating token, I'd recommend against it, since it is harder to get it right. To avoid repetition , you might want to construct the regex by string concatenation instead.
Following the recipe above:
^[a-zA-Z0-9'-]+(\s[a-zA-Z0-9'-]+)*$
Although you can see repetition in repetition here, there is no catastrophic backtracking, since the regex can only expand to:
[a-zA-Z0-9'-]+\s[a-zA-Z0-9'-]+\s[a-zA-Z0-9'-]+\s[a-zA-Z0-9'-]+...
And \s and [a-zA-Z0-9'-] are mutual exclusive - there is only one way to match any string.
Related
I have an regex expression to find links in texts:
(?i)\\b((?:https?://|www\\d{0,3}[.]|[a-z0-9.\\-]+[.][a-z]{2,4}/)(?:[^'\"\\n\\r()<>]+|\\(([^'\"\\n\\r()<>]+|(\\([^'\"\\n\\r()<>]+\\)))*\\))+(?:\\(([^'\"\\n\\r()<>]+|(\\([^'\"\\n\\r()<>]+\\)))*\\)|[^'\"\\n\\r`!()\\[\\]{};:'.,<>?\u00AB\u00BB\u201C\u201D\u2018\u2019]))
But some ( in a link is causing an thread lock. Searching the Internet I've found some website suggesting that's a Catastrophic Backtracking problem. I've spent some time to optimize the pattern but it does not work. Any ideas?
Example input link that is causing the problem:
https://subdomain.domain.com/web/?id=-%c3%a1(%c2%81y%e2%80%9a%c3%a5d%e2%80%ba%c3%a8%c2%a7%c2%be.%c3%a9+%c2%a8
You should keep to the principle: all subsequent adjoining subpatterns cannot match at the same location in the string. If you quantify them with * or ?, make sure those obligatory patterns before them do not match the same text. Else, revamp the pattern. Or make use of atomic groupings.
The (?:https?://|www\d{0,3}[.]|[a-z0-9.-]+[.][a-z]{2,4}/) part is an alternation where both can match at the same location in the string. This cannot be avoided, so use an atomic group to prevent backtracking into the pattern.
Look at [a-z0-9.-]+[.], . is present in the + quantified character class. Make it more linear, replace with [a-z0-9-]*(?:\.[a-z0-9-]*)*\..
The (?:[^'"\n\r()<>]+|\(([^'"\n\r()<>]+|(\([^'"\n\r()<>]+\)))*\))+ part is a buggy pattern: [^'"\n\r()<>]+ is + quantified, and again, and it leads to situations when the regex engine reduces it to (?:a+)+, a classical CA scenario. Use atomic groupings if you do not want to re-vamp, although it seems to be a part of a pattern matching balanced parentheses and can be re-written as [^'"\n\r()<>]*(?:\((?>[^()]+|(?<o>\()|(?<-o>\)))*(?(o)(?!))\)[^'"\n\r()<>]*)*.
The ([^'"\n\r()<>]+|(\([^'"\n\r()<>]+\)))* part is similar to the part above, change ( to (?> where you quantify the group and the single obligatory pattern inside it.
The fixed pattern is
var pattern = #"(?i)\b((?>https?://|www\d{0,3}\.|[a-z0-9-]*(?:\.[a-z0-9-]*)*\.[a-z]{2,4}/)(?>[^'""\n\r()<>]+|\((?>[^'""\n\r()<>]+|\([^'""\n\r()<>]+\))*\))+(?:\((?>[^'""\n\r()<>]+|(\([^'""\n\r()<>]+\)))*\)|[^]['""\n\r`!(){};:.,<>?\u00AB\u00BB\u201C\u201D\u2018\u2019]))";
See how it fails gracefully here.
I have the following REGEX expression (that works) to allow Alpha-Numeric (as well as ' and -) and no double spacing:
^([a-zA-Z0-9'-]+\s?)*$
Due to the nested grouping, this allows Catastrophic Backtracking to happen - which is bad!
How can I simplify this expression to avoid Catastrophic Backtracking??
(Ideally this wouldn't allow white-space in first and last characters either)
Explanation
Nested group doesn't automatically causes catastrophic backtracking. In your case, it is because your regex degenerates to the classical example of catastrophic backtracking (a*)*.
Since \s in optional in ^([a-zA-Z0-9'-]+\s?)*$, on input without any spaces but has characters outside the allowed list, the regex simply degenerates to ^([a-zA-Z0-9'-]+)*$.
You can also think in term of expansion of the original regex:
[a-zA-Z0-9'-]+\s?[a-zA-Z0-9'-]+\s?[a-zA-Z0-9'-]+\s?[a-zA-Z0-9'-]+\s?...
Since \s is optional, we can remove it:
[a-zA-Z0-9'-]+[a-zA-Z0-9'-]+[a-zA-Z0-9'-]+[a-zA-Z0-9'-]+...
And we got a series of consecutive [a-zA-Z0-9'-]+, which will try all ways to distribute the characters between themselves and blow up the complexity.
Solution
The standard way to write a regex to match token delimiter token ... delimiter token is token (delimiter token)*. While it is possible to rewrite the regex avoid repeating token, I'd recommend against it, since it is harder to get it right. To avoid repetition , you might want to construct the regex by string concatenation instead.
Following the recipe above:
^[a-zA-Z0-9'-]+(\s[a-zA-Z0-9'-]+)*$
Although you can see repetition in repetition here, there is no catastrophic backtracking, since the regex can only expand to:
[a-zA-Z0-9'-]+\s[a-zA-Z0-9'-]+\s[a-zA-Z0-9'-]+\s[a-zA-Z0-9'-]+...
And \s and [a-zA-Z0-9'-] are mutual exclusive - there is only one way to match any string.
Not sure if it is possible, I need to improve my RegExp. I have the following:
\b(?:http|https)://www\.domain\.co\.za/.*
It is fine for all my purpases except I would like it to also validate for:
http://www.domain.co.za (No Backslash at the end)
But should NOT validate for:
http://www.domain.co.zaaaaa
And then This Expression:
\b(?:https?)://[.0-9a-z-]*domain\.co\.za
To validate for (Currently Working)
http://domain.co.za
http://sub1.domain.co.za
http://wwww.domain.co.za
But it should NOT validate for:
http://abcdomain.co.za
That's pretty easy:
\b(?:http|https)://www\.domain\.co\.za/?\b
Demo
.* is useless since it always matches, I just removed it, made the / optional and inserted a \b.
The second case is similar:
\b(?:https?)://[.0-9a-z-]*\bdomain\.co\.za
Demo
Just use that magic \b :)
Or, if you want a more strict pattern, this would be better:
\b(?:https?)://(?>[0-9a-z-]+\.)*domain\.co\.za
Demo
because it enforces runs of characters separated by . for the subdomains. The atomic group ((?>...)) is here to avoid catastrophic backtracking.
I am trying to become better at regular expressions. I am having a hard time trying to understand what does (?> expression ) means. Where can I find more info on non-backtacking subexpressoins? The description of THIS link says:
Greedy subexpression, also known as a non-backtracking subexpression.
This is matched only once and then does not participate in
backtracking.
this other link: http://msdn.microsoft.com/en-us/library/bs2twtah(v=vs.71).aspx has also a definition of non-backtracking subexpression but I still am having a hard time understanding what it means plus I cannot think of an example where I will use (?>exp)
As always, regular-expressions.info is a good place to start.
Use an atomic group if you want to make sure that whatever has once been matched will stay part of the match.
For example, to match a number of "words" that may or may not be separated by spaces, then followed by a colon, a user tried the regex:
(?:[A-Za-z0-9_.&,-]+\s*)+:
When there was a match, everything was fine. But when there wasn't, his PC would become non-responsive with 100% CPU load because of catastrophic backtracking because the regex engine would vainly try to find a matching combination of words that would allow the following colon to match. Which was of course impossible.
By using an atomic group, this could have been prevented:
(?>[A-Za-z0-9_.&,-]+\s*)+:
Now whatever has been matched stays matched - no backtracking and therefore fast failing times.
Another good example I recently came across:
If you want to match all numbers that are not followed by an ASCII letter, you might want to use the regex \d+(?![A-Za-z]). However, this will fail with inputs like 123a because the regex engine will happily return the match 12 by backtracking until the following character is no longer a letter. If you use (?>\d+)(?![A-Za-z]), this won't happen. (Of course, \d+(?![\dA-Za-z]) would also work)
The Regex Tutorial has a page on it here: http://www.regular-expressions.info/atomic.html
Basically what it does is discards backtracking information, meaning that a(?>bc|b)c matches abcc but not abc.
The reason it doesn't match the second string is because it finds a match with bc, and discards backtracking information about the bc|b alternation. It essentially forgets the |b part of it. Therefore, there is no c after the bc, and the match fails.
The most useful method of using atomic groups, as they are called, is to optimize slow regexes. You can find more information on the aforementioned page.
Read up on possessive quantifiers [a-z]*+ make the backtracking engine remember only the previous step that matched not all of the previous steps that matched.
This is useful when a lot of acceptable steps are probable and they will eat up memory if each step is stored for any possible backtracking regression.
Possessive quantifiers are a shorthand for atomic groups.
I have an exepression to validate an email address:
string REGEX_EMAIL = #"^\w+([\.\!\#\$\%\&\'\*\+\-\/\=\?\^\`\{\¦\}\~]*\w+)*#\w+([\.\-]\w+)*\.\w+([\.\-]\w+)*$";
If address is correct IsMatch() method quickly shows true result. But if address string is long and wrong this method hangs.
What can I do to raise speed of this method?
Thanks.
You are experiencing catastrophic backtracking.
The simplified regex:
Regex regexObj = new Regex(#"^\w+([-.!#$%&'*+/=?^`{¦}~]*\w+)*#\w+([.-]\w+)*\.\w+([.-]\w+)*$");
Has potential problems e.g. ([.-]\w+)*\.
If the . is missing for example and you have a long string of characters before it then all possible combinations must be taken into account for your regex to find out that it actually fails.
You have a couple things going on which are hurting the performance in this regular expression.
Catastrophic backtracking
Too many optional statements
You can definitely improve performance by using the + instead of the * in a few key places, but this of course changes what the regular expression will and won't match. So the easiest fix I found is actually covered in the catastrophic backtracking article above. You can use the nonbacktracking subexpression to drastically improve performance in this case, without changing the regular expression's behavior in any way that matters.
The nonbacktracking subexpression looks like this... (?>pattern)
So try this regular expression instead:
^\w+(?>[\.\!\#\$\%\&\'\*\+\-\/\=\?\^\`\{\¦\}\~]*\w+)*#\w+([\.\-]\w+)*\.\w+([\.\-]\w+)*$
On a slightly related topic, my philosophy for checking for a valid email address is a bit different. For one, long regular expressions like this one can potentially have performance problems as you've found.
Secondly, there's the upcoming promise of email address internationalization which complicates all of this even more.
Lastly, the main purpose of any regular expression based email validation is to catch typos and blatant attempts to get through your form without entering a real email address. But to check if an email address is genuine requires that you send an email to that address.
So my philosophy is to err on the side of accepting too much. And that, in fact, is a very simple thing to do...
^.+#.+\..+$
This should match any conceivably valid email address, and some invalid ones as well.
So, there is some backtracking issues. You can reduce those issues with a prudent use of the independent subexpression constructs, but you will still have issues because inner expressions won't have that constraint. The best thing to do is to separate the major parts.
Changing it to this helps alot (expanded):
^
(?>
\w+
(
[\.\!\#\$\%\&\'\*\+\-\/\=\?\^\`\{\¦\}\~]*
\w+
)*
#
\w+
)
(?>
([\.\-]\w+)*
\.
\w+
([\.\-]\w+)*
)
$
However, if you refactor to the equivalent expression by putting some well placed assertions, then re-adding the independent subexpressions grouping, you can virtually eliminate backtracking. Running this through my regex dubugger shows it takes only a very few steps to either pass or fail (expanded):
^
(?>
\w+
[\.\!\#\$\%\&\'\*\+\-\/\=\?\^\`\{\¦\}\~\w]*
(?<=\w)
#
\w+
)
(?=.*\.\w)
(?>
([\.\-]\w+)+
)
$