Solve Catastrophic Backtracking in my regex detecting links? - c#

I have an regex expression to find links in texts:
(?i)\\b((?:https?://|www\\d{0,3}[.]|[a-z0-9.\\-]+[.][a-z]{2,4}/)(?:[^'\"\\n\\r()<>]+|\\(([^'\"\\n\\r()<>]+|(\\([^'\"\\n\\r()<>]+\\)))*\\))+(?:\\(([^'\"\\n\\r()<>]+|(\\([^'\"\\n\\r()<>]+\\)))*\\)|[^'\"\\n\\r`!()\\[\\]{};:'.,<>?\u00AB\u00BB\u201C\u201D\u2018\u2019]))
But some ( in a link is causing an thread lock. Searching the Internet I've found some website suggesting that's a Catastrophic Backtracking problem. I've spent some time to optimize the pattern but it does not work. Any ideas?
Example input link that is causing the problem:
https://subdomain.domain.com/web/?id=-%c3%a1(%c2%81y%e2%80%9a%c3%a5d%e2%80%ba%c3%a8%c2%a7%c2%be.%c3%a9+%c2%a8

You should keep to the principle: all subsequent adjoining subpatterns cannot match at the same location in the string. If you quantify them with * or ?, make sure those obligatory patterns before them do not match the same text. Else, revamp the pattern. Or make use of atomic groupings.
The (?:https?://|www\d{0,3}[.]|[a-z0-9.-]+[.][a-z]{2,4}/) part is an alternation where both can match at the same location in the string. This cannot be avoided, so use an atomic group to prevent backtracking into the pattern.
Look at [a-z0-9.-]+[.], . is present in the + quantified character class. Make it more linear, replace with [a-z0-9-]*(?:\.[a-z0-9-]*)*\..
The (?:[^'"\n\r()<>]+|\(([^'"\n\r()<>]+|(\([^'"\n\r()<>]+\)))*\))+ part is a buggy pattern: [^'"\n\r()<>]+ is + quantified, and again, and it leads to situations when the regex engine reduces it to (?:a+)+, a classical CA scenario. Use atomic groupings if you do not want to re-vamp, although it seems to be a part of a pattern matching balanced parentheses and can be re-written as [^'"\n\r()<>]*(?:\((?>[^()]+|(?<o>\()|(?<-o>\)))*(?(o)(?!))\)[^'"\n\r()<>]*)*.
The ([^'"\n\r()<>]+|(\([^'"\n\r()<>]+\)))* part is similar to the part above, change ( to (?> where you quantify the group and the single obligatory pattern inside it.
The fixed pattern is
var pattern = #"(?i)\b((?>https?://|www\d{0,3}\.|[a-z0-9-]*(?:\.[a-z0-9-]*)*\.[a-z]{2,4}/)(?>[^'""\n\r()<>]+|\((?>[^'""\n\r()<>]+|\([^'""\n\r()<>]+\))*\))+(?:\((?>[^'""\n\r()<>]+|(\([^'""\n\r()<>]+\)))*\)|[^]['""\n\r`!(){};:.,<>?\u00AB\u00BB\u201C\u201D\u2018\u2019]))";
See how it fails gracefully here.

Related

Regex hangs on simple pattern in C# [duplicate]

I have the following REGEX expression (that works) to allow Alpha-Numeric (as well as ' and -) and no double spacing:
^([a-zA-Z0-9'-]+\s?)*$
Due to the nested grouping, this allows Catastrophic Backtracking to happen - which is bad!
How can I simplify this expression to avoid Catastrophic Backtracking??
(Ideally this wouldn't allow white-space in first and last characters either)
Explanation
Nested group doesn't automatically causes catastrophic backtracking. In your case, it is because your regex degenerates to the classical example of catastrophic backtracking (a*)*.
Since \s in optional in ^([a-zA-Z0-9'-]+\s?)*$, on input without any spaces but has characters outside the allowed list, the regex simply degenerates to ^([a-zA-Z0-9'-]+)*$.
You can also think in term of expansion of the original regex:
[a-zA-Z0-9'-]+\s?[a-zA-Z0-9'-]+\s?[a-zA-Z0-9'-]+\s?[a-zA-Z0-9'-]+\s?...
Since \s is optional, we can remove it:
[a-zA-Z0-9'-]+[a-zA-Z0-9'-]+[a-zA-Z0-9'-]+[a-zA-Z0-9'-]+...
And we got a series of consecutive [a-zA-Z0-9'-]+, which will try all ways to distribute the characters between themselves and blow up the complexity.
Solution
The standard way to write a regex to match token delimiter token ... delimiter token is token (delimiter token)*. While it is possible to rewrite the regex avoid repeating token, I'd recommend against it, since it is harder to get it right. To avoid repetition , you might want to construct the regex by string concatenation instead.
Following the recipe above:
^[a-zA-Z0-9'-]+(\s[a-zA-Z0-9'-]+)*$
Although you can see repetition in repetition here, there is no catastrophic backtracking, since the regex can only expand to:
[a-zA-Z0-9'-]+\s[a-zA-Z0-9'-]+\s[a-zA-Z0-9'-]+\s[a-zA-Z0-9'-]+...
And \s and [a-zA-Z0-9'-] are mutual exclusive - there is only one way to match any string.

Very simple regex match takes forever to compute [duplicate]

I have the following REGEX expression (that works) to allow Alpha-Numeric (as well as ' and -) and no double spacing:
^([a-zA-Z0-9'-]+\s?)*$
Due to the nested grouping, this allows Catastrophic Backtracking to happen - which is bad!
How can I simplify this expression to avoid Catastrophic Backtracking??
(Ideally this wouldn't allow white-space in first and last characters either)
Explanation
Nested group doesn't automatically causes catastrophic backtracking. In your case, it is because your regex degenerates to the classical example of catastrophic backtracking (a*)*.
Since \s in optional in ^([a-zA-Z0-9'-]+\s?)*$, on input without any spaces but has characters outside the allowed list, the regex simply degenerates to ^([a-zA-Z0-9'-]+)*$.
You can also think in term of expansion of the original regex:
[a-zA-Z0-9'-]+\s?[a-zA-Z0-9'-]+\s?[a-zA-Z0-9'-]+\s?[a-zA-Z0-9'-]+\s?...
Since \s is optional, we can remove it:
[a-zA-Z0-9'-]+[a-zA-Z0-9'-]+[a-zA-Z0-9'-]+[a-zA-Z0-9'-]+...
And we got a series of consecutive [a-zA-Z0-9'-]+, which will try all ways to distribute the characters between themselves and blow up the complexity.
Solution
The standard way to write a regex to match token delimiter token ... delimiter token is token (delimiter token)*. While it is possible to rewrite the regex avoid repeating token, I'd recommend against it, since it is harder to get it right. To avoid repetition , you might want to construct the regex by string concatenation instead.
Following the recipe above:
^[a-zA-Z0-9'-]+(\s[a-zA-Z0-9'-]+)*$
Although you can see repetition in repetition here, there is no catastrophic backtracking, since the regex can only expand to:
[a-zA-Z0-9'-]+\s[a-zA-Z0-9'-]+\s[a-zA-Z0-9'-]+\s[a-zA-Z0-9'-]+...
And \s and [a-zA-Z0-9'-] are mutual exclusive - there is only one way to match any string.

regex to find incomplete xml tags in c#

I'm trying to use regular expression to find incomplete xml tags that have no attributes. So far, I've managed to come up with this regex </?\s*([a-zA-Z0-9]?:\s+)?[a-zA-Z0-9]*(?!>), but that doesn't do the trick.
In an xml like this one:
<abc>
</abc>
<ab>
</ab
<s:ab
I want to match </ab and <s:ab (as they're both lacking ">" at the end). Is there a way to do this using regular expressions in c#?
You are pretty close. Your major problem is that the pattern backtracks when the negative lookahead fails. You can avoid that by putting the part before the lookahead in an non-backtracking atomic group: (?>no backtracking in here).
For example:
(?xi) # turn on eXtended (ignore spaces/comments) and case-Insensitive mode
(?> # don't backtrack
< /? # tag start (no space allowed after it)
[a-z0-9]+ # tag name/space
(?: : [a-z0-9]+ )?
\s* # optional spaces
)
(?! > ) # no ending
Note that this will match <foo in <foo bar>.
If you are just trying to find errors in a single xml file, try opening it in Google Chrome web browser - it will show the line where the error is.
But if you have lot's of files you have to process in code, then you'd need something more powerful than regexes.
As people have said, this is probably a fruitless endeavor - as XML is not a regular language. However, part of your problem is your lookahead. You only ensure that it's not immediately followed by a closing angle bracket - which means things like <ab of <abc> will match even when you don't want them too. so you need to include the entire tag structure in your lookahead.
To get a match for the exact data you gave, I could use the regular expression:
#</?([a-z]?:)?[a-z]*(?!/?([a-z]?:)?[a-z]*>)#
Which you can see in action here. The key here is to make sure that at no point can the regular expressions engine backtrack (by say, dropping one character) to validate the lookahead. There are other ways to do this - such as possessive quantifiers, which refuse to give up their matched token in a normal backtracking process, but the standard .NET engine doesn't support possessive matching. It does support an atomic group - which behaves the same way, but using a group instead of a quantifier. You can see here that I've wrapped the entire opening of the tag in an atomic group. ((?> ... ))
#(?></?([a-z]?:)?[a-z]*)(?!>)#
You're free to enter your own regular expression for how a tag ought to be formatted, but I must say that this regular expression is already pushing the limits for readable code, and messing about with legal xml tag names is going to push it further in that direction. Nevertheless, I hope this has helped shed some light on the error.

How does non-backtracking subexpression work "(?>exp)"

I am trying to become better at regular expressions. I am having a hard time trying to understand what does (?> expression ) means. Where can I find more info on non-backtacking subexpressoins? The description of THIS link says:
Greedy subexpression, also known as a non-backtracking subexpression.
This is matched only once and then does not participate in
backtracking.
this other link: http://msdn.microsoft.com/en-us/library/bs2twtah(v=vs.71).aspx has also a definition of non-backtracking subexpression but I still am having a hard time understanding what it means plus I cannot think of an example where I will use (?>exp)
As always, regular-expressions.info is a good place to start.
Use an atomic group if you want to make sure that whatever has once been matched will stay part of the match.
For example, to match a number of "words" that may or may not be separated by spaces, then followed by a colon, a user tried the regex:
(?:[A-Za-z0-9_.&,-]+\s*)+:
When there was a match, everything was fine. But when there wasn't, his PC would become non-responsive with 100% CPU load because of catastrophic backtracking because the regex engine would vainly try to find a matching combination of words that would allow the following colon to match. Which was of course impossible.
By using an atomic group, this could have been prevented:
(?>[A-Za-z0-9_.&,-]+\s*)+:
Now whatever has been matched stays matched - no backtracking and therefore fast failing times.
Another good example I recently came across:
If you want to match all numbers that are not followed by an ASCII letter, you might want to use the regex \d+(?![A-Za-z]). However, this will fail with inputs like 123a because the regex engine will happily return the match 12 by backtracking until the following character is no longer a letter. If you use (?>\d+)(?![A-Za-z]), this won't happen. (Of course, \d+(?![\dA-Za-z]) would also work)
The Regex Tutorial has a page on it here: http://www.regular-expressions.info/atomic.html
Basically what it does is discards backtracking information, meaning that a(?>bc|b)c matches abcc but not abc.
The reason it doesn't match the second string is because it finds a match with bc, and discards backtracking information about the bc|b alternation. It essentially forgets the |b part of it. Therefore, there is no c after the bc, and the match fails.
The most useful method of using atomic groups, as they are called, is to optimize slow regexes. You can find more information on the aforementioned page.
Read up on possessive quantifiers [a-z]*+ make the backtracking engine remember only the previous step that matched not all of the previous steps that matched.
This is useful when a lot of acceptable steps are probable and they will eat up memory if each step is stored for any possible backtracking regression.
Possessive quantifiers are a shorthand for atomic groups.

Why Regex IsMatch() hangs

I have an exepression to validate an email address:
string REGEX_EMAIL = #"^\w+([\.\!\#\$\%\&\'\*\+\-\/\=\?\^\`\{\¦\}\~]*\w+)*#\w+([\.\-]\w+)*\.\w+([\.\-]\w+)*$";
If address is correct IsMatch() method quickly shows true result. But if address string is long and wrong this method hangs.
What can I do to raise speed of this method?
Thanks.
You are experiencing catastrophic backtracking.
The simplified regex:
Regex regexObj = new Regex(#"^\w+([-.!#$%&'*+/=?^`{¦}~]*\w+)*#\w+([.-]\w+)*\.\w+([.-]\w+)*$");
Has potential problems e.g. ([.-]\w+)*\.
If the . is missing for example and you have a long string of characters before it then all possible combinations must be taken into account for your regex to find out that it actually fails.
You have a couple things going on which are hurting the performance in this regular expression.
Catastrophic backtracking
Too many optional statements
You can definitely improve performance by using the + instead of the * in a few key places, but this of course changes what the regular expression will and won't match. So the easiest fix I found is actually covered in the catastrophic backtracking article above. You can use the nonbacktracking subexpression to drastically improve performance in this case, without changing the regular expression's behavior in any way that matters.
The nonbacktracking subexpression looks like this... (?>pattern)
So try this regular expression instead:
^\w+(?>[\.\!\#\$\%\&\'\*\+\-\/\=\?\^\`\{\¦\}\~]*\w+)*#\w+([\.\-]\w+)*\.\w+([\.\-]\w+)*$
On a slightly related topic, my philosophy for checking for a valid email address is a bit different. For one, long regular expressions like this one can potentially have performance problems as you've found.
Secondly, there's the upcoming promise of email address internationalization which complicates all of this even more.
Lastly, the main purpose of any regular expression based email validation is to catch typos and blatant attempts to get through your form without entering a real email address. But to check if an email address is genuine requires that you send an email to that address.
So my philosophy is to err on the side of accepting too much. And that, in fact, is a very simple thing to do...
^.+#.+\..+$
This should match any conceivably valid email address, and some invalid ones as well.
So, there is some backtracking issues. You can reduce those issues with a prudent use of the independent subexpression constructs, but you will still have issues because inner expressions won't have that constraint. The best thing to do is to separate the major parts.
Changing it to this helps alot (expanded):
^
(?>
\w+
(
[\.\!\#\$\%\&\'\*\+\-\/\=\?\^\`\{\¦\}\~]*
\w+
)*
#
\w+
)
(?>
([\.\-]\w+)*
\.
\w+
([\.\-]\w+)*
)
$
However, if you refactor to the equivalent expression by putting some well placed assertions, then re-adding the independent subexpressions grouping, you can virtually eliminate backtracking. Running this through my regex dubugger shows it takes only a very few steps to either pass or fail (expanded):
^
(?>
\w+
[\.\!\#\$\%\&\'\*\+\-\/\=\?\^\`\{\¦\}\~\w]*
(?<=\w)
#
\w+
)
(?=.*\.\w)
(?>
([\.\-]\w+)+
)
$

Categories