Regex IsMatch taking too long to execute - c#

I have one strange issue on my .NET project with RegEx. Please, see C# code below:
const string PATTERN = #"^[a-zA-Z]([-\s\.a-zA-Z]*('(?!'))?[-\s\.a-zA-Z]*)*$";
const string VALUE = "Ingebrigtsen Myre (Øvre)";
System.Text.RegularExpressions.Regex regex = new System.Text.RegularExpressions.Regex(PATTERN);
if (!regex.IsMatch(VALUE)) // <--- Infinite loop here
return string.Empty;
// Some other code
I use this pattern to validate all types of names (fist names, last names, middle names, etc.). Value is a parameter, but I provided it as a constant above, because issue is not reproduced often - only with special symbols: *, (, ), etc. (sorry, but I don't have the full list of these symbols).
Can you help me to fix this infinite loop? Thanks for any help.
Added: this code is placed on the very base level of project and I don't want to do any refactoring there - I just want to have quick fix for this issue.
Added 2: I do know that it technically is not a loop - I meant that "regex.IsMatch(VALUE)" never ends. I waited for about an hour and it was still executing.

Your non-trivial regex: ^[a-zA-Z]([-\s\.a-zA-Z]*('(?!'))?[-\s\.a-zA-Z]*)*$, is better written with comments in free-spacing mode like so:
Regex re_orig = new Regex(#"
^ # Anchor to start of string.
[a-zA-Z] # First char must be letter.
( # $1: Zero or more additional parts.
[-\s\.a-zA-Z]* # Zero or more valid name chars.
( # $2: optional quote.
' # Allow quote but only
(?!') # if not followed by quote.
)? # End $2: optional quote.
[-\s\.a-zA-Z]* # Zero or more valid name chars.
)* # End $1: Zero or more additional parts.
$ # Anchor to end of string.
",RegexOptions.IgnorePatternWhitespace);
In English, this regex essentially says: "Match a string that begins with an alpha letter [a-zA-Z] followed by zero or more alpha letters, whitespaces, periods, hyphens or single quotes, but each single quote may not be immediately followed by another single quote."
Note that your above regex allows oddball names such as: "ABC---...'... -.-.XYZ " which may or may not be what you need. It also allows multi-line input and strings that end with whitespace.
The "infinite loop" problem with the above regex is that catastrophic backtracking occurs when this regex is applied to a long invalid input which contains two single quotes in a row. Here is an equivalent pattern which matches (and fails to match) the exact same strings, but does not experience catastrophic backtracking:
Regex re_fixed = new Regex(#"
^ # Anchor to start of string.
[a-zA-Z] # First char must be letter.
[-\s.a-zA-Z]* # Zero or more valid name chars.
(?: # Zero or more isolated single quotes.
' # Allow single quote but only
(?!') # if not followed by single quote.
[-\s.a-zA-Z]* # Zero or more valid name chars.
)* # Zero or more isolated single quotes.
$ # Anchor to end of string.
",RegexOptions.IgnorePatternWhitespace);
And here it is in short form in your code context:
const string PATTERN = #"^[a-zA-Z][-\s.a-zA-Z]*(?:'(?!')[-\s.a-zA-Z]*)*$";

Look at this part of your regex:
( [-\s\.a-zA-Z]* ('(?!'))? [-\s\.a-zA-Z]* )*$
^ ^ ^ ^ ^
| | | | |
| | | | This group repeats any number of times
| | | charclass repeats any number of times
| | This group is optional
| This character class also repeats any number of times
Outer group (repeated, as seen above)
That means that as soon as your input string contains a character that's not in the character class (like the brackets and non-ASCII letter in your example), the preceding characters will be tried in a lot of permutations whose number increases exponentially with the length of the string.
To avoid that (and to allow a faster failure of the regex, use atomic groups:
const string PATTERN = #"^[a-zA-Z](?>(?>[-\s\.a-zA-Z]*)(?>'(?!'))?(?>[-\s\.a-zA-Z])*)*$";

You've got an "any number of any number" here:
...[-\s\.a-zA-Z]*)*
and because your input doesn't match, the engine backtracks to try all permutations of dividing the input up, and the number of attempts grows exponentially with the length of the input.
You can fix it simply by adding a "+" to make a possessive quantifier, which once consumed will not backtrack to find other combinations:
const string PATTERN = #"^[a-zA-Z]([-\s\.a-zA-Z]*('(?!'))?[-\s\.a-zA-Z]*+)*$";
^-- added + here
You can see a live demo (on rubular) demonstrating that adding the plus fixed the loop problem, and still matches input that doesn't have the odd characters.

Related

How Do I format a telephone number using regex

I need to format my telephone numbers in a specific way. Unfortunately business rules prohibit me from doing this up front. (separate input boxes etc..)
The format needs to be +1-xxx-xxx-xxxx where the "+1" is constant. (We don't do business internationally)
Here is my regex pattern to test the input:
"\\D*([2-9]\\d{2})(\\D*)([2-9]\\d{2})(\\D*)(\\d{4})\\D*"
(which I stole from somewhere else)
Then I perform a regex.Replace() like so:
regex.Replace(telephoneNumber, "+1-$1-$3-$5"); **THIS IS WHERE IT BLOWS UP**
If my telephone number already has the "+1" in the string, it prepends another so that I get +1-+1-xxx-xxx-xxxx
Can someone please help?
You can add (?:\+1\D*)? to catch an optional prefix before the number. As it's caught it will be replaced if it's there.
You don't need to use \D* before and after the number. As they are optional, they don't change anything.
You don't need to capture the parts that you won't use, that makes it easier to see what ends up in the replacement.
str = Regex.Replace(str, #"(?:\+1\D*)?([2-9]\d{2})\D*([2-9]\d{2})\D*(\d{4})", "+1-$1-$2-$3");
You might consider using something more specific than \D* for the separators though, for example [\- /]?. With a too non-specific pattern you risk catching something that's not a phone number, for example changing "I have 234 cats, 528 dogs and 4509 horses." into "I have +1-234-528-4509 horses.".
str = Regex.Replace(str, #"(?:\+1[\- /]?)?([2-9]\d{2})[\- /]?([2-9]\d{2})[\- /]?(\d{4})", "+1-$1-$2-$3");
try something like this to make things more readable:
Regex rxPhoneNumber = new Regex( #"
^ # anchor the start-of-match to start-of-text
\D* # - allow and ignore any leading non-digits
1? # - we'll allow (and ignore) a leading 1 (as in 1-800-123-4567
\D* # - allow and ignore any non-digits following that
(?<areaCode>[2-9]\d\d) # - required 3-digit area code
\D* # - allow and ignore any non-digits following the area code
(?<exchangeCode>[2-9]\d\d) # - required 3-digit exchange code (central office)
\D* # - allow and ignore any non-digits following the C.O.
(?<subscriberNumber>\d\d\d\d) # - required 4-digit subscriber number
\D* # - allow and ignore any non-digits following the subscriber number
$ # - followed the end-of-text.
" ,
RegexOptions.IgnorePatternWhitespace|RegexOptions.ExplicitCapture
);
string input = "voice: 1 (234) 567/1234 (leave a message)" ;
bool isValid = rxPhoneNumber.IsMatch(input) ;
string tidied = rxPhoneNumber.Replace( input , "+1-${areaCode}-${exchangeCode}-${subscriberNumber}" ) ;
which will give tidied the desired value
+1-234-567-1234
You can use the following regex
\D*(\+1-)?([2-9]\d{2})\D*([2-9]\d{2})\D*(\d{4})\D*
And the replacement string:
$1$2-$3-$4
Here is a demo
This is a kind of an adjustment of the regex you had. If you need to match the whole numbers, I'd use
(\+1-)?\b([2-9]\d{2})\D*([2-9]\d{2})\D*(\d{4})\b
See demo 2
Also, if the hyphen in \+1- is optional, add a ?: \+1-?.
To make the regex safer, I'd replace \D* (0 or more non-digit symbols) with some character class containing known separators, e.g [ /-]* (matching /, spaces and -s).

Regular expression for conditionally formatting a number string

orginal question removed
I am looking for a Regular Expression which will format a string containing of special characters, characters and numbers into a string containing only numbers.
There are special cases in which it’s not enough to only replace all non-numeric characters with “” (empty).
1.) Zero in brackets.
If there are only zeros in a bracket (0) these should be removed if it is the first bracket pair. (The second bracket pair containing only zeros should not be removed)
2.) Leading zero.
All leading zero should be removed (ignoring brackets)
Examples for better understanding:
123 (0) 123 would be 123123 (zero removed)
(0) 123 -123 would be 123123(zero and all other non-numeric characters removed)
2(0) 123 (0) would be 21230 (first zero in brackets removed)
20(0)123023(0) would be 201230230 (first zero in brackets removed)
00(0)1 would be 1(leading zeros removed)
001(1)(0) would be 110 (leading zeros removed)
0(0)02(0) would be 20 (leading zeros removed)
123(1)3 would be 12313 (characters removed)
You could use a lookbehind to match (0) only if it's not at the beginning of the string, and replace with empty string as you're doing.
(original solution removed)
Updated again to reflect new requirements
Matches leading zeroes, matches (0) only if it's the first parenthesized item, and matches any non-digit characters:
^[0\D]+|(?<=^[^(]*)\(0\)|\D
Note that most regex engines do not support variable-length lookbehinds (i.e., the use of quantifiers like *), so this will only work in a few regex engines -- .NET's being one of them.
^[0\D]+ # zeroes and non-digits at start of string
| # or
(?<=^[^(]*) # preceded by start of string and only non-"(" chars
\(0\) # "(0)"
| # or
\D # non-digit, equivalent to "[^\d]"
(tested at regexhero.net)
You've changed and added requirements several times now. For multiple rules like this, you're probably better off coding for them individually. It could become complicated and difficult to debug if one condition matches and causes another condition not to match when it should. For example, in separate steps:
Remove parenthesized items as necessary.
Remove non-digit characters.
Remove leading zeroes.
But if you absolutely need these three conditions all matched in a single regular expression (not recommended), here it is.
Regexes get much, much simpler if you can use multiple passes. I think you could do a first pass to drop your (0) if it's not the first thing in a string, then follow it with stripping out the non-digits:
var noMidStrParenZero = Regex.Replace(text, "^([^(]+)\(0\)", "$1");
var finalStr = Regex.Replace(noMidStrParenZero, "[^0-9]", "");
Avoids a lot of regex craziness, and it's also self-documenting to an extent.
EDIT: this version should work with your new examples too.
This regex should be pretty near the one you're searching for.
(^[^\d])|([^\d](0[^\d])?)+
(You can replace everything that is caught by an empty string)
EDIT :
Your request evolved, and is now to complex to be treatd with a single pass. Assuming you always got a space before a bracket group, you can use those passes (keep this order) :
string[] entries = new string[7] {
"800 (0) 123 - 1",
"800 (1) 123",
"(0)321 123",
"1 (0) 1",
"1 (12) (0) 1",
"1 (0) (0) 1",
"(9)156 (1) (0)"
};
foreach (string entry in entries)
{
var output = Regex.Replace(entry , #"\(0\)\s*\(0\)", "0");
output = Regex.Replace(output, #"\s\(0\)", "");
output = Regex.Replace(output, #"[^\d]", "");
System.Console.WriteLine("---");
System.Console.WriteLine(entry);
System.Console.WriteLine(output);
}
(?: # start grouping
^ # start of string
| # OR
^\( # start of string followed by paren
| # OR
\d # a digit
) # end grouping
(0+) # capture any number of zeros
| # OR
([1-9]) # capture any non-zero digit
This works for all of your example strings, but the entire expression does match the ( followed by the zero. You can use Regex.Matches to get the match collection using a global match and then join all of the matched groups into a string to get numbers only (or just remove any non-numbers).

C# Regex Match 15 Characters, Single Spaces, Alpha-Numeric

I need to match a string under the following conditions using Regex in C#:
Entire string can only be alphanumeric (including spaces).
Must be a maximum of 15 characters or less (including spaces).
First & last characters can only be a letter.
A single space can appear multiple times in anywhere but the first and last characters of the string. (Multiple spaces together should not be allowed).
Capitalization should be ignored.
Should match the WHOLE word(s).
If any one of these preconditions are broken, a match should not follow.
Here is what i currently have:
^\b([A-z]{1})(([A-z0-9 ])*([A-z]{1}))?\b$
And here are some test strings that should match:
Stack OverFlow
Iamthe greatest
A
superman23s
One Two Three
And some that shouldn't match (note the spaces):
Stack [double_space] Overflow Rocks
23Hello
ThisIsOver15CharactersLong
Hello23
[space_here]hey
etc.
Any suggestions would be much appreciated.
You should use lookaheads
|->matches if all the lookaheads are true
--
^(?=[a-zA-Z]([a-zA-Z\d\s]+[a-zA-Z])?$)(?=.{1,15}$)(?!.*\s{2,}).*$
-------------------------------------- ---------- ----------
| | |->checks if there are no two or more space occuring
| |->checks if the string is between 1 to 15 chars
|->checks if the string starts with alphabet followed by 1 to many requireds chars and that ends with a char that is not space
you can try it here
Try this regex: -
"^([a-zA-Z]([ ](?=[a-zA-Z0-9])|[a-zA-Z0-9]){0,13}[a-zA-Z])$"
Explanation : -
[a-zA-Z] // Match first character letter
( // Capture group
[ ](?=[a-zA-Z0-9]) // Match either a `space followed by non-whitespace` (Avoid double space, but accept single whitespace)
| // or
[a-zA-Z0-9] // just `non-whitespace` characters
){0,13} // from `0 to 13` character in length
[a-zA-Z] // Match last character letter
Update : -
To handle single characters, you can make the pattern after 1st character optional as nicely pointed by #Rawling in comments: -
"^([a-zA-Z](([ ](?=[a-zA-Z0-9])|[a-zA-Z0-9]){0,13}[a-zA-Z])?)$"
^^^ ^^^
use a capture group make it optional
And my version, again using look-aheads:
^(?=.{1,15}$)(?=^[A-Z].*)(?=.*[A-Z]$)(?![ ]{2})[A-Z0-9 ]+$
explained:
^ start of string
(?=.{1,15}$) positive look-ahead: must be between 1 and 15 chars
(?=^[A-Z].*) positive look-ahead: initial char must be alpha
(?=.*[A-Z]$) positive look-ahead: last char must be alpha
(?![ ]{2}) negative look-ahead: string mustn't contain 2 or more consecutive spaces
[A-Z0-9 ]+ if all the look-aheads agree, select only alpha-numeric chars + space
$ end of string
This will also need the IgnoreCase option setting

Regex match if a string has length 2 and contains 1 letter and 1 number

Guys I hate Regex and I suck at writing.
I have a string that is space separated and contains several codes that I need to pull out. Each code is marked by beginning with a capital letter and ending with a number. The code is only two digits.
I'm trying to create an array of strings from the initial string and I can't get the regular expression right.
Here is what I have
String[] test = Regex.Split(originalText, "([a-zA-Z0-9]{2})");
I also tried:
String[] test = Regex.Split(originalText, "([A-Z]{1}[0-9]{1})");
I don't have any experience with Regex as I try to avoid writing them whenever possible.
Anyone have any suggestions?
Example input:
AA2410 F7 A4 Y7 B7 A 0715 0836 E0.M80
I need to pull out F7, A4, B7. E0 should be ignored.
You want to collect the results, not split on them, right?
Regex regexObj = new Regex(#"\b[A-Z][0-9]\b");
allMatchResults = regexObj.Matches(subjectString);
should do this. The \bs are word boundaries, making sure that only entire strings (like A1) are extracted, not substrings (like the A1 in TWA101).
If you also need to exclude "words" with non-word characters in them (like E0.M80 in your comment), you need to define your own word boundary, for example:
Regex regexObj = new Regex(#"(?<=^|\s)[A-Z][0-9](?=\s|$)");
Now A1 only matches when surrounded by whitespace (or start/end-of-string positions).
Explanation:
(?<= # Assert that we can match the following before the current position:
^ # Start of string
| # or
\s # whitespace.
)
[A-Z] # Match an uppercase ASCII letter
[0-9] # Match an ASCII digit
(?= # Assert that we can match the following after the current position:
\s # Whitespace
| # or
$ # end of string.
)
If you also need to find non-ASCII letters/digits, you can use
\p{Lu}\p{N}
instead of [A-Z][0-9]. This finds all uppercase Unicode letters and Unicode digits (like Ä٣), but I guess that's not really what you're after, is it?
Do you mean that each code looks like "A00"?
Then this is the regex:
"[A-Z][0-9][0-9]"
Very simple... By the way, there's no point writing {1} in a regex. [0-9]{1} means "match exactly one digit, which is exactly like writing [0-9].
Don't give up, simple regexes make perfect sense.
This should be ok:
String[] all_codes = Regex.Split(originalText, #"\b[A-Z]\d\b");
It gives you an array with all code starting with a capital letter followed by a digit, separated by an kind of word boundary (site space etc.)

Seeking some C# RegEx help

I am trying to create a RegEx expression that will successfully parse the following line:
"57" "testing123" 82 16 # 13 26 blah blah
What I want is to be able to do is identify the numbers in the line. Currently, what I'm using is this:
[0-9]+
which parses fine. However, where it gets tricky is if the number is in quotes, like "57" is or like "testing123" is, I do not want it to match.
In addition to that, anything after the hash sign (the '#"), I do not want to match anything at all after the hash sign.
So in this example, the matches I should be getting are "82" and "16". Nothing else should match.
Any help on this would be appreciated.
It should be easier for you to build 3 different regexes, and then create the logic that combines them:
Check, whether the string has #, and ignore everything after it.
Check, for all the matches of "\d+", and ignore all of them
Check everything that's left, whether it matches [0-9]+
.Net regular expression can rather easily parse this string. The following pattern should match everything until the comment:
\A # Start of the string
(?>
(?<Quoted> # A quoted string
"" # Open quotes
[^""\\]* # non quotes or backslashes
(?:\\.[^""\\]*)* # but allow escaped characters
"" # Close quotes
)
|
(?<Number> # A number
\d+ # some digits
)
|
\s+ # Whitespace separator
)*
If you also want to match the comment, add:
(?<Comment>
\# .*
)?
\z
You can get your numbers in a single Match, using all captures of the "Number" group:
Match parsed = Regex.Match(s, pattern, RegexOptions.IgnorePatternWhitespace);
CaptureCollection numbers = parsed.Groups["Number"].Captures;
Missing from this pattern is mainly unquoted string tokens, such as 4 8 this 15that, which can add some complexity, depending on how we'd want it to work.

Categories