Seeking some C# RegEx help - c#

I am trying to create a RegEx expression that will successfully parse the following line:
"57" "testing123" 82 16 # 13 26 blah blah
What I want is to be able to do is identify the numbers in the line. Currently, what I'm using is this:
[0-9]+
which parses fine. However, where it gets tricky is if the number is in quotes, like "57" is or like "testing123" is, I do not want it to match.
In addition to that, anything after the hash sign (the '#"), I do not want to match anything at all after the hash sign.
So in this example, the matches I should be getting are "82" and "16". Nothing else should match.
Any help on this would be appreciated.

It should be easier for you to build 3 different regexes, and then create the logic that combines them:
Check, whether the string has #, and ignore everything after it.
Check, for all the matches of "\d+", and ignore all of them
Check everything that's left, whether it matches [0-9]+

.Net regular expression can rather easily parse this string. The following pattern should match everything until the comment:
\A # Start of the string
(?>
(?<Quoted> # A quoted string
"" # Open quotes
[^""\\]* # non quotes or backslashes
(?:\\.[^""\\]*)* # but allow escaped characters
"" # Close quotes
)
|
(?<Number> # A number
\d+ # some digits
)
|
\s+ # Whitespace separator
)*
If you also want to match the comment, add:
(?<Comment>
\# .*
)?
\z
You can get your numbers in a single Match, using all captures of the "Number" group:
Match parsed = Regex.Match(s, pattern, RegexOptions.IgnorePatternWhitespace);
CaptureCollection numbers = parsed.Groups["Number"].Captures;
Missing from this pattern is mainly unquoted string tokens, such as 4 8 this 15that, which can add some complexity, depending on how we'd want it to work.

Related

Regex for alpha number string in c# accepting underscore and white spaces

I already gone through many post on SO. I didn't find what I needed for my specific scenario.
I need a regex for alpha numeric string.
where following conditions should be matched
Valid string:
ameya123 (alphabets and numbers)
ameya (only alphabets)
AMeya12(Capital and normal alphabets and numbers)
Ameya_123 (alphabets and underscore and numbers)
Ameya_ 123 (alphabets underscore and white speces)
Invalid string:
123 (only numbers)
_ (only underscore)
(only space) (only white spaces)
any special charecter other than underscore
what i tried till now:
(?=.*[a-zA-Z])(?=.*[0-9]*[\s]*[_]*)
the above regex is working in Regex online editor however not working in data annotation in c#
please suggest.
Based on your requirements and not your attempt, what you are in need of is this:
^(?!(?:\d+|_+| +)$)[\w ]+$
The negative lookahead looks for undesired matches to fail the whole process. Those are strings containing digits only, underscores only or spaces only. If they never happen we want to have a match for ^[\w ]+$ which is nearly the same as ^[a-zA-Z0-9_ ]+$.
See live demo here
Explanation:
^ Start of line / string
(?! Start of negative lookahead
(?: Start of non-capturing group
\d+ Match digits
| Or
_+ Match underscores
| Or
[ ]+ Match spaces
)$ End of non-capturing group immediately followed by end of line / string (none of previous matches should be found)
) End of negative lookahead
[\w ]+$ Match a character inside the character set up to end of input string
Note: \w is a shorthand for [a-zA-Z0-9_] unless u modifier is set.
One problem with your regex is that in annotations, the regex must match and consume the entire string input, while your pattern only contains lookarounds that do not consume any text.
You may use
^(?!\d+$)(?![_\s]+$)[A-Za-z0-9\s_]+$
See the regex demo. Note that \w (when used for a server-side validation, and thus parsed with the .NET regex engine) will also allow any Unicode letters, digits and some more stuff when validating on the server side, so I'd rather stick to [A-Za-z0-9_] to be consistent with both server- and client-side validation.
Details
^ - start of string (not necessary here, but good to have when debugging)
(?!\d+$) - a negative lookahead that fails the match if the whole string consists of digits
(?![_\s]+$) - a negative lookahead that fails the match if the whole string consists of underscores and/or whitespaces. NOTE: if you plan to only disallow ____ or " " like inputs, you need to split this lookahead into (?!_+$) and (?!\s+$))
[A-Za-z0-9\s_]+ - 1+ ASCII letters, digits, _ and whitespace chars
$ - end of string (not necessary here, but still good to have).
If I understand your requirements correctly, you need to match one or more letters (uppercase or lowercase), and possibly zero or more of digits, whitespace, or underscore. This implies the following pattern:
^[A-Za-z0-9\s_]*[A-Za-z][A-Za-z0-9\s_]*$
Demo
In the demo, I have replaced \s with \t \r, because \s was matching across all lines.
Unlike the answers given by #revo and #wiktor, I don't have a fancy looking explanation to the regex. I am beautiful even without my makeup on. Honestly, if you don't understand the pattern I gave, you might want to review a good regex tutorial.
This simple RegEx should do it:
[a-zA-Z]+[0-9_ ]*
One or more Alphabet, followed by zero or more numbers, underscore and Space.
This one should be good:
[\w\s_]*[a-zA-Z]+[\w\s_]*

Regular Expressions: Determining if a String is either a number or variable

I am trying to combine two Regular Expression patterns to determine if a String is either a double value or a variable. My restrictions are as follows:
The variable can only begin with an _ or alphabetical letter (A-Z, ignoring case), but it can be followed by zero or more _s, letters, or digits.
Here's what I have so far, but I can't get it to work properly.
String varPattern = #"[a-zA-Z_](?: [a-zA-Z_]|\d)*";
String doublePattern = #"(?: \d+\.\d* | \d*\.\d+ | \d+ ) (?: [eE][\+-]?\d+)?";
String pattern = String.Format("({0}) | ({1})",
varPattern, doublePattern);
Regex.IsMatch(word, varPattern, RegexOptions.IgnoreCase)
It seems that it is capturing both Regular Expression patterns, but I need it to be either/or.
For example, _A2 2 is valid using the code above, but _A2 is invalid.
Some examples of valid variables are as follows:
_X6 , _ , A , Z_2_A
And some examples of invalid variables are as follows:
2_X6 , $2 , T_2$
I guess I just need clarification on the pattern format for the Regular Expression. The format is unclear to me.
As noted, the literal whitespace you've put in your regular expressions is part of the regular expression. You won't get a match unless that same whitespace is in the text being scanned by the regular expression. If you want to use whitespace to make your regex, you'll need to specify RegexOptions.IgnorePatternWhitespace, after that, if you want to match any whitespace, you'll have to do so explicitly, either by specifying \s, \x20, etc.
It should be noted that if you do specify RegexOptions.IgnorePatternWhitespace, you can use Perl-style comments (# to end of line) to document your regular expression (as I've done below). For complex regular expressions, someone 5 years from now — who might be you! — will thank you for the kindness.
Your [presumably intended] patterns are also, I think, more complex than they need be. A regular expression to match the identifier rule you've specified is this:
[a-zA-Z_][a-zA-Z0-9_]*
Broken out into its constituent parts:
[a-zA-Z_] # match an upper- or lower-case letter or an underscore, followed by
[a-zA-Z0-9_]* # zero or more occurences of an upper- or lower-case letter, decimal digit or underscore
A regular expression to match the conventional style of a numeric/floating-point literal is this:
([+-]?[0-9]+)(\.[0-9]+)?([Ee][+-]?[0-9]+)?
Broken out into its constituent parts:
( # a mandatory group that is the integer portion of the value, consisting of
[+-]? # - an optional plus- or minus-sign, followed by
[0-9]+ # - one or more decimal digits
) # followed by
( # an optional group that is the fractional portion of the value, consisting of
\. # - a decimal point, followed by
[0-9]+ # - one or more decimal digits
)? # followed by,
( # an optional group, that is the exponent portion of the value, consisting of
[Ee] # - The upper- or lower-case letter 'E' indicating the start of the exponent, followed by
[+-]? # - an optional plus- or minus-sign, followed by
[0-9]+ # - one or more decimal digits.
)? # Easy!
Note: Some grammars differ as to whether the sign of the value is a unary operator or part
of the value and whether or not a leading + sign is allowed. Grammars also vary as to whether
something like 123245. is valid (e.g., is a decimal point with no fractional digits valid?)
To combine these two regular expression,
First, group each of them with parentheses (you might want to name the containing groups, as I've done):
(?<identifier>[a-zA-Z_][a-zA-Z0-9_]*)
(?<number>[+-]?[0-9]+)(\.[0-9]+)?([Ee][+-]?[0-9]+)?
Next, combine with the alternation operation, |:
(?<identifier>[a-zA-Z_][a-zA-Z0-9_]*)|(?<number>[+-]?[0-9]+)(\.[0-9]+)?([Ee][+-]?[0-9]+)?
Finally, enclose the whole shebang in an #"..." literal and you should be good to go.
That's about all there is to it.
Spaces are not ignored in regular expressions by default, so for each space in your current expressions it is looking for a space in that string. Add the RegexOptions.IgnorePatternWhitespace flag or remove the spaces from your expressions.
You will also want to add some beginning and end of string anchors (^ and $ respectively) so you do not match just part of a string.
You should avoid having spaces in your regular expressions unless you explicitly set IgnorePatterWhiteSpace. To make sure you get only matches on complete words you should include the beginning of line (^) and end of line ($) characters. I would also suggest you build the entire expression pattern instead of using String.Format("({0}) | ({1})", ...) as you have here.
The below should work given your examples:
string pattern = #"(?:^[a-zA-Z_][a-zA-Z_\d]*)|(?:^\d+(?:\.\d+){0,1}(?:[Ee][\+-]\d+){0,1}$)";

Regex IsMatch taking too long to execute

I have one strange issue on my .NET project with RegEx. Please, see C# code below:
const string PATTERN = #"^[a-zA-Z]([-\s\.a-zA-Z]*('(?!'))?[-\s\.a-zA-Z]*)*$";
const string VALUE = "Ingebrigtsen Myre (Øvre)";
System.Text.RegularExpressions.Regex regex = new System.Text.RegularExpressions.Regex(PATTERN);
if (!regex.IsMatch(VALUE)) // <--- Infinite loop here
return string.Empty;
// Some other code
I use this pattern to validate all types of names (fist names, last names, middle names, etc.). Value is a parameter, but I provided it as a constant above, because issue is not reproduced often - only with special symbols: *, (, ), etc. (sorry, but I don't have the full list of these symbols).
Can you help me to fix this infinite loop? Thanks for any help.
Added: this code is placed on the very base level of project and I don't want to do any refactoring there - I just want to have quick fix for this issue.
Added 2: I do know that it technically is not a loop - I meant that "regex.IsMatch(VALUE)" never ends. I waited for about an hour and it was still executing.
Your non-trivial regex: ^[a-zA-Z]([-\s\.a-zA-Z]*('(?!'))?[-\s\.a-zA-Z]*)*$, is better written with comments in free-spacing mode like so:
Regex re_orig = new Regex(#"
^ # Anchor to start of string.
[a-zA-Z] # First char must be letter.
( # $1: Zero or more additional parts.
[-\s\.a-zA-Z]* # Zero or more valid name chars.
( # $2: optional quote.
' # Allow quote but only
(?!') # if not followed by quote.
)? # End $2: optional quote.
[-\s\.a-zA-Z]* # Zero or more valid name chars.
)* # End $1: Zero or more additional parts.
$ # Anchor to end of string.
",RegexOptions.IgnorePatternWhitespace);
In English, this regex essentially says: "Match a string that begins with an alpha letter [a-zA-Z] followed by zero or more alpha letters, whitespaces, periods, hyphens or single quotes, but each single quote may not be immediately followed by another single quote."
Note that your above regex allows oddball names such as: "ABC---...'... -.-.XYZ " which may or may not be what you need. It also allows multi-line input and strings that end with whitespace.
The "infinite loop" problem with the above regex is that catastrophic backtracking occurs when this regex is applied to a long invalid input which contains two single quotes in a row. Here is an equivalent pattern which matches (and fails to match) the exact same strings, but does not experience catastrophic backtracking:
Regex re_fixed = new Regex(#"
^ # Anchor to start of string.
[a-zA-Z] # First char must be letter.
[-\s.a-zA-Z]* # Zero or more valid name chars.
(?: # Zero or more isolated single quotes.
' # Allow single quote but only
(?!') # if not followed by single quote.
[-\s.a-zA-Z]* # Zero or more valid name chars.
)* # Zero or more isolated single quotes.
$ # Anchor to end of string.
",RegexOptions.IgnorePatternWhitespace);
And here it is in short form in your code context:
const string PATTERN = #"^[a-zA-Z][-\s.a-zA-Z]*(?:'(?!')[-\s.a-zA-Z]*)*$";
Look at this part of your regex:
( [-\s\.a-zA-Z]* ('(?!'))? [-\s\.a-zA-Z]* )*$
^ ^ ^ ^ ^
| | | | |
| | | | This group repeats any number of times
| | | charclass repeats any number of times
| | This group is optional
| This character class also repeats any number of times
Outer group (repeated, as seen above)
That means that as soon as your input string contains a character that's not in the character class (like the brackets and non-ASCII letter in your example), the preceding characters will be tried in a lot of permutations whose number increases exponentially with the length of the string.
To avoid that (and to allow a faster failure of the regex, use atomic groups:
const string PATTERN = #"^[a-zA-Z](?>(?>[-\s\.a-zA-Z]*)(?>'(?!'))?(?>[-\s\.a-zA-Z])*)*$";
You've got an "any number of any number" here:
...[-\s\.a-zA-Z]*)*
and because your input doesn't match, the engine backtracks to try all permutations of dividing the input up, and the number of attempts grows exponentially with the length of the input.
You can fix it simply by adding a "+" to make a possessive quantifier, which once consumed will not backtrack to find other combinations:
const string PATTERN = #"^[a-zA-Z]([-\s\.a-zA-Z]*('(?!'))?[-\s\.a-zA-Z]*+)*$";
^-- added + here
You can see a live demo (on rubular) demonstrating that adding the plus fixed the loop problem, and still matches input that doesn't have the odd characters.

Regex match if a string has length 2 and contains 1 letter and 1 number

Guys I hate Regex and I suck at writing.
I have a string that is space separated and contains several codes that I need to pull out. Each code is marked by beginning with a capital letter and ending with a number. The code is only two digits.
I'm trying to create an array of strings from the initial string and I can't get the regular expression right.
Here is what I have
String[] test = Regex.Split(originalText, "([a-zA-Z0-9]{2})");
I also tried:
String[] test = Regex.Split(originalText, "([A-Z]{1}[0-9]{1})");
I don't have any experience with Regex as I try to avoid writing them whenever possible.
Anyone have any suggestions?
Example input:
AA2410 F7 A4 Y7 B7 A 0715 0836 E0.M80
I need to pull out F7, A4, B7. E0 should be ignored.
You want to collect the results, not split on them, right?
Regex regexObj = new Regex(#"\b[A-Z][0-9]\b");
allMatchResults = regexObj.Matches(subjectString);
should do this. The \bs are word boundaries, making sure that only entire strings (like A1) are extracted, not substrings (like the A1 in TWA101).
If you also need to exclude "words" with non-word characters in them (like E0.M80 in your comment), you need to define your own word boundary, for example:
Regex regexObj = new Regex(#"(?<=^|\s)[A-Z][0-9](?=\s|$)");
Now A1 only matches when surrounded by whitespace (or start/end-of-string positions).
Explanation:
(?<= # Assert that we can match the following before the current position:
^ # Start of string
| # or
\s # whitespace.
)
[A-Z] # Match an uppercase ASCII letter
[0-9] # Match an ASCII digit
(?= # Assert that we can match the following after the current position:
\s # Whitespace
| # or
$ # end of string.
)
If you also need to find non-ASCII letters/digits, you can use
\p{Lu}\p{N}
instead of [A-Z][0-9]. This finds all uppercase Unicode letters and Unicode digits (like Ä٣), but I guess that's not really what you're after, is it?
Do you mean that each code looks like "A00"?
Then this is the regex:
"[A-Z][0-9][0-9]"
Very simple... By the way, there's no point writing {1} in a regex. [0-9]{1} means "match exactly one digit, which is exactly like writing [0-9].
Don't give up, simple regexes make perfect sense.
This should be ok:
String[] all_codes = Regex.Split(originalText, #"\b[A-Z]\d\b");
It gives you an array with all code starting with a capital letter followed by a digit, separated by an kind of word boundary (site space etc.)

Regular expression to find separator dots in formula

The C# expression library I am using will not directly support my table/field parameter syntax:
The following are table/field parameter names that are not directly supported:
TableName1.FieldName1
[TableName1].[FieldName1]
[Table Name 1].[Field Name 1]
It accepts alphanumeric parameters without spaces, or most characters enclosed within square brackets. I would like to use C# regular expressions to replace the dot separators and neighboring brackets to a different delimiter, so the results would be as follows:
[TableName1|FieldName1]
[TableName1|FieldName1]
[Table Name 1|Field Name 1]
I also need to skip any string literals within single quotes, like:
'TableName1.FieldName1'
And, of course, ignore any numeric literals like:
12345.6789
EDIT: Thank you for your feedback on improving my question. Hopefully it is clearer now.
I've written a completely new answer, now that the problem is clarified:
You can do this in a single regex. It is quite bulletproof, I think, but as you can see, it's not exactly self-explanatory, which is why I've commented it liberally. Hope it makes sense.
You're lucky that .NET allows re-use of named capturing groups, otherwise you would have had to do this in several steps.
resultString = Regex.Replace(subjectString,
#"(?: # Either match...
(?<before> # (and capture into backref <before>)
(?=\w*\p{L}) # (as long as it contains at least one letter):
\w+ # one or more alphanumeric characters,
) # (End of capturing group <before>).
\. # then a literal dot,
(?<after> # (now capture again, into backref <after>)
(?=\w*\p{L}) # (as long as it contains at least one letter):
\w+ # one or more alphanumeric characters.
) # (End of capturing group <after>) and end of match.
| # Or:
\[ # Match a literal [
(?<before> # (now capture into backref <before>)
[^\]]+ # one or more characters except ]
) # (End of capturing group <before>).
\]\.\[ # Match literal ].[
(?<after> # (capture into backref <after>)
[^\]]+ # one or more characters except ]
) # (End of capturing group <after>).
\] # Match a literal ]
) # End of alternation. The match is now finished, but
(?= # only if the rest of the line matches either...
[^']*$ # only non-quote characters
| # or
[^']*'[^']*' # contains an even number of quote characters
[^']* # plus any number of non-quote characters
$ # until the end of the line.
) # End of the lookahead assertion.",
"[${before}|${after}]", RegexOptions.Multiline | RegexOptions.IgnorePatternWhitespace);

Categories