Pattern Match At Specific Location For Validation - c#

With these data examples:
/test -test/test/2016/April
/test -test/test/2016
How does one pattern match so that it can determine whether or not the number 2016 is located in this exact position?

A regex pattern can do validation or as you infer location positioning validation. The key is to setup pattern anchors based on the strings encountered before one gets to just the numeric.
For your case you have literal /s then text then a literal - then literal /s then text....etc. By following those patterns of the literal anchors with generic text, you can require a specific position.
But other numbers could spoof other patterns (noise per se), so you appear to be getting a date. The following will make sure that /{date of 19XX or 20XX}/ is the only valid item for that position.
string pattern = #"
^ # Beginning of line (anchor)
/ # / anchor
[^-]+ # Anything not a dash.
- # Anchor dash
[^/]+ # Anything not a /
/ # / anchor
[^/]+ # Anything not a /
/ # / anchor
[12][90]\d\d # Allow only a `date` field of 19XX or 20XX.
";
// IgnorePatternWhitespace *only* allows us to comment the pattern
// and place it on multiple lines (space ignored)
// it does not affect processing of the data.
// Compiled tells the parser to hold the pattern compilation
// in memory for future processing.
var validator = new Regex(pattern, RegexOptions.IgnorePatternWhitespace |
RegexOptions.Compiled);
validator.IsMatch("/ -test/test/2016/April"); // True
validator.IsMatch("/ -test/test/2016"); // True
validator.IsMatch("/ -test/test/1985/April"); // True
validator.IsMatch("/ -2017/test/1985/April"); // True
// Negative Tests
validator.IsMatch("/ -2017/test/WTF/April"); // False
validator.IsMatch("/jabberwocky/test/1985/April"); // False, no dash!
validator.IsMatch("////April"); // false
validator.IsMatch("///2016/April"); // False because no text between `/`
validator.IsMatch("/ -test/test/ 2016/April"); // False because pattern
// does not allow a space
Pattern Notes
Instead of looking of for the date with \d\d\d\d, I am giving the regex parser a specific anchor type hint that this is either going to be a date in that resides in the twentieth century, 19XX, or the twenty first century, 20XX. So I spell out the first two places of the \d\d\d\d pattern to be a set where either 1 or 2 is the first \d as [12] (1 for a 19xx pattern or 2 for a 20xx pattern) followed by the second place number to be either a nine or a zero[90]. In a modern computer system most dates will be within these two centuries; so why not craft the regex as such.

Assuming, that "exact position" means "third position", the following regex would work:
/(?:[^/]*/){2}(\d{4}).*
In C#, this can be used with the Regex Constructor and the #"" String Syntax, which makes escaping characters obsolete:
var rx = new Regex(#"/(?:[^/]*/){2}(\d{4}).*");
If this regex matches a string, the four digits of the year are captured as a result.
Explanation
/ captures the leading slash character.
[^/]* captures any sequence of characters unequal to a slash.
/ captures a slash character
the preceeding two code parts are now wrapped inside non-capturing brackets, which are specified with ?: as the first two characters inside them.
Having (?:[^/]*/) now matching a "path segment" like "test/", the pattern must be matched exactly two times in a row. that's why the brackets are followed by the quantifier {2}
Then the actual number must be matched: It consists of four digits in a row. This is represented as followed: (\d{4}) where \d means "any number" and - once again - the quantifier defines that there should be 4 in a row.
Finally, there can be aribtrary characters behind the number, ("tha path can continue"): This is specified by the . ("match any character") and the quantifier *, which means "any number of occurences".
Note: There are many dialects of Regular Expressions. This on works for the C# regex implemantation, however it should work for many others as well.

Your regex will be:
\-(?:[^\/]+\/){2}(\d+)
It will capture number appearing after xx/xx/ pattern where xx/ is adjustable.
Example:
var s1 = "/test -test/test/2016/April";
var s2 = "/test -test/test/2016";
var rx = new Regex ("\\-(?:[^\\/]+\\/){2}(\\d+)");
var m1 = rx.Match(s1);
var m2 = rx.Match(s2);
if (m1.Success && m2.Success) {
if (m1.Groups[1].Value == m2.Groups[1].Value) {
Console.WriteLine ("s1 == s2");
}
}
Based on provided input string s1 and s2, it will print:
s1 == s2

Related

C# Regular expression to match on a character not following pairs of the same charcater

Objective: Regex Matching
For this example I'm interested in matching a "|" pipe character.
I need to match it if it's alone: "aaa|aaa"
I need to match it (the last pipe) only if it's preceded by pairs of pipe: (2,4,6,8...any even number)
Another way: I want to ignore ALL pipe pairs "||" (right to left)
or I want to select bachelor bars only (the odd man out)
string twomatches = "aaaaaaaaa||||**|**aaaaaa||**|**aaaaaa";
string onematch = "aaaaaaaaa||**|**aaaaaaa||aaaaaaaa";
string noMatch = "||";
string noMatch = "||||";
I'm trying to select the last "|" only when preceded by an even sequence of "|" pairs or in a string when a single bar exists by itself.
Regardless of the number of "|"
You may use the following regex to select just odd one pipe out:
(?<=(?<!\|)(?:\|{2})*)\|(?!\|)
See regex demo.
The regex breakdown:
(?<=(?<!\|)(?:\|{2})*) - if a pipe is preceded with an even number of pipes ((?:\|{2})* - 0 or more sequences of exactly 2 pipes) from a position that has no preceding pipe ((?<!\|))
\| - match an odd pipe on the right
(?!\|) - if it is not followed by another pipe.
Please note that this regex uses a variable-width look-behind and is very resource-consuming. I'd rather use a capturing group mechanism here, but it all depends on the actual purpose of matching that odd pipe.
Here is a modified version of the regex for removing the odd one out:
var s = "1|2||3|||4||||5|||||6||||||7|||||||";
var data = Regex.Replace(s, #"(?<!\|)(?<even_pipes>(?:\|{2})*)\|(?!\|)", "${even_pipes}");
Console.WriteLine(data);
See IDEONE demo. Here, the quantified part is moved from lookbehind to an even_pipes named capturing group, so that it could be restored with the backreference in the replaced string. Regexhero.net shows 129,046 iterations per second for the version with a capturing group and 69,206 with the original version with variable-width lookbehind.
Only use variable-width look-behind if it is absolutely necessary!
Oh, it's reopened! If you need better performance, also try this negative improved version.
\|(?!\|)(?<!(?:[^|]|^)(?:\|\|)*)
The idea here is to first match the last literal | at right side of a sequence or single | and execute a negated version of the lookbehind just after the match. This should perform considerably better.
\|(?!\|) matches literal | IF NOT followed by another pipe character (right most if sequence).
(?<!(?:[^|]|^)(?:\|\|)*) IF position right after the matched | IS NOT preceded by (?:\|\|)* any amount of literal || until a non| or ^ start.In other words: If this position is not preceded by an even amount of pipe characters.
Btw, there is no performance gain in using \|{2} over \|\| it might be better readable.
See demo at regexstorm

.NET Regex - Specific String containing a changing number

I am new to working with Regexs in C# .NET. Say I have a string as follows...
"Working on log #4"
And within this string we can expect to see the number (4) vary. How can I use a Regex to extract only that number from the string.
I want to make sure that the string matches the first part:
"Working on log #"
And then exctract the integer from it.
Also - I know that I could do this using string.Split(), or .Substring, etc. I just wanted to know how I might use regex's to do this.
Thanks!
"Working on log #(\d+)"
The () create a match group, so you will be able to extract that section.
The \d matches any digit.
The + says "look at the previous token, match it one or more times" so it will make it match one or more digits.
So overall you're capturing a group containing one or more digits, where that group comes after "Working on log #"
RegEx rgx = new RegEx("Working on log #[0-9]"); is the pattern you want to use. The first part is a string literal, [0-9] says that character can be any value 0 through 9. If you allow multiple digits then change it to [0-9]{x} where x is the number of repetitions or [0-9]+ as a + after any character means 1 or more of that character is allowed.
You could also just do string.StartsWith("Working on log #") then split on # and use int.TryParse() with the second value to confirm it is in fact a valid integer.
Try this: ^(?<=Working on log #)\d+$. This only captures the number. No need for a capture group. Remove ^ and $ if this is within a larger string.
^ - start of string
(?<=) - positive lookbehind - ensures what is between = and ) is found before
\d+ - at least one digit
$ - end of string
A capturing group is the solution:
"Working on log #(?<Number>[0-9]+)"
Then you can access the matched groups using the Match.Groups property.

Regular Expressions: Determining if a String is either a number or variable

I am trying to combine two Regular Expression patterns to determine if a String is either a double value or a variable. My restrictions are as follows:
The variable can only begin with an _ or alphabetical letter (A-Z, ignoring case), but it can be followed by zero or more _s, letters, or digits.
Here's what I have so far, but I can't get it to work properly.
String varPattern = #"[a-zA-Z_](?: [a-zA-Z_]|\d)*";
String doublePattern = #"(?: \d+\.\d* | \d*\.\d+ | \d+ ) (?: [eE][\+-]?\d+)?";
String pattern = String.Format("({0}) | ({1})",
varPattern, doublePattern);
Regex.IsMatch(word, varPattern, RegexOptions.IgnoreCase)
It seems that it is capturing both Regular Expression patterns, but I need it to be either/or.
For example, _A2 2 is valid using the code above, but _A2 is invalid.
Some examples of valid variables are as follows:
_X6 , _ , A , Z_2_A
And some examples of invalid variables are as follows:
2_X6 , $2 , T_2$
I guess I just need clarification on the pattern format for the Regular Expression. The format is unclear to me.
As noted, the literal whitespace you've put in your regular expressions is part of the regular expression. You won't get a match unless that same whitespace is in the text being scanned by the regular expression. If you want to use whitespace to make your regex, you'll need to specify RegexOptions.IgnorePatternWhitespace, after that, if you want to match any whitespace, you'll have to do so explicitly, either by specifying \s, \x20, etc.
It should be noted that if you do specify RegexOptions.IgnorePatternWhitespace, you can use Perl-style comments (# to end of line) to document your regular expression (as I've done below). For complex regular expressions, someone 5 years from now — who might be you! — will thank you for the kindness.
Your [presumably intended] patterns are also, I think, more complex than they need be. A regular expression to match the identifier rule you've specified is this:
[a-zA-Z_][a-zA-Z0-9_]*
Broken out into its constituent parts:
[a-zA-Z_] # match an upper- or lower-case letter or an underscore, followed by
[a-zA-Z0-9_]* # zero or more occurences of an upper- or lower-case letter, decimal digit or underscore
A regular expression to match the conventional style of a numeric/floating-point literal is this:
([+-]?[0-9]+)(\.[0-9]+)?([Ee][+-]?[0-9]+)?
Broken out into its constituent parts:
( # a mandatory group that is the integer portion of the value, consisting of
[+-]? # - an optional plus- or minus-sign, followed by
[0-9]+ # - one or more decimal digits
) # followed by
( # an optional group that is the fractional portion of the value, consisting of
\. # - a decimal point, followed by
[0-9]+ # - one or more decimal digits
)? # followed by,
( # an optional group, that is the exponent portion of the value, consisting of
[Ee] # - The upper- or lower-case letter 'E' indicating the start of the exponent, followed by
[+-]? # - an optional plus- or minus-sign, followed by
[0-9]+ # - one or more decimal digits.
)? # Easy!
Note: Some grammars differ as to whether the sign of the value is a unary operator or part
of the value and whether or not a leading + sign is allowed. Grammars also vary as to whether
something like 123245. is valid (e.g., is a decimal point with no fractional digits valid?)
To combine these two regular expression,
First, group each of them with parentheses (you might want to name the containing groups, as I've done):
(?<identifier>[a-zA-Z_][a-zA-Z0-9_]*)
(?<number>[+-]?[0-9]+)(\.[0-9]+)?([Ee][+-]?[0-9]+)?
Next, combine with the alternation operation, |:
(?<identifier>[a-zA-Z_][a-zA-Z0-9_]*)|(?<number>[+-]?[0-9]+)(\.[0-9]+)?([Ee][+-]?[0-9]+)?
Finally, enclose the whole shebang in an #"..." literal and you should be good to go.
That's about all there is to it.
Spaces are not ignored in regular expressions by default, so for each space in your current expressions it is looking for a space in that string. Add the RegexOptions.IgnorePatternWhitespace flag or remove the spaces from your expressions.
You will also want to add some beginning and end of string anchors (^ and $ respectively) so you do not match just part of a string.
You should avoid having spaces in your regular expressions unless you explicitly set IgnorePatterWhiteSpace. To make sure you get only matches on complete words you should include the beginning of line (^) and end of line ($) characters. I would also suggest you build the entire expression pattern instead of using String.Format("({0}) | ({1})", ...) as you have here.
The below should work given your examples:
string pattern = #"(?:^[a-zA-Z_][a-zA-Z_\d]*)|(?:^\d+(?:\.\d+){0,1}(?:[Ee][\+-]\d+){0,1}$)";

Regex IsMatch taking too long to execute

I have one strange issue on my .NET project with RegEx. Please, see C# code below:
const string PATTERN = #"^[a-zA-Z]([-\s\.a-zA-Z]*('(?!'))?[-\s\.a-zA-Z]*)*$";
const string VALUE = "Ingebrigtsen Myre (Øvre)";
System.Text.RegularExpressions.Regex regex = new System.Text.RegularExpressions.Regex(PATTERN);
if (!regex.IsMatch(VALUE)) // <--- Infinite loop here
return string.Empty;
// Some other code
I use this pattern to validate all types of names (fist names, last names, middle names, etc.). Value is a parameter, but I provided it as a constant above, because issue is not reproduced often - only with special symbols: *, (, ), etc. (sorry, but I don't have the full list of these symbols).
Can you help me to fix this infinite loop? Thanks for any help.
Added: this code is placed on the very base level of project and I don't want to do any refactoring there - I just want to have quick fix for this issue.
Added 2: I do know that it technically is not a loop - I meant that "regex.IsMatch(VALUE)" never ends. I waited for about an hour and it was still executing.
Your non-trivial regex: ^[a-zA-Z]([-\s\.a-zA-Z]*('(?!'))?[-\s\.a-zA-Z]*)*$, is better written with comments in free-spacing mode like so:
Regex re_orig = new Regex(#"
^ # Anchor to start of string.
[a-zA-Z] # First char must be letter.
( # $1: Zero or more additional parts.
[-\s\.a-zA-Z]* # Zero or more valid name chars.
( # $2: optional quote.
' # Allow quote but only
(?!') # if not followed by quote.
)? # End $2: optional quote.
[-\s\.a-zA-Z]* # Zero or more valid name chars.
)* # End $1: Zero or more additional parts.
$ # Anchor to end of string.
",RegexOptions.IgnorePatternWhitespace);
In English, this regex essentially says: "Match a string that begins with an alpha letter [a-zA-Z] followed by zero or more alpha letters, whitespaces, periods, hyphens or single quotes, but each single quote may not be immediately followed by another single quote."
Note that your above regex allows oddball names such as: "ABC---...'... -.-.XYZ " which may or may not be what you need. It also allows multi-line input and strings that end with whitespace.
The "infinite loop" problem with the above regex is that catastrophic backtracking occurs when this regex is applied to a long invalid input which contains two single quotes in a row. Here is an equivalent pattern which matches (and fails to match) the exact same strings, but does not experience catastrophic backtracking:
Regex re_fixed = new Regex(#"
^ # Anchor to start of string.
[a-zA-Z] # First char must be letter.
[-\s.a-zA-Z]* # Zero or more valid name chars.
(?: # Zero or more isolated single quotes.
' # Allow single quote but only
(?!') # if not followed by single quote.
[-\s.a-zA-Z]* # Zero or more valid name chars.
)* # Zero or more isolated single quotes.
$ # Anchor to end of string.
",RegexOptions.IgnorePatternWhitespace);
And here it is in short form in your code context:
const string PATTERN = #"^[a-zA-Z][-\s.a-zA-Z]*(?:'(?!')[-\s.a-zA-Z]*)*$";
Look at this part of your regex:
( [-\s\.a-zA-Z]* ('(?!'))? [-\s\.a-zA-Z]* )*$
^ ^ ^ ^ ^
| | | | |
| | | | This group repeats any number of times
| | | charclass repeats any number of times
| | This group is optional
| This character class also repeats any number of times
Outer group (repeated, as seen above)
That means that as soon as your input string contains a character that's not in the character class (like the brackets and non-ASCII letter in your example), the preceding characters will be tried in a lot of permutations whose number increases exponentially with the length of the string.
To avoid that (and to allow a faster failure of the regex, use atomic groups:
const string PATTERN = #"^[a-zA-Z](?>(?>[-\s\.a-zA-Z]*)(?>'(?!'))?(?>[-\s\.a-zA-Z])*)*$";
You've got an "any number of any number" here:
...[-\s\.a-zA-Z]*)*
and because your input doesn't match, the engine backtracks to try all permutations of dividing the input up, and the number of attempts grows exponentially with the length of the input.
You can fix it simply by adding a "+" to make a possessive quantifier, which once consumed will not backtrack to find other combinations:
const string PATTERN = #"^[a-zA-Z]([-\s\.a-zA-Z]*('(?!'))?[-\s\.a-zA-Z]*+)*$";
^-- added + here
You can see a live demo (on rubular) demonstrating that adding the plus fixed the loop problem, and still matches input that doesn't have the odd characters.

C# regex to replace a delimiter by another one

I'm working on pl/sql code where i want to replace ';' which is commented with '~'.
e.g.
If i have a code as:
--comment 1 with;
select id from t_id;
--comment 2 with ;
select name from t_id;
/*comment 3
with ;*/
Then i want my result text as:
--comment 1 with~
select id from t_id;
--comment 2 with ~
select name from t_id;
/*comment 3
with ~*/
Can it be done using regex in C#?
Regular expression:
((?:--|/\*)[^~]*)~(\*/)?
C# code to use it:
string code = "all that text of yours";
Regex regex = new Regex(#"((?:--|/\*)[^~]*)~(\*/)?", RegexOptions.Multiline);
result = regex.Replace(code, "$1;$2");
Not tested with C#, but the regular expression and the replacement works in RegexBuddy with your text =)
Note: I am not a very brilliant regular expression writer, so it could probably have been written better. But it works. And handles both your cases with one-liner-comments starting with -- and also the multiline ones with /* */
Edit: Read your comment to the other answer, so removed the ^ anchor, so that it takes care of comments not starting on a new line as well.
Edit 2: Figured it could be simplified a bit. Also found it works fine without the ending $ anchor as well.
Explanation:
// ((?:--|/\*)[^~]*)~(\*/)?
//
// Options: ^ and $ match at line breaks
//
// Match the regular expression below and capture its match into backreference number 1 «((?:--|/\*)[^~]*)»
// Match the regular expression below «(?:--|/\*)»
// Match either the regular expression below (attempting the next alternative only if this one fails) «--»
// Match the characters “--” literally «--»
// Or match regular expression number 2 below (the entire group fails if this one fails to match) «/\*»
// Match the character “/” literally «/»
// Match the character “*” literally «\*»
// Match any character that is NOT a “~” «[^~]*»
// Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
// Match the character “~” literally «~»
// Match the regular expression below and capture its match into backreference number 2 «(\*/)?»
// Between zero and one times, as many times as possible, giving back as needed (greedy) «?»
// Match the character “*” literally «\*»
// Match the character “/” literally «/»
A regex is not really needed - you can iterate on lines, locate the lines starting with "--" and replace ";" with "~" on them.
String.StartsWith("--") - Determines whether the beginning of an instance of String matches a specified string.
String.Replace(";", "~") - Returns a new string in which all occurrences of a specified Unicode character or String in this instance are replaced with another specified Unicode character or String.

Categories