Match exactly one occurrence with regex - c#

Consider M,T,W,TH,F,S,SU are days of week.
I have regex which is working well except for one scenario when there is no sequence of weekdays, i.e. there is no M, T , W , TH , F , S , SU at the expected location inside the string.
For example, q10MT is valid but q10HT is invalid.
Below is my expression:
string expression = "q(\\d*)(M)?(T(?!H))?(W)?(TH)?(F)?(S(?!U))?(SU)?";
In case of q10MT, the output is q10MT which is correct, but in case of q10HT the output is q10 which is incorrect, my regex should return no value or empty string when there is no match.
What changes do I need to make in order to achieve this?

You can achieve it with a positive look-ahead:
q(\\d*)(?=(?:M|T(?!H)|W|TH|F|S(?!U)|SU))(M)?(T(?!H))?(W)?(TH)?(F)?(S(?!U))?(SU)?
Or, as #Taemyr noted, a shorter equivalent
q(\\d*)(?=(?:M|TH?|W|TH|F|SU?))(M)?(T(?!H))?(W)?(TH)?(F)?(S(?!U))?(SU)?
Here is a demo
The (?=(?:M|TH?|W|F|SU?)) look-ahead makes sure there is at least one required value from the alternation list you have after the look-ahead.
C# regex usage:
var rx = new Regex(#"q(\d*)(?=(?:M|TH?|W|TH|F|SU?))(M)?(T(?!H))?(W)?(TH)?(F)?(S(?!U))?(SU)?");
var result = rx.Match("q10MSUT").Value;
Result:

What about the following:
q(\d*)(M|TH?|W|F|SU?)+
See demo with some examples on matches and no-matches. The key change in this regexp is that this one uses the + to require at least one match.
Be aware that this solution doesn't demand the days to be in order, and allows skipping of days specified in comments not to matter.
Edit: OP says in comments that he requires only one match for each day, which this solution doesn't account for.

If order does not matter you need to do something like this;
q(?<number>\d+)((?<monday>(?<!M\D*)M)|(?<tuesday>(?<!T(?!H)\D*)T(?!H))|(?<wednesday>(?<!W\D*)W)|(?<thursday>(?<!TH\D*)TH)|(?<friday>(?<!F\D*)F)|(?<saturday>(?<!S(?!U)\D*)S(?!U))|(?<sunday>(?<!SU\D*)SU))+
This matches if q is followed by some number, and then followed by one or more weekdays. Order of weekdays does not matter, and the negative lookbehind insures that no weekday can occur more than once.
Each weekday is captured in it's own capturing group and that group is named so that it can be extracted later. "q10MTsomething" will capture "q10MT" with 10 in the "number" capturing group, M in the "monday" capturing group and T in the "tuesday" capturing group, other capturing groups will be empty. "q10TFMother" will capture "q10TFM" with capturing as in the previous example, plus F in the "friday" capturing group. "q10TFMT" will capture "q10TFM" with capturing groups as in the previous example. "q10HT" will not match.
demo
Note that this is the regexp string. If entered in code you might need to escape the \s to produce the correct string.

The question is answered already. Even so I want to point to another idea using a variable length lookbehind for maintaining the sequence, which should be fine with .NET
q(\d*)[MTWFSUH]+(?<=q\d*(M)?(T)?(W)?(TH)?(F)?(S)?(SU)?)
[MTWFSUH] is the list of valid characters. At least one is required
Using a lookbehind for matching as long as the sequence is maintained
Test at your test tool

Related

How to match a string, but only if the same string has not already been matched with or without dashes?

I have a case I'm trying to match using regular expressions.
My current expression will match a string in a certain format with or without dashes. I would like to add it to match only if the string has not been matched before, with or without the dashes. For example, take the following cases:
1. 1234-56-789-5555
2. 1234567895555
3. 0000-99-888-3333
4. 1111223334444
If the four examples above appeared in this same order in a list, document, whatever, I would want to only capture (1, 3, 4). I want to skip #2 since it was already captured by #1, but with the dashes. If #2 had of come first, I would have wanted to similarly skip #1.
Here's the current expression I'm using:
\d\d\d\d-*\d\d-*\d\d\d-*\d\d\d\d
I tried to read up on look behinds (I'm fairly inexperienced with Regex) but I only really understand that a look behind only checks if certain text is matched previously. I'm not sure if what I want can be combined with this; I only see how to check for specific text, not for the current value with/without dashes.
I'm currently doing this with C# logic, but am trying to see if it can be done purely in Regex. If it can't be done, that's fine; I'm just trying to beef up my Regex knowledge in this case.
Is this possible -- how can I accomplish this?
If you want to obtain just the first occurrence of each number (answering I want to skip #2 since it was already captured by #1, but with the dashes), you need a negative look-behind with a RegexOptions.RightToLeft and RegexOptions.Singleline options:
(?<!\b\1-?\2-?\3-?\4\b.*)\b(\d{4})-?(\d{2})-?(\d{3})-?(\d{4})\b
The \b(\d{4})-?(\d{2})-?(\d{3})-?(\d{4})\b subpattern is the number with capture groups to check for their presence regardless of the hyphens earlier in the string.
The (?<!\b\1-?\2-?\3-?\4\b.*) subpattern look-behind is checking if we have no other occurrences of the same string.
Tested at regexhero.net and in Expresso:
You can easily do this without using regex.. but if you still want to use regex for this purpose.. you can use the following to match:
(?<=((\d{4})-(\d{2})-(\d{3})-(\d{4})).*?)\2\3\4\5
And replace with '' (empty string)
Explanation:
This will match all those digits without dashes which are already captured by digits with dashes
So, in your 1,2,3 and 4.. instead of matching 1,3 and 4 types it matches type 2.. and you can replace it with '' (nothing) and you remain with 1,3, and 4
See demo here
You can use the following regex to do exactly what you want..
((?<=((\d{4})-(\d{2})-(\d{3})-(\d{4})).*?)(?!\3\4\5\6)\d{13})|(((?<=((\d{4})(\d{2})(\d{3})(\d{4})).*?)(?!\10-\11-\12-\13)((\d{4})-(\d{2})-(\d{3})-(\d{4}))))
Explanation:
((?<=((\d{4})-(\d{2})-(\d{3})-(\d{4})).*?)(?!\3\4\5\6)\d{13}) match all those \d{13} which are not previously occurred with dashes in between them (this excludes strings of type 2 in your case)
((\d{4})-(\d{2})-(\d{3})-(\d{4})) and match all of this pattern
Matches 1, 3 and 4 in your case.
See DEMO

Regex for key value pairs

I am not great with regular expressions and I have a need to parse out key/value pairs from a string. An example of the string would be:
Event Name CallingNumber:+15555555555 CallID:12345 CallingName:Doe, John CallingTime:12-26-2013 14:27:41.645497
The result I'm looking for would be something like this:
CallingNumber=+15555555555
CallID=12345
CallingName=Doe, John
CallingTime=12-26-2013 14:27:41.645497
The key/value pairs are delimited by a space, but the value is allowed to have a space in it (ex: Doe, John). It would be nice if the values were surrounded by quotes or something, but they are not. Essentially I'm trying to match a word without a space followed by a colon and then any character after the colon until it reaches another word without a space followed by a colon.
Your match is impossible, the fields are delimited with : but you have a date with : in there, as well, Regex can't really distinguish those very easily.
Still, this is what I came up with:
(.+?):(.+?)(?=(?:[^\s]+:)|(?:$))
Again, beacuse of the date, this won't work perfectly.
Here's a fiddle to demonstrate: http://www.rexfiddle.net/Wm3NiK0
Edit: If your "keys" are only letters (not numbers), which avoids the time/date problem, then this will work:
([A-Za-z]+?):(.+?)\s?(?=(?:[A-Za-z]+:)|(?:$))
Here's another fiddle to demonstrate this: http://www.rexfiddle.net/sGQs7YV
You can apply the regex repeatedly, with a (.*) to return the "yet to be parsed" remainder
In pseudocode form, this might be:
match string to "^(([^:]*\s)*[^:]*)\s+(.*)$"
should grab "Event Name" and leave the rest as $3
loop:
keep only $3 as new base string
match new base string to "^(\w+)[:](.+?)\s+(\w+[:].*)$"
key = $1, value = $2, new remainder = $3
repeat until no $1, $2 values are returned
"I'm suing .NET (c#)," good idea! :) Microsoft needs to be put in its place!
Do you have a fixed number of fields, or could they vary in number? Do you expect the same fields each time? In the same order? If a fixed number, you could hard code the number of fields in the regexp, but I still think that trying to do it with just one regexp is asking for a headache. Use some scripting code and break it down piece by piece, first of all splitting it on :\s+. The last word in a group is then stripped off as the name of the next group, and the remainder is the value of the previous group. The first and last groups have to have some special treatment. I think that would be a lot easier and more understandable than trying to do it in one ugly regexp. As a bonus, any number of fields in any order could be handled.

Regex to parse formatter string

I am writing a string.Format-like method. In order to do this, I am adopting Regex to determine commands and parameters: e.g. Format(#"\m{0,1,2}", byteArr0, byteArr1, byteArr2)
For the first Regex, return 2 groups:
'\m'
'{0,1,2}'
Another Regex takes the value of '{0,1,2}' and has 3 matches:
0
1
2
These values are the indexes corresponding to the byteArr params.
This command structure is likely to grow so I'm really trying to figure this out and learn enough to be able to modify the Regex for future requirements.I would think that a single Regex would do all of the above but there is value in having 2 separate Regex(es/ices ???) expressions.
Any way, to get the first group '\m' the Regex is:
"(\\)(\w{1,1})" // I want the '{0,1,2}' group also
To get the integer matches '{0,1,2}' I was trying:
"(?<=\{)([^}]*)(?=\})"
I am having difficulty in achieving: (1) 2 groups on the first expression and (2) 3 matches on the integers within the braces delimited by a comma in the second expression.
Your first regex (\\)(\w{1,1}) can be greatly simplified.
You don't want to capture the \ separately to the m so no need to wrap them in their own sets of parenthesis.
\w{1,1} is the same as just \w.
So we have \\\w to match the first part \m.
Now to deal with the second part, really we can ignore everything other than the 0,1,2 in the example since there are no numbers elsewhere so you'd just use: \d+ and iterate through the matches.
But lets assume the example could actually be \9{1,2,3}.
Now \d+ would match the 9 so to avoid this we could use [{,](\d+)[,}]. This says capture a number that has either a , or { on the left of it and a , or } on the right.
You're right in saying that we can match the whole string with a single regex, something like this would do it:
(\\\w){((\d+),?)+}
However the problem with this is when you examine the contents of the capture groups afterwards, the last number caught by the (\d+) overwrites all the other values that were caught in there. So you'd be left with group 1: \m and group 2: 2 for your example.
With that in mind I recommend using 2 regexs:
For the 1st part: \\\w
For the numbers: I'd forget about the [{,](\d+)[,}] (and the many other ways you could do it), the cleanest way might just be to grab whatever is inside the {...} and then match with a simple \d+.
So to do this first use (\\\w)\{([^/}]+)\} to grab the \m into group 1 and the 1,2,3 into group 2, then just use \d+ on that.
FYI, your (?<=\{)([^}]*)(?=\}) works fine, but you can't but anything before the lookbehind i.e. the \\\w. In the vast majority of cases where a lookbehind can be used, you can do what you want by just using capture groups and ignoring everything else :
My regex \{([^/}]+)\} is pretty much the same as you (?<=\{)([^}]*)(?=\}) except rather than looking ahead and looking behind for the { and } I just leave them outside the capture groups that are going to be used.
Consider the following Regexes...
(^.*?)(?={.*})
\d+
Good Luck!

C# Regex Replace weird behavior with multiple captures and matching at the end of string?

I'm trying to write something that format Brazilian phone numbers, but I want it to do it matching from the end of the string, and not the beginning, so it would turn input strings according to the following pattern:
"5135554444" -> "(51) 3555-4444"
"35554444" -> "3555-4444"
"5554444" -> "555-4444"
Since the begining portion is what usually changes, I thought of building the match using the $ sign so it would start at the end, and then capture backwards (so I thought), replacing then by the desired end format, and after, just getting rid of the parentesis "()" in front if they were empty.
This is the C# code:
s = "5135554444";
string str = Regex.Replace(s, #"\D", ""); //Get rid of non digits, if any
str = Regex.Replace(str, #"(\d{0,2})(\d{0,4})(\d{1,4})$", "($1) $2-$3");
return Regex.Replace(str, #"^\(\) ", ""); //Get rid of empty () at the beginning
The return value was as expected for a 10 digit number. But for anything less than that, it ended up showing some strange behavior. These were my results:
"5135554444" -> "(51) 3555-4444"
"35554444" -> "(35) 5544-44"
"5554444" -> "(55) 5444-4"
It seems that it ignores the $ at the end to do the match, except that if I test with something less than 7 digits it goes like this:
"554444" -> "(55) 444-4"
"54444" -> "(54) 44-4"
"4444" -> "(44) 4-4"
Notice that it keeps the "minimum" {n} number of times of the third capture group always capturing it from the end, but then, the first two groups are capturing from the beginning as if the last group was non greedy from the end, just getting the minimum... weird or it's me?
Now, if I change the pattern, so instead of {1,4} on the third capture I use {4} these are the results:
str = Regex.Replace(str, #"(\d{0,2})(\d{0,4})(\d{4})$", "($1) $2-$3");
"5135554444" -> "(51) 3555-4444" //As expected
"35554444" -> "(35) 55-4444" //The last four are as expected, but "35" as $1?
"54444" -> "(5) -4444" //Again "4444" in $3, why nothing in $2 and "5" in $1?
I know this is probably some stupidity of mine, but wouldn't it be more reasonable if I want to capture at the end of the string, that all previous capture groups would be captured in reverse order?
I would think that "54444" would turn into "5-4444" in this last example... then it does not...
How would one accomplish this?
(I know maybe there's a better way to accomplish the very same thing using different approaches... but what I'm really curious is to find out why this particular behavior of the Regex seems odd. So, the answer tho this question should focus on explaining why the last capture is anchored at the end of the string, and why the others are not, as demonstrated in this example. So I'm not particularly interested in the actual phone # formatting problem, but to understand the Regex sintax)...
Thanks...
So you want the third part to always have four digits, the second part zero to four digits, and the first part zero to two digits, but only if the second part contains four digits?
Use
^(\d{0,2}?)(\d{0,4})(\d{4})$
As a C# snippet, commented:
resultString = Regex.Replace(subjectString,
#"^ # anchor the search at the start of the string
(\d{0,2}?) # match as few digits as possible, maximum 2
(\d{0,4}) # match up to four digits, as many as possible
(\d{4}) # match exactly four digits
$ # anchor the search at the end of the string",
"($1) $2-$3", RegexOptions.IgnorePatternWhitespace);
By adding a ? to a quantifier (??, *?, +?, {a,b}?) you make it lazy, i. e. tell it to match as few characters as possible while still allowing an overall match to be found.
Without the ? in the first group, what would happen when trying to match 123456?
First, the \d{0,2} matches 12.
Then, the \d{0,4} matches 3456.
Then, the \d{4} doesn't have anything left to match, so the regex engine backtracks until that's possible again. After four steps, the \d{4} can match 3456. The \d{0,4} gives up everything it had matched greedily for this.
Now, an overall match has been found - no need to try any more combinations. Therefore, the first and third groups will contain parts of the match.
You have to tell it that it's OK if the first matching groups aren't there, but not the last one:
(\d{0,2}?)(\d{0,4}?)(\d{1,4})$
Matches your examples properly in my testing.

In C# regular expression why does the initial match show up in the groups?

So if I write a regex it's matches I can get the match or I can access its groups. This seems counter intuitive since the groups are defined in the expression with braces "(" and ")". It seems like it is not only wrong but redundant. Any one know why?
Regex quickCheck = new Regex(#"(\D+)\d+");
string source = "abc123";
m.Value //Equals source
m.Groups.Count //Equals 2
m.Groups[0]) //Equals source
m.Groups[1]) //Equals "abc"
I agree - it is a little strange, however I think there are good reasons for it.
A Regex Match is itself a Group, which in turn is a Capture.
But the Match.Value (or Capture.Value as it actually is) is only valid when one match is present in the string - if you're matching multiple instances of a pattern, then by definition it can't return everything. In effect - the Value property on the Match is a convenience for when there is only match.
But to clarify where this behaviour of passing the whole match into Groups[0] makes sense - consider this (contrived) example of a naive code unminifier:
[TestMethod]
public void UnMinifyExample()
{
string toUnMinify = "{int somevalue = 0; /*init the value*/} /* end */";
string result = Regex.Replace(toUnMinify, #"(;|})\s*(/\*[^*]*?\*/)?\s*", "$0\n");
Assert.AreEqual("{int somevalue = 0; /*init the value*/\n} /* end */\n", result);
}
The regex match will preserve /* */ comments at the end of a statement, placing a newline afterwards - but works for either ; or } line-endings.
Okay - you might wonder why you'd bother doing this with a regex - but humour me :)
If Groups[0] generated by the matches for this regex was not the whole capture - then a single-call replace would not be possible - and your question would probably be asking why doesn't the whole match get put into Groups[0] instead of the other way round!
The documentation for Match says that the first group is always the entire match so it's not an implementation detail.
It's historical is all. In Perl 5, the contents of capture groups are stored in the special variables $1, $2, etc., but C#, Java, and others instead store them in an array (or array-like structure). To preserve compatibility with Perl's naming convention (which has been copied by several other languages), the first group is stored in element number one, the second in element two, etc. That leaves element zero free, so why not store the full match there?
FYI, Perl 6 has adopted a new convention, in which the first capturing group is numbered zero instead of one. I'm sure it wasn't done just to piss us off. ;)
Most likely so that you can use "$0" to represent the match in a substitution expression, and "$1" for the first group match, etc.
I don't think there's really an answer other than the person who wrote this chose that as an implementation detail. As long as you remember that the first group will always equal the source string you should be ok :-)
Not sure why either, but if you use named groups you can then set the option RegExOptions.ExplicitCapture and it should not include the source as first group.
It might be redundant, however it has some nice properties.
For example, it means the capture groups work the same way as other regex engines - the first capture group corresponds to "1", and so on.
Backreferences are one-based, e.g., \1 or $1 is the first parenthesized subexpression, and so on. As laid out, one maps to the other without any thought.
Also of note: m.Groups["0"] gives you the entire matched substring, so be sure to skip "0" if you're iterating over regex.GetGroupNames().

Categories