Regex match what did not match - c#

In words, I would describe the problem as "match A or B, match foo, and then match A or B that did NOT match previously".
I can do it with the following regex:
AfooB|BfooA
I'm wondering if there is a more efficient way to do this? I know how to reference a captured group using the "\" and then the group number. In this case I would like to apply something like to say "not the option that matched in the captured group" (and still be restricted to only match the other possible matches for that group).
The reason I'm looking for something more efficient than simply "AfooB|BfooA" is that in my case"foo"is a very long pattern and I would prefer to reduce duplication if possible.

You may use a negative lookahead with a backreference restriction when matching the second A or B:
(A|B)foo(?!\1)(A|B)
Basically, (A|B) matches and captures the value into Group 1, then foo matches foo, (?!\1) makes sure that the text that follows is not the same as the one captured into the first group, and then it can only match the opposite value with (A|B).
See this regex demo
NOTE: if A and B are single characters, use a character class: ([AB])foo(?!\1)([AB])

Related

Match exactly one occurrence with regex

Consider M,T,W,TH,F,S,SU are days of week.
I have regex which is working well except for one scenario when there is no sequence of weekdays, i.e. there is no M, T , W , TH , F , S , SU at the expected location inside the string.
For example, q10MT is valid but q10HT is invalid.
Below is my expression:
string expression = "q(\\d*)(M)?(T(?!H))?(W)?(TH)?(F)?(S(?!U))?(SU)?";
In case of q10MT, the output is q10MT which is correct, but in case of q10HT the output is q10 which is incorrect, my regex should return no value or empty string when there is no match.
What changes do I need to make in order to achieve this?
You can achieve it with a positive look-ahead:
q(\\d*)(?=(?:M|T(?!H)|W|TH|F|S(?!U)|SU))(M)?(T(?!H))?(W)?(TH)?(F)?(S(?!U))?(SU)?
Or, as #Taemyr noted, a shorter equivalent
q(\\d*)(?=(?:M|TH?|W|TH|F|SU?))(M)?(T(?!H))?(W)?(TH)?(F)?(S(?!U))?(SU)?
Here is a demo
The (?=(?:M|TH?|W|F|SU?)) look-ahead makes sure there is at least one required value from the alternation list you have after the look-ahead.
C# regex usage:
var rx = new Regex(#"q(\d*)(?=(?:M|TH?|W|TH|F|SU?))(M)?(T(?!H))?(W)?(TH)?(F)?(S(?!U))?(SU)?");
var result = rx.Match("q10MSUT").Value;
Result:
What about the following:
q(\d*)(M|TH?|W|F|SU?)+
See demo with some examples on matches and no-matches. The key change in this regexp is that this one uses the + to require at least one match.
Be aware that this solution doesn't demand the days to be in order, and allows skipping of days specified in comments not to matter.
Edit: OP says in comments that he requires only one match for each day, which this solution doesn't account for.
If order does not matter you need to do something like this;
q(?<number>\d+)((?<monday>(?<!M\D*)M)|(?<tuesday>(?<!T(?!H)\D*)T(?!H))|(?<wednesday>(?<!W\D*)W)|(?<thursday>(?<!TH\D*)TH)|(?<friday>(?<!F\D*)F)|(?<saturday>(?<!S(?!U)\D*)S(?!U))|(?<sunday>(?<!SU\D*)SU))+
This matches if q is followed by some number, and then followed by one or more weekdays. Order of weekdays does not matter, and the negative lookbehind insures that no weekday can occur more than once.
Each weekday is captured in it's own capturing group and that group is named so that it can be extracted later. "q10MTsomething" will capture "q10MT" with 10 in the "number" capturing group, M in the "monday" capturing group and T in the "tuesday" capturing group, other capturing groups will be empty. "q10TFMother" will capture "q10TFM" with capturing as in the previous example, plus F in the "friday" capturing group. "q10TFMT" will capture "q10TFM" with capturing groups as in the previous example. "q10HT" will not match.
demo
Note that this is the regexp string. If entered in code you might need to escape the \s to produce the correct string.
The question is answered already. Even so I want to point to another idea using a variable length lookbehind for maintaining the sequence, which should be fine with .NET
q(\d*)[MTWFSUH]+(?<=q\d*(M)?(T)?(W)?(TH)?(F)?(S)?(SU)?)
[MTWFSUH] is the list of valid characters. At least one is required
Using a lookbehind for matching as long as the sequence is maintained
Test at your test tool

Captured group in optional part of a regular expression

I want to capture a group in an optional part of a string.
For example:
In the string "firstName:Bill-lastName:Gates", I want to capture 2 groups :
Bill
Gates
I use this regex:
firstName:(.*)-lastName:(.*)
But when the lastName-part is optional, I still want to capture the first
group (firstName).
I used this regex, to make the lastName-part optional (in a non-capturing group):
firstName:(.*)(?:-lastName:(.*))?
Using this updated regex, the resulting groups are:
when the lastName part is not present, for example "firstName:Bill" the captured groups are:
Bill
/empty string/
which is correct,
when the firstName and lastName parts are present: "firstName:Bill-lastName:Gates", the groups are not correct:
Bill-lastName:Gates
/empty/
I think it has to do with greediness of the first capturing group, but how to adjust this regex to make the regex work when the lastName-part is optional?
You are right, it is about greediness. Find a delimiter for the first match group. So, if your firstname "never" contains the dash, only match everything but the dash with the first match group.
firstName:([^-]*)(?:-lastName:(.*))?
firstName:([^-]*)(?:-lastName:(.*))?
Debuggex Demo
If you cannot find such a delimiter you would need to take a different approach. Even if you try to make the first pattern "lazy", the Regex engine always prefers a bigger match over matching an additional optional match.
This is, because lazy matchgroups will match the first string that satisfies the expression (! important wording !)
There might be an option with look arrounds, but you could also use a or -statement without providing optional matches:
firstName:(.*)-lastName:(.*)|firstName:(.*)
This way, the regex engine would match either or, but prefer the pattern with 2 matches since it is listed first. Only if that does not apply, it will try the single match.
Even though you accepted #dognose's answer already, I assure you there are first names with a dash in them (You don't wanna piss off Jean-Claude van Damme). I would advise you to do it like so:
firstName:((?:(?!-lastName:).)*)(?:-lastName:(.*))?
Debuggex Demo
You can see from the visualization that the (?:(?!-lastName:).) says "if the current position is not followed by '-lastName:', capture another character"

Regex non-greedy match

If I have the following simplified string:
string331-itemR-icon253,string131-itemA-icon453,
string12131-itemB-icon4535,string22-itemC-icon443
How do I get the following only using only regex?
string12131-itemB-icon4535,
All numbers are unknown. The only known parts are
itemA, itemB, itemC, string and icon
I've tried string.+?itemB.+?, but it also picks up from the first occurrence of string rather than the one adjacent to itemB
I've also tried using [^icon] preceding the itemB in various positions but couldn't get it to work.
Try this:
string[^,]+itemB[^,]+,
Try this regex
string\d+-itemB-icon\d+,
The given solutions that use a restricted set of characters instead of a wildcard are simplest, but to get more at the general question: You got the non-greedy quantifier part right, but being non-greedy doesn't prevent the matcher from taking as many characters as it needs to find a match. You might be looking for the atomic group operator, (?>group). Once the group matches something, it will be treated atomically if the matcher needs to backtrack.
(?>string.+?item)B.+?,
In your example, the group matches string331-item, but the B doesn't match R so the whole group is tossed and the search moves to the next string.
You don't mention the commas separating items as a known part but use it in the example regex so I assume it can be used in a solution. Try excluding the comma as a character set instead of matching against ".".

How to prevent regex from stopping at the first match of alternatives?

If I have the string hello world , how can I modify the regex world|wo|w so that it will match all of "world", "wo" and "w" rather than just the single first match of "world" that it comes to ?
If this is not possible directly, is there a good workaround ? I'm using C# if it makes a difference:
Regex testRegex = new Regex("world|wo|w");
MatchCollection theMatches = testRegex.Matches("hello world");
foreach (Match thisMatch in theMatches)
{
...
}
I think you're going to need to use three separate regexs and match on each of them. When you specify alternatives it considers each one a successful match and stops looking after matching one of them. The only way I can see to do it is to repeat the search with each of your alternatives in a separate regex. You can create an array or list of Match items and have each search add to the list if you want to be able to iterate through them later.
If you're trying to match (the beginning of) the word world three times, you'll need to use three separate Regex objects; a single Regex cannot match the same character twice.
As SLaks wrote, a regex can't match the same text more than once.
You could "fake it" like this:
\b(w)((?<=w)o)?((?<=wo)rld)?
will match the w, the o only if preceded by w*, and rld only if preceded by wo.
Of course, only parts of the word will actually be matched, but you'll see whether only the first one, the first two or all the parts did match by looking at the captured groups.
So in the word want, the w will match (the rest is optional, so the regex reports overall success.
In work, the wo will match; \1 will contain w, and \2 will contain o. The rld will fail, but since it's optional, the regex still reports success.
I have added a word boundary anchor \b to the start of the regex to avoid matches in the middle of words like reword; if don't want to exclude those matches, drop the \b.
* The (?<=w) is not actually needed here, but I kept it in for consistency.

In C# regular expression why does the initial match show up in the groups?

So if I write a regex it's matches I can get the match or I can access its groups. This seems counter intuitive since the groups are defined in the expression with braces "(" and ")". It seems like it is not only wrong but redundant. Any one know why?
Regex quickCheck = new Regex(#"(\D+)\d+");
string source = "abc123";
m.Value //Equals source
m.Groups.Count //Equals 2
m.Groups[0]) //Equals source
m.Groups[1]) //Equals "abc"
I agree - it is a little strange, however I think there are good reasons for it.
A Regex Match is itself a Group, which in turn is a Capture.
But the Match.Value (or Capture.Value as it actually is) is only valid when one match is present in the string - if you're matching multiple instances of a pattern, then by definition it can't return everything. In effect - the Value property on the Match is a convenience for when there is only match.
But to clarify where this behaviour of passing the whole match into Groups[0] makes sense - consider this (contrived) example of a naive code unminifier:
[TestMethod]
public void UnMinifyExample()
{
string toUnMinify = "{int somevalue = 0; /*init the value*/} /* end */";
string result = Regex.Replace(toUnMinify, #"(;|})\s*(/\*[^*]*?\*/)?\s*", "$0\n");
Assert.AreEqual("{int somevalue = 0; /*init the value*/\n} /* end */\n", result);
}
The regex match will preserve /* */ comments at the end of a statement, placing a newline afterwards - but works for either ; or } line-endings.
Okay - you might wonder why you'd bother doing this with a regex - but humour me :)
If Groups[0] generated by the matches for this regex was not the whole capture - then a single-call replace would not be possible - and your question would probably be asking why doesn't the whole match get put into Groups[0] instead of the other way round!
The documentation for Match says that the first group is always the entire match so it's not an implementation detail.
It's historical is all. In Perl 5, the contents of capture groups are stored in the special variables $1, $2, etc., but C#, Java, and others instead store them in an array (or array-like structure). To preserve compatibility with Perl's naming convention (which has been copied by several other languages), the first group is stored in element number one, the second in element two, etc. That leaves element zero free, so why not store the full match there?
FYI, Perl 6 has adopted a new convention, in which the first capturing group is numbered zero instead of one. I'm sure it wasn't done just to piss us off. ;)
Most likely so that you can use "$0" to represent the match in a substitution expression, and "$1" for the first group match, etc.
I don't think there's really an answer other than the person who wrote this chose that as an implementation detail. As long as you remember that the first group will always equal the source string you should be ok :-)
Not sure why either, but if you use named groups you can then set the option RegExOptions.ExplicitCapture and it should not include the source as first group.
It might be redundant, however it has some nice properties.
For example, it means the capture groups work the same way as other regex engines - the first capture group corresponds to "1", and so on.
Backreferences are one-based, e.g., \1 or $1 is the first parenthesized subexpression, and so on. As laid out, one maps to the other without any thought.
Also of note: m.Groups["0"] gives you the entire matched substring, so be sure to skip "0" if you're iterating over regex.GetGroupNames().

Categories