Regex matching too many groups

Regex matching too many groups - c#

I'm using C# Regex class. I'm trying to split two strings from one. The source (input) string is constructed in following way:
first part must match PO|P|S|[1-5] (in regex syntax).
second part can be VP|GZ|GAR|PP|NAD|TER|NT|OT|LO (again, regex syntax). Second part can occur zero or one time.
Acceptable examples are "PO" (one group), "POGAR" (both groups PO+GAR), "POT" (P+OT)...
So I've use the following regex expression:
Regex r = new Regex("^(?<first>PO|P|S|[1-5])(?<second>VP|GZ|GAR|PP|NAD|TER|NT|OT|LO)?$");
Match match = r.Match(potentialToken);
When potentialToken is "PO", it returns 3 groups! How come? I am expecting just one group (first).
match.Groups are {"PO","PO",""}
Named groups are OK - match.Groups["first"] returns 1 instance, while match.Groups["second"].Success is false.

When using the numbered groups, the first group is always the complete matched (sub)string (cf. docs - "the first element of the GroupCollection object returned by the Groups property contains a string that matches the entire regular expression pattern"), i.e. in your case PO.
The second element in Groups is the capture of your first named group, and the third element is the capture of your second named group - just like the two captures you can retrieve by name. If you check Success of the numbered groups, you will see that the last element (the one matching your second named group) has a Success value of false, as well. You can interpret this as "the group exists, but it did not match anything".
To confirm this, have a look at the output of this testing code:
using System;
using System.Text.RegularExpressions;
class Program
{
static void Main()
{
Regex r = new Regex("^(?<first>PO|P|S|[1-5])(?<second>VP|GZ|GAR|PP|NAD|TER|NT|OT|LO)?$");
Match match = r.Match("PO");
for (int i = 0; i < match.Groups.Count; i++) {
Console.WriteLine(string.Format("{0}: {1}; {2}", i, match.Groups[i].Success, match.Groups[i].Value));
}
}
}
You can run it here.

RegularExpression will always have one group which is "Group 0" at index 0 even though you don't have any capturing groups.
"Group 0" will be equal to whole match the regex has made(Match.Value).
Then in your case you get 3 groups because "Group 0" + "Group first" + "Group second". As mentioned "Group second" is an optional group so when it doesn't take part in subject .Net regex engine marks "Group second".Success = false. I don't see anything surprise here. This is the expected behavior.

Related

Regex.Replace using regular expression as replacement

I am new to C# programming language and came across the following problem
I have a string " avenue 4 TH some more words". I want to remove space between 4 and TH. I have written a regex which helps in determining whether "4 TH" is available in a string or not.
[0-9]+\s(th|nd|st|rd)
string result = "avanue 4 TH some more words";
var match = Regex.IsMatch(result,"\\b" + item + "\\b",RegexOptions.IgnoreCase) ;
Console.WriteLine(match);//True
Is there anything in C# which will remove the space
something likeRegex.Replace(result, "[0-9]+\\s(th|nd|st|rd)", "[0-9]+(th|nd|st|rd)",RegexOptions.IgnoreCase);
so that end result looks like
avenue 4TH some more words

You may use
var pattern = #"(?i)(\d+)\s*(th|[nr]d|st)\b";
var match = string.Concat(Regex.Match(result, pattern)?.Groups.Cast<Group>().Skip(1));
See the C# demo yielding 4TH.
The regex - (?i)(\d+)\s*(th|[nr]d|st)\b - matches 1 or more digits capturing the value into Group 1, then 0 or more whitespaces are matched with \s*, and then th, nd, rd or st as whole words (as \b is a word boundary) are captured into Group 2.
The Regex.Match(result, pattern)? part tries to match the pattern in the string. If there is a match, the match object Groups property is accessed and all groups are cast to aGrouplist withGroups.Cast(). Since the first group is the whole match value, we.Skip(1)` it.
The rest - the values of Group 1 and Group 2 - are concatenated with string.Concat.

Match but exclude a string using C# regular expression [duplicate]

Say I have the string "User Name:firstname.surname" contained in a larger string how can I use a regular expression to just get the firstname.surname part?
Every method i have tried returns the string "User Name:firstname.surname" then I have to do a string replace on "User Name:" to an empty string.
Could back references be of use here?
Edit:
The longer string could contain "Account Name: firstname.surname" hence why I want to match the "User Name:" part of the string aswell to just get that value.

I like to use named groups:
Match m = Regex.Match("User Name:first.sur", #"User Name:(?<name>\w+\.\w+)");
if(m.Success)
{
string name = m.Groups["name"].Value;
}
Putting the ?<something> at the beginning of a group in parentheses (e.g. (?<something>...)) allows you to get the value from the match using something as a key (e.g. from m.Groups["something"].Value)
If you didn't want to go to the trouble of naming your groups, you could say
Match m = Regex.Match("User Name:first.sur", #"User Name:(\w+\.\w+)");
if(m.Success)
{
string name = m.Groups[1].Value;
}
and just get the first thing that matches. (Note that the first parenthesized group is at index 1; the whole expression that matches is at index 0)

You could also try the concept of "lookaround". This is a kind of zero-width assertion, meaning it will match characters but it won't capture them in the result.
In your case, we could take a positive lookbehind: we want what's behind the target string "firstname.surname" to be equal to "User Name:".
Positive lookbehind operator: (?<=StringBehind)StringWeWant
This can be achieved like this, for instance (a little Java example, using string replace):
String test = "Account Name: firstname.surname; User Name:firstname.surname";
String regex = "(?<=User Name:)firstname.surname";
String replacement = "James.Bond";
System.out.println(test.replaceAll(regex, replacement));
This replaces only the "firstname.surname" strings that are preceeded by "User Name:" without replacing the "User Name:" itself - which is not returned by the regex, only matched.
OUTPUT: Account Name: firstname.surname; User Name:James.Bond
That is, if the language you're using supports this kind of operations

Make a group with parantheses, then get it from the Match.Groups collection, like this:
string s = "User Name:firstname.surname";
Regex re = new Regex(#"User Name:(.*\..*)");
Match match = re.Match(s);
if (match.Success)
{
MessageBox.Show(match.Groups[1].Value);
}
(note: the first group, with index 0, is the whole match)

All regular expression libraries I have used allow you to define groups in the regular expression using parentheses, and then access that group from the result.
So, your regexp might look like: User name:([^.].[^.])
The complete match is group 0. The part that matches inside the parentheses is group 1.

c# - regex don't work (match does not preserve the string)

Regex regOrg = new Regex(#"org(?:aniser)?\s+(\d\d):(\d\d)\s?(\d\d)?\.?(\d\d)?", RegexOptions.IgnoreCase);
MatchCollection mcOrg = regOrg.Matches(str);
Match mvOrg = regOrg.Match(str);
dayOrg = mvOrg.Value[4].ToString();
monthOrg = mvOrg.Value[5].ToString();
hourOrg = mvOrg.Value[2].ToString();
minuteOrg = mvOrg.Value[3].ToString();
This regular expression analyzes the string with text
"organiser 23:59" / "organiser 25:59 31.12"
or
"org 23:59" / "org 23:59 31.12"
Day and month of optional parameters
Accordingly, I want to see the output variables dayOrg, monthOrg, hourOrg, minuteOrg with this data, but I get this:
Query: org 23:59 31.12
The value mcOrg.Count: 1
The value dayOrg: 2
The value monthOrg: 3
The value hourOrg: g
The value minuteOrg: empty
What am I doing wrong? Tried a lot of options, but it's not working.

You're not accessing the groups correctly (you're accessing individual characters of the matched string).
dayOrg = mvOrg.Groups[4].Value;
monthOrg = mvOrg.Groups[5].Value;
hourOrg = mvOrg.Groups[2].Value;
minuteOrg = mvOrg.Groups[3].Value;

The reason you are getting that result is because you are getting Value[index] from the mvOrg Match.
The Match class, as described on MSDN says that Value is the first match, hence you are accessing the character array of the first match instead of the groups. You need to use the Groups property of the Match class to get the actual groups found.
Be sure to check the count of this collection before trying to access the optional parameters.

I added name for you pattern so now it look like this :
Regex regOrg = new Regex(#"org(?:aniser)?\s+(?<hourOrg>\d{2}):(?<minuteOrg>\d{2})\s?(?<dayOrg>\d{2})?\.?(?<monthOrg>\d{2})?", RegexOptions.IgnoreCase);
and you can access the result like this
Console.WriteLine(mvOrg.Groups["hourOrg"]);
Console.WriteLine(mvOrg.Groups["minuteOrg"]);
Console.WriteLine(mvOrg.Groups["dayOrg"]);
Console.WriteLine(mvOrg.Groups["monthOrg"]);
Using hard coded indexes is not good practice, since you can change the regex and now need to change all the indexes ...
Is it what you wanted ?

How to find repeatable characters

I can't understand how to solve the following problem:
I have input string "aaaabaa" and I'm trying to search for string "aa" (I'm looking for positions of characters)
Expected result is
0 1 2 5
aa aabaa
a aa abaa
aa aa baa
aaaab aa
This problem is already solved by me using another approach (non-RegEx).
But I need a RegEx I'm new to RegEx so google-search can't help me really.
Any help appreciated! Thanks!
P.S.
I've tried to use (aa)* and "\b(\w+(aa))*\w+" but those expressions are wrong

You can solve this by using a lookahead
a(?=a)
will find every "a" that is followed by another "a".
If you want to do this more generally
(\p{L})(?=\1)
This will find every character that is followed by the same character. Every found letter is stored in a capturing group (because of the brackets around), this capturing group is then reused by the positive lookahead assertion (the (?=...)) by using \1 (in \1 there is the matches character stored)
\p{L} is a unicode code point with the category "letter"
Code
String text = "aaaabaa";
Regex reg = new Regex(#"(\p{L})(?=\1)");
MatchCollection result = reg.Matches(text);
foreach (Match item in result) {
Console.WriteLine(item.Index);
}
Output
0
1
2
5

The following code should work with any regular expression without having to change the actual expression:
Regex rx = new Regex("(a)\1"); // or any other word you're looking for.
int position = 0;
string text = "aaaaabbbbccccaaa";
int textLength = text.Length;
Match m = rx.Match(text, position);
while (m != null && m.Success)
{
Console.WriteLine(m.Index);
if (m.Index <= textLength)
{
m = rx.Match(text, m.Index + 1);
}
else
{
m = null;
}
}
Console.ReadKey();
It uses the option to change the start index of a regex search for each consecutive search. The actual problem comes from the fact that the Regex engine, by default, will always continue searching after the previous match. So it will never find a possible match within another match, unless you instruct it to by using a Look ahead construction or by manually setting the start index.
Another, relatively easy, solution is to just stick the whole expression in a forward look ahead:
string expression = "(a)\1"
Regex rx2 = new Regex("(?=" + expression + ")");
MatchCollection ms = rx2.Matches(text);
var indexes = ms.Cast<Match>().Select(match => match.Index);
That way the engine will automatically advance the index by one for every match it finds.
From the docs:
When a match attempt is repeated by calling the NextMatch method, the regular expression engine gives empty matches special treatment. Usually, NextMatch begins the search for the next match exactly where the previous match left off. However, after an empty match, the NextMatch method advances by one character before trying the next match. This behavior guarantees that the regular expression engine will progress through the string. Otherwise, because an empty match does not result in any forward movement, the next match would start in exactly the same place as the previous match, and it would match the same empty string repeatedly.

Try this:
How can I find repeated characters with a regex in Java?
It is in java, but the regex and non-regex way is there. C# Regex is very similar to the Java way.

Regular Expression Groups in C#

I've inherited a code block that contains the following regex and I'm trying to understand how it's getting its results.
var pattern = #"\[(.*?)\]";
var matches = Regex.Matches(user, pattern);
if (matches.Count > 0 && matches[0].Groups.Count > 1)
...
For the input user == "Josh Smith [jsmith]":
matches.Count == 1
matches[0].Value == "[jsmith]"
... which I understand. But then:
matches[0].Groups.Count == 2
matches[0].Groups[0].Value == "[jsmith]"
matches[0].Groups[1].Value == "jsmith" <=== how?
Looking at this question from what I understand the Groups collection stores the entire match as well as the previous match. But, doesn't the regexp above match only for [open square bracket] [text] [close square bracket] so why would "jsmith" match?
Also, is it always the case the the groups collection will store exactly 2 groups: the entire match and the last match?

match.Groups[0] is always the same as match.Value, which is the entire match.
match.Groups[1] is the first capturing group in your regular expression.
Consider this example:
var pattern = #"\[(.*?)\](.*)";
var match = Regex.Match("ignored [john] John Johnson", pattern);
In this case,
match.Value is "[john] John Johnson"
match.Groups[0] is always the same as match.Value, "[john] John Johnson".
match.Groups[1] is the group of captures from the (.*?).
match.Groups[2] is the group of captures from the (.*).
match.Groups[1].Captures is yet another dimension.
Consider another example:
var pattern = #"(\[.*?\])+";
var match = Regex.Match("[john][johnny]", pattern);
Note that we are looking for one or more bracketed names in a row. You need to be able to get each name separately. Enter Captures!
match.Groups[0] is always the same as match.Value, "[john][johnny]".
match.Groups[1] is the group of captures from the (\[.*?\])+. The same as match.Value in this case.
match.Groups[1].Captures[0] is the same as match.Groups[1].Value
match.Groups[1].Captures[1] is [john]
match.Groups[1].Captures[2] is [johnny]

The ( ) acts as a capture group. So the matches array has all of matches that C# finds in your string and the sub array has the values of the capture groups inside of those matches. If you didn't want that extra level of capture jut remove the ( ).

Groups[0] is your entire input string.
Groups[1] is your group captured by parentheses (.*?). You can configure Regex to capture Explicit groups only (there is an option for that when you create a regex), or use (?:.*?) to create a non-capturing group.

The parenthesis is identifying a group as well, so match 1 is the entire match, and match 2 are the contents of what was found between the square brackets.

How? The answer is here
(.*?)
That is a subgroup of #"[(.*?)];

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Regex matching too many groups - c#

Related

Regex.Replace using regular expression as replacement

Match but exclude a string using C# regular expression [duplicate]

c# - regex don't work (match does not preserve the string)

How to find repeatable characters

Regular Expression Groups in C#

Categories

Resources