Unexpected regular expression groups

Unexpected regular expression groups - c#

I want to use regular expressions for analyzing a url, but I can't get the regex groups as I would expect them to be.
My regular expression is:
#"member/filter(.*)(/.+)*"
The strings to match:
"member/filter-one"
"member/filter-two/option"
"member/filter-three/option/option"
I expect to get the following groups:
member/filter-one, /filter-one
member/filter-two/option, /filter-two, /option
member/filter-three/option/option, /filter-three, /option(with 2 captures)
I get the result for the first string, but fore the 2 others I get:
member/filter-two/option, /filter-two/option, empty string
member/filter-three/option/option, /filter-three/option/option, empty string
What can be the issue?

Try
#"member/filter([^/]*)(/.+)*"
Another way could be to use the MatchCollection this way:
string url = "member/filter-three/option/option";
url = url.Replace("member/filter-", string.Empty); // cutting static content
MatchCollection matches = new Regex(#"([^/]+)/?").Matches(url);
foreach (Match match in matches)
{
Console.WriteLine(match.Groups[1].Value);
}
Console.ReadLine();
Here, you first remove the constant part from your string (it could be a parameter of a function). Then you simply check for everything inside two / characters. You do that by identifying [^/] as the character you want to match, which means match one character, that is not a /, then put an identifier after that (+ sign), which means, match more than one character.

The "member/filter([^/]*)(/.+)*" seems logical but is impractical as it accepts empty options (i.e. member/filter1/////////). A more accurate-practical pattern which also allows you to accept more than one filter with options is member(/filter[^/]+(/[^/]+)*)*

Related

Get Regex.Matches to start the match at Position 0

I am trying to use Regex to count the number of times a certain string appears in another comma-separated string.
I am using Regex.Matches(comma-separated string, certain string).Count to grab the number. The only issue I have is that I want it to simply count as a match if it lines up at the start of the string.
For instance, if I have the comma separated string
string comma_separated = "dog,cat,bird,blackdog,dog(1)";
and want to see how many times the search string matches with the contents of the comma-separated string
string search = "dog";
I use:
int count = Regex.Matches(comma_separated, search).Count;
I would expect it to be 2 since it matches up with
"dog,cat,bird,blackdog,dog(1)",
however it returns a 3 since it is also matching up with the dog part of blackdog.
Is there any way I can get it to only count as a match when it recognizes a match starting at the start of the string? Or am I just using Regex incorrectly?

As noted in the comments, a regex may not be the most logical way for you to achieve your desired result. However, if you would like to use a regex to find your matches, something like this would provide your desired result
(?<=,|^)dog
This will perform a "positive lookbehind" to ensure that the word "dog" is preceded by either a comma or is at the start of the string you are searching.
More info available on lookarounds in Regex here: https://www.regular-expressions.info/lookaround.html

string comma_separated = "dog,cat,bird,blackdog,dog(1)";
int count = Regex.Matches(comma_separated, string.Format(#"\b{0}\b", Regex.Escape("dog")), RegexOptions.IgnoreCase).Count;
By appending the \b to either side of the text you can find the "EXACT" match within the text.

Try using this pattern: search = #"\bdog";. \b matches word boundary.

Regex to match for a string but only bringing a part of it by grouping

I have a string on the following form - for fiddle see here.
{[(0.3;0.43)(10;8.2)(0;0)(0.7888;12.345)]:8.56/13.9}
I would like to filter out just the coordinates in the parentheses. So, the pattern to match I've set up recognized a pair of parentheses (two backslashes for escaping the grouping). Then, it creates a group that on its inside has the minimal set of characters as long as they're separated by a semicolon and enclosed in parentheses. According to the debugger I have four matches, which suggest that it's correct. However, when I access Grouping, I see two elements - one containing a pair of brackets and one without.
Regex regex = new Regex("\\((.*?;.*?)\\)");
string full = regex.Matches(figure.ToString())[0].Value;
string with = regex.Matches(figure.ToString())[0].Groups[0].Value;
string sans = regex.Matches(figure.ToString())[0].Groups[1].Value;
I'm not entirely sure where the first group (Groups[0]) gets its information from. I suspect that I haven't phrased the regular expression well enough and that it actually react to the escaped parentheses as if it was a grouping as well. Am I right in my suspicion? How should I reformulate the expression?

From https://msdn.microsoft.com/en-us/library/system.text.regularexpressions.match.groups(v=vs.110).aspx:
If the regular expression engine can find a match, the first element of the GroupCollection object (the element at index 0) returned by the Groups property contains a string that matches the entire regular expression pattern.
So Groups[0] has the entire value you matched (e.g. (1;2)), while Groups[1] is the first matched subgroup (e.g. 1;2).

Anything in parentheses is considered a group. If you don't want the group to be considered in matches, you should prefix it with "?:"
(?:REGEX)
will cause the group to be ignored as a result, but still matched against

Regex.Replace replaces more than bargained for

I'm writing some test cases for IIS Rewrite rules, but my tests are not matching the same way as IIS is, leading to some false negatives.
Can anyone tell me why the following two lines leads to the same result?
Regex.Replace("v1/bids/aedd3675-a0f2-4494-a2c0-32418cf2476a", ".*v[1-9]/bids/.*", "http://localhost:9900/$0")
Regex.Replace("v1/bids/aedd3675-a0f2-4494-a2c0-32418cf2476a", "v[1-9]/bids/", "http://localhost:9900/$0")
Both return:
http://localhost:9900/v1/bids/aedd3675-a0f2-4494-a2c0-32418cf2476a
But I would expect the last regex to return:
http://localhost:9900/v1/bids/
As the GUID is not matched.
On IIS, the pattern tester yields the result below. Is {R:0} not equivalent to $0?
What I am asking is:
Given the test input of v[1-9]/bids/, how can I match IIS' way of doing Regex replaces so that I get the result http://localhost:9900/v1/bids/, which appears to be what IIS will rewrite to.

The point here is that the pattern you have matches the test strings at the start.
The first .*v[1-9]/bids/.* regex matches 0+ any characters but a newline (as many as possible) up to the last v followed with a digit (other than 0) and followed with /bids/, and then 0+ characters other than a newline. Since the string is matched at the beginning the whole string is matched and placed into Group 0. In the replacement, you just pre-pend http://localhost:9900/ to that value.
The second regex replacement returns the same result because the regex matches v1/bids/, stores it in Group 0, and replaces it with http://localhost:9900/ + v1/bids/. What remains is just appended to the replacement result as it does not match.
You need to match that "tail" in order to remove it.
To only get the http://localhost:9900/v1/bids/, use a capturing group around the v[0-9]/bids/ and use the $1 backreference in the replacement part:
(v[1-9]/bids/).*
Replace with http://localhost:9900/$1. Result: http://localhost:9900/v1/bids/
See the regex demo
Update
The IIS keeps the base URL and then adds the parts you match with the regex. So, in your case, you have http://localhost:9900/ as the base URL and then you match v1/bids/ with the regex. So, to simulate this behavior, just use Regex.Match:
var rx = Regex.Match("v1/bids/aedd3675-a0f2-4494-a2c0-32418cf2476a", "v[1-9]/bids/");
var res = rx.Success ? string.Format("http://localhost:9900/{0}", rx.Value) : string.Empty;
See the IDEONE demo

How do I exclude a regex value in a replace

I have a regex expression which searches for strings using a Prefix and a Suffix. In it's simplest form \$\$\w+\$\$ will find $$My_Name$$ (in this case the Prefix and Suffix are both equal to $$) Another example would be \[\#\w+\#\] to match [#My_Name#].
The Prefix and Suffix will always be a specific string of 0 to n characters which I can always safely escape for a direct character match.
I extract the Matches in my C# code so I can work with them but obviously my match contains $$My_Name$$ but what I want is to simply get the inner string between the Suffix and Prefix: My_Name.
How do I exclude the Prefix and Suffix from the result?

Change your REGEX to \$\$(\w+)\$\$ and use $1 to get the matching (inner) string.
For example
string pattern = #"\$\$(\w+)\$\$";
string input = "$$My_Name$$";
Regex rgx = new Regex(pattern);
Match result = rgx.Match(input);
Console.WriteLine(result.Groups[1]);
Outputs: "My Name"
P.S - There's no need to use explicitly typed local variables, but I just wanted the types to be clear.

You can group your w+ into a group like this (w+) then when you retrieve the matched string you might be able to ask for that subgroup.
I do not know if I am wrong (but you didn't provided any code whatsoever) but I think this is how it is done: .Groups[1].Value on the the result of Regex.Match.

How about the regex below. It works by capturing the first character into a named group then capturing any repeats into a named group called first group which it then uses to match the end of the string. It will work with any number of repeated character so long as they repeated at the end of the word.
'(?<first_group>(?<first_char>.)\k<first_char>+)(?<word>\w+)\k<first_group>+'
You just need to then extract the capture group named word like so:
String sample = "$$My_Name$$";
Regex regex = new Regex("(?<first_group>(?<first_char>.)\k<first_char>+)(?<word>\w+)\k<first_group>+");
Match match = regex.Match(sample);
if (match.Success)
{
Console.WriteLine(match.Groups["word"].Value);
}

You can use named group like this:
(\$\$)(?<group1>.+?)\1 -- pattern 1 (first case)
\[(#)(?<group2>.+?)\1\] -- pattern 2 (second case)
or combined representation would be:
(\$\$)(?<group1>.+?)\1|\[(#)(?<group2>.+?)\3\]
I would suggest you to use .+? it will help you to match any character other than your prefix/suffix.
Live Demo

C# Regular expressions, retrieving two words separated by a comma, parenthesis operator

I've been playing around with retrieving data from a string using regular expression, mostly as an exercise for myself. The pattern that I'm trying to match looks like this:
"(SomeWord,OtherWord)"
After reading some documentation and looking at a cheat sheet I came to the conclusion that the following regex should give me 2 matches:
"\((\w),(\w)\)"
Because according to the documentation the parenthesis should do the following:
(pattern) Matches pattern and remembers the match. The matched
substring can be retrieved from the resulting Matches collection,
using Item [0]...[n]. To match parentheses characters ( ), use "\ (" or
"\ )".
However using the following code (removed error checking for conciseness) matches quite something different:
string line = "(A,B)";
string pattern = #"\((\w),(\w)\)";
MatchCollection matches = Regex.Matches(line, pattern);
string left = matches[0].Value;
string right = matches[1].Value;
Now I would expect left to become "A" and right to become "B". However left becomes "(A,B)" and there is no second match at all. What am I missing here?
(I know this example is trivial to solve without regexes but to learn how to properly use regexes I should be able to make something simple as this work)

You want the Groups member of the first match. In your example case there is only 1 match, which is the whole string. In the Groups collection you will have 3 items. Try this sample code, left should be A, and right should be B. If you look at the group[0] value it will be the whole string.
string line = "(A,B)";
string pattern = #"\((\w),(\w)\)";
MatchCollection matches = Regex.Matches(line, pattern);
GroupCollection groups = matches[0].Groups;
string left = groups[1].Value;
string right = groups[2].Value;

\w matches only one word character. If words have to contain at least one character, the expression should be:
string pattern = #"\((\w+),(\w+)\)";
if words may be empty:
string pattern = #"\((\w*),(\w*)\)";
+: means one or more repetitions.
*: means zero, one or more repetitions.
In any case, you will get one match with three groups, the first containing the whole string including the left and right parentheses, the two others the two words.

I think the problem is that you're confusing the concept of a match and a group.
A MatchCollection contains a list of strings that matched your entire regex, not just the parenthetical groups inside that Regex. For example, if the string you searched looked like this...
(A,B)(C,D)
...then you would have two matches: (A,B) and (C,D).
However, there's good news: you can get the groups from each match very easily, like so:
string line = "(A,B)";
string pattern = #"\((\w),(\w)\)";
MatchCollection matches = Regex.Matches(line, pattern);
string left = matches[0].Groups[1].Value;
string right = matches[0].Groups[2].Value;
That Groups variable is a collection of parenthetical groups from a single match.
Edit:
Olivier Jacot-Descombes made a very good point: we all got so hung up explaining match vs. group that we forgot to notice a second problem: \w will only match a SINGLE character. You need to add a quantifier (such as +) in order to grab more than one character at a time. Olivier's answer should explain that part clearly.

First off, it's one "match", with 2 "groups"...
I would recommend you name the groups anyway...
string pattern = #"\((?<FirstWord>\w+),(?<SecondWord>\w+)\)";
Then you could do...
Match m = Regex.Match(line, pattern);
string firstWord = m.Groups["FirstWord"].Value;

Since all you are looking for are the characters separated by a comma, you can simply use \w as your pattern. The matches will be A and B.
A handy site for testing your Regex is http://gskinner.com/RegExr/

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Unexpected regular expression groups - c#

The "member/filter([^/])(/.+)" seems logical but is impractical as it accepts empty options (i.e. member/filter1/////////). A more accurate-practical pattern which also allows you to accept more than one filter with options is member(/filter[^/]+(/[^/]+))

Related

Get Regex.Matches to start the match at Position 0

Regex to match for a string but only bringing a part of it by grouping

Regex.Replace replaces more than bargained for

How do I exclude a regex value in a replace

C# Regular expressions, retrieving two words separated by a comma, parenthesis operator

Categories

Resources