Odd regexp behaviour - matches only first and last capture group - c#

I am trying to write a regexp which would match a comma separated list of words and capture all words. This line should be matched    apple , banana ,orange,peanut  and captures should be apple, banana, orange, peanut. To do that I use following regexp:
^\s*([a-z_]\w*)(?:\s*,\s*([a-z_]\w*))*\s*$
It successfully matches the string but all of a sudden only apple and peanut are captured. This behaviour is seen in both C# and Perl. Thus I assume I am missing something about how regexp matching works. Any ideas? :)

The value given by match.Groups[2].Value is just the last value captured by the second group.
To find all the values, look at match.Groups[2].Captures[i].Value where in this case i ranges from 0 to 2. (As well as match.Groups[1].Value for the first group.)
(+1 for question, I learned something today!)

Try this:
string text = " apple , banana ,orange,peanut";
var matches = Regex.Matches(text, #"\s*(?<word>\w+)\s*,?")
.Cast<Match>()
.Select(x => x.Groups["word"].Value)
.ToList();

You are repeating your capturing group, at every repeated match the previous content is overwritten. So only the last match of your second capturing group is available at the end.
You can change your second capturing group to
^\s*([a-z_]\w*)((?:\s*,\s*(?:[a-z_]\w*))*)\s*$
Then the result would be " , banana ,orange,peanut" in your second group. I am not sure, if you want this.
If you want to check that the string has that pattern and extract each word. I would do it in two steps.
Check the pattern with your regex.
If the pattern is correct, remove leading and trailing whitespace and split on \s*,\s*.

Simple regexp:
(?:^| *)(.+?)(?:,|$)
Explanation:
?: # Non capturing group
^| * # Match start of line or multiple spaces
.+ # Capture the word in the list, lazy
?: # Non capture group
,|$ # Match comma or end of line
Note: Rublular is a nice website for testing this kind of thing.

Related

Regex c# obtain subgroup of a captured group

It seems a simple question, but I don't think it is so easy.
From the example string AAACARACBBBBBDZAAAAEE, I want to extract the first 8 characters (= AAACARAC) and from this resulting 8-char long string, I want to extract everything except the leading 'A' characters (= CARAC).
I tried with this regex (?^[A]<WORD>\w{8}), but I dont know how to apply another regex on the captured group named WORD?
This is the regex you want:
(?=^.{8}(.*)$)A*(?<WORD>.*?)\1$
See a demo here (click then on "Table" for looking at the specific matches).
The regex firs will match the first eight characters looking for what comes next (matching this "tail" in the first capturing group), then will restart from the beginning of the string excluding all the trailing As and matching for as less character as possible such that these characters are followed by the same content of the first capturing group.
Using C#, you might also use a positive lookbehind to assert 8 chars to the left, matching optional A's and capture the chars that follow in a group.
^A*(?<WORD>[^\sA].*)(?<=^.{8})
^ Start of string
A* match optional repetitions of A
(?<WORD> Named group WORD
[^\sA].* Match any non whitespace char except A
) Close named group WORD
(?<=^.{8}) Assert 8 chars to the left of the current position
.NET regex demo
If you only want to match word characters:
^A*(?<WORD>[^\WA]\w*)(?<=^\w{8})
.NET Regex demo

Matching a sequence of characters splitted by spaces after a prefix

I have the following strings:
-prefix <#141222969505480701> where the second part e.g. <#141222969505480701> can be repeated unlimited times (only the numbers change).
-prefix 141222969505480701 which should behave the same as above.
-prefix 141222969505480701 <#141222969505480702> which would still be able to repeat itself forever.
The last one should have groups containing 141222969505480701 and 141222969505480702.
So a few bits of information:
The digit chains are always 18 in total so I use \d{18} in my regex
I would like to have the numbers in groups for me to use them afterwards.
What I tried
First of I tried to match the first of my example strings.
-prefix(\s<#\d{18}>)\1* which would match the entire string, but I would like to have the digits itself in its own group. Also this method only matches the same parts e.g. <#141222969505480701> <#141222969505480701> <#141222969505480701> would match, but any other number in between wouldn't match.
What would sound logical in my head
-prefix (\d{18})+ but it would only match the first one of the 'digit parts'.
While I was testing it on regex101 it told me the following:
A repeated capturing group will only capture the last iteration. Put a capturing group around the repeated group to capture all iterations or use a non-capturing group instead if you're not interested in the data.
I tried to adjust the regex to the following -prefix ((\d{18})+), but with the same result.
With the help to of #madreflection in the comments I was able to come up with this solution:
-prefix([\s]*(<#|)(?<digits>[0-9]{18})>?)+
Which is exactly what I needed, which even ignores spaces in between. Also with the use of match.Groups["digits"].Captures it made the whole story a lot easier.
You could use an alternation to list the 3 different allowed formats. In .NET it is supported to reuse the group name.
-prefix\s*(?:(?<digits>[0-9]{18})\s*<#(?<digits>[0-9]{18})>|(?<digits>[0-9]{18})|<#(?<digits>[0-9]{18}))
Pattern parts
-prefix\s* Match literally followed by 0+ whitespace characters
(?: Non capturing group
(?<digits>[0-9]{18})\s*<#(?<digits>[0-9]{18})> 2 named capturing groups which will match the digits
| Or
(?<digits>[0-9]{18}) Named capturing group, match digits only
| Or
<#(?<digits>[0-9]{18}) Named capturing group, match digits between brackets only
)
Regex demo
You could also use 2 named capturing groups, 1 for each format. For example:
-prefix\s*(?:(?<digits>[0-9]{18})\s*<#(?<digitsBrackets>[0-9]{18})>|(?<digits>[0-9]{18})|<#(?<digitsBrackets>[0-9]{18}))
Regex demo

Get the nth word of a line

With this code:
regex = new Regex(#"^(?:\S+\s){2}(\S+)");
match = regex.Match("one two three four five");
if (match.Success)
{
Console.WriteLine(match.Value);
}
I want to retrieve the third word of the line --> "three".
But instead, I get "one two three".
Edit:
I know that I could do it with s.Split(' ')[2] but I want to do it with regex.
If you want to use Match method only without reference to groups, etc., then you have to use look-behind. Basically you say - find a word that is preceded by two words. In you current regex you say - find me 2 words + 1 word, so you just have to change part "find 2 words" to "preceded by 2 words", i.e. ^(?:\S+\s){2} is changed to (?<=^(\S+\s){2})
(?<=^(\S+\s){2})\S+
match.Value returns the entire matched substring, which includes the non-capturing parts of your regex. You should instead use match.Groups[1].Value to get the value of the first capturing group.

C# Regular Expression: Search the first 3 letters of each name

Does anyone know how to say I can get a regex (C#) search of the first 3 letters of a full name?
Without the use of (.*)
I used (.**)but it scrolls the text far beyond the requested name, or
if it finds the first condition and after 100 words find the second condition he return a text that is not the look, so I have to limit in number of words.
Example: \s*(?:\s+\S+){0,2}\s*
I would like to ignore names with less than 3 characters if they exist in name.
Search any name that contains the first 3 characters that start with:
'Mar Jac Rey' (regex that performs search)
Should match:
Marck Jacobs L. S. Reynolds
Marcus Jacobine Reys
Maroon Jacqueline by Reyils
Can anyone help me?
The zero or more quantifier (*) is 'greedy' by default—that is, it will consume as many characters as possible in order to finding the remainder of the pattern. This is why Mar.*Jac will match the first Mar in the input and the last Jac and everything in between.
One potential solution is just to make your pattern 'non-greedy' (*?). This will make it consume as few characters as possible in order to match the remainder of the pattern.
Mar.*?Jac.*?Rey
However, this is not a great solution because it would still match the various name parts regardless of what other text appears in between—e.g. Marcus Jacobine Should Not Match Reys would be a valid match.
To allow only whitespace or at most 2 consecutive non-whitespace characters to appear between each name part, you'd have to get more fancy:
\bMar\w*(\s+\S{0,2})*\s+Jac\w*(\s+\S{0,2})*\s+Rey\w*
The pattern (\s+\S{0,2})*\s+ will match any number of non-whitespace characters containing at most two characters, each surrounded by whitespace. The \w* after each name part ensures that the entire name is included in that part of the match (you might want to use \S* instead here, but that's not entirely clear from your question). And I threw in a word boundary (\b) at the beginning to ensure that the match does not start in the middle of a 'word' (e.g. OMar would not match).
I think what you want is this regular expression to check if it is true and is case insensitive
#"^[Mar|Jac|Rey]{3}"
Less specific:
#"^[\w]{3}"
If you want to capture the first three letters of every words of at least three characters words you could use something like :
((?<name>[\w]{3})\w+)+
And enable ExplicitCapture when initializing your Regex.
It will return you a serie of Match named "name", each one of them is a result.
Code sample :
Regex regex = new Regex(#"((?<name>[\w]{3})\w+)+", RegexOptions.ExplicitCapture | RegexOptions.IgnoreCase);
var match = regex.Matches("Marck Jacobs L. S. Reynolds");
If you want capture also 3 characters words, you can replace the last "\w" by a space. In this case think to handle the last word of the phrase.

Parsing a regex

I am having trouble writing a regular expression in C#; its purpose is to extract all words that start with '#' from a given string so they can be stored in some type of data structure.
If the string is "The quick #brown fox jumps over the lazy #dog", I'd like to get an array that contains two elements: brown and dog. It needs to handle the edge cases properly. For example, if it's ##brown, it should still produce 'brown' not '#brown'.
something like this
C#:
string quick = "The quick #brown fox jumps over the lazy #dog ##dog";
MatchCollection results = Regex.Matches(quick, "#\\w+");
foreach (Match m in results)
{
Literal1.Text += m.Value.Replace("#", "");
}
takes care of your edge case too. (##dog => dog)
#[\w\d]+ should work for you.
Tested using http://www.regextester.com/.
This works by matching for the #, followed by one or more word characters. The \w represents any "word character" (character sets), the \d represents any digit, and the + (repetition) indicates one or more. The \w and \d are both allowed by being wrapped in brackets.
To exclude the # you could use str.Substring(1) to ignore the first character, or use the regex #([\w\d]+) and extract the first group.
Depending on your definition of "word" (\w is more the C-language definition of a symbol valid in an identifier or keyword: [a-z0-9_].), you might try the folowing — I'm defining "word" here as a sequence of non-whitespace characters:
(^|\s)(#+(?<atword>[^\s]+))(\s|$)
The above has been tested here, and matches the following:
Match start-of-string or a whitespace character, followed by
1 or more # characters, followed by
1 or more non-whitespace characters, in group named 'atword', followed by
a whitespace character or end-of-string.
For successful matches, the named group atword will contain the text following the lead-in # sign(s).
So:
This ## foo won't match.
This #foo bar will match
`###foobarbat is kind of silly will match
`###foobar#bazabat will match.
silly.#rabbit, tricks are for kids won't match, but
silly #rabbit, tricks are for kids will match and you'll get rabbit, rather than rabbit (like I said, you need to think about how you define 'word'.
etc.

Categories