Include bad groups in the matches - c#

I have a string like this:
&l&mmabc&od&l&r&mef&lg&l&e&j&rh
I want to get the following matches and groups. (Consider the parentheses to be groups and lines to be matches)
(&l&m)(mabc)
(&o)(d)
(&l&r&m)(ef)
(&l)(g)
(&l)(&e&j)
(&r)(h)
So far, I have got this:
(&[lmnor])+(\w+)
The results of the match are as follows:
You can see that the substring &l&e&j are not included in the matches. I know it's the problem with \w+ but I can't seem to figure out how to include those matches. The first group should only contain anything which matches &[lmnor] (It can contain multiple of those if they are close together. That is the reason I used +)The second group should contain anything other than those letters.
(&[lmnor])+(.*) doesn't work. (&[lmnor])+^(&[lmnor])+ doesn't either.

You can see that the substring &l&e&j are not included in the matches. I know it's the problem with \w+ but I can't seem to figure out how to include those matches.
It is clear that & is not a word character. That is why that substring with & symbols is not matched/captured.
The first group should only contain anything which matches &[lmnor] (It can contain multiple of those if they are close together.
That is a case when we should be using a non-capturing group with a quantifier inside a capturing group: ((?:&[lmnor])+). We match the sequences of characters and capture all that chunk of text into 1 group.
The second group should contain anything other than those letters.
It is a perfect job for a tempered greedy token: (?:(?!&[lmnor]).)*. It matches any text that is not starting &[lmnor] substring. We cannot use a negated character class because the symbols to skip are 2 (not single character).
So, you can use the following regex:
((?:&[lmnor])+)((?:(?!&[lmnor]).)*)
See regex demo
There is another regex you can use that follows the same logics, but using a lazy dot matching and a boundary expressed with a positive look-ahead checking for end of string or with the first set of symbols &[lmnor]:
((?:&[lmnor])+)(.*?)(?=$|&[lmnor])
See another regex demo

It can be done with less complication using the built in power of
C# regular expressions.
This shows using the Capture Collections, and I believe is much faster.
C#
Match aM = Regex.Match(
#"&l&mmabc&od&l&r&mef&lg&l&e&j&rh",
#"^(?:((?:&[lmnor])+)(.*?))+$" );
if ( aM.Success ) {
CaptureCollection cc1 = aM.Groups[1].Captures;
CaptureCollection cc2 = aM.Groups[2].Captures;
for (int i = 0; i < cc1.Count; i++)
Console.WriteLine("[{0}] = {1} {2}", i, cc1[i].Value, cc2[i].Value);
}
Output:
[0] = &l&m mabc
[1] = &o d
[2] = &l&r&m ef
[3] = &l g
[4] = &l &e&j
[5] = &r h

Related

How to match any repeated chunks of characters?

I've seen many questions similar to this but none quite like it.
I have strings like this:
HF-01-HF-01-01
FBC-FBC-04
OZYA-03A-OZYA-03A-03
QC-QC-02
and want them to be returned like so:
HF-01-01
FBC-04
OZYA-03A-03
QC-02
I can't figure this out and the other questions I've seen don't apply because 1) the repeated chunk is more than one character, 2) There are no spaces between the repetition.
Or is regex not the best way to do this?
EDIT:
Rules
Alpha chunks are never repeated more than one time.
Some chunks can be alphanumeric but also never repeated more than one
time.
The part that can be repeated would be from the start of the string
and any additional chunks by hyphen.
So you would never have something like HF-HF-01-01. But in this case using the above rules, it would become HF-01-01 since HF is the only part repeated from the beginning of the string.
Perhaps something like this would work:
Scan string to first hyphen, see if that matches anywhere else after first hyphen, if so scan to second hyphen, see if that matches anywhere else, if not, take the first scan and remove one instance of it from the string, if so, scan to third, etc.
But I don't know how to do that in regex.
I'm not sure if RegExp is the right tool here.
Using MoreLinq RunLengthEncode method (that implement R.L.E.) you can achieve it like this:
string RemoveDuplicate(string input)
{
var chunks = input.Split('-') // cut at -
.RunLengthEncode() // group and count adjacent equals chunck
.Select(kvp => kvp.Key);// just take the chunk value
return string.Join("-", chunks); // reglue with -
}
Edit
Doesn't work for:
OZYA-03A-OZYA-03A-03
I guess,
([^-\r\n]+-|[^-\r\n]+-[^-\r\n]+-)(\1.*)
or with start/end anchors,
^([^-\r\n]+-|[^-\r\n]+-[^-\r\n]+-)(\1.*)$
might work to some extent and the desired output is in the last capturing group:
(\1.*)
RegEx Demo 1
RegEx Demo 2
Test
using System;
using System.Text.RegularExpressions;
public class Example
{
public static void Main()
{
string pattern = #"([^-\r\n]+-|[^-\r\n]+-[^-\r\n]+-)(\1.*)";
string input = #"HF-01-HF-01-01
FBC-FBC-04
OZYA-03A-OZYA-03A-03
QC-QC-02
and want them to be returned like so:
HF-01-01
FBC-04
OZYA-03A-03
QC-02";
RegexOptions options = RegexOptions.Multiline;
foreach (Match m in Regex.Matches(input, pattern, options))
{
Console.WriteLine("'{0}' found at index {1}.", m.Value, m.Index);
}
}
}
If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.
RegEx Circuit
jex.im visualizes regular expressions:
I'm not sure if regex is the right tool here, but atleast it can be somewhat done with this short pattern:
^([A-Z0-9]+)-.*(\1.*)$
Explanation:
^ start of string
( group 1 start
[A-Z0-9]+ one or more capital letters or digits
) end group 1
- literal
.* any number of any chars
( group 2 start
\1 anything that was matched in group 1
.* any number of any chars
) end group 2 (this group will be used as the result)
$ end of string

C# Regex to obtain string up until a pattern

I've always been really bad when it comes to using regular expressions but it is something I want to seriously understand because as we all know, it is quite useful.
This is for a personal project, to keep my folders organized and neat.
I have a bunch of folders with the following naming pattern XXXXXXXX.XXXXXXX.XXXXXX.SYY.EYY.SOMETHINGELSE
There can be any amount of X repeating separated by ".", but the SYY.EYY is always there. So what I want is a regular expression to retrieve all the text represented by XXX without the "." if possible up until the SYY.EYY pattern.
I managed to detect the pattern because YY are always numbers, so doing something like \d{2} will detect it but I'm wondering if its possible to also add the rest of the pattern to that \d{2}.
Any help is appreciate it :)
If the YY is as you stated 2 digits and you want to get the text except the . up until for example S11.E22 you could make use of the \G anchor and a capturing group to get the text without a dot.
The value is in the Match.Groups property.
\G(?!S[0-9]{2}\.E[0-9]{2})([^.]+)\.
In parts
\G Assert position at the end of previous match (start at the beginning)
(?! Negative lookahead, assert what is directly to the right is not
S[0-9]{2}\.E[0-9]{2} Math S, 2 digits, . E and 2 digits
) Close lookahead
( Capture group 1
[^.]+ Match 1+ times any char except a dot
) Close group 1
\. Match dot literal
Regex demo | C# demo
For example
string pattern = #"\G(?!S[0-9]{2}\.E[0-9]{2})([^.]+)\.";
string input = #"XXXXXXXX.XXXXXXX.XXXXXX.S11.E22.SOMETHINGELSE";
foreach (Match m in Regex.Matches(input, pattern))
{
Console.WriteLine(m.Groups[1].Value);
}
Output
XXXXXXXX
XXXXXXX
XXXXXX
You can "replace/cut" the "." with C#.
The regex to get up until the SYY.EYY can be like this:
.SYY.EYY$
Line ends with word -> Regex: ExampleWord$
I would do something like:
var leftPart = Regex.Match(x, "^.*?(?=SYY)").Captures.First().Value;
// this now has XXXXXXXX.XXXXXXX.XXXXXX.
// And we can:
var left = leftPart.Replace(".", " "); // or any other char

Odd regexp behaviour - matches only first and last capture group

I am trying to write a regexp which would match a comma separated list of words and capture all words. This line should be matched    apple , banana ,orange,peanut  and captures should be apple, banana, orange, peanut. To do that I use following regexp:
^\s*([a-z_]\w*)(?:\s*,\s*([a-z_]\w*))*\s*$
It successfully matches the string but all of a sudden only apple and peanut are captured. This behaviour is seen in both C# and Perl. Thus I assume I am missing something about how regexp matching works. Any ideas? :)
The value given by match.Groups[2].Value is just the last value captured by the second group.
To find all the values, look at match.Groups[2].Captures[i].Value where in this case i ranges from 0 to 2. (As well as match.Groups[1].Value for the first group.)
(+1 for question, I learned something today!)
Try this:
string text = " apple , banana ,orange,peanut";
var matches = Regex.Matches(text, #"\s*(?<word>\w+)\s*,?")
.Cast<Match>()
.Select(x => x.Groups["word"].Value)
.ToList();
You are repeating your capturing group, at every repeated match the previous content is overwritten. So only the last match of your second capturing group is available at the end.
You can change your second capturing group to
^\s*([a-z_]\w*)((?:\s*,\s*(?:[a-z_]\w*))*)\s*$
Then the result would be " , banana ,orange,peanut" in your second group. I am not sure, if you want this.
If you want to check that the string has that pattern and extract each word. I would do it in two steps.
Check the pattern with your regex.
If the pattern is correct, remove leading and trailing whitespace and split on \s*,\s*.
Simple regexp:
(?:^| *)(.+?)(?:,|$)
Explanation:
?: # Non capturing group
^| * # Match start of line or multiple spaces
.+ # Capture the word in the list, lazy
?: # Non capture group
,|$ # Match comma or end of line
Note: Rublular is a nice website for testing this kind of thing.

C# Regular expressions, retrieving two words separated by a comma, parenthesis operator

I've been playing around with retrieving data from a string using regular expression, mostly as an exercise for myself. The pattern that I'm trying to match looks like this:
"(SomeWord,OtherWord)"
After reading some documentation and looking at a cheat sheet I came to the conclusion that the following regex should give me 2 matches:
"\((\w),(\w)\)"
Because according to the documentation the parenthesis should do the following:
(pattern) Matches pattern and remembers the match. The matched
substring can be retrieved from the resulting Matches collection,
using Item [0]...[n]. To match parentheses characters ( ), use "\ (" or
"\ )".
However using the following code (removed error checking for conciseness) matches quite something different:
string line = "(A,B)";
string pattern = #"\((\w),(\w)\)";
MatchCollection matches = Regex.Matches(line, pattern);
string left = matches[0].Value;
string right = matches[1].Value;
Now I would expect left to become "A" and right to become "B". However left becomes "(A,B)" and there is no second match at all. What am I missing here?
(I know this example is trivial to solve without regexes but to learn how to properly use regexes I should be able to make something simple as this work)
You want the Groups member of the first match. In your example case there is only 1 match, which is the whole string. In the Groups collection you will have 3 items. Try this sample code, left should be A, and right should be B. If you look at the group[0] value it will be the whole string.
string line = "(A,B)";
string pattern = #"\((\w),(\w)\)";
MatchCollection matches = Regex.Matches(line, pattern);
GroupCollection groups = matches[0].Groups;
string left = groups[1].Value;
string right = groups[2].Value;
\w matches only one word character. If words have to contain at least one character, the expression should be:
string pattern = #"\((\w+),(\w+)\)";
if words may be empty:
string pattern = #"\((\w*),(\w*)\)";
+: means one or more repetitions.
*: means zero, one or more repetitions.
In any case, you will get one match with three groups, the first containing the whole string including the left and right parentheses, the two others the two words.
I think the problem is that you're confusing the concept of a match and a group.
A MatchCollection contains a list of strings that matched your entire regex, not just the parenthetical groups inside that Regex. For example, if the string you searched looked like this...
(A,B)(C,D)
...then you would have two matches: (A,B) and (C,D).
However, there's good news: you can get the groups from each match very easily, like so:
string line = "(A,B)";
string pattern = #"\((\w),(\w)\)";
MatchCollection matches = Regex.Matches(line, pattern);
string left = matches[0].Groups[1].Value;
string right = matches[0].Groups[2].Value;
That Groups variable is a collection of parenthetical groups from a single match.
Edit:
Olivier Jacot-Descombes made a very good point: we all got so hung up explaining match vs. group that we forgot to notice a second problem: \w will only match a SINGLE character. You need to add a quantifier (such as +) in order to grab more than one character at a time. Olivier's answer should explain that part clearly.
First off, it's one "match", with 2 "groups"...
I would recommend you name the groups anyway...
string pattern = #"\((?<FirstWord>\w+),(?<SecondWord>\w+)\)";
Then you could do...
Match m = Regex.Match(line, pattern);
string firstWord = m.Groups["FirstWord"].Value;
Since all you are looking for are the characters separated by a comma, you can simply use \w as your pattern. The matches will be A and B.
A handy site for testing your Regex is http://gskinner.com/RegExr/

Regex to restrict only second occurrence of open and close brackets using C#

I have a string "(BETA) (Feb 27, 2011)"
I need to get the second occurrence of the open and close brackets using C#
It is probably easiest to match all (...) tokens and take the second:
MatchCollection matches = Regex.Matches(str, #"\(([^)]*)\)");
Getting the second match:
String second = matches[1].Groups[1].Value;
The regex assumes valid pairs of parentheses, and no nesting. It is pretty basic:
\( - Opening.
(...) - Capturing group, to easily extract the value.
[^)]* - Content of the group - characters that are not (.
\) - Closing.
Do you want it in regex? If not regex:
int n = text.indexOf("(");
if (n >= 0) {
n = text.indexOf("(", n+1);
}
Regex:
\(.+?\)\s*(\(.+?\))
Notice the use of a following "?" to force non-aggressive mode. And you must have at least one character within the parentheses.

Categories