Using Regular Expressions to extract groups of numbers from a string - c#

I need to convert a string like,
"[1,2,3,4][5,6,7,8]"
into groups of integers, adjusted to be zero based rather than one based:
{0,1,2,3} {4,5,6,7}
The following rules also apply:
The string must contain at least 1 group of numbers with enclosing square brackets.
Each group must contain at least 2 numbers.
Every number must be unique (not something I'm attempting to achieve with the regex).
0 is not valid, but 10, 100 etc are.
Since I'm not that experienced with regular expressions, I'm currently using two;
#"^(?:\[(?:[1-9]+[\d]*,)+(?:[1-9]+[\d]*){1}\])+$";
and
#"\[(?:[1-9]+[\d]*,)+(?:[1-9]+[\d]*){1}\]";
I'm using the first one to check the input and the second to get all matches of a set of numbers inside square brackets.
I'm then using .Net string manipulation to trim off the square brackets and extract the numbers, parsing them and subtracting 1 to get the result I need.
I was wondering if I could get at the numbers better by using captures, but not sure how they work.
Final Solution:
In the end I used the following regular expression to validate the input string
#"^(?<set>\[(?:[1-9]\d{0,7}(?:]|,(?=\d))){2,})+$"
agent-j's pattern is fine for capturing the information needed but also matches a string like "[1,2,3,4][5]" and would require me to do some additional filtering of the results.
I access the captures via the named group 'set' and use a second simple regex to extract the numbers.
The '[1-9]\d{0,7}' simplifies parsing ints by limiting numbers to 99,999,999 and avoiding overflow exceptions.
MatchCollection matches = new Regex(#"^(?<set>\[(?:[1-9]\d{0,7}(?:]|,(?=\d))){2,})+$").Matches(inputText);
if (matches.Count != 1)return;
CaptureCollection captures = matches[0].Groups["set"].Captures;
var resultJArray = new int[captures.Count][];
var numbersRegex = new Regex(#"\d+");
for (int captureIndex = 0; captureIndex < captures.Count; captureIndex++)
{
string capture = captures[captureIndex].Value;
MatchCollection numberMatches = numbersRegex.Matches(capture);
resultJArray [captureIndex] = new int[numberMatches.Count];
for (int numberMatchIndex = 0; numberMatchIndex < numberMatches.Count; numberMatchIndex++)
{
string number = numberMatches[numberMatchIndex].Value;
int numberAdjustedToZeroBase = Int32.Parse(number) - 1;
resultJArray [captureIndex][numberMatchIndex] = numberAdjustedToZeroBase;
}
}

string input = "[1,2,3,4][5,6,7,8][534,63433,73434,8343434]";
string pattern = #"\G(?:\[(?:(\d+)(?:,|(?=\]))){2,}\])";//\])+$";
MatchCollection matches = Regex.Matches (input, pattern);
To start out, any (regex) with plain parenthasis is a capturing group. This means that the regex engine will capture (store positions matched by that group). To avoid this (when you don't need it, use (?:regex). I did that above.
Index 0 is special and it means the whole of the parent. I.E. match.Groups[0].Value is always the same as match.Value and match.Groups[0].Captures[0].Value. So, you can consider the Groups and Capture collections to start at index 1.
As you can see below, each match contains a bracketed digit group. You'll want to use captures 1-n from Group 1 of each match.
foreach (Match match in matches)
{
// [1,2]
// use captures 1-n from the first group.
for (int i = 1; i < match.Group[1].Captures.Count; i++)
{
int number = int.Parse(match.Group[1].Captures[i]);
if (number == 0)
throw new Exception ("Cannot be 0.");
}
}
Match[0] => [1,2,3,4]
Group[0] => [1,2,3,4]
Capture[0] => [1,2,3,4]
Group[1] => 4
Capture[0] => 1
Capture[1] => 2
Capture[2] => 3
Capture[3] => 4
Match[1] => [5,6,7,8]
Group[0] => [5,6,7,8]
Capture[0] => [5,6,7,8]
Group[1] => 8
Capture[0] => 5
Capture[1] => 6
Capture[2] => 7
Capture[3] => 8
Match[2] => [534,63433,73434,8343434]
Group[0] => [534,63433,73434,8343434]
Capture[0] => [534,63433,73434,8343434]
Group[1] => 8343434
Capture[0] => 534
Capture[1] => 63433
Capture[2] => 73434
Capture[3] => 8343434
The \G causes the match to begin at the start of the last match (so you won't match [1,2] [3,4]). The {2,} satisfies your requirement that there be at least 2 numbers per match.
The expression will match even if there is a 0. I suggest that you put that validation in with the other non-regex stuff. It will keep the regex simpler.

The following regex will validate and also spit out match groups of the bracketed [] group and also the inside that, each number
(?:([1-9][0-9]*)\,?){2,}
[1][5] - fail
[1] - fail
[] - fail
[a,b,c][5] - fail
[1,2,3,4] - pass
[1,2,3,4,5,6,7,8][5,6,7,8] - pass
[1,2,3,4][5,6,7,8][534,63433,73434,8343434] - pass

What about \d+ and a global flag?

Related

LINQ conditional selection and formatting

I have a string of characters and I'm trying to set up a query that'll substitute a specific sequence of similar characters into a character count. Here's an example of what I'm trying to do:
agd69dnbd555bdggjykbcx555555bbb
In this case, I'm trying to isolate and count ONLY the occurrences of the number 5, so my output should read:
agd69dnbd3bdggjykbcx6bbb
My current code is the following, where GroupAdjacentBy is a function that groups and counts the character occurrences as above.
var res = text
.GroupAdjacentBy((l, r) => l == r)
.Select(x => new { n = x.First(), c = x.Count()})
.ToArray();
The problem is that the above function groups and counts EVERY SINGLE character in my string, not the just the one character I'm after. Is there a way to conditionally perform that operation on ONLY the character I need counted?
Regex is a better tool for this job than LINQ.
Have a look at this:
string input = "agd69dnbd555bdggjykbcx555555bbb";
string pattern = #"5+"; // Match 1 or more adjacent 5 character
string output = Regex.Replace(input, pattern, match => match.Length.ToString());
// output = agd69dnbd3bdggjykbcx6bbb
Not sure if your intending to replace every 5 character, or just when there is more than one adjacent 5.
If it's the latter, just change your pattern to:
string pattern = #"5{2,}"; // Match 2 or more adjacent 5's
The best answer was already given by Johnathan Barclay. But just for the case that you need something similar by using Linq and to show an alternative solution:
var charToCombine = '5';
var res = text
.GroupAdjacentBy((l, r) => l == r )
.SelectMany(x => x.Count() > 1 && x.First() == charToCombine ? x.Count().ToString() : x)
.ToArray();

Retrieve different groups of values in a regex

I have this following string :
((1+2)*(4+3))
I would like to get the values exposed with parentheses separately through a Regex. These values must be in a array like string array.
For example :
Group 1 : ((1+2)*(4+3))
Group 2 : (1+2)
Group 3 : (4+3)
I have tried this Regex :
(?<content>\(.+\))
But she don't functional, because she keeps the group 1
You will have solutions that could allow me to manage this recursively?
You may get all overlapping substrings starting with ( and ending with ) and having any amount of balanced nested parentheses inside using
var result = Regex.Matches(s, #"(?=(\((?>[^()]+|(?<o>)\(|(?<-o>)\))*(?(o)(?!)|)\)))").Cast<Match>().Select(x => x.Groups[1].Value);
See the regex demo online.
Regex details
The regex is a positive lookahead ((?=...)) that checks each position within a string and finds a match if its pattern matches. Since the pattern is enclosed with a capturing group ((...)) the value is stored in match.Groups[1] that you may retrieve once the match is found. \((?>[^()]+|(?<o>)\(|(?<-o>)\))*(?(o)(?!)|)\) is a known pattern that matches nested balanced parentheses.
C# demo:
var str = "((1+2)*(4+3))";
var pattern = #"(?=(\((?>[^()]+|(?<o>)\(|(?<-o>)\))*(?(o)(?!)|)\)))";
var result = Regex.Matches(str, pattern)
.Cast<Match>()
.Select(x => x.Groups[1].Value);
Console.WriteLine(string.Join("\n", result));
Output:
((1+2)*(4+3))
(1+2)
(4+3)

RegEx string between N and (N+1)th Occurance

I am attempting to find nth occurrence of sub string between two special characters. For example.
one|two|three|four|five
Say, I am looking to find string between (n and n+1 th) 2nd and 3rd Occurrence of '|' character, which turns out to be 'three'.I want to do it using RegEx. Could someone guide me ?
My Current Attempt is as follows.
string subtext = "zero|one|two|three|four";
Regex r = new Regex(#"(?:([^|]*)|){3}");
var m = r.Match(subtext).Value;
If you have full access to C# code, you should consider a mere splitting approach:
var idx = 2; // Might be user-defined
var subtext = "zero|one|two|three|four";
var result = subtext.Split('|').ElementAtOrDefault(idx);
Console.WriteLine(result);
// => two
A regex can be used if you have no access to code (if you use some tool that is powered with .NET regex):
^(?:[^|]*\|){2}([^|]*)
See the regex demo. It matches
^ - start of string
(?:[^|]*\|){2} - 2 (or adjust it as you need) or more sequences of:
[^|]* - zero or more chars other than |
\| - a | symbol
([^|]*) - Group 1 (access via .Groups[1]): zero or more chars other than |
C# code to test:
var pat = $#"^(?:[^|]*\|){{{idx}}}([^|]*)";
var m = Regex.Match(subtext, pat);
if (m.Success) {
Console.WriteLine(m.Groups[1].Value);
}
// => two
See the C# demo
If a tool does not let you access captured groups, turn the initial part into a non-consuming lookbehind pattern:
(?<=^(?:[^|]*\|){2})[^|]*
^^^^^^^^^^^^^^^^^^^^
See this regex demo. The (?<=...) positive lookbehind only checks for a pattern presence immediately to the left of the current location, and if the pattern is not matched, the match will fail.
Use this:
(?:.*?\|){n}(.[^|]*)
where n is the number of times you need to skip your special character. The first capturing group will contain the result.
Demo for n = 2
Use this regex and then select the n-th match (in this case 2) from the Matches collection:
string subtext = "zero|one|two|three|four";
Regex r = new Regex("(?<=\|)[^\|]*");
var m = r.Matches(subtext)[2];

Regexp find position of different characters in string

I have a string conforming to the following pattern:
(cc)-(nr).(nr)M(nr)(cc)whitespace(nr)
where cc is artbitrary number of letter characters, nr is arbitrary number of numerical characters, and M is is the actual letter M.
For example:
ASF-1.15M437979CA 100000
EU-12.15M121515PO 1145
I need to find the positions of -, . and M whithin the string. The problem is, the leading characters and the ending characters can contain the letter M as well, but I need only the one in the middle.
As an alternative, the subtraction of the first characters (until -) and the first two numbers (as in (nr).(nr)M...) would be enough.
If you need a regex-based solution, you just need to use 3 capturing groups around the required patterns, and then access the Groups[n].Index property:
var rxt = new Regex(#"\p{L}*(-)\d+(\.)\d+(M)\d+\p{L}*\s*\d+");
// Collect matches
var matches = rxt.Matches(#"ASF-1.15M437979CA 100000 or EU-12.15M121515PO 1145");
// Now, we can get the indices
var posOfHyphen = matches.Cast<Match>().Select(p => p.Groups[1].Index);
var posOfDot = matches.Cast<Match>().Select(p => p.Groups[2].Index);
var posOfM = matches.Cast<Match>().Select(p => p.Groups[3].Index);
Output:
posOfHyphen => [3, 32]
posOfDot => [5, 35]
posOfM => [8, 38]
Regex:
string pattern = #"[A-Z]+(-)\d+(\.)\d+(M)\d+[A-Z]+";
string value = "ASF-1.15M437979CA 100000 or EU-12.15M121515PO 1145";
var match = Regex.Match(value, pattern);
if (match.Success)
{
int sep1 = match.Groups[1].Index;
int sep2 = match.Groups[2].Index;
int sep3 = match.Groups[3].Index;
}

How can I get a regex match to only be added once to the matches collection?

I have a string which has several html comments in it. I need to count the unique matches of an expression.
For example, the string might be:
var teststring = "<!--X1-->Hi<!--X1-->there<!--X2-->";
I currently use this to get the matches:
var regex = new Regex("<!--X.-->");
var matches = regex.Matches(teststring);
The results of this is 3 matches. However, I would like to have this be only 2 matches since there are only two unique matches.
I know I can probably loop through the resulting MatchCollection and remove the extra Match, but I'm hoping there is a more elegant solution.
Clarification: The sample string is greatly simplified from what is actually being used. There can easily be an X8 or X9, and there are likely dozens of each in the string.
I would just use the Enumerable.Distinct Method for example like this:
string subjectString = "<!--X1-->Hi<!--X1-->there<!--X2--><!--X1-->Hi<!--X1-->there<!--X2-->";
var regex = new Regex(#"<!--X\d-->");
var matches = regex.Matches(subjectString);
var uniqueMatches = matches
.OfType<Match>()
.Select(m => m.Value)
.Distinct();
uniqueMatches.ToList().ForEach(Console.WriteLine);
Outputs this:
<!--X1-->
<!--X2-->
For regular expression, you could maybe use this one?
(<!--X\d-->)(?!.*\1.*)
Seems to work on your test string in RegexBuddy at least =)
// (<!--X\d-->)(?!.*\1.*)
//
// Options: dot matches newline
//
// Match the regular expression below and capture its match into backreference number 1 «(<!--X\d-->)»
// Match the characters “<!--X” literally «<!--X»
// Match a single digit 0..9 «\d»
// Match the characters “-->” literally «-->»
// Assert that it is impossible to match the regex below starting at this position (negative lookahead) «(?!.*\1.*)»
// Match any single character «.*»
// Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
// Match the same text as most recently matched by capturing group number 1 «\1»
// Match any single character «.*»
// Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
It appears you're doing two different things:
Matching comments like /<-- X. -->/
Finding the set of unique comments
So it is fairly logical to handle these as two different steps:
var regex = new Regex("<!--X.-->");
var matches = regex.Matches(teststring);
var uniqueMatches = matches.Cast<Match>().Distinct(new MatchComparer());
class MatchComparer : IEqualityComparer<Match>
{
public bool Equals(Match a, Match b)
{
return a.Value == b.Value;
}
public int GetHashCode(Match match)
{
return match.Value.GetHashCode();
}
}
Extract the comments and store them in an array. Then you can filter out the unique values.
But I don’t know how to implement this in C#.
Depending on how many Xn's you have you might be able to use:
(\<!--X1--\>){1}.*(\<!--X2--\>){1}
That will only match each occurrence of the X1, X2 etc. once provided they are in order.
Capture the inner portion of the comment as a group. Then put those strings into a hashtable(dictionary). Then ask the dictionary for its count, since it will self weed out repeats.
var teststring = "<!--X1-->Hi<!--X1-->there<!--X2-->";
var tokens = new Dicationary<string, string>();
Regex.Replace(teststring, #"<!--(.*)-->",
match => {
tokens[match.Groups[1].Value] = match.Groups[1].Valuel;
return "";
});
var uniques = tokens.Keys.Count;
By using the Regex.Replace construct you get to have a lambda called on each match. Since you are not interested in the replace, you don't set it equal to anything.
You must use Group[1] because group[0] is the entire match.
I'm only repeating the same thing on both sides, so that its easier to put into the dictionary, which only stores unique keys.
If you want a distinct Match list from a MatchCollection without converting to string, you can use something like this:
var distinctMatches = matchList.OfType<Match>().GroupBy(x => x.Value).Select(x =>x.First()).ToList();
I know it has been 12 years but sometimes we need this kind of solutions, so I wanted to share. C# evolved, .NET evolved, so it's easier now.

Categories