Find multiply groups matching in specific substring - c#

I would like to catch bold values in the string below that starts with "need" word, while words in other string that starts from "skip" and "ignored" must be ignored. I tried the pattern
need.+?(:"(?'index'\w+)"[,}])
but it found only first(ephasised) value. How I can get needed result using RegEx only?
"skip" : {"A":"ABCD123","B":"ABCD1234","C":"ABCD1235"}
"need" : {"A":"ZABCD123","B":"ZABCD1234","C":"ZABCD1235"}
"ignore" : {"A":"SABCD123","B":"SABCD1234","C":"SABCD1235"}

We are going find need and group what we find into Named Match Group => Captures. There will be two groups, one named Index which holds the A | B | C and then one named Data.
The match will hold our data which will look like this:
From there we will join them into a dictionary:
Here is the code to do that magic:
string data =
#"""skip"" : {""A"":""ABCD123"",""B"":""ABCD1234"",""C"":""ABCD1235""}
""need"" : {""A"":""ZABCD123"",""B"":""ZABCD1234"",""C"":""ZABCD1235""}
""ignore"" : {""A"":""SABCD123"",""B"":""SABCD1234"",""C"":""SABCD1235""}";
string pattern = #"
\x22need\x22\s *:\s *{ # Find need
( # Beginning of Captures
\x22 # Quote is \x22
(?<Index>[^\x22] +) # A into index.
\x22\:\x22 # ':'
(?<Data>[^\x22] +) # 'Z...' Data
\x22,? # ',(maybe)
)+ # End of 1 to many Captures";
var mt = Regex.Match(data,
pattern,
RegexOptions.IgnorePatternWhitespace | RegexOptions.ExplicitCapture);
// Get the data capture into a List<string>.
var captureData = mt.Groups["Data"].Captures.OfType<Capture>()
.Select(c => c.Value).ToList();
// Join the index capture data and project it into a dictionary.
var asDictionary = mt.Groups["Index"]
.Captures.OfType<Capture>()
.Select((cp, iIndex) => new KeyValuePair<string,string>
(cp.Value, captureData[iIndex]) )
.ToDictionary(kvp => kvp.Key, kvp => kvp.Value );

If number of fields is fixed - you can code it like:
^"need"\s*:\s*{"A":"(\w+)","B":"(\w+)","C":"(\w+)"}
Demo
If tags would be after values - like that:
{"A":"ABCD123","B":"ABCD1234","C":"ABCD1235"} : "skip"
{"A":"ZABCD123","B":"ZABCD1234","C":"ZABCD1235"} : "need"
{"A":"SABCD123","B":"SABCD1234","C":"SABCD1235"} : "ignore"
Then you could employ infinite positive look ahead with
"\w+?":"(\w+?)"(?=.*"need")
Demo
But infinite positive look behind's are prohibited in PCRE. (prohibited use of *+ operators in look behind's syntax). So not very useful in your situation

You can't capture a dynamically set number of groups, so I'd run something like this regex
"need".*{.*,?".*?":(".+?").*}
[Demo]
with a 'match_all' function, or use Agnius' suggestion

Related

Negative lookahead in Regex to exclude two words

I have the following regex:
(?!SELECT|FROM|WHERE|AND|OR|AS|[0-9])(?<= |^|\()([a-zA-Z0-9_]+)
that I'm matching against a string like this:
SELECT Static AS My_alias FROM Table WHERE Id = 400 AND Name = 'Something';
This already does 90% of what I want. What I also like to do is to exclude AS My_alias, where the alias can be any word.
I tried to add this to my regex, but this didn't work:
(?!SELECT|FROM|WHERE|AND|OR|AS [a-zA-Z0-9_]+|[0-9])(?<= |^|\()([a-zA-Z0-9_]+)
^^^^^^^^^^^^^^^^
this is the new part
How can I exclude this part of the string using my regex?
Demo of the regex can be found here
This excludes the AS and gets the tokens you seek. It also handles multiple select values, along zero to many Where clauses.
The thought is to use named explicit captures, and let the regex engine know to disregard any non-named capture groups. (A match but don't capture feature)
We will also put all the "tokens" wanted into one token captures (?<Token> ... ) for all of our token needs.
var data = "SELECT Static AS My_alias FROM Table WHERE Id = 400 AND Name = 'Something';";
var pattern = #"
^
SELECT\s+
(
(?<Token>[^\s]+)
(\sAS\s[^\s]+)?
[\s,]+
)+ # One to many statements
FROM\s+
(?<Token>[^\s]+) # Table name
(
\s+WHERE\s+
(
(?<Token>[^\s]+)
(.+?AND\s+)?
)+ # One to many conditions
)? # Optional Where
";
var tokens =
Regex.Matches(data, pattern,
RegexOptions.IgnorePatternWhitespace // Lets us space out/comment pattern
| RegexOptions.ExplicitCapture) // Only consume named groups.
.OfType<Match>()
.SelectMany(mt => mt.Groups["Token"].Captures // Get the captures inserted into `Token`
.OfType<Capture>()
.Select(cp => cp.Value.ToString()))
;
tokens is an array of these strings: { "Static", "Table", "Id", "Name" }
This should get you going on most of the cases of what will find. Use similar logic if you need to process selects with joins; regardless this is a good base to work from going forward.

Retrieve different groups of values in a regex

I have this following string :
((1+2)*(4+3))
I would like to get the values exposed with parentheses separately through a Regex. These values must be in a array like string array.
For example :
Group 1 : ((1+2)*(4+3))
Group 2 : (1+2)
Group 3 : (4+3)
I have tried this Regex :
(?<content>\(.+\))
But she don't functional, because she keeps the group 1
You will have solutions that could allow me to manage this recursively?
You may get all overlapping substrings starting with ( and ending with ) and having any amount of balanced nested parentheses inside using
var result = Regex.Matches(s, #"(?=(\((?>[^()]+|(?<o>)\(|(?<-o>)\))*(?(o)(?!)|)\)))").Cast<Match>().Select(x => x.Groups[1].Value);
See the regex demo online.
Regex details
The regex is a positive lookahead ((?=...)) that checks each position within a string and finds a match if its pattern matches. Since the pattern is enclosed with a capturing group ((...)) the value is stored in match.Groups[1] that you may retrieve once the match is found. \((?>[^()]+|(?<o>)\(|(?<-o>)\))*(?(o)(?!)|)\) is a known pattern that matches nested balanced parentheses.
C# demo:
var str = "((1+2)*(4+3))";
var pattern = #"(?=(\((?>[^()]+|(?<o>)\(|(?<-o>)\))*(?(o)(?!)|)\)))";
var result = Regex.Matches(str, pattern)
.Cast<Match>()
.Select(x => x.Groups[1].Value);
Console.WriteLine(string.Join("\n", result));
Output:
((1+2)*(4+3))
(1+2)
(4+3)

C# check if characters occur in a fixed order in a string

I need to check if a user input resembles a parameter or not. It comes as a string (not changeable) and has to look like the following examples:
p123[2] -> writable array index
r23[12] -> read only array index
p3[7].5 -> writable bit in word
r1263[13].24 -> read only bit in word
15 -> simple value
The user is allowed to input any of them and my function has to distinguish them in order to call the proper function.
An idea would be to check for characters in a specific order e.g. "p[]", "r[]", "p[]." etc.
But I am not sure how to archive that without checking each single character and using multiple cases...
Any other idea of how to make sure that the user input is correct is also welcomed.
If you just need to validate user input that should come in 1 of the 5 provided formants, use a regex check:
Regex.IsMatch(str, #"^(?:(?<p>[pr]\d+)(?:\[(?<idx>\d+)])?(?:\.(?<inword>\d+))?|(?<simpleval>\d+))$")
See the regex demo
Description:
^ - start of string
(?: - start of the alternation group
(?<p>[pr]\d+) - Group "p" capturing p or r and 1 or more digits after
(?:\[(?<idx>\d+)])? - an optional sequence of [, 1 or more digits (captured into Group "idx") and then ]
(?:\.(?<inword>\d+)‌​)? - an optional sequence of a literal ., then 1 or more digits captured into Group "inword"
| - or (then comes the second alternative)
(?<simpleval>\d+)‌​ - Group "simpleval" capturing 1 or more digits
) - end of the outer grouping
$ - end of string.
If the p or r can be any ASCII letters, use [a-zA-Z] instead of [pr].
C# demo:
var strs = new List<string> { "p123[2]","r23[12]","p3[7].5","r1263[13].24","15"};
var pattern = #"^(?:(?<p>[pr]\d+)(?:\[(?<idx>\d+)])?(?:\.(?<inword>\d+))?|(?<simpleval>\d+))$";
foreach (var s in strs)
Console.WriteLine("{0}: {1}", s, Regex.IsMatch(s, pattern));
You can check if the input match with a regex pattern :
1 ) Regex.IsMatch(input,#"^p\d+\[\d+\]$"); // match p123[2]
2 ) Regex.IsMatch(input,#"^r\d+\[\d+\]$"); // match r23[12]
3 ) Regex.IsMatch(input,#"^p\d+\[\d+\]\.\d+$"); // match p3[7].5
4 ) Regex.IsMatch(input,#"^r\d+\[\d+\]\.\d+$"); // match r1263[13].24
5 ) Regex.IsMatch(input,#"^\d+$") ;// match simple value

Regex problems with equal sign?

In C# I'm trying to validate a string that looks like:
I#paramname='test'
or
O#paramname=2827
Here is my code:
string t1 = "I#parameter='test'";
string r = #"^([Ii]|[Oo])#\w=\w";
var re = new Regex(r);
If I take the "=\w" off the end or variable r I get True. If I add an "=\w" after the \w it's False. I want the characters between # and = to be able to be any alphanumeric value. Anything after the = sign can have alphanumeric and ' (single quotes). What am I doing wrong here. I very rarely have used regular expressions and normally can find example, this is custom format though and even with cheatsheets I'm having issues.
^([Ii]|[Oo])#\w+=(?<q>'?)[\w\d]+\k<q>$
Regular expression:
^ start of line
([Ii]|[Oo]) either (I or i) or (O or o)
\w+ 1 or more word characters
= equals sign
(?<q>'?) capture 0 or 1 quotes in named group q
[\w\d]+ 1 or more word or digit characters
\k<q> repeat of what was captured in named group q
$ end of line
use \w+ instead of \w to one character or more. Or \w* to get zero or more:
Try this: Live demo
^([Ii]|[Oo])#\w+=\'*\w+\'*
If you are being a bit more strict with using paramname:
^([Ii]|[Oo])#paramname=[']?[\w]+[']?
Here is a demo
You could try something like this:
Regex rx = new Regex( #"^([IO])#(\w+)=(.*)$" , RegexOptions.IgnoreCase ) ;
Match group 1 will give you the value of I or O (the parameter direction?)
Match group 2 will give you the name of the parameter
Match group 3 will give you the value of the parameter
You could be stricter about the 3rd group and match it as
(([^']+)|('(('')|([^']+))*'))
The first alternative matches 1 or more non quoted character; the second alternative match a quoted string literal with any internal (embedded) quotes escape by doubling them, so it would match things like
'' (the empty string
'foo bar'
'That''s All, Folks!'

Regular Expression Groups in C#

I've inherited a code block that contains the following regex and I'm trying to understand how it's getting its results.
var pattern = #"\[(.*?)\]";
var matches = Regex.Matches(user, pattern);
if (matches.Count > 0 && matches[0].Groups.Count > 1)
...
For the input user == "Josh Smith [jsmith]":
matches.Count == 1
matches[0].Value == "[jsmith]"
... which I understand. But then:
matches[0].Groups.Count == 2
matches[0].Groups[0].Value == "[jsmith]"
matches[0].Groups[1].Value == "jsmith" <=== how?
Looking at this question from what I understand the Groups collection stores the entire match as well as the previous match. But, doesn't the regexp above match only for [open square bracket] [text] [close square bracket] so why would "jsmith" match?
Also, is it always the case the the groups collection will store exactly 2 groups: the entire match and the last match?
match.Groups[0] is always the same as match.Value, which is the entire match.
match.Groups[1] is the first capturing group in your regular expression.
Consider this example:
var pattern = #"\[(.*?)\](.*)";
var match = Regex.Match("ignored [john] John Johnson", pattern);
In this case,
match.Value is "[john] John Johnson"
match.Groups[0] is always the same as match.Value, "[john] John Johnson".
match.Groups[1] is the group of captures from the (.*?).
match.Groups[2] is the group of captures from the (.*).
match.Groups[1].Captures is yet another dimension.
Consider another example:
var pattern = #"(\[.*?\])+";
var match = Regex.Match("[john][johnny]", pattern);
Note that we are looking for one or more bracketed names in a row. You need to be able to get each name separately. Enter Captures!
match.Groups[0] is always the same as match.Value, "[john][johnny]".
match.Groups[1] is the group of captures from the (\[.*?\])+. The same as match.Value in this case.
match.Groups[1].Captures[0] is the same as match.Groups[1].Value
match.Groups[1].Captures[1] is [john]
match.Groups[1].Captures[2] is [johnny]
The ( ) acts as a capture group. So the matches array has all of matches that C# finds in your string and the sub array has the values of the capture groups inside of those matches. If you didn't want that extra level of capture jut remove the ( ).
Groups[0] is your entire input string.
Groups[1] is your group captured by parentheses (.*?). You can configure Regex to capture Explicit groups only (there is an option for that when you create a regex), or use (?:.*?) to create a non-capturing group.
The parenthesis is identifying a group as well, so match 1 is the entire match, and match 2 are the contents of what was found between the square brackets.
How? The answer is here
(.*?)
That is a subgroup of #"[(.*?)];

Categories