capturing specific or group in regex C# - c#

I'm trying to parse match a file name like xxxxSystemCheckedOut.png where xxx can be any prefix to the file name and System and Checked out are keywords to identify.
EDIT: I wasn't being clear on all the possible file names and their results. So filenames can be
xxxxSystem.png produces (group 1: xxxx group 2: System)
xxxxSystemCheckedOut.png produces (group 1: xxxx group 2: System group 3: CheckedOut)
xxxxCheckedOut.png produces (group 1: xxxx group 2: CheckedOut)
this is my current regex, it matchs the file name like I want it to but can't get it to group in the right way.
Using the previous example I'd like the groups to be like this:
xxxx
System
CheckedOut
.png
(?:([\w]*)(CheckedOut|System)+(\.[a-z]*)\Z)

[EDIT]
Give this a try.
Pattern: (.*?)(?:(System)|(CheckedOut)|(Cached))+(.png)\Z
String: xxxxTESTSystemCached.png
Groups:
xxxxTest
System
Cached
.png
https://regex101.com/r/jE5eA4/1

UPDATE - Based on comments to other answers:
This should work for all combinations of System/CheckedOut/Cached:
(\w+?)(System)?(CheckedOut)?(Cached)?(.png)
https://regex101.com/r/qT2sX9/1
Note that that the groups for missing keywords will still exist, so for example:
"abcdSystemCached.png" gives:
Match 1 : "abcd"
Match 2 : "System"
Match 3 :
Match 4 : "Cached"
Match 5 : ".png"
And "1234CheckedOutCached.png" gives:
Match 1 : "abcd"
Match 2 :
Match 3 : "CheckedOut"
Match 4 : "Cached"
Match 5 : ".png"
This is kinda nice as you know a particular keyword will always be a certain position, so it becomes like a flag.

From the comments: I actually need the groups separately so I know how to operate on the image, each keyword ends in different operations on the image
You really don't need to use separate capture buffers on the keywords.
If you need the order of the matched keywords relative to one another,
you'd use the below code. Even if you didn't need the order it could be
done like that.
( .*? ) # (1)
( System | CheckedOut )+ # (2)
\.png $
C#:
string fname = "xxxxSystemCheckedOutSystemSystemCheckedOutCheckedOut.png";
Regex RxFname = new Regex( #"(.*?)(System|CheckedOut)+\.png$" );
Match fnameMatch = RxFname.Match( fname );
if ( fnameMatch.Success )
{
Console.WriteLine("Group 0 = {0}", fnameMatch.Groups[0].Value);
Console.WriteLine("Group 1 = {0}", fnameMatch.Groups[1].Value);
Console.WriteLine("Last Group 2 = {0}\n", fnameMatch.Groups[2].Value);
CaptureCollection cc = fnameMatch.Groups[2].Captures;
Console.WriteLine("Array and order of group 2 matches (collection):\n");
for (int i = 0; i < cc.Count; i++)
{
Console.WriteLine("[{0}] = '{1}'", i, cc[i].Value);
}
}
Output:
Group 0 = xxxxSystemCheckedOutSystemSystemCheckedOutCheckedOut.png
Group 1 = xxxx
Last Group 2 = CheckedOut
Array and order of group 2 matches (collection):
[0] = 'System'
[1] = 'CheckedOut'
[2] = 'System'
[3] = 'System'
[4] = 'CheckedOut'
[5] = 'CheckedOut'

I'm no Regex wizard, so if this can be shortened/tidied I'd love to know, but this groups like you want based on the keywords you gave:
Edited based on OPs clarification of the file structure
(\w+?)(system)?(checkedout)?(cached)?(.png)/ig
Regex101 Demo
Edit: beercohol and jon have me beat ;-)

I read somewhere (can't remember where) the more precise your pattern is, the better performance you'll get from it.
So try this pattern
"(\\w+?)(?:(System)|(CheckedOut))+(.png)"
Code Sample:
List<string> fileNames = new List<string>
{
"xxxxSystemCheckedOut.png", // Good
"SystemCheckedOut.png", // Good
"1afweiljSystemCheckedOutdgf.png", // Bad - Garbage characters before .png
"asdf.png", // Bad - No System or CheckedOut
"xxxxxxxSystemCheckedOut.bmp", // Bad - Wrong file extension
"xxSystem.png", // Good
"xCheckedOut.png" // Good
};
foreach (Match match in fileNames.Select(fileName => Regex.Match(fileName, "(\\w+?)(?:(System)|(CheckedOut))+(.png)")))
{
List<Group> matchedGroups = match.Groups.Cast<Group>().Where(group => !String.IsNullOrEmpty(group.Value)).ToList();
if (matchedGroups.Count > 0)
{
matchedGroups.ForEach(Console.WriteLine);
Console.WriteLine();
}
}
Results:
xxxxSystemCheckedOut.png
xxxx
System
CheckedOut
.png
SystemCheckedOut.png
System
CheckedOut
.png
xxSystem.png
xx
System
.png
xCheckedOut.png
x
CheckedOut
.png

Related

Find multiply groups matching in specific substring

I would like to catch bold values in the string below that starts with "need" word, while words in other string that starts from "skip" and "ignored" must be ignored. I tried the pattern
need.+?(:"(?'index'\w+)"[,}])
but it found only first(ephasised) value. How I can get needed result using RegEx only?
"skip" : {"A":"ABCD123","B":"ABCD1234","C":"ABCD1235"}
"need" : {"A":"ZABCD123","B":"ZABCD1234","C":"ZABCD1235"}
"ignore" : {"A":"SABCD123","B":"SABCD1234","C":"SABCD1235"}
We are going find need and group what we find into Named Match Group => Captures. There will be two groups, one named Index which holds the A | B | C and then one named Data.
The match will hold our data which will look like this:
From there we will join them into a dictionary:
Here is the code to do that magic:
string data =
#"""skip"" : {""A"":""ABCD123"",""B"":""ABCD1234"",""C"":""ABCD1235""}
""need"" : {""A"":""ZABCD123"",""B"":""ZABCD1234"",""C"":""ZABCD1235""}
""ignore"" : {""A"":""SABCD123"",""B"":""SABCD1234"",""C"":""SABCD1235""}";
string pattern = #"
\x22need\x22\s *:\s *{ # Find need
( # Beginning of Captures
\x22 # Quote is \x22
(?<Index>[^\x22] +) # A into index.
\x22\:\x22 # ':'
(?<Data>[^\x22] +) # 'Z...' Data
\x22,? # ',(maybe)
)+ # End of 1 to many Captures";
var mt = Regex.Match(data,
pattern,
RegexOptions.IgnorePatternWhitespace | RegexOptions.ExplicitCapture);
// Get the data capture into a List<string>.
var captureData = mt.Groups["Data"].Captures.OfType<Capture>()
.Select(c => c.Value).ToList();
// Join the index capture data and project it into a dictionary.
var asDictionary = mt.Groups["Index"]
.Captures.OfType<Capture>()
.Select((cp, iIndex) => new KeyValuePair<string,string>
(cp.Value, captureData[iIndex]) )
.ToDictionary(kvp => kvp.Key, kvp => kvp.Value );
If number of fields is fixed - you can code it like:
^"need"\s*:\s*{"A":"(\w+)","B":"(\w+)","C":"(\w+)"}
Demo
If tags would be after values - like that:
{"A":"ABCD123","B":"ABCD1234","C":"ABCD1235"} : "skip"
{"A":"ZABCD123","B":"ZABCD1234","C":"ZABCD1235"} : "need"
{"A":"SABCD123","B":"SABCD1234","C":"SABCD1235"} : "ignore"
Then you could employ infinite positive look ahead with
"\w+?":"(\w+?)"(?=.*"need")
Demo
But infinite positive look behind's are prohibited in PCRE. (prohibited use of *+ operators in look behind's syntax). So not very useful in your situation
You can't capture a dynamically set number of groups, so I'd run something like this regex
"need".*{.*,?".*?":(".+?").*}
[Demo]
with a 'match_all' function, or use Agnius' suggestion

C# check if characters occur in a fixed order in a string

I need to check if a user input resembles a parameter or not. It comes as a string (not changeable) and has to look like the following examples:
p123[2] -> writable array index
r23[12] -> read only array index
p3[7].5 -> writable bit in word
r1263[13].24 -> read only bit in word
15 -> simple value
The user is allowed to input any of them and my function has to distinguish them in order to call the proper function.
An idea would be to check for characters in a specific order e.g. "p[]", "r[]", "p[]." etc.
But I am not sure how to archive that without checking each single character and using multiple cases...
Any other idea of how to make sure that the user input is correct is also welcomed.
If you just need to validate user input that should come in 1 of the 5 provided formants, use a regex check:
Regex.IsMatch(str, #"^(?:(?<p>[pr]\d+)(?:\[(?<idx>\d+)])?(?:\.(?<inword>\d+))?|(?<simpleval>\d+))$")
See the regex demo
Description:
^ - start of string
(?: - start of the alternation group
(?<p>[pr]\d+) - Group "p" capturing p or r and 1 or more digits after
(?:\[(?<idx>\d+)])? - an optional sequence of [, 1 or more digits (captured into Group "idx") and then ]
(?:\.(?<inword>\d+)‌​)? - an optional sequence of a literal ., then 1 or more digits captured into Group "inword"
| - or (then comes the second alternative)
(?<simpleval>\d+)‌​ - Group "simpleval" capturing 1 or more digits
) - end of the outer grouping
$ - end of string.
If the p or r can be any ASCII letters, use [a-zA-Z] instead of [pr].
C# demo:
var strs = new List<string> { "p123[2]","r23[12]","p3[7].5","r1263[13].24","15"};
var pattern = #"^(?:(?<p>[pr]\d+)(?:\[(?<idx>\d+)])?(?:\.(?<inword>\d+))?|(?<simpleval>\d+))$";
foreach (var s in strs)
Console.WriteLine("{0}: {1}", s, Regex.IsMatch(s, pattern));
You can check if the input match with a regex pattern :
1 ) Regex.IsMatch(input,#"^p\d+\[\d+\]$"); // match p123[2]
2 ) Regex.IsMatch(input,#"^r\d+\[\d+\]$"); // match r23[12]
3 ) Regex.IsMatch(input,#"^p\d+\[\d+\]\.\d+$"); // match p3[7].5
4 ) Regex.IsMatch(input,#"^r\d+\[\d+\]\.\d+$"); // match r1263[13].24
5 ) Regex.IsMatch(input,#"^\d+$") ;// match simple value

Remove substring if number exists before keyword

I have a strings with the form:
5 dogs = 1 medium size house
4 cats = 2 small houses
one bird = 1 bird cage
What I amt trying to do is remove the substring that exists before the equals sign but only if the substring contains a keyword and the data before that keyword is a integer.
So in this example my key words are:
dogs,
cats,
bird
In the above example, the ideal output of my process would be:
1 medium size house
2 small houses
one bird = 1 bird cage
My code so far looks like this (I am hard coding the keyword values/strings for now)
var orginalstring= "5 dogs = 1 medium size house";
int equalsindex = originalstring.indexof('=');
var prefix = originalstring.Substring(0,equalsindex);
if(prefix.Contains("dogs")
{
var modifiedstring = originalstring.Remove(prefix).Replace("=", string.empty);
return modifiedstring;
}
return originalstring;
The issue here is that I am removing the whole substring regardless of whether or not the data preceding the keyword is a number.
Would somebody be able to help me with this additional logic?
Thanks so much as always for anybody who takes a few minutes to read this question.
Mick
You can do it with a simple regex of the form
\d+\s+(?:kw1|kw2|kw3|...)\s*=\s*
where kwX is the corresponding keyword.
var data = new[] {
"5 dogs = 1 medium size house",
"4 cats = 2 small houses",
"one bird = 1 bird cage"
};
var keywords = new[] {"dogs", "cats", "bird"};
var regexStr = string.Format( #"\d+\s+(?:{0})\s*=\s*", string.Join("|", keywords));
var regex = new Regex(regexStr);
foreach (var s in data) {
Console.WriteLine("'{0}'", regex.Replace(s, string.Empty));
}
In the example above the call of string.Format pastes the list of keywords joined by | into the "template" of the expression at the top of the post, i.e.
\d+\s+(?:dogs|cats|bird)\s*=\s*
This expression matches
One or more digits \d+, followed by
One or more space \s+, followed by
A keyword from the list: dogs, cats, bird (?:dogs|cats|bird), followed by
Zero or more spaces \s*, followed by
An equal sign =, followed by
Zero or more spaces \s*
The rest is easy: since this regex matches the part that you wish to remove, you need to call Replace and pass it string.Empty.
Demo.
You can use regex (System.Text.RegularExpressions) to identify whether or not there is a number in the string.
Regex r = new Regex("[0-9]"); //Look for a number between 0 and 9
bool hasNumber = r.IsMatch(prefix);
This Regex simply searches for any number in the string. If you want to search for a number-space-string you could use [0-9] [a-z]|[A-Z]. The | is an "or" so that both upper and lower case letters result in a match.
You can try something like this:
int i;
if(int.TryParse(prefix.Substring(0, 1), out i)) //try to get an int from first char of prefix
{
//remove prefix
}
This will only work for single-digit integers, however.

Run multiple RegEx patterns on single string

I need to run a C# RegEx match on a string.
Problem is, I'm looking for more than one pattern on a single string, and I cannot find a way to do that with a single run.
For example, in the string
The dog has jumped
I'm looking for "dog" and for "dog has".
I don't know how can I get those two results with one pass.
I've tried to concatenate the pattern with the alteration symbol (|), like that:
(dog|dog has)
But it returned only the first match.
What can I use to get back both the matches?
Thanks!
The regex engine will return the first substring that satisfied the pattern. If you write (dog|dog has), it won't ever be able to match dog has because dog has starts with dog, which is the first alternative. Furthermore, the regex engine won't return overlapping matches.
Here's a convoluted method:
var patterns = new[] { "dog", "dog has" };
var sb = new StringBuilder();
for (var i = 0; i < patterns.Length; i++)
sb.Append(#"(?=(?<p").Append(i).Append(">").Append(patterns[i]).Append("))?");
var regex = new Regex(sb.ToString(), RegexOptions.Compiled);
Console.WriteLine("Pattern: {0}", regex);
var input = "a dog has been seen with another dog";
Console.WriteLine("Input: {0}", input);
foreach (var match in regex.Matches(input).Cast<Match>())
{
for (var i = 0; i < patterns.Length; i++)
{
var group = match.Groups["p" + i];
if (!group.Success)
continue;
Console.WriteLine("Matched pattern #{0}: '{1}' at index {2}", i, group.Value, group.Index);
}
}
This produces the following output:
Pattern: (?=(?<p0>dog))?(?=(?<p1>dog has))?
Input: a dog has been seen with another dog
Matched pattern #0: 'dog' at index 2
Matched pattern #1: 'dog has' at index 2
Matched pattern #0: 'dog' at index 33
Yes, this is an abuse of the regex engine :)
This works by building a pattern using optional lookaheads, which capture the substrings as a side effect, but the pattern otherwise always matches an empty string. So there are n+1 total matches, n being the input length. The patterns cannot contain numbered backreferences, but you can use named backreferences instead.
Also, this can return overlapping matches, as it will try to match all patterns at all string positions.
But you definitely should benchmark this against a manual approach (looping over the patterns and matching each of them separately). I don't expect this to be fast...
You can use one regex pattern to do both.
Pattern: (dog\b has\b)|(dog\b)
I figured out this pattern using the online builder here: enter link description here
Then you can use it in C# with the regex class by doing something like
Regex reg = new Regex("(dog\b has\b)|(dog\b)", RegexOptions.IgnoreCase);
if (reg.IsMatch){
//found dog or dog has
}

Regular Expression Pattern C#

I have the following string that would require me to parse it via Regex in C#.
Format: rec_mnd.rate.current_rate.sum.QWD.RET : 214345
I would like to extract our the bold chars as group objects in a groupcollection.
QWD = 1 group
RET = 1 group
214345 = 1 group
what would the message pattern be like?
It would be something like this:
string s = "Format: rec_mnd.rate.current_rate.sum.QWD.RET : 214345";
Match m = Regex.Match(s, #"^Format: rec_mnd\.rate\.current_rate\.sum\.(.+?)\.(.+?) : (\d+)$");
if( m.Success )
{
Console.WriteLine(m.Groups[1].Value);
Console.WriteLine(m.Groups[2].Value);
Console.WriteLine(m.Groups[3].Value);
}
The question mark in the first two groups make that quantifier lazy: it will capture the least possible amount of characters. In other words, it captures until the first . it sees. Alternatively, you could use ([^.]+) in those groups, which explicitly captures everything except a period.
The last group explicitly only captures decimal digits. If your expression can have other values on the right side of the : you'd have to change that to .+ as well.
Please, make it a lot easier on yourself and label your groups to make it easier to understand what is going on in code.
RegEx myRegex = new Regex(#"rec_mnd\.rate\.current_rate\.sum\.(?<code>[A-Z]{3})\.(?<subCode>[A-Z]{3})\s*:\s*(?<number>\d+)");
var matches = myRegex.Matches(sourceString);
foreach(Match match in matches)
{
//do stuff
Console.WriteLine("Match");
Console.WriteLine("Code: " + match.Groups["code"].Value);
Console.WriteLine("SubCode: " + match.Groups["subCode"].Value);
Console.WriteLine("Number: " + match.Groups["number"].Value);
}
This should give you what you want regardless of what's between the .'s.
#"(?:.+\.){4}(.\w+)\.(\w+)\s?:\s?(\d+)"

Categories