C# Regular Expression Capturing Empty String [duplicate] - c#

This question already has answers here:
C# Regex.Split: Removing empty results
(9 answers)
Closed 5 years ago.
I'm trying to create a simple regular expression in C# to split a string into tokens. The problem I'm running into is that the pattern I'm using captures an empty string, which throws off my expected results. What can I do to change my regular expression so it doesn't capture an empty string?
var input = "ID=123&User=JohnDoe";
var pattern = "(?:id=)|(?:&user=)";
var tokens = Regex.Split(input, pattern, RegexOptions.IgnoreCase);
// Expected Results
// tokens[0] == "123"
// tokens[1] == "JohnDoe"
// Actual Results
// tokens[0] == ""
// tokens[1] == "123"
// tokens[2] == "JohnDoe"

While the comments to your OP regarding using a different approach may have merit, they don't address your specific question regarding the RegEx behavior.
I think that the reason though you're getting the regex behavior has to do with an implicit capture group (ed: or it could just be limiting the capture behavior of the first group is sufficient), but I haven't made it to the top level of the RegEx hierarchy of understanding.
Edit:
Working RegEx for the given test case:
(?>id=)|(?:&user=)
If none of this is to your liking, you could always tack a predicate to the tokens list:
tokens.Where(x => !string.IsNullOrWhiteSpace(x))

I don't think you can solve this problem with Regex.Split to be honest. One brute force way to do this is to remove every "":
var input = "ID=123&User=JohnDoe";
var pattern = "(?:id=)|(?:&user=)";
var tokens = Regex.Split(input, pattern, RegexOptions.IgnoreCase).Where(x => x != "");
I think you should use regex that actually captures the tokens in groups.
var input = "ID=123&User=JohnDoe";
var pattern = "id=(.+)&user=(.+)";
var match = Regex.Match(input, pattern, RegexOptions
.IgnoreCase);
match.Groups[1] // 123

Related

Use C# RegEx to retrieve a list of matching strings found in a source string? [duplicate]

This question already has an answer here:
Simple and tested online regex containing regex delimiters does not work in C# code
(1 answer)
Closed 3 years ago.
I'm a RegEx novice, so I'm hoping someone out there can give me a hint.
I want to find a straightforward way (using RegEx?) to extract a list/array of values that match a pattern from a string.
If source string is "Hello #bob and #mark and #dave", I want to retrieve a list containing "#bob", "#mark" and "#dave" or, even better, "bob", "mark" and "dave" (without the # symbol).
So far, I have something like this (in C#):
string note = "Hello, #bob and #mark and #dave";
var r = new Regex(#"/(#)\w+/g");
var listOfFound = r.Match(note);
I'm hoping listOfFound will be an array or a List containing the three values.
I could do this with some clever string parsing, but it seems like this should be a piece of cake for RegEx, if I could only come up with the right pattern.
Thanks for any help!
Regexes in C# don't need delimiters and options must be supplied as the second argument to the constructor, but are not required in this case as you can get all your matches using Regex.Matches. Note that by using a lookbehind for the # ((?<=#)) we can avoid having the # in the match:
string note = "Hello, #bob and #mark and #dave";
Regex r = new Regex(#"(?<=#)\w+");
foreach (Match match in r.Matches(note))
Console.WriteLine("Found '{0}' at position {1}", match.Value, match.Index);
Output:
Found 'bob' at position 8
Found 'mark' at position 17
Found 'dave' at position 27
To get all the values into a list/array you could use something like:
string note = "Hello, #bob and #mark and #dave";
Regex r = new Regex(#"(?<=#)\w+");
// list of matches
List<String> Matches = new List<String>();
foreach (Match match in r.Matches(note))
Matches.Add(match.Value);
// array of matches
String[] listOfFound = Matches.ToArray();
You could do it without Regex, for example:
var listOfFound = note.Split().Where(word => word.StartsWith("#"));
Replace
var listOfFound = r.Match(note);
by
var listOfFound = r.Matches(note);

Regex not able to parse last group [duplicate]

This question already has answers here:
What is the difference between .*? and .* regular expressions?
(3 answers)
Closed 3 years ago.
This is the test. I expect the last group to be ".png", but this pattern returns "" instead.
var inputStr = #"C:\path\to\dir\[yyyy-MM-dd_HH-mm].png";
var pattern = #"(.*?)\[(.*?)\](.*?)";
var regex = new Regex(pattern);
var match = regex.Match(inputStr);
var thirdGroupValue = match.Groups[3].Value;
// ✓ EXPECTED: ".png"
// ✗ CURRENT: ""
The 1st and 2nd groups work fine.
This is because you made the * in Group 3 lazy:
(.*?)\[(.*?)\](.*?)
^
here
This means it will match as little as possible. What's the least .* can match? An empty string!
You can learn more about lazy vs greedy here.
You can fix this either by removing ?, making it greedy, or put a $ at the end, telling it to match until the end of the string:
(.*?)\[(.*?)\](.*)
or
(.*?)\[(.*?)\](.*?)$

Split a string by Regex [duplicate]

This question already has answers here:
Regular expression to extract text between square brackets
(15 answers)
Closed 5 years ago.
I'm currently thinking of how to split this kind of string into regex using c#.
[01,01,01][02,03,00][03,07,00][04,06,00][05,02,00][06,04,00][07,08,00][08,05,00]
Can someone knowledgeable on regex can point me on how to achieved this goal?
sample regex pattern that don't work:
[\dd,\dd,\dd]
sample output:
[01,01,01]
[02,03,00]
[03,07,00]
[04,06,00]
[05,02,00]
[06,04,00]
[07,08,00]
[08,05,00]
This will do the job in C# (\[.+?\]), e.g.:
var s = #"[01,01,01][02,03,00][03,07,00][04,06,00][05,02,00][06,04,00][07,08,00][08,05,00]";
var reg = new Regex(#"(\[.+?\])");
var matches = reg.Matches(s);
foreach(Match m in matches)
{
Console.WriteLine($"{m.Value}");
}
EDIT This is how the expression (\[.+?\]) works
first the outter parenthesis, ( and ), means to capture whatever the inside pattern matched
then the escaped square brackets, \[ and \], is to match the [ and ] in the source string
finally the .+? means to match one or more characters, but as few times as possible, so that it won't match all the characters before the first [ and the last ]
I know you stipulated Regex, however it's worth looking at Split again, if for only for academic purposes:
Code
var input = "[01,01,01][02,03,00][03,07,00][04,06,00][05,02,00][06,04,00][07,08,00][08,05,00]";
var output = input.Split(']',StringSplitOptions.RemoveEmptyEntries)
.Select(x => x + "]") // the bracket back
.ToList();
foreach(var o in output)
Console.WriteLine(o);
Output
[01,01,01]
[02,03,00]
[03,07,00]
[04,06,00]
[05,02,00]
[06,04,00]
[07,08,00]
[08,05,00]
The Regex solution below is restricted to 3 values of only 2 digits seperated by comma. Inside the foreach loop you can access the matching value via match.Value. >> Refiddle example
Remember to include using System.Text.RegularExpressions;
var input = "[01,01,01][02,03,00][03,07,00][04,06,00][05,02,00][06,04,00][07,08,00][08,05,00]";
foreach(var match in Regex.Matches(input, #"(\[\d{2},\d{2},\d{2}\])+"))
{
// do stuff
}
Thanks all for the answer i also got it working by using this code
string pattern = #"\[\d\d,\d\d,\d\d]";
Regex rgx = new Regex(pattern, RegexOptions.IgnoreCase);
MatchCollection matches = rgx.Matches(myResult);
Debug.WriteLine(matches.Count);
foreach (Match match in matches)
Debug.WriteLine(match.Value);

Easy Regex capture [duplicate]

This question already has answers here:
Regular Expression Groups in C#
(5 answers)
Closed 6 years ago.
New to using C# Regex, I am trying to capture two comma separated integers from a string into two variables.
Example: 13,567
I tried variations on
Regex regex = new Regex(#"(\d+),(\d+)");
var matches = regex.Matches("12,345");
foreach (var itemMatch in matches)
Debug.Print(itemMatch.Value);
This just captures 1 variable, which is the entire string. I did workaround this by changing the capture pattern to "(\d+)", but that then ignores the middle comma entirely and I would get a match if there were any text between the integers.
How do I get it to extract both integers and ensure it also sees a comma between.
Can do this with String.Split
Why not just use a split and parse?
var results = "123,456".Split(',').Select(int.Parse).ToArray();
var left = results[0];
var right = results[1];
Alternatively, you can use a loop and use int.TryParse to handle failures but for what you're looking for this should cover it
If you're really committed to a Regex
You can do this with a Regex too, just need to use groups of the match
Regex r = new Regex(#"(\d+)\,(\d+)", RegexOptions.Compiled);
var r1 = r.Match("123,456");
//first is total match
Console.WriteLine(r1.Groups[0].Value);
//Then first and second groups
var left = int.Parse(r1.Groups[1].Value);
var right = int.Parse(r1.Groups[2].Value);
Console.WriteLine("Left "+ left);
Console.WriteLine("Right "+right);
Made a dotnetfiddle you can test the solution in as well
With Regex, you can use this:
Regex regex = new Regex(#"\d+(?=,)|(?<=,)\d+");
var matches = regex.Matches("12,345");
foreach (Match itemMatch in matches)
Console.WriteLine(itemMatch.Value);
prints:
12
345
Actually this is doing a look-ahead and look-behind a , :
\d+(?=,) <---- // Match numbers followed by a ,
| <---- // OR
(?<=,)\d+ <---- // Match numbers preceeded by a ,

How can I get a regex match to only be added once to the matches collection?

I have a string which has several html comments in it. I need to count the unique matches of an expression.
For example, the string might be:
var teststring = "<!--X1-->Hi<!--X1-->there<!--X2-->";
I currently use this to get the matches:
var regex = new Regex("<!--X.-->");
var matches = regex.Matches(teststring);
The results of this is 3 matches. However, I would like to have this be only 2 matches since there are only two unique matches.
I know I can probably loop through the resulting MatchCollection and remove the extra Match, but I'm hoping there is a more elegant solution.
Clarification: The sample string is greatly simplified from what is actually being used. There can easily be an X8 or X9, and there are likely dozens of each in the string.
I would just use the Enumerable.Distinct Method for example like this:
string subjectString = "<!--X1-->Hi<!--X1-->there<!--X2--><!--X1-->Hi<!--X1-->there<!--X2-->";
var regex = new Regex(#"<!--X\d-->");
var matches = regex.Matches(subjectString);
var uniqueMatches = matches
.OfType<Match>()
.Select(m => m.Value)
.Distinct();
uniqueMatches.ToList().ForEach(Console.WriteLine);
Outputs this:
<!--X1-->
<!--X2-->
For regular expression, you could maybe use this one?
(<!--X\d-->)(?!.*\1.*)
Seems to work on your test string in RegexBuddy at least =)
// (<!--X\d-->)(?!.*\1.*)
//
// Options: dot matches newline
//
// Match the regular expression below and capture its match into backreference number 1 «(<!--X\d-->)»
// Match the characters “<!--X” literally «<!--X»
// Match a single digit 0..9 «\d»
// Match the characters “-->” literally «-->»
// Assert that it is impossible to match the regex below starting at this position (negative lookahead) «(?!.*\1.*)»
// Match any single character «.*»
// Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
// Match the same text as most recently matched by capturing group number 1 «\1»
// Match any single character «.*»
// Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
It appears you're doing two different things:
Matching comments like /<-- X. -->/
Finding the set of unique comments
So it is fairly logical to handle these as two different steps:
var regex = new Regex("<!--X.-->");
var matches = regex.Matches(teststring);
var uniqueMatches = matches.Cast<Match>().Distinct(new MatchComparer());
class MatchComparer : IEqualityComparer<Match>
{
public bool Equals(Match a, Match b)
{
return a.Value == b.Value;
}
public int GetHashCode(Match match)
{
return match.Value.GetHashCode();
}
}
Extract the comments and store them in an array. Then you can filter out the unique values.
But I don’t know how to implement this in C#.
Depending on how many Xn's you have you might be able to use:
(\<!--X1--\>){1}.*(\<!--X2--\>){1}
That will only match each occurrence of the X1, X2 etc. once provided they are in order.
Capture the inner portion of the comment as a group. Then put those strings into a hashtable(dictionary). Then ask the dictionary for its count, since it will self weed out repeats.
var teststring = "<!--X1-->Hi<!--X1-->there<!--X2-->";
var tokens = new Dicationary<string, string>();
Regex.Replace(teststring, #"<!--(.*)-->",
match => {
tokens[match.Groups[1].Value] = match.Groups[1].Valuel;
return "";
});
var uniques = tokens.Keys.Count;
By using the Regex.Replace construct you get to have a lambda called on each match. Since you are not interested in the replace, you don't set it equal to anything.
You must use Group[1] because group[0] is the entire match.
I'm only repeating the same thing on both sides, so that its easier to put into the dictionary, which only stores unique keys.
If you want a distinct Match list from a MatchCollection without converting to string, you can use something like this:
var distinctMatches = matchList.OfType<Match>().GroupBy(x => x.Value).Select(x =>x.First()).ToList();
I know it has been 12 years but sometimes we need this kind of solutions, so I wanted to share. C# evolved, .NET evolved, so it's easier now.

Categories