Catastrophic backtracking; regular expression for extracting values in nested brackets - c#

I would like to extract aaa, bb b={{b}}bb bbb and {ccc} ccc from the following string using regular expression:
zyx={aaa}, yzx={bb b={{b}}bb bbb}, xyz={{ccc} ccc}
Note: aaa represents an arbitrary sequence of any number of characters, hence no determined length or pattern. For instance, {ccc} ccc could be {cccccccccc}cc {cc} cccc cccc, or any other combination),
I have written the following regular expression:
(?<a>[^{}]*)\s*=\s*{((?<v>[^{}]+)*)},*
This expression extracts aaa, but fails to parse the rest of the input with catastrophic backtracking failure, because of the nested curly-brackets.
Any thoughts on how I can update the regex to process the nested brackets correctly?
(Just in case, I am using C# .NET Core 3.0, if you need engine-specific options. Also, I rather not doing any magics on the code, but work with the regex pattern only.)
Similar question
The question regular expression to match balanced parentheses is similar to this question, with one difference that here the parenthesis are not necessarily balanced, rather they follow x={y} pattern.
Update 1
Inputs such as the following are also possible:
yzx={bb b={{b}},bb bbb,},
Note , after {{b}} and bbb.
Update 2
I wrote the following patter, this can match anything but aaa from the first example:
(?<A>[^{}]*)\s*=\s*{(?<V>(?<S>([^{}]?)\{(?:[^}{]+|(?&S))+\}))}(,|$)

Regex.Matches, pretty good
"={(.*?)}(, |$)" could work.
string input = "zyx={aaa}, yzx={bb b={{b}}bb bbb}, yzx={bb b={{b}},bb bbb,}, xyz={{ccc} ccc}";
string pattern = "={(.*?)}(, |$)";
var matches = Regex.Matches(input, pattern)
.Select(m => m.Groups[1].Value)
.ToList();
foreach (var m in matches) Console.WriteLine(m);
Output
aaa
bb b={{b}}bb bbb
bb b={{b}},bb bbb,
{ccc} ccc
Regex.Split, really good
I think for this job Regex.Split may be a better tool.
tring input = "zyx={aaa}, yzx={bb b={{b}}bb bbb}, yzx={bb b={{b}},bb bbb,}, ttt={nasty{t, }, }, xyz={{ccc} ccc}, zzz={{{{{{{huh?}";
var matches2 = Regex.Split(input, "(^|, )[a-zA-Z]+=", RegexOptions.ExplicitCapture); // Or "(?:^|, )[a-zA-Z]+=" without the flag
Console.WriteLine("-------------------------"); // Adding this to show the empty element (see note below)
foreach (var m in matches2) Console.WriteLine(m);
Console.WriteLine("-------------------------");
-------------------------
{aaa}
{bb b={{b}}bb bbb}
{bb b={{b}},bb bbb,}
{nasty{t, }, }
{{ccc} ccc}
{{{{{{{huh?}
-------------------------
Note: The empty element is there because:
If a match is found at the beginning or the end of the input string, an empty string is included at the beginning or the end of the returned array.
Case 3
string input = "AAA={aaa}, BBB={bbb, bb{{b}}, bbb{b}}, CCC={ccc}, DDD={ddd}, EEE={00-99} ";
var matches2 = Regex.Split(input, "(?:^|, )[a-zA-Z]+="); // Or drop '?:' and use RegexOptions.ExplicitCapture
foreach (var m in matches2) Console.WriteLine(m);
{aaa}
{bbb, bb{{b}}, bbb{b}}
{ccc}
{ddd}
{00-99}

Related

Regex c# does not give the same result with https://regex101.com/ [duplicate]

I'm trying to extract values from a string which are between << and >>. But they could happen multiple times.
Can anyone help with the regular expression to match these;
this is a test for <<bob>> who like <<books>>
test 2 <<frank>> likes nothing
test 3 <<what>> <<on>> <<earth>> <<this>> <<is>> <<too>> <<much>>.
I then want to foreach the GroupCollection to get all the values.
Any help greatly received.
Thanks.
Use a positive look ahead and look behind assertion to match the angle brackets, use .*? to match the shortest possible sequence of characters between those brackets. Find all values by iterating the MatchCollection returned by the Matches() method.
Regex regex = new Regex("(?<=<<).*?(?=>>)");
foreach (Match match in regex.Matches(
"this is a test for <<bob>> who like <<books>>"))
{
Console.WriteLine(match.Value);
}
LiveDemo in DotNetFiddle
While Peter's answer is a good example of using lookarounds for left and right hand context checking, I'd like to also add a LINQ (lambda) way to access matches/groups and show the use of simple numeric capturing groups that come handy when you want to extract only a part of the pattern:
using System.Linq;
using System.Collections.Generic;
using System.Text.RegularExpressions;
// ...
var results = Regex.Matches(s, #"<<(.*?)>>", RegexOptions.Singleline)
.Cast<Match>()
.Select(x => x.Groups[1].Value);
Same approach with Peter's compiled regex where the whole match value is accessed via Match.Value:
var results = regex.Matches(s).Cast<Match>().Select(x => x.Value);
Note:
<<(.*?)>> is a regex matching <<, then capturing any 0 or more chars as few as possible (due to the non-greedy *? quantifier) into Group 1 and then matching >>
RegexOptions.Singleline makes . match newline (LF) chars, too (it does not match them by default)
Cast<Match>() casts the match collection to a IEnumerable<Match> that you may further access using a lambda
Select(x => x.Groups[1].Value) only returns the Group 1 value from the current x match object
Note you may further create a list of array of obtained values by adding .ToList() or .ToArray() after Select.
In the demo C# code, string.Join(", ", results) generates a comma-separated string of the Group 1 values:
var strs = new List<string> { "this is a test for <<bob>> who like <<books>>",
"test 2 <<frank>> likes nothing",
"test 3 <<what>> <<on>> <<earth>> <<this>> <<is>> <<too>> <<much>>." };
foreach (var s in strs)
{
var results = Regex.Matches(s, #"<<(.*?)>>", RegexOptions.Singleline)
.Cast<Match>()
.Select(x => x.Groups[1].Value);
Console.WriteLine(string.Join(", ", results));
}
Output:
bob, books
frank
what, on, earth, this, is, too, much
You can try one of these:
(?<=<<)[^>]+(?=>>)
(?<=<<)\w+(?=>>)
However you will have to iterate the returned MatchCollection.
Something like this:
(<<(?<element>[^>]*)>>)*
This program might be useful:
http://sourceforge.net/projects/regulator/

REGEX help needed in c#

I am very new to reg-ex and i am not sure whats going on with this one.... however my friend gave me this to solve my issue BUT somehow it is not working....
string: department_name:womens AND item_type_keyword:base-layer-underwear
reg-ex: (department_name:([\\w-]+))?(item_type_keyword:([\\w-]+))?
desired output: array OR group
1st element should be: department_name:womens
2nd should be: womens
3rd: item_type_keyword:base-layer-underwear
4th: base-layer-underwear
strings can contain department_name OR item_type_keyword, BUT not mendatory, in any order
C# Code
Regex regex = new Regex(#"(department_name:([\w-]+))?(item_type_keyword:([\w-]+))?");
Match match = regex.Match(query);
if (match.Success)
if (!String.IsNullOrEmpty(match.Groups[4].ToString()))
d1.ItemType = match.Groups[4].ToString();
this C# code only returns string array with 3 element
1: department_name:womens
2: department_name:womens
3: womens
somehow it is duplicating 1st and 2nd element, i dont know why. BUT its not return the other elements that i expect..
can someone help me please...
when i am testing the regex online, it looks fine to me...
http://fiddle.re/crvw1
Thanks
You can use something like this to get the output you have in your question:
string txt = "department_name:womens AND item_type_keyword:base-layer-underwear";
var reg = new Regex(#"(?:department_name|item_type_keyword):([\w-]+)", RegexOptions.IgnoreCase);
var ms = reg.Matches(txt);
ArrayList results = new ArrayList();
foreach (Match match in ms)
{
results.Add(match.Groups[0].Value);
results.Add(match.Groups[1].Value);
}
// results is your final array containing all results
foreach (string elem in results)
{
Console.WriteLine(elem);
}
Prints:
department_name:womens
womens
item_type_keyword:base-layer-underwear
base-layer-underwear
match.Groups[0].Value gives the part that matched the pattern, while match.Groups[1].Value will give the part captured in the pattern.
In your first expression, you have 2 capture groups; hence why you have twice department_name:womens appearing.
Once you get the different elements, you should be able to put them in an array/list for further processing. (Added this part in edit)
The loop then allows you to iterate over each of the matches, which you cannot exactly do with if and .Match() (which is better suited for a single match, while here I'm enabling multiple matches so the order they are matched doesn't matter, or the number of matches).
ideone demo
(?:
department_name # Match department_name
| # Or
item_type_keyword # Match item_type_keyword
)
:
([\w-]+) # Capture \w and - characters
It's better to use the alternation (or logical OR) operator | because we don't know the order of the input string.
(department_name:([\w-]+))|(item_type_keyword:([\w-]+))
DEMO
String input = #"department_name:womens AND item_type_keyword:base-layer-underwear";
Regex rgx = new Regex(#"(?:(department_name:([\w-]+))|(item_type_keyword:([\w-]+)))");
foreach (Match m in rgx.Matches(input))
{
Console.WriteLine(m.Groups[1].Value);
Console.WriteLine(m.Groups[2].Value);
Console.WriteLine(m.Groups[3].Value);
Console.WriteLine(m.Groups[4].Value);
}
IDEONE
Another idea using a lookahead for capturing and getting all groups in one match:
^(?!$)(?=.*(department_name:([\w-]+))|)(?=.*(item_type_keyword:([\w-]+))|)
as a .NET String
"^(?!$)(?=.*(department_name:([\\w-]+))|)(?=.*(item_type_keyword:([\\w-]+))|)"
test at regexplanet (click on .NET); test at regex101.com
(add m multiline modifier if multiline input: "^(?m)...)
If you use any spliting with And Or , etc that you can use
(department_name:(.*?)) AND (item_type_keyword:(.*?)$)
•1: department_name:womens
•2: womens
•3: item_type_keyword:base-layer-underwear
•4: base-layer-underwear
(?=(department_name:\w+)).*?:([\w-]+)|(?=(item_type_keyword:.*)$).*?:([\w-]+)
Try this.This uses a lookahead to capture then backtrack and again capture.See demo.
http://regex101.com/r/lS5tT3/52

Regex to find all matching statements

I have a string:
put 1 in pot put 2 in pot put 3 in pot...
up to
put n in pot
How can I use C# regex to obtain all put statements like:
"put 1 in pot"
"put 2 in pot"
"put 3 in pot"
...
"put n in pot"
for n statements?
Thanks
I probably shouldn't answer this as your question shows no effort at all, but I think a possible regex would be:
string regex = #"put (?<number>\d+) in pot";
Then you can match using:
var matches = Regex.Matches("Put 1 in pot put 2 in pot", #"put (?<number>\d+) in pot", RegexOptions.IgnoreCase);
foreach (Match match in matches)
{
Console.WriteLine(match.Value);
}
To find the actual number, you can use
int matchNumber = Convert.ToInt32(match.Groups["number"].Value);
You can also do this
var reg=#"put.*?(?=put|$)";
List<string> puts=Regex.Matches(inp,reg,RegexOptions.Singleline)
.Cast<Match>()
.Select(x=>x.Value)
.ToList();
put.*?(?=put|$)
------ -------
| |
| |->checks if `.*?`(0 to many characters) is followed by `put` or `end` of the file
|->matches put followed by 0 to many characters

C# regular expression match

18.jun. 7 noči od 515,00 EUR
here I would like to get 515,00 with a regular expression.
Regex regularExpr = new Regex(#rule.RegularExpression,
RegexOptions.Compiled | RegexOptions.Multiline |
RegexOptions.IgnoreCase | RegexOptions.Singleline |
RegexOptions.IgnorePatternWhitespace);
tagValue.Value = "18.jun. 7 noči od 515,00 EUR";
Match match = regularExpr.Match(tagValue.Value);
object value = match.Groups[2].Value;
regex is: \d+((.\d+)+(,\d+)?)?
but I always get an empty string (""). If I try this regex in Expresso I get an array of 3 values and the third is 515,00.
What is wrong with my C# code that I get an empty string?
Your regex matches the 18 (since the decimal parts are optional), and match.Groups[2] refers to the second capturing parenthesis (.\d+) which should correctly read (\.\d+) and hasn't participated in the match, therefore the empty string is returned.
You need to correct your regex and iterate over the results:
StringCollection resultList = new StringCollection();
Regex regexObj = new Regex(#"\d+(?:[.,]\d+)?");
Match matchResult = regexObj.Match(subjectString);
while (matchResult.Success) {
resultList.Add(matchResult.Value);
matchResult = matchResult.NextMatch();
}
resultList[2] will then contain your match.
Make sure you escaped everything properly when you created the regular expression.
Regex re = new Regex("\d+((.\d+)+(,\d+)?)?")
is very different from
Regex re = new Regex(#"\d+((.\d+)+(,\d+)?)?")
You probably want the second.
I suspect the result you're getting in Expresso is equivalent to this:
string s = "18.jun. 7 noči od 515,00 EUR";
Regex r = new Regex(#"\d+((.\d+)+(,\d+)?)?");
foreach (Match m in r.Matches(s))
{
Console.WriteLine(m.Value);
}
In other words, it's not the contents of the second capturing group you're seeing, it's the third match. This code shows it more clearly:
Console.WriteLine("{0,10} {1,10} {2,10} {3,10}",
#"Group 0", #"Group 1", #"Groups 2", #"Group 3");
Regex r = new Regex(#"\d+((.\d+)+(,\d+)?)?");
foreach (Match m in r.Matches(s))
{
Console.WriteLine("{0,10} {1,10} {2,10} {3,10}",
m.Groups[0].Value, m.Groups[1].Value, m.Groups[2].Value, m.Groups[3].Value);
}
output:
Group 0 Group 1 Group 2 Group 3
18
7
515,00 ,00 ,00
On to the regex itself. If you want to match only the price and not those other numbers, you need to be more specific. For example, if you know the ,00 part will always be present, you can use this regex:
#"(?n)\b\d+(\.\d+)*(,\d+)\b"
(?n) is the inline form of the ExplicitCapture option, which turns those two capturing groups into non-capturing groups. Of the RegexOptions you did specify, the only one that has any effect is Compiled, which speeds up matching of the regex slightly, at the expense of slowing down its construction and hogging memory. \b is a word boundary.
It looks like you're applying all those modifiers blindly to every regex when you construct them, which is not a good idea. If a particular regex needs a certain modifier, you should try to specify it in the regex itself with an inline modifier, like I did with (?n).

RegEx Match multiple times in string

I'm trying to extract values from a string which are between << and >>. But they could happen multiple times.
Can anyone help with the regular expression to match these;
this is a test for <<bob>> who like <<books>>
test 2 <<frank>> likes nothing
test 3 <<what>> <<on>> <<earth>> <<this>> <<is>> <<too>> <<much>>.
I then want to foreach the GroupCollection to get all the values.
Any help greatly received.
Thanks.
Use a positive look ahead and look behind assertion to match the angle brackets, use .*? to match the shortest possible sequence of characters between those brackets. Find all values by iterating the MatchCollection returned by the Matches() method.
Regex regex = new Regex("(?<=<<).*?(?=>>)");
foreach (Match match in regex.Matches(
"this is a test for <<bob>> who like <<books>>"))
{
Console.WriteLine(match.Value);
}
LiveDemo in DotNetFiddle
While Peter's answer is a good example of using lookarounds for left and right hand context checking, I'd like to also add a LINQ (lambda) way to access matches/groups and show the use of simple numeric capturing groups that come handy when you want to extract only a part of the pattern:
using System.Linq;
using System.Collections.Generic;
using System.Text.RegularExpressions;
// ...
var results = Regex.Matches(s, #"<<(.*?)>>", RegexOptions.Singleline)
.Cast<Match>()
.Select(x => x.Groups[1].Value);
Same approach with Peter's compiled regex where the whole match value is accessed via Match.Value:
var results = regex.Matches(s).Cast<Match>().Select(x => x.Value);
Note:
<<(.*?)>> is a regex matching <<, then capturing any 0 or more chars as few as possible (due to the non-greedy *? quantifier) into Group 1 and then matching >>
RegexOptions.Singleline makes . match newline (LF) chars, too (it does not match them by default)
Cast<Match>() casts the match collection to a IEnumerable<Match> that you may further access using a lambda
Select(x => x.Groups[1].Value) only returns the Group 1 value from the current x match object
Note you may further create a list of array of obtained values by adding .ToList() or .ToArray() after Select.
In the demo C# code, string.Join(", ", results) generates a comma-separated string of the Group 1 values:
var strs = new List<string> { "this is a test for <<bob>> who like <<books>>",
"test 2 <<frank>> likes nothing",
"test 3 <<what>> <<on>> <<earth>> <<this>> <<is>> <<too>> <<much>>." };
foreach (var s in strs)
{
var results = Regex.Matches(s, #"<<(.*?)>>", RegexOptions.Singleline)
.Cast<Match>()
.Select(x => x.Groups[1].Value);
Console.WriteLine(string.Join(", ", results));
}
Output:
bob, books
frank
what, on, earth, this, is, too, much
You can try one of these:
(?<=<<)[^>]+(?=>>)
(?<=<<)\w+(?=>>)
However you will have to iterate the returned MatchCollection.
Something like this:
(<<(?<element>[^>]*)>>)*
This program might be useful:
http://sourceforge.net/projects/regulator/

Categories