Greedy regex finding latest brace - c#

I'm trying to parse some variable definition to extract documentation automatically, but I'm having trouble skipping some } which potentially appear in the default value.
Here's a sample...
variable "a" {
type = string
description = "A desc"
default = ""
}
variable "b" {
type = map()
description = "B desc"
default = {}
}
variable "c" {
type = list(string)
description = "C desc"
default = []
}
And the regex I'm using
variable.\"(?<name>\w+)\"(.*?)description.=."(?<desc>[^"\\]*(?:\\.[^"\\]*)*)"(.*?)}
with a replace of
'* `${name}`: ${desc}
This gives the output
* `a`: A desc
* `b`: B desc
}
* `c`: C Desc
I need the regex to be in single line mode and non-greedy so it stays within each variable definition, but then I can't seem to stop it matching on the first trailing } it finds. What would be good is if could match ^} - but again we are in single line mode so it doesn't apply.

See if this will work for your dataset:
variable.\"(?<name>\w+)\"(.*?)description.=."(?<desc>[^"\\]*(?:\\.[^"\\]*)*)"(.*?)(?=[^}]+?variable|$)
Try it on Regex101
Here I replaced } at the end with (?=[^}]+?variable|$). This should ensure that the last capturing group will keep consuming characters until there are no more closing braces before the next variable (or the end of the input).

You can match the variable and the description values and match all lines in between that do not start with } using a negative lookahead.
variable\s*"(?<name>\w+)"\s*{(?:(?!\r?\n}|\bdescription\b).)*description\s*=\s*"(?<desc>[^"]*(?:\\.[^"]*)*)"(?:(?!\r?\n}).)*\r?\n}
Explanation
variable\s*" Match variable and then "
(?<name>\w+) Group name, match 1+ word chars
"\s*{ Match optional whitespace chars and {
(?:(?!\r?\n}|\bdescription\b).)* Match any char using the dot (Single line mode) when what is directly to the right is not a newline and } or description
description\s*=\s*" match description=" with optional whitespace chars around the = and then "
(?<desc>[^"]*(?:\\.[^"]*)*) Named group desc to capture the description
" Match the closing "
(?:(?!\r?\n}).)* Match any char (Using the Single line mode) when what is directly to the right is not a newline and }
\r?\n} Match a newline and }
.Net regex demo
It's quite verbose, but a bit more optimized pattern might be
variable\s*"(?<name>\w+)"\s*{[^}d]*(?>}(?<!\r?\n.)[^}]*|(?!\bdescription\s*=\s*"[^"]*")d[^d]*)*\bdescription\s*=\s*"(?<desc>[^"]*(?:\\.[^"]*)*)"[^}]*(?:}(?<!\r?\n.)[^}]*)*\r?\n}
Regex demo

Related

Regex to match string between curly braces (that allows to escape them via 'doubling')

I was using the regex from Extract values within single curly braces:
(?<!{){[^{}]+}(?!})
However, it does not cover the user case #3 (see below).
I would like to know if it's possible to define a regular expression that satisfied the use cases below
Use case 1
Given:
Hola {name}
It should match {name} and capture name
But I would like to be able to escape curly braces when needed by doubling them, like C# does for interpolated strings. So, in a string like
Use case 2
Hola {name}, this will be {{unmatched}}
The {{unmatched}} part should be ignored because it uses them doubled. Notice the {{ and }}.
Use case 3
In the last, most complex case, a text like this:
Buenos {{{dias}}}
The text {dias} should be a match (and capture dias) because the first outer-most doubled curly braces should be interpreted just like another character (they are escaped) so it should match: {{{dias}}}
My ultimate goal is to replace the matches later with another string, like a variable.
EDIT
This 4th use case pretty much summarized the whole requirements:
Given:
Hola {name}, buenos {{{dias}}}
Results in:
Match 1:
Matched text: {name}
Captured text: name
Match 2:
Matched text: {dias}
Captured text: dias
To optionally match double curly's, you could use an if clause and take the value from capture group 2.
(?<!{)({{)?{([^{}]+)}(?(1)}})(?!})
Explanation
(?<!{) Assert not { directly to the left
({{)? Optionally capture {{ in group 1
{([^{}]+)} Match from { till } without matching { and } in between
(?(1)}}) If clause, if group 1 exists, match }}
(?!}) Assert not } directly to the right
.Net regex demo | C# demo
string pattern = #"(?<!{)({{)?{([^{}]+)}(?(1)}})(?!})";
string input = #"Hola {name}
Hola {name}, this will be {{unmatched}}
Buenos {{{dias}}}";
foreach (Match m in Regex.Matches(input, pattern))
{
Console.WriteLine(m.Groups[2].Value);
}
Output
name
name
dias
If the double curly's should be balanced, you might use this approach:
(?<!{){(?>(?<={){{(?<c>)|([^{}]+)|}}(?=})(?<-c>))*(?(c)(?!))}(?!})
.NET regex demo
You can use
(?<!{)(?:{{)*{([^{}]*)}(?:}})*(?!})
See the .NET regex demo.
In C#, you can use
var results = Regex.Matches(text, #"(?<!{)(?:{{)*{([^{}]*)}(?:}})*(?!})").Cast<Match>().Select(x => x.Groups[1].Value).ToList();
Alternatively, to get full matches, wrap the left- and right-hand contexts in lookarounds:
(?<=(?<!{)(?:{{)*{)[^{}]*(?=}(?:}})*(?!}))
See this regex demo.
In C#:
var results = Regex.Matches(text, #"(?<=(?<!{)(?:{{)*{)[^{}]*(?=}(?:}})*(?!}))")
.Cast<Match>()
.Select(x => x.Value)
.ToList();
Regex details
(?<=(?<!{)(?:{{)*{) - immediately to the left, there must be zero or more {{ substrings not immediately preceded with a { char and then {
[^{}]* - zero or more chars other than { and }
(?=}(?:}})*(?!})) - immediately to the right, there must be }, zero or more }} substrings not immediately followed with a } char.

Using RegEx in c# to check for valid characters

I'm having a hard time understand regex. I have a scenario where valid characters are a-z, A-Z, 0-9 and a space. So when I try and create a RegEx for invalid characters I have this, [^a-zA-Z0-9 ].
Then I have strings that I want to search based on the RegEx and when it finds an invalid character, it checks if the character before it is invalid.
for example, "test test +?test"
So what I want to happen is if there are two invalid characters, one after the other, do nothing otherwise insert a '£'. So the string above will be fine, no changes. However, the string, "test test £test", should be changed to "test test ££test".
This is my code..
public string HandleInvalidChars(string message)
{
const string methodName = "HandleInvalidChars";
Regex specialChars = new Regex("[^a-zA-Z0-9 ]");
string strSpecialChars = specialChars.ToString();
//prev character in string which we are going to check
string prevChar;
Match match = specialChars.Match(message);
while (match.Success)
{
//get position of special character
int position = match.Index;
// get character before special character
prevChar = message.Substring(position - 1, 1);
//check if next character is a special character, if not insert ? escape character
try
{
if (!Regex.IsMatch(prevChar, strSpecialChars))
{
message = message.Insert(position, "?");
}
}
catch (Exception ex)
{
_logger.ErrorFormat("{0}: ApplicationException: {1}", methodName, ex);
return message;
}
match = match.NextMatch();
//loop through remainder of string until last character
}
return message;
}
When I test it on the first string it handles the first invalid char, '+', ok but it falls over when it reaches '£'.
Any help is really appreciated.
Thanks :)
What if you would change the RegEx to something like below, to check for only those cases with one special character and not with two?
[a-zA-Z0-9 ]{0,1}[^a-zA-Z0-9 ][a-zA-Z0-9 ]{0,1}
Another thing, I would create a new variable for the return value. As I can see you are keep changing the original string where you are looking for matches.
I believe you have overthought it a bit. All you need is to find a forbidden char that is not preceded nor followed with another forbidden char.
Declare
public string HandleInvalidChars(string message)
{
var pat = #"(?<![^A-Za-z0-9 ])[^A-Za-z0-9 ](?![^A-Za-z0-9 ])";
return Regex.Replace(message, pat, "£$&");
}
and use:
Console.WriteLine(HandleInvalidChars("test test £test"));
// => test test ££test
Console.WriteLine(HandleInvalidChars("test test +?test"));
// => test test +?test
See the online C# demo.
Details
(?<![^A-Za-z0-9 ]) - a negative lookbehind that fails the match if there is a char other than an ASCII letter/digit or space immediately to the left of the current location
[^A-Za-z0-9 ] - a char other than an ASCII letter/digit or space
(?![^A-Za-z0-9 ]) - a negative lookahead that fails the match if there is a char other than an ASCII letter/digit or space immediately to the right of the current location.
The replacement string contains a $&, backreference to the whole match value. Thus, using "£$&" we insert a £ before the match.
See the regex demo.

Regex pattern for splitting a delimited string in curly braces

I have the following string
{token1;token2;token3#somewhere.com;...;tokenn}
I need a Regex pattern, that would give a result in array of strings such as
token1
token2
token3#somewhere.com
...
...
...
tokenn
Would also appreciate a suggestion if can use the same pattern to confirm the format of the string, means string should start and end in curly braces and at least 2 values exist within the anchors.
You may use an anchored regex with named repeated capturing groups:
\A{(?<val>[^;]*)(?:;(?<val>[^;]*))+}\z
See the regex demo
\A - start of string
{ - a {
(?<val>[^;]*) - Group "val" capturing 0+ (due to * quantifier, if the value cannot be empty, use +) chars other than ;
(?:;(?<val>[^;]*))+ - 1 or more occurrences (thus, requiring at least 2 values inside {...}) of the sequence:
; - a semi-colon
(?<val>[^;]*) - Group "val" capturing 0+ chars other than ;
} - a literal }
\z - end of string.
.NET regex keeps each capture in a CaptureCollection stack, that is why all the values captured into "num" group can be accessed after a match is found.
C# demo:
var s = "{token1;token2;token3;...;tokenn}";
var pat = #"\A{(?<val>[^;]*)(?:;(?<val>[^;]*))+}\z";
var caps = new List<string>();
var result = Regex.Match(s, pat);
if (result.Success)
{
caps = result.Groups["val"].Captures.Cast<Capture>().Select(t=>t.Value).ToList();
}
Read it(similar to your problem): How to keep the delimiters of Regex.Split?.
For your RegEx testing use this: http://www.regexlib.com/RETester.aspx?AspxAutoDetectCookieSupport=1.
But RegEx is a very resource-intensive, slow operation.
In your case will be better to use the Split method of string class, for example : "token1;token2;token3;...;tokenn".Split(';');. It will return to you a collection of strings, that you want to obtain.

Regex to find special pattern

I have a string to parse. First I have to check if string contains special pattern:
I wanted to know if there is substrings which starts with "$(",
and end with ")",
and between those start and end special strings,there should not be
any white-empty space,
it should not include "$" character inside it.
I have a little regex for it in C#
string input = "$(abc)";
string pattern = #"\$\(([^$][^\s]*)\)";
Regex rgx = new Regex(pattern, RegexOptions.IgnoreCase);
MatchCollection matches = rgx.Matches(input);
foreach (var match in matches)
{
Console.WriteLine("value = " + match);
}
It works for many cases but failed at input= $(a$() , which inside the expression is empty. I wanted NOT to match when input is $().[ there is nothing between start and end identifiers].
What is wrong with my regex?
Note: [^$] matches a single character but not of $
Use the below regex if you want to match $()
\$\(([^\s$]*)\)
Use the below regex if you don't want to match $(),
\$\(([^\s$]+)\)
* repeats the preceding token zero or more times.
+ Repeats the preceding token one or more times.
Your regex \(([^$][^\s]*)\) is wrong. It won't allow $ as a first character inside () but it allows it as second or third ,, etc. See the demo here. You need to combine the negated classes in your regex inorder to match any character not of a space or $.
Your current regex does not match $() because the [^$] matches at least 1 character. The only way I can think of where you would have this match would be when you have an input containing more than one parens, like:
$()(something)
In those cases, you will also need to exclude at least the closing paren:
string pattern = #"\$\(([^$\s)]+)\)";
The above matches for example:
abc in $(abc) and
abc and def in $(def)$()$(abc)(something).
Simply replace the * with a + and merge the options.
string pattern = #"\$\(([^$\s]+)\)";
+ means 1 or more
* means 0 or more

Trouble creating a Regex expression

I'm trying to create a regex expression what will accept a certain format of command. The pattern is as follows:
Can start with a $ and have two following value 0-9,A-F,a-f (ie: $00 - $FF)
or
Can be any value except for "&<>'/"
*if the value start with $ the next two values after need to be a valid hex value from 00-ff
So far I have this
Regex correctValue = new Regex("($[0-9a-fA-F][0-9a-fA-F])");
Any help will be greatly appreciated!
You just need to add "\" symbol before your "$" and it works:
string input = "$00";
Match m = Regex.Match(input, #"^\$[0-9a-fA-F][0-9a-fA-F]$");
if (m.Success)
{
foreach (Group g in m.Groups)
Console.WriteLine(g.Value);
}
else
Console.WriteLine("Didn't match");
If I'm following you correctly, the net result you're looking for is any value that is not in the list "&<>'/", since any combination of $ and two alphanumeric characters would also not be in that list. Thus you could make your expression:
Regex correctValue = new Regex("[^&<>'/]");
Update: But just in case you do need to know how to properly match the $00 - $FF, this would do the trick:
Regex correctValue = new Regex("\$[0-9A-Fa-f]{2}");
In Regular Expression $ use for Anchor assertion, and means:
The match must occur at the end of the string or before \n at the end of the line or string.
try using [$] (Character Class for single character) or \$ (Character Escape) instead.

Categories