Regex recursive substitutions - c#

I have 3 case of data:
{{test_data}}
{{!test_data}}
{{test_data1&&!test_data2}} // test_data2 might not have the !
and I need to translate those strings with:
mystring.test_data
!mystring.test_data
mystring.test_data1 && !mystring.test_data2
I'm fiddling around with the super-useful regex101.com and i managed to cover almost all of 3 cases with Regex.Replace(str, "{{2}(?:(!?)(\w*)(\|{2}|&{2})?)}{2}", "$1mystring.$2 $3");
I can't figure out how to use regex recursion to re-apply the (?: ) part until the }} and join together all the matches using the specified substitution pattern
Is that even possible??
edit: here's the regex101 page -> https://regex101.com/r/vIBVkQ/2

I would advise to use a more generic solution here, with smaller, easier to read and maintain regexps here: one (the longest) will be used to find the substrings you need (the longest one), then a simple \w+ pattern will be used to add the my_string. part and the other will add spaces around logical operators. The smaller regexps will be used inside a match evaluator, to manipulate the values found by the longest regex:
Regex.Replace(input, #"{{!?\w+(?:\s*(?:&&|\|\|)\s*!?\w+)*}}", m =>
Regex.Replace(
Regex.Replace(m.Value, #"\s*(&&|\|\|)\s*", " $1 "),
#"\w+",
"mystring.$&"
)
)
See the C# demo
The main regex matches:
{{ - a {{ substring
!? - an optional ! sign
\w+ - 1 or more word chars
(?:\s*(?:&&|\|\|)\s*!?\w+)* - 0+ sequences of:
\s* - 0+ whitespace chars
(?:&&|\|\|) - a && or || substring
\s* - 0+ whitespaces
!? - an optional !
\w+ - 1 or more word chars
}} - a }} substring.

Regex: (?:{{2}|[^|]{2}|[^&]{2})\!?(\w+)(?:}{2})?
Regex demo
C# code:
List<string> list = new List<string>() { "{{test_data}}", "{{!test_data}}", "{{test_data1&&!test_data2}}" };
foreach(string s in list)
{
string t = Regex.Replace(s, #"(?:{{2}|[^|]{2}|[^&]{2})\!?(\w+)(?:}{2})?",
o => o.Value.Contains("!") ? "!mystring." + o.Groups[1].Value : "mystring." + o.Groups[1].Value);
Console.WriteLine(t);
}
Console.ReadLine();
Output:
mystring.test_data
!mystring.test_data
mystring.test_data1&&!mystring.test_data2

I don't think you can use recursion, but with a different representation of your input pattern, you can use sub-groups. Note I used named captures to slightly limit the confusion in this example:
var test = #"{{test_data}}
{{!test_data}}
{{test_data1&&!test_data2&&test_data3}}
{{test_data1&&!test_data2 fail test_data3}}
{{test_data1&&test_data2||!test_data3}}";
// (1:!)(2:word)(3:||&&)(4:repeat)
var matches = Regex.Matches(test, #"\{{2}(?:(?<exc>!?)(?<word>\w+))(?:(?<op>\|{2}|&{2})(?<exc2>!?)(?<word2>\w+))*}{2}");
foreach (Match match in matches)
{
Console.WriteLine("Match: {0}", match.Value);
Console.WriteLine(" exc: {0}", match.Groups["exc"].Value);
Console.WriteLine(" word: {0}", match.Groups["word"].Value);
for (int i = 0; i < match.Groups["op"].Captures.Count; i++)
{
Console.WriteLine(" op: {0}", match.Groups["op"].Captures[i].Value);
Console.WriteLine(" exc2: {0}", match.Groups["exc2"].Captures[i].Value);
Console.WriteLine("word2: {0}", match.Groups["word2"].Captures[i].Value);
}
}
The idea is to read the first word in each group unconditionally and then possibly read N combinations of (|| or &&)(optional !)(word) as separate groups with sub-captures.
Example output:
Match: {{test_data}}
exc:
word: test_data
Match: {{!test_data}}
exc: !
word: test_data
Match: {{test_data1&&!test_data2&&test_data3}}
exc:
word: test_data1
op: &&
exc2: !
word2: test_data2
op: &&
exc2:
word2: test_data3
Match: {{test_data1&&test_data2||!test_data3}}
exc:
word: test_data1
op: &&
exc2:
word2: test_data2
op: ||
exc2: !
word2: test_data3
Note the line {{test_data1&&!test_data2 fail test_data3}} is not part of the result groups because it doesn't comply with the syntax rules.
So you can build your desired result the same way from the matches structure:
foreach (Match match in matches)
{
var sb = new StringBuilder();
sb.Append(match.Groups["exc"].Value).Append("mystring.").Append(match.Groups["word"].Value);
for (int i = 0; i < match.Groups["op"].Captures.Count; i++)
{
sb.Append(' ').Append(match.Groups["op"].Captures[i].Value).Append(' ');
sb.Append(match.Groups["exc2"].Value).Append("mystring.").Append(match.Groups["word2"].Value);
}
Console.WriteLine("Result: {0}", sb.ToString());
}

Related

Problem with brackets in regular expression in C#

can anybody help me with regular expression in C#?
I want to create a pattern for this input:
{a? ab 12 ?? cd}
This is my pattern:
([A-Fa-f0-9?]{2})+
The problem are the curly brackets. This doesn't work:
{(([A-Fa-f0-9?]{2})+)}
It just works for
{ab}
I would use {([A-Fa-f0-9?]+|[^}]+)}
It captures 1 group which:
Match a single character present in the list below [A-Fa-f0-9?]+
Match a single character not present in the list below [^}]+
If you allow leading/trailing whitespace within {...} string, the expression will look like
{(?:\s*([A-Fa-f0-9?]{2}))+\s*}
See this regex demo
If you only allow a single regular space only between the values inside {...} and no space after { and before }, you can use
{(?:([A-Fa-f0-9?]{2})(?: (?!}))?)+}
See this regex demo. Note this one is much stricter. Details:
{ - a { char
(?:\s*([A-Fa-f0-9?]{2}))+ - one or more occurrences of
\s* - zero or more whitespaces
([A-Fa-f0-9?]{2}) - Capturing group 1: two hex or ? chars
\s* - zero or more whitespaces
} - a } char.
See a C# demo:
var text = "{a? ab 12 ?? cd}";
var pattern = #"{(?:([A-Fa-f0-9?]{2})(?: (?!}))?)+}";
var result = Regex.Matches(text, pattern)
.Cast<Match>()
.Select(x => x.Groups[1].Captures.Cast<Capture>().Select(m => m.Value))
.ToList();
foreach (var list in result)
Console.WriteLine(string.Join("; ", list));
// => a?; ab; 12; ??; cd
If you want to capture pairs of chars between the curly's, you can use a single capture group:
{([A-Fa-f0-9?]{2}(?: [A-Fa-f0-9?]{2})*)}
Explanation
{ Match {
( Capture group 1
[A-Fa-f0-9?]{2} Match 2 times any of the listed characters
(?: [A-Fa-f0-9?]{2})* Optionally repeat a space and again 2 of the listed characters
) Close group 1
} Match }
Regex demo | C# demo
Example code
string pattern = #"{([A-Fa-f0-9?]{2}(?: [A-Fa-f0-9?]{2})*)}";
string input = #"{a? ab 12 ?? cd}
{ab}";
foreach (Match m in Regex.Matches(input, pattern))
{
Console.WriteLine(m.Groups[1].Value);
}
Output
a? ab 12 ?? cd
ab

Replacing mutiple occurrences of string using string builder by regex pattern matching

We are trying to replace all matching patterns (regex) in a string builder with their respective "groups".
Firstly, we are trying to find the count of all occurrences of that pattern and loop through them (count - termination condition). For each match we are assigning the match object and replace them using their respective groups.
Here only the first occurrence is replaced and the other matches are never replaced.
*str* - contains the actual string
Regex - ('.*')\s*=\s*(.*)
To match pattern:
'nam_cd'=isnull(rtrim(x.nam_cd),''),
'Company'=isnull(rtrim(a.co_name),'')
Pattern : created using https://regex101.com/
*matches.Count* - gives the correct count (here 2)
String pattern = #"('.*')\s*=\s*(.*)";
MatchCollection matches = Regex.Matches(str, pattern);
StringBuilder sb = new StringBuilder(str);
Match match = Regex.Match(str, pattern);
for (int i = 0; i < matches.Count; i++)
{
String First = String.Empty;
Console.WriteLine(match.Groups[0].Value);
Console.WriteLine(match.Groups[1].Value);
First = match.Groups[2].Value.TrimEnd('\r');
First = First.Trim();
First = First.TrimEnd(',');
Console.WriteLine(First);
sb.Replace(match.Groups[0].Value, First + " as " + match.Groups[1].Value) + " ,", match.Index, match.Groups[0].Value.Length);
match = match.NextMatch();
}
Current output:
SELECT DISTINCT
isnull(rtrim(f.fleet),'') as 'Fleet' ,
'cust_clnt_id' = isnull(rtrim(x.cust_clnt_id),'')
Expected output:
SELECT DISTINCT
isnull(rtrim(f.fleet),'') as 'Fleet' ,
isnull(rtrim(x.cust_clnt_id),'') as 'cust_clnt_id'
A regex solution like this is too fragile. If you need to parse any arbitrary SQL, you need a dedicated parser. There are examples on how to parse SQL properly in Parsing SQL code in C#.
If you are sure there are no "wild", unbalaned ( and ) in your input, you may use a regex as a workaround, for a one-off job:
var result = Regex.Replace(s, #"('[^']+')\s*=\s*(\w+\((?>[^()]+|(?<o>\()|(?<-o>\)))*\))", "\n $2 as $1");
See the regex demo.
Details
('[^']+') - Capturing group 1 ($1): ', 1 or more chars other than ' and then '
\s*=\s* - = enclosed with 0+ whitespaces
(\w+\((?>[^()]+|(?<o>\()|(?<-o>\)))*\)) - Capturing group 2 ($2):
\w+ - 1+ word chars
\((?>[^()]+|(?<o>\()|(?<-o>\)))*\) - a (...) substring with any amount of balanced (...)s inside (see my explanation of this pattern).

How to capitalize 1st letter (ignoring non a-z) with regex in c#?

There are tons of posts regarding how to capitalize the first letter with C#, but I specifically am struggling how to do this when ignoring prefixed non-letter characters and tags inside them. Eg,
<style=blah>capitalize the word, 'capitalize'</style>
How to ignore potential <> tags (or non-letter chars before it, like asterisk *) and the contents within them, THEN capitalize "capitalize"?
I tried:
public static string CapitalizeFirstCharToUpperRegex(string str)
{
// Check for empty string.
if (string.IsNullOrEmpty(str))
return string.Empty;
// Return char and concat substring.
// Start # first char, no matter what (avoid <tags>, etc)
string pattern = #"(^.*?)([a-z])(.+)";
// Extract middle, then upper 1st char
string middleUpperFirst = Regex.Replace(str, pattern, "$2");
middleUpperFirst = CapitalizeFirstCharToUpper(str); // Works
// Inject the middle back in
string final = $"$1{middleUpperFirst}$3";
return Regex.Replace(str, pattern, final);
}
EDIT:
Input: <style=foo>first non-tagged word 1st char upper</style>
Expected output: <style=foo>First non-tagged word 1st char upper</style>
You may use
<[^<>]*>|(?<!\p{L})(\p{L})(\p{L}*)
The regex does the following:
<[^<>]*> - matches <, any 0+ chars other than < and > and then >
| - or
(?<!\p{L}) - finds a position not immediately preceded with a letter
(\p{L}) - captures into Group 1 any letter
(\p{L}*) - captures into Group 2 any 0+ letters (that is necessary if you want to lowercase the rest of the word).
Then, check if Group 2 matched, and if yes, capitalize the first group value and lowercase the second one, else, return the whole value:
var result = Regex.Replace(s, #"<[^<>]*>|(?<!\p{L})(\p{L})(\p{L}*)", m =>
m.Groups[1].Success ?
m.Groups[1].Value.ToUpper() + m.Groups[2].Value.ToLower() :
m.Value);
If you do not need to lowercase the rest of the word, remove the second group and the code related to it:
var result = Regex.Replace(s, #"<[^<>]*>|(?<!\p{L})(\p{L})", m =>
m.Groups[1].Success ?
m.Groups[1].Value.ToUpper() : m.Value);
To only replace the first occurrence using this approach, you need to set a flag and reverse it once the first match is found:
var s = "<style=foo>first non-tagged word 1st char upper</style>";
var found = false;
var result = Regex.Replace(s, #"<[^<>]*>|(?<!\p{L})(\p{L})", m => {
if (m.Groups[1].Success && !found) {
found = !found;
return m.Groups[1].Value.ToUpper();
} else {
return m.Value;
}
});
Console.WriteLine(result); // => <style=foo>First non-tagged word 1st char upper</style>
See the C# demo.
Using look-behind regex feature you can match the first 'capitalize' without > parenthesis and then you can capitalize the output.
The regex is the following:
(?<=<.*>)\w+
It will match the first word after the > parenthesis

c# regex matches exclude first and last character

I know nothing about regex so I am asking this great community to help me out.
With the help of SO I manage to write this regex:
string input = "((isoCode=s)||(isoCode=a))&&(title=s)&&((ti=2)&&(t=2))||(t=2&&e>5)";
string pattern = #"\((?>\((?<DEPTH>)|\)(?<-DEPTH>)|.?)*(?(DEPTH)(?!))\)|&&|\|\|";
MatchCollection matches = Regex.Matches(input, pattern);
foreach (Match match in matches)
{
Console.WriteLine(match.Groups[0].Value);
}
And the result is:
((isoCode=s)||(isoCode=a))
&&
(title=s)
&&
((ti=2)&&(t=2))
||
(t=2&&e>5)
but I need result like this (without first/last "(", ")"):
(isoCode=s)||(isoCode=a)
&&
title=s
&&
(ti=2)&&(t=2)
||
t=2&&e>5
Can it be done? I know I can do it with substring (removing first and last character), but I want to know if it can be done with regex.
You may use
\((?<R>(?>\((?<DEPTH>)|\)(?<-DEPTH>)|[^()]+)*(?(DEPTH)(?!)))\)|(?<R>&&|\|\|)
See the regex demo, grab Group "R" value.
Details
\( - an open (
(?<R> - start of the R named group:
(?> - start of the atomic group:
\((?<DEPTH>)| - an open ( and an empty string is pushed on the DEPTH group stack or
\)(?<-DEPTH>)| - a closing ) and an empty string is popped off the DEPTH group stack or
[^()]+ - 1+ chars other than ( and )
)* - zero or more repetitions
(?(DEPTH)(?!)) - a conditional construct that checks if the number of close and open parentheses is balanced
) - end of R named group
\) - a closing )
| - or
(?<R>&&|\|\|) - another occurrence of Group R matching either of the 2 subpatterns:
&& - a && substring
| - or
\|\| - a || substring.
C# code:
var pattern = #"\((?<R>(?>\((?<DEPTH>)|\)(?<-DEPTH>)|[^()]+)*(?(DEPTH)(?!)))\)|(?<R>&&|\|\|)";
var results = Regex.Match(input, pattern)
.Cast<Match>()
.Select(m => m.Groups[1].Value)
.ToList();
Brief
You can use the regex below, but I'd still strongly suggest you write a proper parser for this instead of using regex.
Code
See regex in use here
\(((?>\((?<DEPTH>)|\)(?<-DEPTH>)|.?)*(?(DEPTH)(?!)))\)|&{2}|‌​\|{2}
Usage
See regex in use here
using System;
using System.Text.RegularExpressions;
public class Test
{
public static void Main()
{
string input = "((isoCode=s)||(isoCode=a))&&(title=s)&&((ti=2)&&(t=2))||(t=2&&e>5)";
string pattern = #"\(((?>\((?<DEPTH>)|\)(?<-DEPTH>)|.?)*(?(DEPTH)(?!)))\)|&{2}|\|{2}";
MatchCollection matches = Regex.Matches(input, pattern);
foreach (Match match in matches)
{
Console.WriteLine(match.Groups[1].Success ? match.Groups[1].Value : match.Groups[0].Value);
}
}
}
Result
(isoCode=s)||(isoCode=a)
&&
title=s
&&
(ti=2)&&(t=2)
||
t=2&&e>5

How can I use lookbehind in a C# Regex in order to skip matches of repeated prefix patterns?

How can I use lookbehind in a C# Regex in order to skip matches of repeated prefix patterns?
Example - I'm trying to have the expression match all the b characters following any number of a characters:
Regex expression = new Regex("(?<=a).*");
foreach (Match result in expression.Matches("aaabbbb"))
MessageBox.Show(result.Value);
returns aabbbb, the lookbehind matching only an a. How can I make it so that it would match all the as in the beginning?
I've tried
Regex expression = new Regex("(?<=a+).*");
and
Regex expression = new Regex("(?<=a)+.*");
with no results...
What I'm expecting is bbbb.
Are you looking for a repeated capturing group?
(.)\1*
This will return two matches.
Given:
aaabbbb
This will result in:
aaa
bbbb
This:
(?<=(.))(?!\1).*
Uses the above principal, first checking that the finding the previous character, capturing it into a back reference, and then asserting that that character is not the next character.
That matches:
bbbb
I figured it out eventually:
Regex expression = new Regex("(?<=a+)[^a]+");
foreach (Match result in expression.Matches(#"aaabbbb"))
MessageBox.Show(result.Value);
I must not allow the as to me matched by the non-lookbehind group. This way, the expression will only match those b repetitions that follow a repetitions.
Matching aaabbbb yields bbbb and matching aaabbbbcccbbbbaaaaaabbzzabbb results in bbbbcccbbbb, bbzz and bbb.
The reason the look-behind is skipping the "a" is because it is consuming the first "a" (but no capturing it), then it captures the rest.
Would this pattern work for you instead? New pattern: \ba+(.+)\b
It uses a word boundary \b to anchor either ends of the word. It matches at least one "a" followed by the rest of the characters till the word boundary ends. The remaining characters are captured in a group so you can reference them easily.
string pattern = #"\ba+(.+)\b";
foreach (Match m in Regex.Matches("aaabbbb", pattern))
{
Console.WriteLine("Match: " + m.Value);
Console.WriteLine("Group capture: " + m.Groups[1].Value);
}
UPDATE: If you want to skip the first occurrence of any duplicated letters, then match the rest of the string, you could do this:
string pattern = #"\b(.)(\1)*(?<Content>.+)\b";
foreach (Match m in Regex.Matches("aaabbbb", pattern))
{
Console.WriteLine("Match: " + m.Value);
Console.WriteLine("Group capture: " + m.Groups["Content"].Value);
}

Categories