Regex to split and ignore brackets

Regex to split and ignore brackets - c#

I need to split by comma in the text but the text also has a comma inside brackets which need to be ignored
Input text : Selectroasted peanuts,Sugars (sugar, fancymolasses),Hydrogenatedvegetable oil (cottonseed and rapeseed oil),Salt.
Expected output:
Selectroasted peanuts
Sugars (sugar, fancymolasses)
Hydrogenatedvegetable oil (cottonseed and rapeseed oil)
Salt
MyCode
string pattern = #"\s*(?:""[^""]*""|\([^)]*\)|[^, ]+)";
string input = "Selectroasted peanuts,Sugars (sugar, fancymolasses),Hydrogenatedvegetable oil (cottonseed and rapeseed oil),Salt.";
foreach (Match m in Regex.Matches(input, pattern))
{
Console.WriteLine("{0}", m.Value);
}
The output I am getting:
Selectroasted
peanuts
Sugars
(sugar, fancymolasses)
Hydrogenatedvegetable
oil
(cottonseed and rapeseed oil)
Salt
Please help.

You can use
string pattern = #"(?:""[^""]*""|\([^()]*\)|[^,])+";
string input = "Selectroasted peanuts,Sugars (sugar, fancymolasses),Hydrogenatedvegetable oil (cottonseed and rapeseed oil),Salt.";
foreach (Match m in Regex.Matches(input.TrimEnd(new[] {'!', '?', '.', '…'}), pattern))
{
Console.WriteLine("{0}", m.Value);
}
// => Selectroasted peanuts
// Sugars (sugar, fancymolasses)
// Hydrogenatedvegetable oil (cottonseed and rapeseed oil)
// Salt
See the C# demo. See the regex demo, too. It matches one or more occurrences of
"[^"]*" - ", zero or more chars other than " and then a "
| - or
\([^()]*\) - a (, then any zero or more chars other than ( and ) and then a ) char
| - or
[^,] - a char other than a ,.
Note the .TrimEnd(new[] {'!', '?', '.', '…'}) part in the code snippet is meant to remove the trailing sentence punctuation, but if you can affort Salt. in the output, you can remove that part.

Related

Regex - Get digits after a colon

I have a regex:
var topPayMatch = Regex.Match(result, #"(?<=Top Pay)(\D*)(\d+(?:\.\d+)?)", RegexOptions.IgnoreCase);
And I have to convert this to int which I did
topPayMatch = Convert.ToInt32(topPayMatchString.Groups[2].Value);
So now...
Top Pay: 1,000,000 then it currently grabs the first digit, which is 1. I want all 1000000.
If Top Pay: 888,888 then I want all 888888.
What should I add to my regex?

You can use something as simple like #"(?<=Top Pay: )([0-9,]+)". Note that, decimals will be ignored with this regex.
This will match all numbers with their commas after Top Pay:, which after you can parse it to an integer.
Example:
Regex rgx = new Regex(#"(?<=Top Pay: )([0-9,]+)");
string str = "Top Pay: 1,000,000";
Match match = rgx.Match(str);
if (match.Success)
{
string val = match.Value;
int num = int.Parse(val, System.Globalization.NumberStyles.AllowThousands);
Console.WriteLine(num);
}
Console.WriteLine("Ended");
Source:
Convert int from string with commas

If you use the lookbehind, you don't need the capture groups and you can move the \D* into the lookbehind.
To get the values, you can match 1+ digits followed by optional repetitions of , and 1+ digits.
Note that your example data contains comma's and no dots, and using ? as a quantifier means 0 or 1 time.
(?<=Top Pay\D*)\d+(?:,\d+)*
The pattern matches:
(?<=Top Pay\D*) Positive lookbehind, assert what is to the left is Top Pay and optional non digits
\d+ Match 1+ digits
(?:,\d+)* Optionally repeat a , and 1+ digits
See a .NET regex demo and a C# demo
string pattern = #"(?<=Top Pay\D*)\d+(?:,\d+)*";
string input = #"Top Pay: 1,000,000
Top Pay: 888,888";
RegexOptions options = RegexOptions.IgnoreCase;
foreach (Match m in Regex.Matches(input, pattern, options))
{
var topPayMatch = int.Parse(m.Value, System.Globalization.NumberStyles.AllowThousands);
Console.WriteLine(topPayMatch);
}
Output
1000000
888888

Tokenize a string using multiple conditions

For the string below:
var str = "value0 'value 1/5' 'x ' value2";
Is there a way I can parse that string such that I get
arr[0] = "value0";
arr[1] = "value 1/5";
arr[2] = "x ";
arr[3] = "value2";
The order of values that might come with single quotes is arbitrary. Case does not matter.
I can get all values between single quotes using a regex like
"'(.*?)'"
but I need the order of those values relative other non-single-quoted values.

Use
'(?<val>.*?)'|(?<val>\S+)
See regex proof
EXPLANATION
--------------------------------------------------------------------------------
' '\''
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
.*? any character except \n (0 or more times
(matching the least amount possible))
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
' '\''
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
( group and capture to \2:
--------------------------------------------------------------------------------
\S+ non-whitespace (all but \n, \r, \t, \f,
and " ") (1 or more times (matching the
most amount possible))
--------------------------------------------------------------------------------
) end of \2
C# code:
using System;
using System.Text.RegularExpressions;
public class Example
{
public static void Main()
{
string pattern = #"'(?<val>.*?)'|(?<val>\S+)";
string input = #"value0 'value 1/5' 'x ' value2";
foreach (Match m in Regex.Matches(input, pattern))
{
Console.WriteLine(m.Groups["val"].Value);
}
}
}

In C# you can reuse the same named capture group, so you could use an alternation | using the same group name for both parts.
'(?<val>[^']+)'|(?<val>\S+)
The pattern matches:
' Match a single quote
(?<val>[^']+) Capture in group val matching 1+ times any char except ' to not match an empty string
' Match a single quote
| Or
(?<val>\S+) Capture in group val matching 1+ times any non whitespace char
See a .NET regex demo or a C# demo
For example
string pattern = #"'(?<val>[^']+)'|(?<val>\S+)";
var str = "value0 'value 1/5' 'x ' value2";
foreach (Match m in Regex.Matches(str, pattern))
{
Console.WriteLine(m.Groups["val"].Value);
}
Output
value0
value 1/5
x
value2

Regex recursive substitutions

I have 3 case of data:
{{test_data}}
{{!test_data}}
{{test_data1&&!test_data2}} // test_data2 might not have the !
and I need to translate those strings with:
mystring.test_data
!mystring.test_data
mystring.test_data1 && !mystring.test_data2
I'm fiddling around with the super-useful regex101.com and i managed to cover almost all of 3 cases with Regex.Replace(str, "{{2}(?:(!?)(\w*)(\|{2}|&{2})?)}{2}", "$1mystring.$2 $3");
I can't figure out how to use regex recursion to re-apply the (?: ) part until the }} and join together all the matches using the specified substitution pattern
Is that even possible??
edit: here's the regex101 page -> https://regex101.com/r/vIBVkQ/2

I would advise to use a more generic solution here, with smaller, easier to read and maintain regexps here: one (the longest) will be used to find the substrings you need (the longest one), then a simple \w+ pattern will be used to add the my_string. part and the other will add spaces around logical operators. The smaller regexps will be used inside a match evaluator, to manipulate the values found by the longest regex:
Regex.Replace(input, #"{{!?\w+(?:\s*(?:&&|\|\|)\s*!?\w+)*}}", m =>
Regex.Replace(
Regex.Replace(m.Value, #"\s*(&&|\|\|)\s*", " $1 "),
#"\w+",
"mystring.$&"
)
)
See the C# demo
The main regex matches:
{{ - a {{ substring
!? - an optional ! sign
\w+ - 1 or more word chars
(?:\s*(?:&&|\|\|)\s*!?\w+)* - 0+ sequences of:
\s* - 0+ whitespace chars
(?:&&|\|\|) - a && or || substring
\s* - 0+ whitespaces
!? - an optional !
\w+ - 1 or more word chars
}} - a }} substring.

Regex: (?:{{2}|[^|]{2}|[^&]{2})\!?(\w+)(?:}{2})?
Regex demo
C# code:
List<string> list = new List<string>() { "{{test_data}}", "{{!test_data}}", "{{test_data1&&!test_data2}}" };
foreach(string s in list)
{
string t = Regex.Replace(s, #"(?:{{2}|[^|]{2}|[^&]{2})\!?(\w+)(?:}{2})?",
o => o.Value.Contains("!") ? "!mystring." + o.Groups[1].Value : "mystring." + o.Groups[1].Value);
Console.WriteLine(t);
}
Console.ReadLine();
Output:
mystring.test_data
!mystring.test_data
mystring.test_data1&&!mystring.test_data2

I don't think you can use recursion, but with a different representation of your input pattern, you can use sub-groups. Note I used named captures to slightly limit the confusion in this example:
var test = #"{{test_data}}
{{!test_data}}
{{test_data1&&!test_data2&&test_data3}}
{{test_data1&&!test_data2 fail test_data3}}
{{test_data1&&test_data2||!test_data3}}";
// (1:!)(2:word)(3:||&&)(4:repeat)
var matches = Regex.Matches(test, #"\{{2}(?:(?<exc>!?)(?<word>\w+))(?:(?<op>\|{2}|&{2})(?<exc2>!?)(?<word2>\w+))*}{2}");
foreach (Match match in matches)
{
Console.WriteLine("Match: {0}", match.Value);
Console.WriteLine(" exc: {0}", match.Groups["exc"].Value);
Console.WriteLine(" word: {0}", match.Groups["word"].Value);
for (int i = 0; i < match.Groups["op"].Captures.Count; i++)
{
Console.WriteLine(" op: {0}", match.Groups["op"].Captures[i].Value);
Console.WriteLine(" exc2: {0}", match.Groups["exc2"].Captures[i].Value);
Console.WriteLine("word2: {0}", match.Groups["word2"].Captures[i].Value);
}
}
The idea is to read the first word in each group unconditionally and then possibly read N combinations of (|| or &&)(optional !)(word) as separate groups with sub-captures.
Example output:
Match: {{test_data}}
exc:
word: test_data
Match: {{!test_data}}
exc: !
word: test_data
Match: {{test_data1&&!test_data2&&test_data3}}
exc:
word: test_data1
op: &&
exc2: !
word2: test_data2
op: &&
exc2:
word2: test_data3
Match: {{test_data1&&test_data2||!test_data3}}
exc:
word: test_data1
op: &&
exc2:
word2: test_data2
op: ||
exc2: !
word2: test_data3
Note the line {{test_data1&&!test_data2 fail test_data3}} is not part of the result groups because it doesn't comply with the syntax rules.
So you can build your desired result the same way from the matches structure:
foreach (Match match in matches)
{
var sb = new StringBuilder();
sb.Append(match.Groups["exc"].Value).Append("mystring.").Append(match.Groups["word"].Value);
for (int i = 0; i < match.Groups["op"].Captures.Count; i++)
{
sb.Append(' ').Append(match.Groups["op"].Captures[i].Value).Append(' ');
sb.Append(match.Groups["exc2"].Value).Append("mystring.").Append(match.Groups["word2"].Value);
}
Console.WriteLine("Result: {0}", sb.ToString());
}

Splitting a String with conditions

Given a string of:
"S1 =F A1 =T A2 =T F3 =F"
How can I split it so that the result is an array of strings where the 4 strings ,individual string would look like this:
"S1=F"
"A1=T"
"A2=T"
"F3=F"
Thank you

You can try matching all Name = (T|F) conditions with regular expressions and then get rid of white spaces in the each match with a help of Linq:
using System.Linq;
using System.Text.RegularExpressions;
..
string source = "S1 \t = F A1 = T A2 = T F3 = F";
string[] result = Regex
.Matches(source, #"[A-Za-z][A-Za-z0-9]*\s*=\s*[TF]")
.OfType<Match>()
.Select(match => string.Concat(match.Value.Where(c => !char.IsWhiteSpace(c))))
.ToArray();
Console.WriteLine(string.Join(Environment.NewLine, result));
Outcome:
S1=F
A1=T
A2=T
F3=F
Edit: What's goining on. First part is a regular expression matching:
... Regex
.Matches(source, #"[A-Za-z][A-Za-z0-9]*\s*=\s*[TF]")
.OfType<Match>() ...
We are trying to find out fragments with a pattern
[A-Za-z] - Letter A..Z or a..z
[A-Za-z0-9]* - followed by zero or many letters or digits
\s* - zero or more white spaces (spaces, tabulations etc.)
= - =
\s* - zero or more white spaces (spaces, tabulations etc.)
[TF] - either T or F
Second part is match clearing: for each match found e.g. S1 \t = F we want to obtain "S1=F" string:
...
.Select(match => string.Concat(match.Value.Where(c => !char.IsWhiteSpace(c))))
.ToArray();
We use Linq here: for each character in the match we filter out all white spaces (take character c if and only if it's not a white space):
match.Value.Where(c => !char.IsWhiteSpace(c))
then combine (Concat) filtered characters (IEnumerable<char>) of each match back to string and organize these strings as an array (materialization):
.Select(match => string.Concat(...))
.ToArray();

REGEX Expression C#. Split string by whitespace outside the quotation marks

I'm trying to define a regular expression for the Split function in order to obtain all substring split by a whitespace omitting those whitespaces that are into single quotation marks.
Example:
key1:value1 key2:'value2 value3'
i Need these separated values:
key1:value1
key2:'value2 value3'
I'm tried to perform this in different ways:
Regex.Split(q, #"(\s)^('\s')").ToList();
Regex.Split(q, #"(\s)(^'.\s.')").ToList();
Regex.Split(q, #"(?=.*\s)").ToList();
What i am wrong with this code?
Could you please help me with this?
Thanks in advance

A working example:
(\w+):(?:(\w+)|'([^']+)')
(\w+) # key: 1 or more word chars (captured)
: # literal
(?: # non-captured grouped alternatives
(\w+) # value: 1 or more word chars (captured)
| # or
'([^']+)' # 1 or more not "'" enclosed by "'" (captured)
) # end of group
Demo
Your try:
(\s)^('\s')
^ means beginning of line, \s is a white-space characters. If you want to use the not-operator, this only works in a character class [^\s] -> 1 character not a white-space.

var st = "key1:value1 key2:'value2 value3'";
var result = Regex.Matches(st, #"\w+:\w+|\w+:\'[^']+\'");
foreach (var item in result)
Console.WriteLine(item);
The result should be:
key1:value1
key2:'value2 value3'

Try following :
static void Main(string[] args)
{
string input = "key1:value1 key2:'value2 value3'";
string pattern = #"\s*(?'key'[^:]+):((?'value'[^'][^\s]+)|'(?'value'[^']+))";
MatchCollection matches = Regex.Matches(input, pattern);
foreach (Match match in matches)
{
Console.WriteLine("Key : '{0}', Value : '{1}'", match.Groups["key"].Value, match.Groups["value"].Value);
}
Console.ReadLine();
}

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Regex to split and ignore brackets - c#

Related

Regex - Get digits after a colon

Tokenize a string using multiple conditions

Regex recursive substitutions

Splitting a String with conditions

REGEX Expression C#. Split string by whitespace outside the quotation marks

Categories

Resources