Regex, match string ending with ) and ignore any () inbetween [duplicate] - c#

This question already has answers here:
Using RegEx to balance match parenthesis
(4 answers)
Closed 9 years ago.
I want to select a part of a string, but the problem is that the last character I want to select can have multiple occurrences.
I want to select 'Aggregate(' and end at the matching ')', any () in between can be ignored.
Examples:
string: Substr(Aggregate(SubQuery, SUM, [Model].Remark * [Object].Shortname + 10), 0, 1)
should return: Aggregate(SubQuery, SUM, [Model].Remark * [Object].Shortname + 10)
string: Substr(Aggregate(SubQuery, SUM, [Model].Remark * ([Object].Shortname + 10)), 0, 1)
should return: Aggregate(SubQuery, SUM, [Model].Remark * ([Object].Shortname + 10))
string: Substr(Aggregate(SubQuery, SUM, ([Model].Remark) * ([Object].Shortname + 10) ), 0, 1)
should return: Aggregate(SubQuery, SUM, ([Model].Remark) * ([Object].Shortname + 10) )
Is there a way to solve this with a regular expression? I'm using C#.

This is a little ugly, but you could use something like
Aggregate\(([^()]+|\(.*?\))*\)
It passes all your tests, but it can only match one level of nested parentheses.

This solution works with any level of nested parenthesis by using .NETs balancing groups:
(?x) # allow comments and ignore whitespace
Aggregate\(
(?:
[^()] # anything but ( and )
| (?<open> \( ) # ( -> open++
| (?<-open> \) ) # ) -> open--
)*
(?(open) (?!) ) # fail if open > 0
\)
I'm not sure how much the input varies but for the string examples in the question something as simple as this would work:
Aggregate\(.*\)(?=,)

If eventually consider avoiding regular expressions, here's an alternative for parsing, which uses the System.Xml.Linq namespace:
class Program
{
static void Main()
{
var input = File.ReadAllLines("input.txt");
input.ToList().ForEach(item => {
Console.WriteLine(item.GetParameter("Aggregate"));
});
}
}
static class X
{
public static string GetParameter(this string expression, string element)
{
XDocument doc;
var input1 = "<root>" + expression
.Replace("(", "<n1>")
.Replace(")", "</n1>")
.Replace("[", "<n2>")
.Replace("]", "</n2>") +
"</root>";
try
{
doc = XDocument.Parse(input1);
}
catch
{
return null;
}
var agg=doc.Descendants()
.Where(d => d.FirstNode.ToString() == element)
.FirstOrDefault();
if (agg == null)
return null;
var param = agg
.Elements()
.FirstOrDefault();
if (param == null)
return null;
return element +
param
.ToString()
.Replace("<n1>", "(")
.Replace("</n1>", ")")
.Replace("<n2>", "[")
.Replace("</n2>", "]");
}
}

This regex works with any number of pairs of brackets, and nested to any level:
Aggregate\(([^(]*\([^)]*\))*[^()]\)
For example, it will find the bolded text here:
Substr(Aggregate(SubQuery, SUM(foo(bar), baz()), ([Model].Remark) * ([Object].Shortname + 10) ), 0, 1)
Notice the SUM(foo(bar), baz()) in there.
See a live demo on rubular.

Related

Find 3 or more whitespaces with regex in C# [duplicate]

This question already has answers here:
Regex to validate string for having three non white-space characters
(2 answers)
Closed 3 years ago.
As said above, I want to find 3 or more whitespaces with regex in C#. Currently I tried:
\s{3,} and [ ]{3,} for Somestreet 155/ EG 47. Both didnt worked out. What did I do wrong?
This \s{3,} matches 3 or more whitespace in a row. You need for example this pattern \s.*\s.*\s to match a string with 3 whitespaces anywhere.
So this would match:
a b c d
a b c
a b
abc d e f
a
a b // ends in 1 space
// just 3 spaces
a // ends in 3 spaces
Linq is an alternative way to count spaces:
string source = "Somestreet 155/ EG 47";
bool result = source
.Where(c => c == ' ') // spaces only
.Skip(2) // skip 2 of them
.Any(); // do we have at least 1 more (i.e. 3d space?)
Edit: If you want not just spaces but whitespaces Where should be
...
.Where(c => char.IsWhiteSpace(c))
...
You could count the whitespace matches:
if (Regex.Matches(yourString, #"\s+").Count >= 3) {...}
The + makes sure that consecutive matches to \s only count once, so "Somestreet 155/ EG 47" has three matches but "Somestreet 155/ EG47" only has two.
If the string is long, then it could take more time than necessary to get all the matches then count them. An alternative is to get one match at a time and bail out early if the required number of matches has been met:
static bool MatchesAtLeast(string s, Regex re, int matchCount)
{
bool success = false;
int startPos = 0;
while (!success)
{
Match m = re.Match(s, startPos);
if (m.Success)
{
matchCount--;
success = (matchCount <= 0);
startPos = m.Index + m.Length;
if (startPos > s.Length - 2) { break; }
}
else { break; }
}
return success;
}
static void Main(string[] args)
{
Regex re = new Regex(#"\s+");
string s = "Somestreet 155/ EG\t47";
Console.WriteLine(MatchesAtLeast(s, re, 3)); // outputs True
Console.ReadLine();
}
Try ^\S*\s\S*\s\S*\s\S*$ instead.
\S matches non-whitespace characters, ^ matches beginnning of a string and $ matches end of a string.
Demo

Match content inside brackets [duplicate]

This question already has an answer here:
How to get text between nested parentheses?
(1 answer)
Closed 5 years ago.
I have a good regex that works well for most my cases:
"\(.*\)"
This regex matches nested brackets which is good: "ABC ( DEF (GHI) JKL ) MNO"
But there is a tricky case: "This is ABC (XXX) DEF (XXX) (XXX)". As you can see this regex matches also DEF, but it doesn't.
Any ideas of how I can adjust my regex?
If you don't insist on regular expressions you can put a simple stack-based implementation:
using System.Linq;
...
private static IEnumerable<string> EnumerateEnclosed(string value) {
if (null == value)
yield break;
Stack<int> positions = new Stack<int>();
for (int i = 0; i < value.Length; ++i) {
char ch = value[i];
if (ch == '(')
positions.Push(i);
else if (ch == ')')
if (positions.Any()) {
int from = positions.Pop();
if (!positions.Any()) // <- outmost ")"
yield return value.Substring(from, i - from + 1);
}
}
}
Test:
// Let's combine both examples into one and elaborate it a bit further:
string test = "ABC (DEF (GHI) J(RT(123)L)KL) MNO (XXX1) DEF (XXX2) (XXX3)";
Console.WriteLine(string.Join(Environment.NewLine, EnumerateEnclosed(test)));
Outcome:
(DEF (GHI) J(RT(123)L)KL)
(XXX1)
(XXX2)
(XXX3)
Regex: \([^)]+\)[^(]+\)|\([^)]+\)
Details:
[^(] Match a single character not present in the list "("
+ Matches between one and unlimited times
| or
Regex demo

Validate a Boolean expression with brackets in C#

I want to validate a string in C# that contains a Boolean expression with brackets.
The string should only contain numbers 1-9, round brackets, "OR" , "AND".
Examples of good strings:
"1 AND 2"
"2 OR 4"
"4 AND (3 OR 5)"
"2"
And so on...
I am not sure if Regular Expression are flexible enough for this task.
Is there a nice short way of achieving this in C# ?
It's probably simpler to do this with a simple parser. But you can do this with .NET regex by using balancing groups and realizing that if the brackets are removed from the string you always have a string matched by a simple expression like ^\d+(?:\s+(?:AND|OR)\s+\d+)*\z.
So all you have to do is use balancing groups to make sure that the brackets are balanced (and are in the right place in the right form).
Rewriting the expression above a bit:
(?x)^
OPENING
\d+
CLOSING
(?:
\s+(?:AND|OR)\s+
OPENING
\d+
CLOSING
)*
BALANCED
\z
((?x) makes the regex engine ignore all whitespace and comments in the pattern, so it can be made more readable.)
Where OPENING matches any number (0 included) of opening brackets:
\s* (?: (?<open> \( ) \s* )*
CLOSING matches any number of closing brackets also making sure that the balancing group is balanced:
\s* (?: (?<-open> \) ) \s* )*
and BALANCED performs a balancing check, failing if there are more open brackets then closed:
(?(open)(?!))
Giving the expression:
(?x)^
\s* (?: (?<open> \( ) \s* )*
\d+
\s* (?: (?<-open> \) ) \s* )*
(?:
\s+(?:AND|OR)\s+
\s* (?: (?<open> \( ) \s* )*
\d+
\s* (?: (?<-open> \) ) \s* )*
)*
(?(open)(?!))
\z
If you do not want to allow random spaces remove every \s*.
Example
See demo at IdeOne. Output:
matched: '2'
matched: '1 AND 2'
matched: '12 OR 234'
matched: '(1) AND (2)'
matched: '(((1)) AND (2))'
matched: '1 AND 2 AND 3'
matched: '1 AND (2 OR (3 AND 4))'
matched: '1 AND (2 OR 3) AND 4'
matched: ' ( 1 AND ( 2 OR ( 3 AND 4 ) )'
matched: '((1 AND 7) OR 6) AND ((2 AND 5) OR (3 AND 4))'
matched: '(1)'
matched: '(((1)))'
failed: '1 2'
failed: '1(2)'
failed: '(1)(2)'
failed: 'AND'
failed: '1 AND'
failed: '(1 AND 2'
failed: '1 AND 2)'
failed: '1 (AND) 2'
failed: '(1 AND 2))'
failed: '(1) AND 2)'
failed: '(1)() AND (2)'
failed: '((1 AND 7) OR 6) AND (2 AND 5) OR (3 AND 4))'
failed: '((1 AND 7) OR 6) AND ((2 AND 5 OR (3 AND 4))'
failed: ''
If you consider a boolean expression as generated by a formal grammar writing a parser is easier.
I made an open source library to interpret simple boolean expressions. You can take a look at it on GitHub, in particular look at the AstParser class and Lexer.
If you just want to validate the input string, you can write a simple parser.
Each method consumes a certain kind of input (digit, brackets, operator) and returns the remaining string after matching. An exception is thrown if no match can be made.
public class ParseException : Exception { }
public static class ExprValidator
{
public static bool Validate(string str)
{
try
{
string term = Term(str);
string stripTrailing = Whitespace(term);
return stripTrailing.Length == 0;
}
catch(ParseException) { return false; }
}
static string Term(string str)
{
if(str == string.Empty) return str;
char current = str[0];
if(current == '(')
{
string term = LBracket(str);
string rBracket = Term(term);
string temp = Whitespace(rBracket);
return RBracket(temp);
}
else if(Char.IsDigit(current))
{
string rest = Digit(str);
try
{
//possibly match op term
string op = Op(rest);
return Term(op);
}
catch(ParseException) { return rest; }
}
else if(Char.IsWhiteSpace(current))
{
string temp = Whitespace(str);
return Term(temp);
}
else throw new ParseException();
}
static string Op(string str)
{
string t1 = Whitespace_(str);
string op = MatchOp(t1);
return Whitespace_(op);
}
static string MatchOp(string str)
{
if(str.StartsWith("AND")) return str.Substring(3);
else if(str.StartsWith("OR")) return str.Substring(2);
else throw new ParseException();
}
static string LBracket(string str)
{
return MatchChar('(')(str);
}
static string RBracket(string str)
{
return MatchChar(')')(str);
}
static string Digit(string str)
{
return MatchChar(Char.IsDigit)(str);
}
static string Whitespace(string str)
{
if(str == string.Empty) return str;
int i = 0;
while(i < str.Length && Char.IsWhiteSpace(str[i])) { i++; }
return str.Substring(i);
}
//match at least one whitespace character
static string Whitespace_(string str)
{
string stripFirst = MatchChar(Char.IsWhiteSpace)(str);
return Whitespace(stripFirst);
}
static Func<string, string> MatchChar(char c)
{
return MatchChar(chr => chr == c);
}
static Func<string, string> MatchChar(Func<char, bool> pred)
{
return input => {
if(input == string.Empty) throw new ParseException();
else if(pred(input[0])) return input.Substring(1);
else throw new ParseException();
};
}
}
Pretty simply:
At first stage you must determ lexems (digit, bracket or operator) with simple string comparsion.
At second stage you must define variable of count of closed bracket (bracketPairs), which can be calculated by the following algorithm for each lexem:
if current lexem is '(', then bracketPairs++;
if current lexem is ')', then bracketPairs--.
Else do not modify bracketPairs.
At the end if all lexems are known and bracketPairs == 0 then input expression is valid.
The task is a bit more complex, if it's necesery to build AST.
what you want are "balanced groups", with them you can get all bracet definitions, then you just need a simple string parsing
http://blog.stevenlevithan.com/archives/balancing-groups
http://msdn.microsoft.com/en-us/library/bs2twtah.aspx#balancing_group_definition
ANTLR Parser Generator?
a short way of achieving this in C#
Although it may be an overkill if its just numbers and OR + AND

Regex to find inner if conditions

I had a regex to find single if-then-else condition.
string pattern2 = #"if( *.*? *)then( *.*? *)(?:else( *.*? *))?endif";
Now, I need to extend this & provide looping if conditions. But the regex is not suitable to extract the then & else parts properly.
Example Looped IF condition:
if (2 > 1) then ( if(3>2) then ( if(4>3) then 4 else 3 endif ) else 2 endif) else 1 endif
Expected Result with Regex:
condition = (2>1)
then part = ( if(3>2) then ( if(4>3) then 4 else 3 endif ) else 2 endif)
else part = 1
I can check if else & then part have real values or a condition. Then i can use the same regex on this inner condition until everything is resolved.
The current regex returns result like:
condition = (2 > 1)
then part = ( if( 3>2) then ( if(4>3) then 3
else part = 3
Meaning, it returns the value after first "else" found. But actually, it has to extract from the last else.
Can someone help me with this?
You can adapt the solution on answer Can regular expressions be used to match nested patterns? ( http://retkomma.wordpress.com/2007/10/30/nested-regular-expressions-explained/ ).
That solution shows how to match content between html tags , even if it contains nested tags. Applying the same idea for parenthesis pairs should solve your problem.
EDIT:
using System;
using System.Text.RegularExpressions;
namespace ConsoleApplication1
{
class Program
{
static void Main(string[] args)
{
String matchParenthesis = #"
(?# line 01) \((
(?# line 02) (?>
(?# line 03) \( (?<DEPTH>)
(?# line 04) |
(?# line 05) \) (?<-DEPTH>)
(?# line 06) |
(?# line 07) .?
(?# line 08) )*
(?# line 09) (?(DEPTH)(?!))
(?# line 10) )\)
";
//string source = "if (2 > 1) then ( if(3>2) then ( if(4>3) then 4 else 3 endif ) else 2 endif) else 1 endif";
string source = "if (2 > 1) then 2 else ( if(3>2) then ( if(4>3) then 4 else 3 endif ) else 2 endif) endif";
string pattern = #"if\s*(?<condition>(?:[^(]*|" + matchParenthesis + #"))\s*";
pattern += #"then\s*(?<then_part>(?:[^(]*|" + matchParenthesis + #"))\s*";
pattern += #"else\s*(?<else_part>(?:[^(]*|" + matchParenthesis + #"))\s*endif";
Match match = Regex.Match(source, pattern,
RegexOptions.IgnorePatternWhitespace | RegexOptions.IgnoreCase);
Console.WriteLine(match.Success.ToString());
Console.WriteLine("source: " + source );
Console.WriteLine("condition = " + match.Groups["condition"]);
Console.WriteLine("then part = " + match.Groups["then_part"]);
Console.WriteLine("else part = " + match.Groups["else_part"]);
}
}
}
If you replace endif with end you get
if (2 > 1) then ( if(3>2) then ( if(4>3) then 4 else 3 end) else 2 end) else 1 end
and you also got a perfectly fine Ruby expression. Download IronRuby and add references to IronRuby, IronRuby.Libraries, and Microsoft.Scripting to your project. You find them in C:\Program Files\IronRuby 1.0v4\bin then
using Microsoft.Scripting;
using Microsoft.Scripting.Hosting;
using IronRuby;
and in your code
var engine = Ruby.CreateEngine();
int result = engine.Execute("if (2 > 1) then ( if(3>2) then ( if(4>3) then 4 else 3 end ) else 2 end) else 1 end");

How to parse a comma delimited string when comma and parenthesis exists in field

I have this string in C#
adj_con(CL2,1,3,0),adj_cont(CL1,1,3,0),NG, NG/CL, 5 value of CL(JK), HO
I want to use a RegEx to parse it to get the following:
adj_con(CL2,1,3,0)
adj_cont(CL1,1,3,0)
NG
NG/CL
5 value of CL(JK)
HO
In addition to the above example, I tested with the following, but am still unable to parse it correctly.
"%exc.uns: 8 hours let # = ABC, DEF", "exc_it = 1 day" , " summ=graffe ", " a,b,(c,d)"
The new text will be in one string
string mystr = #"""%exc.uns: 8 hours let # = ABC, DEF"", ""exc_it = 1 day"" , "" summ=graffe "", "" a,b,(c,d)""";
string str = "adj_con(CL2,1,3,0),adj_cont(CL1,1,3,0),NG, NG/CL, 5 value of CL(JK), HO";
var resultStrings = new List<string>();
int? firstIndex = null;
int scopeLevel = 0;
for (int i = 0; i < str.Length; i++)
{
if (str[i] == ',' && scopeLevel == 0)
{
resultStrings.Add(str.Substring(firstIndex.GetValueOrDefault(), i - firstIndex.GetValueOrDefault()));
firstIndex = i + 1;
}
else if (str[i] == '(') scopeLevel++;
else if (str[i] == ')') scopeLevel--;
}
resultStrings.Add(str.Substring(firstIndex.GetValueOrDefault()));
Event faster:
([^,]*\x28[^\x29]*\x29|[^,]+)
That should do the trick. Basically, look for either a "function thumbprint" or anything without a comma.
adj_con(CL2,1,3,0),adj_cont(CL1,1,3,0),NG, NG/CL, 5 value of CL(JK), HO
^ ^ ^ ^ ^
The Carets symbolize where the grouping stops.
Just this regex:
[^,()]+(\([^()]*\))?
A test example:
var s= "adj_con(CL2,1,3,0),adj_cont(CL1,1,3,0),NG, NG/CL, 5 value of CL(JK), HO";
Regex regex = new Regex(#"[^,()]+(\([^()]*\))?");
var matches = regex.Matches(s)
.Cast<Match>()
.Select(m => m.Value);
returns
adj_con(CL2,1,3,0)
adj_cont(CL1,1,3,0)
NG
NG/CL
5 value of CL(JK)
HO
If you simply must use Regex, then you can split the string on the following:
, # match a comma
(?= # that is followed by
(?: # either
[^\(\)]* # no parens at all
| # or
(?: #
[^\(\)]* # ...
\( # (
[^\(\)]* # stuff in parens
\) # )
[^\(\)]* # ...
)+ # any number of times
)$ # until the end of the string
)
It breaks your input into the following:
adj_con(CL2,1,3,0)
adj_cont(CL1,1,3,0)
NG
NG/CL
5 value of CL(JK)
HO
You can also use .NET's balanced grouping constructs to create a version that works with nested parens, but you're probably just as well off with one of the non-Regex solutions.
Another way to implement what Snowbear was doing:
public static string[] SplitNest(this string s, char src, string nest, string trg)
{
int scope = 0;
if (trg == null || nest == null) return null;
if (trg.Length == 0 || nest.Length < 2) return null;
if (trg.IndexOf(src) >= 0) return null;
if (nest.IndexOf(src) >= 0) return null;
for (int i = 0; i < s.Length; i++)
{
if (s[i] == src && scope == 0)
{
s = s.Remove(i, 1).Insert(i, trg);
}
else if (s[i] == nest[0]) scope++;
else if (s[i] == nest[1]) scope--;
}
return s.Split(trg);
}
The idea is to replace any non-nested delimiter with another delimiter that you can then use with an ordinary string.Split(). You can also choose what type of bracket to use - (), <>, [], or even something weird like \/, ][, or `'. For your purposes you would use
string str = "adj_con(CL2,1,3,0),adj_cont(CL1,1,3,0),NG, NG/CL, 5 value of CL(JK), HO";
string[] result = str.SplitNest(',',"()","~");
The function would first turn your string into
adj_con(CL2,1,3,0)~adj_cont(CL1,1,3,0)~NG~ NG/CL~ 5 value of CL(JK)~ HO
then split on the ~, ignoring the nested commas.
Assuming non nested, matching parentheses, you can easily match the tokens you want instead of splitting the string:
MatchCollection matches = Regex.Matches(data, #"(?:[^(),]|\([^)]*\))+");
var s = "adj_con(CL2,1,3,0),adj_cont(CL1,1,3,0),NG, NG/CL, 5 value of CL(JK), HO";
var result = string.Join(#"\n",Regex.Split(s, #"(?<=\)),|,\s"));
The pattern matches for ) and excludes it from the match then matches ,
or
matches , followed by a space.
result =
adj_con(CL2,1,3,0)
adj_cont(CL1,1,3,0)
NG
NG/CL
5 value of CL(JK)
HO
The TextFieldParser (msdn) class seems to have the functionality built-in:
TextFieldParser Class: - Provides methods and properties for parsing structured text files.
Parsing a text file with the TextFieldParser is similar to iterating over a text file, while the ReadFields method to extract fields of text is similar to splitting the strings.
The TextFieldParser can parse two types of files: delimited or fixed-width. Some properties, such as Delimiters and HasFieldsEnclosedInQuotes are meaningful only when working with delimited files, while the FieldWidths property is meaningful only when working with fixed-width files.
See the article which helped me find that
Here's a stronger option, which parses the whole text, including nested parentheses:
string pattern = #"
\A
(?>
(?<Token>
(?:
[^,()] # Regular character
|
(?<Paren> \( ) # Opening paren - push to stack
|
(?<-Paren> \) ) # Closing paren - pop
|
(?(Paren),) # If inside parentheses, match comma.
)*?
)
(?(Paren)(?!)) # If we are not inside parentheses,
(?:,|\Z) # match a comma or the end
)*? # lazy just to avoid an extra empty match at the end,
# though it removes a last empty token.
\Z
";
Match match = Regex.Match(data, pattern, RegexOptions.IgnorePatternWhitespace);
You can get all matches by iterating over match.Groups["Token"].Captures.

Categories