Tokenizing with RegEx when delimiter can be in token - c#

I'm parsing some input in C#, and I'm hitting a wall with RegEx processing.
A disclaimer: I'm not a regular expression expert, but I'm learning more.
I have an input string that looks like this:
ObjectType [property1=value1, property2=value2, property3=AnotherObjectType [property4=some value4]]
(a contrived value, but the important thing is that these can be nested).
I'm doing the following to tokenize the string:
Regex Tokenizer = new Regex(#"([=\[\]])|(,\s)");
string[] tokens = Tokenizer.Split(s);
This gets me about 98% of the way. This splits the string on known separators, and commas followed by a whitespace.
The tokens in the above example are:
ObjectType
[
property1
=
value1
,
property2
=
value2
,
property3
=
AnotherObjectType
[
property4
=
some value4
]
]
But I have two issues:
1) The property values can contain commas. This is a valid input:
ObjectType [property1=This is a valid value, and should be combined,, property2=value2, property3=AnotherObjectType [property4=value4]]
I would like the token after property1= to be:
This is a valid value, and should be combined,
And I'd like the whitespace inside the token to be preserved. Currently, it's split when a comma is found.
2) When split, the comma tokens contain whitespace. I'd like to get rid of this if possible, but this is a much less important priority.
I've tried various options, and they have all gotten me partially there. The closest that I've had is this:
Regex Tokenizer = new Regex(#"([=\[\]])|(,\s)|([\w]*\s*(?=[=\[\]]))|(.[^=]*(?=,\s))");
To match the separators, a comma followed by a whitepace, word charaters followed by a whitespace before a literal, and text before a comma and whitespace (that doesn't include the = sign).
When I get the matches instead of calling split, I get this:
ObjectType
[
property1
=
value1
,
property2
=
value2
,
property3
=
AnotherObjectType
[
property4
=
value4
]
]
Notice the missing information from property4. More complex inputs sometimes have the close brackets included in the token, like this: value4]
I'm not sure why that's happening. Any ideas on how to improve upon this?
Thanks,
Phil

This is easiest to answer with a lexer and parser tool. Many argue that they are too complex for these "simple" use cases, though I have always found them clearer and easier to reason about. You don't get bogged down in stupid if logic.
For C#, GPLEX and GPPG seem to be some good ones. See here for why you might want to use them.
In your case you have a grammar, that is how you define the interaction between different tokens based upon context. And also, you have the details of implementing this grammar in your language and toolchain of choice. The grammar is relatively easy to define, you have informally done so already. The details are the tricky part. Wouldn't it be nice if you had a framework that could read some defined way of writing out the grammar bit and just generate the code to actually do it?
That is how these tools work in a nutshell. The docs are pretty short, so read through all of them, taking the time up front will help immensely.
In essence, you would declare a scanner and parser. The scanner takes in a text stream/file and compares it to various regular expressions until it has a match. That match is passed up to the parser as a token. Then the next token is matched and passed up, round and round until the text stream empties out.
Each matched token can have arbitrary C# code attached to it, and the same with each of the rules in the parser.
I don't normally use C#, but I've written quite a few lexers and parsers. The principles are the same across languages. This is the best solution for your problem, and will help you again and again throughout your career.

You can do this with two regular expressions and a recursive function with one caveat: special characters must be escaped. From what I can see, "=", "[" and "]" have special meaning, so you must insert a "\" before those characters if you want them to appear as part of your property value. Note that commas are not considered "special". A comma before a "property=" string is ignored, but otherwise they are treated in no special way (and, in fact, are optional between properties).
Input
ObjectType
[
property1=value1,val\=value2
property2=value2 \[property2\=this is not an object\], property3=
AnotherObjectType [property4=some
value4]]
Regular Expressions
The regex for discovering "complex" types (beginning with a type name followed by square brackets). The regex includes a mechanism for balancing square brackets to make sure that each open bracket is paired with a close bracket (so that the match does not end too soon or too late):
^\s*(?<TypeName>\w+)\s*\[(?<Properties>([^\[\]]|\\\[|\\\]|(?<!\\)\[(?<Depth>)|(?<!\\)\](?<-Depth>))*(?(Depth)(?!)))\]\s*$
The regex for discovering properties within a complex type. Note that this also includes balanced square brackets to ensure that the properties of a sub-complex type are not accidentally consumed by the parent.
(?<PropertyName>\w+)\s*=\s*(?<PropertyValue>([^\[\]]|\\\[|\\\]|(?<!\\)\[(?<Depth>)|(?<!\\)\](?<-Depth>))*?(?(Depth)(?!))(?=$|(?<!\\)\]|,?\s*\w+\s*=))
Code
private static Regex ComplexTypeRegex = new Regex( #"^\s*(?<TypeName>\w+)\s*\[(?<Properties>([^\[\]]|\\\[|\\\]|(?<!\\)\[(?<Depth>)|(?<!\\)\](?<-Depth>))*(?(Depth)(?!)))\]\s*$" );
private static Regex PropertyRegex = new Regex( #"(?<PropertyName>\w+)\s*=\s*(?<PropertyValue>([^\[\]]|\\\[|\\\]|(?<!\\)\[(?<Depth>)|(?<!\\)\](?<-Depth>))*?(?(Depth)(?!))(?=$|(?<!\\)\]|,?\s*\w+\s*=))" );
private static string Input =
#"ObjectType" + "\n" +
#"[" + "\n" +
#" property1=value1,val\=value2 " + "\n" +
#" property2=value2 \[property2\=this is not an object\], property3=" + "\n" +
#" AnotherObjectType [property4=some " + "\n" +
#"value4]]";
static void Main( string[] args )
{
Console.Write( Process( 0, Input ) );
Console.WriteLine( "\n\nPress any key..." );
Console.ReadKey( true );
}
private static string Process( int level, string input )
{
var l_complexMatch = ComplexTypeRegex.Match( input );
var l_indent = string.Join( "", Enumerable.Range( 0, level * 3 ).Select( i => " " ).ToArray() );
var l_output = new StringBuilder();
l_output.AppendLine( l_indent + l_complexMatch.Groups["TypeName"].Value );
foreach ( var l_match in PropertyRegex.Matches( l_complexMatch.Groups["Properties"].Value ).Cast<Match>() )
{
l_output.Append( l_indent + "#" + l_match.Groups["PropertyName"].Value + " = " );
var l_value = l_match.Groups["PropertyValue"].Value;
if ( Regex.IsMatch( l_value, #"(?<!\\)\[" ) )
{
l_output.AppendLine();
l_output.Append( Process( level + 1, l_value ) );
}
else
{
l_output.AppendLine( "\"" + l_value + "\"" );
}
}
return l_output.ToString();
}
Output
ObjectType
#property1 = "value1,val\=value2 "
#property2 = "value2 \[property2\=this is not an object\]"
#property3 =
AnotherObjectType
#property4 = "some value4"
If you cannot escape the delimiters, then I doubt even a human could parse such a string. For example, how would a human reliably know whether the value of property 3 should be considered a literal string or a complex type?

Related

Replace a part of string containing Password

Slightly similar to this question, I want to replace argv contents:
string argv = "-help=none\n-URL=(default)\n-password=look\n-uname=Khanna\n-p=100";
to this:
"-help=none\n-URL=(default)\n-password=********\n-uname=Khanna\n-p=100"
I have tried very basic string find and search operations (using IndexOf, SubString etc.). I am looking for more elegant solution so as to replace this part of string:
-password=AnyPassword
to:
-password=*******
And keep other part of string intact. I am looking if String.Replace or Regex replace may help.
What I've tried (not much of error-checks):
var pwd_index = argv.IndexOf("--password=");
string converted;
if (pwd_index >= 0)
{
var leftPart = argv.Substring(0, pwd_index);
var pwdStr = argv.Substring(pwd_index);
var rightPart = pwdStr.Substring(pwdStr.IndexOf("\n") + 1);
converted = leftPart + "--password=********\n" + rightPart;
}
else
converted = argv;
Console.WriteLine(converted);
Solution
Similar to Rubens Farias' solution but a little bit more elegant:
string argv = "-help=none\n-URL=(default)\n-password=\n-uname=Khanna\n-p=100";
string result = Regex.Replace(argv, #"(password=)[^\n]*", "$1********");
It matches password= literally, stores it in capture group $1 and the keeps matching until a \n is reached.
This yields a constant number of *'s, though. But telling how much characters a password has, might already convey too much information to hackers, anyway.
Working example: https://dotnetfiddle.net/xOFCyG
Regular expression breakdown
( // Store the following match in capture group $1.
password= // Match "password=" literally.
)
[ // Match one from a set of characters.
^ // Negate a set of characters (i.e., match anything not
// contained in the following set).
\n // The character set: consists only of the new line character.
]
* // Match the previously matched character 0 to n times.
This code replaces the password value by several "*" characters:
string argv = "-help=none\n-URL=(default)\n-password=look\n-uname=Khanna\n-p=100";
string result = Regex.Replace(argv, #"(password=)([\s\S]*?\n)",
match => match.Groups[1].Value + new String('*', match.Groups[2].Value.Length - 1) + "\n");
You can also remove the new String() part and replace it by a string constant

Parsing mathematical expressions in C#

As a project, I want to write a parser for mathematical expressions in C#. I know there are libraries for this, but want to create my own to learn about this topic.
As an example, I have the expression
min(3,4) + 2 - abs(-4.6)
I then create token from this string by specifying regular expressions and going through the expression from the user trying to match one of the regex. This is done from the front to the back:
private static List<string> Tokenize(string expression)
{
List<string> result = new List<string>();
List<string> tokens = new List<string>();
tokens.Add("^\\(");// matches opening bracket
tokens.Add("^([\\d.\\d]+)"); // matches floating point numbers
tokens.Add("^[&|<=>!]+"); // matches operators and other special characters
tokens.Add("^[\\w]+"); // matches words and integers
tokens.Add("^[,]"); // matches ,
tokens.Add("^[\\)]"); // matches closing bracket
while (0 != expression.Length)
{
bool foundMatch = false;
foreach (string token in tokens)
{
Match match = Regex.Match(expression, token);
if (false == match.Success)
{
continue;
}
result.Add(match.Value);
expression = Regex.Replace(expression, token, "");
foundMatch = true;
break;
}
if (false == foundMatch)
{
break;
}
}
return result;
}
This works quite well. Now I want the user to be able to enter strings into the expression. I found a question to this at Regex tokenize issue however the answer provide regex which match the text anywhere in the expression. However I need this to match only the first occurrence at the front of the expression so I can keep the order of token.
As an example see this:
5 + " is smaller than " + 10
should give me the tokens
5 + " is greater than " + 10
If possible I would also like to be able to enter escape characters so the user is able to use the character " in strings, like "This is an apostrophe \" " gives me the token "This is an apostrophe " "
The answer from Wiktor Stribiżew at that question looked really good, but I couldn't modify it so it only matches at the beginning and only one word. Help is appreciated!
Funny you referencing that question. I actually adopted (yet again) my answer in there to work for you here ;)
Here's a fiddle showing the solution.
The regex is
(?!\+)(?:"((?:\\"|[^"])*)"?)
I changed the code to use capture groups to be able to in a simple manner not add the surrounding quotes. Also the loop removes the + sign separating the tokens.
Regards

Why is one character missing in the query result?

Take a look at the code:
string expression = "x & ~y -> (s + t) & z";
var exprCharsNoWhitespace = expression.Except( new[]{' ', '\t'} ).ToList();
var exprCharsNoWhitespace_2 = expression.Replace( " ", "" ).Replace( "\t", "" ).ToList();
// output for examination
Console.WriteLine( exprCharsNoWhitespace.Aggregate( "", (a,x) => a+x ) );
Console.WriteLine( exprCharsNoWhitespace_2.Aggregate( "", (a,x) => a+x ) );
// Output:
// x&~y->(s+t)z
// x&~y->(s+t)&z
I want to remove all whitespace from the original string and then get the individual characters.
The result surprized me.
The variable exprCharsNoWhitespace contains, as expected, no whitespace, but unexpectedly, only almost all of the other characters. The last occurence of '&' is missing, the Count of the list is 12.
Whereas exprCharsNoWhitespace_2 is completely as expected: Count is 13, all characters other than whitespace are contained.
The framework used was .NET 4.0.
I also just pasted this to csharppad (web-based IDE/compiler) and got the same results.
Why does this happen?
EDIT:
Allright, I was unaware that Except is, as pointed out by Ryan O'Hara, a set operation. I hadn't used it before.
// So I'll continue just using something like this:
expression.Where( c => c!=' ' && c!='\t' )
// or for more characters this can be shorter:
expression.Where( c => ! new[]{'a', 'b', 'c', 'd'}.Contains(c) ).
Except produces a set difference. Your expression isn’t a set, so it’s not the right method to use. As to why the & specifically is missing: it’s because it’s repeated. None of the other characters is.
Ryan already answered your question as asked, but I'd like to provide you an alternative solution to the problem you are facing. If you need to do a lot of string manipulation, you may find regular expression pattern matching to be helpful. The examples you've given would work something like this:
string expression = "x & ~y -> (s + t) & z";
string pattern = #"\s";
string replacement = "";
string noWhitespace = new Regex(pattern).Replace(expression, replacement);
Or for the second example, keep everything the same except the pattern:
string pattern = "[abcd]";
Keep the Regex object stored somewhere rather than creating it each time if you're going to use the same pattern a lot.
As already mentioned .Except(...) is a set operation so it drops duplicates.
Try just using .Where(...) instead:
string expression = "x & ~y -> (s + t) & z";
var exprCharsNoWhitespace =
String.Join(
"",
expression.Where(c => !new[] { ' ', '\t' }.Contains(c)));
This gives:
x&~y->(s+t)&z

Regex to find embedded quotes in a quotes string

Original string:
11235485|56987|0|2010|05|"This is my sample
"text""|"01J400B"|""|1|"Sample "text" number two"|""sample text number
three""|""|""|
Desired string:
11235485|56987|0|2010|05|"This is my sample
""text"""|"01J400B"|""|1|"Sample ""text"" number two"|"""sample text
number three"""|""|""|
The desired string unfortunately is a requirement that is out of my control, all nested quotes MUST be qualified with quotes (I KNOW).
Try as I might I have not been able to create the desired string from the original.
A regex match/replace seems to be the way to go, I need help. Any help is appreciated.
I'd actually split the string and evaluate each piece:
public string Escape(string input)
{
string[] pieces = input.Split('|');
for (int i = 0; i < pieces.Length; i++)
{
string piece = pieces[i];
if (piece.StartsWith("\"") && piece.EndsWith("\""))
{
pieces[i] = "\"" + piece.Trim('\"').Replace("\"", "\"\"") + "\"";
}
}
return string.Join("|", pieces);
}
This is making several assumptions about the input:
Items are delimited by pipes (|)
Items are well formed and will begin and end with quotation marks
This will also break if you have |s inside of quoted strings.
You may be able to just use the normal string.Replace() method. You know that | is what starts the column, so you can replace all " to "" and then fix the column start and end by replacing |"" to |" and ""| to "|.
It'd look like this:
var input = YOUR_ORIGINAL_STRING;
input.Replace("\"", "\"\"").Replace("|\"\"", "|\"").Replace("\"\"|", "\"|"));
It's not pretty, but it gets the job done.

Regex for matching Functions and Capturing their Arguments

I'm working on a calculator and it takes string expressions and evaluates them. I have a function that searches the expression for math functions using Regex, retrieves the arguments, looks up the function name, and evaluates it. What I'm having problem with is that I can only do this if I know how many arguments there are going to be, I can't get the Regex right. And if I just split the contents of the ( and ) characters by the , character then I can't have other function calls in that argument.
Here is the function matching pattern: \b([a-z][a-z0-9_]*)\((..*)\)\b
It only works with one argument, have can I create a group for every argument excluding the ones inside of nested functions? For example, it would match: func1(2 * 7, func2(3, 5)) and create capture groups for: 2 * 7 and func2(3, 5)
Here the function I'm using to evaluate the expression:
/// <summary>
/// Attempts to evaluate and store the result of the given mathematical expression.
/// </summary>
public static bool Evaluate(string expr, ref double result)
{
expr = expr.ToLower();
try
{
// Matches for result identifiers, constants/variables objects, and functions.
MatchCollection results = Calculator.PatternResult.Matches(expr);
MatchCollection objs = Calculator.PatternObjId.Matches(expr);
MatchCollection funcs = Calculator.PatternFunc.Matches(expr);
// Parse the expression for functions.
foreach (Match match in funcs)
{
System.Windows.Forms.MessageBox.Show("Function found. - " + match.Groups[1].Value + "(" + match.Groups[2].Value + ")");
int argCount = 0;
List<string> args = new List<string>();
List<double> argVals = new List<double>();
string funcName = match.Groups[1].Value;
// Ensure the function exists.
if (_Functions.ContainsKey(funcName)) {
argCount = _Functions[funcName].ArgCount;
} else {
Error("The function '"+funcName+"' does not exist.");
return false;
}
// Create the pattern for matching arguments.
string argPattTmp = funcName + "\\(\\s*";
for (int i = 0; i < argCount; ++i)
argPattTmp += "(..*)" + ((i == argCount - 1) ? ",":"") + "\\s*";
argPattTmp += "\\)";
// Get all of the argument strings.
Regex argPatt = new Regex(argPattTmp);
// Evaluate and store all argument values.
foreach (Group argMatch in argPatt.Matches(match.Value.Trim())[0].Groups)
{
string arg = argMatch.Value.Trim();
System.Windows.Forms.MessageBox.Show(arg);
if (arg.Length > 0)
{
double argVal = 0;
// Check if the argument is a double or expression.
try {
argVal = Convert.ToDouble(arg);
} catch {
// Attempt to evaluate the arguments expression.
System.Windows.Forms.MessageBox.Show("Argument is an expression: " + arg);
if (!Evaluate(arg, ref argVal)) {
Error("Invalid arguments were passed to the function '" + funcName + "'.");
return false;
}
}
// Store the value of the argument.
System.Windows.Forms.MessageBox.Show("ArgVal = " + argVal.ToString());
argVals.Add(argVal);
}
else
{
Error("Invalid arguments were passed to the function '" + funcName + "'.");
return false;
}
}
// Parse the function and replace with the result.
double funcResult = RunFunction(funcName, argVals.ToArray());
expr = new Regex("\\b"+match.Value+"\\b").Replace(expr, funcResult.ToString());
}
// Final evaluation.
result = Program.Scripting.Eval(expr);
}
catch (Exception ex)
{
Error(ex.Message);
return false;
}
return true;
}
////////////////////////////////// ---- PATTERNS ---- \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
/// <summary>
/// The pattern used for function calls.
/// </summary>
public static Regex PatternFunc = new Regex(#"([a-z][a-z0-9_]*)\((..*)\)");
As you can see, there is a pretty bad attempt at building a Regex to match the arguments. It doesn't work.
All I am trying to do is extract 2 * 7 and func2(3, 5) from the expression func1(2 * 7, func2(3, 5)) but it must work for functions with different argument counts as well. If there is a way to do this without using Regex that is also good.
There is both a simple solution and a more advanced solution (added after edit) to handle more complex functions.
To achieve the example you posted, I suggest doing this in two steps, the first step is to extract the parameters (regexes are explained at the end):
\b[^()]+\((.*)\)$
Now, to parse the parameters.
Simple solution
Extract the parameters using:
([^,]+\(.+?\))|([^,]+)
Here are some C# code examples (all asserts pass):
string extractFuncRegex = #"\b[^()]+\((.*)\)$";
string extractArgsRegex = #"([^,]+\(.+?\))|([^,]+)";
//Your test string
string test = #"func1(2 * 7, func2(3, 5))";
var match = Regex.Match( test, extractFuncRegex );
string innerArgs = match.Groups[1].Value;
Assert.AreEqual( innerArgs, #"2 * 7, func2(3, 5)" );
var matches = Regex.Matches( innerArgs, extractArgsRegex );
Assert.AreEqual( matches[0].Value, "2 * 7" );
Assert.AreEqual( matches[1].Value.Trim(), "func2(3, 5)" );
Explanation of regexes. The arguments extraction as a single string:
\b[^()]+\((.*)\)$
where:
[^()]+ chars that are not an opening, closing bracket.
\((.*)\) everything inside the brackets
The args extraction:
([^,]+\(.+?\))|([^,]+)
where:
([^,]+\(.+?\)) character that are not commas followed by characters in brackets. This picks up the func arguments. Note the +? so that the match is lazy and stops at the first ) it meets.
|([^,]+) If the previous does not match then match consecutive chars that are not commas. These matches go into groups.
More advanced solution
Now, there are some obvious limitations with that approach, for example it matches the first closing bracket, so it doesn't handle nested functions very well. For a more comprehensive solution (if you require it), we need to use balancing group definitions(as I mentioned before this edit). For our purposes, balancing group definitions allow us to keep track of the instances of the open brackets and subtract the closing bracket instances. In essence opening and closing brackets will cancel each other out in the balancing part of the search until the final closing bracket is found. That is, the match will continue until the brackets balance and the final closing bracket is found.
So, the regex to extract the parms is now (func extraction can stay the same):
(?:[^,()]+((?:\((?>[^()]+|\((?<open>)|\)(?<-open>))*\)))*)+
Here are some test cases to show it in action:
string extractFuncRegex = #"\b[^()]+\((.*)\)$";
string extractArgsRegex = #"(?:[^,()]+((?:\((?>[^()]+|\((?<open>)|\)(?<-open>))*\)))*)+";
//Your test string
string test = #"func1(2 * 7, func2(3, 5))";
var match = Regex.Match( test, extractFuncRegex );
string innerArgs = match.Groups[1].Value;
Assert.AreEqual( innerArgs, #"2 * 7, func2(3, 5)" );
var matches = Regex.Matches( innerArgs, extractArgsRegex );
Assert.AreEqual( matches[0].Value, "2 * 7" );
Assert.AreEqual( matches[1].Value.Trim(), "func2(3, 5)" );
//A more advanced test string
test = #"someFunc(a,b,func1(a,b+c),func2(a*b,func3(a+b,c)),func4(e)+func5(f),func6(func7(g,h)+func8(i,(a)=>a+2)),g+2)";
match = Regex.Match( test, extractFuncRegex );
innerArgs = match.Groups[1].Value;
Assert.AreEqual( innerArgs, #"a,b,func1(a,b+c),func2(a*b,func3(a+b,c)),func4(e)+func5(f),func6(func7(g,h)+func8(i,(a)=>a+2)),g+2" );
matches = Regex.Matches( innerArgs, extractArgsRegex );
Assert.AreEqual( matches[0].Value, "a" );
Assert.AreEqual( matches[1].Value.Trim(), "b" );
Assert.AreEqual( matches[2].Value.Trim(), "func1(a,b+c)" );
Assert.AreEqual( matches[3].Value.Trim(), "func2(a*b,func3(a+b,c))" );
Assert.AreEqual( matches[4].Value.Trim(), "func4(e)+func5(f)" );
Assert.AreEqual( matches[5].Value.Trim(), "func6(func7(g,h)+func8(i,(a)=>a+2))" );
Assert.AreEqual( matches[6].Value.Trim(), "g+2" );
Note especially that the method is now quite advanced:
someFunc(a,b,func1(a,b+c),func2(a*b,func3(a+b,c)),func4(e)+func5(f),func6(func7(g,h)+func8(i,(a)=>a+2)),g+2)
So, looking at the regex again:
(?:[^,()]+((?:\((?>[^()]+|\((?<open>)|\)(?<-open>))*\)))*)+
In summary, it starts out with characters that are not commas or brackets. Then if there are brackets in the argument, it matches and subtracts the brackets until they balance. It then tries to repeat that match in case there are other functions in the argument. It then goes onto the next argument (after the comma). In detail:
[^,()]+ matches anything that is not ',()'
?: means non-capturing group, i.e. do not store matches within brackets in a group.
\( means start at an open bracket.
?> means atomic grouping - essentially, this means it does not remember backtracking positions. This also helps to improve performance because there are less stepbacks to try different combinations.
[^()]+| means anything but an opening or closing bracket. This is followed by | (or)
\((?<open>)| This is the good stuff and says match '(' or
(?<-open>) This is the better stuff that says match a ')' and balance out the '('. This means that this part of the match (everything after the first bracket) will continue until all the internal brackets match. Without the balancing expressions, the match would finish on the first closing bracket. The crux is that the engine does not match this ')' against the final ')', instead it is subtracted from the matching '('. When there are no further outstanding '(', the -open fails so the final ')' can be matched.
The rest of the regex contains the closing parenthesis for the group and the repetitions (, and +) which are respectively: repeat the inner bracket match 0 or more times, repeat the full bracket search 0 or more times (0 allows arguments without brackets) and repeat the full match 1 or more times (allows foo(1)+foo(2))
One final embellishment:
If you add (?(open)(?!)) to the regex:
(?:[^,()]+((?:\((?>[^()]+|\((?<open>)|\)(?<-open>))*(?(open)(?!))\)))*)+
The (?!) will always fail if open has captured something (that hasn't been subtracted), i.e. it will always fail if there is an opening bracket without a closing bracket. This is a useful way to test whether the balancing has failed.
Some notes:
\b will not match when the last character is a ')' because it is not a word character and \b tests for word character boundaries so your regex would not match.
While regex is powerful, unless you are a guru among gurus it is best to keep the expressions simple because otherwise they are hard to maintain and hard for other people to understand. That is why it is sometimes best to break up the problem into subproblems and simpler expressions and let the language do some of the non search/match operations that it is good at. So, you may want to mix simple regexes with more complex code or visa versa, depending on where you are comfortable.
This will match some very complex functions, but it is not a lexical analyzer for functions.
If you can have strings in the arguments and the strings themselves can contains brackets, e.g. "go(..." then you will need to modify the regex to take strings out of the comparison. Same with comments.
Some links for balancing group definitions: here, here, here and here.
Hope that helps.
This regex does what you want:
^(?<FunctionName>\w+)\((?>(?(param),)(?<param>(?>(?>[^\(\),"]|(?<p>\()|(?<-p>\))|(?(p)[^\(\)]|(?!))|(?(g)(?:""|[^"]|(?<-g>"))|(?!))|(?<g>")))*))+\)$
Don't forget to escape backslashes and double quotes when pasting it in your code.
It will match correctly arguments in double quotes, inner functions and numbers like this one:
f1(123,"df""j"" , dhf",abc12,func2(),func(123,a>2))
The param stack will contains
123
"df""j"" , dhf"
abc12
func2()
func(123,a>2)
I'm sorry to burst the RegEx bubble, but this is one of those things that you just can't do effectively with regular expressions alone.
What you're implementing is basically an Operator-Precedence Parser with support for sub-expressions and argument lists. The statement is processed as a stream of tokens - possibly using regular expressions - with sub-expressions processed as high-priority operations.
With the right code you can do this as an iteration over the full token stream, but recursive parsers are common too. Either way you have to be able to effectively push state and restart parsing at each of the sub-expression entry points - a (, , or <function_name>( token - and pushing the result up the parser chain at the sub-expression exit points - ) or , token.
Regular expressions aren't going to get you completely out of trouble with this...
Since you have nested parentheses, you need to modify your code to count ( against ). When you encounter an (, you need to take note of the position then look ahead, incrementing a counter for each extra ( you find, and decrementing it for each ) you find. When your counter is 0 and you find a ), that is the end of your function parameter block, and you can then parse the text between the parentheses. You can also split the text on , when the counter is 0 to get function parameters.
If you encounter the end of the string while the counter is 0, you have a "(" without ")" error.
You then take the text block(s) between the opening and closing parentheses and any commas, and repeat the above for each parameter.
There are some new (relatively very new) language-specific enhancements to regex that make it possible to match context free languages with "regex", but you will find more resources and more help when using the tools more commonly used for this kind of task:
It'd be better to use a parser generator like ANTLR, LEX+YACC, FLEX+BISON, or any other commonly used parser generator. Most of them come with complete examples on how to build simple calculators that support grouping and function calls.

Categories