How can I normalize/canonize a regular expression pattern? - c#

I have a complex regular expression I've built with code. I want to normalize it to the simplest (canonical) form that will be an equivalent regular expression but without the extra brackets and so on.
I want it to be normalized so I can understand if it's correct and find bugs in it.
Here is an example for a regular expression I want to normalize:
^(?:(?:(?:\r\n(?:[ \t]+))*)(<transfer-coding>(?:chunked|(?:(?:[\x21\x23-\x27\x2A\x2B\x2D\x2E0-9A-Z\x5E\x7A\x7C\x7E-\xFE]+)(?:(?:;(?:(?:[\x21\x23-\x27\x2A\x2B\x2D\x2E0-9A-Z\x5E\x7A\x7C\x7E-\xFE]+)=(?:(?:[\x21\x23-\x27\x2A\x2B\x2D\x2E0-9A-Z\x5E\x7A\x7C\x7E-\xFE]+)|(?:"(?:(?:(?:|[^\x00-\x31\x127\"])|(?:\\[\x00-\x127]))*)))))*))))(?:(?:(?:\r\n(?:[ \t]+))*),(?:(?:\r\n(?:[ \t]+))*)(<transfer-coding>(?:chunked|(?:(?:[\x21\x23-\x27\x2A\x2B\x2D\x2E0-9A-Z\x5E\x7A\x7C\x7E-\xFE]+)(?:(?:;(?:(?:[\x21\x23-\x27\x2A\x2B\x2D\x2E0-9A-Z\x5E\x7A\x7C\x7E-\xFE]+)=(?:(?:[\x21\x23-\x27\x2A\x2B\x2D\x2E0-9A-Z\x5E\x7A\x7C\x7E-\xFE]+)|(?:"(?:(?:(?:|[^\x00-\x31\x127\"])|(?:\\[\x00-\x127]))*)))))*))))*))$

I'm with the other answers and comments so far. Even if you could define a reduced form, it's unlikely that the reduced form is going to be any more understandable than this thing, which resembles line noise on a 1200 baud modem.
If you did want to find a canonical form for regular expressions, i'd start by defining precisely what you mean by "canonical form". For example, suppose you have the regular expression [ABCDEF-I]. Is the canonical form (1) [ABCDEF-I], (2) [ABCDEFGHI] or (3) [A-I] ?
That is, for purposes of canonicalization, do you want to (1) ignore this subset of regular expressions for the purposes of canonicalization, (2) eliminate all "-" operators, thereby simplifying the expression, or (3) make it shorter?
The simplest way would be to go through every part of the regular expression specification and work out which subexpressions are logically equivalent to another form, and decide which of the two is "more canonical". Then write a recursive regular expression analyzer that goes through a regular expression and replaces each subexpression with its canonical form. Keep doing that in a loop until you find the "fixed point", the regular expression that doesn't change when you put it in canonical form.
That, however, will not necessarily do what you want. If what you want is to reorganize the regular expression to minimize the complexity of grouping or some such thing then what you might want to do is to canonicalize the regular expression so that it is in a form such that it only has grouping, union and Kleene star operators. Once it is in that form you can easily translate it into a deterministic finite automaton, and once it is in DFA form then you can run a graph simplification algorithm on the DFA to form an equivalent simpler DFA. Then you can turn the resulting simplified DFA back into a regular expression.
Though that would be fascinating, like I said, I don't think it would actually solve your problem. Your problem, as I understand it, is a practical one. You have this mess, and you want to understand that it is right.
I would approach that problem by a completely different tack. If the problem is that the literal string is hard to read, then don't write it as a literal string. I'd start "simplifying" your regular expression by making it read like a programming language instead of reading like line noise:
Func<string, string> group = s=>"(?:"+s+")";
Func<string, string> capture = s=>"("+s+")";
Func<string, string> anynumberof = s=>s+"*";
Func<string, string> oneormoreof = s=>s+"+";
var beginning = "^";
var end = "$";
var newline = #"\r\n";
var tab = #"\t";
var space = " ";
var semi = ";";
var comma = ",";
var equal = "=";
var chunked = "chunked";
var transfer = "<transfer-coding>";
var backslash = #"\\";
var escape = group(backslash + #"[\x00-\x7f]");
var or = "|";
var whitespace =
group(
anynumberof(
group(
newline +
group(
oneormoreof(#"[ \t]")))));
var legalchars =
group(
oneormoreof(#"[\x21\x23-\x27\x2A\x2B\x2D\x2E0-9A-Z\x5E\x7A\x7C\x7E-\xFE]"));
var re =
beginning +
group(
whitespace +
capture(
transfer +
group(
chunked +
or +
group(
legalchars +
group(
group(
semi +
anynumberof(
group(
legalchars +
equal +
...
Once it looks like that it'll be a lot easier to understand and optimize.

I think you're getting ahead of yourself; the problems with that regex are not just cosmetic. Many of the parentheses can simply be dropped, as in (?:[ \t]+), but I suspect some of them are changing the meaning of the regex in ways you didn't intend.
For example, what's (?:|[^\x00-\x31\x127\"]) supposed to mean? With that pipe at the beginning, it's equivalent to [^\x00-\x31\x127\"]??--zero or one, reluctantly, of whatever the character class matches. Is that really what you intended?
The character class itself is highly suspect as well. It's obviously meant to match anything other than an ASCII control character or a quotation mark, but the numbers are decimal where they should be hexadecimal: [^\x00-\x1E\x7F\"]

I am not aware of any tool that can do this. I even strongly doubt there is something like a canonical form for regular expressions - they are complex enough that there are usually several and vastly different solutions.
If this expression is the output of an generator it seems much more promising to me to (unit)test the code generator.

I'd just write it in an expanded form:
^
(?:
(?: (?: \r\n (?:[ \t]+) )* )
(<transfer-coding>
(?: chunked
| (?:
(?:[\x21\x23-\x27\x2A\x2B\x2D\x2E0-9A-Z\x5E\x7A\x7C\x7E-\xFE]+)
(?:
(?:
;
(?:
(?:[\x21\x23-\x27\x2A\x2B\x2D\x2E0-9A-Z\x5E\x7A\x7C\x7E-\xFE]+)
=
(?: (?:[\x21\x23-\x27\x2A\x2B\x2D\x2E0-9A-Z\x5E\x7A\x7C\x7E-\xFE]+)
| (?:
"
(?:
(?:
(?:
| [^\x00-\x31\x127\"]
)
| (?:\\[\x00-\x127])
)*
)
)
)
)
)*
)
)
)
)
(?:
(?: (?: \r\n (?:[ \t]+) )* )
,
(?: (?: \r\n (?:[ \t]+) )* )
(<transfer-coding>
(?: chunked
| (?:
(?:[\x21\x23-\x27\x2A\x2B\x2D\x2E0-9A-Z\x5E\x7A\x7C\x7E-\xFE]+)
(?:
(?:
;
(?:
(?:[\x21\x23-\x27\x2A\x2B\x2D\x2E0-9A-Z\x5E\x7A\x7C\x7E-\xFE]+)
=
(?: (?:[\x21\x23-\x27\x2A\x2B\x2D\x2E0-9A-Z\x5E\x7A\x7C\x7E-\xFE]+)
| (?:
"
(?:
(?:
(?:
| [^\x00-\x31\x127\"]
)
| (?:\\[\x00-\x127])
)*
)
)
)
)
)*
)
)
)
)
)
)
$
You can quickly locate unnecessary grouping, and locate some errors. Some errors i saw:
Missing ? for the named groups. It should be (?<name> ).
No closing double quote (").
You can even use the regex in this form. If you supply the flag RegexOptions.IgnorePatternWhitespace when constructing the Regex object, any whitespace or comments (#) in the pattern will be ignored.

Proving correctness is not a good motivation for doing normalization because the normal form can be very obscure and totally irrecognizable.
To get correctness, you either 1) run a lot of tests on it 2) obtain the state machine and prove correctness by induction.

Related

Regex: Matching all words EXCEPT those inside of parenthesis (C#)

So given:
COLUMN_1, COLUMN_2, COLUMN_3, ((COLUMN_1) AS SOME TEXT) AS COLUMN_4, COLUMN_5
How would I go about getting my matches as:
COLUMN_1
COLUMN_2
COLUMN_3
COLUMN_4
COLUMN_5
I've tried:
(?<!(\(.*?\)))(\w+)(,\s*\w+)*?
But I feel like I'm way off base :( I'm using regexstorm.net for testing.
Appreciate any help :)
You need a regex that keeps track of opening and closing parentheses and makes sure that a word is only matched if a balanced set of parentheses (or no parentheses at all) follow:
Regex regexObj = new Regex(
#"\w+ # Match a word
(?= # only if it's possible to match the following:
(?> # Atomic group (used to avoid catastrophic backtracking):
[^()]+ # Match any characters except parens
| # or
\( (?<DEPTH>) # a (, increasing the depth counter
| # or
\) (?<-DEPTH>) # a ), decreasing the depth counter
)* # any number of times.
(?(DEPTH)(?!)) # Then make sure the depth counter is zero again
$ # at the end of the string.
) # (End of lookahead assertion)",
RegexOptions.IgnorePatternWhitespace);
I tried to provide a test link to regexstorm.net, but it was too long for StackOverflow. Apparently, SO also doesn't like URL shorteners, so I can't link this directly, but you should be able to recreate the link easily: http://bit[dot]ly/2cNZS0O
This should work:
(?<!\()COLUMN_[\d](?!\))
Try it: https://regex101.com/r/bC4D7n/1
Update:
Ok, then try to use this regular expression:
[\(]+[\w\s\W]+[\)]+
Demo here: https://regex101.com/r/bC4D7n/2
Matching all words except some set of them is one of the most difficult exercises you can do with regular expressions. The easy way is: just construct the finite automata that accepts your original non negated predicate about the strings it should accept, then change all the accepting states by non-accepting ones, and finally construct a regular expression that is equivalent to the automata just constructed. This is a task difficult to do, so the most easy way to deal with it is construct the regexp for the predicate you want to negate and pass your string through the regexp matcher, if it maches, just reject it.
The main problem with this is that that is easy to do with computers, but constructing a regular expression from an automata description is tedious and normally gives you not the result you want (and actually a huge result). Let me illustrate with an example:
You have asked for matching words, but from these words, you want the ones that don't appear in a set of them. Let's suppose we want the automata that matches preciselly that set of words, and suppose we have matched the first n-1 letters of that word. This string should be matched, but only if you don't get the final letter next. So the proper regexp should be a regexp that matches all the letters of the first word but the last.... Not, we can skip this test if we have a word that matches all the letters in the first word but the last two, and so successively, back to the first letter (obviously, if your regexp doesn't begin with the first letter of the word, it doesn't match anyway) Let's suppose the first word is BEGIN. A good regexp matching things that are not equal to BEGIN is something like this:
[^B]|B[^E]|BE[^G]|BEG[^I]|BEGI[^N]
a different scenario (that complicates things more) is to find a regexp that matches the string if the word BEGIN is not contained in the string. Let's part from the opposite predicate, to find a string that has the word BEGIN included
^.*BEGIN.*$
and let's construct its finite automata:
(0)---B--->(1)---E--->(2)---G--->(3)---I--->(4)---N--->((5))
^ \ | | | | ^ \
| | | | | | | |
`-+<-------+<---------+<---------+<---------' `-+
where the double parenthesis indicates an accepting state. If you just
change all the accepting states with non-accepting ones, you'll get an automata that accepts all the strings the first one didn't and viceversa.
((0))--B-->((1))--E-->((2))--G-->((3))--I-->((4))--N-->(5)
^ \ | | | | ^ \
| | | | | | | |
`-+<--------+<---------+<---------+<---------' `-+
But converting this into a simple regular expression is far from easy (you can try, if you don't believe me)
And this only with one word, so think how to match any of the words, construct the automata, and then switch the acceptance-nonacceptance status of each state.
In your case, we have something to deal with, in addition to the premise your predicate is not equivalent to the one I have formulated. My predicate is for matching expressions that have one word in it (which is the target for which regexp were conceived) but yours if for matching groups inside your regexp. If you try my example, you will find that a simple string as "" (the empty string) matches the second regexp, as the starting ((0)) state is accepting state (well, the empty string doesn't contain the word BEGIN), but you want your regexp to match words (and "" isn't a word) so we first need to define what is a word for you and construct the regular expression that matches a word:
[a-zA-Z][a-zA-Z]*
should be a good candidate. It should go in an automata definition like this:
(0)---[a-zA-Z]--->((1))---[a-zA-Z]--.
^ \ | ^ |
| * * | |
`--+<-------------' `-------------'
and you want an automata to accept both (1-must be a word, and 2-not in the set of words) (not being in the set of words is the same as not being the first word, and not being the second and not being the third... you can construct it by first constructing an automata that matches if it's the first word, or the second, or the third, ... and then negating it) construct the first automaton, the second and then construct an automaton that matches both. This, again is easy to be done with automatons for computers, but not for people.
As I said, construct an automaton from a regexp is an easy and direct thing for a computer, but not for a person. Construct a regexp from an automaton is also, but it results in huge regular expressions and because of this problem, most implementations have result in implementation of extender operators that match if some regexp doesn't and the opposite.
CONCLUSION
Use the negation operators that allow you to get to the opposite predicate about the set of strings your regexp acceptor must accept, or just simply construct a regexp to do simple things and use the boolean algebra to do the rest.
Since you have nested parentheses things get trickier. Although .NET RegEx engine provides balancing group constructs which uses stack memory, I go with a more general approach called recursive match.
Regex:
\((?(?!\(|\)).|(?R))*\)|(\w+)
Live demo
All you need is in first capturing group.
Explanation of left side of alternation:
\( # Match an opening bracket
(?(?!\(|\)) # If next character is not `(` or `)`
. # Then match it
| # Otherwise
(?R) # Recurs whole pattern
)* # As much as possible
\) # Up to corresponding closing bracket

Regex nested parentheses

I have the following string:
a,b,c,d.e(f,g,h,i(j,k)),l,m,n
Would know tell me how I could build a regex that returns me only the "first level" of parentheses something like this:
[0] = a,b,c,
[1] = d.e(f,g,h,i.j(k,l))
[2] = m,n
The goal would be to keep the section that has the same index in parentheses nested to manipulate future.
Thank you.
EDIT
Trying to improve the example...
Imagine I have this string
username,TB_PEOPLE.fields(FirstName,LastName,TB_PHONE.fields(num_phone1, num_phone2)),password
My goal is to turn a string into a dynamic query.
Then the fields that do not begin with "TB_" I know they are fields of the main table, otherwise I know informandos fields within parentheses, are related to another table.
But I am having difficulty retrieving all fields "first level" since I can separate them from related tables, I could go recursively recovering the remaining fields.
In the end, would have something like:
[0] = username,password
[1] = TB_PEOPLE.fields(FirstName,LastName,TB_PHONE.fields(num_phone1, num_phone2))
I hope I have explained a little better, sorry.
You can use this:
(?>\w+\.)?\w+\((?>\((?<DEPTH>)|\)(?<-DEPTH>)|[^()]+)*\)(?(DEPTH)(?!))|\w+
With your example you obtain:
0 => username
1 => TB_PEOPLE.fields(FirstName,LastName,TB_PHONE.fields(num_phone1, num_phone2))
2 => password
Explanation:
(?>\w+\.)? \w+ \( # the opening parenthesis (with the function name)
(?> # open an atomic group
\( (?<DEPTH>) # when an opening parenthesis is encountered,
# then increment the stack named DEPTH
| # OR
\) (?<-DEPTH>) # when a closing parenthesis is encountered,
# then decrement the stack named DEPTH
| # OR
[^()]+ # content that is not parenthesis
)* # close the atomic group, repeat zero or more times
\) # the closing parenthesis
(?(DEPTH)(?!)) # conditional: if the stack named DEPTH is not empty
# then fail (ie: parenthesis are not balanced)
You can try it with this code:
string input = "username,TB_PEOPLE.fields(FirstName,LastName,TB_PHONE.fields(num_phone1, num_phone2)),password";
string pattern = #"(?>\w+\.)?\w+\((?>\((?<DEPTH>)|\)(?<-DEPTH>)|[^()]+)*\)(?(DEPTH)(?!))|\w+";
MatchCollection matches = Regex.Matches(input, pattern);
foreach (Match match in matches)
{
Console.WriteLine(match.Groups[0].Value);
}
I suggest a new strategy, R2 - do it algorithmically. While you can build a Regex that will eventually come close to what you're asking, it'll be grossly unmaintainable, and hard to extend when you find new edge cases. I don't speak C#, but this pseudo code should get you on the right track:
function parenthetical_depth(some_string):
open = count '(' in some_string
close = count ')' in some_string
return open - close
function smart_split(some_string):
bits = split some_string on ','
new_bits = empty list
bit = empty string
while bits has next:
bit = fetch next from bits
while parenthetical_depth(bit) != 0:
bit = bit + ',' + fetch next from bits
place bit into new_bits
return new_bits
This is the easiest way to understand it, the algorithm is currently O(n^2) - there's an optimization for the inner loop to make it O(n) (with the exception of String copying, which is kind of the worst part of this):
depth = parenthetical_depth(bit)
while depth != 0:
nbit = fetch next from bits
depth = depth + parenthetical_depth(nbit)
bit = bit + ',' + nbit
The string copying can be made more efficient with clever use of buffers and buffer size, at the cost of space efficiency, but I don't think C# gives you that level of control natively.
If I understood correctly your example, your are looking for something like this:
(?<head>[a-zA-Z._]+\,)*(?<body>[a-zA-Z._]+[(].*[)])(?<tail>.*)
For given string:
username,TB_PEOPLE.fields(FirstName,LastName,TB_PHONE.fields(num_phone1, num_phone2)),password
This expression will match
username, for group head
TB_PEOPLE.fields(FirstName,LastName,TB_PHONE.fields(num_phone1, num_phone2)) for group body
,password for group tail

Regular expression for 1(2),3,4(5,6(7,8),9),10

Prologue:
Input string: 1(2),3,4(5,6(7,8),9),10
I am using C# and I would want to eventually get a List<foo> from the above expression
public class foo
{
public int bar { get; set; }
public List<foo> listOfFoo { get; set; }
}
I can achieve the task by writing some validations and parsing character by character, but would like to know a better way. Lesser the code, lesser the bugs they say ;)
Query
I am looking for a regular expression for validating and possibly capturing information in a string like
1(2),3,4(5,6(7,8),9),10
The string is basically a set of numbers separated by comma. But a number can have some sub expressions for it using parenthesis ( )
What I want to fetch from the string is a graph like
1
2
3
4
5
6
7
8
9
10
I have very little idea about reg-ex. I can read & understand most of them, but writing one I find really tough
Looking for someone to tell me if something like this is at all achievable using RegEx. If so, what should be the approach? I can see that I would need a recursive expression, any links or examples would be of great help. Someone willing to give me the RegEx itself would be icing on the cake :)
.NET regex has balancing groups which allow you to count and match balanced parenthesis like in this case.
For that you can use an expression like this:
(?x) # ignore spaces and comments
^
(?:
(?<open> \( )* # open++
\d+
(?<-open> \) )* # open--
(?:
, (?!\z) # match a , but not at end of string
| \z # or end of string
)
)+
\z
(?(open) (?!) ) # fail if unbalanced (open > 0)
This would validate but not parse the string. To build a tree like you desire you have to use a parser I believe.

How can I improve the performance of a .NET regular expression?

I have a regular expression which parses a (very small) subset of the Razor template language. Recently, I added a few more rules to the regex which dramatically slowed its execution. I'm wondering: are there certain regex constructs that are known to be slow? Is there a restructuring of the pattern I'm using that would maintain readability and yet improve performance? Note: I've confirmed that this performance hit occurs post-compilation.
Here's the pattern:
new Regex(
#" (?<escape> \#\# )"
+ #"| (?<comment> \#\* ( ([^\*]\#) | (\*[^\#]) | . )* \*\# )"
+ #"| (?<using> \#using \s+ (?<namespace> [\w\.]+ ) (\s*;)? )"
// captures expressions of the form "foreach (var [var] in [expression]) { <text>"
/* ---> */ + #"| (?<foreach> \#foreach \s* \( \s* var \s+ (?<var> \w+ ) \s+ in \s+ (?<expressionValue> [\w\.]+ ) \s* \) \s* \{ \s* <text> )"
// captures expressions of the form "if ([expression]) { <text>"
/* ---> */ + #"| (?<if> \#if \s* \( \s* (?<expressionValue> [\w\.]+ ) \s* \) \s* \{ \s* <text> )"
// captures the close of a razor text block
+ #"| (?<endBlock> </text> \s* \} )"
// an expression of the form #([(int)] a.b.c)
+ #"| (?<parenAtExpression> \#\( \s* (?<castToInt> \(int\)\s* )? (?<expressionValue> [\w\.]+ ) \s* \) )"
+ #"| (?<atExpression> \# (?<expressionValue> [\w\.]+ ) )"
/* ---> */ + #"| (?<literal> ([^\#<]+|[^\#]) )",
RegexOptions.IgnorePatternWhitespace | RegexOptions.Multiline | RegexOptions.ExplicitCapture | RegexOptions.Compiled);
/* ---> */ indicates the new "rules" that caused the slowdown.
As you are not anchoring the expression the engine will have to check each alternative sub-pattern at every position of the string before it can be sure that it can't find a match. This will always be time-consuming, but how can it be made less so?
Some thoughts:
I don't like the sub-pattern on the second line that tries to match comments and I don't think it will work correctly.
I can see what you're trying to do with the ( ([^\*]\#) | (\*[^\#]) | . )* - allow # and * within the comments as long as they are not preceded by * or followed by # respectively. But because of the group's * quantifier and the third option ., the sub-pattern will happily match *#, therefore rendering the other options redundant.
And assuming that the subset of Razor you are trying to match does not allow multiline comments, I suggest for the second line
+ #"| (?<comment> #\*.*?\*# )"
i.e. lazily match any characters (but newlines) until the first *# is encountered.
You are using RegexOptions.ExplicitCapture meaning only named groups are being captured, so the lack of () should not be a problem.
I also do not like the ([^\#<]+|[^\#]) sub-pattern in the last line, which equates to ([^\#<]+|<). The [^\#<]+ will greedily match to the end of the string unless it comes across a # or <.
I do not see any adjacent sub-patterns that will match the same text, which are the usual culprits for excessive backtracking, but all the \s* seem suspect because of their greed and flexibility, including matching nothing and newlines. Perhaps you could change some of the \s* to [ \t]* where you know you don't want to match newlines, for example, perhaps before the opening bracket following an if.
I notice that nhahtdh has suggested you use use atomic grouping to prevent the engine backtracking into the previously matched, and that is certainly something worth experimenting with as it is almost certainly the excessive backtracking caused when the engine can no longer find a match that is causing the slow-down.
What are you trying to achieve with the RegexOptions.Multiline option? You do not look to be using ^ or $ so it will have no effect.
The escaping of the # is unnecessary.
As others have mentioned, you can improve the readability by removing unnecessary escapes (such as escaping # or escaping characters aside from \ inside a character class; for example, using [^*] instead of [^\*]).
Here are some ideas for improving performance:
Order your different alternatives so that the most likely ones come first.
The regex engine will attempt to match each alternative in the order that they appear in the regex. If you put the ones that are more likely up front, then the engine will not have to waste time attempting to match against unlikely alternatives for the majority of cases.
Remove unnecessary backtracking
Not the ending of your "using" alternative: #"| (?<using> \#using \s+ (?<namespace> [\w\.]+ ) (\s*;)? )"
If for some reason you have a large amount of whitespace, but no closing ; at the end of a using line, the regex engine must backtrack through each whitespace character until it finally decides that it can't match (\s*;). In your case, (\s*;)? can be replaced with \s*;? to prevent backtracking in these scenarios.
In addition, you could use atomic groups (?>...) to prevent backtracking through quantifiers (e.g. * and +). This really helps improve performance when you don't find a match. For example, your "foreach" alternative contains \s* \( \s*. If you find the text "foreach var...", the "foreach" alternative will greedily match all of the whitespace after foreach, and then fail when it doesn't find an opening (. It will then backtrack, one whitespace-character at a time, and try to match ( at the previous position until it confirms that it cannot match that line. Using an atomic group (?>\s*)\( will cause the regex engine to not backtrack through \s* if it matches, allowing the regex to fail more quickly.
Be careful when using them though, as they can cause unintended failures when used at the wrong place (for instance, '(?>,*); will never match anything, due to the greedy .* matching all characters (including ;), and the atomic grouping (?>...) preventing the regex engine from backtracking one character to match the ending ;).
"Unroll the loop" on some of your alternatives, such as your "comment" alternative (also useful if you plan on adding an alternative for strings).
For example: #"| (?<comment> \#\* ( ([^\*]\#) | (\*[^\#]) | . )* \*\# )"
Could be replaced with #"| (?<comment> #\* [^*]* (\*+[^#][^*]*)* \*+# )"
The new regex boils down to:
#\*: Find the beginning of a comment #*
[^*]*: Read all "normal characters" (anything that's not a * because that could signify the end of the comment)
(\*+[^#][^*]*)*: include any non-terminal * inside the comment
(\*+[^#]: If we find a *, ensure that any string of *s doesn't end in a #
[^*]*: Go back to reading all "normal characters"
)*: Loop back to the beginning if we find another *
\*+#: Finally, grab the end of the comment *# being careful to include any extra *
You can find many more ideas for improving the performance of your regular expressions from Jeffrey Friedl's Mastering Regular Expressions (3rd Edition).

C# Regex - How to remove multiple paired parentheses from string

I am trying to figure out how to use C# regular expressions to remove all instances paired parentheses from a string. The parentheses and all text between them should be removed. The parentheses aren't always on the same line. Also, their might be nested parentheses. An example of the string would be
This is a (string). I would like all of the (parentheses
to be removed). This (is) a string. Nested ((parentheses) should) also
be removed. (Thanks) for your help.
The desired output should be as follows:
This is a . I would like all of the . This a string. Nested also
be removed. for your help.
Fortunately, .NET allows recursion in regexes (see Balancing Group Definitions):
Regex regexObj = new Regex(
#"\( # Match an opening parenthesis.
(?> # Then either match (possessively):
[^()]+ # any characters except parentheses
| # or
\( (?<Depth>) # an opening paren (and increase the parens counter)
| # or
\) (?<-Depth>) # a closing paren (and decrease the parens counter).
)* # Repeat as needed.
(?(Depth)(?!)) # Assert that the parens counter is at zero.
\) # Then match a closing parenthesis.",
RegexOptions.IgnorePatternWhitespace);
In case anyone is wondering: The "parens counter" may never go below zero (<?-Depth> will fail otherwise), so even if the parentheses are "balanced" but aren't correctly matched (like ()))((()), this regex will not be fooled.
For more information, read Jeffrey Friedl's excellent book "Mastering Regular Expressions" (p. 436)
You can repetitively replace /\([^\)\(]*\)/g with the empty string till no more matches are found, though.
How about this: Regex Replace seems to do the trick.
string Remove(string s, char begin, char end)
{
Regex regex = new Regex(string.Format("\\{0}.*?\\{1}", begin, end));
return regex.Replace(s, string.Empty);
}
string s = "Hello (my name) is (brian)"
s = Remove(s, '(', ')');
Output would be:
"Hello is"
Normally, it is not an option. However, Microsoft does have some extensions to standard regular expressions. You may be able to achieve this with Grouping Constructs even if it is faster to code as an algorithm than to read and understand Microsoft's explanation of their extension.

Categories