How can I improve the performance of a .NET regular expression? - c#

I have a regular expression which parses a (very small) subset of the Razor template language. Recently, I added a few more rules to the regex which dramatically slowed its execution. I'm wondering: are there certain regex constructs that are known to be slow? Is there a restructuring of the pattern I'm using that would maintain readability and yet improve performance? Note: I've confirmed that this performance hit occurs post-compilation.
Here's the pattern:
new Regex(
#" (?<escape> \#\# )"
+ #"| (?<comment> \#\* ( ([^\*]\#) | (\*[^\#]) | . )* \*\# )"
+ #"| (?<using> \#using \s+ (?<namespace> [\w\.]+ ) (\s*;)? )"
// captures expressions of the form "foreach (var [var] in [expression]) { <text>"
/* ---> */ + #"| (?<foreach> \#foreach \s* \( \s* var \s+ (?<var> \w+ ) \s+ in \s+ (?<expressionValue> [\w\.]+ ) \s* \) \s* \{ \s* <text> )"
// captures expressions of the form "if ([expression]) { <text>"
/* ---> */ + #"| (?<if> \#if \s* \( \s* (?<expressionValue> [\w\.]+ ) \s* \) \s* \{ \s* <text> )"
// captures the close of a razor text block
+ #"| (?<endBlock> </text> \s* \} )"
// an expression of the form #([(int)] a.b.c)
+ #"| (?<parenAtExpression> \#\( \s* (?<castToInt> \(int\)\s* )? (?<expressionValue> [\w\.]+ ) \s* \) )"
+ #"| (?<atExpression> \# (?<expressionValue> [\w\.]+ ) )"
/* ---> */ + #"| (?<literal> ([^\#<]+|[^\#]) )",
RegexOptions.IgnorePatternWhitespace | RegexOptions.Multiline | RegexOptions.ExplicitCapture | RegexOptions.Compiled);
/* ---> */ indicates the new "rules" that caused the slowdown.

As you are not anchoring the expression the engine will have to check each alternative sub-pattern at every position of the string before it can be sure that it can't find a match. This will always be time-consuming, but how can it be made less so?
Some thoughts:
I don't like the sub-pattern on the second line that tries to match comments and I don't think it will work correctly.
I can see what you're trying to do with the ( ([^\*]\#) | (\*[^\#]) | . )* - allow # and * within the comments as long as they are not preceded by * or followed by # respectively. But because of the group's * quantifier and the third option ., the sub-pattern will happily match *#, therefore rendering the other options redundant.
And assuming that the subset of Razor you are trying to match does not allow multiline comments, I suggest for the second line
+ #"| (?<comment> #\*.*?\*# )"
i.e. lazily match any characters (but newlines) until the first *# is encountered.
You are using RegexOptions.ExplicitCapture meaning only named groups are being captured, so the lack of () should not be a problem.
I also do not like the ([^\#<]+|[^\#]) sub-pattern in the last line, which equates to ([^\#<]+|<). The [^\#<]+ will greedily match to the end of the string unless it comes across a # or <.
I do not see any adjacent sub-patterns that will match the same text, which are the usual culprits for excessive backtracking, but all the \s* seem suspect because of their greed and flexibility, including matching nothing and newlines. Perhaps you could change some of the \s* to [ \t]* where you know you don't want to match newlines, for example, perhaps before the opening bracket following an if.
I notice that nhahtdh has suggested you use use atomic grouping to prevent the engine backtracking into the previously matched, and that is certainly something worth experimenting with as it is almost certainly the excessive backtracking caused when the engine can no longer find a match that is causing the slow-down.
What are you trying to achieve with the RegexOptions.Multiline option? You do not look to be using ^ or $ so it will have no effect.
The escaping of the # is unnecessary.

As others have mentioned, you can improve the readability by removing unnecessary escapes (such as escaping # or escaping characters aside from \ inside a character class; for example, using [^*] instead of [^\*]).
Here are some ideas for improving performance:
Order your different alternatives so that the most likely ones come first.
The regex engine will attempt to match each alternative in the order that they appear in the regex. If you put the ones that are more likely up front, then the engine will not have to waste time attempting to match against unlikely alternatives for the majority of cases.
Remove unnecessary backtracking
Not the ending of your "using" alternative: #"| (?<using> \#using \s+ (?<namespace> [\w\.]+ ) (\s*;)? )"
If for some reason you have a large amount of whitespace, but no closing ; at the end of a using line, the regex engine must backtrack through each whitespace character until it finally decides that it can't match (\s*;). In your case, (\s*;)? can be replaced with \s*;? to prevent backtracking in these scenarios.
In addition, you could use atomic groups (?>...) to prevent backtracking through quantifiers (e.g. * and +). This really helps improve performance when you don't find a match. For example, your "foreach" alternative contains \s* \( \s*. If you find the text "foreach var...", the "foreach" alternative will greedily match all of the whitespace after foreach, and then fail when it doesn't find an opening (. It will then backtrack, one whitespace-character at a time, and try to match ( at the previous position until it confirms that it cannot match that line. Using an atomic group (?>\s*)\( will cause the regex engine to not backtrack through \s* if it matches, allowing the regex to fail more quickly.
Be careful when using them though, as they can cause unintended failures when used at the wrong place (for instance, '(?>,*); will never match anything, due to the greedy .* matching all characters (including ;), and the atomic grouping (?>...) preventing the regex engine from backtracking one character to match the ending ;).
"Unroll the loop" on some of your alternatives, such as your "comment" alternative (also useful if you plan on adding an alternative for strings).
For example: #"| (?<comment> \#\* ( ([^\*]\#) | (\*[^\#]) | . )* \*\# )"
Could be replaced with #"| (?<comment> #\* [^*]* (\*+[^#][^*]*)* \*+# )"
The new regex boils down to:
#\*: Find the beginning of a comment #*
[^*]*: Read all "normal characters" (anything that's not a * because that could signify the end of the comment)
(\*+[^#][^*]*)*: include any non-terminal * inside the comment
(\*+[^#]: If we find a *, ensure that any string of *s doesn't end in a #
[^*]*: Go back to reading all "normal characters"
)*: Loop back to the beginning if we find another *
\*+#: Finally, grab the end of the comment *# being careful to include any extra *
You can find many more ideas for improving the performance of your regular expressions from Jeffrey Friedl's Mastering Regular Expressions (3rd Edition).

Related

Regex - how to match a block comment

when it comes to regex I'm always lost. I have an editor created in C# to work with papyrus scripting, the problem I'm having is that users ask me for styling block comment ";/ /;" already working for single line that use ";"
Here is the code I have so far
var inputData = #"comment test and this line not suppose to show
;/
comment line 1
comment line 2
comment line 3
/;
Not suppose to show";
PapyrusCommentRegex1 = new Regex(#"(;/\*.*?\/;)|(.*\/;)", RegexOptions.Singleline);
foreach (Match match in PapyrusCommentRegex1.Matches(inputData))
{
if (match.Success)
{
textBox1.AppendText(match.Value + Environment.NewLine);
}
}
The result I get is
comment test and this line not suppose to show
;/
comment line 1
comment line 2
comment line 3
/;
All the line before the ";/" shows.
My question is what am I doing wrong in my regex expression?
Thanks in advance to all
Edit:
To make it more clearer I need a regex pattern in C# for finding all block comment that start with ";/" and finish with "/;" and need to include the ";/ /;"
Since you said you need to do this with regex in a .NET library I guess you may want a regex that is using balancing groups to match the block comment
(?x) # ignore spaces and comments
(
;/ # open block comment
(?:
(?<open> ;/ )* # open++
.+
(?<-open> /; )* # open--
)+
/; # close
(?(open)(?!)) # fail if unblanaced: open > 0
)
This should give you what you want. Regex Demo
Some mentioned the problem of block comments in strings (and vice vesa?!). This makes things a lot harder, especially since we do not have the (*SKIP)((*FAIL) backtracking verbs and \K in .NET's regex engine available. I would try to match and capture what you need but only match what you do not need:
This matches your block comments and "..." strings. The trick is to only look at the blockcomment capture group:
(?x) # ignore spaces and comments
(
;/ # open block comment
(?:
(?<open> ;/ )* # open++
.+
(?<-open> /; )* # open--
)+
/; # close
(?(open)(?!)) # fail if unblanaced: open > 0
)
|
(?:(?<openq>")
[^"]*?
(?<-openq>")*
)+(?(openq)(?!))
Demo Code
I hope you can apply this in your code.
Try this:
(;/.*?/;)|(;.*?(?=$|[\r\n]))
Note that I'm still using the SingleLine mode.
The part before the | matches multiline comments, the part after the | matches single line comments (comments that end when they encounter the end of the text $ or a new line \r or \n`. Note that the regex won't capture the end-of-the-line at the end of the single-line comments, so
;xyz\n
the \n won't be captured. To capture it:
(;/.*?/;)|(;.*?(?:$|\r\n?|\n))

C# Regex - How to remove multiple paired parentheses from string

I am trying to figure out how to use C# regular expressions to remove all instances paired parentheses from a string. The parentheses and all text between them should be removed. The parentheses aren't always on the same line. Also, their might be nested parentheses. An example of the string would be
This is a (string). I would like all of the (parentheses
to be removed). This (is) a string. Nested ((parentheses) should) also
be removed. (Thanks) for your help.
The desired output should be as follows:
This is a . I would like all of the . This a string. Nested also
be removed. for your help.
Fortunately, .NET allows recursion in regexes (see Balancing Group Definitions):
Regex regexObj = new Regex(
#"\( # Match an opening parenthesis.
(?> # Then either match (possessively):
[^()]+ # any characters except parentheses
| # or
\( (?<Depth>) # an opening paren (and increase the parens counter)
| # or
\) (?<-Depth>) # a closing paren (and decrease the parens counter).
)* # Repeat as needed.
(?(Depth)(?!)) # Assert that the parens counter is at zero.
\) # Then match a closing parenthesis.",
RegexOptions.IgnorePatternWhitespace);
In case anyone is wondering: The "parens counter" may never go below zero (<?-Depth> will fail otherwise), so even if the parentheses are "balanced" but aren't correctly matched (like ()))((()), this regex will not be fooled.
For more information, read Jeffrey Friedl's excellent book "Mastering Regular Expressions" (p. 436)
You can repetitively replace /\([^\)\(]*\)/g with the empty string till no more matches are found, though.
How about this: Regex Replace seems to do the trick.
string Remove(string s, char begin, char end)
{
Regex regex = new Regex(string.Format("\\{0}.*?\\{1}", begin, end));
return regex.Replace(s, string.Empty);
}
string s = "Hello (my name) is (brian)"
s = Remove(s, '(', ')');
Output would be:
"Hello is"
Normally, it is not an option. However, Microsoft does have some extensions to standard regular expressions. You may be able to achieve this with Grouping Constructs even if it is faster to code as an algorithm than to read and understand Microsoft's explanation of their extension.

Regular expression for matching php's constant definition

I wrote a regular expression for matching php's constant definition.
Example:
define('Symfony∞DI', SYS_DIRECTORY_PUBLIC . SYS_DIRECTORY_INCLUDES . SYS_DIRECTORY_CLASSES . SYS_DIRECTORY_EXTERNAL . 'symfony/di/');
Here is the regular expression:
define\((\"|\')+([\w-\.-∞]+)+(\"|\')+(,)+((\s)+(\"|\')+([\w-(\')-\\"-\.-∞-\s-(\\)-\/]+)+(\"|\')|(([\w-\s-\.-∞-(\\)-\/]+)))\);
When I executed with ActionScript it works fine. But when I executed with C# it gives me the following error:
parsing "define\((\"|\')+([\w-\.-∞]+)+(\"|\')+(,)+((\s)+(\"|\')+([\w-(\')-\\"-\.-∞-\s-(\\)-\/]+)+(\"|\')|(([\w-\s-\.-∞-(\\)-\/]+)))\);" - Cannot include class \s in character range.
Could you help me resolve this issue?
You seem to be using regexes in a completely convoluted way:
character classes: the - is special and it there to compute an interval; I guess you have an ordering inversion which .Net doesn't handle whereas PHP handles it (or maybe the collating order is different in PHP). Your character class should read [\w.∞] instead of [\w-.-∞], just to quote the first example;
no need to put a group around \s: \s+, not (\s)+; similarly, , instead of (,).
' is not special in a regex, and if you want to match two characters, use a character class, not a group + alternative: ['\"] instead of (\'|\") -- and note that the '"' is escaped only because you are in a doubly quoted string;
your regex is not anchored at the beginning and it looks like you want to match define at the beginning of the output: ^define and not define.
The 1. is probably the source of your problems.
Rewriting your regex with all of the above gives this (in double quotes):
"^define\(([\"'][\w.∞]+[\"'],(\s+[\"']+[\w'\".∞\s\\/]+)+[\"']|([\w\s.∞\\/]+))\);"
which definitely doesn't look that it will ever match your input...
Try this instead:
"^define\(\s*(['\"])[\w.∞]+\1\s*,\s*([\w/]+(\s*\.\s*[\w/]+)*\s*\);$"
See fge's answer for the error you're having. Without knowing what your tring to do and not deviating too much from your original, here is an alternative regex:
define\(\s*(["'])\s*[\w.∞]+\s*\1(?:\s*[.,]\s*(["']?)\s*[\w/]+\s*\2)*\s*\);
define
\(
\s* (["'])
\s* [\w.∞]+
\s* \1
(?:
\s* [.,]
\s* (["']?)
\s* [\w/]+
\s* \2
)*
\s*
\);

Seeking some C# RegEx help

I am trying to create a RegEx expression that will successfully parse the following line:
"57" "testing123" 82 16 # 13 26 blah blah
What I want is to be able to do is identify the numbers in the line. Currently, what I'm using is this:
[0-9]+
which parses fine. However, where it gets tricky is if the number is in quotes, like "57" is or like "testing123" is, I do not want it to match.
In addition to that, anything after the hash sign (the '#"), I do not want to match anything at all after the hash sign.
So in this example, the matches I should be getting are "82" and "16". Nothing else should match.
Any help on this would be appreciated.
It should be easier for you to build 3 different regexes, and then create the logic that combines them:
Check, whether the string has #, and ignore everything after it.
Check, for all the matches of "\d+", and ignore all of them
Check everything that's left, whether it matches [0-9]+
.Net regular expression can rather easily parse this string. The following pattern should match everything until the comment:
\A # Start of the string
(?>
(?<Quoted> # A quoted string
"" # Open quotes
[^""\\]* # non quotes or backslashes
(?:\\.[^""\\]*)* # but allow escaped characters
"" # Close quotes
)
|
(?<Number> # A number
\d+ # some digits
)
|
\s+ # Whitespace separator
)*
If you also want to match the comment, add:
(?<Comment>
\# .*
)?
\z
You can get your numbers in a single Match, using all captures of the "Number" group:
Match parsed = Regex.Match(s, pattern, RegexOptions.IgnorePatternWhitespace);
CaptureCollection numbers = parsed.Groups["Number"].Captures;
Missing from this pattern is mainly unquoted string tokens, such as 4 8 this 15that, which can add some complexity, depending on how we'd want it to work.

How can I normalize/canonize a regular expression pattern?

I have a complex regular expression I've built with code. I want to normalize it to the simplest (canonical) form that will be an equivalent regular expression but without the extra brackets and so on.
I want it to be normalized so I can understand if it's correct and find bugs in it.
Here is an example for a regular expression I want to normalize:
^(?:(?:(?:\r\n(?:[ \t]+))*)(<transfer-coding>(?:chunked|(?:(?:[\x21\x23-\x27\x2A\x2B\x2D\x2E0-9A-Z\x5E\x7A\x7C\x7E-\xFE]+)(?:(?:;(?:(?:[\x21\x23-\x27\x2A\x2B\x2D\x2E0-9A-Z\x5E\x7A\x7C\x7E-\xFE]+)=(?:(?:[\x21\x23-\x27\x2A\x2B\x2D\x2E0-9A-Z\x5E\x7A\x7C\x7E-\xFE]+)|(?:"(?:(?:(?:|[^\x00-\x31\x127\"])|(?:\\[\x00-\x127]))*)))))*))))(?:(?:(?:\r\n(?:[ \t]+))*),(?:(?:\r\n(?:[ \t]+))*)(<transfer-coding>(?:chunked|(?:(?:[\x21\x23-\x27\x2A\x2B\x2D\x2E0-9A-Z\x5E\x7A\x7C\x7E-\xFE]+)(?:(?:;(?:(?:[\x21\x23-\x27\x2A\x2B\x2D\x2E0-9A-Z\x5E\x7A\x7C\x7E-\xFE]+)=(?:(?:[\x21\x23-\x27\x2A\x2B\x2D\x2E0-9A-Z\x5E\x7A\x7C\x7E-\xFE]+)|(?:"(?:(?:(?:|[^\x00-\x31\x127\"])|(?:\\[\x00-\x127]))*)))))*))))*))$
I'm with the other answers and comments so far. Even if you could define a reduced form, it's unlikely that the reduced form is going to be any more understandable than this thing, which resembles line noise on a 1200 baud modem.
If you did want to find a canonical form for regular expressions, i'd start by defining precisely what you mean by "canonical form". For example, suppose you have the regular expression [ABCDEF-I]. Is the canonical form (1) [ABCDEF-I], (2) [ABCDEFGHI] or (3) [A-I] ?
That is, for purposes of canonicalization, do you want to (1) ignore this subset of regular expressions for the purposes of canonicalization, (2) eliminate all "-" operators, thereby simplifying the expression, or (3) make it shorter?
The simplest way would be to go through every part of the regular expression specification and work out which subexpressions are logically equivalent to another form, and decide which of the two is "more canonical". Then write a recursive regular expression analyzer that goes through a regular expression and replaces each subexpression with its canonical form. Keep doing that in a loop until you find the "fixed point", the regular expression that doesn't change when you put it in canonical form.
That, however, will not necessarily do what you want. If what you want is to reorganize the regular expression to minimize the complexity of grouping or some such thing then what you might want to do is to canonicalize the regular expression so that it is in a form such that it only has grouping, union and Kleene star operators. Once it is in that form you can easily translate it into a deterministic finite automaton, and once it is in DFA form then you can run a graph simplification algorithm on the DFA to form an equivalent simpler DFA. Then you can turn the resulting simplified DFA back into a regular expression.
Though that would be fascinating, like I said, I don't think it would actually solve your problem. Your problem, as I understand it, is a practical one. You have this mess, and you want to understand that it is right.
I would approach that problem by a completely different tack. If the problem is that the literal string is hard to read, then don't write it as a literal string. I'd start "simplifying" your regular expression by making it read like a programming language instead of reading like line noise:
Func<string, string> group = s=>"(?:"+s+")";
Func<string, string> capture = s=>"("+s+")";
Func<string, string> anynumberof = s=>s+"*";
Func<string, string> oneormoreof = s=>s+"+";
var beginning = "^";
var end = "$";
var newline = #"\r\n";
var tab = #"\t";
var space = " ";
var semi = ";";
var comma = ",";
var equal = "=";
var chunked = "chunked";
var transfer = "<transfer-coding>";
var backslash = #"\\";
var escape = group(backslash + #"[\x00-\x7f]");
var or = "|";
var whitespace =
group(
anynumberof(
group(
newline +
group(
oneormoreof(#"[ \t]")))));
var legalchars =
group(
oneormoreof(#"[\x21\x23-\x27\x2A\x2B\x2D\x2E0-9A-Z\x5E\x7A\x7C\x7E-\xFE]"));
var re =
beginning +
group(
whitespace +
capture(
transfer +
group(
chunked +
or +
group(
legalchars +
group(
group(
semi +
anynumberof(
group(
legalchars +
equal +
...
Once it looks like that it'll be a lot easier to understand and optimize.
I think you're getting ahead of yourself; the problems with that regex are not just cosmetic. Many of the parentheses can simply be dropped, as in (?:[ \t]+), but I suspect some of them are changing the meaning of the regex in ways you didn't intend.
For example, what's (?:|[^\x00-\x31\x127\"]) supposed to mean? With that pipe at the beginning, it's equivalent to [^\x00-\x31\x127\"]??--zero or one, reluctantly, of whatever the character class matches. Is that really what you intended?
The character class itself is highly suspect as well. It's obviously meant to match anything other than an ASCII control character or a quotation mark, but the numbers are decimal where they should be hexadecimal: [^\x00-\x1E\x7F\"]
I am not aware of any tool that can do this. I even strongly doubt there is something like a canonical form for regular expressions - they are complex enough that there are usually several and vastly different solutions.
If this expression is the output of an generator it seems much more promising to me to (unit)test the code generator.
I'd just write it in an expanded form:
^
(?:
(?: (?: \r\n (?:[ \t]+) )* )
(<transfer-coding>
(?: chunked
| (?:
(?:[\x21\x23-\x27\x2A\x2B\x2D\x2E0-9A-Z\x5E\x7A\x7C\x7E-\xFE]+)
(?:
(?:
;
(?:
(?:[\x21\x23-\x27\x2A\x2B\x2D\x2E0-9A-Z\x5E\x7A\x7C\x7E-\xFE]+)
=
(?: (?:[\x21\x23-\x27\x2A\x2B\x2D\x2E0-9A-Z\x5E\x7A\x7C\x7E-\xFE]+)
| (?:
"
(?:
(?:
(?:
| [^\x00-\x31\x127\"]
)
| (?:\\[\x00-\x127])
)*
)
)
)
)
)*
)
)
)
)
(?:
(?: (?: \r\n (?:[ \t]+) )* )
,
(?: (?: \r\n (?:[ \t]+) )* )
(<transfer-coding>
(?: chunked
| (?:
(?:[\x21\x23-\x27\x2A\x2B\x2D\x2E0-9A-Z\x5E\x7A\x7C\x7E-\xFE]+)
(?:
(?:
;
(?:
(?:[\x21\x23-\x27\x2A\x2B\x2D\x2E0-9A-Z\x5E\x7A\x7C\x7E-\xFE]+)
=
(?: (?:[\x21\x23-\x27\x2A\x2B\x2D\x2E0-9A-Z\x5E\x7A\x7C\x7E-\xFE]+)
| (?:
"
(?:
(?:
(?:
| [^\x00-\x31\x127\"]
)
| (?:\\[\x00-\x127])
)*
)
)
)
)
)*
)
)
)
)
)
)
$
You can quickly locate unnecessary grouping, and locate some errors. Some errors i saw:
Missing ? for the named groups. It should be (?<name> ).
No closing double quote (").
You can even use the regex in this form. If you supply the flag RegexOptions.IgnorePatternWhitespace when constructing the Regex object, any whitespace or comments (#) in the pattern will be ignored.
Proving correctness is not a good motivation for doing normalization because the normal form can be very obscure and totally irrecognizable.
To get correctness, you either 1) run a lot of tests on it 2) obtain the state machine and prove correctness by induction.

Categories