Large string regex replace performance in C# - c#

I'm facing a problem with Regex performance in C#.
I need to replace on a very large string (270k charachters, don't ask why..). The regex matches about 3k times.
private static Regex emptyCSSRulesetRegex = new Regex(#"[^\};\{]+\{\s*\}", RegexOptions.Compiled | RegexOptions.Singleline);
public string ReplaceEmptyCSSRulesets(string css) {
return emptyCSSRulesetRegex.Replace(css, string.Empty);
}
The string I pass to the method looks something like this:
.selector-with-statements{border:none;}.selector-without-statements{}.etc{}
Currently the replace process takes up 1500ms in C#, but when I do exactly the same in Javascript it only takes 100ms.
The Javascript code I used for timing:
console.time('reg replace');
myLargeString.replace(/[^\};\{]+\{\s*\}/g,'');
console.timeEnd('reg replace');
I also tried to do the replacing by looping over the matches in reverse order and replace the string in a StringBuilder. That was not helping.
I'm surprised by the performance difference between C# and Javascript in this case, and I think there I'm doing something wrong but I cannot think of anything.

I can't really explain the difference of time between Javascript and C#(*). But you can try to improve the performance of your pattern (that produces a lot of backtracking):
private static Regex emptyCSSRulesetRegex = new Regex(#"(?<keep>[^};{]+)(?:{\s*}(?<keep>))?", RegexOptions.Compiled);
public string ReplaceEmptyCSSRulesets(string css) {
return emptyCSSRulesetRegex.Replace(css, #"${keep}");
}
One of the problems of your original pattern is that when curly brackets are not empty (or not filled with whitespaces), the regex engine will continue to test each positions before the opening curly bracket (with always the same result). Example: with the string abcd{1234} your pattern will be tested starting on a, then b ...
The pattern I suggests will consume abcd even if it is not followed by empty curly brackets, so the positions of bcd are not tested.
abcd is captured in the group named keep but when empty curly brackets are found, the capture group is overwritten by an empty capture group.
You can have an idea of the number of steps needed for the two patterns (check the debugger):
original pattern
new pattern
Note: your original pattern can be improved if you enclose [^}{;]+ in an atomic group. This change will divide the number of steps needed by 2 (compared to the original), but even with that, the number of steps stays high for the previously explained reason.
(*) it's possible that the javascript regex engine is smart enough to not retry all these positions, but it's only an assumption.

Related

C# Use REGEX to sort out Parentheses inside eachother

im using REGEX right now to sort out lines like:
string a = "mine(hello, this())"
string b = "mine(hello, me.he)
string c = "mine(hello, this(this2())
Im trying to get this() by itself.
Regex keeps messing up on this as the regex statement was made to get text inside of () how can I correct my regex statement to fix this.
Code:
string result = Regex.Match(a, #"\(([^)]*)\)").Groups[1].Value;
To get everything inside the outermost set of parenthesis, use a greedy operator to get everything until the last parenthesis. The following pattern does not attempt to match sets of parenthesis or nested parenthesis. (To do that is more involved and there are already many questions and online discussions about that.)
Regex.Match(a, #"\((.*)\)").Groups[1].Value;
That pattern will match the opening parenthesis, then everything it possibly can until the closing parenthesis. This will require backtracking, but should be reasonable if matches are limited to single lines. Depending on current regex options, it could match across multiple lines and it has no other constraints, so it could match much more than desired. If that is a concern, add additional operators to limit the extent that it matches. For example, to keep from matching across lines:
Regex.Match(a, #"\(([^\r\n]*)\)").Groups[1].Value;

Regex.Split returning whitespaces

I want to export a View as a HTML-Document to the User on my ASP.NET page. I want to give the option to only get a part of the view.
Because of that I want to split the output with Regex.Split(). I wrote a Regex that matches the part I want to cut out. After splitting I put the 2 output parts together again.
The problem is that I get a list of 3 parts, of which the second contains " ". How can I change the code that the output contains only 2 strings?
My Code:
textParts = Regex.Split(text, #"<!--Graphic2-->(.|\n)*<!--EndDiscarded-->");
text = textParts[0] + textParts[1];
text contains HTML, CSS and jQuery Code. I wrote comments like <!--Graphic2--> around the blocks I want to cut out.
EDIT
I got it working now by using the Regex.Replace() Method. But I still don't know why Split isn't working how I expected.
You should consider parsing HTML with the proper tools, like HtmlAgilityPack.
The current question is about why Regex.Split returned 3 values. That is due to the presence of a capturing group in your pattern. Regex.Split returns the chunks between start/end of string and the matched chunks, and all captured substrings:
If capturing parentheses are used in a Regex.Split expression, any captured text is included in the resulting string array. For example, if you split the string "plum-pear" on a hyphen placed within capturing parentheses, the returned array includes a string element that contains the hyphen.
So, Regex.Split(text, #"<!--Graphic2-->(.|\n)*<!--EndDiscarded-->") matches <!--Graphic2--> substring, then matches and captures into Group 1 any 0+ occurrences of any char, as many as possible, and then matches <!--EndDiscarded-->") - these matches are removed and substrings that are not matched are returned, but the last char captured into the repeated capturing group is also returned.
So, if you plan to use regex for this task, you should consider re-writing it to #"(?s)<!--Graphic2-->.*?<!--EndDiscarded-->" or #"<!--Graphic2-->[^<]*(?:<(?!!--EndDiscarded)[^<]*)*<!--EndDiscarded-->" that will be much more efficient, or even #"<!--Graphic2-->[^<]*(?:<(?!!--(?:EndDiscarded|Graphic2))[^<]*)*<!--EndDiscarded-->" that will ensure no nested Graphic2 comments are matched.
See, the complexity of the regexps rises when you want to make sure your patterns work more efficiently and safer. However, even these longer versions do not guarantee 100% safety.

C# .NET Regex remove all quotes of quotes excluding one instance in a sentance

I have description field which is:
16" Alloy Upgrade
In CSV format it appears like this:
"16"" Alloy Upgrade "
What would be the best use of regex to maintain the original format? As I'm learning I would appreciate it being broke down for my understanding.
I'm already using Regex to split some text separating 2 fields which are: code, description. I'm using this:
,(?=(?:[^\"]*\"[^\"]*\")*(?![^\"]*\"))
My thoughts are to remove the quotes, then remove the delimiter excluding use in sentences.
Thanks in advance.
If you don't want to/can't use a standard CSV parser (which I'd recommend), you can strip all non-doubled quotes using a regex like this:
Regex.Replace(text, #"(?!="")""(?!"")",string.Empty)
That regex will match every " character not preceded or followed by another ".
I wouldn't use regex since they are usually confusing and totally unclear what they do (like the one in your question for example). Instead this method should do the trick:
public string CleanField(string input)
{
if (input.StartsWith("\"") && input.EndsWith("\""))
{
string output = input.Substring(1,input.Length-2);
output = output.Replace("\"\"","\"");
return output;
}
else
{
//If it doesn't start and end with quotes then it doesn't look like its been escaped so just hand it back
return input;
}
}
It may need tweaking but in essence it checks if the string starts and ends with a quote (which it should if it is an escaped field) and then if so takes the inside part (with the substring) and then replaces double quotes with single quotes. The code is a bit ugly due to all the escaping but there is no avoiding that.
The nice thing is this can be used easily with a bit of Linq to take an existing array and convert it.
processedFieldArray = inputfieldArray.Select(CleanField).ToArray();
I'm using arrays here purely because your linked page seems to use them where you are wanting this solution.

Regex matching only when condition has not been met

I have kind of a weird problem that I am attempting to resolve with some elegant regular expressions.
The system I am working on was initially designed to accept an incoming string and through a pattern matching method, alter the string which it then returns. A very simplistic example is:
Incoming string:
The dog & I went to the park and had a great time...
Outgoing string:
The dog {&} I went to the park and had a great time {...}
The punctuation mapper wraps key characters or phrases and wraps them in curly braces. The original implementation was a one way street and was never meant for how it is currently being applied and as a result, if it is called incorrectly, it is very easy for the system to "double" wrap a string as it is just doing a simple string replace.
I spun up Regex Hero this morning and started working on some pattern matches and having not written a regular expression in nearly a year, quickly hit a wall.
My first idea was to match a character (i.e. &) but only if it wasn't wrapped in braces and came up with [^\{]&[^\}], which is great but of course catches any instance of the ampersand so long as it is not preceded by a curly brace, including white spaces and would not work in a situation where there were two ampersands back to back (i.e. && would need to be {&}{&} in the outgoing string. To make matters more complicated, it is not always a single character as ellipsis (...) is also one of the mapped values.
Every solution I noodle over either hits a barrier because there is an unknown number of occurrences of a particular value in the string or that the capture groups will either be too greedy or finally, cannot compensate for multiple values back to back (i.e. a single period . vs ellipsis ...) which the original dev handled by processing ellipsis first which covered the period in the string replace implementation.
Are there any regex gurus out there that have any ideas on how I can detect the undecorated (unwrapped) values in a string and then perform their replacements in an ungreedy fashion that can also handle multiple repeated characters?
My datasource that I am working against is a simple key value pair that contains the value to be searched for and the value to replace it with.
Updated with example strings:
Undecorated:
Show Details...
Default Server:
"Smart" 2-Way
Show Lender's Information
Black & White
Decorated:
Show Details{...}
Default Server{:}
{"}Smart{"} 2-Way
Show Lender{'}s Information
Black {&} White
Updated With More Concrete Examples and Datasource
Datasource (SQL table, can grow at any time):
TaggedValue UntaggedValue
{:} :
{&} &
{<} <
{$} $
{'} '
{} \
{>} >
{"} "
{%} %
{...} ...
{...} …
{:} :
{"} “
{"} ”
{'} `
{'} ’
Broken String: This is a string that already has stuff {&} other stuff{!} and {...} with {_} and {#} as well{.} and here are the same characters without it & follow by ! and ... _ & . &&&
String that needs decoration: Show Details... Default Server: "Smart" 2-Way Show Lender's Information Black & White
String that would pass through the method untouched (because it was already decorated): The dog {&} I went to the park and had a great time {...}
The other "gotcha" in moving to regex is the need to handle escaping, especially of backslashes elegantly due to their function in regular expressions.
Updated with output from #Ethan Brown
#Ethan Brown,
I am starting think that regex, while elegant might not be the way to go here. The updated code you provided, while closer still does not yield correct results and the number of variables involved may exceed the regex logics capability.
Using my example above:
'This is a string that already has stuff {&} other stuff{!} and {...} with {_} and {#} as well{.} and here are the same characters without it & follow by ! and ... _ & . &&&'
yields
This is a string that already has stuff {&} other stuff{!} and {...} with {_} and {#} as well{.} and here are the same characters without it {&} follow by {!} and {...} {_} {&} . {&&}&
Where the last group of ampersands which should come out as {&}{&}{&} actually comes out as {&&}&.
There is so much variability here (i.e. need to handle ellipsis and wide ellipsis from far east languages) and the need to utilize a database as the datasource is paramount.
I think I am just going to write a custom evaluator which I can easily enough write to perform this type of validation and shelve the regex route for now. I will grant you credit for your answer and work as soon as I get in front of a desktop browser.
This kind of problem can be really tough, but let me give you some ideas that might help out. One thing that's really going to give you headaches is handling the case where the punctuation appears at the beginning or end of the string. Certainly that's possible to handle in a regex with a construct like (^|[^{])&($|[^}]), but in addition to that being painfully hard to read, it also has efficiency issues. However, there's a simple way to "cheat" and get around this problem: just pad your input string with a space on either end:
var input = " " + originalInput + " ";
When you're done you can just trim. Of course if you care about preserving input at the beginning or end, you'll have to be more clever, but I'm going to assume for argument's sake that you don't.
So now on to the meat of the problem. Certainly, we can come up with some elaborate regular expressions to do what we're looking for, but often the answer is much much simpler if you use more than one regular expression.
Since you've updated your answer with more characters, and more problem inputs, I've updated this answer to be more flexible: hopefully it will meet your needs better as more characters get added.
Looking over your input space, and the expressions you need quoted, there are really three cases:
Single-character replacements (! becomes {!}, for example).
Multi-character replacements (... becomes {...}).
Slash replacement (\ becomes {})
Since the period is included in the single-character replacements, order matters: if you replace all the periods first, then you will miss ellipses.
Because I find the C# regex library a little clunky, I use the following extension method to make this more "fluent":
public static class StringExtensions {
public static string RegexReplace( this string s, string regex, string replacement ) {
return Regex.Replace( s, regex, replacement );
}
}
Now I can cover all of the cases:
// putting this into a const will make it easier to add new
// characters in the future
const string normalQuotedChars = #"\!_\\:&<\$'>""%:`";
var output = s
.RegexReplace( "(?<=[^{])\\.\\.\\.(?=[^}])", "{$&}" )
.RegexReplace( "(?<=[^{])[" + normalQuotedChars + "](?=[^}])", "{$&}" )
.RegexReplace( "\\\\", "{}" );
So let's break this solution down:
First we handle the ellipses (which will keep us from getting in trouble with periods later). Note that we use a zero-width assertions at the beginning and end of the expression to exclude expressions that are already quoted. The zero-width assertions are necessary, because without them, we'd get into trouble with quoted characters right next to each other. For example, if you have the regex ([^{])!([^}]), and your input string is foo !! bar, the match would include the space before the first exclamation point and the second exclamation point. A naive replacement of $1!$2 would therefore yield foo {!}! bar because the second exclamation point would have been consumed as part of the match. You'd have to end up doing an exhaustive match, and it's much easier to just use zero-width assertions, which are not consumed.
Then we handle all of the normal quoted characters. Note that we use zero-width assertions here for the same reasons as above.
Finally, we can find lone slashes (note we have to escape it twice: once for C# strings and again for regex metacharacters) and replace that with empty curly brackets.
I ran all of your test cases (and a few of my own invention) through this series of matches, and it all worked as expected.
I'm no regex god, so one simple way:
Get / construct the final replacement string(s) - ex. "{...}", "{&}"
Replace all occurrences of these in the input with a reserved char (unicode to the rescue)
Run your matching regex(es) and put "{" or whatever desired marker(s).
Replace reserved char(s) with the original string.
Ignoring the case where your original input string has a { or } character, a common way to avoid re-applying a regex to an already-escaped string is to look for the escape sequence and remove it from the string before applying your regex to the remainders. Here's an example regex to find things that are already escaped:
Regex escapedPattern = new Regex(#"\{[^{}]*\}"); // consider adding RegexOptions.Compiled
The basic idea of this negative-character class pattern comes from regular-expressions.info, a very helpful site for all thing regex. The pattern works because for any inner-most pair of braces, there must be a { followed by non {}'s followed by a }
Run the escapedPattern on the input string, find for each Match get the start and end indices in the original string and substring them out, then with the final cleaned string run your original pattern match again or use something like the following:
Regex punctPattern = new Regex(#"[^\w\d\s]+"); // this assumes all non-word,
// digit or space chars are punctuation, which may not be a correct
//assumption
And replace Match.Groups[1].Value for each match (groups are a 0 based array where 0 is the whole match, 1 is the first set of parentheses, 2 is the next etc.) with "{" + Match.Groups[1].Value + "}"

C# program freezes for a minute when running a bunch of RegExes

I have a program that runs a large number of regular expressions (10+) on a fairly long set of texts (5-15 texts about 1000 words each)
Every time that is done I feel like I forgot a Thread.Sleep(5000) in there somewhere. Are regular expressions really processor-heavy or something? It'd seem like a computer should crank through a task like that in a millisecond.
Should I try and group all the regular expressions into ONE monster expression? Would that help?
Thanks
EDIT: Here's a regex that runs right now:
Regex rgx = new Regex(#"((.*(\w+([-+.]\w+)*#\w+([-.]\w+)*\.\w+([-.]\w+)*).*)|(.*(keyword1)).*|.*(keyword2).*|.*(keyword3).*|.*(keyword4).*|.*(keyword5).*|.*(keyword6).*|.*(keyword7).*|.*(keyword8).*|.*(count:\n[0-9]|count:\n\n[0-9]|Count:\n[0-9]|Count:\n\n[0-9]|Count:\n).*|.*(keyword10).*|.*(summary: \n|Summary:\n).*|.*(count:).*)", RegexOptions.Compiled | RegexOptions.IgnoreCase);
Regex regex = new Regex(#".*(\.com|\.biz|\.net|\.org|\.co\.uk|\.bz|\.info|\.us|\.cm|(<a href=)).*", RegexOptions.Compiled | RegexOptions.IgnoreCase);
It's pretty huge, no doubt about it. The idea is if it gets to any of the keywords or the link it will just take out the whole paragraph surrounding it.
Regexes don't kill CPU's, regex authors do. ;)
But seriously, if regexes always ran as slowly as you describe, nobody would be using them. Before you start loading up silver bullets like the Compiled option, you should go back to your regex and see if it can be improved.
And it can. Each keyword is in its own branch/alternative, and each branch starts with .*, so the first thing each branch does is consume the remainder of the current paragraph (i.e., everything up to the next newline). Then it starts backtracking as it tries to match the keyword. If it gets back to the position it started from, the next branch takes over and does the same thing.
When all branches have reported failure, the regex engine bumps ahead one position and goes through all the branches again. That's over a dozen branches, times the number of characters in the paragraph, times the number of paragraphs... I think you get the point. Compare that to this regex:
Regex re = new Regex(#"^.*?(\w+([-+.]\w+)*#\w+([-.]\w+)*\.\w+([-.]\w+)*|keyword1|keyword2|keyword3|keyword4|keyword5|keyword6|keyword7|keyword8|count:(\n\n?[0-9]?)?|keyword10|summary: \n).*$",
RegexOptions.Multiline | RegexOptions.IgnoreCase | RegexOptions.ExplicitCapture);
There are three major changes:
I factored out the leading and trailing .*
I changed the leading one to .*?, making it non-greedy
I added start-of-line and end-of-line anchors (^ and $ in Multiline mode)
Now it only makes one match attempt per paragraph (pass or fail), and it practically never backtracks. I could probably make it even more efficient if I knew more about your data. For example, if every keyword/token/whatever starts with a letter, a word boundary would have an appreciable effect (e.g. ^.*?\b(\w+...).
The ExplicitCapture option makes all the "bare" groups ((...)) act like non-capturing groups ((?:...)), reducing the overhead a little more without adding clutter to the regex. If you want to capture the token, just change that first group to a named group (e.g.(?<token>\w+...).
First of all you may try Compiled option RegexOptions.Compiled
Here is good article about regex performance: regex-performance
Second one: regex performance depended on pattern:
Some pattern are much quicker than others. You have to cpecify pattern as strict as possible.
And the third one.
I had some troubles with regex perfomance. And in that case I used string.contains method.
For example:
bool IsSomethingValid(stging source, string shouldContain, string pattern)
{
bool res = source.Contains(shouldContain);
if (res)
{
var regex = new Regex(pattern, RegexOptions.Compiled);
res = regex.IsMatch(source);
}
return res;
}
If you gave us an examples of your script we may try to improve them.
Never assume anything about what and for which reason your application is slow. Instead, always measure it.
Use a decent performance profiler (like e.g. Red Gate's ANTS Performance Profiler - they offer a free 14 days trial version) and actually see the performance bottlenecks.
My own experience is that I always was wrong on what was the real cause for a bad performance. With the help of profiling, I could find the slow code segments and adjust them to increase performance.
If the profiler confirms your assumption regarding the Regular Expressions, you then could try to optimize them (by modifying or pre-compiling them).
If the pattern doesn't deal with case...don't have the ignore case option. See
Want faster regular expressions? Maybe you should think about that IgnoreCase option...
Otherwise as mentioned but put in here:
Put all operations on seperate background threads. Don't hold up a gui.
Examine each pattern. To many usages of * (zero to many) if swapped out to a + (one to many) can give the regex engine an real big hint and not require as much fruitless backtracking. See (Backtracking)
Compiling a pattern can take time, use Compile option to avoid it being parsed again...but if one uses the static methods to call the regex parser, the pattern is cached automatically.
See Regex hangs with my expression [David Gutierrez] for some other goodies.

Categories