Regex extremely slow on large documents

Regex extremely slow on large documents - c#

When running the following code the CPU load goes way up and it takes a long time on larger documents:
string pattern = #"\w+([-+.]\w+)*#\w+([-.]\w+)*\.\w+([-.]\w+)*";
Regex regex = new Regex(
pattern,
RegexOptions.None | RegexOptions.Multiline | RegexOptions.IgnoreCase);
MatchCollection matches = regex.Matches(input); // Here is where it takes time
MessageBox.Show(matches.Count.ToString());
foreach (Match match in matches)
{
...
}
Any idea how to speed it up?

Changing RegexOptions.None | RegexOptions.Multiline | RegexOptions.IgnoreCase to RegexOptions.Compiled yields the same results (since your pattern does not include any literal letters or ^/$).
On my machine this reduces the time taken on the sample document you linked from 46 seconds to 21 seconds (which still seems slow to me, but might be good enough for you).
EDIT: So I looked into this some more and have discovered the real issue.
The problem is with the first half of your regex: \w+([-.]\w+)*\.\w+([-.]\w+)*#. This works fine when matching sections of the input that actually contain the # symbol, but for sections that match just \w+([-.]\w+)*\.\w+([-.]\w+)* but are not followed by #, the regex engine wastes a lot of time backtracking and retrying from each position in the sequence (and failing again because there is still no #!)
You can fix this by forcing the match to start at a word boundary using \b:
string pattern = #"\b\w+([-+.]\w+)*#\w+([-.]\w+)*\.\w+([-.]\w+)*";
On your sample document, this produces the same 10 results in under 1 second.

Try to use regex for streams, use mono-project regex and this article can be useful for .Net
Building a Regular Expression Stream with the .NET Framework
and try to improve your regex performance.

To answer how to change it, you need to tell us, what it should match.
The problem is probably in the last part #\w+([-.]\w+)*\.\w+([-.]\w+)*. On a string "bla#a.b.c.d.e-f.g.h" it will have to try many possibilities, till it finds a match.
Could be a little bit of Catastrophic Backtracking.
So, you need to define you pattern in a better, more "unique" way. Do you really need "Dash/dot - dot - dash/dot"?

Related

Regular expression performance issue

I've got a long string which contains about 100 parameters (string parameterName) matching the following pattern:
parameterName + "(Whitespace | CarriageReturn | Tabulation | Period | Underline | Digit | Letter | QuotationMark | Slash)* DATA Whitespace Hexadecimal"
I've tried to used this regular expression, but it works a way too long:
parameterName + "[\\s\\S]*DATA\\s0x[0-9A-F]{4,8}"
This messy one works a little better:
parameterName + "(\\s|\r|\n|\t|\\.|[_0-9A-z]|\"|/)*DATA\\s0x[0-9A-F]{4,8}"
I'd use ".*", however, it doesn't match "\n".
I've tried "(.|\n)", but it works even slower than "[\s\S]".
Is there any way to improve this regular expression?

You can use something like
(?>(?>[^D]+|D(?!ATA))*)DATA\\s0x[0-9A-F]{4,8}
(?> # atomic grouping (no backtracking)
(?> # atomic grouping (no backtracking)
[^D]+ # anything but a D
| # or
D(?!ATA) # a D not followed by ATA
)* # zero or more time
)
The idea
The idea is to get to the DATA without asking ourselves any question, and to not go any further and then backtrack to it.
If you use .*DATA on a string like DATA321, see what the regex engine does:
.* eats up all the string
There's no DATA to be found, so step by step the engine will backtrack and try these combinations: .* will eat only DATA32, then DATA3, then DATA... then nothing and that's when we find our match.
Same thing happens if you use .*?DATA on 123DATA: .*? will try to match nothing, then 1, then 12...
On each try we have to check there is no DATA after the place where .* stopped, and this is time consuming. With the [^D]+|D(?!ATA) we ensure we stop exactly when we need to - not before, not after.
Beware of backtracking
So why not use (?:[^D]|D(?!ATA)) instead of these weird atomic grouping?
This is all good and working fine when we have a match to be found. But what happens when we don't? Before declaring failure, the regex have to try ALL possible combinations. And when you have something like (.*)* at each character the regex engine can use both the inside * or the outside one.
Which means the number of combinations very rapidely becomes huge. We want to not try all of these: we know that we stopped at the right place, if we didn't find a match right away we never will. Hence the atomic grouping (apparently .NET doesn't support possessive quantifiers).
You can see what I mean over here: 80'000 steps to check a 15 character long string that will never match.
This is discussed more in depth (and better put than what I could ever do) in this great article by Friedl, regex guru, over here

Regex For Anchor Tag Pegging CPU

So I have broke my regex in half and found this half is causing the cpu to peg under load. What are some performance savers I could do to help? I can't really find any spots to shave off.
private string pattern =
#"(<a\s+href\s*=\s*(\""|')http(s)?://(www.)?([a-z0-9]+\-?[a-z0-9]+\.)?wordpress.org(\""|'))";
Regex wordPressPattern = new Regex(regexPattern, RegexOptions.Compiled | RegexOptions.IgnoreCase | RegexOptions.Singleline);

The only thing that jumps out at me as a performance sink is this part:
[a-z0-9]+\-?[a-z0-9]+
The idea is to match hyphenated words like luxury-yacht or THX-1138, while still allowing words without hyphens. Trouble is, if there's no hyphen, the regex engine still has to choose how to distribute the characters between the first [a-z0-9]+ and the second one. If it tries to match word as w-o-r-(no hyphen)-d, and something later in the regex fails to match, it has to come back and try w-o-(no hyphen)-r-d, and so on. These efforts are pointless, but the regex engine has no way to know that. You need to give it a little help, like so:
[a-z0-9]+(-[a-z0-9]+)?
Now you're saying, "If you run out of alphanumerics, and the next character is a hyphen, try to match some more alphanumerics. Otherwise, go on to the next part." But you don't need to be so specific in this case; you're trying to find the URLs, not validate them. I recommend you replace that part with:
[a-z0-9-]+
This also allows it to match words with more than one hyphen (e.g., james-bond, but also james-bond-007).
You also have a lot of unnecessary capturing groups. You don't seem to be using the captures, so you might as well use the ExplicitCapture option to improve performance a little more. But most of the groups don't seem to be needed even for pure grouping purposes. I suggest you try this regex:
#"<a\s+href\s*=\s*[""']https?://([a-z0-9-]+\.)+wordpress\.org[""']"
...with these options:
RegexOptions.Compiled | RegexOptions.IgnoreCase | RegexOptions.ExplicitCapture

C# program freezes for a minute when running a bunch of RegExes

I have a program that runs a large number of regular expressions (10+) on a fairly long set of texts (5-15 texts about 1000 words each)
Every time that is done I feel like I forgot a Thread.Sleep(5000) in there somewhere. Are regular expressions really processor-heavy or something? It'd seem like a computer should crank through a task like that in a millisecond.
Should I try and group all the regular expressions into ONE monster expression? Would that help?
Thanks
EDIT: Here's a regex that runs right now:
Regex rgx = new Regex(#"((.*(\w+([-+.]\w+)*#\w+([-.]\w+)*\.\w+([-.]\w+)*).*)|(.*(keyword1)).*|.*(keyword2).*|.*(keyword3).*|.*(keyword4).*|.*(keyword5).*|.*(keyword6).*|.*(keyword7).*|.*(keyword8).*|.*(count:\n[0-9]|count:\n\n[0-9]|Count:\n[0-9]|Count:\n\n[0-9]|Count:\n).*|.*(keyword10).*|.*(summary: \n|Summary:\n).*|.*(count:).*)", RegexOptions.Compiled | RegexOptions.IgnoreCase);
Regex regex = new Regex(#".*(\.com|\.biz|\.net|\.org|\.co\.uk|\.bz|\.info|\.us|\.cm|(<a href=)).*", RegexOptions.Compiled | RegexOptions.IgnoreCase);
It's pretty huge, no doubt about it. The idea is if it gets to any of the keywords or the link it will just take out the whole paragraph surrounding it.

Regexes don't kill CPU's, regex authors do. ;)
But seriously, if regexes always ran as slowly as you describe, nobody would be using them. Before you start loading up silver bullets like the Compiled option, you should go back to your regex and see if it can be improved.
And it can. Each keyword is in its own branch/alternative, and each branch starts with .*, so the first thing each branch does is consume the remainder of the current paragraph (i.e., everything up to the next newline). Then it starts backtracking as it tries to match the keyword. If it gets back to the position it started from, the next branch takes over and does the same thing.
When all branches have reported failure, the regex engine bumps ahead one position and goes through all the branches again. That's over a dozen branches, times the number of characters in the paragraph, times the number of paragraphs... I think you get the point. Compare that to this regex:
Regex re = new Regex(#"^.*?(\w+([-+.]\w+)*#\w+([-.]\w+)*\.\w+([-.]\w+)*|keyword1|keyword2|keyword3|keyword4|keyword5|keyword6|keyword7|keyword8|count:(\n\n?[0-9]?)?|keyword10|summary: \n).*$",
RegexOptions.Multiline | RegexOptions.IgnoreCase | RegexOptions.ExplicitCapture);
There are three major changes:
I factored out the leading and trailing .*
I changed the leading one to .*?, making it non-greedy
I added start-of-line and end-of-line anchors (^ and $ in Multiline mode)
Now it only makes one match attempt per paragraph (pass or fail), and it practically never backtracks. I could probably make it even more efficient if I knew more about your data. For example, if every keyword/token/whatever starts with a letter, a word boundary would have an appreciable effect (e.g. ^.*?\b(\w+...).
The ExplicitCapture option makes all the "bare" groups ((...)) act like non-capturing groups ((?:...)), reducing the overhead a little more without adding clutter to the regex. If you want to capture the token, just change that first group to a named group (e.g.(?<token>\w+...).

First of all you may try Compiled option RegexOptions.Compiled
Here is good article about regex performance: regex-performance
Second one: regex performance depended on pattern:
Some pattern are much quicker than others. You have to cpecify pattern as strict as possible.
And the third one.
I had some troubles with regex perfomance. And in that case I used string.contains method.
For example:
bool IsSomethingValid(stging source, string shouldContain, string pattern)
{
bool res = source.Contains(shouldContain);
if (res)
{
var regex = new Regex(pattern, RegexOptions.Compiled);
res = regex.IsMatch(source);
}
return res;
}
If you gave us an examples of your script we may try to improve them.

Never assume anything about what and for which reason your application is slow. Instead, always measure it.
Use a decent performance profiler (like e.g. Red Gate's ANTS Performance Profiler - they offer a free 14 days trial version) and actually see the performance bottlenecks.
My own experience is that I always was wrong on what was the real cause for a bad performance. With the help of profiling, I could find the slow code segments and adjust them to increase performance.
If the profiler confirms your assumption regarding the Regular Expressions, you then could try to optimize them (by modifying or pre-compiling them).

If the pattern doesn't deal with case...don't have the ignore case option. See
Want faster regular expressions? Maybe you should think about that IgnoreCase option...
Otherwise as mentioned but put in here:
Put all operations on seperate background threads. Don't hold up a gui.
Examine each pattern. To many usages of * (zero to many) if swapped out to a + (one to many) can give the regex engine an real big hint and not require as much fruitless backtracking. See (Backtracking)
Compiling a pattern can take time, use Compile option to avoid it being parsed again...but if one uses the static methods to call the regex parser, the pattern is cached automatically.
See Regex hangs with my expression [David Gutierrez] for some other goodies.

Help writing a regular expression

I asked a very similar question to this one almost a month ago here.
I am trying very hard to understand regular expressions, but not a bit of it makes any sense. SLak's solution in that question worked well, but when I try to use the Regex Helper at http://gskinner.com/RegExr/ it only matches the first comma of -2.2,1.1-6.9,2.3-12.8,2.3 when given the regex ,|(?<!^|,)(?=-)
In other words I can't find a single regex tool that will even help me understand it. Well, enough whining. I'm now trying to re-write this regex so that I can do a Regex.Split() to split up the string 2.2 1.1-6.9,2.3-12.8 2.3 into -2.2, 1.1, -6.9, 2.3, -12.8, and 2.3.
The difference the aforementioned question is that there can now be leading and/or trailing whitespace, and that whitespace can act as a delimiter as can a comma.
I tried using \s|,|(?<!^|,)(?=-) but this doesn't work. I tried using this to split 293.46701,72.238185, but C# just tells me "the input string was not in a correct format". Please note that there is leading and trailing whitespace that SO does not display correctly.
EDIT: Here is the code which is executed, and the variables and values after execution of the code.

If it doesn't have to be Regex, and if it doesn't have to be slow :-) this should do it for you:
var components = "2.2 1.1-6.9,2.3-12.8 2.3".Replace("-", ",-").
Split(new[]{' ', ','},StringSplitOptions.RemoveEmptyEntries);
Components would then contain:[2.2 1.1 -6.9 2.3 -12.8 2.3]

Does it need to be split? You could do Regex.Matches(text, #"\-?[\d]+(\.[\d]+)?").
If you need split, Regex.Split(text, #"[^\d.-]+|(?=-)") should work also.
P.S. I used Regex Hero to test on the fly http://regexhero.net

Unless I'm missing the point entirely (it's Sunday night and I'm tired ;) ) I think you need to concentrate more on matching the things you do want and not the things you don't want.
Regex argsep = new Regex(#"\-?[0-9]+\.?[0-9]*");
string text_to_split = "-2.2 1.1-6.9,2.3-12.8 2.3 293.46701,72.238185";
var tmp3 = argsep.Matches(text_to_split);
This gives you a MatchCollection of each of the values you wanted.
To break that down and try and give you an understanding of what it's saying, split it up into parts:
\-? Matches a literal minus sign (\ denotes literal characters) zero or one time (?)
[0-9]+ Matches any character from 0 to 9, one or more times (+)
\.? Matches a literal full stop, zero or one time (?)
[0-9]* Matches any character from 0 to 9 again, but this time it's zero or more times (*)
You don't need to worry about things like \s (spaces) for this regex, as the things you're actually trying to match are the positive/negative numbers.

Consider using the string split function. String operations are way faster than regular expressions and much simpler to use/understand.

If the "Matches" approach doesnt work you could perhaps hack something in two steps?
Regex RE = new Regex(#"(-?[\d.]+)|,|\s+");
RE.Split(" -2.2,1.1-6.9,2.3-12.8,2.3 ")
.Where(s=>!string.IsNullOrEmpty(s))
Outputs:
-2.2
1.1
-6.9
2.3
-12.8
2.3

Regex.Replace doesn't seem to work with back-reference

I made an application designed to prepare files for translation using lists of regexes.
It runs each regex on the file using Regex.Replace. There is also an inspector module which allows the user to see the matches for each regex on the list.
It works well, except when a regex contains a back-reference, Regex.Replace does not replace anything, yet the inspector shows the matches properly (so I know the regex is valid and matches what it should).
sSrcRtf = Regex.Replace(sSrcRtf, sTag, sTaggedTag,
RegexOptions.Compiled | RegexOptions.Singleline);
sSrcRtf contains the RTF code of the page. sTag contains the regular expression in between parentheses. sTaggedTag contains $1 surrounded by the tag formating code.
To give an example:
sSrcRtf = Regex.Replace("the little dog", "((e).*?\1)", "$1",
RegexOptions.Compiled | RegexOptions.Singleline);
doesn't work. But
sSrcRtf = Regex.Replace("the little dog", "((e).*?e)", "$1",
RegexOptions.Compiled | RegexOptions.Singleline);
does. (of course, there is some RTF code around $1)
Any idea why this is?

You technically have two match groups there, the outer and the inner parentheses. Why don't you try addressing the inner set as the second capture, e.g.:
((e).*?\2)
Your parser probably thinks the outer capture is \1, and it doesn't make much sense to backreference it from inside itself.
Also note that your replacement won't do anything, since you are asking to replace the portion that you match with itself. I'm not sure what your intended behavior is, but if you are trying to extract just the match and discard the rest of the string, you want something like:
.*((e).*?\2).*

You're using a reference to a group inside the group you're referencing.
"((e).*?\1)" // first capturing group
"(e)" // second capturing group
I'm not 100% certain, but I don't think you can reference a group from within that group. For starters, what would you expect the backreference to match, since it's not even complete yet?

As others have mentioned, there are some additional groups being captured. Your replacement isn't referencing the correct one.
Your current regex should be rewritten as (options elided):
Regex.Replace("the little dog", #"((e).*?\2)", "$2")
// or
Regex.Replace("the little dog", #"(e).*?\1", "$1")
Here's another example that matches doubled words and indicates which backreferences work:
Regex.Replace("the the little dog", #"\b(\w+)\s+\1\b", "$1") // good
Regex.Replace("the the little dog", #"\b((\w+)\s+\2)\b", "$1") // no good
Regex.Replace("the the little dog", #"\b((\w+)\s+\2)\b", "$2") // good

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Regex extremely slow on large documents - c#

Try to use regex for streams, use mono-project regex and this article can be useful for .Net Building a Regular Expression Stream with the .NET Framework and try to improve your regex performance.

Related

Regular expression performance issue

Regex For Anchor Tag Pegging CPU

C# program freezes for a minute when running a bunch of RegExes

Help writing a regular expression

Regex.Replace doesn't seem to work with back-reference

Categories

Resources