Optimize performance with multiple calls to Regex.IsMatch on large text

Optimize performance with multiple calls to Regex.IsMatch on large text - c#

I have a long text (50-60 KB) and I need to run several regular expressions against it (about 100 rules in total). However, this is so slow that it essentially doesn't work.
All I have done is created a loop around the rules where each rule does a Regex.IsMatch().
Is there a way to optimize this?
UPDATE
Sample code of what each rule is doing:
public class SomeRegexInterceptor : ValidatorBase
{
private readonly Regex _rgx = new Regex("some regex", RegexOptions.Compiled | RegexOptions.IgnoreCase | RegexOptions.Singleline);
public override void Intercept(string html, ValidationResultCollection collection)
{
if (!_rgx.IsMatch(html)) return;
/* do something irrelevant here */
}
}

The most important thing about the usage of Regex replacements is how and where you declare your Regex. Never initialize a Regex object inside a loop.
Create a static class and add public static readonly Regex fields with RegexOptions.Compiled flag set.
Then, use them wherever you need using something like MyRegexClass.LeadingWhitespace.Replace(str, string.Empty).
Note that if you need to use Regex.Replace, you do not need to check if there is a match with Regex.IsMatch before.
Read and follow the recommendations outlined at Best Practices for Regular Expressions in the .NET Framework, namely:
Consider the Input Source
Handle Object Instantiation Appropriately
Take Charge of Backtracking
Use Time-out Values
Capture Only When Necessary
Also, consider processing the file line by line, and avoid regular expressions wherever you can do without them.

Can you combine your rules into one rule? For example: if you are doing a regex for "aaa" and then one for "bbb", combining into "aaa|bbb" improves performance greatly ( rather than running two separate regexes ). I have programatically combined a large number of regexes in this way before and it makes a huge performance difference.

Combining Expressions
If you have the ability to do work prior to running your rules, you could combine the regexes with | and do the search in one pass. Then in your rule you don't check for a match anonymously like you were, but by the name of the group. For example
>((?<Ex1>expression1)|(?<Ex2>expression2)|(?<Ex3>expression3))
So the rule that cares about group Ex1, it checks that group. The rule that cares about Ex2 checks Ex2 and so on.
Additionally, you could possibly analyze the set of regexes and optimize them somehow. That would be quite a programmatic feat unless you're doing something simple like eliminating duplicates.
Caching, Cache Size
One other idea is to cache up and compile all the regexes once somewhere, and change the size of Regex.CacheSize to see if it helps. The documentation states the default value is 15, but you're over that.
More Info
The comments about instantiating and compiling the expression once, not looeven that big one there, are valid. There are additional recommendations from MSDN.

Related

How to match condition IF in C#

I have a string like this.
string strex = "Insert|Update|Delete"
I am retrieving another string as string strex1 = "Insert" (It may retrieve Update or Delete)
I need to match strex1 with strex in "IF" condition in C#.
Do I need to split strex and match with strex1?

The string you posted is a regular expression pattern that matches the words Insert, Update or Delete. Regular expressions are a very common way of specifying validation rules in web applications.
Regular expressions can express far more complex rules than a simple comparison. They're also far faster (think 10x) in validation scenarios than splitting. In a web application, that translates to using fewer servers to serve the same traffic.
You can use .NET's Regex to match strings with that pattern, eg :
var strex = "Insert|Update|Delete";
if (Regex.IsMatch(input,strex))
{
....
}
This will create a new regular expression object each time. You can avoid this by creating a static Regex instance and reuse it. Regex is thread-safe which means there's no problem using the same instance from multiple threads :
static Regex _cmdRegex = new Regex("Insert|Update|Delete");
...
void MyMethod(string input)
{
if(_cmdRegex.IsMatch(input))
{
...
}
}
The Regex class methods will match if the pattern appears anywhere in the pattern. Regex.IsMatch("Insert1",strex) will return True. If you want an exact match, you have to specify that the pattern starts at the beginning of the input with ^ and ends at the end with $ :
static Regex _cmdRegex = new Regex("^(Insert|Update|Delete)$");
With this change, _cmdRegex.IsMatch("Insert1") will return false but _cmdRegex.IsMatch("Insert") will return true.
Performance
In this case a regular expression is a lot faster than splitting and trying exact matches. Think 10-100x over time. There are two reasons for this:
Strings are immutable, so every string modification operation like Split() will generate new temporary strings that have to be allocated and garbage collected. In a busy web application this adds up, eventually using up a lot of RAM and CPU for little or no benefit. One of the reasons ASP.NET Core is 10x times faster than the old ASP.NET is eliminating such substring operations wherever possible.
A regular expression is compiled into a program that performs matching in the most efficient way. When you use Split().Any() the program will compare the input with all the substrings even if it's obvious there's no possible match, eg because the first letter is Z. A Regex program on the other hand would only proceed if the first character was I, U or D

Efficient way I can think of is using string.Contains()
if(strex.Contains($"{strex1}|") || strex.Contains($"|{strex1}"))
{
//Your code goes here
}
Solution using Linq, Split string strex by '|' and check strex1 is present in an array or not, like
Issue with below solution is pointed out by #PanagiotisKanavos in the
comment.
Using .Any(),
if(strex.Split('|').Any(x => x.Equals(strex1)))
{
//Your code goes here
}
or using Contains(),
if(strex.Split('|').Contains(strex1))
{
//Your code goes here
}
if you want to ignore case while comparing string then you can use StringComparison.OrdinalIgnoreCase.
if(strex.Split('|').Any(x => x.Equals(strex1, StringComparison.OrdinalIgnoreCase))
{
//Your code goes here
}
.NETFIDDLE

Regex MatchCollection obj hangs\"Function evuluation timed out" after Regex.Matches

I'm kind of new too C#, and regular expression for that matter, but I've searched a couple of hours to find a solution too this problem so, hopefully this is easy for you guys:)
My application uses a regex to match email addresses in a given string,
then loops throu the matches.:
String EmailPattern = "\\w+([-+.]\\w+)*#\\w+([-.]\\w+)*\\.\\w+([-.]\\w+)*";
MatchCollection mcemail = Regex.Matches(rawHTML, EmailPattern);
foreach (Match memail in mcemail)
Works fine, but, when I downloaded the string from a certain page, http://www.sp.se/sv/index/services/quality/sidor/default.aspx, the MatchCollection(mcemail) object "hangs" the loop. When using a break point and accessing the object, I get "Function evuluation timed out" on everything(.Count etc).
Update
I've tried my pattern and other email patterns on the same string, everyone(regex desingers, python based web pages etc.) fails/timesout when trying too match this particular string.
How can I detect that the matchcollection obj is not "ready" to use?

If you can post the email that's causing the problem (perhaps anonymized in some way), that will give us more information, but I'm thinking the problem is this little guy right here:
([-.]\\w+)*\\.\\w+([-.]\\w+)*
To understand the problem, let's break that into groups:
([-.]\\w+)*
\\.\\w+
([-.]\\w+)*
The strings that will match \\.\\w+ are a subset of those that will match [-.]\\w+. So if part of your input looks like foo.bar.baz.blah.yadda.com, your regex engine has no way of knowing which group is supposed to match it. Does that make sense? So the first ([-.]\\w+)* could match .bar.baz.blah, then the \\.\\w+ could match .yadda, then the last ([-.]\\w+)* could match .com...
...OR the first clause could match .bar.baz, the second could match .blah, and the last could match .yadda.com. Since it doesn't know which one is right, it will keep trying different combinations. It should stop eventually, but that could still take a long time. This is called "catastrophic backtracking".
This issue is compounded by the fact that you're using capturing groups rather than non-capturing groups; i.e. ([-+.]\\w+) instead of (?:[-+.]\\w+). That causes the engine to try and separate and save whatever matches inside the parentheses for later reference. But as I explained above, it's ambiguous which group each substring belongs in.
You might consider replacing everything after the # with something like this:
\\w[-\\w]*\\.[-.\\w]+
That could use some refinement to make it more specific, but you get the general idea. Hope I explained all this well enough; grouping and backreferences are kind of tough to describe.
EDIT:
Looking back at your pattern, there's a deeper issue here, still related to the backtracking/ambiguity problem I mentioned. The clause \\w+([-.]\\w+)* is ambiguous all by itself. Splitting it into parts, we have:
\\w+
([-.]\\w+)*
Suppose you have a string like foobar. Where does the \\w+ end and the ([-.]\\w+)* begin? How many repetitions of ([-.]\\w+) are there? Any of the following could work as matches:
f(oobar)
foo(bar)
f(o)(oba)(r)
f(o)(o)(b)(a)(r)
foobar
etc...
The regex engine doesn't know which is important, so it will try them all. This is the same problem I pointed out above, but it means you have it in multiple places in your pattern.
Even worse, ([-.]\\w+)* is also ambiguous, because of the + after the \\w. How many groups are there in blah? I count 16 possible combinations: (blah), (b)(lah), (bl)(ah)...
The amount of different possible combinations is going to be huge, even for a relatively small input, so your engine is going to be in overdrive. I would definitely simplify it if I were you.

I just did a local test and it appears either the sheer document size or something in the ViewState causes the Regex match evaluation to time out. (Edit: I'm pretty sure it's the size, actually. Removing the ViewState just reduces the size significantly.)
An admittedly crude way to solve this would be something like this:
string[] rawHtmlLines = File.ReadAllLines(#"C:\default.aspx");
string filteredHtml = String.Join(Environment.NewLine,
rawHtmlLines.Where(line => !line.Contains("_VIEWSTATE")).ToArray());
string emailPattern = #"\w+([-+.]\w+)*#\w+([-.]\w+)*\.\w+([-.]\w+)*";
var emailMatches = Regex.Matches(filteredHtml, emailPattern);
foreach (Match match in emailMatches)
{
//...
}
Overall I suspect the email pattern is just not well optimised (or intended) to filter out emails in a large string but just used as validation for user input. Generally it might be a good idea to limit the string you search in to just the parts you are actually interested in and keep it as small as possible - for example by leaving out the ViewState which is guaranteed to not contain any readable email addresses.
If performance is important, it's probably also a better idea to create the filtered HTML using a StringBuilder and IndexOf (etc.) instead of splitting lines and LINQing up the result :)
Edit:
To further minimize the length of the string the Regex needs to check you could only include lines that contain the # character to begin with, like so:
string filteredHtml = String.Join(Environment.NewLine,
rawHtmlLines.Where(line => line.IndexOf('#') >= 0 && !line.Contains("_VIEWSTATE")).ToArray());

From "Function evaluation timed out", I'm assuming you're doing this in the debugger. The debugger has some fairly quick timeouts with regard to how long a method takes. Not eveything happens quickly. I would suggest going the operation in code, storing the result, then viewing that result in the debugger (i.e. let the call to Matches run and put a breakpoint after it).
Now, with regard to detecting whether the string will make Matches take a long time; that's a bit of a black art. You basically have to perform some sort of input validation. Just because you got some value from the internet, doesn't mean that value will work well with Matches. The ultimate validation logic is up to you; but, starting with the length of rawHtmlLines might be useful. (i.e. if the lenght is 1000000 bytes, Matches might take a while) But, you have to decide what to do if the length is too long; e.g give an error to the user.

Regex to parse C/C++ functions declarations

I need to parse and split C and C++ functions into the main components (return type, function name/class and method, parameters, etc).
I'm working from either headers or a list where the signatures take the form:
public: void __thiscall myClass::method(int, class myOtherClass * )
I have the following regex, which works for most functions:
(?<expo>public\:|protected\:|private\:) (?<ret>(const )*(void|int|unsigned int|long|unsigned long|float|double|(class .*)|(enum .*))) (?<decl>__thiscall|__cdecl|__stdcall|__fastcall|__clrcall) (?<ns>.*)\:\:(?<class>(.*)((<.*>)*))\:\:(?<method>(.*)((<.*>)*))\((?<params>((.*(<.*>)?)(,)?)*)\)
There are a few functions that it doesn't like to parse, but appear to match the pattern. I'm not worried about matching functions that aren't members of a class at the moment (can handle that later). The expression is used in a C# program, so the <label>s are for easily retrieving the groups.
I'm wondering if there is a standard regex to parse all functions, or how to improve mine to handle the odd exceptions?

C++ is notoriously hard to parse; it is impossible to write a regex that catches all cases. For example, there can be an unlimited number of nested parentheses, which shows that even this subset of the C++ language is not regular.
But it seems that you're going for practicality, not theoretical correctness. Just keep improving your regex until it catches the cases it needs to catch, and try to make it as stringent as possible so you don't get any false matches.
Without knowing the "odd exceptions" that it doesn't catch, it's hard to say how to improve the regex.

Take a look at Boost.Spirit, it is a boost library that allows the implementation of recursive descent parsers using only C++ code and no preprocessors. You have to specify a BNF Grammar, and then pass a string for it to parse. You can even generate an Abstract-Syntax Tree (AST), which is useful to process the parsed data.
The BNF specification looks like for a list of integers or words separated might look like :
using spirit::alpha_p;
using spirit::digit_p;
using spirit::anychar_p;
using spirit::end_p;
using spirit::space_p;
// Inside the definition...
integer = +digit_p; // One or more digits.
word = +alpha_p; // One or more letters.
token = integer | word; // An integer or a word.
token_list = token >> *(+space_p >> token) // A token, followed by 0 or more tokens.
For more information refer to the documentation, the library is a bit complex at the beginning, but then it gets easier to use (and more powerful).

No. Even function prototypes can have arbitrary levels of nesting, so cannot be expressed with a single regular expression.
If you really are restricting yourself to things very close to your example (exactly 2 arguments, etc.), then could you provide an example of something that doesn't match?

testing for "EndsWith" efficiently with a Regex

I need to build a Regex (.NET syntax) to determine if a string ends with a specific value. Specifically I need to test whether a file has a specific extension (or set of extensions).
The code I'm trying to fix was using:
.*\.(png|jpg|gif)$
which is hideously slow for failed matches in my scenario (presumably due to the backtracking.
Simply removing the .* (which is fine since the API only tests for matches and doesn't extract anything) at the beginning makes the regex much more efficient.
It still feels like it is pretty inefficient. Am I missing something obvious here?
Unfortunately, I don't control the API in question so I need a regex to do this even though I wouldn't normally consider regex to be the right tool for the job.
I also did some tests using the RegexOptions.RightToLeft and found that I could squeeze a little more performance out of my test case with ^.*\.(png|jpg|gif)$, but I can't find a way to specify the RightToLeft option within the string of the regex itself so I don't think I can use it.

I don't have access to C# so I can't try this... but you should be able to avoid too much backtracking by forcing the engine to find the end of the string first, then matching the extensions:
$(?<=\.(gif|png|jpg))
I'm not sure of the effect the look-behind has on performance, though.

Really, you could also just drop Regex altogether, and use String.EndsWidth, with the following :
var extensions = new String[] { ".png", ".jpg", ".gif" };
extensions.Any(ext => "something".EndsWith(ext));
I usually have the feeling that it ends up being faster to use simple string functions for cases like this rather than trying to find a clever way to use an efficient regex, in terms of runtime and/or development time, unless you are comfortable with and know what is efficient in terms of Regex.

Make it look specifically for a period instead of any character preceding the extension:
\.(png|jpg|gif)$
This will make it safer (won't match x.xgif) and it will not have to do any backtracking at all until it found a period (as opposed to backtracking on every character).

If you can change the code, why can't you use something else? You don't control the API, right, but you are changing it anyway. This I really don't understand.
Anyway, why not simply:
var AcceptedExtensions = new List<string>() { "txt", "html", "htm" };
var extension = filename.Substring(filename.LastIndexOf(".") + 1).ToLower();
return AcceptedExtensions.Contains(extension);
The IEnumerable AcceptedExtensions would be loaded from some config, the same way you load your jpg|gif|.... Or it would be a constant, whatever. You just don't need to recreate it each time you are going to use it (I doubt that this would be a bottleneck though).

You probably don't need a regular expression for this... but going with the original question:
Make sure you're using RegexOptions.Compiled to pre-compile the regular expression and then reuse your RegEx object. This avoids setting up the RegEx every time you use it, this will speed things up a lot.

Regex index in matching string where the match failed

I am wondering if it is possible to extract the index position in a given string where a Regex failed when trying to match it?
For example, if my regex was "abc" and I tried to match that with "abd" the match would fail at index 2.
Edit for clarification. The reason I need this is to allow me to simplify the parsing component of my application. The application is an Assmebly language teaching tool which allows students to write, compile, and execute assembly like programs.
Currently I have a tokenizer class which converts input strings into Tokens using regex's. This works very well. For example:
The tokenizer would produce the following tokens given the following input = "INP :x:":
Token.OPCODE, Token.WHITESPACE, Token.LABEL, Token.EOL
These tokens are then analysed to ensure they conform to a syntax for a given statement. Currently this is done using IF statements and is proving cumbersome. The upside of this approach is that I can provide detailed error messages. I.E
if(token[2] != Token.LABEL) { throw new SyntaxError("Expected label");}
I want to use a regular expression to define a syntax instead of the annoying IF statements. But in doing so I lose the ability to return detailed error reports. I therefore would at least like to inform the user of WHERE the error occurred.

I agree with Colin Younger, I don't think it is possible with the existing Regex class. However, I think it is doable if you are willing to sweat a little:
Get the Regex class source code
(e.g.
http://www.codeplex.com/NetMassDownloader
to download the .Net source).
Change the code to have a readonly
property with the failure index.
Make sure your code uses that Regex
rather than Microsoft's.

I guess such an index would only have meaning in some simple case, like in your example.
If you'll take a regex like "ab*c*z" (where by * I mean any character) and a string "abbbcbbcdd", what should be the index, you are talking about?
It will depend on the algorithm used for mathcing...
Could fail on "abbbc..." or on "abbbcbbc..."

I don't believe it's possible, but I am intrigued why you would want it.

In order to do that you would need either callbacks embedded in the regex (which AFAIK C# doesn't support) or preferably hooks into the regex engine. Even then, it's not clear what result you would want if backtracking was involved.

It is not possible to be able to tell where a regex fails. as a result you need to take a different approach. You need to compare strings. Use a regex to remove all the things that could vary and compare it with the string that you know it does not change.
I run into the same problem came up to your answer and had to work out my own solution. Here it is:
https://stackoverflow.com/a/11730035/637142
hope it helps

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Optimize performance with multiple calls to Regex.IsMatch on large text - c#

Related

How to match condition IF in C#

Regex MatchCollection obj hangs\"Function evuluation timed out" after Regex.Matches

Regex to parse C/C++ functions declarations

testing for "EndsWith" efficiently with a Regex

Regex index in matching string where the match failed

Categories

Resources