testing for "EndsWith" efficiently with a Regex

testing for "EndsWith" efficiently with a Regex - c#

I need to build a Regex (.NET syntax) to determine if a string ends with a specific value. Specifically I need to test whether a file has a specific extension (or set of extensions).
The code I'm trying to fix was using:
.*\.(png|jpg|gif)$
which is hideously slow for failed matches in my scenario (presumably due to the backtracking.
Simply removing the .* (which is fine since the API only tests for matches and doesn't extract anything) at the beginning makes the regex much more efficient.
It still feels like it is pretty inefficient. Am I missing something obvious here?
Unfortunately, I don't control the API in question so I need a regex to do this even though I wouldn't normally consider regex to be the right tool for the job.
I also did some tests using the RegexOptions.RightToLeft and found that I could squeeze a little more performance out of my test case with ^.*\.(png|jpg|gif)$, but I can't find a way to specify the RightToLeft option within the string of the regex itself so I don't think I can use it.

I don't have access to C# so I can't try this... but you should be able to avoid too much backtracking by forcing the engine to find the end of the string first, then matching the extensions:
$(?<=\.(gif|png|jpg))
I'm not sure of the effect the look-behind has on performance, though.

Really, you could also just drop Regex altogether, and use String.EndsWidth, with the following :
var extensions = new String[] { ".png", ".jpg", ".gif" };
extensions.Any(ext => "something".EndsWith(ext));
I usually have the feeling that it ends up being faster to use simple string functions for cases like this rather than trying to find a clever way to use an efficient regex, in terms of runtime and/or development time, unless you are comfortable with and know what is efficient in terms of Regex.

Make it look specifically for a period instead of any character preceding the extension:
\.(png|jpg|gif)$
This will make it safer (won't match x.xgif) and it will not have to do any backtracking at all until it found a period (as opposed to backtracking on every character).

If you can change the code, why can't you use something else? You don't control the API, right, but you are changing it anyway. This I really don't understand.
Anyway, why not simply:
var AcceptedExtensions = new List<string>() { "txt", "html", "htm" };
var extension = filename.Substring(filename.LastIndexOf(".") + 1).ToLower();
return AcceptedExtensions.Contains(extension);
The IEnumerable AcceptedExtensions would be loaded from some config, the same way you load your jpg|gif|.... Or it would be a constant, whatever. You just don't need to recreate it each time you are going to use it (I doubt that this would be a bottleneck though).

You probably don't need a regular expression for this... but going with the original question:
Make sure you're using RegexOptions.Compiled to pre-compile the regular expression and then reuse your RegEx object. This avoids setting up the RegEx every time you use it, this will speed things up a lot.

Related

How to check the equality of two identical URLs if one of them has "www"?

The following URLs are identical:
http://example.com
http://www.example.com
Can I get the expected result using a method provided in .Net framework like Compare() method in Uri class? or should I handle this case manually?

Unfortunately I don't have enough rep to leave a simple comment for you to ponder about so I'll just leave an answer as food for thought.
I can see a few possible solutions which should cover most scenarios (I'm pretty certain I will probably miss quite a few but everyone makes mistakes):
Use a loop to move over both URIs and compare each part/component as you see fit with break early conditions to speed it up
Clean up the URIs as much as you can using arbitrarily defined rules (eg. remove protocol, remove www prefix, trim ending /) then use Equals() to compare
If you're feeling bold you could use something similar to lexical analysis to convert both URIs into a objects/tokens/parts and compare the subsequent result (harder to implement but probably the most accurate)
Just keep in mind that the guys in the comments are right. The URLs aren't technically the same and the logic you implement in your ultimate solution is purely defined around your definition of 'identical'.
Also, I wouldn't use the solution that simple replaces "www." with "" since someone crazy could easily put a 'www.' somewhere else in their URL and break that implementation unless you perform the replace on both URLs which is also quite risky since one could have more 'www.' instances than the other and would still be considered 'identical'

As mentioned in the comments, these are actually not identical urls and cannot be treated as such.
That aside, it may help you to check the equality after removing the www part:
string urla = #"http://example.com";
string urlb = #"http://www.example.com";
if (urlb.contains("www.")) urlb = urlb.replace("www.", "");
if (urla == urlb) {
// url matches
}

Looking for simple yet powerful windows wildcards (`*, ?`) matching implementation

I'm looking for simple and powerful way to implement Windows flavoured * and ? wildcards matching in strings.
BeginsWith(), EndsWith() too simple to cover all cases, while translating wildcards expressions to regex'es will look to complex and I'm not sure about performance.
A happy medium wanted.
EDIT: I'm trying to parse .gitignore file and match the same files, as Git does. This means:
File should be out of repository's index (so I'm checking file's path against one stored in index)
Number of patterns in .gitignore can be large;
Number of files to check might also be large.

The equivalents of the Windows wildcards ? and * in regex are just . and .*.
[Edit] Given your new edit (stating that you're looking for actual files), I would skip the translation altogether and let .Net do the searching using Directory.GetFiles().
(note that, for some reason, passing a ? into Directory.GetFiles() matches "zero or one characters," whereas in Windows it always matches exactly one character)

To get an exact match including all corner-cases, use
System.IO.Directory.GetFiles(myPath, myPattern)
You may have to create some tempfiles form your targetstrings first.
In other words, I think you should keep your patterns dry until it's time to meet the filesytem.

You should go with regex based approach unless your data volume is humungous or you have data-points to say regex will severely impact performance.
If that is the case, any other solution will also likely affect the performance and you will probably need to hand-roll something.

Converting * and ? to regex is quite easy.
For ? replace the "?" with ".{1}"
and for * replace the "*" with ".+?"
That should get you the same behaviour as wildcard matching on windows.
EDIT:
boolean PathMatchSpec(input, pattern) will do the job.
Private Declare Auto Function PathMatchSpec Lib "shlwapi" (ByVal pszFileParam As String, ByVal pszSpec As String) As Boolean

Can I add a regular expression into a .Net Assertion?

I'm trying to pull out page source from a set of pages and run an assertion on the results, this is a Test that runs to check that we are crawling specific pages in our site. Sometimes the results come back with a different case for the URL string, I'd like to account for that in the Assertion where I am checking page source. This is probably the wrong way to do this but I was wondering if there is a way to add in the .Net regex commands to the Assertion text. I have this as an assertion:
Assert.IsTrue(driver.PageSource.Contains("/explore"));
But is there a way to be sure that I can capture explore, Explore or EXPLORE? I though I could use (?i) here but that doesn't seem to work. I'm more used to Perl and it's regex capabilities but with C# and .Net I'm a little lost on where I can and can't use the inline regex commands.

Anthonys answer is valid, you don't really need regex. But if you do want to use it, you can use
Regex.IsMatch(driver.PageSource, "/explore", RegexOptions.IgnoreCase)

You don't need a regular expression to perform a case-insensitive check. Use IndexOf and compare that the result is greater than -1. IndexOf has overloads that allow you to specify if casing matters. Something like
bool containsExplore = driver.PageSource.IndexOf("/explore", StringComparison.InvariantCultureIgnoreCase) > -1;
Assert.IsTrue(containsExplore);

Try:
RegEx.Match("string", "regexp", RegExOptions.IgnoreCase).Success

How about using
StringAssert.Matches(string, regex);
In your case, that would translate to
StringAssert.Matches("drive.PageSource", "\/explore");

Simplifying Regex's - escaping

I want to enable my users to specify the allowed characters in a given string.
So... Regex's are great but too tough for my users.
my plan is to enable users to specify a list of allowed characters - for example
a-z|A-Z|0-9|,
i can transform this into a regex which does the matching as such:
[a-zA-Z0-9,]*
However i'm a little lost to deal with all the escaping - imagine if a user specified
a-z|A-Z|0-9| |,|||\|*|[|]|{|}|(|)
Clearly one option is to deal with every case individually but before i write such a nasty solution - is there some nifty way to do this?
Thanks
David

Forget regex, here is a much simpler solution:
bool isInputValid = inputString.All(c => allowedChars.Contains(c));

You might be right about your customers, but you could provide some introductory regex material and see how they get on - you might be surprised.
If you really need to simplify, you'll probably need to jetison the use of pipe characters too, and provide an alternative such as putting each item on a new line (in a multi line text box for instance).

To make it as simple as possible for your users, why don't you ditch the "|" and the concept of character ranges, e.g., "a-z", and get them just to type the complete list of characters they want to allow:
abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ01234567890 *{}()
You get the idea. I think this will be much simpler.

Regex index in matching string where the match failed

I am wondering if it is possible to extract the index position in a given string where a Regex failed when trying to match it?
For example, if my regex was "abc" and I tried to match that with "abd" the match would fail at index 2.
Edit for clarification. The reason I need this is to allow me to simplify the parsing component of my application. The application is an Assmebly language teaching tool which allows students to write, compile, and execute assembly like programs.
Currently I have a tokenizer class which converts input strings into Tokens using regex's. This works very well. For example:
The tokenizer would produce the following tokens given the following input = "INP :x:":
Token.OPCODE, Token.WHITESPACE, Token.LABEL, Token.EOL
These tokens are then analysed to ensure they conform to a syntax for a given statement. Currently this is done using IF statements and is proving cumbersome. The upside of this approach is that I can provide detailed error messages. I.E
if(token[2] != Token.LABEL) { throw new SyntaxError("Expected label");}
I want to use a regular expression to define a syntax instead of the annoying IF statements. But in doing so I lose the ability to return detailed error reports. I therefore would at least like to inform the user of WHERE the error occurred.

I agree with Colin Younger, I don't think it is possible with the existing Regex class. However, I think it is doable if you are willing to sweat a little:
Get the Regex class source code
(e.g.
http://www.codeplex.com/NetMassDownloader
to download the .Net source).
Change the code to have a readonly
property with the failure index.
Make sure your code uses that Regex
rather than Microsoft's.

I guess such an index would only have meaning in some simple case, like in your example.
If you'll take a regex like "ab*c*z" (where by * I mean any character) and a string "abbbcbbcdd", what should be the index, you are talking about?
It will depend on the algorithm used for mathcing...
Could fail on "abbbc..." or on "abbbcbbc..."

I don't believe it's possible, but I am intrigued why you would want it.

In order to do that you would need either callbacks embedded in the regex (which AFAIK C# doesn't support) or preferably hooks into the regex engine. Even then, it's not clear what result you would want if backtracking was involved.

It is not possible to be able to tell where a regex fails. as a result you need to take a different approach. You need to compare strings. Use a regex to remove all the things that could vary and compare it with the string that you know it does not change.
I run into the same problem came up to your answer and had to work out my own solution. Here it is:
https://stackoverflow.com/a/11730035/637142
hope it helps

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

testing for "EndsWith" efficiently with a Regex - c#

I don't have access to C# so I can't try this... but you should be able to avoid too much backtracking by forcing the engine to find the end of the string first, then matching the extensions: $(?<=\.(gif|png|jpg)) I'm not sure of the effect the look-behind has on performance, though.

Make it look specifically for a period instead of any character preceding the extension: \.(png|jpg|gif)$ This will make it safer (won't match x.xgif) and it will not have to do any backtracking at all until it found a period (as opposed to backtracking on every character).

You probably don't need a regular expression for this... but going with the original question: Make sure you're using RegexOptions.Compiled to pre-compile the regular expression and then reuse your RegEx object. This avoids setting up the RegEx every time you use it, this will speed things up a lot.

Related

How to check the equality of two identical URLs if one of them has "www"?

Looking for simple yet powerful windows wildcards (`*, ?`) matching implementation

Can I add a regular expression into a .Net Assertion?

Simplifying Regex's - escaping

Regex index in matching string where the match failed

Categories

Resources