Looking for simple yet powerful windows wildcards (`*, ?`) matching implementation - c#

I'm looking for simple and powerful way to implement Windows flavoured * and ? wildcards matching in strings.
BeginsWith(), EndsWith() too simple to cover all cases, while translating wildcards expressions to regex'es will look to complex and I'm not sure about performance.
A happy medium wanted.
EDIT: I'm trying to parse .gitignore file and match the same files, as Git does. This means:
File should be out of repository's index (so I'm checking file's path against one stored in index)
Number of patterns in .gitignore can be large;
Number of files to check might also be large.

The equivalents of the Windows wildcards ? and * in regex are just . and .*.
[Edit] Given your new edit (stating that you're looking for actual files), I would skip the translation altogether and let .Net do the searching using Directory.GetFiles().
(note that, for some reason, passing a ? into Directory.GetFiles() matches "zero or one characters," whereas in Windows it always matches exactly one character)

To get an exact match including all corner-cases, use
System.IO.Directory.GetFiles(myPath, myPattern)
You may have to create some tempfiles form your targetstrings first.
In other words, I think you should keep your patterns dry until it's time to meet the filesytem.

You should go with regex based approach unless your data volume is humungous or you have data-points to say regex will severely impact performance.
If that is the case, any other solution will also likely affect the performance and you will probably need to hand-roll something.

Converting * and ? to regex is quite easy.
For ? replace the "?" with ".{1}"
and for * replace the "*" with ".+?"
That should get you the same behaviour as wildcard matching on windows.
EDIT:
boolean PathMatchSpec(input, pattern) will do the job.
Private Declare Auto Function PathMatchSpec Lib "shlwapi" (ByVal pszFileParam As String, ByVal pszSpec As String) As Boolean

Related

C# Regex filter problems

At this moment in time, i posted something earlier asking about the same type of question regarding Regex. It has given me headaches, i have looked up loads of documentation of how to use regex but i still could not put my finger on it. I wouldn't want to waste another 6 hours looking to filter simple (i think) expressions.
So basically what i want to do is filter all filetypes with the endings of HTML extensions (the '*' stars are from a Winforms Tabcontrol signifying that the file has been modified. I also need them in IgnoreCase:
.html, .htm, .shtml, .shtm, .xhtml
.html*, .htm*, .shtml*, .shtm*, .xhtml*
Also filtering some CSS files:
.css
.css*
And some SQL Files:
.sql, .ddl, .dml
.sql*, .ddl*, .dml*
My previous question got an answer to filtering Python files:
.py, .py, .pyi, .pyx, .pyw
Expression would be: \.py[3ixw]?\*?$
But when i tried to learn from the expression above i would always end up with opening a .xhtml only, the rest are not valid.
For the HTML expression, i currently have this: \.html|.html|.shtml|.shtm|.xhtml\*?$ with RegexOptions.IgnoreCase. But the output will only allow .xhtml case sensitive or insensitive. .html files, .htm and the rest did not match. I would really appreciate an explanation to each of the expressions you provide (so i don't have to ask the same question ever again).
Thank you.
For such cases you may start with a simple regex that can be simplified step by step down to a good regex expression:
In C# this would basically, with IgnoreCase, be
Regex myRegex = new Regex("PATTERN", RegexOptions.IgnoreCase);
Now the pattern: The most easy one is simply concatenating all valid results with OR + escaping (if possible):
\.html|\.htm|\.shtml|\.shtm|\.xhtml|\.html*|\.htm*|\.shtml*|\.shtm*|\.xhtml*
With .html* you mean .html + anything, which is written as .*(Any character, 0-infinite times) in regex.
\.html|\.htm|\.shtml|\.shtm|\.xhtml|\.html.*|\.htm.*|\.shtml.*|\.shtm.*|\.xhtml.*
Then, you may take all repeating patterns and group them together. All file endings start with a dot and may have an optional end and ending.* always contains ending:
\.(html|htm|shtml|shtm|xhtml).*
Then, I see htm pretty often, so I try to extract that. Taking all possible characters before and after htm together (? means 0 or 1 appearance):
\.(s|x)?(htm)l?.*
And, I always check if it's still working in regexstorm for .Net
That way, you may also get regular expressions for the other 2 ones and concat them all together in the end.

Regular Expression For File Name in C#

I am trying to find a regular expression to parse two sections out of the file name for the .resx files in my project. There is one main file called "UiText.resx" and then many translation .resx files with convention "UiText.ja-JP.resx". I need both the "UiText" and the "ja-JP" out of the latter string, as we do have other resx files that don't have to be for UiText (e.g. I have some files named "ExceptionText.resx").
The pattern I'm using right now (which works, it just requires a little extra coding after) is "(?<=\.)((.*?)(?=\.resx))". For the example above, "UiText.ja-JP.resx" gets me a match set in C# of "UiText.", "ja-JP.", "ja-JP.", ".resx"
Of course I am able to just take the first occurrence of "ja-JP." and "UiText." from this set and massage it to what I want, but I'd rather just have a cleaner "UiText" "ja-JP" and be done with it.
I figure I'll probably have to have at least two different patterns for this, so that is OK. Thank you in advance!
Since UiText seems to be constant you can use this regex to extract just js-JP into $1:
^UiText\.(.+?)\.resx$
https://regex101.com/r/XKvwHA/1/
If I'm understanding your needs correctly, then the main reason you need "UiText" is not because you have any value for the term itself, but rather because you need to filter your files. The real term you need to play around with is "ja-JP", which changes for the files you need.
If I'm correct, try this regex:
(?<=UiText\.).+(?=\.resx)
Used in C# as follows:
var fileName = "UiText.ja-JP.resx";
var result = new Regex(#"(?<=^UiText\.).+(?=\.resx$)").Match(fileName).Value;
A little explanation:
(?<=^UiText\.) Start of string must begin exactly with "UiText."
.+ Any number of characters (but at least one)
(?=\.resx$) End of string must end with ".resx"
Any file that doesn't meet your criteria will return an empty string for 'result'.

Simplifying Regex's - escaping

I want to enable my users to specify the allowed characters in a given string.
So... Regex's are great but too tough for my users.
my plan is to enable users to specify a list of allowed characters - for example
a-z|A-Z|0-9|,
i can transform this into a regex which does the matching as such:
[a-zA-Z0-9,]*
However i'm a little lost to deal with all the escaping - imagine if a user specified
a-z|A-Z|0-9| |,|||\|*|[|]|{|}|(|)
Clearly one option is to deal with every case individually but before i write such a nasty solution - is there some nifty way to do this?
Thanks
David
Forget regex, here is a much simpler solution:
bool isInputValid = inputString.All(c => allowedChars.Contains(c));
You might be right about your customers, but you could provide some introductory regex material and see how they get on - you might be surprised.
If you really need to simplify, you'll probably need to jetison the use of pipe characters too, and provide an alternative such as putting each item on a new line (in a multi line text box for instance).
To make it as simple as possible for your users, why don't you ditch the "|" and the concept of character ranges, e.g., "a-z", and get them just to type the complete list of characters they want to allow:
abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ01234567890 *{}()
You get the idea. I think this will be much simpler.

testing for "EndsWith" efficiently with a Regex

I need to build a Regex (.NET syntax) to determine if a string ends with a specific value. Specifically I need to test whether a file has a specific extension (or set of extensions).
The code I'm trying to fix was using:
.*\.(png|jpg|gif)$
which is hideously slow for failed matches in my scenario (presumably due to the backtracking.
Simply removing the .* (which is fine since the API only tests for matches and doesn't extract anything) at the beginning makes the regex much more efficient.
It still feels like it is pretty inefficient. Am I missing something obvious here?
Unfortunately, I don't control the API in question so I need a regex to do this even though I wouldn't normally consider regex to be the right tool for the job.
I also did some tests using the RegexOptions.RightToLeft and found that I could squeeze a little more performance out of my test case with ^.*\.(png|jpg|gif)$, but I can't find a way to specify the RightToLeft option within the string of the regex itself so I don't think I can use it.
I don't have access to C# so I can't try this... but you should be able to avoid too much backtracking by forcing the engine to find the end of the string first, then matching the extensions:
$(?<=\.(gif|png|jpg))
I'm not sure of the effect the look-behind has on performance, though.
Really, you could also just drop Regex altogether, and use String.EndsWidth, with the following :
var extensions = new String[] { ".png", ".jpg", ".gif" };
extensions.Any(ext => "something".EndsWith(ext));
I usually have the feeling that it ends up being faster to use simple string functions for cases like this rather than trying to find a clever way to use an efficient regex, in terms of runtime and/or development time, unless you are comfortable with and know what is efficient in terms of Regex.
Make it look specifically for a period instead of any character preceding the extension:
\.(png|jpg|gif)$
This will make it safer (won't match x.xgif) and it will not have to do any backtracking at all until it found a period (as opposed to backtracking on every character).
If you can change the code, why can't you use something else? You don't control the API, right, but you are changing it anyway. This I really don't understand.
Anyway, why not simply:
var AcceptedExtensions = new List<string>() { "txt", "html", "htm" };
var extension = filename.Substring(filename.LastIndexOf(".") + 1).ToLower();
return AcceptedExtensions.Contains(extension);
The IEnumerable AcceptedExtensions would be loaded from some config, the same way you load your jpg|gif|.... Or it would be a constant, whatever. You just don't need to recreate it each time you are going to use it (I doubt that this would be a bottleneck though).
You probably don't need a regular expression for this... but going with the original question:
Make sure you're using RegexOptions.Compiled to pre-compile the regular expression and then reuse your RegEx object. This avoids setting up the RegEx every time you use it, this will speed things up a lot.

Regex index in matching string where the match failed

I am wondering if it is possible to extract the index position in a given string where a Regex failed when trying to match it?
For example, if my regex was "abc" and I tried to match that with "abd" the match would fail at index 2.
Edit for clarification. The reason I need this is to allow me to simplify the parsing component of my application. The application is an Assmebly language teaching tool which allows students to write, compile, and execute assembly like programs.
Currently I have a tokenizer class which converts input strings into Tokens using regex's. This works very well. For example:
The tokenizer would produce the following tokens given the following input = "INP :x:":
Token.OPCODE, Token.WHITESPACE, Token.LABEL, Token.EOL
These tokens are then analysed to ensure they conform to a syntax for a given statement. Currently this is done using IF statements and is proving cumbersome. The upside of this approach is that I can provide detailed error messages. I.E
if(token[2] != Token.LABEL) { throw new SyntaxError("Expected label");}
I want to use a regular expression to define a syntax instead of the annoying IF statements. But in doing so I lose the ability to return detailed error reports. I therefore would at least like to inform the user of WHERE the error occurred.
I agree with Colin Younger, I don't think it is possible with the existing Regex class. However, I think it is doable if you are willing to sweat a little:
Get the Regex class source code
(e.g.
http://www.codeplex.com/NetMassDownloader
to download the .Net source).
Change the code to have a readonly
property with the failure index.
Make sure your code uses that Regex
rather than Microsoft's.
I guess such an index would only have meaning in some simple case, like in your example.
If you'll take a regex like "ab*c*z" (where by * I mean any character) and a string "abbbcbbcdd", what should be the index, you are talking about?
It will depend on the algorithm used for mathcing...
Could fail on "abbbc..." or on "abbbcbbc..."
I don't believe it's possible, but I am intrigued why you would want it.
In order to do that you would need either callbacks embedded in the regex (which AFAIK C# doesn't support) or preferably hooks into the regex engine. Even then, it's not clear what result you would want if backtracking was involved.
It is not possible to be able to tell where a regex fails. as a result you need to take a different approach. You need to compare strings. Use a regex to remove all the things that could vary and compare it with the string that you know it does not change.
I run into the same problem came up to your answer and had to work out my own solution. Here it is:
https://stackoverflow.com/a/11730035/637142
hope it helps

Categories