Regex index in matching string where the match failed

Regex index in matching string where the match failed - c#

I am wondering if it is possible to extract the index position in a given string where a Regex failed when trying to match it?
For example, if my regex was "abc" and I tried to match that with "abd" the match would fail at index 2.
Edit for clarification. The reason I need this is to allow me to simplify the parsing component of my application. The application is an Assmebly language teaching tool which allows students to write, compile, and execute assembly like programs.
Currently I have a tokenizer class which converts input strings into Tokens using regex's. This works very well. For example:
The tokenizer would produce the following tokens given the following input = "INP :x:":
Token.OPCODE, Token.WHITESPACE, Token.LABEL, Token.EOL
These tokens are then analysed to ensure they conform to a syntax for a given statement. Currently this is done using IF statements and is proving cumbersome. The upside of this approach is that I can provide detailed error messages. I.E
if(token[2] != Token.LABEL) { throw new SyntaxError("Expected label");}
I want to use a regular expression to define a syntax instead of the annoying IF statements. But in doing so I lose the ability to return detailed error reports. I therefore would at least like to inform the user of WHERE the error occurred.

I agree with Colin Younger, I don't think it is possible with the existing Regex class. However, I think it is doable if you are willing to sweat a little:
Get the Regex class source code
(e.g.
http://www.codeplex.com/NetMassDownloader
to download the .Net source).
Change the code to have a readonly
property with the failure index.
Make sure your code uses that Regex
rather than Microsoft's.

I guess such an index would only have meaning in some simple case, like in your example.
If you'll take a regex like "ab*c*z" (where by * I mean any character) and a string "abbbcbbcdd", what should be the index, you are talking about?
It will depend on the algorithm used for mathcing...
Could fail on "abbbc..." or on "abbbcbbc..."

I don't believe it's possible, but I am intrigued why you would want it.

In order to do that you would need either callbacks embedded in the regex (which AFAIK C# doesn't support) or preferably hooks into the regex engine. Even then, it's not clear what result you would want if backtracking was involved.

It is not possible to be able to tell where a regex fails. as a result you need to take a different approach. You need to compare strings. Use a regex to remove all the things that could vary and compare it with the string that you know it does not change.
I run into the same problem came up to your answer and had to work out my own solution. Here it is:
https://stackoverflow.com/a/11730035/637142
hope it helps

Related

Regexp: finding multiple matches for a single asterisked group

Suppose we have /(a*b)*/ regexp, and we would like to get
{"aab", "ab", "aaab"}
as a set of matches for the only pair of parenthesis (group) when matching with "aababaaab" string. Sure we could modify this regexp with g modifier, and run it in a loop, but there're a lot of more complicated examples, when such rewrite is unsuitable due to code blowup. .NET has such an interface called a capture.
Is there any way to do the same in JS either using standard library or external libraries?

I don't know if I understood correctly, but if you want to get the minimun expressión of your query you can use a ungreedy match:
'aababaaab'.match(/(a*b)+?/g);
The problem of this is that you will not get all the combinations, but you seem not care about that in your question.

Can I add a regular expression into a .Net Assertion?

I'm trying to pull out page source from a set of pages and run an assertion on the results, this is a Test that runs to check that we are crawling specific pages in our site. Sometimes the results come back with a different case for the URL string, I'd like to account for that in the Assertion where I am checking page source. This is probably the wrong way to do this but I was wondering if there is a way to add in the .Net regex commands to the Assertion text. I have this as an assertion:
Assert.IsTrue(driver.PageSource.Contains("/explore"));
But is there a way to be sure that I can capture explore, Explore or EXPLORE? I though I could use (?i) here but that doesn't seem to work. I'm more used to Perl and it's regex capabilities but with C# and .Net I'm a little lost on where I can and can't use the inline regex commands.

Anthonys answer is valid, you don't really need regex. But if you do want to use it, you can use
Regex.IsMatch(driver.PageSource, "/explore", RegexOptions.IgnoreCase)

You don't need a regular expression to perform a case-insensitive check. Use IndexOf and compare that the result is greater than -1. IndexOf has overloads that allow you to specify if casing matters. Something like
bool containsExplore = driver.PageSource.IndexOf("/explore", StringComparison.InvariantCultureIgnoreCase) > -1;
Assert.IsTrue(containsExplore);

Try:
RegEx.Match("string", "regexp", RegExOptions.IgnoreCase).Success

How about using
StringAssert.Matches(string, regex);
In your case, that would translate to
StringAssert.Matches("drive.PageSource", "\/explore");

Simplifying Regex's - escaping

I want to enable my users to specify the allowed characters in a given string.
So... Regex's are great but too tough for my users.
my plan is to enable users to specify a list of allowed characters - for example
a-z|A-Z|0-9|,
i can transform this into a regex which does the matching as such:
[a-zA-Z0-9,]*
However i'm a little lost to deal with all the escaping - imagine if a user specified
a-z|A-Z|0-9| |,|||\|*|[|]|{|}|(|)
Clearly one option is to deal with every case individually but before i write such a nasty solution - is there some nifty way to do this?
Thanks
David

Forget regex, here is a much simpler solution:
bool isInputValid = inputString.All(c => allowedChars.Contains(c));

You might be right about your customers, but you could provide some introductory regex material and see how they get on - you might be surprised.
If you really need to simplify, you'll probably need to jetison the use of pipe characters too, and provide an alternative such as putting each item on a new line (in a multi line text box for instance).

To make it as simple as possible for your users, why don't you ditch the "|" and the concept of character ranges, e.g., "a-z", and get them just to type the complete list of characters they want to allow:
abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ01234567890 *{}()
You get the idea. I think this will be much simpler.

testing for "EndsWith" efficiently with a Regex

I need to build a Regex (.NET syntax) to determine if a string ends with a specific value. Specifically I need to test whether a file has a specific extension (or set of extensions).
The code I'm trying to fix was using:
.*\.(png|jpg|gif)$
which is hideously slow for failed matches in my scenario (presumably due to the backtracking.
Simply removing the .* (which is fine since the API only tests for matches and doesn't extract anything) at the beginning makes the regex much more efficient.
It still feels like it is pretty inefficient. Am I missing something obvious here?
Unfortunately, I don't control the API in question so I need a regex to do this even though I wouldn't normally consider regex to be the right tool for the job.
I also did some tests using the RegexOptions.RightToLeft and found that I could squeeze a little more performance out of my test case with ^.*\.(png|jpg|gif)$, but I can't find a way to specify the RightToLeft option within the string of the regex itself so I don't think I can use it.

I don't have access to C# so I can't try this... but you should be able to avoid too much backtracking by forcing the engine to find the end of the string first, then matching the extensions:
$(?<=\.(gif|png|jpg))
I'm not sure of the effect the look-behind has on performance, though.

Really, you could also just drop Regex altogether, and use String.EndsWidth, with the following :
var extensions = new String[] { ".png", ".jpg", ".gif" };
extensions.Any(ext => "something".EndsWith(ext));
I usually have the feeling that it ends up being faster to use simple string functions for cases like this rather than trying to find a clever way to use an efficient regex, in terms of runtime and/or development time, unless you are comfortable with and know what is efficient in terms of Regex.

Make it look specifically for a period instead of any character preceding the extension:
\.(png|jpg|gif)$
This will make it safer (won't match x.xgif) and it will not have to do any backtracking at all until it found a period (as opposed to backtracking on every character).

If you can change the code, why can't you use something else? You don't control the API, right, but you are changing it anyway. This I really don't understand.
Anyway, why not simply:
var AcceptedExtensions = new List<string>() { "txt", "html", "htm" };
var extension = filename.Substring(filename.LastIndexOf(".") + 1).ToLower();
return AcceptedExtensions.Contains(extension);
The IEnumerable AcceptedExtensions would be loaded from some config, the same way you load your jpg|gif|.... Or it would be a constant, whatever. You just don't need to recreate it each time you are going to use it (I doubt that this would be a bottleneck though).

You probably don't need a regular expression for this... but going with the original question:
Make sure you're using RegexOptions.Compiled to pre-compile the regular expression and then reuse your RegEx object. This avoids setting up the RegEx every time you use it, this will speed things up a lot.

Regular expression to define format of backup filenames

In the application I am currently working on, I have an option to create automatic backups of a certain file on the hard disk. What I would like to do is offer the user the possibility to configure the name of the file and its extension.
For example, the backup filename could be something like : "backup_month_year_username.bak". I had the idea to save the format in the form of a regular expression. For the example above, the regexp would look like :
"^backup_(?<Month>\d{2})_(?<Year>\d{2})_(?<Username>\w).(?<extension>bak)$"
I thought about using regex because I will also have to browse through the directory of backuped files to delete those older than a certain date. The main trouble I have now is how to create a filename using the regex. In a way I should replace the tags with the information. I could do that using regex.replace and another regex, but I feel it's a big weird doing that and it might be a better way.
Thanks
[Edit] Maybe I wasn't really clear in the first go, but the idea is of course that the user (in this case an admin that will know regex syntax) will have the possibility to modify the form of the filename, that's all the idea behind it[/Edit]

... and if the regex changes, it is next to impossible to reconstruct a string from a given regex.
Edit:
Create some predefined "place-holders": %u could be the user's name, %y could be the year, etc.:
backup_%m_%y_%u.bak
and then simple replace the %? with their actual values.

It sounds like you're trying to use the regular expression to create the file name from a pattern which the user should be able to specify.
Regular expressions can - AFAIK - not be used to create output, but only to validate input, so you'd have the user specify two things:
a file name production pattern like Bart suggested
a validation pattern in form of a regular expression that helps you split the file names into their parts
EDIT
By the way, your sample regex contains an error: The "." is use for "any character", also \w only matches one word character, so I guess you meant to write
"^backup_(?<Month>\d{2})_(?<Year>\d{2})_(?<Username>\w+)\.(?<extension>bak)$"

If the filename is always in this form, there is no reason for a regex, as it's easier to process with string.Split ...

With Bart's solution it is easy enough to split (using string.Split) the generated file name using underscore as the delimiter, to get back the information.

Ok, I think I have found a way to use only the regex. As I am using groups to get the information, I will use another regular expression to match the regular expression and replace the groups with the value:
Regex rgx = new Regex("\(\?\<Month\>.+?\)");
rgx.Replace("^backup_(?<Month>\d{2})_(?<Year>\d{2})_(?<Username>\w+)\.(?<extension>bak)$"
, DateTime.Now.Month.ToString());
Ok, it's really a hack, but at least it works and I have only one pattern defined by the user. It might not work if the regex is too complex, but I think I can deal with that problem.
What do you think?

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Regex index in matching string where the match failed - c#

I don't believe it's possible, but I am intrigued why you would want it.

In order to do that you would need either callbacks embedded in the regex (which AFAIK C# doesn't support) or preferably hooks into the regex engine. Even then, it's not clear what result you would want if backtracking was involved.

Related

Regexp: finding multiple matches for a single asterisked group

Can I add a regular expression into a .Net Assertion?

Simplifying Regex's - escaping

testing for "EndsWith" efficiently with a Regex

Regular expression to define format of backup filenames

Categories

Resources