Regex.Replace replaces more than bargained for

Regex.Replace replaces more than bargained for - c#

I'm writing some test cases for IIS Rewrite rules, but my tests are not matching the same way as IIS is, leading to some false negatives.
Can anyone tell me why the following two lines leads to the same result?
Regex.Replace("v1/bids/aedd3675-a0f2-4494-a2c0-32418cf2476a", ".*v[1-9]/bids/.*", "http://localhost:9900/$0")
Regex.Replace("v1/bids/aedd3675-a0f2-4494-a2c0-32418cf2476a", "v[1-9]/bids/", "http://localhost:9900/$0")
Both return:
http://localhost:9900/v1/bids/aedd3675-a0f2-4494-a2c0-32418cf2476a
But I would expect the last regex to return:
http://localhost:9900/v1/bids/
As the GUID is not matched.
On IIS, the pattern tester yields the result below. Is {R:0} not equivalent to $0?
What I am asking is:
Given the test input of v[1-9]/bids/, how can I match IIS' way of doing Regex replaces so that I get the result http://localhost:9900/v1/bids/, which appears to be what IIS will rewrite to.

The point here is that the pattern you have matches the test strings at the start.
The first .*v[1-9]/bids/.* regex matches 0+ any characters but a newline (as many as possible) up to the last v followed with a digit (other than 0) and followed with /bids/, and then 0+ characters other than a newline. Since the string is matched at the beginning the whole string is matched and placed into Group 0. In the replacement, you just pre-pend http://localhost:9900/ to that value.
The second regex replacement returns the same result because the regex matches v1/bids/, stores it in Group 0, and replaces it with http://localhost:9900/ + v1/bids/. What remains is just appended to the replacement result as it does not match.
You need to match that "tail" in order to remove it.
To only get the http://localhost:9900/v1/bids/, use a capturing group around the v[0-9]/bids/ and use the $1 backreference in the replacement part:
(v[1-9]/bids/).*
Replace with http://localhost:9900/$1. Result: http://localhost:9900/v1/bids/
See the regex demo
Update
The IIS keeps the base URL and then adds the parts you match with the regex. So, in your case, you have http://localhost:9900/ as the base URL and then you match v1/bids/ with the regex. So, to simulate this behavior, just use Regex.Match:
var rx = Regex.Match("v1/bids/aedd3675-a0f2-4494-a2c0-32418cf2476a", "v[1-9]/bids/");
var res = rx.Success ? string.Format("http://localhost:9900/{0}", rx.Value) : string.Empty;
See the IDEONE demo

Related

How to perform a RegEx replace only if another a separate filter is matched using .NET?

Given a string (a path) that matches /dir1/, I need to replace all spaces with dashes.
Ex: /dir1/path with spaces should become /dir1/path-with-spaces.
This could easily be done in 2 steps...
var rgx = new Regex(#"^\/dir1\/");
var path = "/dir1/path with spaces";
if (rgx.isMatch(path))
{
path = (new Regex(#" |\%20")).Replace(path, "-");
}
Unfortunately for me, the application is already built with a simple RegEx replace and cannot be modified, so I need to have the RegEx do the work. I thought I had found the answer here:
regex: how to replace all occurrences of a string within another string, if the original string matches some filter
And was able create and test (?:\G(?!^)|^(?=\/dir1\/.*$)).*?\K( |\%20), but then I learned it does not work in this app because the \K is an unrecognized escape sequence (not supported in .NET).
I also tried a positive lookbehind, but I wasn't able to get it to replace all the spaces (only the last if the match was greedy or the first if not greedy). I could put in enough checks to handle the max number of spaces, but as soon as I check for 10 spaces, someone will pass in a path with 11 spaces.
Is there a RegEx only solution for this problem that will work in the .NET engine?

You can leverage the unlimited width lookbehind pattern in .NET:
Regex.Replace(path, #"(?<=^/dir1/.*?)(?: |%20)", "-")
See the regex demo
Regex details
(?<=^/dir1/.*?) - a positive lookbehind that matches a location that is immediately preceded with /dir1/ and then any zero or more chars other than a newline char, as few as possible
(?: |%20) - either a space or %20 substring.

Regex or Substring to Match Filename With Extension

I have a current situation where I can be given a filename with path that looks like:
C:\\Users\\testUser\\Documents\\testFile.txt.9043632d298f44ad88509c677a8249f8
or
C:\\Users\\testUser\\Documents\\testFile.txt.9043632d298f44ad88509c677a8249f8.enc
I need to be able to extract everything up until the end of the extension (can be any file extension, will always have guid string preceded by a . after the extension)
So an example output would be:
C:\\Users\\testUser\\Documents\\testFile.txt
C:\\Users\\testUser\\Documents\\testFile.pdf
C:\\Users\\testUser\\Documents\\testFile.jpeg
I have tried substrings but cannot seem to get the proper input (though I assume it is a simple task). I can never seem to get the proper result.
An example I tried was along the lines of:
filename.Substring(0,filename.Indexof('.', //what goes here??));
but keep getting stuck.
Any help would be lovely!

You might use:
new Regex(#".*(?=\.[a-f\d]{32})", RegexOptions.IgnoreCase).Match(yourString)
Explanation:
.+ match one or more of any char
(?= ) look ahead, check if the following chars match, but don't include in match
\. match a dot
[a-f\d]{32} match any character a-f or digit exactly 32 times
RegexOptions.IgnoreCase ignores the case

Regular expression in RegularExpressionAttribute behavior

I am using this regular expression: #"[ \]\[;\/\\\?:*""<>|+=]|^[.]|[.]$"
First part [ \]\[;\/\\\?:*""<>|+=] should match any of the characters inside the brackets.
Next part ^[.] should match if the string starts with a 'dot'
Last part [.]$ should match if the string ends with a 'dot'
This works perfectly fine if I use Regex.IsMatch() function. However if I use RegularExpressionAttribute in ASP.NET MVC, I always get invalid model. Does anyone have any clue why this behavior occurs?
Examples:
"abcdefg" should not match
".abcdefg" should match
"abc.defg" should not match
"abcdefg." should match
"abc[defg" should match
Thanks in advance!
EDIT:
The RegularExpressionAttribute Specifies that a data field value in ASP.NET Dynamic Data must match the specified regular expression..
Which means. I need the "abcdef" to match, and ".abcdefg" to not match. Basically negate the whole expression I have above.

You need to make sure the pattern matches the entire string.
In a general case, you may append/prepend the pattern with .*.
Here, you may use
.*[ \][;/\\?:*"<>|+=].*|^[.].*|.*[.]$
Or, to make it a bit more efficient (that is, to reduce backtracking in the first branch) a negated character class will perform better:
[^ \][;/\\?:*"<>|+=]*[ \][;\/\\?:*"<>|+=].*|^[.].*|.*[.]$
But it is best to put the branches matching text at the start/end of the string as first branches:
^[.].*|.*[.]$|[^ \][;/\\?:*"<>|+=]*[ \][;/\\?:*"<>|+=].*
NOTE: You do not have to escape / and ? chars inside the .NET regex since you can't use regex delimiters there.
C# declaration of the last pattern will look like
#"^[.].*|.*[.]$|[^ \][;/\\?:*""<>|+=]*[ \][;/\\?:*""<>|+=].*"
See this .NET regex demo.
RegularExpressionAttrubute:
[RegularExpression(
#"^[.].*|.*[.]$|[^ \][;/\\?:*""<>|+=]*[ \][;/\\?:*""<>|+=].*",
ErrorMessage = "Username cannot contain following characters: ] [ ; / \\ ? : * \" < > | + =")
]

Your regex is an alternation which matches 1 character out of 3 character classes, the first consisting of more than 1 characters, the second a dot at the start of the string and the third a dot at the end of the string.
It works fine because it does match one of the alternations, only not the whole string you want to match.
You could use 3 alternations where the first matches a dot followed by repeating the character class until the end of the string, the second the other way around but this time the dot is at the end of the string.
Or the third using a positive lookahead asserting that the string contains at least one of the characters [\][;\/\\?:*"<>|+=]
^\.[a-z \][;\/\\?:*"<>|+=]+$|^[a-z \][;\/\\?:*"<>|+=]+\.$|^(?=.*[\][;\/\\?:*"<>|+=])[a-z \][;\/\\?:*"<>|+=]+$
Regex demo

Removing commas from numbers with .NET regex

So I'm processing a report that (brilliantly, really) spits out number values with commas in them, in a .csv output. Super useful.
So, I'm using (C#)regex lookahead positive and lookbehind positive expressions to remove commas that have digits on both sides.
If I use only the lookahead, it seems to work. However when I add the lookbehind as well, the expression breaks down and removes nothing. Both ends of the comma can have arbitrary numbers of digits around them, so I just want to remove the comma if the pattern has one or more digits around it.
Here's the expression that works with the lookahead only:
str = Regex.Replace(str, #"[,](?=(\d+)),"");
Here's the expression that doesn't work as I intend it:
str = Regex.Replace(str, #"[,](?=(\d+)?<=(\d+))", "");
What's wrong with my regex! If I had to guess, there's something I'm misunderstanding about how lookbehind works. Any ideas?

You may use any of the solutions below:
var s = "abc,def,2,100,xyz!,:))))";
Console.WriteLine(Regex.Replace(s, #"(\d),(\d)", "$1$2")); // Does not handle 1,2,3,4 cases
Console.WriteLine(Regex.Replace(s, #"(\d),(?=\d)", "$1")); // Handles consecutive matches with capturing group+backreference/lookahead
Console.WriteLine(Regex.Replace(s, #"(?<=\d),(?=\d)", "")); // Handles consecutive matches with lookbehind/lookahead, the most efficient way
Console.WriteLine(Regex.Replace(s, #",(?<=\d,)(?=\d)", "")); // Also handles all cases
See the C# demo.
Explanations:
(\d),(\d) - matches and captures single digits on both sides of , and $1$2 are replacement backreferences that insert captured texts back into the result
(\d),(?=\d) - matches and captures a digit before ,, then a comma is matched and then a positive lookahead (?=\d) requires a digit after ,, but since it is not consumed, onyl $1 is required in the replacement pattern
(?<=\d),(?=\d) - only such a comma is matched that is enclosed with digits without consuming the digits ((?<=\d) is a positive lookbehind that requires its pattern match immediately to the left of the current location)
,(?<=\d,)(?=\d) - matches a comma and only after matching it, the regex engine checks if there is a digit and a comma immediately before the location (that is after the comma), and if the check if true, the next char is checked for a digit. If it is a digit, a match is returned.
RegexHero.net test:
Bonus:
You may just match a pattern like yours with \d,\d and pass the match to the MatchEvaluator method where you may manipulate the match further:
Console.WriteLine(Regex.Replace(s, #"\d,\d", m => m.Value.Replace(",",string.Empty))); // Callback method
Here, m is the match object and m.Value holds the whole match value. With .Replace(",",string.Empty), you remove all commas from the match value.

You can always check a website that evaluates regex expressions.
I think this code might be able to help you:
str = Regex.Replace(str, #"[,](?=(\d+))(?<=(\d))", "");

How to use regex to match anything from A to B, where B is not preceeded by C

I'm having a hard time with this one. First off, here is the difficult part of the string I'm matching against:
"a \"b\" c"
What I want to extract from this is the following:
a \"b\" c
Of course, this is just a substring from a larger string, but everything else works as expected. The problem is making the regex ignore the quotes that are escaped with a backslash.
I've looked into various ways of doing it, but nothing has gotten me the correct results. My most recent attempt looks like this:
"((\"|[^"])+?)"
In various test online, this works the way it should - but when I build my ASP.NET page, it cuts off at the first ", leaving me with just the a-letter, white space and a backslash.
The logic behind the pattern above is to capture all instances of \" or something that is not ". I was hoping this would search for \", making sure to find those first - but I got the feeling that this is overridden by the second part of the expression, which is only 1 single character. A single backslash does not match 2 characters (\"), but it will match as a non-". And from there, the next character will be a single ", and the matching is completed. (This is just my hypothesis on why my pattern is failing.)
Any pointers on this one? I have tried various combinations with "look"-methods in regex, but I didn't really get anywhere. I also get the feeling that is what I need.

ORIGINAL ANSWER
To match a string like a \"b\" c, you need to use following regex declaration:
(?:\\"|[^"])+
var rx = Regex(#"(?:\\""|[^""])+");
See RegexStorm demo
Here is an IDEONE demo:
var str = "a \\\"b\\\" c";
Console.WriteLine(str);
var rx = new Regex(#"(?:\\""|[^""])+");
Console.WriteLine(rx.Match(str).Value);
Please note the # in front of the string literal that lets us use verbatim string literals where we have to double quotes to match literal quotes and use single escape slashes instead of double. This makes regexps easier to read and maintain.
If you want to match any escaped entities in your input string, you can use:
var rx = new Regex(#"[^""\\]*(?:\\.[^""\\]*)*");
See demo on RegexStorm
UPDATE
To match the quoted strings, just add quotes around the pattern:
var rx = new Regex(#"""(?<res>[^""\\]*(?:\\.[^""\\]*)*)""");
This pattern yields much better performance than Tim Long's suggested regex, see RegexHero test resuls:

The following expression worked for me:
"(?<Result>(\\"|.)*)"
The expression matches as follows:
An opening quote (literal ")
A named capture (?<name>pattern) consisting of:
Zero or more occurences * of literal \" or (|) any single character (.)
A final closing quote (literal ")
Note that the * (zero or more) quantifier is non-greedy so the final quote is matched by the literal " and not the "any single character" . part.
I used ReSharper 9's built-in Regular Expression validator to develop the expression and verify the results:
I have used the "Explicit Capture" option to reduce cruft in the output (RegexOptions.ExplicitCapture).
One thing to note is that I am matching the whole string, but I am only capturing the substring, using a named capture. Using named captures is a really useful way to get at the results you want. In code, it might look something like this:
static string MatchQuotedString(string input)
{
const string pattern = #"""(?<Result>(\\""|.)*)""";
const RegexOptions options = RegexOptions.ExplicitCapture;
Regex regex = new Regex(pattern, options);
var matches = regex.Match(input);
var substring = matches.Groups["Result"].Value;
return substring;
}
Optimization: If you are planning on using the regex a lot, you could factor it out into a field and use the RegexOptions.Compiled option, this pre-compiles the expression and gives you faster throughput at the expense of longer initialization.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Regex.Replace replaces more than bargained for - c#

Related

How to perform a RegEx replace only if another a separate filter is matched using .NET?

Regex or Substring to Match Filename With Extension

Regular expression in RegularExpressionAttribute behavior

Removing commas from numbers with .NET regex

How to use regex to match anything from A to B, where B is not preceeded by C

Categories

Resources