Removing commas from numbers with .NET regex - c#

So I'm processing a report that (brilliantly, really) spits out number values with commas in them, in a .csv output. Super useful.
So, I'm using (C#)regex lookahead positive and lookbehind positive expressions to remove commas that have digits on both sides.
If I use only the lookahead, it seems to work. However when I add the lookbehind as well, the expression breaks down and removes nothing. Both ends of the comma can have arbitrary numbers of digits around them, so I just want to remove the comma if the pattern has one or more digits around it.
Here's the expression that works with the lookahead only:
str = Regex.Replace(str, #"[,](?=(\d+)),"");
Here's the expression that doesn't work as I intend it:
str = Regex.Replace(str, #"[,](?=(\d+)?<=(\d+))", "");
What's wrong with my regex! If I had to guess, there's something I'm misunderstanding about how lookbehind works. Any ideas?

You may use any of the solutions below:
var s = "abc,def,2,100,xyz!,:))))";
Console.WriteLine(Regex.Replace(s, #"(\d),(\d)", "$1$2")); // Does not handle 1,2,3,4 cases
Console.WriteLine(Regex.Replace(s, #"(\d),(?=\d)", "$1")); // Handles consecutive matches with capturing group+backreference/lookahead
Console.WriteLine(Regex.Replace(s, #"(?<=\d),(?=\d)", "")); // Handles consecutive matches with lookbehind/lookahead, the most efficient way
Console.WriteLine(Regex.Replace(s, #",(?<=\d,)(?=\d)", "")); // Also handles all cases
See the C# demo.
Explanations:
(\d),(\d) - matches and captures single digits on both sides of , and $1$2 are replacement backreferences that insert captured texts back into the result
(\d),(?=\d) - matches and captures a digit before ,, then a comma is matched and then a positive lookahead (?=\d) requires a digit after ,, but since it is not consumed, onyl $1 is required in the replacement pattern
(?<=\d),(?=\d) - only such a comma is matched that is enclosed with digits without consuming the digits ((?<=\d) is a positive lookbehind that requires its pattern match immediately to the left of the current location)
,(?<=\d,)(?=\d) - matches a comma and only after matching it, the regex engine checks if there is a digit and a comma immediately before the location (that is after the comma), and if the check if true, the next char is checked for a digit. If it is a digit, a match is returned.
RegexHero.net test:
Bonus:
You may just match a pattern like yours with \d,\d and pass the match to the MatchEvaluator method where you may manipulate the match further:
Console.WriteLine(Regex.Replace(s, #"\d,\d", m => m.Value.Replace(",",string.Empty))); // Callback method
Here, m is the match object and m.Value holds the whole match value. With .Replace(",",string.Empty), you remove all commas from the match value.

You can always check a website that evaluates regex expressions.
I think this code might be able to help you:
str = Regex.Replace(str, #"[,](?=(\d+))(?<=(\d))", "");

Related

Regex.Replace replaces more than bargained for

I'm writing some test cases for IIS Rewrite rules, but my tests are not matching the same way as IIS is, leading to some false negatives.
Can anyone tell me why the following two lines leads to the same result?
Regex.Replace("v1/bids/aedd3675-a0f2-4494-a2c0-32418cf2476a", ".*v[1-9]/bids/.*", "http://localhost:9900/$0")
Regex.Replace("v1/bids/aedd3675-a0f2-4494-a2c0-32418cf2476a", "v[1-9]/bids/", "http://localhost:9900/$0")
Both return:
http://localhost:9900/v1/bids/aedd3675-a0f2-4494-a2c0-32418cf2476a
But I would expect the last regex to return:
http://localhost:9900/v1/bids/
As the GUID is not matched.
On IIS, the pattern tester yields the result below. Is {R:0} not equivalent to $0?
What I am asking is:
Given the test input of v[1-9]/bids/, how can I match IIS' way of doing Regex replaces so that I get the result http://localhost:9900/v1/bids/, which appears to be what IIS will rewrite to.
The point here is that the pattern you have matches the test strings at the start.
The first .*v[1-9]/bids/.* regex matches 0+ any characters but a newline (as many as possible) up to the last v followed with a digit (other than 0) and followed with /bids/, and then 0+ characters other than a newline. Since the string is matched at the beginning the whole string is matched and placed into Group 0. In the replacement, you just pre-pend http://localhost:9900/ to that value.
The second regex replacement returns the same result because the regex matches v1/bids/, stores it in Group 0, and replaces it with http://localhost:9900/ + v1/bids/. What remains is just appended to the replacement result as it does not match.
You need to match that "tail" in order to remove it.
To only get the http://localhost:9900/v1/bids/, use a capturing group around the v[0-9]/bids/ and use the $1 backreference in the replacement part:
(v[1-9]/bids/).*
Replace with http://localhost:9900/$1. Result: http://localhost:9900/v1/bids/
See the regex demo
Update
The IIS keeps the base URL and then adds the parts you match with the regex. So, in your case, you have http://localhost:9900/ as the base URL and then you match v1/bids/ with the regex. So, to simulate this behavior, just use Regex.Match:
var rx = Regex.Match("v1/bids/aedd3675-a0f2-4494-a2c0-32418cf2476a", "v[1-9]/bids/");
var res = rx.Success ? string.Format("http://localhost:9900/{0}", rx.Value) : string.Empty;
See the IDEONE demo

C# Regex match on special characters

I know this stuff has been talked about a lot, but I'm having a problem trying to match the following...
Example input: "test test 310-315"
I need a regex expression that recognizes a number followed by a dash, and returns 310. How do I include the dash in the regex expression though. So the final match result would be: "310".
Thanks a lot - kcross
EDIT: Also, how would I do the same thing but with the dash preceding, but also take into account that the number following the dash could be a negative number... didnt think of this one when I wrote the question immediately. for example: "test test 310--315" returns -315 and "test 310-315" returns 315.
Regex regex = new Regex(#"\d+(?=\-)");
\d+ - Looks for one or more digits
(?=\-) - Makes sure it is followed by a dash
The # just eliminates the need to escape the backslashes to keep the compiler happy.
Also, you may want this instead:
\d+(?=\-\d+)
This will check for a one or more numbers, followed by a dash, followed by one or more numbers, but only match the first set.
In response to your comment, here's a regex that will check for a number following a -, while accounting for potential negative (-) numbers:
Regex regex = new Regex(#"(?<=\-)\-?\d+");
(?<=\-) - Negative lookbehind which will check and make sure there is a preceding -
\-? - Checks for either zero or one dashes
\d+ - One or more digits
(?'number'\d+)- will work ( no need to escape ). In this example the group containing the single number is the named group 'number'.
if you want to match both groups with optional sign try:
#"(?'first'-?\d+)-(?'second'-?\d+)"
See it working here.
Just to describe, nothing complicated, just using -? to match an optional - and \d+ to match one or more digit. a literal - match itself.
here's some documentation that I use:
http://www.mikesdotnetting.com/Article/46/CSharp-Regular-Expressions-Cheat-Sheet
in the comments section of that page, it suggests escaping the dash with '\-'
make sure you escape your escape character \
You would escape the special meaning of - in regex language (means range) using a backslash (\). Since backslash has a special meaning in C# literals to escape quotes or be part of some characters, you need to escape that with another backslash(\). So essentially it would be \d+\\-.
\b\d*(?=\-) you will want to look ahead for the dash
\b = is start at a word boundry
\d = match any decimal digit
* = match the previous as many times as needed
(?=\-) = look ahead for the dash
Edited for Formatting issue with the slash not showing after posting

c# Regex capturing repeated keyword values

I'm trying to capture the value of a keyword that is delimited by either another keyword or the end of the line with the keywords possibly be repeated, in any order or have no data to capture:
Keywords:
K1,K2
Input data:
somedatahereornotk1capturethis1k2capturethis2k2capturethis3k1k2
I want the captured data to be
1. capturethis1
2. capturethis2
3. capturethis3
4.
5.
I've tried k1|k2(?<Data>.*?)k1|k2, but the captured data is always empty.
Thanks!
First, be aware that the alternation operator | has low precedence, so
k1|k2(?<Data>.*?)k1|k2
is actually looking for k1 or k2(?<Data>.*?)k1 or k2. Use grouping:
(?:k1|k2)(?<Data>.*?)(?:k1|k2)
Second, consider using the zero-width lookahead and lookbehind assertions:
(?<=k1|k2)(?<Data>.*?)(?=k1|k2)
You are on the right track with the alternations. The missing piece is to use look-behind and look-ahead to assert that something must be preceded and followed by the delimiters.
(?<=k1|k2)(?<Data>.*?)(?=k1|k2)
Lookbehind (?<=…) and lookahead (?=…) are zero-width assertions, so they must be satisfied but do not become part of the match.
Your desire to capture instances of consecutive delimeters is a bit trickier, because you can't really capture "nothing" -- the space between two characters. One approach would be to capture the lookbehind (or lookahead):
(?<=(?<Delimiter>k1|k2))(?<Data>.*?)(?=k1|k2)
This will yield 4 results instead of 3, because it will include the consecutive k1k2 at the end of your sample data. You'll just have to ignore the extra data for each match (k1,k2,k2,k1).
string s="somedatahereornotk1capturethis1k2capturethis2k2capturethis3k1k2";
Regex r=new Regex("(?<=k1|k2).*?(?=k1|k2)");
foreach(Match m in r.Matches(s))
Console.WriteLine(m.Value);

.NET Regex: negate previous character for the first character in string

Consider following string
"Some" string with "quotes" and \"pre-slashed\" quotes
Using regex, I want to find all the double quotes with no slash before them. So I want the regex to find four matches for the example sentence
This....
[^\\]"
...would find only three of them. I suppose that's because of the regex's state machine which is first validating the command to negate the presence of the slash.
That means I need to write a regex with some kind of look-behind, but I don't know how to work with these lookaheads and lookbehinds...im not even sure that's what I'm looking for.
The following attempt returns 6, not 4 matches...
"(?<!\\)
"(?<!\\")
Is what you're looking for
If you want to match "Some" and "quotes", then
(?<!\\")(?!\\")"[a-zA-Z0-9]*"
will do
Explanation:
(?<!\\") - Negative lookbehind. Specifies a group that can not match before your main expression
(?!\\") - Negative lookahead. Specifies a group that can not match after your main expression
"[a-zA-Z0-9]*" - String to match between regular quotes
Which means - match anything that doesn't come with \" before and \" after, but is contained inside double quotes
You almost got it, move the quote after the lookbehind, like:
(?<!\\)"
Also be ware of cases like
"escaped" backslash \\"string\"
You can use an expression like this to handle those:
(?<!\\)(?:\\\\)*"
Try this
(?<!\\)(?<qs>"[^"]+")
Explanation
<!--
(?<!\\)(?<qs>"[^"]+")
Options: case insensitive
Assert that it is impossible to match the regex below with the match ending at this position (negative lookbehind) «(?<!\\)»
Match the character “\” literally «\\»
Match the regular expression below and capture its match into backreference with name “qs” «(?<qs>"[^"]+")»
Match the character “"” literally «"»
Match any character that is NOT a “"” «[^"]+»
Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
Match the character “"” literally «"»
-->
code
try {
if (Regex.IsMatch(subjectString, #"(?<!\\)(?<qs>""[^""]+"")", RegexOptions.IgnoreCase)) {
// Successful match
} else {
// Match attempt failed
}
} catch (ArgumentException ex) {
// Syntax error in the regular expression
}

How to supress whitespace regex and evenly space during Split

I'm trying to create a regex to tokenize a string. An example of the string is
ExeScript.ps1 $1, $0,%1, %3
Or it may be carelessly typed in as . ExeScript.ps1 $1, $0,%1, %3`
I use a simple regex string.
Regex RE = new Regex(#"[\s,\,]");
return (RE.Split(ActionItem));
I get a whole bunch of zeros between the script1.ps1 and the $1
When it is evenly spaced, as in first example I get no space between the script1.ps1 and the $1.
What am I doing wrong. How do I supress whitespace and ensure each array cell has a value in it which is now a whitespace.
Bob.
Try this regex:
Regex RE = new Regex(#"[\s,\,]+");
The + makes the delimiter "one or more" of the previous items. This may help, but it won't detect the situation of two commas (it would interpret it as one delimiter, which may not be what you want). Another possibility would be:
Regex RE = new Regex(#"\s*,\s*");
which is zero or more spaces, followed by a comma, followed by zero or more spaces.
You may also have to decide how you want to handle inputs such as:
foo, ,bar
which you might view as a list of three items, the second of which is a single space.

Categories