find specific pattern of digits in a string - c#

Consider the following strings:
"via caporale degli zuavi 278a , 78329"
and
"autostrada a1 km - 47"
I am looking to isolate a specific sequence that can be present (first example) or not (second example)
In particular, i am looking for a sequence of digit that can be long 1 to 4 digit and can be followed by a single letter, but also in the string there must not be the substring "km". So in my previous example "278a" is valid but the rest of the sequence of digit are not.
What i've done until now is the following:
Since i know that any string that contains "km" is not valid i applied this piece of code:
if(!stripped.ToLower().Contains("km"))
{
// apply Regex
}
else
// string not valid, move on
I know that this Regex will give me all the squence of digits : Regex.Matches(t, #"\d+"); , but it is not enough. How can i proceed from here?
Edit: for further clarification, when a sequence of digit is followed by a letter, that letter must be the next char (so no whitespace or anything else)
Edit2: note that the sequence of digit can be followed by a letter or not (so 278a is as valid as 278)

You can assert not km to the left and right, and capture 1-4 digits 0-9 in a group and match and a char a-zA-Z:
(?<!\bkm\b.*)\b[0-9]{1,4}[A-Za-z]?\b(?!.*\bkm)
(?<!\bkm\b.*) Assert not km to the left
\b[0-9]{1,4}[A-Za-z]\b Match 1-4 digits 0-9 and match a single char A-Za-z
(?!.*\bkm) Assert not km to the right
.NET Regex demo
string pattern = #"(?<!\bkm\b.*)\b[0-9]{1,4}[A-Za-z]?\b(?!.*\bkm)";
string input = #"via caporale degli zuavi 278a , 78329
via caporale degli zuavi 277 , 78329
via caporale degli zuavi 279a , 78329 km
km via caporale degli zuavi 280a , 78329
autostrada a1 km - 47";
foreach (Match m in Regex.Matches(input, pattern))
{
Console.WriteLine(m.Value);
}
Output
278a
277
If there is only 1 match expected, you might also rule out km in the whole string, and use a capture group as well with Regex.Match
^(?!.*\bkm\b).*\b([0-9]{1,4}[A-Za-z]?)\b
Regex demo

You can use
^(?!.*(?<!\p{L})km\b)(?:.*\D)?(\d{1,4})(?=\p{L}?\b)
See the .NET regex demo. Details:
^ - start of string
(?!.*(?<!\p{L})km\b) - no km without any letter preceding the word and no alphanumeric/underscore following it is allowed anywhere in the string
(?:.*\D)? - an optional sequence of any zero or more chars other than a newline char, as many as possible, and then a non-digit char
(\d{1,4}) - Grooup 1: one to four digits
(?=\p{L}?\b) - immediately on the right, there should be an optional letter not followed with any alphanumeric or connector punctuation (like _).
See a C# demo:
var l = new List<string> {"via caporale degli zuavi 278a , 78329","autostrada a1 km - 47"};
foreach (var t in l)
{
var rx = #"^(?!.*(?<!\p{L})km\b)(?:.*\D)?(\d{1,4})(?=\p{L}?\b)";
var match = Regex.Match(t, rx, RegexOptions.ECMAScript)?.Groups[1].Value;
if (!string.IsNullOrEmpty(match))
{
Console.WriteLine($"There is a match in '{t}': {match}");
}
else
{
Console.WriteLine($"There is no match in '{t}'.");
}
}
Output:
There is a match in 'via caporale degli zuavi 278a , 78329': 278
There is no match in 'autostrada a1 km - 47'.
The RegexOptions.ECMAScript option is used to make \d only match ASCII digits (it does not affect \p{L} though).

Related

Regex to match positive and negative numbers and text between "" after a character

I need a regex for an input that contains positive and negative numbers and sometimes a string between " and ". I'm not sure if this can be done in only one pattern. Here's some test cases for the pattern:
*PATH "C:\Users\User\Desktop\Media\SoundBanks\Ambient\WAV_Data\AD_SMP_SFX_WIND0.wav"
*NODECOLOR 0 255 140
*FILEREF -7
*FREQUENCY 22050
The idea would be to use a pattern that returns:
C:\Users\User\Desktop\Media\SoundBanks\Ambient\WAV_Data\AD_SMP_SFX_WIND0.wav
0 255 140
-7
22050
The content always goes after the character *. I've split this in two patterns because I don't know how to do it all in one, but doesn't work:
MatchCollection NumberMtaches = Regex.Matches(FileLine, #"(?<=[*])-?[0-9]+");
MatchCollection FilePathMatches = Regex.Matches(FileLine, #"/,([^,]*)(?=,)/g");
You may read the file into a string and run the following regex:
var matches = Regex.Matches(filecontents, #"(?m)^\*\w+[\s-[\r\n]]*""?(.*?)""?\r?$")
.Cast<Match>()
.Select(x => x.Groups[1].Value)
.ToList();
See the .NET regex demo.
Details:
(?m) - RegexOptions.Multiline option on
^ - start of a line
\* - a * char
\w+ - one or more word chars
[\s-[\r\n]]* - zero or more whitespaces other than CR and LF
"? - an optional " char
(.*?) - Group 1: any zero or more chars other than an LF char, as few as possible
"? - an optional " char
\r? - an optional CR
$ - end of a line/string.

Regex replace 'whole' decimal numbers not followed by a certain string

I want to replace "whole" decimal numbers not followed by pt with M.
For example, I need to replace 1, 12, and 36.7, but not 45.63 in the following.
string exp = "y=tan^-1(45.63pt)+12sin(-36.7)";
I have already tried
string newExp = Regex.Replace(exp, #"(\d+\.?\d*)(?!pt)", "M");
and it gives
"y=tan^-M(M3pt)+Msin(-M)"
It does make sense to me why it works like this, but I need to get
"y=tan^-M(45.63pt)+Msin(-M)"
The problem with the regex is that it is still matching a portion of the decimal value 45.63, up to the second-to-last decimal digit. One solution is to add a negative lookahead to the pattern to ensure that we only assert (?!pt) at the real end of every decimal value. This version is working:
string exp = "y=tan^-1(45.63pt)+12sin(-36.7)";
string newExp = Regex.Replace(exp, #"(\d+(?:\.\d+)?)(?![\d.])(?!pt)", "M");
Console.WriteLine(newExp);
This prints:
y=tan^-M(45.63pt)+Msin(-M)
Here is an explanation of the regex pattern used:
( match and capture:
\d+ one or more whole number digits
(?:\.\d+)? followed by an optional decimal component
) stop capturing
(?![\d.]) not being followed by another digit or dot
(?!pt) not followed by pt
Hi there if you need the out put as
"y=tan^-M(Mpt)+Msin(-M)"
then then newExp should be
string newExp = Regex.Replace(exp, #"(\d+\.?\d*)", "M");
if output is
"y=tan^-M(45.63pt)+Msin(-M)"
then newExp should be
string newExp = Regex.Replace(exp, #"(\d+\.?\d*)(?![.\d]*pt), "M");
I think you may assert the point in a string where there are no digits and dots directly followed by "pt":
\b(?![\d.]+pt)\d+(?:\.\d+)?
See the online demo
\b - Match a word-boundary.
(?![\d.]+pt) - Negative lookahead for 1+ digits and dots followed by "pt".
\d+ - 1+ digits.
(?: - Open non-capture group:
\.\d+ - A literal dot and 1+ digits.
)? - Close non-capture group and make it optional.
See the .NET demo

Get the middle part of a filename using regex

I need a regex that can return up to 10 characters in the middle of a file name.
filename: returns:
msl_0123456789_otherstuff.csv -> 0123456789
msl_test.xml -> test
anythingShort.w1 -> anythingSh
I can capture the beginning and end for removal with the following regex:
Regex.Replace(filename, "(^msl_)|([.][[:alnum:]]{1,3}$)", string.Empty); *
but I also need to have only 10 characters when I am done.
Explanation of the regex above:
(^msl_) - match lines that start with "msl_"
| - or
([.] - match a period
[[:alnum]]{1,3} - followed by 1-3 alphanumeric characters
$) - at the end of the line
Note [[:alnum:]] can't work in a .NET regex, because it does not support POSIX character classes. You may use \w (to match letters, digits, underscores) or [^\W_] (to match letters or digits).
You can use your regex and just keep the first 10 chars in the string:
new string(Regex.Replace(s, #"^msl_|\.\w{1,3}$","").Take(10).ToArray())
See the C# demo online:
var strings = new List<string> { "msl_0123456789_otherstuff.csv", "msl_test.xml", "anythingShort.w1" };
foreach (var s in strings)
{
Console.WriteLine("{0} => {1}", s, new string(Regex.Replace(s, #"^msl_|\.\w{1,3}$","").Take(10).ToArray()));
}
Output:
msl_0123456789_otherstuff.csv => 0123456789
msl_test.xml => test
anythingShort.w1 => anythingSh
Using replace with the alternation, removes either of the alternatives from the start and the end of the string, but it will also work when the extension is not present and does not take the number of chars into account in the middle.
If the file extension should be present you might use a capturing group and make msl_ optional at the beginning.
Then match 1-10 times a word character except the _ followed by matching optional word characters until the .
^(?:msl_)?([^\W_]{1,10})\w*\.[^\W_]{2,}$
.NET regex demo (Click on the table tab)
A bit broader match could be using \S instead of \w and match until the last dot:
^(?:msl_)?(\S{1,10})\S*\.[^\W_]{2,}$
See another regex demo | C# demo
string[] strings = {"msl_0123456789_otherstuff.csv", "msl_test.xml","anythingShort.w1", "123456testxxxxxxxx"};
string pattern = #"^(?:msl_)?(\S{1,10})\S*\.[^\W_]{2,}$";
foreach (String s in strings) {
Match match = Regex.Match(s, pattern);
if (match.Success)
{
Console.WriteLine(match.Groups[1]);
}
}
Output
0123456789
test
anythingSh

C# [regex] trim spaces before specific word

I want to trim all spaces between numbers before words "usd" and "eur".
I have regex pattern like this:
#"\b(\d\s*)+\s(usd|eur)"
How to exclude space and usd|eur from result match?.
String example: "sdklfjsd 10 343 usd ds 232 300 eur"
Result should be: "sdklfjsd 10343 usd ds 232300 eur"
string line = "2 300 $ 12 Asdsfd 2 300 530 usd and 2 351 eur";
MatchCollection matches;
Regex defaultRegex = new Regex(#"\b(\d+\s*)+(usd|eur)");
matches = defaultRegex.Matches(line);
WriteLine("Parsing '{0}'", line);
for (int ctr = 0; ctr < matches.Count; ctr++)
WriteLine("={0}){1}", ctr, matches[ctr].Value);
There my be a more eloquent way, but it can be done easily with a MatchEvaluator
new Regex(#"\b(\d+\s*)+(?=\s(usd|eur))").
Replace("sdklfjsd 10 343 usd ds 232 300 eur",
m => string.Join("", m.Groups[1].Captures.Cast<Capture>().Select(c => c.Value.Trim())))
The Regex \b(\d+\s*)+(?=\s(usd|eur)) uses a look-ahead to only match numbers that are followed by \s(usd|eur) and a grouping to match each consecutive match to \d+\s* (I assume the \b boundary from your question so that with abc12 34 56 eur it would only match 34 56 is desired, remove it otherwise).
Then for each match it gets all of that group's captures, trims them all, and concatenates them together to produce the replacement text.
(Note that generally currency codes should be capitalised, so you my have another issue there).
Try Regex: (\d+) *(\d+)(?= (?:usd|eur))
Demo
Assuming there only two numbers, you can use
\b(\d+)\s*(\d+)(?=\s(usd|eur)) with a replacement string of $1$2
You could also use a posotive lookbehind and a positive lookahead to match all the spaces you want to remove:
(?<=\d)\s+(?=(?:\d+\s+)*\d+\s+(?:eur|usd)\b)
Explanation
(?<=\d) Positive lookbehind to assert what is on the left is
\s+ Match 1+ whitespace characters
(?= Positive lookahead to assert what is on the right is
(?:\d+\s+)* Repeat 0+ times matching 1+ digits followed by 1+ whitespace characters
\d+\s+(?:eur|usd)\b match 1+ digits followed by 1+ whitespace characters and eur or usd
) Close positive lookahead
Regex demo
string line = "2 300 $ 12 Asdsfd 2 300 530 usd and 2 351 eur";
string result = Regex.Replace(line , #"(?<=\d)\s+(?=(?:\d+\s+)*\d+\s+(?:eur|usd)\b)", "");
Console.WriteLine(result); // 2 300 $ 12 Asdsfd 2300530 usd and 2351 eur
Demo C#

Basic regex for 16 digit numbers

I currently have a regex that pulls up a 16 digit number from a file e.g.:
Regex:
Regex.Match(l, #"\d{16}")
This would work well for a number as follows:
1234567891234567
Although how could I also include numbers in the regex such as:
1234 5678 9123 4567
and
1234-5678-9123-4567
If all groups are always 4 digit long:
\b\d{4}[ -]?\d{4}[ -]?\d{4}[ -]?\d{4}\b
to be sure the delimiter is the same between groups:
\b\d{4}(| |-)\d{4}\1\d{4}\1\d{4}\b
If it's always all together or groups of fours, then one way to do this with a single regex is something like:
Regex.Match(l, #"\d{16}|\d{4}[- ]\d{4}[- ]\d{4}[- ]\d{4}")
You could try something like:
^([0-9]{4}[\s-]?){3}([0-9]{4})$
That should do the trick.
Please note:
This also allows
1234-5678 9123 4567
It's not strict on only dashes or only spaces.
Another option is to just use the regex you currently have, and strip all offending characters out of the string before you run the regex:
var input = fileValue.Replace("-",string.Empty).Replace(" ",string.Empty);
Regex.Match(input, #"\d{16}");
Here is a pattern which will get all the numbers and strip out the dashes or spaces. Note it also checks to validate that there is only 16 numbers. The ignore option is so the pattern is commented, it doesn't affect the match processing.
string value = "1234-5678-9123-4567";
string pattern = #"
^ # Beginning of line
( # Place into capture groups for 1 match
(?<Number>\d{4}) # Place into named group capture
(?:[\s-]?) # Allow for a space or dash optional
){4} # Get 4 groups
(?!\d) # 17th number, do not match! abort
$ # End constraint to keep int in 16 digits
";
var result = Regex.Match(value, pattern, RegexOptions.IgnorePatternWhitespace)
.Groups["Number"].Captures
.OfType<Capture>()
.Aggregate (string.Empty, (seed, current) => seed + current);
Console.WriteLine ( result ); // 1234567891234567
// Shows False due to 17 numbers!
Console.WriteLine ( Regex.IsMatch("1234-5678-9123-45678", pattern, RegexOptions.IgnorePatternWhitespace));

Categories