How to get whats between two numbers in a string? - c#

I have a lot of movie files and I want to get their production year from their file names. as below:
Input: Kingdom.of.Heaven.2005.720p.Dubbed.Film2media
Output: 2005
This code just splits all the numbers:
string[] result = Regex.Split(str, #"(\d+:)");

You must be more specific about which numbers you want. E.g.
Regex to find the year (not for splitting):
\b(19\d\d)|(20\d\d)\b
19\d\d selects numbers like 1948, 1989.
20\d\d selects numbers like 2001, 2022.
\b specifies the word limits. It excludes numbers or words with 5 or more digits.
| means or
But it is difficult to make a fool proof algorithm without knowing how exactly the filename is constructed. E.g. the movie "2001: A Space Odyssey" was released in 1968. So, 2001 is not a correct result here.
To omit the movie name, you could search backwards like this:
string productionYear =
Regex.Match(str, #"\b(19\d\d)|(20\d\d)\b", RegexOptions.RightToLeft);
If instead of 720p we had a resolution of 2048p for instance, this would not be a problem, because the 2nd \b requires the number to be at the word end.
If the production year was always the 4th item from the right, then a better way to get this year would be:
string[] parts = str.Split('.');
string productionYear = parts[^4]; // C# 8.0+, .NET Core
// or
string productionYear = parts[parts.Length - 4]; // C# < 8 or .NET Framework
Note that the regex expression you specify in Regex.Split designates the separators, not the returned values.

I would not try to split the string, more like match a field. Also, consider matching \d{4} and not \d+ if you want to be sure to get years and not other fields like resolution in your example

You can try this:
string str = "Kingdom.of.Heaven.2005.720p.Dubbed.Film2media";
string year = Regex.Match(str, #"(?<=\.)(\d{4})(?=\.)").Groups[1].Value;
Console.WriteLine("Year: " + year);
Output: Year: 2005
Demo: https://dotnetfiddle.net/KM2PNk
\d{4}: This matches any sequence of four digits.
(?<=\.): This is a positive lookbehind assertion, which means that the preceding pattern must be present, but is not included in the match. In this case, the preceding pattern is a dot, so the regular expression will only match a sequence of four digits if it is preceded by a dot.
(?=\.): This is a positive lookahead assertion, which means that the following pattern must be present, but is not included in the match. In this case, the following pattern is a dot, so the regular expression will only match a sequence of four digits if it is followed by a dot.

Related

c# regex - match either a word or a number (int or float)

I'm currently programming a resolver for math expressions and have the following situation:
A user can either define a formula with parameters (a word) or numbers. Therefor I need to resolve a given string to either a word or a number that can be integer or float and I want to save the result e.g. as "left hand side" (lhs):
Unfortunately I have problems with the regex:
I tried something like:
Match m1 = Regex.Match("variable", #"^(?<lhs>((\w+)|(([0-9]+)(\.([0-9]*))?)))");
Match m2 = Regex.Match("1.2", #"^(?<lhs>((\w+)|(([0-9]+)(\.([0-9]*))?)))");
Expected result would be that I get a group "lhs" that contains the string "variable" for m1 or "1.2" for m2. Actual result is that I get the whole word "variable" in m1, but in case of the number in m2, just the first digit "1" is matched.
If I exlude the possibility for word "1.2" is found:
Match m3 = Regex.Match("1.2", #"^(?<lhs>(([0-9]+)(\.([0-9]*))?))");
Can anyone help me here?
The \w+ pattern matches letters, digits, and some other chars, including _. Since it is the first part in the alternation group, once it matches a digit, it is never retried with the second alternative.
You may swap the branches, and use something like
^(?<lhs>(?<num>[0-9]+(?:\.[0-9]*)?)|(?<word>\w+))
Or, if the whole string should match:
^(?<lhs>(?<num>[0-9]+(?:\.[0-9]*)?)|(?<word>\w+))$
See the .NET regex demo.

Substitute number with 2 digits in regex

I have some input that is an integer stored as a string that may have 1 or 2 digits. I would like to know if it is possible to come up with a regex pattern and substitution string that allows me to add a 0 at the front of any input that has only one digit.
ie. I'd like to find pattern and subst such that:
Regex.Replace("1",pattern,subst); // returns "01"
Regex.Replace("31",pattern,subst); // returns "31"
Edit: the question is specific to C# regex. Please do not answer to provide alternative methods
Using regex you can use word boundaries around a single digit:
string num = "5";
Regex.Replace(num, #"\b\d\b", "0$&");
//=> 05
num = "31";
Regex.Replace(num, #"\b\d\b", "0$&");
//=> 31
Code Demo
Regex \b\d\b will match a single digit with word boundaries on either side to ensure we're only matching a single digit.
More Infor about Word boundary
In case digit can appear in the middle of the word then you can use lookarounds regex like this:
num = Console.WriteLine(Regex.Replace(num, #"(?<!\d)\d(?!\d)", "0$&"));

Regex Substring or Left Equivalent

Greetings beloved comrades.
I cannot figure out how to accomplish the following via a regex.
I need to take this format number 201101234 and transform it to 11-0123401, where digits 3 and 4 become the digits to the left of the dash, and the remaining five digits are inserted to the right of the dash, followed by a hardcoded 01.
I've tried http://gskinner.com/RegExr, but the syntax just defeats me.
This answer, Equivalent of Substring as a RegularExpression, sounds promising, but I can't get it to parse correctly.
I can create a SQL function to accomplish this, but I'd rather not hammer my server in order to reformat some strings.
Thanks in advance.
You can try this:
var input = "201101234";
var output = Regex.Replace(input, #"^\d{2}(\d{2})(\d{5})$", "${1}-${2}01");
Console.WriteLine(output); // 11-0123401
This will match:
two digits, followed by
two digits captured as group 1, followed by
five digits captured as group 2
And return a string which replaces that matched text with
group 1, followed by
a literal hyphen, followed by
group 2, followed by
a literal 01.
The start and end anchors ( ^ / $ ) ensure that if the input string does not exactly match this pattern, it will simply return the original string.
If you can use custom C# scripts, you may want to use Substring instead:
string newStr = string.Format("{0}-{1}01", old.Substring(2,2), old.Substring(4));
I don't think you really need a regex here. Substring would be better. But still if you want regex only, you can use this:
string newString = Regex.Replace(input, #"^\d{2}(\d{2})(\d+)$", "$1-${2}01");
Explanation:
^\d{2} // Match first 2 digits. Will be ignored
(\d{2}) // Match next 2 digits. Capture it in group 1
(\d+)$ // Match rest of the digits. Capture it in group 2
Now, the required digits, are in group 1 and 2, which you use in the replacement string.
Do you even SQL? Pull some levers and stuff.

.NET REGEX Matching matches empty strings

I have this
pattern:
[0-9]*\.?[0-9]*
Target:
X=113.3413475 Y=18.2054775
And i want to match the numbers. It matches find in testing software like http://regexpal.com/ and Regex Coach.
But in Dot net and http://derekslager.com/blog/posts/2007/09/a-better-dotnet-regular-expression-tester.ashx
I get:
Found 11 matches:
1.
2.
3.
4.
5.
6. 113.3413475
7.
8.
9.
10. 18.2054775
11.
String literals for use in programs:
C#
#"[0-9]*[\.]?[0-9]*"
Any one have any idea why i'm getting all these empty matches.
Thanks and Regards,
Kevin
Yes, that will match empty string. Look at it:
[0-9]* - zero or more digits
\.? - an optional period
[0-9]* - zero or more digits
Everything's optional, so an empty string matches.
It sounds like you always want there to be digits somewhere, for example:
[0-9]+\.[0-9]*|\.[0-9]+|[0-9]+
(The order here matters, as you want it to take the most possible.)
That works for me:
using System;
using System.Text.RegularExpressions;
class Test
{
static void Main(string[] args)
{
string x = "X=113.3413475 Y=18.2054775";
Regex regex = new Regex(#"[0-9]+\.[0-9]*|\.[0-9]+|[0-9]+");
var matches = regex.Matches(x);
foreach (Match match in matches)
{
Console.WriteLine(match);
}
}
}
Output:
113.3413475
18.2054775
There may well be better ways of doing it, admittedly :)
Try this one:
[0-9]+(\.[0-9]+)?
It's slightly different that Jon Skeet's answer in that it won't match .45, it requires either a number alone (e.g. 8) or a real decimal (e.g. 8.1 or 0.1)
Another alternative is to keep your original regex, and just assert it must have a number in it (maybe after a dot):
[0-9]*\.?[0-9]*
Goes to:
(?=\.?[0-9])[0-9]*\.?[0-9]*
The key problem is the *, which means "match zero or more of the preceding characters". The empty string matches zero or more digits, which is why you're getting all those matches.
Change your two *s to +s and you'll get what you want.
The problem with this regex is that it is completely optional in all the fields, so an empty string also is matched by it. I would consider adding all the cases. By the regex, I see you want the numbers with or without dot, and with or without a set of decimal digits. You can separate first those that contain only numbers [0-9]+, then those that contain numbers plus only a dot, [0-9]+\. and then join them all with | (or).
The problem with the regex as it is is that it allows cases that are not real numbers, for example, the cases in which the first set of numbers and the last set of numbers are empty (just a dot), so you have to put the valid cases explicitly.
Regex pattern = new Regex( #"[0-9]+[\.][0-9]+");
string info = "X=113.3413475 Y=18.2054775";
MatchCollection matches = pattern.Matches(info);
int count = 1;
foreach(Match match in matches)
{
Console.WriteLine("{0} : {1}", count++, match.Value);
}
//output
//1 : 113.3413475
//2 : 18.2054775
Replace your * with + and remove ? from your period case.
EDIT: from above conversation: #"[0-9]+.[0-9]*|.[0-9]+|[0-9]+", is the better case. catches 123, .123, 123.123 etc

regex for capturing digits and digit ranges

i have the following string
Fat mass loss was 2121,323.222 greater for GPLC (2–2.4kg vs. 0.5kg)
i want to capture
212,323.222
2-2.24
0.5
i.e. i want the above three results from the string,
can any one help me with this regex
I noticed that your hyphen in 2–2.4kg is not really hyphen, its a unicode 0x2013 "DASH".
So, here is another regex in C#
#"[0-9]+([,.\u2013-][0-9]+)*"
Test
MatchCollection matches = Regex.Matches("Fat mass loss was 2121,323.222 greater for GPLC (2–2.4kg vs. 0.5kg)", #"[0-9]+([,.\u2013-][0-9]+)*");
foreach (Match m in matches) {
Console.WriteLine(m.Groups[0]);
}
Here is the results, my console does not support printing unicode char 2013, so its "?" but its properly matched.
2121,323.222
2?2.4
0.5
Okay I didn't notice the C# tag until now. I will leave the answer but I know that's not what you expected, see if you can do something with it. Perhaps the title should have mentioned the programming language?
Sure:
Fat mass loss was (.*) greater for GPLC \((.*) vs. (.*)kg\)
Find your substrings in \1, \2 and \3.
If for Emacs, swap all parentheses and escaped parentheses.
How about something like this:
^.*((?:\d+,)*\d+(?:\.\d+)?).*(\d+(?:\.\d+)?(?:-\d+(?:\.\d+))?).*(\d+(?:\.\d+)).*$
A little more general, I think. I'm a little concerned about .* being greedy.
Fat mass loss was 2121,323.222 greater
for GPLC (2–2.4kg vs. 0.5kg)
a generalized extractor:
/\D+?([\d\,\.\-]+)/g
explanation:
/ # start pattern
\D+ # 1 or more non-digits
( # capture group 1
[\d,.-]+ # character class, 1 or more of digits, comma, period, hyphen
) # end capture group 1
/g # trailing regex g modifier (make regex continue after last match)
sorry I don't know c# well enough for a full writeup, but the pattern should plug right in.
see: http://www.radsoftware.com.au/articles/regexsyntaxadvanced.aspx for some implementation examples.
I came out with something like this atrocity:
-?\d(?:,?\d)*(?:\.(?:\d(?:,?\d)*\d|\d))?(?:[–-]-?\d(?:,?\d)*(?:\.(?:\d(?:,?\d)*\d|\d))?)?
Out of witch -?\d(?:,?\d)*(?:\.(?:\d(?:,?\d)*\d|\d))? is repeated twice, with – in the middle (note that this is a long hyphen).
This should take care of dots and commas outside of numbers, eg: hello,23,45.2-7world - will capture 23,45.2-7.
It looks like you're trying to find all numbers in the string (possibly with commas inside the number), and all ranges of numbers such as "2-2.4". Here is a regex that should work:
\d+(?:[,.-]\d+)*
From C# 3, you can use it like this:
var input = "Fat mass loss was 2121,323.222 greater for GPLC (2-2.4kg vs. 0.5kg)";
var pattern = #"\d+(?:[,.-]\d+)*";
var matches = Regex.Matches(input, pattern);
foreach ( var match in matches )
Console.WriteLine(match.Value);
Hmm, this is a tricky question, especially because the input string contains unicode character – (EN DASH) instead of - (HYPHEN-MINUS). Therefore the correct regex to match the numbers in the original string would be:
\d+(?:[\u2013,.]\d+)*
If you want a more generic approach would be:
\d+(?:[\p{Pd}\p{Pc}\p{Po}]\d+)*
which matches dash punctuation, connecter punctuation and other punctuation. See here for more information about those.
An implementation in C# would look like this:
string input = "Fat mass loss was 2121,323.222 greater for GPLC (2–2.4kg vs. 0.5kg)";
try {
Regex rx = new Regex(#"\d+(?:[\p{Pd}\p{Pc}\p{Po}\p{C}]\d+)*", RegexOptions.IgnoreCase | RegexOptions.Multiline);
Match match = rx.Match(input);
while (match.Success) {
// matched text: match.Value
// match start: match.Index
// match length: match.Length
match = match.NextMatch();
}
} catch (ArgumentException ex) {
// Syntax error in the regular expression
}
Let's try this one :
(?=\d)([0-9,.-]+)(?<=\d)
It captures all expressions containing only :
"[0-9,.-]" characters,
must start with a digit "(?=\d)",
must finish with a digit "(?<=\d)"
It works with a single digit expression and does not include beginning or trailing [.,-].
Hope this helps.
I got the solution to my problem.
The following is the Regex that gave my desired result:
(([0-9]+)([–.,-]*))+

Categories