.NET REGEX Matching matches empty strings - c#

I have this
pattern:
[0-9]*\.?[0-9]*
Target:
X=113.3413475 Y=18.2054775
And i want to match the numbers. It matches find in testing software like http://regexpal.com/ and Regex Coach.
But in Dot net and http://derekslager.com/blog/posts/2007/09/a-better-dotnet-regular-expression-tester.ashx
I get:
Found 11 matches:
1.
2.
3.
4.
5.
6. 113.3413475
7.
8.
9.
10. 18.2054775
11.
String literals for use in programs:
C#
#"[0-9]*[\.]?[0-9]*"
Any one have any idea why i'm getting all these empty matches.
Thanks and Regards,
Kevin

Yes, that will match empty string. Look at it:
[0-9]* - zero or more digits
\.? - an optional period
[0-9]* - zero or more digits
Everything's optional, so an empty string matches.
It sounds like you always want there to be digits somewhere, for example:
[0-9]+\.[0-9]*|\.[0-9]+|[0-9]+
(The order here matters, as you want it to take the most possible.)
That works for me:
using System;
using System.Text.RegularExpressions;
class Test
{
static void Main(string[] args)
{
string x = "X=113.3413475 Y=18.2054775";
Regex regex = new Regex(#"[0-9]+\.[0-9]*|\.[0-9]+|[0-9]+");
var matches = regex.Matches(x);
foreach (Match match in matches)
{
Console.WriteLine(match);
}
}
}
Output:
113.3413475
18.2054775
There may well be better ways of doing it, admittedly :)

Try this one:
[0-9]+(\.[0-9]+)?
It's slightly different that Jon Skeet's answer in that it won't match .45, it requires either a number alone (e.g. 8) or a real decimal (e.g. 8.1 or 0.1)

Another alternative is to keep your original regex, and just assert it must have a number in it (maybe after a dot):
[0-9]*\.?[0-9]*
Goes to:
(?=\.?[0-9])[0-9]*\.?[0-9]*

The key problem is the *, which means "match zero or more of the preceding characters". The empty string matches zero or more digits, which is why you're getting all those matches.
Change your two *s to +s and you'll get what you want.

The problem with this regex is that it is completely optional in all the fields, so an empty string also is matched by it. I would consider adding all the cases. By the regex, I see you want the numbers with or without dot, and with or without a set of decimal digits. You can separate first those that contain only numbers [0-9]+, then those that contain numbers plus only a dot, [0-9]+\. and then join them all with | (or).
The problem with the regex as it is is that it allows cases that are not real numbers, for example, the cases in which the first set of numbers and the last set of numbers are empty (just a dot), so you have to put the valid cases explicitly.

Regex pattern = new Regex( #"[0-9]+[\.][0-9]+");
string info = "X=113.3413475 Y=18.2054775";
MatchCollection matches = pattern.Matches(info);
int count = 1;
foreach(Match match in matches)
{
Console.WriteLine("{0} : {1}", count++, match.Value);
}
//output
//1 : 113.3413475
//2 : 18.2054775
Replace your * with + and remove ? from your period case.
EDIT: from above conversation: #"[0-9]+.[0-9]*|.[0-9]+|[0-9]+", is the better case. catches 123, .123, 123.123 etc

Related

How to get whats between two numbers in a string?

I have a lot of movie files and I want to get their production year from their file names. as below:
Input: Kingdom.of.Heaven.2005.720p.Dubbed.Film2media
Output: 2005
This code just splits all the numbers:
string[] result = Regex.Split(str, #"(\d+:)");
You must be more specific about which numbers you want. E.g.
Regex to find the year (not for splitting):
\b(19\d\d)|(20\d\d)\b
19\d\d selects numbers like 1948, 1989.
20\d\d selects numbers like 2001, 2022.
\b specifies the word limits. It excludes numbers or words with 5 or more digits.
| means or
But it is difficult to make a fool proof algorithm without knowing how exactly the filename is constructed. E.g. the movie "2001: A Space Odyssey" was released in 1968. So, 2001 is not a correct result here.
To omit the movie name, you could search backwards like this:
string productionYear =
Regex.Match(str, #"\b(19\d\d)|(20\d\d)\b", RegexOptions.RightToLeft);
If instead of 720p we had a resolution of 2048p for instance, this would not be a problem, because the 2nd \b requires the number to be at the word end.
If the production year was always the 4th item from the right, then a better way to get this year would be:
string[] parts = str.Split('.');
string productionYear = parts[^4]; // C# 8.0+, .NET Core
// or
string productionYear = parts[parts.Length - 4]; // C# < 8 or .NET Framework
Note that the regex expression you specify in Regex.Split designates the separators, not the returned values.
I would not try to split the string, more like match a field. Also, consider matching \d{4} and not \d+ if you want to be sure to get years and not other fields like resolution in your example
You can try this:
string str = "Kingdom.of.Heaven.2005.720p.Dubbed.Film2media";
string year = Regex.Match(str, #"(?<=\.)(\d{4})(?=\.)").Groups[1].Value;
Console.WriteLine("Year: " + year);
Output: Year: 2005
Demo: https://dotnetfiddle.net/KM2PNk
\d{4}: This matches any sequence of four digits.
(?<=\.): This is a positive lookbehind assertion, which means that the preceding pattern must be present, but is not included in the match. In this case, the preceding pattern is a dot, so the regular expression will only match a sequence of four digits if it is preceded by a dot.
(?=\.): This is a positive lookahead assertion, which means that the following pattern must be present, but is not included in the match. In this case, the following pattern is a dot, so the regular expression will only match a sequence of four digits if it is followed by a dot.

Regular expression only one character and 7 numbers

i want regex match only one char in any position of word and 7 numbers
match example:
1111111q
2222222q
111e1111
11e11111
i do this pattern but not working in all patterns:
[A-Za-z][0-9]{7}
Regular expressions match patterns. In your case, it would seem that the letter can be at any point in your string, which would mean that you would have a multitude of patterns which would need to be taken into consideration.
I think that for this case, you should not use regular expressions for simplicity's sake. I would recommend you take a look at the Char.isDigit(Char c) and Char.isLetter(Char c) methods and use counters to see that the string is in the format you are after.
there are readily available methods in C# for checking the conditions you want. I would use Regex if there is no parser or simple c# solution.
I would do like below
var str = "1111111u";
var isValid = str.Length ==8 &&
str.Where(char.IsDigit).Count() ==7 &&
str.Where(char.IsLetter).Count() ==1;
It is not that difficult in regex:
If the complete string has to match just use:
^(?=.{8}$)\d*[a-zA-Z]\d*$
See it here on regexr.
If this is a word in a larger text use:
\b(?=[a-z0-9]{8}\b)\d*[a-z]\d*\b
See it here on Regexr
\d*[a-z]\d* matches any amount of digits, followed by one letter, then again any amount of digits.
(?=[a-z0-9]{8} is a positive lookahead assertion, this ensures the length of 8 in total.
Important here is the use of anchors or word boundaries to avoid partial wrong matches.
If you really want to match any letter then use the Unicode property \p{L} instead of the character class:
^(?=.{8}$)\d*\p{L}\d*$
I can only come up with a "brute force" regex method:
foundMatch = Regex.IsMatch(subjectString,
#"\b
(?:[a-z]\d{7}|
\d[a-z]\d{6}|
\d{2}[a-z]\d{5}|
\d{3}[a-z]\d{4}|
\d{4}[a-z]\d{3}|
\d{5}[a-z]\d{2}|
\d{6}[a-z]\d{1}|
\d{7}[a-z])
\b",
RegexOptions.IgnoreCase | RegexOptions.IgnorePatternWhitespace);
Note the word boundary anchors, which you should remove if this pattern is part of a longer string.
Also note the IgnoreCase option, which you can remove if all letters will be lower case.
Edit: See #stema Answer -- much more concise regex
This will match what you want:
(\d{1}\w\d{6}|\d{2}\w\d{5}|\d{3}\w\d{4}|\d{4}\w\d{3}|\d{5}\w\d{2}|\d{6}\w\d{1}|\d{7}\w)
I generated it like this, in powershell:
$n = 6;
for ($i = 1; $i -le 6; $i++) {
write-host "\d{"$i"}\w\d{"$n"}"
$n--
}
Your example will only work when the character is the first character in the string.
The problem you've got is that you need a total of 7 digits, and absolutely only one character potentially within those 7 digits. This is not something that's possible with regular expressions as defined in theory, because you have to have a link between the two groups of digits to see how many are in the other group and regexes can't carry that kind of context around with them.
I was wondering if it was possible using a lookahead assertion to ensure there's only one letter, but the best I can do is ensuring there's no instance of two letters in a row, which doesn't cover all possible invalid cases. Thus I think you're going to have to find another method, as npinti suggested. So something like:
public static bool Match(string s) {
return (s.Length == 8) &&
(s.Where(Char.IsDigit).Count() == 7) &&
(s.Where(Char.IsLetter).Count() == 1);
}
But I haven't tested that.
just use this if you want one letter and 7 digit
"[A-Za-z]{1}[0-9]{7}|[0-9]{7}[A-Za-z]{1}|[0-9]{1}[A-Za-z]{1}[0-9]{6}[0-9]{1}|[0-9]{2}[A-Za-z]{1}[0-9]{5}|[0-9]{3}[A-Za-z]{1}[0-9]{4}|[0-9]{4}[A-Za-z]{1}[0-9]{3}|[0-9]{5}[A-Za-z]{1}[0-9]{2}"
and here a code snippet how you can iterate through your result
string st = "1111111q 2222222q 111e1111 11e11111";
string pattS = #"[A-Za-z]{1}[0-9]{7}|[0-9]{7}[A-Za-z]{1}|[0-9]{1}[A-Za-z]{1}[0-9]{6}[0-9]{1}|[0-9]{2}[A-Za-z]{1}[0-9]{5}|[0-9]{3}[A-Za-z]{1}[0-9]{4}|[0-9]{4}[A-Za-z]{1}[0-9]{3}|[0-9]{5}[A-Za-z]{1}[0-9]{2}";
Regex regex = new Regex(pattS);
var res = regex.Matches(st);
foreach (var re in res)
{
}
check here on rubular it covers all examples you provide
You can use this pattern:
^([0-9])(?:\1|[a-z](?!.*[a-z])){7}|[a-z]([0-9])\2{6}$
With Regex, you can do it in two steps. First you can remove the character, in whatever position it is:
string input = "111a1111";
Regex rgx = new Regex(#"[a-zA-Z]");
string output=rgx.Replace(input,"",1); // remove only one character
// output = "1111111"
then you can match with [0-9]{7} (if you don't want all digits to be the same)
or with ^(\d)\1{6}$ (if you want 7 occurrences of the same digit)

Regex Substring or Left Equivalent

Greetings beloved comrades.
I cannot figure out how to accomplish the following via a regex.
I need to take this format number 201101234 and transform it to 11-0123401, where digits 3 and 4 become the digits to the left of the dash, and the remaining five digits are inserted to the right of the dash, followed by a hardcoded 01.
I've tried http://gskinner.com/RegExr, but the syntax just defeats me.
This answer, Equivalent of Substring as a RegularExpression, sounds promising, but I can't get it to parse correctly.
I can create a SQL function to accomplish this, but I'd rather not hammer my server in order to reformat some strings.
Thanks in advance.
You can try this:
var input = "201101234";
var output = Regex.Replace(input, #"^\d{2}(\d{2})(\d{5})$", "${1}-${2}01");
Console.WriteLine(output); // 11-0123401
This will match:
two digits, followed by
two digits captured as group 1, followed by
five digits captured as group 2
And return a string which replaces that matched text with
group 1, followed by
a literal hyphen, followed by
group 2, followed by
a literal 01.
The start and end anchors ( ^ / $ ) ensure that if the input string does not exactly match this pattern, it will simply return the original string.
If you can use custom C# scripts, you may want to use Substring instead:
string newStr = string.Format("{0}-{1}01", old.Substring(2,2), old.Substring(4));
I don't think you really need a regex here. Substring would be better. But still if you want regex only, you can use this:
string newString = Regex.Replace(input, #"^\d{2}(\d{2})(\d+)$", "$1-${2}01");
Explanation:
^\d{2} // Match first 2 digits. Will be ignored
(\d{2}) // Match next 2 digits. Capture it in group 1
(\d+)$ // Match rest of the digits. Capture it in group 2
Now, the required digits, are in group 1 and 2, which you use in the replacement string.
Do you even SQL? Pull some levers and stuff.

Match pattern of [0-9]-[0-9]-[0-9], but without matching [0-9]-[0-9]

I'm not sure how to accomplish this with a regular expression (or if I can; I'm new to regex). I have an angle value the user will type in and I'm trying to validate the entry. It is in the form degrees-minutes-seconds. The problem I'm having, is that if the user mistypes the seconds portion, I have to catch that error, but my match for degrees-minutes is a success.
Perhaps the method will explain better:
private Boolean isTextValid(String _angleValue) {
Regex _degreeMatchPattern = new Regex("0*[1-9]");
Regex degreeMinMatchPattern = new Regex("(0*[0-9]-{1}0*[0-9]){1}");
Regex degreeMinSecMatchPattern = new Regex("0*[0-9]-{1}0*[0-9]-{1}0*[0-9]");
Match _degreeMatch, _degreeMinMatch, _degreeMinSecMatch;
_degreeMinSecMatch = degreeMinSecMatchPattern.Match(_angleValue);
if (_degreeMinSecMatch.Success)
return true;
_degreeMinMatch = degreeMinMatchPattern.Match(_angleValue);
if (_degreeMinMatch.Success)
return true;
_degreeMatch = _degreeMatchPattern.Match(_angleValue);
if (_degreeMatch.Success)
return true;
return false;
}
}
I want to check for degrees-minutes if the degrees-minutes-seconds match is unsuccessful, but only if the user didn't enter any seconds data. Can I do this via regex, or do I need to parse the string and evaluate each portion separately? Thanks.
EDIT: Sample data would be 45-23-10 as correct data. The problem is 45-23 is also valid data; the 0 seconds is understood. So if the user types 45-23-1= on accident, the degreeMinMatchPattern regex in my code will match succesfully, even though it is invalid.
Second EDIT: Just to make it clear, the minutes and second portions are both optional. The user can type 45 and that is valid.
You can specify "this part of the pattern must match at least 3 times" using the {m,} syntax. Since there are hyphens between each component, specify the first part separately, and then each hyphen-digit combination can be grouped together after:
`[0-9](-[0-9]){2,}`
You also can shorten [0-9] to \d: \d(-\d){2,}
First off, a character in a regex is matched once by default, so {1} is redundant.
Second, since you can apparently isolate this value (you prompt for just this value, instead of having to look for it in a paragraph of entered data) you should include ^ and $ in your string, to enforce that the string should contain ONLY this pattern.
Try "^\d{1,3}-\d{1,2}(-\d{1,2})?$".
Breaking it down: ^ matches the beginning of the string. \d matches any single decimal character, and then behind that you're specifying {1,3} which will match a set of one to three occurrences of any digit. Then you're looking for one dash, then a similar decimal pattern but only one or two times. The last term is enclosed in parenthesis so we can group the characters. Its form is similar to the first two, then there's a ? which marks the preceding character group as optional. The $ at the end indicates that the input should end. Given this, it will match 222-33-44 or 222-33, but not 222-3344 or 222-33-abc.
Keep in mind there are additional rules you might want to incorporate. For instance, seconds can be expressed as a decimal (if you want a resolution smaller than one second). You would need to optionally expect the decimal point and one or more additional digits. Also, you probably have a maximum degree value; the above regex will match the maximum integer DMS value of 359-59-59, however it will also match 999-99-99 which is not valid. You can limit the maximum value using regex (for example "(3[0-5]\d|[1-2]\d{2}|\d{1,2})" will match any number from 0 to 359, by matching a 3, then 0-5, then 0-9, OR any 3-digit number starting with 1 or 2, OR any two-digit number), but as the example shows the regex will get long and messy, so document it well in code as to what you're doing.
Maybe you would do better to just parse the input out and check is piece separately.
I'm not sure I understand correctly, but I think
(?<degrees>0*[0-9])-?(?<minutes>0*[0-9])(?:-?(?<seconds>0*[0-9]))?$
might work.
But this is quite ambiguous; also I'm wondering why you're only allowing single-digit degree/minute/second values. Please show some examples you do and don't want to match.
Maybe you should to try something like this and test for empty/invalid groups:
Regex degrees = new Regex(
#"(?<degrees>\d+)(?:-(?<minutes>\d+))?(?:-(?<seconds>\d+))?");
string[] samples = new []{ "123", "123-456", "123-456-789" };
foreach (var sample in samples)
{
Match m = degrees.Match(sample);
if(m.Success)
{
string degrees = m.Groups["degrees"].Value;
string minutes = m.Groups["minutes"].Value;
string seconds = m.Groups["seconds"].Value;
Console.WriteLine("{0}°{1}'{2}\"", degrees,
String.IsNullOrEmpty(minutes) ? "0" : minutes,
String.IsNullOrEmpty(seconds) ? "0" : seconds
);
}
}

regex for capturing digits and digit ranges

i have the following string
Fat mass loss was 2121,323.222 greater for GPLC (2–2.4kg vs. 0.5kg)
i want to capture
212,323.222
2-2.24
0.5
i.e. i want the above three results from the string,
can any one help me with this regex
I noticed that your hyphen in 2–2.4kg is not really hyphen, its a unicode 0x2013 "DASH".
So, here is another regex in C#
#"[0-9]+([,.\u2013-][0-9]+)*"
Test
MatchCollection matches = Regex.Matches("Fat mass loss was 2121,323.222 greater for GPLC (2–2.4kg vs. 0.5kg)", #"[0-9]+([,.\u2013-][0-9]+)*");
foreach (Match m in matches) {
Console.WriteLine(m.Groups[0]);
}
Here is the results, my console does not support printing unicode char 2013, so its "?" but its properly matched.
2121,323.222
2?2.4
0.5
Okay I didn't notice the C# tag until now. I will leave the answer but I know that's not what you expected, see if you can do something with it. Perhaps the title should have mentioned the programming language?
Sure:
Fat mass loss was (.*) greater for GPLC \((.*) vs. (.*)kg\)
Find your substrings in \1, \2 and \3.
If for Emacs, swap all parentheses and escaped parentheses.
How about something like this:
^.*((?:\d+,)*\d+(?:\.\d+)?).*(\d+(?:\.\d+)?(?:-\d+(?:\.\d+))?).*(\d+(?:\.\d+)).*$
A little more general, I think. I'm a little concerned about .* being greedy.
Fat mass loss was 2121,323.222 greater
for GPLC (2–2.4kg vs. 0.5kg)
a generalized extractor:
/\D+?([\d\,\.\-]+)/g
explanation:
/ # start pattern
\D+ # 1 or more non-digits
( # capture group 1
[\d,.-]+ # character class, 1 or more of digits, comma, period, hyphen
) # end capture group 1
/g # trailing regex g modifier (make regex continue after last match)
sorry I don't know c# well enough for a full writeup, but the pattern should plug right in.
see: http://www.radsoftware.com.au/articles/regexsyntaxadvanced.aspx for some implementation examples.
I came out with something like this atrocity:
-?\d(?:,?\d)*(?:\.(?:\d(?:,?\d)*\d|\d))?(?:[–-]-?\d(?:,?\d)*(?:\.(?:\d(?:,?\d)*\d|\d))?)?
Out of witch -?\d(?:,?\d)*(?:\.(?:\d(?:,?\d)*\d|\d))? is repeated twice, with – in the middle (note that this is a long hyphen).
This should take care of dots and commas outside of numbers, eg: hello,23,45.2-7world - will capture 23,45.2-7.
It looks like you're trying to find all numbers in the string (possibly with commas inside the number), and all ranges of numbers such as "2-2.4". Here is a regex that should work:
\d+(?:[,.-]\d+)*
From C# 3, you can use it like this:
var input = "Fat mass loss was 2121,323.222 greater for GPLC (2-2.4kg vs. 0.5kg)";
var pattern = #"\d+(?:[,.-]\d+)*";
var matches = Regex.Matches(input, pattern);
foreach ( var match in matches )
Console.WriteLine(match.Value);
Hmm, this is a tricky question, especially because the input string contains unicode character – (EN DASH) instead of - (HYPHEN-MINUS). Therefore the correct regex to match the numbers in the original string would be:
\d+(?:[\u2013,.]\d+)*
If you want a more generic approach would be:
\d+(?:[\p{Pd}\p{Pc}\p{Po}]\d+)*
which matches dash punctuation, connecter punctuation and other punctuation. See here for more information about those.
An implementation in C# would look like this:
string input = "Fat mass loss was 2121,323.222 greater for GPLC (2–2.4kg vs. 0.5kg)";
try {
Regex rx = new Regex(#"\d+(?:[\p{Pd}\p{Pc}\p{Po}\p{C}]\d+)*", RegexOptions.IgnoreCase | RegexOptions.Multiline);
Match match = rx.Match(input);
while (match.Success) {
// matched text: match.Value
// match start: match.Index
// match length: match.Length
match = match.NextMatch();
}
} catch (ArgumentException ex) {
// Syntax error in the regular expression
}
Let's try this one :
(?=\d)([0-9,.-]+)(?<=\d)
It captures all expressions containing only :
"[0-9,.-]" characters,
must start with a digit "(?=\d)",
must finish with a digit "(?<=\d)"
It works with a single digit expression and does not include beginning or trailing [.,-].
Hope this helps.
I got the solution to my problem.
The following is the Regex that gave my desired result:
(([0-9]+)([–.,-]*))+

Categories