Find all kinds of numbers within a string - c#

The string can contains ints, floats and hexadecimal numbers for example.
"This a string than can have -345 and 57 and could also have 35.4656 or a subtle 0xF46434 and more"
What could I use to find these numbers in C#?

Use something along these lines: (I wrote it myself, so I'm not going to say it's all-inclusive for whatever sort of numbers you're looking to find, but it works for your example)
var str = "123 This a string than can have -345 and 57 and could also have 35.4656 or a subtle 0XF46434 and more like -0xf46434";
var a = Regex.Matches(str, #"(?<=(^|[^a-zA-Z0-9_^]))(-?\d+(\.\d+)?|-?0[xX][0-9A-Fa-f]+)(?=([^a-zA-Z0-9_]|$))");
foreach (Match match in a)
{
//do something
}
Regex seems to be a write-only language, (i.e. incredibly hard to read) so I'll break it down so you can understand: (?<=(^|[^a-zA-Z0-9_^])) is a lookbehind to break it by a word boundary. I can't use \b because it considers - a boundary character, so it would only match 345 instead of -345. -?\d+(\.\d+)? matches decimal numbers, optionally negative, optionally with fractional digits. -?0[xX][0-9A-Fa-f]+ matches hexadecimal numbers, case insensitive, optionally negative. Finally, (?=([^a-zA-Z0-9_]|$)) is a lookahead, again as a word boundary. Note that in the first boundary, I allowed for the start of the string, and here I allow for the end of the string.

Just try to parse each word to double and return the array of doubles.
Here is a way to get array of doubles from a string:
double[] GetNumbers(string str)
{
double num;
List<double> l = new List<double>();
foreach (string s in str.Split(' '))
{
bool isNum = double.TryParse(s, out num);
if (isNum)
{
l.Add(num);
}
}
return l.ToArray();
}
more info about double.TryParse() here.

Given your input above this expression matches every number present there
string line = "This a string than can have " +
"-345 and 57 and could also have 35.4656 " +
"or a subtle 0xF46434 and more";
Regex r = new Regex(#"(-?0[Xx][A-Fa-f0-9]+|-?\d+\.\d+|-?\d+)");
var m = r.Matches(line);
foreach(Match h in m)
Console.WriteLine(h.ToString());
EDIT: for a replace you use the Replace method that takes a MatchEvaluator overload
string result = r.Replace(line, new MatchEvaluator(replacementMethod));
public string replacementMethod(Match match)
{
return "?????";
}
Explaining the regex pattern
First, the sequence "(pattern1|pattern2|pattern3)" means that we have three possible pattern to find in our string. One of them is enough to have a match
First pattern -?0[Xx][A-Fa-f0-9]+ means an optional minus followed by a zero followed by an X or x char followed by a series of one or more chars in the range A-F a-f or 0-9
Second pattern -?\d+\.\d+ means an optional minus followed by a series of 1 or more digits followed by the decimal point followed by a series of 1 or more digits
Third pattern -?\d+ means an optional minus followed by a series of 1 or more digits.
The sequence of patterns is of utmost importance. If you reverse the pattern and put the integer match before the decimal pattern the results will be wrong.

Besides regex, which tends to have its own problems, you can build a state machine to do the processing. You can decide on which inputs the machine would accept as 'numbers'. Unlike regex, a state machine will have predictably decent performance, and will also give you predictable results (whereas regex can sometimes match rather surprising things).
It's not really that difficult, when you think about it. There are rather few states, and you can define special cases explicitly.
EDIT: The following is an edit as a response to the comment.
In .NET, Regex is implemented as an NFA (Nontdeterminisitc Finite Automaton). On one hand, it's a very powerful parser, but on the other, it can sometimes backtrack much more than it should. This is especially true when you're accepting unsafe input (input from the user, which can be just about anything). While I'm not sure what sort of Regex expression you'll be using to parse the result, you can induce a performance hit in pretty much anything. Although in most cases performance is a non-issue, Regex performance can scale exponentially with the input. That means that, in some cases, it really can be a bottleneck. And a rather unexpected one.
Another potential problem stemming from the greedy nature of Regex is that sometimes it can match unexpected things. You might use the same Regex expression for days, and it might work fine, waiting for the right combination of overlooked characters to be parsed, and you'll end up writing garbage into your database.
By state machine, I mean parsing the input using a deterministic finite automaton, or something like that. I'll show you what I mean. Here's a small DFA for parsing a positive decimal integer or float within a string. I'm pretty sure you can build a DFA using frameworks like ANTLR, though I'm sure there are also less powerful ones around.

Related

C# Regular Expression for String matching

I am looking for a regular expression that returns success only if the input string contains following characters:
a-zA-Z0-9~!#$^ ()_-+’:.?
Is this regular expression correct?
^[a-zA-Z0-9~!#$^ ()_-+’:.?]+$
I have understood what ^ means here but not sure about +$. Also are there any alternatives to this? By the way the above regular expression also includes a space character between ^ and (
it only contains the characters listed above
bool invalidCharsExist =
Regex.Replace(input, #"[a-zA-Z0-9~!#\$\^\ \(\)_\-\+’:\.\?]", "").Length != 0;
BTW: This is not fully equivalent to your regex (It will also include non-ascii letters and digits) but I think it is a better way to check
var specialChars = new HashSet<char>("~!#$^ ()_-+’:.?");
var allValid = input.All(c => char.IsLetterOrDigit(c) || specialChars.Contains(c));
Close, but get rid of that dash in the middle of your character class and put it at the beginning:
^[-a-zA-Z0-9~!#$^ ()_+’:.?]+$
And make sure when you put it in a string that you use the proper string qualifier (I forget what it's called):
#"^[-a-zA-Z0-9~!#$^ ()_+’:.?]+$"
As to whether or not you can do it in other ways, sure, for example a negative look-ahead that doesn't actually match anything. I don't think a proper regex optimizer would leave one better than the other, it's just a matter of preference. Do you want something that looks to succeed (selects the entire string if valid), or something that looks to fail (negative look-ahead).
Honestly if performance is at all important, you should write a good old for and loop over the characters (or the equivalent LINQ implementation). Regex won't even be in the ballpark.
the regular expression would be: ^[a-zA-Z0-9~!#$^ ()_\-+’:.?]+$
I personally recommend using https://regex101.com to check regex expressions - note that they don't have C# support, but in general javascript's RegExp has similar syntax to C#, but what it does give you a particularly useful explaination of what your expression is doing, here is this epression's explaination from there:
^ assert position at start of the string
[a-zA-Z0-9~!#$^ ()_\-\+’:.?]+ match a single character present in the list below
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
a-z a single character in the range between a and z (case sensitive)
A-Z a single character in the range between A and Z (case sensitive)
0-9 a single character in the range between 0 and 9
~!#$^ ()_ a single character in the list ~!#$^ ()_ literally
\- matches the character - literally
+’:.? a single character in the list ’:.? literally
$ assert position at end of the string
the issue with what you put in the OP was literally only forgetting to escape the - as it is reserved in the regular expression pattern to be used for special purposes (i.e in the [] notation the - is reserved to declare a character range like a-z)

Regular expression only one character and 7 numbers

i want regex match only one char in any position of word and 7 numbers
match example:
1111111q
2222222q
111e1111
11e11111
i do this pattern but not working in all patterns:
[A-Za-z][0-9]{7}
Regular expressions match patterns. In your case, it would seem that the letter can be at any point in your string, which would mean that you would have a multitude of patterns which would need to be taken into consideration.
I think that for this case, you should not use regular expressions for simplicity's sake. I would recommend you take a look at the Char.isDigit(Char c) and Char.isLetter(Char c) methods and use counters to see that the string is in the format you are after.
there are readily available methods in C# for checking the conditions you want. I would use Regex if there is no parser or simple c# solution.
I would do like below
var str = "1111111u";
var isValid = str.Length ==8 &&
str.Where(char.IsDigit).Count() ==7 &&
str.Where(char.IsLetter).Count() ==1;
It is not that difficult in regex:
If the complete string has to match just use:
^(?=.{8}$)\d*[a-zA-Z]\d*$
See it here on regexr.
If this is a word in a larger text use:
\b(?=[a-z0-9]{8}\b)\d*[a-z]\d*\b
See it here on Regexr
\d*[a-z]\d* matches any amount of digits, followed by one letter, then again any amount of digits.
(?=[a-z0-9]{8} is a positive lookahead assertion, this ensures the length of 8 in total.
Important here is the use of anchors or word boundaries to avoid partial wrong matches.
If you really want to match any letter then use the Unicode property \p{L} instead of the character class:
^(?=.{8}$)\d*\p{L}\d*$
I can only come up with a "brute force" regex method:
foundMatch = Regex.IsMatch(subjectString,
#"\b
(?:[a-z]\d{7}|
\d[a-z]\d{6}|
\d{2}[a-z]\d{5}|
\d{3}[a-z]\d{4}|
\d{4}[a-z]\d{3}|
\d{5}[a-z]\d{2}|
\d{6}[a-z]\d{1}|
\d{7}[a-z])
\b",
RegexOptions.IgnoreCase | RegexOptions.IgnorePatternWhitespace);
Note the word boundary anchors, which you should remove if this pattern is part of a longer string.
Also note the IgnoreCase option, which you can remove if all letters will be lower case.
Edit: See #stema Answer -- much more concise regex
This will match what you want:
(\d{1}\w\d{6}|\d{2}\w\d{5}|\d{3}\w\d{4}|\d{4}\w\d{3}|\d{5}\w\d{2}|\d{6}\w\d{1}|\d{7}\w)
I generated it like this, in powershell:
$n = 6;
for ($i = 1; $i -le 6; $i++) {
write-host "\d{"$i"}\w\d{"$n"}"
$n--
}
Your example will only work when the character is the first character in the string.
The problem you've got is that you need a total of 7 digits, and absolutely only one character potentially within those 7 digits. This is not something that's possible with regular expressions as defined in theory, because you have to have a link between the two groups of digits to see how many are in the other group and regexes can't carry that kind of context around with them.
I was wondering if it was possible using a lookahead assertion to ensure there's only one letter, but the best I can do is ensuring there's no instance of two letters in a row, which doesn't cover all possible invalid cases. Thus I think you're going to have to find another method, as npinti suggested. So something like:
public static bool Match(string s) {
return (s.Length == 8) &&
(s.Where(Char.IsDigit).Count() == 7) &&
(s.Where(Char.IsLetter).Count() == 1);
}
But I haven't tested that.
just use this if you want one letter and 7 digit
"[A-Za-z]{1}[0-9]{7}|[0-9]{7}[A-Za-z]{1}|[0-9]{1}[A-Za-z]{1}[0-9]{6}[0-9]{1}|[0-9]{2}[A-Za-z]{1}[0-9]{5}|[0-9]{3}[A-Za-z]{1}[0-9]{4}|[0-9]{4}[A-Za-z]{1}[0-9]{3}|[0-9]{5}[A-Za-z]{1}[0-9]{2}"
and here a code snippet how you can iterate through your result
string st = "1111111q 2222222q 111e1111 11e11111";
string pattS = #"[A-Za-z]{1}[0-9]{7}|[0-9]{7}[A-Za-z]{1}|[0-9]{1}[A-Za-z]{1}[0-9]{6}[0-9]{1}|[0-9]{2}[A-Za-z]{1}[0-9]{5}|[0-9]{3}[A-Za-z]{1}[0-9]{4}|[0-9]{4}[A-Za-z]{1}[0-9]{3}|[0-9]{5}[A-Za-z]{1}[0-9]{2}";
Regex regex = new Regex(pattS);
var res = regex.Matches(st);
foreach (var re in res)
{
}
check here on rubular it covers all examples you provide
You can use this pattern:
^([0-9])(?:\1|[a-z](?!.*[a-z])){7}|[a-z]([0-9])\2{6}$
With Regex, you can do it in two steps. First you can remove the character, in whatever position it is:
string input = "111a1111";
Regex rgx = new Regex(#"[a-zA-Z]");
string output=rgx.Replace(input,"",1); // remove only one character
// output = "1111111"
then you can match with [0-9]{7} (if you don't want all digits to be the same)
or with ^(\d)\1{6}$ (if you want 7 occurrences of the same digit)

Regular expression to detect whether input is a formatted number

Regular Expressions have always seemed like black magic to me and I have never been able to get my head around building them.
I am now in need of a Reg Exp (for validation putsposes) that checks that the user enters a number according to the following rules.
no alpha characters
can have decimal
can have commas for the thousands, but the commas must be correctly placed
Some examples of VALID values:
1.23
100
1,234
1234
1,234.56
0.56
1,234,567.89
INVALID values:
1.ab
1,2345.67
0,123.45
1.24,687
You can try the following expression
^([1-9]\d{0,2}(,\d{3})+|[1-9]\d*|0)(\.\d+)?$
Explanation:
The part before the point consists of
either 1-3 digits followed by (one or more) comma plus three digits
or just digits (at least one)
If then follows a dot also some digits must follow.
^(((([1-9][0-9]{0,2})(,[0-9]{3})*)|([0-9]+)))?(\.[0-9]+)?$
This works for all of your examples of valid data, and will also accept decimals that start with a decimal point. (I.e. .61, .07, etc.)
I noticed that all of your examples of valid decimals (1.23, 1,234.56, and 1,234,567.89) had exactly two digits after the decimal point. I'm not sure if this is coincidence, or if you actually require exactly two digits after the decimal point. (I.e. maybe you're working with money values.) The regular expression as I've written it works for any number of digits after the decimal point. (I.e. 1.2345 and 1,234.56789 would be considered valid.) If you need there to be exactly two digits after the decimal point, change the end of the regular expression from +)?$ to {2})?$.
try to use this regex
^(\d{1,3}[,](\d{3}[,])*\d{3}(\.\d{1,3})?|\d{1,3}(\.\d+)?)$
I know you asked for a regex but I think it's much saner to just call double.TryParse() and consider your input acceptable if that method returns true.
double dummy;
var isValid=double.TryParse(text, out dummy);
It won't match your testcases exactly; the major difference being that it is very lenient with commas (so it will accept two of your INVALID inputs).
I'm not sure why you care, but if you really do want comma strictness you could do a preprocessing step where you only check the validity of comma placement and then call double.TryParse() only if the string passes the comma placement test. (If you want to be truly careful, you'll have to honor the CultureInfo so you can know what character is used for separators, and how many digits there are between separators, in the environment your program finds itself in)
Either approach results in code that is more "obviously right" than a regex. For example, you won't have to live with the fear that your regex left out some important case, like scientific notation.

Match pattern of [0-9]-[0-9]-[0-9], but without matching [0-9]-[0-9]

I'm not sure how to accomplish this with a regular expression (or if I can; I'm new to regex). I have an angle value the user will type in and I'm trying to validate the entry. It is in the form degrees-minutes-seconds. The problem I'm having, is that if the user mistypes the seconds portion, I have to catch that error, but my match for degrees-minutes is a success.
Perhaps the method will explain better:
private Boolean isTextValid(String _angleValue) {
Regex _degreeMatchPattern = new Regex("0*[1-9]");
Regex degreeMinMatchPattern = new Regex("(0*[0-9]-{1}0*[0-9]){1}");
Regex degreeMinSecMatchPattern = new Regex("0*[0-9]-{1}0*[0-9]-{1}0*[0-9]");
Match _degreeMatch, _degreeMinMatch, _degreeMinSecMatch;
_degreeMinSecMatch = degreeMinSecMatchPattern.Match(_angleValue);
if (_degreeMinSecMatch.Success)
return true;
_degreeMinMatch = degreeMinMatchPattern.Match(_angleValue);
if (_degreeMinMatch.Success)
return true;
_degreeMatch = _degreeMatchPattern.Match(_angleValue);
if (_degreeMatch.Success)
return true;
return false;
}
}
I want to check for degrees-minutes if the degrees-minutes-seconds match is unsuccessful, but only if the user didn't enter any seconds data. Can I do this via regex, or do I need to parse the string and evaluate each portion separately? Thanks.
EDIT: Sample data would be 45-23-10 as correct data. The problem is 45-23 is also valid data; the 0 seconds is understood. So if the user types 45-23-1= on accident, the degreeMinMatchPattern regex in my code will match succesfully, even though it is invalid.
Second EDIT: Just to make it clear, the minutes and second portions are both optional. The user can type 45 and that is valid.
You can specify "this part of the pattern must match at least 3 times" using the {m,} syntax. Since there are hyphens between each component, specify the first part separately, and then each hyphen-digit combination can be grouped together after:
`[0-9](-[0-9]){2,}`
You also can shorten [0-9] to \d: \d(-\d){2,}
First off, a character in a regex is matched once by default, so {1} is redundant.
Second, since you can apparently isolate this value (you prompt for just this value, instead of having to look for it in a paragraph of entered data) you should include ^ and $ in your string, to enforce that the string should contain ONLY this pattern.
Try "^\d{1,3}-\d{1,2}(-\d{1,2})?$".
Breaking it down: ^ matches the beginning of the string. \d matches any single decimal character, and then behind that you're specifying {1,3} which will match a set of one to three occurrences of any digit. Then you're looking for one dash, then a similar decimal pattern but only one or two times. The last term is enclosed in parenthesis so we can group the characters. Its form is similar to the first two, then there's a ? which marks the preceding character group as optional. The $ at the end indicates that the input should end. Given this, it will match 222-33-44 or 222-33, but not 222-3344 or 222-33-abc.
Keep in mind there are additional rules you might want to incorporate. For instance, seconds can be expressed as a decimal (if you want a resolution smaller than one second). You would need to optionally expect the decimal point and one or more additional digits. Also, you probably have a maximum degree value; the above regex will match the maximum integer DMS value of 359-59-59, however it will also match 999-99-99 which is not valid. You can limit the maximum value using regex (for example "(3[0-5]\d|[1-2]\d{2}|\d{1,2})" will match any number from 0 to 359, by matching a 3, then 0-5, then 0-9, OR any 3-digit number starting with 1 or 2, OR any two-digit number), but as the example shows the regex will get long and messy, so document it well in code as to what you're doing.
Maybe you would do better to just parse the input out and check is piece separately.
I'm not sure I understand correctly, but I think
(?<degrees>0*[0-9])-?(?<minutes>0*[0-9])(?:-?(?<seconds>0*[0-9]))?$
might work.
But this is quite ambiguous; also I'm wondering why you're only allowing single-digit degree/minute/second values. Please show some examples you do and don't want to match.
Maybe you should to try something like this and test for empty/invalid groups:
Regex degrees = new Regex(
#"(?<degrees>\d+)(?:-(?<minutes>\d+))?(?:-(?<seconds>\d+))?");
string[] samples = new []{ "123", "123-456", "123-456-789" };
foreach (var sample in samples)
{
Match m = degrees.Match(sample);
if(m.Success)
{
string degrees = m.Groups["degrees"].Value;
string minutes = m.Groups["minutes"].Value;
string seconds = m.Groups["seconds"].Value;
Console.WriteLine("{0}°{1}'{2}\"", degrees,
String.IsNullOrEmpty(minutes) ? "0" : minutes,
String.IsNullOrEmpty(seconds) ? "0" : seconds
);
}
}

Regular expression to match a decimal value range

Is there an easy way to take a dynamic decimal value and create a validation regular expression that can handle this?
For example, I know that /1[0-9]{1}[0-9]{1}/ should match anything from 100-199, so what would be the best way to programmatically create a similar structure given any decimal number?
I was thinking that I could just loop through each digit and build one from there, but I have no idea how I would go about that.
Ranges are difficult to handle correctly with regular expressions. REs are a tool for text-based analysis or pattern matching, not semantic analysis. The best that you can probably do safely is to recognize a string that is a number with a certain number of digits. You can build REs for the maximum or minimum number of digits for a range using a base 10 logarithm. For example, the match a number between a and b where b > a, construct the RE by:
re = "[1-9][0-9]{"
re += str(log10(a)-1)
re += "-"
re += str(log10(b)-1)
re += "}"
Note: the example is in no particular programming language. Sorry, C# not really spoken here.
There are some boundary point issues, but the basic idea is to construct an RE like [1-9][0-9]{1} for anything between 100 and 999 and then if the string matches the expression, convert to an integer and do the range analysis in value space instead of lexical space.
With all of that said... I would go with Mehrdad's solution and use something provided by the language like decimal.TryParse and then range check the result.
^[-]?\d+(.\d+)?$
will validate a number with an optional decimal point and / or minus sign at the front
No, is the simple answer. Generating the regex that will work correctly would be more complicated than doing the following:
Decimal regex (find the decimal numbers in a string). "^\$?[+-]?[\d,]*(\.\d*)?$"
Convert result to decimal and compare to your range. (decimal.TryParse)
This depends on where and what you want to parse.
Using the bellow RegEx to parse strings for numbers.
Can handle comma's and dots.
[^\d.,](?<number>(\d{1,3}(\.\d{3})*,\d+|\d{1,3}(,\d{3})*\.\d+|\d*[,\.]\d+|\d+))[^\d.,]

Categories