Is there an easy way to take a dynamic decimal value and create a validation regular expression that can handle this?
For example, I know that /1[0-9]{1}[0-9]{1}/ should match anything from 100-199, so what would be the best way to programmatically create a similar structure given any decimal number?
I was thinking that I could just loop through each digit and build one from there, but I have no idea how I would go about that.
Ranges are difficult to handle correctly with regular expressions. REs are a tool for text-based analysis or pattern matching, not semantic analysis. The best that you can probably do safely is to recognize a string that is a number with a certain number of digits. You can build REs for the maximum or minimum number of digits for a range using a base 10 logarithm. For example, the match a number between a and b where b > a, construct the RE by:
re = "[1-9][0-9]{"
re += str(log10(a)-1)
re += "-"
re += str(log10(b)-1)
re += "}"
Note: the example is in no particular programming language. Sorry, C# not really spoken here.
There are some boundary point issues, but the basic idea is to construct an RE like [1-9][0-9]{1} for anything between 100 and 999 and then if the string matches the expression, convert to an integer and do the range analysis in value space instead of lexical space.
With all of that said... I would go with Mehrdad's solution and use something provided by the language like decimal.TryParse and then range check the result.
^[-]?\d+(.\d+)?$
will validate a number with an optional decimal point and / or minus sign at the front
No, is the simple answer. Generating the regex that will work correctly would be more complicated than doing the following:
Decimal regex (find the decimal numbers in a string). "^\$?[+-]?[\d,]*(\.\d*)?$"
Convert result to decimal and compare to your range. (decimal.TryParse)
This depends on where and what you want to parse.
Using the bellow RegEx to parse strings for numbers.
Can handle comma's and dots.
[^\d.,](?<number>(\d{1,3}(\.\d{3})*,\d+|\d{1,3}(,\d{3})*\.\d+|\d*[,\.]\d+|\d+))[^\d.,]
Related
I want to write a lexical parser for regular text.
So i need to detect following tokens:
1) Word
2) Number
3) dot and other punctuation
4) "..." "!?" "!!!" and so on
I think that is not trivial to write "if else" condition for each item.
So is there any finite state machine generators for c#?
I know ANTLR and other but while i will try to learn how to work with these tools i can write my own "ifelse" FSM.
i hope to found something like:
FiniteStateMachine.AddTokenDefinition(":)","smile");
FiniteStateMachine.AddTokenDefinition(".","dot");
FiniteStateMachine.ParseText(text);
I suggest using Regular Expressions. Something like #"[a-zA-Z\-]+" will pick up words (a-z and dashes), while #"[0-9]*(\.[0-9]+)?" will pick up numbers (including decimal numbers). Dots and such are similar - #"[!\.\?]+" - and you can just add whatever punctuation you need inside the square brackets (escaping special Regex characters with a ).
Poor man's "lexer" for C# is very close to what you are looking for, in terms of being a lexer. I recommend googling regular expressions for words and numbers or whatever else you need to find out what expressions, exactly you need.
EDIT:
Or see Justin's answer for the particular regexes.
We need to know specifics on what you consider a word or a number. That being said, I'll assume "word" means "a C#-style identifier," and "number" means "a string of base-10 numerals, possibly including (but not starting or ending with) a decimal point."
Under those definitions, words would be anything matching the following regex:
#"\b(?!\d)\w+\b"
Note that this would also match unicode. Numbers would match the following:
#"\b\d+(?:\.\d+)?\b"
Note again that this doesn't cover hexadecimal, octal, or scientific notation, although you could add that in without too much difficulty. It also doesn't cover numeric literal suffixes.
After matching those, you could probably get away with this for punctuation:
#"[^\w\d\s]+"
The string can contains ints, floats and hexadecimal numbers for example.
"This a string than can have -345 and 57 and could also have 35.4656 or a subtle 0xF46434 and more"
What could I use to find these numbers in C#?
Use something along these lines: (I wrote it myself, so I'm not going to say it's all-inclusive for whatever sort of numbers you're looking to find, but it works for your example)
var str = "123 This a string than can have -345 and 57 and could also have 35.4656 or a subtle 0XF46434 and more like -0xf46434";
var a = Regex.Matches(str, #"(?<=(^|[^a-zA-Z0-9_^]))(-?\d+(\.\d+)?|-?0[xX][0-9A-Fa-f]+)(?=([^a-zA-Z0-9_]|$))");
foreach (Match match in a)
{
//do something
}
Regex seems to be a write-only language, (i.e. incredibly hard to read) so I'll break it down so you can understand: (?<=(^|[^a-zA-Z0-9_^])) is a lookbehind to break it by a word boundary. I can't use \b because it considers - a boundary character, so it would only match 345 instead of -345. -?\d+(\.\d+)? matches decimal numbers, optionally negative, optionally with fractional digits. -?0[xX][0-9A-Fa-f]+ matches hexadecimal numbers, case insensitive, optionally negative. Finally, (?=([^a-zA-Z0-9_]|$)) is a lookahead, again as a word boundary. Note that in the first boundary, I allowed for the start of the string, and here I allow for the end of the string.
Just try to parse each word to double and return the array of doubles.
Here is a way to get array of doubles from a string:
double[] GetNumbers(string str)
{
double num;
List<double> l = new List<double>();
foreach (string s in str.Split(' '))
{
bool isNum = double.TryParse(s, out num);
if (isNum)
{
l.Add(num);
}
}
return l.ToArray();
}
more info about double.TryParse() here.
Given your input above this expression matches every number present there
string line = "This a string than can have " +
"-345 and 57 and could also have 35.4656 " +
"or a subtle 0xF46434 and more";
Regex r = new Regex(#"(-?0[Xx][A-Fa-f0-9]+|-?\d+\.\d+|-?\d+)");
var m = r.Matches(line);
foreach(Match h in m)
Console.WriteLine(h.ToString());
EDIT: for a replace you use the Replace method that takes a MatchEvaluator overload
string result = r.Replace(line, new MatchEvaluator(replacementMethod));
public string replacementMethod(Match match)
{
return "?????";
}
Explaining the regex pattern
First, the sequence "(pattern1|pattern2|pattern3)" means that we have three possible pattern to find in our string. One of them is enough to have a match
First pattern -?0[Xx][A-Fa-f0-9]+ means an optional minus followed by a zero followed by an X or x char followed by a series of one or more chars in the range A-F a-f or 0-9
Second pattern -?\d+\.\d+ means an optional minus followed by a series of 1 or more digits followed by the decimal point followed by a series of 1 or more digits
Third pattern -?\d+ means an optional minus followed by a series of 1 or more digits.
The sequence of patterns is of utmost importance. If you reverse the pattern and put the integer match before the decimal pattern the results will be wrong.
Besides regex, which tends to have its own problems, you can build a state machine to do the processing. You can decide on which inputs the machine would accept as 'numbers'. Unlike regex, a state machine will have predictably decent performance, and will also give you predictable results (whereas regex can sometimes match rather surprising things).
It's not really that difficult, when you think about it. There are rather few states, and you can define special cases explicitly.
EDIT: The following is an edit as a response to the comment.
In .NET, Regex is implemented as an NFA (Nontdeterminisitc Finite Automaton). On one hand, it's a very powerful parser, but on the other, it can sometimes backtrack much more than it should. This is especially true when you're accepting unsafe input (input from the user, which can be just about anything). While I'm not sure what sort of Regex expression you'll be using to parse the result, you can induce a performance hit in pretty much anything. Although in most cases performance is a non-issue, Regex performance can scale exponentially with the input. That means that, in some cases, it really can be a bottleneck. And a rather unexpected one.
Another potential problem stemming from the greedy nature of Regex is that sometimes it can match unexpected things. You might use the same Regex expression for days, and it might work fine, waiting for the right combination of overlooked characters to be parsed, and you'll end up writing garbage into your database.
By state machine, I mean parsing the input using a deterministic finite automaton, or something like that. I'll show you what I mean. Here's a small DFA for parsing a positive decimal integer or float within a string. I'm pretty sure you can build a DFA using frameworks like ANTLR, though I'm sure there are also less powerful ones around.
Regular Expressions have always seemed like black magic to me and I have never been able to get my head around building them.
I am now in need of a Reg Exp (for validation putsposes) that checks that the user enters a number according to the following rules.
no alpha characters
can have decimal
can have commas for the thousands, but the commas must be correctly placed
Some examples of VALID values:
1.23
100
1,234
1234
1,234.56
0.56
1,234,567.89
INVALID values:
1.ab
1,2345.67
0,123.45
1.24,687
You can try the following expression
^([1-9]\d{0,2}(,\d{3})+|[1-9]\d*|0)(\.\d+)?$
Explanation:
The part before the point consists of
either 1-3 digits followed by (one or more) comma plus three digits
or just digits (at least one)
If then follows a dot also some digits must follow.
^(((([1-9][0-9]{0,2})(,[0-9]{3})*)|([0-9]+)))?(\.[0-9]+)?$
This works for all of your examples of valid data, and will also accept decimals that start with a decimal point. (I.e. .61, .07, etc.)
I noticed that all of your examples of valid decimals (1.23, 1,234.56, and 1,234,567.89) had exactly two digits after the decimal point. I'm not sure if this is coincidence, or if you actually require exactly two digits after the decimal point. (I.e. maybe you're working with money values.) The regular expression as I've written it works for any number of digits after the decimal point. (I.e. 1.2345 and 1,234.56789 would be considered valid.) If you need there to be exactly two digits after the decimal point, change the end of the regular expression from +)?$ to {2})?$.
try to use this regex
^(\d{1,3}[,](\d{3}[,])*\d{3}(\.\d{1,3})?|\d{1,3}(\.\d+)?)$
I know you asked for a regex but I think it's much saner to just call double.TryParse() and consider your input acceptable if that method returns true.
double dummy;
var isValid=double.TryParse(text, out dummy);
It won't match your testcases exactly; the major difference being that it is very lenient with commas (so it will accept two of your INVALID inputs).
I'm not sure why you care, but if you really do want comma strictness you could do a preprocessing step where you only check the validity of comma placement and then call double.TryParse() only if the string passes the comma placement test. (If you want to be truly careful, you'll have to honor the CultureInfo so you can know what character is used for separators, and how many digits there are between separators, in the environment your program finds itself in)
Either approach results in code that is more "obviously right" than a regex. For example, you won't have to live with the fear that your regex left out some important case, like scientific notation.
I'm, quite frankly, completely clueless about Regular expression, more so building them. I am reading in a string that could contain any sort of combination of characters and numbers. What I know for certain is, somewhere in the string, there will be a number followed by % (1%, 13% etc.), and I want to extract that number from the string.
Examples are;
[05:37:25] Completed 21% //want to extract 21
[05:32:34] Completed 18000000 out of 50000000 steps (36%). //want to extract 36
I'm guessing I should be using either regex.Replace or regex.Split, but beyond that, I'm not sure. Any help would be appreciated.
You should be able to use something like "(\d+)%". This will match any number of consecutive digit characters, then a percent sign, and will capture the actual number so you can extract and parse it. Use this in Regex.Match(), and browse the Matches array of the result (I think it'll be the second element in the array, index 1).
If you need a decimal point, use "(\d+(\.\d+)?)%", which will match a string of digits, followed by a decimal point, then another set of digits.
The regex you want is:
/(\d+)%/
This will capture any number of digits immediately preceding a percentage sign.
([\d]+)(%)
The parentheses will group the result.
The [\d]+ gives you any digit, repeated one or more times.
The "%" is just a literal.
You will need to make sure you extract only the first grouping. Also, you will need to be sure that there are no other instances of "<number>%" in the line.
I'm not entirely sure how to make this C# specific, but I'm sure you can figure that out. :-P
Most likely you will need to use double-backslashes (\\) where I only had one.
What would be the following regular expressions for the following strings?
56AAA71064D6
56AAA7105A25
Would the regular expression change if the numbers rolled over? What I mean by this is that the above numbers happen to contain hexadecimal values and I don't know how the value changes one it reaches F. Using the first one as an example: 56AAA71064D6, if this went up to
56AAA71064F6 and then the following one would become 56AAA7106406, this would create a different regular expression because where a letter was allowed, now their is a digit, so does this make the regular expression even more difficult. Suggestions?
A manufacturer is going to enter a range of serial numbers. The problems are that different manufacturers have different formats for serial numbers (some are just numbers, some are alpha numeric, some contain extra characters like dashes, some contain hexadacimal values which makes it more difficult because I don't know how the roll over to the next serial number). The roll over issue is the biggest problem because the serial numbers are entered as a range like 5A1B - 6F12 and without knowing how the roll over, it seems to me that storing them in the database is not as easy. I was going to have the option of giving the user the option to input the pattern (expression) and storing that in the databse, but if a character or characters changes from a digit to a letter or vice versa, then the regular expression is no longer valid for certain serial numbers.
Also, the above example I gave is with just one case. There are multitude of serial numbers that would contain different expressions.
There's no single regular expression which is "the" expression to match both of those strings. Instead, there are infinitely many which will do so. Here are two options at opposite ends of the spectrum:
(56AAA71064D6)|(56AAA7105A25)
.*
The first will only match those two strings. The second will match anything. Both satisfy all the criteria you've given.
Now, if you specify more criteria, then we'd be able to give a more reasonable idea of the regular expression to provide - and that will drive the answers to the other questions. (At the moment, the only answer that makes sense is "It depends on what regex you use.")
I think you could do it this way for 12 characters. This will search for a 12 character phrase where each of the characters must be a capital (A or B or C or D or E or F or 1 or 2 or 3 or 4 or 5 or 6 or 7 or 8 or 9 or 0)
[A-F0-9]{12}
If you're wanting to include the possibility of dashes then do this.
[A-F0-9\-]{12}
Or you're wanting to include the possibility of dashes plus the 12 characters then do this. But that would pick up any 12-15 character item that fit the criteria though.
[A-F0-9\-]{12,15}
Or if it's surrounded by spaces (AAAAHHHh...SO is stripping out my spaces!!!)
[A-F0-9\-]{12}
Or if it's surrounded by tabs
\t[A-F0-9\-]{12}\t
This match a string that contains 12 hexa
[0-9A-F]{12}
Assuming these are all 12-digit hexadecimal numbers, which it looks like they are, the following regex should work:
[0-9A-Fa-f]{12}
Here I'm using a character class to say that I want any digit, OR A-F, OR a-f. As a bonus I'm allowing lowercase letters; if you don't want those just get them out of the regex.
As Jon Skeet and others have said, you really didn't provide enough information, so if you don't like this answer please understand that I was doing the best I can with what information you provided.
So, how about this:
[0-9A-F]{12}
Well it sounds like you're describing a 12 digit hexadecimal number:
^[A-F0-9]{12}$