Regular expression to match all numeric characters except 5 - c#

When I want to match all numeric characters except 5 I use:
[^\D|5]
or
[^\D5]
or
[0-46-9]
or
[012346789]
When I want to match no numeric characters I can use:
[^\d]
or
[\D]
All of them work well. But when I use [^^\d5] or [^^\d|5] to match all numeric characters except 5, it doesn't work.
I want to use it in a lot of cases. For example, I want to match all \p{P} but not \:. Is there any way to use ^\d to match all numeric character except 5?

You could match all digits except 5 using this:
[123467890]
There is no reason to use a shorthand version of everything.
It makes no difference to the regex engine.
In fact, adding in alternation| and zero-length assertions^ will only degrade your performance.
A shorter version would be:
[0-46-9]
Hyphen/Dash behavior inside character classes []
Hyphens will specify a range inside character classes. You can look up an ASCII table to see what range you are doing, for example: [ -Z] actually matches ASCII 33 to 127.
Edit:
Ok, now I have a better understanding of your requirements.
You need to be specific about what you need to match up front.
You can do this using negative/positive lookaheads:
(?!.*?5.*?)(?!.*?\p{Alpha}.*?)(\p{P}*?$|\p{L}*?$)
This will match under the following conditions:
There is no number 5
There is no character from the POSIX class: Alpha
Any character with the Unicode property "letter" or "punctuation"

\d is just [0-9]. See the Java regex reference for confirmation.
Just use [0-46-9]. You can try it in a regex fiddle.
UPDATE:
Based on the requirement to leverage De Morgan's laws and use a logical complement per the OP's comment, here is my interpretation of the logical complement of [^\D5].
[^\D5] essentially means "NOT (a non-digit character OR 5)". Compare this to "NOT (A OR B)" in the referenced Wikipedia article on De Morgan's laws.
What we need then is "(NOT a non-digit character) AND (NOT 5)". Compare this to "(NOT A) AND (NOT B)" in the referenced Wikipedia article.
Here then is my interpretation of logically complementing [^\D5] using a sequence of lookahead expressions for logical ANDing:
(?!\D)(?!5).
No, it does not use double negation by ^^; this does not work as you have found; but the above logical complement essentially means what we want in regexese - "(NOT a non-digit character) AND (NOT 5)" - applied to a single character (i.e. .).
You can see in a follow-on regex fiddle that the above logical complement yields the same results as [^\D5] like it should.

Related

Regex for paragraph numbering in C#

I am looking for Regex expression that will match any of the following:
1.0
2.0
3.1
4.2.1
2.1.1.7
1.3.17.11
12.23.54.18
the nesting/level could be higher than 4 levels...the digits between the dots likely not to exceed 2 digits (last sample).
I tried this #"\d.\d+" but in some cases it did not work.
I am also looking for expression that will match ONLY this:
1.0
12.0
4.0
Here also - no more than 2 digits before the dot.
As usual, think about the structure of what you want to match:
A single digit:
\d
A single number of arbitrary length:
\d+
A single number, constrained to at most 2 digits:
\d{1,2}
A number, followed by a dot, followed by another number:
\d{1,2}\.\d{1,2}
A number, followed by a dot, followed by another number, followed by another dot, followed by yet another number:
\d{1,2}\.\d{1,2}\.\d{1,2}
Notice a pattern? Exactly, you can use grouping and repetition to match that pattern to an arbitrary length:
\d{1,2}(\.\d{1,2})+
Note that . is a meta-character in regular expressions, matching (almost) any character, so to match a literal dot, you need to escape it (as shown above).
To match just two levels of nesting you can constrain the * after the parentheses in a similar manner:
\d{1,2}(\.\d{1,2}){1}
This means it will have to match exactly once. However, in that case you can also simplify to a regex we've seen before:
\d{1,2}\.\d{1,2}
However, putting an exact number of repetitions at the end can be helpful, if you want to create regexes that match n levels of nesting, for arbitrary n.
Try using this
(\d{1,2}[.])+\d{1,2}

Character 'e' is not recognized by simple regular expression - why?

I wrote a very simple regular expression that need to match the next pattern:
word.otherWord
- Word must have at least 2 characters and must not start with digit.
I wrote the next expression:
[a-zA-Z][a-zA-Z](.[a-zA-Z0-9])+
I tested it using Regex tester and it seems to be working at most of the cases but when I try some inputs that ends with 'e' it's not working.
for example:
Hardware.Make does not work but Hardware.Makee is works fine, why? How can I fix it?
That's because your regex looks for inputs which length is even.
You have two characters matched by [a-zA-Z][a-zA-Z] and then another two characters matched by (.[a-zA-Z0-9]) as a group which is repeated one or more times (because of +).
You can see it here: http://regex101.com/r/fW2bC1
I think you need that:
[a-zA-Z]+(\.[a-zA-Z0-9]+)+
Actually, the dot is a regex metacharacter, which stands for "any character". You'll need to escape the dot.
For your situation, I'd do this:
[a-zA-Z]{2,}\.[a-zA-Z0-9]+
The {2,} means, at least 2 characters from the previous range.
In regex, the dot period is one of the most commonly used metacharacters and unfortunately also commonly misused metacharacter. The dot matches a single character without caring what that character is...
So u would also re-write it like
[a-zA-Z]+(\.[a-zA-Z0-9]+)+

Regular expression for non-standard ascii characters

i need a regular expression that check a string for any non-standard ASCIi characters.
You can specify character's unicode point in c# string: "[\u0080-\uFFFF]" should find any character whose "ascii" code is 128+
does this simple one suit your needs ?
[^\x20-\x7E]
Put what you consider the standard characters in a set, then put the negate ^ sign in the set. That will match the nonstandard. For example I consider the standard to be a-z so my nonstandard match pattern would be
[^A-Za-z]
if that matches you have a non standard.

Shall this Regex do what I expect from it, that is, matching against "A1:B10,C3,D4:E1000"?

I'm currently writing a library where I wish to allow the user to be able to specify spreadsheet cell(s) under four possible alternatives:
A single cell: "A1";
Multiple contiguous cells: "A1:B10"
Multiple separate cells: "A1,B6,I60,AA2"
A mix of 2 and 3: "B2:B12,C13:C18,D4,E11000"
Then, to validate whether the input respects these formats, I intended to use a regular expression to match against. I have consulted this article on Wikipedia:
Regular Expression (Wikipedia)
And I also found this related SO question:
regex matching alpha character followed by 4 alphanumerics.
Based on the information provided within the above-linked articles, I would try with this Regex:
Default Readonly Property Cells(ByVal cellsAddresses As String) As ReadOnlyDictionary(Of String, ICell)
Get
Dim validAddresses As Regex = New Regex("A-Za-z0-9:,A-Za-z0-9")
If (Not validAddresses.IsMatch(cellsAddresses)) then _
Throw New FormatException("cellsAddresses")
// Proceed with getting the cells from the Interop here...
End Get
End Property
Questions
1. Is my regular expression correct? If not, please help me understand what expression I could use.
2. What exception is more likely to be the more meaningful between a FormatException and an InvalidExpressionException? I hesitate here, since it is related to the format under which the property expect the cells to be input, aside, I'm using an (regular) expression to match against.
Thank you kindly for your help and support! =)
I would try this one:
[A-Za-z]+[0-9]+([:,][A-Za-z]+[0-9]+)*
Explanation:
Between [] is a possible group of characters for a single position
[A-Za-z] means characters (letters) from 'A' to 'Z' and from 'a' to 'z'
[0-9] means characters (digits) from 0 to 9
A "+" appended to a part of a regex means: repeat that one or more times
A "*" means: repeat the previous part zero or more times.
( ) can be used to define a group
So [A-Za-z]+[0-9]+ matches one or more letters followed by one or more digits for a single cell-address.
Then that same block is repeated zero or more times, with a ',' or ':' separating the addresses.
Assuming that the column for the spreadsheet is any 1- or 2-letter value and the row is any positive number, a more complex but tighter answer still would be:
^[A-Z]{1,2}[1-9]\d*(:[A-Z]{1,2}[1-9]\d*)?(,[A-Z]{1,2}[1-9]\d*(:[A-Z]{1,2}[1-9]\d*)?)*$
"[A-Z]{1,2}[1-9]\d*" is the expression for a single cell reference. If you replace "[A-Z]{1,2}[1-9]\d*" in the above with then the complex expression becomes
^<cell>(:<cell>)?(,<cell>(:<cell>*)?)*$
which more clearly shows that it is a cell or a range followed by one or more "cell or range" entries with commas in between.
The row and column indicators could be further refined to give a tighter still, yet more complex expression. I suspect that the above could be simplified with look-ahead or look-behind assertions, but I admit those are not (yet) my strong suit.
I'd go with this one, I think:
(([A-Z]+[1-9]\d*:)?[A-Z]+[1-9]\d*,)*([A-Z]+[1-9]\d*:)?[A-Z]+[1-9]\d*
This only allows capital letters as the prefix. If you want case insensitivity, use RegexOptions.IgnoreCase.
You could simplify this by replacing [A-Z]+[1-9]\d* with plain old [A-Z]\d+, but that will only allow a one-letter prefix, and it also allows stuff like A0 and B01. Up to you.
EDIT:
Having thought hard about DocMax's mention of lookarounds, and using Hans Kesting's answer as inspiration, it occurs to me that this should work:
^[A-Z]+\d+((,|(?<!:\w*):)[A-Z]+\d+)*$
Or if you want something really twisted:
^([A-Z]+\d+(,|$|(?<!:\w*):))*(?<!,|:)
As in the previous example, replace \d+ with [1-9]\d* if you want to prevent leading zeros.
The idea behind the ,|(?<!\w*:): is that if a group is delimited by a comma, you want to let it through; but if it's a colon, it's only allowed if the previous delimiter wasn't a colon. The (,|$|...) version is madness, but it allows you to do it all with only one [A-Z]+\d+ block.
However! Even though this is shorter, and I'll admit I feel a teeny bit clever about it, I pity the poor fellow who has to come along and maintain it six months from now. It's fun from a code-golf standpoint, but I think it's best for practical purposes to go with the earlier version, which is a lot easier to read.
i think your regex is incorrect, try (([A-Za-z0-9]*)[:,]?)*
Edit : to correct the bug pointed out by Baud : (([A-Za-z0-9]*)[:,]?)*([A-Za-z0-9]+)
and finally - best version : (([A-Za-z]+[0-9]+)[:,]?)*([A-Za-z]+[0-9]+)
// ah ok this wont work probably... but to answer 1. - no i dont think your regex is correct
( ) form a group
[ ] form a charclass (you can use A-Z a-d 0-9 etc or just single characters)
? means 1 or 0
* means 0 or any
id suggest reading http://www.regular-expressions.info/reference.html .
thats where i learned regexes some time ago ;)
and for building expressions i use Rad Software Regular Expression Designer
Let's build this step by step.
If you are following an Excel addressing format, to match a single-cell entry in your CSL, you would use the regular expression:
[A-Z]{1,2}[1-9]\d*
This matches the following in sequence:
Any character in A to Z once or twice
Any digit in 1 to 9
Any digit zero or more times
The digit expression will prevent inputting a cell address with leading zeros.
To build the expression that allows for a cell address pair, repeat the expression preceded by a colon as optional.
[A-Z]{1,2}[1-9]\d*(:[A-Z]{1,2}[1-9]\d*)?
Now allow for repeating the pattern preceded by a comma zero or more times and add start and end string delimiters.
^[A-Z]{1,2}[1-9]\d*(:[A-Z]{1,2}[1-9]\d*)?(,[A-Z]{1,2}[1-9]\d*(:[A-Z]{1,2}[1-9]\d*)?)*$
Kind of long and obnoxious, I admit, but after trying enough variants, I can't find a way of shortening it.
Hope this is helpful.

C# Regular Expressions with \Uxxxxxxxx characters in the pattern

Regex.IsMatch( "foo", "[\U00010000-\U0010FFFF]" )
Throws: System.ArgumentException: parsing "[-]" - [x-y] range in reverse order.
Looking at the hex values for \U00010000 and \U0010FFF I get: 0xd800 0xdc00 for the first character and 0xdbff 0xdfff for the second.
So I guess I have really have one problem. Why are the Unicode characters formed with \U split into two chars in the string?
They're surrogate pairs. Look at the values - they're over 65535. A char is only a 16 bit value. How would you expression 65536 in only 16 bits?
Unfortunately it's not clear from the documentation how (or whether) the regular expression engine in .NET copes with characters which aren't in the basic multilingual plane. (The \uxxxx pattern in the regular expression documentation only covers 0-65535, just like \uxxxx as a C# escape sequence.)
Is your real regular expression bigger, or are you actually just trying to see if there are any non-BMP characters in there?
To workaround such things with .Net regex engine, I'm using following trick:
"[\U010000-\U10FFFF]" is replaced with [\uD800-\uDBFF][\uDC00-\uDFFF]
The idea behind this is that as .Net regexes handle code units instead of code points, we're providing it with surrogate ranges as regular characters. It's also possible to specify more narrow ranges by operating with edges, e.g.: [\U011DEF-\U013E07] is same as (?:\uD807[\uDDEF-\uDFFF])|(?:[\uD808-\uD80E][\uDC00-\uDFFF])|(?:\uD80F[\uDC00-uDE07])
It's harder to read and operate with, and it's not that flexible, but still fits as workaround.

Categories