regular expression - how to make lookahead pattern

regular expression - how to make lookahead pattern - c#

I need to create regular expression for validation of string containing wild cards. The expression is must be in form of mobile number (xxx-xxx-xxxx) where x is digital number or question mark. In this case regexp was straight enough ^([\d?{3}]-[\d?{3}]-[\d?{4}])$ but when user requested also * wild card, I've been really confused.
First of all it can be xxx-xxx-*, right? But xxx-xxx-** is invalid as well as xxx-*-*. I read something about lookahead pattern (writing in C#) but had been only confused more. I tried to compile something like ^(?![\\*\\*])$ - "not two asterisks near one another" but it didn't work.
So, any more ideas?

I'm not sure I've understood your requirement exactly but it sounds to me like you want a pattern the will match:
optionally
one to three numbers or ? followed by -
one to three numbers or ? followed by -
one to four numbers or ? followed by -
this should match
123-456
12?-4??-78??
1-3?-2?0
but not match
1--123
-?-23
1233-23?-234
in which case you have no need for a lookahead
this pattern should work
^([\?\d]{1,3})(\-[\?\d]{1,3}(\-[\?\d]{1,4})?)?$
Try it here

This would your expression with some corrections
^[\d?]{3}-[\d?]{3}-[\d?]{4}$
I moved the closing square brackets, the quantifier has to be outside of the character classes, also I removed the outermost brackets, as they don't make sense.
Now to the lookaheads.
If you want to forbid "**"
^(?!.*\*\*)[\d?]{3}-[\d?]{3}-[\d?]{4}$
I am not sure about your requirements about the usage of the "*". Is only one allowed in the string?
similar to disallow "--"
^(?!.*\*\*)(?!.*--)[\d?]{3}-[\d?]{3}-[\d?]{4}$

Related

Character 'e' is not recognized by simple regular expression - why?

I wrote a very simple regular expression that need to match the next pattern:
word.otherWord
- Word must have at least 2 characters and must not start with digit.
I wrote the next expression:
[a-zA-Z][a-zA-Z](.[a-zA-Z0-9])+
I tested it using Regex tester and it seems to be working at most of the cases but when I try some inputs that ends with 'e' it's not working.
for example:
Hardware.Make does not work but Hardware.Makee is works fine, why? How can I fix it?

That's because your regex looks for inputs which length is even.
You have two characters matched by [a-zA-Z][a-zA-Z] and then another two characters matched by (.[a-zA-Z0-9]) as a group which is repeated one or more times (because of +).
You can see it here: http://regex101.com/r/fW2bC1
I think you need that:
[a-zA-Z]+(\.[a-zA-Z0-9]+)+

Actually, the dot is a regex metacharacter, which stands for "any character". You'll need to escape the dot.
For your situation, I'd do this:
[a-zA-Z]{2,}\.[a-zA-Z0-9]+
The {2,} means, at least 2 characters from the previous range.

In regex, the dot period is one of the most commonly used metacharacters and unfortunately also commonly misused metacharacter. The dot matches a single character without caring what that character is...
So u would also re-write it like
[a-zA-Z]+(\.[a-zA-Z0-9]+)+

Is my C# Reg-ex correct?

Is this Regex correct if I have to match a string which is atleast 7 characters long, not more than 20 characters, has atleast 1 number, and atleast 1 letter? It has no other constraints.
[0-9]+[A-Za-z]+{7,20}
Thanks

No, it's not. The quantifier {7,20} doesn't apply to a token (repetition in regexes is done with quantifiers, like *, +, ? or the more general {n,m} – you cannot use more than one quantifier on a single token [in this case [a-zA-Z]]; *? is a quantifier on its own and thus doesn't play by above rules). You'll need something like the following:
^(?=.*\d)(?=.*[a-zA-Z]).{7,20}$
This has two lookaheads making sure of at least one digit and at least one letter:
(?=.*\d)
(?=.*[a-zA-Z])
Lookarounds are zero-width assertions; they do not consume characters in the string so they are merely matching a position. But they make sure that the expression inside of them would match at the current point. In this case this expression would match arbitrarily many characters and then would require a digit or a letter, respectively.
The actual match itself,
.{7,20}
just makes sure the length matches. What characters are used is irrelevant because we made sure of that constraints above already.
Finally the whole expression is anchored in that a start-of-string and end-of-string anchor are inserted at the start and end:
^...$
This makes sure that the match really encompasses the whole string. While not strictly necessary in this case (it would match the whole string anyway in all valid cases) it's often a good idea to include because usually regexes match only substrings and this can lead to subtle problems where validation regexes match even though they should fail. E.g. using \d+ to make sure a string consists only of digits would match the string a4b which puzzles beginners quite often.
I also changed that the order of letters and numbers doesn't matter. Your regex looks like it tries to impose a definite order where all numbers need to come before all letters which usually isn't what's wanted here.

How to use non capture groups with the "or" character in regular expressions

So basically I have this giant regular expression pattern, and somewhere in the middle of it is the expression (?:\s(\d\d\d)|(\d\d\d\d)). At this part of the parse I'm wanting to capture either 3 digits that follows a space or 4 digits, but I don't want the capture that comes from using the parenthesis around the whole thing (doesn't ?: make something non-capture). I have to use parenthesis so that the "or" logic works (I think).
So potential example inputs would be something like...
input1= giantexpression 123more characters after
input2= giantexpression1234blahblahblah
I tried (?:\s(\d\d\d)|(\d\d\d\d)) and it gave an extra capture at least in the case where I have 4 digits. So am I doing this right or am I messed up somewhere?
Edit:
To go into more detail... here's the current regular expression I'm working with.
pattern = #".?(\d{1,2})\s*(\w{2}).?.?.?(?:\s(\d\d\d)|(\d\d\d\d)).*"
There's a bit of parsing I have to do at the beginning. I think Sean Johnson's answer would still work because I wouldn't need to use "or". But is there a way to do it in which you DO use "or"? I think eventually I'll need that capability.

This should work:
(?:\s(\d{3,4}))
If you aren't doing any logic on that subpattern, you don't even need the parenthesis surrounding it if all you want to do is capture the digits. The following pattern:
\s(\d{3,4})
will capture three or four digits directly following a space character.

I need a Regular Expression allowing user to input numbers, plus, minus and parentheses

I need a Regular Expression allowing user to input numbers, plus, minus and parentheses.
User can only input:
At most one open parenthesis '('.
At most one close parenthesis ')'.
At most one plus '+'
As many minus '-' but not after each other.
Exactly 11 numbers.
Here are valid inputs:
(0)+12-3-4-56-7890
+)0(12345-678-90
+01234567890
+(01234567890)
01234567890
-01-234+5678-90
(01234567890)
)01234567890(
And following are not valid:
0123456--7890
0((1234567890
01234567890))
++01234567890
123456
++123456789
I'm using C# for programming and if it helps order of open and close parentheses can become mandatory too. so )01234567890( will not be valid.
Thanks in advance

This regex passes your examples, but might not be exactly what you're looking for. It should point you in the right direction.
^(?!.*-{2,})(?!(?:.*\)){2,})(?!(?:.*\(){2,})(?!\+{2,})(?:\D*\d\D*){11}$
(?!.*-{2,}) Cannot contain two or more hyphens.
(?!(?:.*)){2,}) Cannot contain two or more closing parentheses.
(?!(?:.*(){2,}) Cannot contain two or more opening parentheses.
(?!+{2,}) Cannot start with more than two addition symbols.
(?:\D*\d\D*){11} Must contain 11 instances of a numeric character surrounded by anything.
However, this is very confusing and fairly inefficient. I bet the regex could be rewritten to be much quicker, but won't be much easier to understand.
I suggest that you follow MisterJack's suggestion instead of pursue a regex. It'll be easier to maintain.
EDIT
^(?!.*--)(?!.*(\(|\)|\+).*\1)(?:\D*\d\D*){11}$
I've consolidated the parentheses and plus symbol rules into one negative lookahead using a backreference. This also restricts the number of parens and pluses to just one of each. I couldn't get it to restrict to just a certain set of characters, but you might be able to do that in a second pass with another regex.
^ Match from beginning of the string
(?!.*--) Do not allow consecutive hyphens
(?!.* ((|)|+).*\1) Do not allow two or more instances of () or +
(?:\D*\d\D*){11} Must contain 11 digits, allow non-digit characters before and after, such as hyphen.
$ Match to end of string
I tried a negative and positive lookahead to restrict the characters, but couldn't get it to work right. I also tried to replace \D with [()+-] but that didn't work either. Maybe someone else will add a comment to show how to restrict the characters. I'd sure love to see how someone else does it in this regex.

I think that a regular expression isn't your best bet, because it could become too much complicated and it can easily be broken.
What I suggest you is to try to parse your input, i.e. to count how many numbers, minuses, plus and parenthesis the user entered, and if they appear in the right order. An easy way to do this could be to loop over the characters that compose the string and check if the current char:
is a number (and we keep count of how many numbers we found)
is a minus (and the previous char isn't a minus)
is a plus (and it is the first one)
is a parenthesis (it's the first open parenthesis or it's a closed one and we already found the open parenthesis)
This could do the trick.

Shall this Regex do what I expect from it, that is, matching against "A1:B10,C3,D4:E1000"?

I'm currently writing a library where I wish to allow the user to be able to specify spreadsheet cell(s) under four possible alternatives:
A single cell: "A1";
Multiple contiguous cells: "A1:B10"
Multiple separate cells: "A1,B6,I60,AA2"
A mix of 2 and 3: "B2:B12,C13:C18,D4,E11000"
Then, to validate whether the input respects these formats, I intended to use a regular expression to match against. I have consulted this article on Wikipedia:
Regular Expression (Wikipedia)
And I also found this related SO question:
regex matching alpha character followed by 4 alphanumerics.
Based on the information provided within the above-linked articles, I would try with this Regex:
Default Readonly Property Cells(ByVal cellsAddresses As String) As ReadOnlyDictionary(Of String, ICell)
Get
Dim validAddresses As Regex = New Regex("A-Za-z0-9:,A-Za-z0-9")
If (Not validAddresses.IsMatch(cellsAddresses)) then _
Throw New FormatException("cellsAddresses")
// Proceed with getting the cells from the Interop here...
End Get
End Property
Questions
1. Is my regular expression correct? If not, please help me understand what expression I could use.
2. What exception is more likely to be the more meaningful between a FormatException and an InvalidExpressionException? I hesitate here, since it is related to the format under which the property expect the cells to be input, aside, I'm using an (regular) expression to match against.
Thank you kindly for your help and support! =)

I would try this one:
[A-Za-z]+[0-9]+([:,][A-Za-z]+[0-9]+)*
Explanation:
Between [] is a possible group of characters for a single position
[A-Za-z] means characters (letters) from 'A' to 'Z' and from 'a' to 'z'
[0-9] means characters (digits) from 0 to 9
A "+" appended to a part of a regex means: repeat that one or more times
A "*" means: repeat the previous part zero or more times.
( ) can be used to define a group
So [A-Za-z]+[0-9]+ matches one or more letters followed by one or more digits for a single cell-address.
Then that same block is repeated zero or more times, with a ',' or ':' separating the addresses.

Assuming that the column for the spreadsheet is any 1- or 2-letter value and the row is any positive number, a more complex but tighter answer still would be:
^[A-Z]{1,2}[1-9]\d*(:[A-Z]{1,2}[1-9]\d*)?(,[A-Z]{1,2}[1-9]\d*(:[A-Z]{1,2}[1-9]\d*)?)*$
"[A-Z]{1,2}[1-9]\d*" is the expression for a single cell reference. If you replace "[A-Z]{1,2}[1-9]\d*" in the above with then the complex expression becomes
^<cell>(:<cell>)?(,<cell>(:<cell>*)?)*$
which more clearly shows that it is a cell or a range followed by one or more "cell or range" entries with commas in between.
The row and column indicators could be further refined to give a tighter still, yet more complex expression. I suspect that the above could be simplified with look-ahead or look-behind assertions, but I admit those are not (yet) my strong suit.

I'd go with this one, I think:
(([A-Z]+[1-9]\d*:)?[A-Z]+[1-9]\d*,)*([A-Z]+[1-9]\d*:)?[A-Z]+[1-9]\d*
This only allows capital letters as the prefix. If you want case insensitivity, use RegexOptions.IgnoreCase.
You could simplify this by replacing [A-Z]+[1-9]\d* with plain old [A-Z]\d+, but that will only allow a one-letter prefix, and it also allows stuff like A0 and B01. Up to you.
EDIT:
Having thought hard about DocMax's mention of lookarounds, and using Hans Kesting's answer as inspiration, it occurs to me that this should work:
^[A-Z]+\d+((,|(?<!:\w*):)[A-Z]+\d+)*$
Or if you want something really twisted:
^([A-Z]+\d+(,|$|(?<!:\w*):))*(?<!,|:)
As in the previous example, replace \d+ with [1-9]\d* if you want to prevent leading zeros.
The idea behind the ,|(?<!\w*:): is that if a group is delimited by a comma, you want to let it through; but if it's a colon, it's only allowed if the previous delimiter wasn't a colon. The (,|$|...) version is madness, but it allows you to do it all with only one [A-Z]+\d+ block.
However! Even though this is shorter, and I'll admit I feel a teeny bit clever about it, I pity the poor fellow who has to come along and maintain it six months from now. It's fun from a code-golf standpoint, but I think it's best for practical purposes to go with the earlier version, which is a lot easier to read.

i think your regex is incorrect, try (([A-Za-z0-9]*)[:,]?)*
Edit : to correct the bug pointed out by Baud : (([A-Za-z0-9]*)[:,]?)*([A-Za-z0-9]+)
and finally - best version : (([A-Za-z]+[0-9]+)[:,]?)*([A-Za-z]+[0-9]+)
// ah ok this wont work probably... but to answer 1. - no i dont think your regex is correct
( ) form a group
[ ] form a charclass (you can use A-Z a-d 0-9 etc or just single characters)
? means 1 or 0
* means 0 or any
id suggest reading http://www.regular-expressions.info/reference.html .
thats where i learned regexes some time ago ;)
and for building expressions i use Rad Software Regular Expression Designer

Let's build this step by step.
If you are following an Excel addressing format, to match a single-cell entry in your CSL, you would use the regular expression:
[A-Z]{1,2}[1-9]\d*
This matches the following in sequence:
Any character in A to Z once or twice
Any digit in 1 to 9
Any digit zero or more times
The digit expression will prevent inputting a cell address with leading zeros.
To build the expression that allows for a cell address pair, repeat the expression preceded by a colon as optional.
[A-Z]{1,2}[1-9]\d*(:[A-Z]{1,2}[1-9]\d*)?
Now allow for repeating the pattern preceded by a comma zero or more times and add start and end string delimiters.
^[A-Z]{1,2}[1-9]\d*(:[A-Z]{1,2}[1-9]\d*)?(,[A-Z]{1,2}[1-9]\d*(:[A-Z]{1,2}[1-9]\d*)?)*$
Kind of long and obnoxious, I admit, but after trying enough variants, I can't find a way of shortening it.
Hope this is helpful.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

regular expression - how to make lookahead pattern - c#

Related

Character 'e' is not recognized by simple regular expression - why?

Is my C# Reg-ex correct?

How to use non capture groups with the "or" character in regular expressions

I need a Regular Expression allowing user to input numbers, plus, minus and parentheses

Shall this Regex do what I expect from it, that is, matching against "A1:B10,C3,D4:E1000"?

Categories

Resources