How can I elaborate this fairly complex regex to validate names - c#

I want to add a temporary regex to validate names. The rules are :only a-z A-z and spaces are allowed. The name must be more than 3 letters and have one space in this case. Also it cannot have two spaces in line (one after the other). I don't care about spaces in the beginning or end of the string because I can trim them. Only the first word can be a single letter, the others must be two or more. Excuse me if some rule contradicted another, it is very difficult to formulate this question.
ed_vu (valid)
edd_v (invalid) the second word is one character (must be two or more)
e_lui (valid)
e_li (valid)
e_ei_ed (valid) even though it has two spaces it has more than 4 letters
More examples
e_el_ed__uuu (invalid) two spaces in line
e_el_elld_liid_eiii_idid (valid)

Try this one:
/^[a-zA-Z]+(?: [a-zA-Z]{2,})*$/gm
Note: these flags are not necessarily what you need (they're useful for the example though).

Related

Underscore in regex not validating

How do I add underscore as a part of my regex string.
Here is my string that checks for uppercase, lowercase, numbers and special characters. The rest of the special characters work. Validation isn't working for underscores.
#"^[^\s](?=(.*[A-Za-z]){1,})(?=(.*[\d]){1,})(?=(.*[\W]){1,})(?=(.*[!##$%^&*()-+=\[{\]};:<>|_.\\/?,\-`'""~]{1,})).*[^\s]$"
Any ideas?
Thanks
This is the regex that AWS Cogito uses, it should apply to your situation:
#"^(?=.*[a-z])(?=.*[A-Z])(?=.*[0-9])(?=.*[\^$*.\[\]{}\(\)?\-“!##%&\/,><’:;|_~`])\S{8,99}$"
You can check regexes at http://regexstorm.net, it's faster than building your application everytime.
I've approached it like this: I took your requirements and made them into separate positive lookaheads:
Check for:
uppercase (?=.*[A-Z])
lowercase (?=.*[a-z]) (note that I broke A-Z and a-z up into separate groups)
numbers (?=.*\d)
special characters (?=.*[!##$%^&*()-+=\[{\]};:<>|_.\\/?,\-`'""~])
You can then combine them in any order and I've combined them in the same order as I listed them above and anchored it with the beginning of the line using ^. Don't add any extra matches before, in-between or after the groups in your requirement that could cause the regex to enforce a certain ordering of the groups:
The lookahead for any non-word character \W makes it impossible to match Underscore1_ since it will only match on "anything other than a letter, digit or underscore" - which is all Underscore1_ contains.
The starting [^\s] (and ending [^\s]) that consumes one character is likely destroying a lot of good matches. Underscore1_ or _1scoreUnder shouldn't matter, but if you start with _ and consume it with [^\s] like you do, the later lookahead for a special character will fail (unless you have a second special character in the password).
#"^(?=.*[A-Z])(?=.*[a-z])(?=.*\d)(?=.*[!##$%^&*()-+=\[{\]};:<>|_.\\/?,\-`'""~])"
If you have a minimum length requirement of, say, 7 characters, you just have to add .{7,}$ to the end of the regex, making it:
#"^(?=.*[A-Z])(?=.*[a-z])(?=.*\d)(?=.*[!##$%^&*()-+=\[{\]};:<>|_.\\/?,\-`'""~]).{7,}$"
Without a minimum length, a password of one character from each group will be enough, and since there are 4 groups, a password with only 4 characters will pass the filter.
I see no point in putting an upper length limit into the regex. If the user interface has accepted a string that is thousands of characters long, then why reject it for being too long later? The length of what you store is probably going to be much smaller anyway since you'll be storing the bcrypt/scrypt/argon2/... encoded password.
Suggestion: Also add space (or even whitespaces) to the list of special characters.
In you regexp add underscore in 3rd Capturing Group regex101
#"^[^\s](?=(.*[A-Za-z]){1,})(?=(.*[\d]){1,})(?=(.*[\W_]){1,})(?=(.*[!##$%^&*()-+=\[{\]};:<>|_.\\/?,\-`'""~]{1,})).*[^\s]$"

Match vocabulary words and phrases

I am writing an application/logic that has vocabulary word/phrase as an input parameter. I am having troubles writing validation logic for this parameter's value!
Following are the rules I've came up with:
can be up to 4 words (with hyphens or not)
one apostrophe is allowed
only regular letters are allowed (no special characters like !##$%^&*()={}[]"";|/>/? ¶ © etc)
numbers are disallowed
case insensitive
multiple languages support (English, Russian, Norwegian, etc..) (so both Unicode and Cyrillic must be supported)
either whole string matches or nothing
Few examples (in 3 languages):
// match:
one two three four
one-two-three-four
one-two-three four
vær så snill
тест регекс
re-read
under the hood
ONe
rabbit's lair
// not-match:
one two three four five
one two three four#
one-two-three-four five
rabbit"s lair
one' two's
one1
1900
Given the expected result provided above - could someone point me to right direction on how to create a validation rule like that? If that matters - I will be writing validation logic in C# so I have more tools than just Regex available at my disposal.
If that is going to be of any help - I have been testing several solutions, like these ^[\p{Ll}\p{Lt}]+$ and (?=\S*['-])([a-zA-Z'-]+)$. The first regex seems to be doing a great job allowing just the letters I need (En, No and Rus), whereas the second rule set is doing great in using the Lookahead concept.
\p{Ll} or \p{Lowercase_Letter}: a lowercase letter that has an uppercase variant.
\p{Lu} or \p{Uppercase_Letter}: an uppercase letter that has a lowercase variant.
\p{Lt} or \p{Titlecase_Letter}: a letter that appears at the start of a word when only the first letter of the word is capitalized.
\p{L&} or \p{Letter&}: a letter that exists in lowercase and uppercase variants (combination of Ll, Lu and Lt).
\p{Lm} or \p{Modifier_Letter}: a special character that is used like a letter.
\p{Lo} or \p{Other_Letter}: a letter or ideograph that does not have lowercase and uppercase variants.
Needless to say, neither of the solutions I have been testing take into account all the rules I defined above..
You can use
\A(?!(?:[^']*'){2})\p{L}+(?:[\s'-]\p{L}+){0,3}\z
See the regex demo. Details:
\A - start of string
(?!(?:[^']*'){2}) - the string cannot contain two apostrophes
\p{L}+ - one or more Unicode letters
(?:[\s'-]\p{L}+){0,3} - zero to three occurrences of
[\s'-] - a whitespace, ' or - char
\p{L}+ - one or more Unicode letters
\z - the very end of string.
In C#, you can use it as
var IsValid = Regex.IsMatch(text, #"\A(?!(?:[^']*'){2})\p{L}+(?:[\s'-]\p{L}+");{0,3}\z")

Noncapturing along with capturing match

I am trying to capture the subdomain from huge lists of domain names. For example I want to capture "funstuff" from "funstuff.mysite.com". I do not want to capture, ".mysite.com" in the match. These occurances are in a sea of text so I can not depend on them being at the start of a line. I know the subdomain will not include any special characters or numbers. So what I have is:
[a-z]{2,10}(?=\.mysite\.com)
The problem is this will work only if the subdomain is NOT preceded by a number or special character. For example, "asdfbasdasdfdfunstuff.mysite.com" will return "fdfunstuff" but "asdfasf23/funstuff.mysite.com" won't make a match.
I can not depend on there being a special character before the subdomain, like a "/" as in "http://funstuff.mysite.com" so that can not be used as part of the condition.
It is ok if the capture gets erroneous text before the subdomain, although 99% of the time it will be preceded with something other that a lowercase letter. I have tried,
(?<=[^a-z])[a-z]{2,10}(?=\.mysite\.com)
but for some reason this does not capture text is a situation like:
afb"asdfunstuff.mysite.com
Where the quotation mark prevents a match for [a-z]{2-20}. Basically what I would want to do in that case would be to capture asdfunstuff.mysite.com. How can this be accomplished?
So you've got two problems to solve: first, you want to match ".mysite.com" but not capture it; second, you want to grab up to 10 alphabetic characters in the "subdomain" position.
First problem can be solved by using a capturing group. The regex
([a-z]{2,10})\.mysite\.com
will capture somewhere between 2 and 10 characters, and the returned match object will expose that in one of its properties (depends on the language). C# returns a collection of Match objects, so it'll be the only item.
Second problem can be solved by using the word-boundary character \b. In .NET, this matches where an alphanumeric (i.e. \w) is next to a non-alphanumeric (\W). Other languages (e.g. ECMAScript / Javascript) work simliarly.
So, I suggest the following regex to solve your problem:
\b([a-z]{2,10})\.mysite\.com
Note that numbers are legal in subdomain names, too, so the following might be generally correct (though perhaps not in your specific case):
\b(\w{2,10})\.mysite\.com
where the "word character" \w is equivalent to [a-zA-Z_0-9] in .NET's ECMAScript-compliant mode. (Further reading.)

Regular Expression for UK postcodes

I have a list of post codes which should be excluded from my shipping methods.
Suppose I have to exclude Scilly Isles, Isle of Man and few others.
For the above 2 areas valid post codes are IM1-IM9, IM86, IM87, IM89. And if it is IM25 or IM85 it is invalid.
I have writtent following expression. But it is returning even it is IM25 or IM 85.
var regex = new Regex("(PO3[0-9]|PO4[0-1]|GY[1-9]|JE[1-5]|IM[1-9]|TR[1-9])");
If I am passing IM85, to my expression it should return false. for IM1-IM9,, IM86, IM87, IM89 it should return true.
Same with TR post codes also. TR1-TR27 is a valid post code. If I give TR28, it should return false.
I am using '|' to seperate multiple patterns. Is that the right way of including multiple patterns in 1 expression.
What do you expect? What should be matched and what not? And please give an example of the string you want to test.
If you match your pattern against "IM25" it will match because you do allow IM[1-9] in your pattern, so you get a valid partial match. If you want to avoid that (I am not sure what you want to achieve) and want to allow really only a single digit after the first letters, use a "word boundary" \b and specify exactly what you want to allow, something like this:
(PO3[0-9]|PO4[0-1]|GY[1-9]|JE[1-5]|IM([1-9]|8[6-9])|TR([1-9]|2[0-7]))\b
See it here on Regexr
this would allow for the "IM" part also 6-9 as a second digit when there is a 8 before.
Update
It is still not clear what the context of your task is. I assume you have a list of valid Postcodes, probably it would be better, you extract the post code or only the first part of it (for that you can eventually use a regex) and check if it is in the list or not.
The actual validation is on the wikipedia site... Google has the answers ;) http://en.wikipedia.org/wiki/Postcodes_in_the_United_Kingdom#Validation
(GIR 0AA)|(((A[BL]|B[ABDFHLNRSTX]?|C[ABFHMORTVW]|D[ADEGHLNTY]|E[HNX]?|F[KY]|G[LUY]?|H[ADGPRSUX]|I[GMPV]|JE|K[ATWY]|L[ADELNSU]?|M[EKL]?|N[EGNPRW]?|O[LX]|P[AEHLOR]|R[GHM]|S[AEGKLMNOPRSTY]?|T[ADFNQRSW]|UB|W[ADFNRSV]|YO|ZE)[1-9]?[0-9]|((E|N|NW|SE|SW|W)1|EC[1-4]|WC[12])[A-HJKMNPR-Y]|(SW|W)([2-9]|[1-9][0-9])|EC[1-9][0-9]) [0-9][ABD-HJLNP-UW-Z]{2})
I still think you need more clarification. As a huge Regex guy, I would like to point out that multi-digit ranges should try to be put into the code side, not the Regex side, just for your sanity. But I personally like to play with Regex in this way. Regex reads one character at a time, so it only recognizes zero through nine. Not ten, not twenty eight. If you want to allow the following:
28 through 347
Then it becomes pretty complicated.
To put it into words, you want to allow:
If Two Digits, allow 2-9 for the first digit, and:
If the first digit is a Two, then allow 8/9 for the second digit,
ElseIf the first Digit is 3-9, then allow 0-9 for the second digit
Elseif Three Digits, allow 1-3 for the first Digit, and:
If the first digit is a Three, then allow 0-4 for the second digit, and:
If the second digit is a Four, then allow 0-7 for the third digit,
ElseIf the second digit is 0-3, then allow 0-9 for the third digit.
ElseIf the first digit is 1/2, then allow 0-9 for both the Second and Third digits.
Then with that, you can write a proper Regex like so, which searches for a word boundary or non-Digit surrounding a 2-pair or 3-pair. With this type of Problem-Solving, you should be able to figure out your Regex issue. Otherwise, let us know more about EXACTLY What you want to Match and NOT Match:
(\b|\D)((2[89]|[3-9][0-9])(\b|\D)|(3(4[0-7]|[0-3][0-9])|[12][0-9][0-9])(\b|\D))
I have changed my approach.
Instead of going for a regular expression which is becoming more complex, I am saving all the excluded outward codes of UK post codes.
And if any post code contains the particular outward code, excluding the post code from the list.
Outward codes are in this format
XX-YYY
XXX-YYY
XXXX-YYY
In all above formats, X represents outward code of an UK postcode.

Is my C# Reg-ex correct?

Is this Regex correct if I have to match a string which is atleast 7 characters long, not more than 20 characters, has atleast 1 number, and atleast 1 letter? It has no other constraints.
[0-9]+[A-Za-z]+{7,20}
Thanks
No, it's not. The quantifier {7,20} doesn't apply to a token (repetition in regexes is done with quantifiers, like *, +, ? or the more general {n,m} – you cannot use more than one quantifier on a single token [in this case [a-zA-Z]]; *? is a quantifier on its own and thus doesn't play by above rules). You'll need something like the following:
^(?=.*\d)(?=.*[a-zA-Z]).{7,20}$
This has two lookaheads making sure of at least one digit and at least one letter:
(?=.*\d)
(?=.*[a-zA-Z])
Lookarounds are zero-width assertions; they do not consume characters in the string so they are merely matching a position. But they make sure that the expression inside of them would match at the current point. In this case this expression would match arbitrarily many characters and then would require a digit or a letter, respectively.
The actual match itself,
.{7,20}
just makes sure the length matches. What characters are used is irrelevant because we made sure of that constraints above already.
Finally the whole expression is anchored in that a start-of-string and end-of-string anchor are inserted at the start and end:
^...$
This makes sure that the match really encompasses the whole string. While not strictly necessary in this case (it would match the whole string anyway in all valid cases) it's often a good idea to include because usually regexes match only substrings and this can lead to subtle problems where validation regexes match even though they should fail. E.g. using \d+ to make sure a string consists only of digits would match the string a4b which puzzles beginners quite often.
I also changed that the order of letters and numbers doesn't matter. Your regex looks like it tries to impose a definite order where all numbers need to come before all letters which usually isn't what's wanted here.

Categories