Match vocabulary words and phrases - c#

I am writing an application/logic that has vocabulary word/phrase as an input parameter. I am having troubles writing validation logic for this parameter's value!
Following are the rules I've came up with:
can be up to 4 words (with hyphens or not)
one apostrophe is allowed
only regular letters are allowed (no special characters like !##$%^&*()={}[]"";|/>/? ¶ © etc)
numbers are disallowed
case insensitive
multiple languages support (English, Russian, Norwegian, etc..) (so both Unicode and Cyrillic must be supported)
either whole string matches or nothing
Few examples (in 3 languages):
// match:
one two three four
one-two-three-four
one-two-three four
vær så snill
тест регекс
re-read
under the hood
ONe
rabbit's lair
// not-match:
one two three four five
one two three four#
one-two-three-four five
rabbit"s lair
one' two's
one1
1900
Given the expected result provided above - could someone point me to right direction on how to create a validation rule like that? If that matters - I will be writing validation logic in C# so I have more tools than just Regex available at my disposal.
If that is going to be of any help - I have been testing several solutions, like these ^[\p{Ll}\p{Lt}]+$ and (?=\S*['-])([a-zA-Z'-]+)$. The first regex seems to be doing a great job allowing just the letters I need (En, No and Rus), whereas the second rule set is doing great in using the Lookahead concept.
\p{Ll} or \p{Lowercase_Letter}: a lowercase letter that has an uppercase variant.
\p{Lu} or \p{Uppercase_Letter}: an uppercase letter that has a lowercase variant.
\p{Lt} or \p{Titlecase_Letter}: a letter that appears at the start of a word when only the first letter of the word is capitalized.
\p{L&} or \p{Letter&}: a letter that exists in lowercase and uppercase variants (combination of Ll, Lu and Lt).
\p{Lm} or \p{Modifier_Letter}: a special character that is used like a letter.
\p{Lo} or \p{Other_Letter}: a letter or ideograph that does not have lowercase and uppercase variants.
Needless to say, neither of the solutions I have been testing take into account all the rules I defined above..

You can use
\A(?!(?:[^']*'){2})\p{L}+(?:[\s'-]\p{L}+){0,3}\z
See the regex demo. Details:
\A - start of string
(?!(?:[^']*'){2}) - the string cannot contain two apostrophes
\p{L}+ - one or more Unicode letters
(?:[\s'-]\p{L}+){0,3} - zero to three occurrences of
[\s'-] - a whitespace, ' or - char
\p{L}+ - one or more Unicode letters
\z - the very end of string.
In C#, you can use it as
var IsValid = Regex.IsMatch(text, #"\A(?!(?:[^']*'){2})\p{L}+(?:[\s'-]\p{L}+");{0,3}\z")

Related

Underscore in regex not validating

How do I add underscore as a part of my regex string.
Here is my string that checks for uppercase, lowercase, numbers and special characters. The rest of the special characters work. Validation isn't working for underscores.
#"^[^\s](?=(.*[A-Za-z]){1,})(?=(.*[\d]){1,})(?=(.*[\W]){1,})(?=(.*[!##$%^&*()-+=\[{\]};:<>|_.\\/?,\-`'""~]{1,})).*[^\s]$"
Any ideas?
Thanks
This is the regex that AWS Cogito uses, it should apply to your situation:
#"^(?=.*[a-z])(?=.*[A-Z])(?=.*[0-9])(?=.*[\^$*.\[\]{}\(\)?\-“!##%&\/,><’:;|_~`])\S{8,99}$"
You can check regexes at http://regexstorm.net, it's faster than building your application everytime.
I've approached it like this: I took your requirements and made them into separate positive lookaheads:
Check for:
uppercase (?=.*[A-Z])
lowercase (?=.*[a-z]) (note that I broke A-Z and a-z up into separate groups)
numbers (?=.*\d)
special characters (?=.*[!##$%^&*()-+=\[{\]};:<>|_.\\/?,\-`'""~])
You can then combine them in any order and I've combined them in the same order as I listed them above and anchored it with the beginning of the line using ^. Don't add any extra matches before, in-between or after the groups in your requirement that could cause the regex to enforce a certain ordering of the groups:
The lookahead for any non-word character \W makes it impossible to match Underscore1_ since it will only match on "anything other than a letter, digit or underscore" - which is all Underscore1_ contains.
The starting [^\s] (and ending [^\s]) that consumes one character is likely destroying a lot of good matches. Underscore1_ or _1scoreUnder shouldn't matter, but if you start with _ and consume it with [^\s] like you do, the later lookahead for a special character will fail (unless you have a second special character in the password).
#"^(?=.*[A-Z])(?=.*[a-z])(?=.*\d)(?=.*[!##$%^&*()-+=\[{\]};:<>|_.\\/?,\-`'""~])"
If you have a minimum length requirement of, say, 7 characters, you just have to add .{7,}$ to the end of the regex, making it:
#"^(?=.*[A-Z])(?=.*[a-z])(?=.*\d)(?=.*[!##$%^&*()-+=\[{\]};:<>|_.\\/?,\-`'""~]).{7,}$"
Without a minimum length, a password of one character from each group will be enough, and since there are 4 groups, a password with only 4 characters will pass the filter.
I see no point in putting an upper length limit into the regex. If the user interface has accepted a string that is thousands of characters long, then why reject it for being too long later? The length of what you store is probably going to be much smaller anyway since you'll be storing the bcrypt/scrypt/argon2/... encoded password.
Suggestion: Also add space (or even whitespaces) to the list of special characters.
In you regexp add underscore in 3rd Capturing Group regex101
#"^[^\s](?=(.*[A-Za-z]){1,})(?=(.*[\d]){1,})(?=(.*[\W_]){1,})(?=(.*[!##$%^&*()-+=\[{\]};:<>|_.\\/?,\-`'""~]{1,})).*[^\s]$"

Regex to find a string that does not contain a space while matching other conditions

I have a bunch of strings that may contains certain patterns. Specifically, the following 3.
Starts with (- followed by 10 digits followed by ).
E.g.:
(-1234567890)
Starts with (, ends with ), and may contain 1 or more characters, but NO spaces.
E.g.:
(ABC) or (AF33) or (2345)
Starts with (, ends with ), and may contain 1 or more characters, INCLUDING spaces.
E.g.:
(Some string)
The strings I work with may contain zero or more of the patterns above. My requirement is to match ONLY the second one from above in a given string, and I'd like to be able to use Regex class in C#.
For example, let's say following are five different strings I have.
This is some random text.
This is some (ABC) random (-1234567890) text.
This is some (XY12) random (-1234567890) text.
This is some (Contains space) random (-1234567890) text.
This is some () random text.
My Regex should match only the 2nd and 3rd strings from the above list.
So far, I've managed to write this following Regex, which excludes strings 1 and 5.
.*\((?!\-).+\).*
This matches 2nd, 3rd, AND 4th strings above. Now I'm not sure how I can get it to exclude the 4th, one which contains spaces inside parenthesis. I know that \S detects whitespaces, but how can I tell it to detect strings that do not contain spaces only within the parenthesis that don't contain a - after the first (?
EDIT 1:
There will never be nested parenthesis in my strings.
EDIT 2:
Here's a Regex Tester.
.*\(\w+\).*
If you use above regex, second and third strings are matches only
.* all characters
( pharantesis
\w+ all word characters (at least one)
) pharantesis
.* all characters
\(([^- ]+[^ ]*)\)
should work
Explanation:
[^- ]+ will first match one character that's neither - or This will make sure it contains at least one character
Then [^ ]* will match 0 or more none white space characters
This will work for any char set

How can I elaborate this fairly complex regex to validate names

I want to add a temporary regex to validate names. The rules are :only a-z A-z and spaces are allowed. The name must be more than 3 letters and have one space in this case. Also it cannot have two spaces in line (one after the other). I don't care about spaces in the beginning or end of the string because I can trim them. Only the first word can be a single letter, the others must be two or more. Excuse me if some rule contradicted another, it is very difficult to formulate this question.
ed_vu (valid)
edd_v (invalid) the second word is one character (must be two or more)
e_lui (valid)
e_li (valid)
e_ei_ed (valid) even though it has two spaces it has more than 4 letters
More examples
e_el_ed__uuu (invalid) two spaces in line
e_el_elld_liid_eiii_idid (valid)
Try this one:
/^[a-zA-Z]+(?: [a-zA-Z]{2,})*$/gm
Note: these flags are not necessarily what you need (they're useful for the example though).

Foreign language characters in Regular expression in C#

In C# code, I am trying to pass chinese characters: " 中文ABC123".
When I use alphanumeric in general using "^[a-zA-Z0-9\s]+$",
it doesn't pass for "中文ABC123" and regex validation fails.
What other expressions do I need to add for C#?
To match any letter character from any language use:
\p{L}
If you also want to match numbers:
[\p{L}\p{Nd}]+
\p{L} ... matches a character of the unicode category letter.
it is the short form for [\p{Ll}\p{Lu}\p{Lt}\p{Lm}\p{Lo}]
\p{Ll} ... matches lowercase letters. (abc)
\p{Lu} ... matches uppercase letters. (ABC)
\p{Lt} ... matches titlecase letters.
\p{Lm} ... matches modifier letters.
\p{Lo} ... matches letters without case. (中文)
\p{Nd} ... matches a character of the unicode category decimal digit.
Just replace: ^[a-zA-Z0-9\s]+$ with ^[\p{L}0-9\s]+$
Thanks to #Andie2302 for pointing to the right way to do it.
In Addition, for many language in the world, it's still has the 'addition character' that require main character to generate it (ex. Thai word 'เก็บ' if use only \p{L} it will display only 'เกบ', you can see that some symbolic will be missing from the word).
That's why only \p{L} will not work for all foreign language.
So, you need to use code below, to support almost foreign language
\p{L}\p{M}
NOTE:
L stand for 'Letter' (All letter from all language, but does not include the 'Mark')
M stand for 'Mark' (The 'Mark' cannot display alone, it require 'Letter' to display it)
In Addition that you need Number, use code below
\p{N}
NOTE:
N stand for 'Numeric'
Thanks to this website for very useful information
https://www.regular-expressions.info/unicode.html

C# Regex for a username with a few restrictions

Similar to this topic.
I am trying to validate a username with the following restrictions:
Must start with a letter or number
Must be 3 to 15 characters in length
Symbols include: . - _ ( ) [ ]
Symbols cannot be adjacent, but letters and numbers can
Edit:
Letters and numbers are a-z A-Z 0-9
Been stumped for a while. I'm new to regex.
As an optimization to Mark's answer:
^(?=.{3,15}$)([A-Za-z0-9][._()\[\]-]?)*$
Explanation:
(?=.{3,15}$) Must be 3-15 characters in the string
([A-Za-z0-9][._()\[\]-]?)* The string is a sequence of alphanumerics,
each of which may be followed by a symbol
This one permits Unicode alphanumerics:
^(?=.{3,15}$)((\p{L}|\p{N})[._()\[\]-]?)*$
This one is the Unicode variant, plus uses non-capturing groups:
^(?=.{3,15}$)(?:(?:\p{L}|\p{N})[._()\[\]-]?)*$
It is not so clean to express a set of unrelated rules in a single regular expression, but it can be done by using lookaround assertions (Rubular):
#"^(?=[A-Za-z0-9])(?!.*[._()\[\]-]{2})[A-Za-z0-9._()\[\]-]{3,15}$"
Explanation:
(?=[A-Za-z0-9]) Must start with a letter or number
(?!.*[._()\[\]-]{2}) Cannot contain two consecutive symbols
[A-Za-z0-9._()\[\]-]{3,15} Must consist of between 3 to 15 allowed characters
You might want to consider if this would be easier to read and more maintable as a list of simpler regular expressions, all of which must validate successfully, or else write it in ordinary C# code.

Categories