regular expressions with the Cyrillic alphabet? - c#

I am currently writing some validation that will validate inputted data. I am using regular expressions to do so, working with C#.
Password = #"(?!^[0-9]*$)(?!^[a-zA-Z]*$)^([a-zA-Z0-9]{6,18})$"
Validate Alpha Numeric = [^a-zA-Z0-9ñÑáÁéÉíÍóÓúÚüÜ¡¿{0}]
The above work fine on the latin alphabet, but how can I expand such to working with the Cyrillic alphabet?

The basic approach to covering ranges of characters using regular expressions is to construct an expression of the form [A-Za-z], where A is the first letter of the range, and Z is the last letter of the range.
The problem is, there is no such thing as "The" Cyrillic alphabet: the alphabet is slightly different depending on the language. If you would like to cover Russian version of the Cyrillic, use [А-Яа-я]. You would use a different range, say, for Serbian, because the last letter in their Cyrillic is Ш, not Я.
Another approach is to list all characters one-by-one. Simply find an authoritative reference for the alphabet that you want to put in a regexp, and put all characters for it into a pair of square brackets:
[АБВГДЕЁЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯабвгдеёжзийклмнопрстуфхцчшщъыьэюя]

You can use character classes if you need to allow characters of particular language or particular type:
#"\p{IsCyrillic}+" // Cyrillic letters
#"[\p{Ll}\p{Lt}]+" // any upper/lower case letters in any language
In your case maybe "not a whitespace" would be enough: #"[^\s]+" or maybe "word character (which includes numbers and underscores) - #"\w+".

Password = #"(?!^[0-9]*$)(?!^[А-Яа-я]*$)^([А-Яа-я0-9]{6,18})$"
Validate Alpha Numeric = [^а-яА-Я0-9ñÑáÁéÉíÍóÓúÚüÜ¡¿{0}]

Related

Match vocabulary words and phrases

I am writing an application/logic that has vocabulary word/phrase as an input parameter. I am having troubles writing validation logic for this parameter's value!
Following are the rules I've came up with:
can be up to 4 words (with hyphens or not)
one apostrophe is allowed
only regular letters are allowed (no special characters like !##$%^&*()={}[]"";|/>/? ¶ © etc)
numbers are disallowed
case insensitive
multiple languages support (English, Russian, Norwegian, etc..) (so both Unicode and Cyrillic must be supported)
either whole string matches or nothing
Few examples (in 3 languages):
// match:
one two three four
one-two-three-four
one-two-three four
vær så snill
тест регекс
re-read
under the hood
ONe
rabbit's lair
// not-match:
one two three four five
one two three four#
one-two-three-four five
rabbit"s lair
one' two's
one1
1900
Given the expected result provided above - could someone point me to right direction on how to create a validation rule like that? If that matters - I will be writing validation logic in C# so I have more tools than just Regex available at my disposal.
If that is going to be of any help - I have been testing several solutions, like these ^[\p{Ll}\p{Lt}]+$ and (?=\S*['-])([a-zA-Z'-]+)$. The first regex seems to be doing a great job allowing just the letters I need (En, No and Rus), whereas the second rule set is doing great in using the Lookahead concept.
\p{Ll} or \p{Lowercase_Letter}: a lowercase letter that has an uppercase variant.
\p{Lu} or \p{Uppercase_Letter}: an uppercase letter that has a lowercase variant.
\p{Lt} or \p{Titlecase_Letter}: a letter that appears at the start of a word when only the first letter of the word is capitalized.
\p{L&} or \p{Letter&}: a letter that exists in lowercase and uppercase variants (combination of Ll, Lu and Lt).
\p{Lm} or \p{Modifier_Letter}: a special character that is used like a letter.
\p{Lo} or \p{Other_Letter}: a letter or ideograph that does not have lowercase and uppercase variants.
Needless to say, neither of the solutions I have been testing take into account all the rules I defined above..
You can use
\A(?!(?:[^']*'){2})\p{L}+(?:[\s'-]\p{L}+){0,3}\z
See the regex demo. Details:
\A - start of string
(?!(?:[^']*'){2}) - the string cannot contain two apostrophes
\p{L}+ - one or more Unicode letters
(?:[\s'-]\p{L}+){0,3} - zero to three occurrences of
[\s'-] - a whitespace, ' or - char
\p{L}+ - one or more Unicode letters
\z - the very end of string.
In C#, you can use it as
var IsValid = Regex.IsMatch(text, #"\A(?!(?:[^']*'){2})\p{L}+(?:[\s'-]\p{L}+");{0,3}\z")

Foreign language characters in Regular expression in C#

In C# code, I am trying to pass chinese characters: " 中文ABC123".
When I use alphanumeric in general using "^[a-zA-Z0-9\s]+$",
it doesn't pass for "中文ABC123" and regex validation fails.
What other expressions do I need to add for C#?
To match any letter character from any language use:
\p{L}
If you also want to match numbers:
[\p{L}\p{Nd}]+
\p{L} ... matches a character of the unicode category letter.
it is the short form for [\p{Ll}\p{Lu}\p{Lt}\p{Lm}\p{Lo}]
\p{Ll} ... matches lowercase letters. (abc)
\p{Lu} ... matches uppercase letters. (ABC)
\p{Lt} ... matches titlecase letters.
\p{Lm} ... matches modifier letters.
\p{Lo} ... matches letters without case. (中文)
\p{Nd} ... matches a character of the unicode category decimal digit.
Just replace: ^[a-zA-Z0-9\s]+$ with ^[\p{L}0-9\s]+$
Thanks to #Andie2302 for pointing to the right way to do it.
In Addition, for many language in the world, it's still has the 'addition character' that require main character to generate it (ex. Thai word 'เก็บ' if use only \p{L} it will display only 'เกบ', you can see that some symbolic will be missing from the word).
That's why only \p{L} will not work for all foreign language.
So, you need to use code below, to support almost foreign language
\p{L}\p{M}
NOTE:
L stand for 'Letter' (All letter from all language, but does not include the 'Mark')
M stand for 'Mark' (The 'Mark' cannot display alone, it require 'Letter' to display it)
In Addition that you need Number, use code below
\p{N}
NOTE:
N stand for 'Numeric'
Thanks to this website for very useful information
https://www.regular-expressions.info/unicode.html

Regex validation Comma Separated Words - Foreign Charcters

I am developing an application in Arabic-English language, so i needed a Regex that validates to a set of separated words, here is my RegEx:
^([a-zA-Z]+(,[a-zA-Z]+)*)?$
This works flawless for me but as you see the charters specified is in English, i want this for Arabic language.
Can this expression be altered to accept other charters either Arabic or even maybe some other language ?
Instead of restricting to a set of alphabetical character, exclude the characters that mark the end of your word.
^([^,]+(,[^,]+)*)?$
If you really want to match Arabic characters, see: regular expression For Arabic Language

how to create regular expression based on some condition

i want to create a regular expression to find and replace uppercase character based on some condition.
find the starting uppercase for a group of uppercase character in a string and replace it lowercase and * before the starting uppercase.
If there is any lowercase following the uppercase,replace the uppercase with lowercase and * before the starting uppercase.
input string : stackOVERFlow
expected output : stack*over*flow
i tried but could not get it working perfectly.
Any idea on how to create a regular expression ?
Thanks
Well the expected inputs and outputs are slightly illogical: you're lower-casing the "f" in "flow" but not including it in the asterisk.
Anyway, the regex you want is pretty simple: #"[A-Z]+?". This matches a string of one or more uppercase alpha characters, nongreedily (don't think it makes a difference either way as the matched character class is relatively narrow).
Now, to do the find/replace, you would do something like the following:
Regex.Replace(inputString, #"([A-Z]+?)", "*$1*").ToLower();
This simply finds all occurrences of one or more uppercase alpha characters, and wherever it finds a match it replaces it with itself surrounded by asterisks. This does the surrounding but not the lowercasing; .NET Regex doesn't provide for that kind of string modification. However, since the end result of the operation should be a string with all lowercase chars, just do exactly that with a ToLower() and you'll get the expected result.
KeithS's solution can be simplified a bit
Regex.Replace("stackOVERFlow","[A-Z]+","*$0*").ToLower()
However, this will yield stack*overf*low including the f between the stars. If you want to exclude the last upper case letter, use the following expression
Regex.Replace("stackOVERFlow","[A-Z]+(?=[A-Z])","*$0*").ToLower()
It will yield stack*over*flow
This uses the pattern find(?=suffix), which finds a position before a suffix.

Regular expression to catch letters beyond a-z

A normal regexp to allow letters only would be "[a-zA-Z]" but I'm from, Sweden so I would have to change that into "[a-zåäöA-ZÅÄÖ]". But suppose I don't know what letters are used in the alphabet.
Is there a way to automatically know what chars are are valid in a given locale/language or should I just make a blacklist of chars that I (think I) know I don't want?
You can use \pL to match any 'letter', which will support all letters in all languages. You can narrow it down to specific languages using 'named blocks'. More information can be found on the Character Classes documentation on MSDN.
My recommendation would be to put the regular expression (or at least the "letter" part) into a localised resource, which you can then pull out based on the current locale and form into the larger pattern.
What about \p{name} ?
Matches any character in the named character class specified by {name}.
Supported names are Unicode groups and block ranges. For example, Ll, Nd, Z,
IsGreek, IsBoxDrawing.
I don't know enough about unicode, but maybe your characters fit a unicode class?
See character categories selection with \p and \w unicode semantics.
All chars are "valid," so I think you're really asking for chars that are "generally considered to be letters" in a locale.
The Unicode specification has some guidelines, but in general the answer is "no," you would need to list the characters you decide are "letters."
Is there a way to automatically know what chars are are valid in a given locale/language or should I just make a blacklist of chars that I (think I) know I don't want?
This is not, in general, possible.
After all Engligh text does include some accented characters (e.g. in "fête" and "naïve" -- which in UK-English to be strictly correct still use accents). In some languages some of the standard letters are rarely used (e.g. y-diaeresis in French).
Then consider including foreign words are included (this will often be the case where technical terms are used). Quotations would be another source.
If your requirements are sufficiently narrowly defined you may be able to create a definition, but this requires linguistic experience in that language.
This regex allows only valid symbols through:
[a-zA-ZÀ-ÿ ]

Categories