Foreign language characters in Regular expression in C#

Foreign language characters in Regular expression in C# - c#

In C# code, I am trying to pass chinese characters: " 中文ABC123".
When I use alphanumeric in general using "^[a-zA-Z0-9\s]+$",
it doesn't pass for "中文ABC123" and regex validation fails.
What other expressions do I need to add for C#?

To match any letter character from any language use:
\p{L}
If you also want to match numbers:
[\p{L}\p{Nd}]+
\p{L} ... matches a character of the unicode category letter.
it is the short form for [\p{Ll}\p{Lu}\p{Lt}\p{Lm}\p{Lo}]
\p{Ll} ... matches lowercase letters. (abc)
\p{Lu} ... matches uppercase letters. (ABC)
\p{Lt} ... matches titlecase letters.
\p{Lm} ... matches modifier letters.
\p{Lo} ... matches letters without case. (中文)
\p{Nd} ... matches a character of the unicode category decimal digit.
Just replace: ^[a-zA-Z0-9\s]+$ with ^[\p{L}0-9\s]+$

Thanks to #Andie2302 for pointing to the right way to do it.
In Addition, for many language in the world, it's still has the 'addition character' that require main character to generate it (ex. Thai word 'เก็บ' if use only \p{L} it will display only 'เกบ', you can see that some symbolic will be missing from the word).
That's why only \p{L} will not work for all foreign language.
So, you need to use code below, to support almost foreign language
\p{L}\p{M}
NOTE:
L stand for 'Letter' (All letter from all language, but does not include the 'Mark')
M stand for 'Mark' (The 'Mark' cannot display alone, it require 'Letter' to display it)
In Addition that you need Number, use code below
\p{N}
NOTE:
N stand for 'Numeric'
Thanks to this website for very useful information
https://www.regular-expressions.info/unicode.html

Related

Match vocabulary words and phrases

I am writing an application/logic that has vocabulary word/phrase as an input parameter. I am having troubles writing validation logic for this parameter's value!
Following are the rules I've came up with:
can be up to 4 words (with hyphens or not)
one apostrophe is allowed
only regular letters are allowed (no special characters like !##$%^&*()={}[]"";|/>/? ¶ © etc)
numbers are disallowed
case insensitive
multiple languages support (English, Russian, Norwegian, etc..) (so both Unicode and Cyrillic must be supported)
either whole string matches or nothing
Few examples (in 3 languages):
// match:
one two three four
one-two-three-four
one-two-three four
vær så snill
тест регекс
re-read
under the hood
ONe
rabbit's lair
// not-match:
one two three four five
one two three four#
one-two-three-four five
rabbit"s lair
one' two's
one1
1900
Given the expected result provided above - could someone point me to right direction on how to create a validation rule like that? If that matters - I will be writing validation logic in C# so I have more tools than just Regex available at my disposal.
If that is going to be of any help - I have been testing several solutions, like these ^[\p{Ll}\p{Lt}]+$ and (?=\S*['-])([a-zA-Z'-]+)$. The first regex seems to be doing a great job allowing just the letters I need (En, No and Rus), whereas the second rule set is doing great in using the Lookahead concept.
\p{Ll} or \p{Lowercase_Letter}: a lowercase letter that has an uppercase variant.
\p{Lu} or \p{Uppercase_Letter}: an uppercase letter that has a lowercase variant.
\p{Lt} or \p{Titlecase_Letter}: a letter that appears at the start of a word when only the first letter of the word is capitalized.
\p{L&} or \p{Letter&}: a letter that exists in lowercase and uppercase variants (combination of Ll, Lu and Lt).
\p{Lm} or \p{Modifier_Letter}: a special character that is used like a letter.
\p{Lo} or \p{Other_Letter}: a letter or ideograph that does not have lowercase and uppercase variants.
Needless to say, neither of the solutions I have been testing take into account all the rules I defined above..

You can use
\A(?!(?:[^']*'){2})\p{L}+(?:[\s'-]\p{L}+){0,3}\z
See the regex demo. Details:
\A - start of string
(?!(?:[^']*'){2}) - the string cannot contain two apostrophes
\p{L}+ - one or more Unicode letters
(?:[\s'-]\p{L}+){0,3} - zero to three occurrences of
[\s'-] - a whitespace, ' or - char
\p{L}+ - one or more Unicode letters
\z - the very end of string.
In C#, you can use it as
var IsValid = Regex.IsMatch(text, #"\A(?!(?:[^']*'){2})\p{L}+(?:[\s'-]\p{L}+");{0,3}\z")

C# Regex specify allowed start and end condtions

I'm trying to create a regex expression with the following requirements:
The value:
Must start with a-z or _, numbers are OK after the first character
Can have parentheses if they are opened and closed with number inside at the end of string, i.e SomeVar(10) is OK, SomeVar(10 is not OK.
Can have a . but only one at a time, and only between letters or numbers. SomeVar.InnerVar is OK, SomeVar..Innevar is not OK.
My try at the regex:
[a-zA-Z_]
??
??

Assuming you want to match an entire string, you may use something like the following:
^[a-zA-Z_](?:\w|(?<=\w)\.(?=\w))*(?:\(\d+\))?$
Demo.
If you want to match partial strings, you'd need to decide what boundaries are allowed. Otherwise, "SomeVar(10" would have a match (i.e., what comes before (), for example.
Notes:
\w matches a lowercase/uppercase letter, a digit, or an underscore. But it also matches Unicode letters and numbers. If you don't want that, you could use [a-zA-Z0-9_] instead.
Similarly, \d matches any Unicode digit. You either use it or use [0-9] depending on your requirements.

Use
^[a-zA-Z_][a-zA-Z0-9_]*(\.[a-zA-Z_][a-zA-Z0-9_]*)*(\([^()]*\))?$
See proof.
[a-zA-Z_][a-zA-Z0-9_]* - a letter or underscore, then zero or more letters, digits, underscores
(\([^()]*\))? - optional group, parens may be present or absent
(\.[a-zA-Z_][a-zA-Z0-9_]*)* - dot is allowed between letter/digit/underscore.

How to handle validations(RegEx) while localizing application

We decided to support other languages by our project and I started localizing it.
In some text boxes, we are using text validations where we allow only certain characters like only alphabets from a to z or only certain characters. When we run our application in other language OS like Hebrew or Hindi, user will not be able to enter any text in those text boxes due to validation.
How can we make these rules localize\Globalize? How to handle these types of scenarios while localizing application

Use {L} along with your Regex for achieving the required validation for all languages.
To match any letter character from any language use:
\p{L}
If you also want to match numbers:
[\p{L}\p{Nd}]+
`\p{L}` ... matches a character of the unicode category letter.
it is the short form for [\p{Ll}\p{Lu}\p{Lt}\p{Lm}\p{Lo}]
\p{Ll} ... matches lowercase letters. (abc)
\p{Lu} ... matches uppercase letters. (ABC)
\p{Lt} ... matches titlecase letters.
\p{Lm} ... matches modifier letters.
\p{Lo} ... matches letters without case. (中文)
\p{Nd} ... matches a character of the unicode category decimal digit.
Just replace: ^[a-zA-Z0-9\s]+$ with ^[\p{L}0-9\s]+$

What does this regexp mean - "\p{Lu}"?

I stumble across this regular expression in c# I would like to port to javascript, and I do not understand the following:
[-.\p{Lu}\p{Ll}0-9]+
The part I have a hard time with is of course \p{Lu}. All regexp websites I visited never mention this modifier.
Any idea?

These are considered Unicode properties.
The Unicode property \p{L} — shorthand for \p{Letter} will match any kind of letter from any language. Therefore, \p{Lu} will match an uppercase letter that has a lowercase variant. And, the opposite \p{Ll} will match a lowercase letter that has an uppercase variant.
Concisely, this would match any lowercase/uppercase that has a variant from any language:
AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuVvWwXxYyZz

Regex ignore underscores

I have a regex ([-#.\/,':\w]*[\w])* and it matches all words within a text (including punctuated words like I.B.M), but I want to make it exclude underscores and I can't seem to figure out how to do it... I tried adding ^[_] (e.g. (^[_][-#.\/,':\w]*[\w])*) but it just breaks up all the words into letters. I want to preserve the word matching, but I don't want to have words with underscores in them, nor words that are entirely made up of underscores.
Whats the proper way to do this?
P.S.
My app is written in C# (if that makes any difference).
I can't use A-Za-z0-9 because I have to match words regardless of the language (could be Chinese, Russian, Japanese, German, English).
Update
Here is an example:
"I.B.M should be parsed as one word w_o_r_d! Russian should work too: мплекс исторических событий."
The matches should be:
I.B.M.
should
be
parsed
as
one
word
Russian
should
work
too
мплекс
исторических
событий
Note that w_o_r_d should not get matched.

Try this instead:
([-#.\/,':\p{L}\p{Nd}]*[\p{L}\p{Nd}])*
The \w class is composed of [\p{L}\p{Nd}\p{Pc}] when you're performing Unicode matching. (Or simply [a-zA-Z0-9] if you're doing non-Unicode matching.)
It's the \p{Pc} Unicode category -- punctuation/connector -- that causes the problem by matching underscores, so we explicitly match against the other categories without including that one.
(Further information here, "Character Classes: Word Character", and here, "Character Classes: Supported Unicode General Categories".)

Tue underscore comes from \w.
Simply use A-Za-z0-9 instead.

For a more concise version of LukeH's regex, you can use simply:
([-#.\/,':\p{L}]*\p{L})*
I simply used \p{L} instead of Lu, Ll, Lt, Lo, Lm. See Supported Unicode General Categories

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Foreign language characters in Regular expression in C# - c#

In C# code, I am trying to pass chinese characters: " 中文ABC123". When I use alphanumeric in general using "^[a-zA-Z0-9\s]+$", it doesn't pass for "中文ABC123" and regex validation fails. What other expressions do I need to add for C#?

Related

Match vocabulary words and phrases

C# Regex specify allowed start and end condtions

How to handle validations(RegEx) while localizing application

What does this regexp mean - "\p{Lu}"?

Regex ignore underscores

Categories

Resources