I am developing an application in Arabic-English language, so i needed a Regex that validates to a set of separated words, here is my RegEx:
^([a-zA-Z]+(,[a-zA-Z]+)*)?$
This works flawless for me but as you see the charters specified is in English, i want this for Arabic language.
Can this expression be altered to accept other charters either Arabic or even maybe some other language ?
Instead of restricting to a set of alphabetical character, exclude the characters that mark the end of your word.
^([^,]+(,[^,]+)*)?$
If you really want to match Arabic characters, see: regular expression For Arabic Language
Related
I have a regular expression validator with client-side validation disabled in an ASP.Net page. The regular expression being used for this validator is as below and it is validating input into a Product Description multi-line text box.
Expression="^[\\p .,;'\-(0-9)\(\)\[\]]+$"
The culture for this ASP.Net app is Chinese as specified in web config.
<globalization uiCulture="zh" culture="zh-CHT" />
The following input into Product Description text box in same ASP.Net page is always failing. I am trying to match any one of these: chinese langauge character or period or comma or semi-colon or single quote or digits or round/square brackets.
Question: What is in the regular expression that is causing this input text to fail and how can I change it to satisfy the matching requirements?
(1)降低庫存過程 (2)增加了吞吐量(1)降低庫存過程 (2)增加了吞吐量(1)降低庫存過程 (2)增加了吞吐量(1)降低庫存過程 (2)增加了吞吐量
In .NET regex, the one that works on server side, you can make use of Unicode categories.
^[\p{L}\p{M}\p{N}\s\p{P}]+$
See demo
So, the character class matches:
\p{L} - Unicode letters
\p{M} - diacritic marks
\p{N} - numbers
\s - whitespace
\p{P} - punctuation.
Note these Unicode categories won't work on client-side where your Englsh UI culture validation takes place. You can use your fixed expression there:
^[a-zA-Z .,;'\-0-9()\[\]]+$
See demo
In C# code, I am trying to pass chinese characters: " 中文ABC123".
When I use alphanumeric in general using "^[a-zA-Z0-9\s]+$",
it doesn't pass for "中文ABC123" and regex validation fails.
What other expressions do I need to add for C#?
To match any letter character from any language use:
\p{L}
If you also want to match numbers:
[\p{L}\p{Nd}]+
\p{L} ... matches a character of the unicode category letter.
it is the short form for [\p{Ll}\p{Lu}\p{Lt}\p{Lm}\p{Lo}]
\p{Ll} ... matches lowercase letters. (abc)
\p{Lu} ... matches uppercase letters. (ABC)
\p{Lt} ... matches titlecase letters.
\p{Lm} ... matches modifier letters.
\p{Lo} ... matches letters without case. (中文)
\p{Nd} ... matches a character of the unicode category decimal digit.
Just replace: ^[a-zA-Z0-9\s]+$ with ^[\p{L}0-9\s]+$
Thanks to #Andie2302 for pointing to the right way to do it.
In Addition, for many language in the world, it's still has the 'addition character' that require main character to generate it (ex. Thai word 'เก็บ' if use only \p{L} it will display only 'เกบ', you can see that some symbolic will be missing from the word).
That's why only \p{L} will not work for all foreign language.
So, you need to use code below, to support almost foreign language
\p{L}\p{M}
NOTE:
L stand for 'Letter' (All letter from all language, but does not include the 'Mark')
M stand for 'Mark' (The 'Mark' cannot display alone, it require 'Letter' to display it)
In Addition that you need Number, use code below
\p{N}
NOTE:
N stand for 'Numeric'
Thanks to this website for very useful information
https://www.regular-expressions.info/unicode.html
I am currently writing some validation that will validate inputted data. I am using regular expressions to do so, working with C#.
Password = #"(?!^[0-9]*$)(?!^[a-zA-Z]*$)^([a-zA-Z0-9]{6,18})$"
Validate Alpha Numeric = [^a-zA-Z0-9ñÑáÁéÉíÍóÓúÚüÜ¡¿{0}]
The above work fine on the latin alphabet, but how can I expand such to working with the Cyrillic alphabet?
The basic approach to covering ranges of characters using regular expressions is to construct an expression of the form [A-Za-z], where A is the first letter of the range, and Z is the last letter of the range.
The problem is, there is no such thing as "The" Cyrillic alphabet: the alphabet is slightly different depending on the language. If you would like to cover Russian version of the Cyrillic, use [А-Яа-я]. You would use a different range, say, for Serbian, because the last letter in their Cyrillic is Ш, not Я.
Another approach is to list all characters one-by-one. Simply find an authoritative reference for the alphabet that you want to put in a regexp, and put all characters for it into a pair of square brackets:
[АБВГДЕЁЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯабвгдеёжзийклмнопрстуфхцчшщъыьэюя]
You can use character classes if you need to allow characters of particular language or particular type:
#"\p{IsCyrillic}+" // Cyrillic letters
#"[\p{Ll}\p{Lt}]+" // any upper/lower case letters in any language
In your case maybe "not a whitespace" would be enough: #"[^\s]+" or maybe "word character (which includes numbers and underscores) - #"\w+".
Password = #"(?!^[0-9]*$)(?!^[А-Яа-я]*$)^([А-Яа-я0-9]{6,18})$"
Validate Alpha Numeric = [^а-яА-Я0-9ñÑáÁéÉíÍóÓúÚüÜ¡¿{0}]
i am not very experienced in Regular Expression so its why i am asking you :)
my question is i use this pattern when i validate Emails.
/^(([^<>()[\]\\.,;:\s#\"]+(\.[^<>()[\]\\.,;:\s#\"]+)*)|(\".+\"))#((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\])|(([a-zAZ\-0-9]+\.)+[a-zA-Z]{2,}))$/
what is it to add to this pattern to disallow Arabic characters ?
Regular expressions should not be used to validate emails.
The correct way to validate an email address is using the MailAddress class like this:
try
{
string address = new MailAddress(address).Address;
}
catch(FormatException)
{
//address is invalid
}
Regarding the question itself, after you see that it is a valid email address - you can check for arabic characters.
I bet you could do it with a bracket expression (aka Character Set aka Character Class) and unicode escapes (available in javascript and C#):
[^\u####-\u%%%%]
... where the hashtags (####) represent the first arabian character (i.e. the character with the lowest unicode value), and the percent signs (%%%%) the last arabian character (i.e. the character with the highest unicode value).
Wikipedia tells me that there are multiple ranges of arabian characters, so you'd need to repeat the snippet above.
Use Character Properties:
/\p{sc=Arabic}/
matches all Arabic characters.
Then inverse the chracters that the expression matches to
/[^\p{sc=Arabic}]/
I have a regex ([-#.\/,':\w]*[\w])* and it matches all words within a text (including punctuated words like I.B.M), but I want to make it exclude underscores and I can't seem to figure out how to do it... I tried adding ^[_] (e.g. (^[_][-#.\/,':\w]*[\w])*) but it just breaks up all the words into letters. I want to preserve the word matching, but I don't want to have words with underscores in them, nor words that are entirely made up of underscores.
Whats the proper way to do this?
P.S.
My app is written in C# (if that makes any difference).
I can't use A-Za-z0-9 because I have to match words regardless of the language (could be Chinese, Russian, Japanese, German, English).
Update
Here is an example:
"I.B.M should be parsed as one word w_o_r_d! Russian should work too: мплекс исторических событий."
The matches should be:
I.B.M.
should
be
parsed
as
one
word
Russian
should
work
too
мплекс
исторических
событий
Note that w_o_r_d should not get matched.
Try this instead:
([-#.\/,':\p{L}\p{Nd}]*[\p{L}\p{Nd}])*
The \w class is composed of [\p{L}\p{Nd}\p{Pc}] when you're performing Unicode matching. (Or simply [a-zA-Z0-9] if you're doing non-Unicode matching.)
It's the \p{Pc} Unicode category -- punctuation/connector -- that causes the problem by matching underscores, so we explicitly match against the other categories without including that one.
(Further information here, "Character Classes: Word Character", and here, "Character Classes: Supported Unicode General Categories".)
Tue underscore comes from \w.
Simply use A-Za-z0-9 instead.
For a more concise version of LukeH's regex, you can use simply:
([-#.\/,':\p{L}]*\p{L})*
I simply used \p{L} instead of Lu, Ll, Lt, Lo, Lm. See Supported Unicode General Categories