Comparing Arabic letters with double diacritics

Comparing Arabic letters with double diacritics - c#

The Arabic language have diacritics similar to other foreign languages like Hebrew or Romanian but i am not sure if the same issue with Arabic applies to these languages.
In Arabic, a letter can have a double diacritic and that is the source of my problem.
As you can see form the images above, the outcome of both are the same but when comparing strings together, they don't match.
I could just check if both string Contains all characters but i am hoping for a better solution as this change will cause a lot of changes in my application.

Instead of ==, use String.Equals(string1, string2, StringComparison. CurrentCulture) as long as your current culture is Arabic. == works on the raw chars and does not account for the culture.

Related

Regex validation Comma Separated Words - Foreign Charcters

I am developing an application in Arabic-English language, so i needed a Regex that validates to a set of separated words, here is my RegEx:
^([a-zA-Z]+(,[a-zA-Z]+)*)?$
This works flawless for me but as you see the charters specified is in English, i want this for Arabic language.
Can this expression be altered to accept other charters either Arabic or even maybe some other language ?

Instead of restricting to a set of alphabetical character, exclude the characters that mark the end of your word.
^([^,]+(,[^,]+)*)?$
If you really want to match Arabic characters, see: regular expression For Arabic Language

Word-Counter in some hieroglyphics languages?

Is there any available library for word-counting of some hieroglyphics language (ex: chinese, japanese, korean...)?
I found that MS Word count effectively texts in these languages. Can I add reference to MS Word libraries in my .NET application to implement this function?
Or is there any other solutions to achieve this purpose?

s there any available library for word-counting of some hieroglyphics language (ex: chinese, japanese, korean...)?
Hieroglyphics? No, they're not. They're logographic characters and it's not so subtle difference. I'm sure some native speaker may explain this much better than me.
Japanese and Chinese text is made of characters exactly as western languages but one character may be a word to. Moreover they don't need spaces to separate words so our distinction characters/words can't be made using blanks as delimiters.
What Word does is to count words (assuming they'll be equal to characters) and you can do the same in your code (just don't forget it's UNICODE so you can't count bytes) counting characters. To count real words you need a dictionary (because you can't rely on spaces).
For example these strings:
这是一个示例文本
これは、サンプルのテキストです
Will be counted as 8 characters and 8 words (in Chinese) and 15 characters and 15 words in Japanese. Actually it's not (for example in Japanese it's 5 words when transliterated in romaji). Moreover don't forget in Japanese they have more than one alphabet (and one family of them are phonetic).
What's the point? What you will count? Words transliterated to one of phonetic representations (with latin characters) we use to represent them? Which one? Word counting will be pretty different and it'll actually count our concept of words (that's why, I suppose, Word counts characters).
That said now try to write this code:
string text = "这是一个示例文本";
MessageBox.Show(text.Length.ToString());
It'll display 8, as Word does (we're counting characters), in bytes (supposing an UTF-8 encoding) is 24. No sense to count spaces here. If you plan to count words in one transliteration you need to use an external library (it's not an easy task to do it by yourself), a different one for each language you want to support (somehow it's easy to auto detect the language because in Japanese they use very often hiragana/katakana characters). Which one? There are a lot of them, I don't know for Chinese but in Japanese a popular one to transliterate Kanji is Kakasi.
Korean is a complete different story, it's an alphabet exactly as latin one but character (that should be called syllable) may be composed of many letters. Again they don't need spaces so you can't rely on them for word counting. It's somehow more complicated because here you may need a dictionary even for character counting (otherwise you'll just count syllables).

regular expressions with the Cyrillic alphabet?

I am currently writing some validation that will validate inputted data. I am using regular expressions to do so, working with C#.
Password = #"(?!^[0-9]*$)(?!^[a-zA-Z]*$)^([a-zA-Z0-9]{6,18})$"
Validate Alpha Numeric = [^a-zA-Z0-9ñÑáÁéÉíÍóÓúÚüÜ¡¿{0}]
The above work fine on the latin alphabet, but how can I expand such to working with the Cyrillic alphabet?

The basic approach to covering ranges of characters using regular expressions is to construct an expression of the form [A-Za-z], where A is the first letter of the range, and Z is the last letter of the range.
The problem is, there is no such thing as "The" Cyrillic alphabet: the alphabet is slightly different depending on the language. If you would like to cover Russian version of the Cyrillic, use [А-Яа-я]. You would use a different range, say, for Serbian, because the last letter in their Cyrillic is Ш, not Я.
Another approach is to list all characters one-by-one. Simply find an authoritative reference for the alphabet that you want to put in a regexp, and put all characters for it into a pair of square brackets:
[АБВГДЕЁЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯабвгдеёжзийклмнопрстуфхцчшщъыьэюя]

You can use character classes if you need to allow characters of particular language or particular type:
#"\p{IsCyrillic}+" // Cyrillic letters
#"[\p{Ll}\p{Lt}]+" // any upper/lower case letters in any language
In your case maybe "not a whitespace" would be enough: #"[^\s]+" or maybe "word character (which includes numbers and underscores) - #"\w+".

Password = #"(?!^[0-9]*$)(?!^[А-Яа-я]*$)^([А-Яа-я0-9]{6,18})$"
Validate Alpha Numeric = [^а-яА-Я0-9ñÑáÁéÉíÍóÓúÚüÜ¡¿{0}]

Regex ignore underscores

I have a regex ([-#.\/,':\w]*[\w])* and it matches all words within a text (including punctuated words like I.B.M), but I want to make it exclude underscores and I can't seem to figure out how to do it... I tried adding ^[_] (e.g. (^[_][-#.\/,':\w]*[\w])*) but it just breaks up all the words into letters. I want to preserve the word matching, but I don't want to have words with underscores in them, nor words that are entirely made up of underscores.
Whats the proper way to do this?
P.S.
My app is written in C# (if that makes any difference).
I can't use A-Za-z0-9 because I have to match words regardless of the language (could be Chinese, Russian, Japanese, German, English).
Update
Here is an example:
"I.B.M should be parsed as one word w_o_r_d! Russian should work too: мплекс исторических событий."
The matches should be:
I.B.M.
should
be
parsed
as
one
word
Russian
should
work
too
мплекс
исторических
событий
Note that w_o_r_d should not get matched.

Try this instead:
([-#.\/,':\p{L}\p{Nd}]*[\p{L}\p{Nd}])*
The \w class is composed of [\p{L}\p{Nd}\p{Pc}] when you're performing Unicode matching. (Or simply [a-zA-Z0-9] if you're doing non-Unicode matching.)
It's the \p{Pc} Unicode category -- punctuation/connector -- that causes the problem by matching underscores, so we explicitly match against the other categories without including that one.
(Further information here, "Character Classes: Word Character", and here, "Character Classes: Supported Unicode General Categories".)

Tue underscore comes from \w.
Simply use A-Za-z0-9 instead.

For a more concise version of LukeH's regex, you can use simply:
([-#.\/,':\p{L}]*\p{L})*
I simply used \p{L} instead of Lu, Ll, Lt, Lo, Lm. See Supported Unicode General Categories

Diacritics alphabetical ordering in C#

I want to know how do you perform a reliable alphabetical ordering (for a listbox) of people's full names with the diacritics of the language in C sharp?
Thanks in advance.
Q: So you just want to treat diacritics as the "original" letter? (eg: João is the same as Joao)? – NullUserException
A: I want to treat them as they should be treated in the language I define, respecting the rules of alphabetical ordering that people apply everyday. I'm sure it's written in the grammars of each language. Thanks. – Queops

This MSDN article should give you what you need: Comparing and Sorting Data for a Specific Culture. It describes culture sensitive comparison, sorting, and normalization, with code samples.

You can use s.Normalize(NormalizationForm.FormD) to normalize a string s, separating accented characters into unaccented characters followed by the accent symbol. You can then use CharUnicodeInfo.GetUnicodeCategory(c) == UnicodeCategory.NonSpacingMark to identify accent characters c in the normalized s. Given that, you can implement your own string comparison operator to place whatever ordering you need on accents.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.