Diacritics alphabetical ordering in C# - c#

I want to know how do you perform a reliable alphabetical ordering (for a listbox) of people's full names with the diacritics of the language in C sharp?
Thanks in advance.
Q: So you just want to treat diacritics as the "original" letter? (eg: João is the same as Joao)? – NullUserException
A: I want to treat them as they should be treated in the language I define, respecting the rules of alphabetical ordering that people apply everyday. I'm sure it's written in the grammars of each language. Thanks. – Queops

This MSDN article should give you what you need: Comparing and Sorting Data for a Specific Culture. It describes culture sensitive comparison, sorting, and normalization, with code samples.

You can use s.Normalize(NormalizationForm.FormD) to normalize a string s, separating accented characters into unaccented characters followed by the accent symbol. You can then use CharUnicodeInfo.GetUnicodeCategory(c) == UnicodeCategory.NonSpacingMark to identify accent characters c in the normalized s. Given that, you can implement your own string comparison operator to place whatever ordering you need on accents.

Related

Comparing Arabic letters with double diacritics

The Arabic language have diacritics similar to other foreign languages like Hebrew or Romanian but i am not sure if the same issue with Arabic applies to these languages.
In Arabic, a letter can have a double diacritic and that is the source of my problem.
As you can see form the images above, the outcome of both are the same but when comparing strings together, they don't match.
I could just check if both string Contains all characters but i am hoping for a better solution as this change will cause a lot of changes in my application.
Instead of ==, use String.Equals(string1, string2, StringComparison. CurrentCulture) as long as your current culture is Arabic. == works on the raw chars and does not account for the culture.

Retrieve all upper case letters of current culture

I know that there is CultureInfo.TextInfo.ToUpper(), however, is there any way to retrieve a collection of all uppercase letters for a given culture?
Please note that I only want to get all the uppercase letters of the current language's alphabet. E.g. for en-US I want to get the list A,B,C,...Y,Z (order actually doesn't matter).
There's no database built into .NET that keeps track of the letters that appear in the alphabet of a particular language. It would be a very large one. And a controversial one, even a country with a simple alphabet like Dutch has speakers that don't agree whether the Ÿ digraph is in the alphabet or not and at what position it appears. The former Yugoslavia had two alphabets, wars have been fought over it. And a changeable one, Swedish added W not long ago, forced to by the World Wide Web. And a rather unpractical one for a languages like Chinese and Korean.
You do not want to have to solve this problem in the general case.
Depending on your actual definition of uppercase, there's a lot of them, just in the Invariant culture, let alone the others, and it varies depending upon your operating system.
This LinqPad query lists 973 (on Win8.1, 873 on Vista, 673 on XP) uppercase characters by my definition, which is the char is invariant to ToUpperInvariant and not invariant to ToLowerInvariant:
var UppercaseChars = from i in Enumerable.Range(0, 65536)
let c = (char)i
let u = Char.ToUpperInvariant(c)
let l = Char.ToLowerInvariant(c)
where c == u && u != l
select c;
UppercaseChars.Count().Dump();
String.Join(" ", UppercaseChars).Dump();
The LinqPad query
Obviously you can change this to use CultureInfo.TextInfo.ToUpper and .ToLower to obtain the list for any culture available.
Note my "definition" of uppercase misses 33 characters (on Win8.1, 135 on Vista, 306 on XP) that are called uppercase by the Unicode Category, but don't have a lowercase alternative (according to ToLowerInvariant). However, it also includes 69 characters (on Win8.1, 71 on Vista, 42 on XP) that are not defined as UppercaseLetter by the Unicode Category, but still have a lowercase alternative (again according to ToLowerInvariant). The latter are some of the characters in the Unicode Categories TitlecaseLetter (not in XP), LetterNumber and OtherSymbol. Vista actually includes 4 characters that are in the Unicode Category LowercaseLetter (ῃ ῳ ⱥ ⱦ).
To actually answer your question, and your questions in comments: the place to get upper case characters according to the Unicode Category "database" is via Char.GetUnicodeCategory. The actual database is not publicly accessible in any other useful way.
For reference you can see the first 255 entries here; the rest is loaded here and looked up here.
And remember your definition of upper case may differ from Unicode's, as I mention in my other answer.

Regex ignore underscores

I have a regex ([-#.\/,':\w]*[\w])* and it matches all words within a text (including punctuated words like I.B.M), but I want to make it exclude underscores and I can't seem to figure out how to do it... I tried adding ^[_] (e.g. (^[_][-#.\/,':\w]*[\w])*) but it just breaks up all the words into letters. I want to preserve the word matching, but I don't want to have words with underscores in them, nor words that are entirely made up of underscores.
Whats the proper way to do this?
P.S.
My app is written in C# (if that makes any difference).
I can't use A-Za-z0-9 because I have to match words regardless of the language (could be Chinese, Russian, Japanese, German, English).
Update
Here is an example:
"I.B.M should be parsed as one word w_o_r_d! Russian should work too: мплекс исторических событий."
The matches should be:
I.B.M.
should
be
parsed
as
one
word
Russian
should
work
too
мплекс
исторических
событий
Note that w_o_r_d should not get matched.
Try this instead:
([-#.\/,':\p{L}\p{Nd}]*[\p{L}\p{Nd}])*
The \w class is composed of [\p{L}\p{Nd}\p{Pc}] when you're performing Unicode matching. (Or simply [a-zA-Z0-9] if you're doing non-Unicode matching.)
It's the \p{Pc} Unicode category -- punctuation/connector -- that causes the problem by matching underscores, so we explicitly match against the other categories without including that one.
(Further information here, "Character Classes: Word Character", and here, "Character Classes: Supported Unicode General Categories".)
Tue underscore comes from \w.
Simply use A-Za-z0-9 instead.
For a more concise version of LukeH's regex, you can use simply:
([-#.\/,':\p{L}]*\p{L})*
I simply used \p{L} instead of Lu, Ll, Lt, Lo, Lm. See Supported Unicode General Categories

Remove spam url in text

Input:
dsfdsf www. cnn .com dksfj kdsfjkdjfdf
www.google.com dkfjkdjfk w w w . ya
hoo .co mdfdd
Output:
dsfdsf dksfj kdsfjkdjfdf dkfjkdjfk mdfdd
How do I write a function that does this in C#?
Basically you would have to implement two steps:
Normalization
Filtering
Normalization means that you would remove all whitespace and other noise characters from your input, then you do a transcoding of all diacritics, special characters etc into the basic latin alphabet (this is to map identical- or similar-looking glyphs to one single char, e.g. omicron and o look identical). You would need to retain a one-to-one mapping from the normalized version of the input to the original input.
Then you would search the normalized input for blocked patterns, retrieve the same pattern in the original input and remove it.
Of course, this approach is not fail-safe, you might get false positives actually.
A good answer describing how the simple filtering is doomed can be found here:
How do you implement a good profanity filter?
Start with learning about the RegEx (Regular Expression) facilities in C#, then you'll need a good RegEx that matches a URL. You'd need to change this to manage URLs with spaces though.

Regular expression to catch letters beyond a-z

A normal regexp to allow letters only would be "[a-zA-Z]" but I'm from, Sweden so I would have to change that into "[a-zåäöA-ZÅÄÖ]". But suppose I don't know what letters are used in the alphabet.
Is there a way to automatically know what chars are are valid in a given locale/language or should I just make a blacklist of chars that I (think I) know I don't want?
You can use \pL to match any 'letter', which will support all letters in all languages. You can narrow it down to specific languages using 'named blocks'. More information can be found on the Character Classes documentation on MSDN.
My recommendation would be to put the regular expression (or at least the "letter" part) into a localised resource, which you can then pull out based on the current locale and form into the larger pattern.
What about \p{name} ?
Matches any character in the named character class specified by {name}.
Supported names are Unicode groups and block ranges. For example, Ll, Nd, Z,
IsGreek, IsBoxDrawing.
I don't know enough about unicode, but maybe your characters fit a unicode class?
See character categories selection with \p and \w unicode semantics.
All chars are "valid," so I think you're really asking for chars that are "generally considered to be letters" in a locale.
The Unicode specification has some guidelines, but in general the answer is "no," you would need to list the characters you decide are "letters."
Is there a way to automatically know what chars are are valid in a given locale/language or should I just make a blacklist of chars that I (think I) know I don't want?
This is not, in general, possible.
After all Engligh text does include some accented characters (e.g. in "fête" and "naïve" -- which in UK-English to be strictly correct still use accents). In some languages some of the standard letters are rarely used (e.g. y-diaeresis in French).
Then consider including foreign words are included (this will often be the case where technical terms are used). Quotations would be another source.
If your requirements are sufficiently narrowly defined you may be able to create a definition, but this requires linguistic experience in that language.
This regex allows only valid symbols through:
[a-zA-ZÀ-ÿ ]

Categories