Retrieve all upper case letters of current culture - c#

I know that there is CultureInfo.TextInfo.ToUpper(), however, is there any way to retrieve a collection of all uppercase letters for a given culture?
Please note that I only want to get all the uppercase letters of the current language's alphabet. E.g. for en-US I want to get the list A,B,C,...Y,Z (order actually doesn't matter).

There's no database built into .NET that keeps track of the letters that appear in the alphabet of a particular language. It would be a very large one. And a controversial one, even a country with a simple alphabet like Dutch has speakers that don't agree whether the Ÿ digraph is in the alphabet or not and at what position it appears. The former Yugoslavia had two alphabets, wars have been fought over it. And a changeable one, Swedish added W not long ago, forced to by the World Wide Web. And a rather unpractical one for a languages like Chinese and Korean.
You do not want to have to solve this problem in the general case.

Depending on your actual definition of uppercase, there's a lot of them, just in the Invariant culture, let alone the others, and it varies depending upon your operating system.
This LinqPad query lists 973 (on Win8.1, 873 on Vista, 673 on XP) uppercase characters by my definition, which is the char is invariant to ToUpperInvariant and not invariant to ToLowerInvariant:
var UppercaseChars = from i in Enumerable.Range(0, 65536)
let c = (char)i
let u = Char.ToUpperInvariant(c)
let l = Char.ToLowerInvariant(c)
where c == u && u != l
select c;
UppercaseChars.Count().Dump();
String.Join(" ", UppercaseChars).Dump();
The LinqPad query
Obviously you can change this to use CultureInfo.TextInfo.ToUpper and .ToLower to obtain the list for any culture available.
Note my "definition" of uppercase misses 33 characters (on Win8.1, 135 on Vista, 306 on XP) that are called uppercase by the Unicode Category, but don't have a lowercase alternative (according to ToLowerInvariant). However, it also includes 69 characters (on Win8.1, 71 on Vista, 42 on XP) that are not defined as UppercaseLetter by the Unicode Category, but still have a lowercase alternative (again according to ToLowerInvariant). The latter are some of the characters in the Unicode Categories TitlecaseLetter (not in XP), LetterNumber and OtherSymbol. Vista actually includes 4 characters that are in the Unicode Category LowercaseLetter (ῃ ῳ ⱥ ⱦ).

To actually answer your question, and your questions in comments: the place to get upper case characters according to the Unicode Category "database" is via Char.GetUnicodeCategory. The actual database is not publicly accessible in any other useful way.
For reference you can see the first 255 entries here; the rest is loaded here and looked up here.
And remember your definition of upper case may differ from Unicode's, as I mention in my other answer.

Related

Combining (some) Unicode nonspacing marks with associated letters for uniform processing

I am writing a text-processing Windows app in C#. The app processes many plain text files to count characters, words, etc. To do this, the app iterates over the characters in each file. I am finding that some text files represent accented letters such as á by using the Unicode character U+00E1 (small letter A with acute) while other use a simple unaccented a (U+0061, small letter A) followed by a U+0301 (combining acute accent). There's no visual difference in how the text is rendered on screen by Notepad or other editors I've used, but the underlying character stream is obviously different.
I would like to detect and treat these two situations in the same way. In other words, I would like my app to combine a letter followed by a combining codepoint into the equivalent self-contained character. For example, I'd like to combine the sequence U+0061 U+0301 into U+00E1. As far as I know, there is no simple algorithm to do this, other than a large and error-prone lookup table for all the possible combinations of plain letters and combining characters.
I there a simpler and more direct algorithm to perform this combination?
You're referring to Unicode normalization forms. That page goes into some interesting detail, but the gist is that representing e.g. accented letters as a single codepoint (e.g. á as U+00E1) is Normalization Form C, or NFC, and as separate codepoints (e.g. á as U+0061 U+0301) is NFD.
Section 3.11 of the Unicode specification goes into the gory details of how to implement it, with some extra details here.
Luckily, you don't need to implement this yourself: string.Normalize() already exists.
"\u00E1".Normalize(NormalizationForm.FormD); // \u0061\u0301
"\u0061\u0301".Normalize(NormalizationForm.FormC); // \u00E1
That said, we've only just scratched the surface of what a "character" is. A good illustration of this uses emoji, but it applies to various scripts as well: there are modern scripts where normal characters are comprise of two codepoints, and there is no single combined codepoint available. This pops up in e.g. Tamil and Thai, as well as some eastern European langauges (IIRC).
My favourite example is 👩🏽‍🚒, or "Woman Firefighter: Medium Skin Tone". Want to guess how that's encoded? That's right, 4 different code points: U+1F469 U+1F3FD U+200D U+1F692.
U+1F469 is 👩, the Woman emoji.
U+1F3FD is "Emoji Modifier Fitzpatrick Type-4", which modifies the previous emoji to give a brown skin tone 👩🏽, rendered as 🏽 when it appears on its own.
U+200D is a "zero-width joiner", which is used to glue codepoints together into the same character
U+1F692 is 🚒, the Fire Engine emoji.
So you take a woman, add a brown skin tone, glue her to a fire engine, and you get a woman brown-skinned firefighter.
(Just for fun, try pasting 👩🏽‍🚒 into various editors and then using backspace on it. If it's rendered properly, some editors turn it into 👩🏽 and then 👩 and then delete it, while others skip various parts. However, you select it as a single character. This mirrors how editing complex characters works in some scripts).
(Another fun nugget are the flag emoji. Unicode defines "Regional Indicator Symbol Letters A-Z" (U+1F1E6 through U+1F1FF), and a flag is encoded as the country's ISO 3166-1 alpha-2 letter country code using these indicator symbols. So 🇺🇸 is 🇺 followed by 🇸. Paste the 🇸 after the 🇺 and a flag appears!)
Of course, if you're iterating over this codepoint-by-codepoint you're going to visit U+1F469 U+1F3FD U+200D U+1F692 individually, which probably isn't what you want.
If you're iterating this char-by-char you're going to do even worse due to surrogate pairs: those codepoints such as U+1F469 are simply too large to represent using a single 16-bit char, so we need to use two of them. This means that if you try to iterate over U+1F469, you'll actually find you've got two chars: 0xD83D (the high surrogate) and 0xDC69 (the low surrogate).
Instead, we need to introduce extended grapheme clusters, which represent what you'd traditionally think of as a single character. Again there's a bunch of complexity if you want to do this yourself, and again someone's helpfully done it for you: StringInfo.GetTextElementEnumerator. Note that this was a bit buggy pre-.NET 5, and didn't properly handle all EGCs.
In .NET 5, however:
// Number of chars, as 3 of the codepoints need to use surrogate pairs when
// encoded with UTF-16
"👩🏽‍🚒".Length; // 7
// Number of Unicode codepoints
"👩🏽‍🚒".EnumerateRunes().Count(); // 4
// Number of extended grapheme clusters
GetTextElements("👩🏽‍🚒").Count(); // 1
public static IEnumerable<string> GetTextElements(string s)
{
TextElementEnumerator charEnum = StringInfo.GetTextElementEnumerator(s);
while (charEnum.MoveNext())
{
yield return charEnum.GetTextElement();
}
}
I've used emoji as an understandable example here, but these issues also crop up in modern scripts, and people working with text need to be aware of them.

Word-Counter in some hieroglyphics languages?

Is there any available library for word-counting of some hieroglyphics language (ex: chinese, japanese, korean...)?
I found that MS Word count effectively texts in these languages. Can I add reference to MS Word libraries in my .NET application to implement this function?
Or is there any other solutions to achieve this purpose?
s there any available library for word-counting of some hieroglyphics language (ex: chinese, japanese, korean...)?
Hieroglyphics? No, they're not. They're logographic characters and it's not so subtle difference. I'm sure some native speaker may explain this much better than me.
Japanese and Chinese text is made of characters exactly as western languages but one character may be a word to. Moreover they don't need spaces to separate words so our distinction characters/words can't be made using blanks as delimiters.
What Word does is to count words (assuming they'll be equal to characters) and you can do the same in your code (just don't forget it's UNICODE so you can't count bytes) counting characters. To count real words you need a dictionary (because you can't rely on spaces).
For example these strings:
这是一个示例文本
これは、サンプルのテキストです
Will be counted as 8 characters and 8 words (in Chinese) and 15 characters and 15 words in Japanese. Actually it's not (for example in Japanese it's 5 words when transliterated in romaji). Moreover don't forget in Japanese they have more than one alphabet (and one family of them are phonetic).
What's the point? What you will count? Words transliterated to one of phonetic representations (with latin characters) we use to represent them? Which one? Word counting will be pretty different and it'll actually count our concept of words (that's why, I suppose, Word counts characters).
That said now try to write this code:
string text = "这是一个示例文本";
MessageBox.Show(text.Length.ToString());
It'll display 8, as Word does (we're counting characters), in bytes (supposing an UTF-8 encoding) is 24. No sense to count spaces here. If you plan to count words in one transliteration you need to use an external library (it's not an easy task to do it by yourself), a different one for each language you want to support (somehow it's easy to auto detect the language because in Japanese they use very often hiragana/katakana characters). Which one? There are a lot of them, I don't know for Chinese but in Japanese a popular one to transliterate Kanji is Kakasi.
Korean is a complete different story, it's an alphabet exactly as latin one but character (that should be called syllable) may be composed of many letters. Again they don't need spaces so you can't rely on them for word counting. It's somehow more complicated because here you may need a dictionary even for character counting (otherwise you'll just count syllables).

Counting special UTF-8 character

I'm finding a way to count special character that form by more than one character but found no solution online!
For e.g. I want to count the string "வாழைப்பழம". It actually consist of 6 tamil character but its 9 character in this case when we use the normal way to find the length. I am wondering is tamil the only kind of encoding that will cause this problem and if there is a solution to this. I'm currently trying to find a solution in C#.
Thank you in advance =)
Use StringInfo.LengthInTextElements:
var text = "வாழைப்பழம";
Console.WriteLine(text.Length); // 9
Console.WriteLine(new StringInfo(text).LengthInTextElements); // 6
The explanation for this behaviour can be found in the documentation of String.Length:
The Length property returns the number of Char objects in this instance, not the number of Unicode characters. The reason is that a Unicode character might be represented by more than one Char. Use the System.Globalization.StringInfo class to work with each Unicode character instead of each Char.
A minor nitpick: strings in .NET use UTF-16, not UTF-8
When you're talking about the length of a string, there are several different things you could mean:
Length in bytes.  This is the old C way of looking at things, usually.
Length in Unicode code points.  This gets you closer to the modern times and should be the way how string lengths are treated, except it isn't.
Length in UTF-8/UTF-16 code units.  This is the most common interpretation, deriving from 1. Certain characters take more than one code unit in those encodings which complicates things if you don't expect it.
Count of visible “characters” (graphemes). This is usually what people mean when they say characters or length of a string.
In your case your confusion stems from the difference between 4. and 3. 3. is what C# uses, 4. is what you expect. Complex scripts such as Tamil use ligatures and diacritics. Ligatures are contractions of two or more adjacent characters into a single glyph – in your case ழை is a ligature of ழ and ை – the latter of which changes the appearance of the former; வா is also such a ligature. Diacritics are ornaments around a letter, e.g. the accent in à or the dot above ப்.
The two cases I mentioned both result in a single grapheme (what you perceive as a single character), yet they both need two actual characters each. So you end up with three code points more in the string.
One thing to note: For your case the distinction between 2. and 3. is irrelevant, but generally you should keep it in mind.

What is the regular expression for the following strings and would the expression change if the number rolled over?

What would be the following regular expressions for the following strings?
56AAA71064D6
56AAA7105A25
Would the regular expression change if the numbers rolled over? What I mean by this is that the above numbers happen to contain hexadecimal values and I don't know how the value changes one it reaches F. Using the first one as an example: 56AAA71064D6, if this went up to
56AAA71064F6 and then the following one would become 56AAA7106406, this would create a different regular expression because where a letter was allowed, now their is a digit, so does this make the regular expression even more difficult. Suggestions?
A manufacturer is going to enter a range of serial numbers. The problems are that different manufacturers have different formats for serial numbers (some are just numbers, some are alpha numeric, some contain extra characters like dashes, some contain hexadacimal values which makes it more difficult because I don't know how the roll over to the next serial number). The roll over issue is the biggest problem because the serial numbers are entered as a range like 5A1B - 6F12 and without knowing how the roll over, it seems to me that storing them in the database is not as easy. I was going to have the option of giving the user the option to input the pattern (expression) and storing that in the databse, but if a character or characters changes from a digit to a letter or vice versa, then the regular expression is no longer valid for certain serial numbers.
Also, the above example I gave is with just one case. There are multitude of serial numbers that would contain different expressions.
There's no single regular expression which is "the" expression to match both of those strings. Instead, there are infinitely many which will do so. Here are two options at opposite ends of the spectrum:
(56AAA71064D6)|(56AAA7105A25)
.*
The first will only match those two strings. The second will match anything. Both satisfy all the criteria you've given.
Now, if you specify more criteria, then we'd be able to give a more reasonable idea of the regular expression to provide - and that will drive the answers to the other questions. (At the moment, the only answer that makes sense is "It depends on what regex you use.")
I think you could do it this way for 12 characters. This will search for a 12 character phrase where each of the characters must be a capital (A or B or C or D or E or F or 1 or 2 or 3 or 4 or 5 or 6 or 7 or 8 or 9 or 0)
[A-F0-9]{12}
If you're wanting to include the possibility of dashes then do this.
[A-F0-9\-]{12}
Or you're wanting to include the possibility of dashes plus the 12 characters then do this. But that would pick up any 12-15 character item that fit the criteria though.
[A-F0-9\-]{12,15}
Or if it's surrounded by spaces (AAAAHHHh...SO is stripping out my spaces!!!)
[A-F0-9\-]{12}
Or if it's surrounded by tabs
\t[A-F0-9\-]{12}\t
This match a string that contains 12 hexa
[0-9A-F]{12}
Assuming these are all 12-digit hexadecimal numbers, which it looks like they are, the following regex should work:
[0-9A-Fa-f]{12}
Here I'm using a character class to say that I want any digit, OR A-F, OR a-f. As a bonus I'm allowing lowercase letters; if you don't want those just get them out of the regex.
As Jon Skeet and others have said, you really didn't provide enough information, so if you don't like this answer please understand that I was doing the best I can with what information you provided.
So, how about this:
[0-9A-F]{12}
Well it sounds like you're describing a 12 digit hexadecimal number:
^[A-F0-9]{12}$

Regular expression to catch letters beyond a-z

A normal regexp to allow letters only would be "[a-zA-Z]" but I'm from, Sweden so I would have to change that into "[a-zåäöA-ZÅÄÖ]". But suppose I don't know what letters are used in the alphabet.
Is there a way to automatically know what chars are are valid in a given locale/language or should I just make a blacklist of chars that I (think I) know I don't want?
You can use \pL to match any 'letter', which will support all letters in all languages. You can narrow it down to specific languages using 'named blocks'. More information can be found on the Character Classes documentation on MSDN.
My recommendation would be to put the regular expression (or at least the "letter" part) into a localised resource, which you can then pull out based on the current locale and form into the larger pattern.
What about \p{name} ?
Matches any character in the named character class specified by {name}.
Supported names are Unicode groups and block ranges. For example, Ll, Nd, Z,
IsGreek, IsBoxDrawing.
I don't know enough about unicode, but maybe your characters fit a unicode class?
See character categories selection with \p and \w unicode semantics.
All chars are "valid," so I think you're really asking for chars that are "generally considered to be letters" in a locale.
The Unicode specification has some guidelines, but in general the answer is "no," you would need to list the characters you decide are "letters."
Is there a way to automatically know what chars are are valid in a given locale/language or should I just make a blacklist of chars that I (think I) know I don't want?
This is not, in general, possible.
After all Engligh text does include some accented characters (e.g. in "fête" and "naïve" -- which in UK-English to be strictly correct still use accents). In some languages some of the standard letters are rarely used (e.g. y-diaeresis in French).
Then consider including foreign words are included (this will often be the case where technical terms are used). Quotations would be another source.
If your requirements are sufficiently narrowly defined you may be able to create a definition, but this requires linguistic experience in that language.
This regex allows only valid symbols through:
[a-zA-ZÀ-ÿ ]

Categories