Counting special UTF-8 character

Counting special UTF-8 character - c#

I'm finding a way to count special character that form by more than one character but found no solution online!
For e.g. I want to count the string "வாழைப்பழம". It actually consist of 6 tamil character but its 9 character in this case when we use the normal way to find the length. I am wondering is tamil the only kind of encoding that will cause this problem and if there is a solution to this. I'm currently trying to find a solution in C#.
Thank you in advance =)

Use StringInfo.LengthInTextElements:
var text = "வாழைப்பழம";
Console.WriteLine(text.Length); // 9
Console.WriteLine(new StringInfo(text).LengthInTextElements); // 6
The explanation for this behaviour can be found in the documentation of String.Length:
The Length property returns the number of Char objects in this instance, not the number of Unicode characters. The reason is that a Unicode character might be represented by more than one Char. Use the System.Globalization.StringInfo class to work with each Unicode character instead of each Char.

A minor nitpick: strings in .NET use UTF-16, not UTF-8
When you're talking about the length of a string, there are several different things you could mean:
Length in bytes. This is the old C way of looking at things, usually.
Length in Unicode code points. This gets you closer to the modern times and should be the way how string lengths are treated, except it isn't.
Length in UTF-8/UTF-16 code units. This is the most common interpretation, deriving from 1. Certain characters take more than one code unit in those encodings which complicates things if you don't expect it.
Count of visible “characters” (graphemes). This is usually what people mean when they say characters or length of a string.
In your case your confusion stems from the difference between 4. and 3. 3. is what C# uses, 4. is what you expect. Complex scripts such as Tamil use ligatures and diacritics. Ligatures are contractions of two or more adjacent characters into a single glyph – in your case ழை is a ligature of ழ and ை – the latter of which changes the appearance of the former; வா is also such a ligature. Diacritics are ornaments around a letter, e.g. the accent in à or the dot above ப்.
The two cases I mentioned both result in a single grapheme (what you perceive as a single character), yet they both need two actual characters each. So you end up with three code points more in the string.
One thing to note: For your case the distinction between 2. and 3. is irrelevant, but generally you should keep it in mind.

Related

Combining (some) Unicode nonspacing marks with associated letters for uniform processing

I am writing a text-processing Windows app in C#. The app processes many plain text files to count characters, words, etc. To do this, the app iterates over the characters in each file. I am finding that some text files represent accented letters such as á by using the Unicode character U+00E1 (small letter A with acute) while other use a simple unaccented a (U+0061, small letter A) followed by a U+0301 (combining acute accent). There's no visual difference in how the text is rendered on screen by Notepad or other editors I've used, but the underlying character stream is obviously different.
I would like to detect and treat these two situations in the same way. In other words, I would like my app to combine a letter followed by a combining codepoint into the equivalent self-contained character. For example, I'd like to combine the sequence U+0061 U+0301 into U+00E1. As far as I know, there is no simple algorithm to do this, other than a large and error-prone lookup table for all the possible combinations of plain letters and combining characters.
I there a simpler and more direct algorithm to perform this combination?

You're referring to Unicode normalization forms. That page goes into some interesting detail, but the gist is that representing e.g. accented letters as a single codepoint (e.g. á as U+00E1) is Normalization Form C, or NFC, and as separate codepoints (e.g. á as U+0061 U+0301) is NFD.
Section 3.11 of the Unicode specification goes into the gory details of how to implement it, with some extra details here.
Luckily, you don't need to implement this yourself: string.Normalize() already exists.
"\u00E1".Normalize(NormalizationForm.FormD); // \u0061\u0301
"\u0061\u0301".Normalize(NormalizationForm.FormC); // \u00E1
That said, we've only just scratched the surface of what a "character" is. A good illustration of this uses emoji, but it applies to various scripts as well: there are modern scripts where normal characters are comprise of two codepoints, and there is no single combined codepoint available. This pops up in e.g. Tamil and Thai, as well as some eastern European langauges (IIRC).
My favourite example is 👩🏽‍🚒, or "Woman Firefighter: Medium Skin Tone". Want to guess how that's encoded? That's right, 4 different code points: U+1F469 U+1F3FD U+200D U+1F692.
U+1F469 is 👩, the Woman emoji.
U+1F3FD is "Emoji Modifier Fitzpatrick Type-4", which modifies the previous emoji to give a brown skin tone 👩🏽, rendered as 🏽 when it appears on its own.
U+200D is a "zero-width joiner", which is used to glue codepoints together into the same character
U+1F692 is 🚒, the Fire Engine emoji.
So you take a woman, add a brown skin tone, glue her to a fire engine, and you get a woman brown-skinned firefighter.
(Just for fun, try pasting 👩🏽‍🚒 into various editors and then using backspace on it. If it's rendered properly, some editors turn it into 👩🏽 and then 👩 and then delete it, while others skip various parts. However, you select it as a single character. This mirrors how editing complex characters works in some scripts).
(Another fun nugget are the flag emoji. Unicode defines "Regional Indicator Symbol Letters A-Z" (U+1F1E6 through U+1F1FF), and a flag is encoded as the country's ISO 3166-1 alpha-2 letter country code using these indicator symbols. So 🇺🇸 is 🇺 followed by 🇸. Paste the 🇸 after the 🇺 and a flag appears!)
Of course, if you're iterating over this codepoint-by-codepoint you're going to visit U+1F469 U+1F3FD U+200D U+1F692 individually, which probably isn't what you want.
If you're iterating this char-by-char you're going to do even worse due to surrogate pairs: those codepoints such as U+1F469 are simply too large to represent using a single 16-bit char, so we need to use two of them. This means that if you try to iterate over U+1F469, you'll actually find you've got two chars: 0xD83D (the high surrogate) and 0xDC69 (the low surrogate).
Instead, we need to introduce extended grapheme clusters, which represent what you'd traditionally think of as a single character. Again there's a bunch of complexity if you want to do this yourself, and again someone's helpfully done it for you: StringInfo.GetTextElementEnumerator. Note that this was a bit buggy pre-.NET 5, and didn't properly handle all EGCs.
In .NET 5, however:
// Number of chars, as 3 of the codepoints need to use surrogate pairs when
// encoded with UTF-16
"👩🏽‍🚒".Length; // 7
// Number of Unicode codepoints
"👩🏽‍🚒".EnumerateRunes().Count(); // 4
// Number of extended grapheme clusters
GetTextElements("👩🏽‍🚒").Count(); // 1
public static IEnumerable<string> GetTextElements(string s)
{
TextElementEnumerator charEnum = StringInfo.GetTextElementEnumerator(s);
while (charEnum.MoveNext())
{
yield return charEnum.GetTextElement();
}
}
I've used emoji as an understandable example here, but these issues also crop up in modern scripts, and people working with text need to be aware of them.

How does html decoding work?

In my app I compare strings. I have strings that look the same but some of them contain white space, and other contain nbsp, so when I compare them I get that they are different. However, they represent the same entity so I have issues when I compare them. That's why I want to decode the strings I compare. That way nbsp will be converted to space in both of the strings and they will be treated as equal when I do the comparison. So here's what I do:
HttpUtility.HtmlDecode(string1)[0]
HttpUtility.HtmlDecode(string2)[0]
But I still get that string1[0] has ascii code of 160, and string2[0] has ascii code of 32.
Obviously I am not understanding the concept. What am I doing wrong?

You are trying to compare two different characters, no matter how resembling they might seem to you.
The fact that they have different character codes is enough to make the comparison fail. The easiest thing to do is replace the non-breaking space by a regular space and then compare them.
bool c = html.Replace('\u00A0', ' ').Equals(regular);

Word-Counter in some hieroglyphics languages?

Is there any available library for word-counting of some hieroglyphics language (ex: chinese, japanese, korean...)?
I found that MS Word count effectively texts in these languages. Can I add reference to MS Word libraries in my .NET application to implement this function?
Or is there any other solutions to achieve this purpose?

s there any available library for word-counting of some hieroglyphics language (ex: chinese, japanese, korean...)?
Hieroglyphics? No, they're not. They're logographic characters and it's not so subtle difference. I'm sure some native speaker may explain this much better than me.
Japanese and Chinese text is made of characters exactly as western languages but one character may be a word to. Moreover they don't need spaces to separate words so our distinction characters/words can't be made using blanks as delimiters.
What Word does is to count words (assuming they'll be equal to characters) and you can do the same in your code (just don't forget it's UNICODE so you can't count bytes) counting characters. To count real words you need a dictionary (because you can't rely on spaces).
For example these strings:
这是一个示例文本
これは、サンプルのテキストです
Will be counted as 8 characters and 8 words (in Chinese) and 15 characters and 15 words in Japanese. Actually it's not (for example in Japanese it's 5 words when transliterated in romaji). Moreover don't forget in Japanese they have more than one alphabet (and one family of them are phonetic).
What's the point? What you will count? Words transliterated to one of phonetic representations (with latin characters) we use to represent them? Which one? Word counting will be pretty different and it'll actually count our concept of words (that's why, I suppose, Word counts characters).
That said now try to write this code:
string text = "这是一个示例文本";
MessageBox.Show(text.Length.ToString());
It'll display 8, as Word does (we're counting characters), in bytes (supposing an UTF-8 encoding) is 24. No sense to count spaces here. If you plan to count words in one transliteration you need to use an external library (it's not an easy task to do it by yourself), a different one for each language you want to support (somehow it's easy to auto detect the language because in Japanese they use very often hiragana/katakana characters). Which one? There are a lot of them, I don't know for Chinese but in Japanese a popular one to transliterate Kanji is Kakasi.
Korean is a complete different story, it's an alphabet exactly as latin one but character (that should be called syllable) may be composed of many letters. Again they don't need spaces so you can't rely on them for word counting. It's somehow more complicated because here you may need a dictionary even for character counting (otherwise you'll just count syllables).

Unicode SMP "character" in C# char [duplicate]

This question already has answers here:
C# and UTF-16 characters
(3 answers)
Closed 9 years ago.
I am trying to determine the implications of character encoding for a software system I am planning, and I found something odd while doing a test.
To my knowledge C# internally uses UTF-16 which (to my knowledge) encompasses every Unicode code point using two 16-bit fields. So I wanted to make some character literals and intentionally chose 𝛃 and 얤, because the former is from the SMP plane and the latter is from the BMP plane. The results are:
char ch1 = '얤'; // No problem
char ch2 = '𝛃'; // Compilation error "Too many characters in character literal"
What's going on?
A corollary of this question is, if I have the string "얤𝛃얤" it is displayed correctly in a MessageBox, however when I convert it to a char[] using ToCharArray I get an array with four elements rather than three. Also the String.Length is reported as four rather than three.
Am I missing something here?

MSDN says that the char type can represent Unicode 16-bit character (thus only character form BMP).
If you use a character outside BMP (in UTF-16: supplementary pair - 2x16 bit) compiler treats that as two characters.

Your source file may not be saved in UTF-8 (which is recommended when using special characters in the source), so the compiler may actually see a sequence of bytes that confuses it. You can verify that by opening your source file in a hex editor - the byte(s) you'll see in place of your character will likely be different.
If it's not already on, you can turn on that setting in Tools->Options->Documents in Visual Studio (I use 2008) - the option is Save documents as Unicode when data cannot be saved in codepage.
Typically, it's better to specify special characters using a character sequence.
This MSDN article describes how to use \uxxxx sequences to specify the Unicode character code you want. This blog entry has all the various C# escape sequences listed - the reason I'm including it is because it mentions using \xnnn - avoid using this format: it's a variable length version of \u and it can cause issues in some situations (not in yours, though).
The MSDN article points out why the character assignment is no good: the code point for the character in question is > FFFF which is outside the range for the char type.
As for the string part of the question, the answer is that the SMP character is represented as two char values. This SO question includes some code showing how to get the code points out of a string, it involves the use of StringInfo.GetTextElementEnumerator

What is the regular expression for the following strings and would the expression change if the number rolled over?

What would be the following regular expressions for the following strings?
56AAA71064D6
56AAA7105A25
Would the regular expression change if the numbers rolled over? What I mean by this is that the above numbers happen to contain hexadecimal values and I don't know how the value changes one it reaches F. Using the first one as an example: 56AAA71064D6, if this went up to
56AAA71064F6 and then the following one would become 56AAA7106406, this would create a different regular expression because where a letter was allowed, now their is a digit, so does this make the regular expression even more difficult. Suggestions?
A manufacturer is going to enter a range of serial numbers. The problems are that different manufacturers have different formats for serial numbers (some are just numbers, some are alpha numeric, some contain extra characters like dashes, some contain hexadacimal values which makes it more difficult because I don't know how the roll over to the next serial number). The roll over issue is the biggest problem because the serial numbers are entered as a range like 5A1B - 6F12 and without knowing how the roll over, it seems to me that storing them in the database is not as easy. I was going to have the option of giving the user the option to input the pattern (expression) and storing that in the databse, but if a character or characters changes from a digit to a letter or vice versa, then the regular expression is no longer valid for certain serial numbers.
Also, the above example I gave is with just one case. There are multitude of serial numbers that would contain different expressions.

There's no single regular expression which is "the" expression to match both of those strings. Instead, there are infinitely many which will do so. Here are two options at opposite ends of the spectrum:
(56AAA71064D6)|(56AAA7105A25)
.*
The first will only match those two strings. The second will match anything. Both satisfy all the criteria you've given.
Now, if you specify more criteria, then we'd be able to give a more reasonable idea of the regular expression to provide - and that will drive the answers to the other questions. (At the moment, the only answer that makes sense is "It depends on what regex you use.")

I think you could do it this way for 12 characters. This will search for a 12 character phrase where each of the characters must be a capital (A or B or C or D or E or F or 1 or 2 or 3 or 4 or 5 or 6 or 7 or 8 or 9 or 0)
[A-F0-9]{12}
If you're wanting to include the possibility of dashes then do this.
[A-F0-9\-]{12}
Or you're wanting to include the possibility of dashes plus the 12 characters then do this. But that would pick up any 12-15 character item that fit the criteria though.
[A-F0-9\-]{12,15}
Or if it's surrounded by spaces (AAAAHHHh...SO is stripping out my spaces!!!)
[A-F0-9\-]{12}
Or if it's surrounded by tabs
\t[A-F0-9\-]{12}\t

This match a string that contains 12 hexa
[0-9A-F]{12}

Assuming these are all 12-digit hexadecimal numbers, which it looks like they are, the following regex should work:
[0-9A-Fa-f]{12}
Here I'm using a character class to say that I want any digit, OR A-F, OR a-f. As a bonus I'm allowing lowercase letters; if you don't want those just get them out of the regex.
As Jon Skeet and others have said, you really didn't provide enough information, so if you don't like this answer please understand that I was doing the best I can with what information you provided.

So, how about this:
[0-9A-F]{12}

Well it sounds like you're describing a 12 digit hexadecimal number:
^[A-F0-9]{12}$

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.