This question already has answers here:
C# and UTF-16 characters
(3 answers)
Closed 9 years ago.
I am trying to determine the implications of character encoding for a software system I am planning, and I found something odd while doing a test.
To my knowledge C# internally uses UTF-16 which (to my knowledge) encompasses every Unicode code point using two 16-bit fields. So I wanted to make some character literals and intentionally chose 𝛃 and 얤, because the former is from the SMP plane and the latter is from the BMP plane. The results are:
char ch1 = '얤'; // No problem
char ch2 = '𝛃'; // Compilation error "Too many characters in character literal"
What's going on?
A corollary of this question is, if I have the string "얤𝛃얤" it is displayed correctly in a MessageBox, however when I convert it to a char[] using ToCharArray I get an array with four elements rather than three. Also the String.Length is reported as four rather than three.
Am I missing something here?
MSDN says that the char type can represent Unicode 16-bit character (thus only character form BMP).
If you use a character outside BMP (in UTF-16: supplementary pair - 2x16 bit) compiler treats that as two characters.
Your source file may not be saved in UTF-8 (which is recommended when using special characters in the source), so the compiler may actually see a sequence of bytes that confuses it. You can verify that by opening your source file in a hex editor - the byte(s) you'll see in place of your character will likely be different.
If it's not already on, you can turn on that setting in Tools->Options->Documents in Visual Studio (I use 2008) - the option is Save documents as Unicode when data cannot be saved in codepage.
Typically, it's better to specify special characters using a character sequence.
This MSDN article describes how to use \uxxxx sequences to specify the Unicode character code you want. This blog entry has all the various C# escape sequences listed - the reason I'm including it is because it mentions using \xnnn - avoid using this format: it's a variable length version of \u and it can cause issues in some situations (not in yours, though).
The MSDN article points out why the character assignment is no good: the code point for the character in question is > FFFF which is outside the range for the char type.
As for the string part of the question, the answer is that the SMP character is represented as two char values. This SO question includes some code showing how to get the code points out of a string, it involves the use of StringInfo.GetTextElementEnumerator
Related
I am writing a text-processing Windows app in C#. The app processes many plain text files to count characters, words, etc. To do this, the app iterates over the characters in each file. I am finding that some text files represent accented letters such as á by using the Unicode character U+00E1 (small letter A with acute) while other use a simple unaccented a (U+0061, small letter A) followed by a U+0301 (combining acute accent). There's no visual difference in how the text is rendered on screen by Notepad or other editors I've used, but the underlying character stream is obviously different.
I would like to detect and treat these two situations in the same way. In other words, I would like my app to combine a letter followed by a combining codepoint into the equivalent self-contained character. For example, I'd like to combine the sequence U+0061 U+0301 into U+00E1. As far as I know, there is no simple algorithm to do this, other than a large and error-prone lookup table for all the possible combinations of plain letters and combining characters.
I there a simpler and more direct algorithm to perform this combination?
You're referring to Unicode normalization forms. That page goes into some interesting detail, but the gist is that representing e.g. accented letters as a single codepoint (e.g. á as U+00E1) is Normalization Form C, or NFC, and as separate codepoints (e.g. á as U+0061 U+0301) is NFD.
Section 3.11 of the Unicode specification goes into the gory details of how to implement it, with some extra details here.
Luckily, you don't need to implement this yourself: string.Normalize() already exists.
"\u00E1".Normalize(NormalizationForm.FormD); // \u0061\u0301
"\u0061\u0301".Normalize(NormalizationForm.FormC); // \u00E1
That said, we've only just scratched the surface of what a "character" is. A good illustration of this uses emoji, but it applies to various scripts as well: there are modern scripts where normal characters are comprise of two codepoints, and there is no single combined codepoint available. This pops up in e.g. Tamil and Thai, as well as some eastern European langauges (IIRC).
My favourite example is 👩🏽🚒, or "Woman Firefighter: Medium Skin Tone". Want to guess how that's encoded? That's right, 4 different code points: U+1F469 U+1F3FD U+200D U+1F692.
U+1F469 is 👩, the Woman emoji.
U+1F3FD is "Emoji Modifier Fitzpatrick Type-4", which modifies the previous emoji to give a brown skin tone 👩🏽, rendered as 🏽 when it appears on its own.
U+200D is a "zero-width joiner", which is used to glue codepoints together into the same character
U+1F692 is 🚒, the Fire Engine emoji.
So you take a woman, add a brown skin tone, glue her to a fire engine, and you get a woman brown-skinned firefighter.
(Just for fun, try pasting 👩🏽🚒 into various editors and then using backspace on it. If it's rendered properly, some editors turn it into 👩🏽 and then 👩 and then delete it, while others skip various parts. However, you select it as a single character. This mirrors how editing complex characters works in some scripts).
(Another fun nugget are the flag emoji. Unicode defines "Regional Indicator Symbol Letters A-Z" (U+1F1E6 through U+1F1FF), and a flag is encoded as the country's ISO 3166-1 alpha-2 letter country code using these indicator symbols. So 🇺🇸 is 🇺 followed by 🇸. Paste the 🇸 after the 🇺 and a flag appears!)
Of course, if you're iterating over this codepoint-by-codepoint you're going to visit U+1F469 U+1F3FD U+200D U+1F692 individually, which probably isn't what you want.
If you're iterating this char-by-char you're going to do even worse due to surrogate pairs: those codepoints such as U+1F469 are simply too large to represent using a single 16-bit char, so we need to use two of them. This means that if you try to iterate over U+1F469, you'll actually find you've got two chars: 0xD83D (the high surrogate) and 0xDC69 (the low surrogate).
Instead, we need to introduce extended grapheme clusters, which represent what you'd traditionally think of as a single character. Again there's a bunch of complexity if you want to do this yourself, and again someone's helpfully done it for you: StringInfo.GetTextElementEnumerator. Note that this was a bit buggy pre-.NET 5, and didn't properly handle all EGCs.
In .NET 5, however:
// Number of chars, as 3 of the codepoints need to use surrogate pairs when
// encoded with UTF-16
"👩🏽🚒".Length; // 7
// Number of Unicode codepoints
"👩🏽🚒".EnumerateRunes().Count(); // 4
// Number of extended grapheme clusters
GetTextElements("👩🏽🚒").Count(); // 1
public static IEnumerable<string> GetTextElements(string s)
{
TextElementEnumerator charEnum = StringInfo.GetTextElementEnumerator(s);
while (charEnum.MoveNext())
{
yield return charEnum.GetTextElement();
}
}
I've used emoji as an understandable example here, but these issues also crop up in modern scripts, and people working with text need to be aware of them.
I have string that contains an odd Unicode space character, but I'm not sure what character that is. I understand that in C# a string in memory is encoded using the UTF-16 format. What is a good way to determine which Unicode characters make up the string?
This question was marked as a possible duplicate to
Determine a string's encoding in C#
It's not a duplicate of this question because I'm not asking about what the encoding is. I already know that a string in C# is encoded as UTF-16. I'm just asking for an easy way to determine what the Unicode values are in the string.
The BMP characters are up to 2 bytes in length (values 0x0000-0xffff), so there's a good bit of coverage there. Characters from the Chinese, Thai, even Mongolian alphabets are there, so if you're not an encoding expert, you might be forgiven if your code only handles BMP characters. But all the same, characters like present here http://www.fileformat.info/info/unicode/char/10330/index.htm won't be correctly handled by code that assumes it'll fit into two bytes.
Unicode seems to identify characters as numeric code points. Not all code points actually refer to characters, however, because Unicode has the concept of combining characters (which I don’t know much about). However, each Unicode string, even some invalid ones (e.g., illegal sequence of combining characters), can be thought of as a list of code points (numbers).
In the UTF-16 encoding, each code point is encoded as a 2 or 4 byte sequence. In .net, Char might roughly correspond to either a 2 byte UTF-16 sequence or half of a 4 byte UTF-16 sequence. When Char contains half of a 4 byte sequence, it is considered a “surrogate” because it only has meaning when combined with another Char which it must be kept with. To get started with inspecting your .net string, you can get .net to tell you the code points contained in the string, automatically combining surrogate pairs together if necessary. .net provides Char.ConvertToUtf32 which is described the following way:
Converts the value of a UTF-16 encoded character or surrogate pair at a specified position in a string into a Unicode code point.
The documentation for Char.ConvertToUtf32(String s, Int32 index) states that an ArgumentException is thrown for the following case:
The specified index position contains a surrogate pair, and either the first character in the pair is not a valid high surrogate or the second character in the pair is not a valid low surrogate.
Thus, you can go character by character in a string and find all of the Unicode code points with the help of Char.IsHighSurrogate() and Char.ConvertToUtf32(). When you don’t encounter a high surrogate, the current character fits in one Char and you only need to advance one Char in your string. If you do encounter a high surrogate, the character requires two Char and you need to advance by two:
static IEnumerable<int> GetCodePoints(string s)
{
for (var i = 0; i < s.Length; i += char.IsHighSurrogate(s[i]) ? 2 : 1)
{
yield return char.ConvertToUtf32(s, i);
}
}
When you say “from a UTF-16 String”, that might imply that you have read in a series of bytes formatted as UTF-16. If that is the case, you would need to convert that to a .net string before passing to the above method:
GetCodePoints(Encoding.UTF16.GetString(myUtf16Blob));
Another note: depending on how you build your String instance, it is possible that it contains an illegal sequence of Char with regards to surrogate pairs. For such strings, Char.ConvertToUtf32() will throw an exception when encountered. However, I think that Encoding.GetString() will always either return a valid string or throw an exception. So, generally, as long as your String instances are from “good” sources, you needn’t worry about Char.ConvertToUtf32() throwing (unless you pass in random values for the index offset because your offset might be in the middle of a surrogate pair).
I saw this post on Jon Skeet's blog where he talks about string reversing. I wanted to try the example he showed myself, but it seems to work... which leads me to believe that I have no idea how to create a string that contains a surrogate pair which will actually cause the string reversal to fail. How does one actually go about creating a string with a surrogate pair in it so that I can see the failure myself?
The simplest way is to use \U######## where the U is capital, and the # denote exactly eight hexadecimal digits. If the value exceeds 0000FFFF hexadecimal, a surrogate pair will be needed:
string myString = "In the game of mahjong \U0001F01C denotes the Four of circles";
You can check myString.Length to see that the one Unicode character occupies two .NET Char values. Note that the char type has a couple of static methods that will help you determine if a char is a part of a surrogate pair.
If you use a .NET language that does not have something like the \U######## escape sequence, you can use the method ConvertFromUtf32, for example:
string fourCircles = char.ConvertFromUtf32(0x1F01C);
Addition: If your C# source file has an encoding that allows all Unicode characters, like UTF-8, you can just put the charater directly in the file (by copy-paste). For example:
string myString = "In the game of mahjong 🀜 denotes the Four of circles";
The character is UTF-8 encoded in the source file (in my example) but will be UTF-16 encoded (surrogate pairs) when the application runs and the string is in memory.
(Not sure if Stack Overflow software handles my mahjong character correctly. Try clicking "edit" to this answer and copy-paste from the text there, if the "funny" character is not here.)
The term "surrogate pair" refers to a means of encoding Unicode characters with high code-points in the UTF-16 encoding scheme (see this page for more information);
In the Unicode character encoding, characters are mapped to values between 0x000000 and 0x10FFFF. Internally, a UTF-16 encoding scheme is used to store strings of Unicode text in which two-byte (16-bit) code sequences are considered. Since two bytes can only contain the range of characters from 0x0000 to 0xFFFF, some additional complexity is used to store values above this range (0x010000 to 0x10FFFF).
This is done using pairs of code points known as surrogates. The surrogate characters are classified in two distinct ranges known as low surrogates and high surrogates, depending on whether they are allowed at the start or the end of the two-code sequence.
Try this yourself:
String surrogate = "abc" + Char.ConvertFromUtf32(Int32.Parse("2A601", NumberStyles.HexNumber)) + "def";
Char[] surrogateArray = surrogate.ToCharArray();
Array.Reverse(surrogateArray);
String surrogateReversed = new String(surrogateArray);
or this, if you want to stick with the blog example:
String surrogate = "Les Mise" + Char.ConvertFromUtf32(Int32.Parse("0301", NumberStyles.HexNumber)) + "rables";
Char[] surrogateArray = surrogate.ToCharArray();
Array.Reverse(surrogateArray);
String surrogateReversed = new String(surrogateArray);
nnd then check the string values with the debugger. Jon Skeet is damn right... strings and dates seem easy but they are absolutely NOT.
I'm finding a way to count special character that form by more than one character but found no solution online!
For e.g. I want to count the string "வாழைப்பழம". It actually consist of 6 tamil character but its 9 character in this case when we use the normal way to find the length. I am wondering is tamil the only kind of encoding that will cause this problem and if there is a solution to this. I'm currently trying to find a solution in C#.
Thank you in advance =)
Use StringInfo.LengthInTextElements:
var text = "வாழைப்பழம";
Console.WriteLine(text.Length); // 9
Console.WriteLine(new StringInfo(text).LengthInTextElements); // 6
The explanation for this behaviour can be found in the documentation of String.Length:
The Length property returns the number of Char objects in this instance, not the number of Unicode characters. The reason is that a Unicode character might be represented by more than one Char. Use the System.Globalization.StringInfo class to work with each Unicode character instead of each Char.
A minor nitpick: strings in .NET use UTF-16, not UTF-8
When you're talking about the length of a string, there are several different things you could mean:
Length in bytes. This is the old C way of looking at things, usually.
Length in Unicode code points. This gets you closer to the modern times and should be the way how string lengths are treated, except it isn't.
Length in UTF-8/UTF-16 code units. This is the most common interpretation, deriving from 1. Certain characters take more than one code unit in those encodings which complicates things if you don't expect it.
Count of visible “characters” (graphemes). This is usually what people mean when they say characters or length of a string.
In your case your confusion stems from the difference between 4. and 3. 3. is what C# uses, 4. is what you expect. Complex scripts such as Tamil use ligatures and diacritics. Ligatures are contractions of two or more adjacent characters into a single glyph – in your case ழை is a ligature of ழ and ை – the latter of which changes the appearance of the former; வா is also such a ligature. Diacritics are ornaments around a letter, e.g. the accent in à or the dot above ப்.
The two cases I mentioned both result in a single grapheme (what you perceive as a single character), yet they both need two actual characters each. So you end up with three code points more in the string.
One thing to note: For your case the distinction between 2. and 3. is irrelevant, but generally you should keep it in mind.
I read wikipedia but I do not understand whether extended ASCII is still just ASCII and is available on any computer that would run my console application?
Also if I understand it correctly, I can write an ASCII char only by using its unicode code in VB or C#.
Thank you
ASCII only covers the characters with value 0-127, and those are the same on all computers. (Well, almost, although this is mostly a matter of glyphs rather than semantics.)
Extended ASCII is a term for various single-byte code pages that are assign various characters to the range 128-255. There is no single "extended ASCII" set of characters.
In C# and VB.NET, all strings are Unicode, so by default, there's no need to worry about this - whether or not a character can be displated in a console app is a matter of the fonts being used, not the limitation of any specific single-byte codepage.
As others have said, true ASCII is always the lower 7 bits of each byte. Before the advent (and ubiquity) of Unicode standards, various extensions to the ASCII character set that utilized the eighth bit were released. The most common in the Windows world is Windows code page 1252.
If you're looking to use this encoding in .NET, you can get it like this:
Encoding windows1252 = Encoding.GetEncoding("windows-1252");
As Wikipedia says, ASCII is only 0-127. "Extended ASCII" is a misnomer, should be avoided, and used to loosely mean "some other character set based on ASCII which only uses single bytes" (meaning not multibyte like UTF-8). Sometimes the term means the 128-255 codepoints of that specific character set—but again, it's vague and you shouldn't count on it meaning anything specific.
The use of the term is sometimes criticized, because it can be mistakenly interpreted that the ASCII standard has been updated to include more than 128 characters or that the term unambiguously identifies a single encoding, both of which are untrue.
Source: http://en.wikipedia.org/wiki/Extended_ASCII