I came across this line of code today:
int c = (int)'c';
I was not aware you could cast a char to an int. So I tested it out, and found that a=97, b=98, c=99, d=100 etc etc...
Why is 'a' 97? What do those numbers relate to?
Everyone else (so far) has referred to ASCII. That's a very limited view - it works for 'a', but doesn't work for anything with an accent etc - which can very easily be represented by char.
A char is just an unsigned 16-bit integer, which is a UTF-16 code unit. Usually that's equivalent to a Unicode character, but not always - sometimes multiple code units are required for a single full character. See the documentation for System.Char for more details.
The implicit conversion from char to int (you don't need the cast in your code) just converts that 16-bit unsigned integer to a 32-bit signed integer in the natural, non-lossy way - just as if you had a ushort.
Note that every valid character in ASCII has the same value in UTF-16, which is why the two are often confused when the examples are only ones from the ASCII set.
97 is UTF-16 code unit value of letter a.
Basically this number relates to UTF-16 code unit of given character.
These are the ASCII values representing the characters.
They are the decimal representation of their ascii counterpart:
http://www.asciitable.com/index/asciifull.gif
so 'a' would be a 97
They are character codes, commonly known as ASCII values.
Technically, though, the character codes are not ASCII.
Related
I have a routine that given an int, returns the equivalent UNICODE character. For some values though, it doesn't retun a character, but its (I presume) Hex value.
For example:
17664 ---> '䔀' // CORRECT!
BUT
56384 ---> '\udc40' // WRONG!!!
Why is that?
The character with code 0xdc40 is a Low Surrogate.
That means that it is one half of a 32-bit character (which is represented as a 16-bit low surrogate plus a 16-bit high surrogate UTF16 character), and thus does not correspond to an actual character.
That's why the output is showing '\udc40' rather than a single character.
I have string that contains an odd Unicode space character, but I'm not sure what character that is. I understand that in C# a string in memory is encoded using the UTF-16 format. What is a good way to determine which Unicode characters make up the string?
This question was marked as a possible duplicate to
Determine a string's encoding in C#
It's not a duplicate of this question because I'm not asking about what the encoding is. I already know that a string in C# is encoded as UTF-16. I'm just asking for an easy way to determine what the Unicode values are in the string.
The BMP characters are up to 2 bytes in length (values 0x0000-0xffff), so there's a good bit of coverage there. Characters from the Chinese, Thai, even Mongolian alphabets are there, so if you're not an encoding expert, you might be forgiven if your code only handles BMP characters. But all the same, characters like present here http://www.fileformat.info/info/unicode/char/10330/index.htm won't be correctly handled by code that assumes it'll fit into two bytes.
Unicode seems to identify characters as numeric code points. Not all code points actually refer to characters, however, because Unicode has the concept of combining characters (which I don’t know much about). However, each Unicode string, even some invalid ones (e.g., illegal sequence of combining characters), can be thought of as a list of code points (numbers).
In the UTF-16 encoding, each code point is encoded as a 2 or 4 byte sequence. In .net, Char might roughly correspond to either a 2 byte UTF-16 sequence or half of a 4 byte UTF-16 sequence. When Char contains half of a 4 byte sequence, it is considered a “surrogate” because it only has meaning when combined with another Char which it must be kept with. To get started with inspecting your .net string, you can get .net to tell you the code points contained in the string, automatically combining surrogate pairs together if necessary. .net provides Char.ConvertToUtf32 which is described the following way:
Converts the value of a UTF-16 encoded character or surrogate pair at a specified position in a string into a Unicode code point.
The documentation for Char.ConvertToUtf32(String s, Int32 index) states that an ArgumentException is thrown for the following case:
The specified index position contains a surrogate pair, and either the first character in the pair is not a valid high surrogate or the second character in the pair is not a valid low surrogate.
Thus, you can go character by character in a string and find all of the Unicode code points with the help of Char.IsHighSurrogate() and Char.ConvertToUtf32(). When you don’t encounter a high surrogate, the current character fits in one Char and you only need to advance one Char in your string. If you do encounter a high surrogate, the character requires two Char and you need to advance by two:
static IEnumerable<int> GetCodePoints(string s)
{
for (var i = 0; i < s.Length; i += char.IsHighSurrogate(s[i]) ? 2 : 1)
{
yield return char.ConvertToUtf32(s, i);
}
}
When you say “from a UTF-16 String”, that might imply that you have read in a series of bytes formatted as UTF-16. If that is the case, you would need to convert that to a .net string before passing to the above method:
GetCodePoints(Encoding.UTF16.GetString(myUtf16Blob));
Another note: depending on how you build your String instance, it is possible that it contains an illegal sequence of Char with regards to surrogate pairs. For such strings, Char.ConvertToUtf32() will throw an exception when encountered. However, I think that Encoding.GetString() will always either return a valid string or throw an exception. So, generally, as long as your String instances are from “good” sources, you needn’t worry about Char.ConvertToUtf32() throwing (unless you pass in random values for the index offset because your offset might be in the middle of a surrogate pair).
In C# I need to get the ASCII code of some characters.
So I convert the char To byte Or int, then print the result.
String sample="A";
int AsciiInt = sample[0];
byte AsciiByte = (byte)sample[0];
For characters with ASCII code 128 and less, I get the right answer.
But for characters greater than 128 I get irrelevant answers!
I am sure all characters are less than 0xFF.
Also I have Tested System.Text.Encoding and got the same results.
For example: I get 172 For a char with actual byte value of 129!
Actually ASCII characters Like ƒ , ‡ , ‹ , “ , ¥ , © , Ï , ³ , · , ½ , » , Á Each character takes 1 byte and goes up to more than 193.
I Guess There is An Unicode Equivalent for Them and .Net Return That Because Interprets Strings As Unicode!
What If SomeOne Needs To Access The Actual Value of a byte , Whether It is a valid Known ASCII Character Or Not!!!
But For Characters Upper Than 128 I get Irrelevant answers
No you don't. You get the bottom 8 bits of the UTF-16 code unit corresponding to the char.
Now if your text were all ASCII, that would be fine - because ASCII only goes up to 127 anyway. It sounds like you're actually expecting the representation in some other encoding - so you need to work out which encoding that is, at which point you can use:
Encoding encoding = ...;
byte[] bytes = encoding.GetBytes(sample);
// Now extract the bytes you want. Note that a character may be represented by more than
// one byte.
If you're essentially looking for an encoding which treats bytes 0 to 255 respectively as U+0000 to U+00FF respectively, you should use ISO-8859-1, which you can access using Encoding.GetEncoding(28591).
You can't just ignore the issue of encoding. There is no inherent mapping between bytes and characters - that's defined by the encoding.
If I use your example of 131, on my system, this produces â. However, since you're obviously on an arabic system, you most likely have Windows-1256 encoding, which produces ƒ for 131.
In other words, if you need to use the correct encoding when converting characters to bytes and vice versa. In your case,
var sample = "ƒ";
var byteValue = Encoding.GetEncoding("windows-1256").GetBytes(sample)[0];
Which produces 131, as you seem to expect. Most importantly, this will work on all computers - if you want to have this system locale-specific, Encoding.Default can also work for you.
The only reason your method seems to work for bytes under 128 is that in UTF-8, the characters correspond to the ASCII standard mapping. However, you're misusing the term ASCII - it really only refers to these 7-bit characters. What you're calling ASCII is actually an extended 8-bit charset - all characters with the 8-bit set are charset-dependent.
We're no longer in a world when you can assume your application will only run on computers with the same locale you have - .NET is designed for this, which is why all strings are unicode. At the very least, read this http://www.joelonsoftware.com/articles/Unicode.html for an explanation of how encodings work, and to get rid of some of the serious and dangerous misconceptions you seem to have.
why do we get the extra text within the char array in c#, looks like a char equivalent but just wondering are there any advantages of having this within the char array, if so where do we use this feature.
That is the integer representation of the char within the ASCII table
They're the ASCII values of what numerical value the characters represent.
http://www.asciitable.com/
This is purely the watch window that is showing you the char byte value and the actual char string vaule.
So, from ASCII Table you can see that 48 is '0' and 49 is '1'
Characters in C# are Unicode, and there are plenty of them. And there are plenty of them that look similar or downright identical or even unreadable. In such cases you need these numbers.
As per the C# Reference, each char from char array shows the Decimal value of it
How do i get the numeric value of a unicode character in C#?
For example if tamil character அ (U+0B85) given, output should be 2949 (i.e. 0x0B85)
See also
C++: How to get decimal value of a unicode character in c++
Java: How can I get a Unicode character's code?
Multi code-point characters
Some characters require multiple code points. In this example, UTF-16, each code unit is still in the Basic Multilingual Plane:
(i.e. U+0072 U+0327 U+030C)
(i.e. U+0072 U+0338 U+0327 U+0316 U+0317 U+0300 U+0301 U+0302 U+0308 U+0360)
The larger point being that one "character" can require more than 1 UTF-16 code unit, it can require more than 2 UTF-16 code units, it can require more than 3 UTF-16 code units.
The larger point being that one "character" can require dozens of unicode code points. In UTF-16 in C# that means more than 1 char. One character can require 17 char.
My question was about converting char into a UTF-16 encoding value. Even if an entire string of 17 char only represents one "character", i still want to know how to convert each UTF-16 unit into a numeric value.
e.g.
String s = "அ";
int i = Unicode(s[0]);
Where Unicode returns the integer value, as defined by the Unicode standard, for the first character of the input expression.
It's basically the same as Java. If you've got it as a char, you can just convert to int implicitly:
char c = '\u0b85';
// Implicit conversion: char is basically a 16-bit unsigned integer
int x = c;
Console.WriteLine(x); // Prints 2949
If you've got it as part of a string, just get that single character first:
string text = GetText();
int x = text[2]; // Or whatever...
Note that characters not in the basic multilingual plane will be represented as two UTF-16 code units. There is support in .NET for finding the full Unicode code point, but it's not simple.
((int)'அ').ToString()
If you have the character as a char, you can cast that to an int, which will represent the character's numeric value. You can then print that out in any way you like, just like with any other integer.
If you wanted hexadecimal output instead, you can use:
((int)'அ').ToString("X4")
X is for hexadecimal, 4 is for zero-padding to four characters.
How do i get the numeric value of a unicode character in C#?
A char is not necessarily the whole Unicode code point. In UTF-16 encoded languages such as C#, you may actually need 2 chars to represent a single "logical" character. And your string lengths migh not be what you expect - the MSDN documnetation for String.Length Property says:
"The Length property returns the number of Char objects in this instance, not the number of Unicode characters."
So, if your Unicode character is encoded in just one char, it is already numeric (essentially an unsigned 16-bit integer). You may want to cast it to some of the integer types, but this won't change the actual bits that were originally present in the char.
If your Unicode character is 2 chars, you'll need to multiply one by 2^16 and add it to the other, resulting in a uint numeric value:
char c1 = ...;
char c2 = ...;
uint c = ((uint)c1 << 16) | c2;
How do i get the decimal value of a unicode character in C#?
When you say "decimal", this usually means a character string containing only characters that a human being would interpret as decimal digits.
If you can represent your Unicode character by only one char, you can convert it to decimal string simply by:
char c = 'அ';
string s = ((ushort)c).ToString();
If you have 2 chars for your Unicode character, convert them to a uint as described above, then call uint.ToString.
--- EDIT ---
AFAIK diacritical marks are considered separate "characters" (and separate code points) despite being visually rendered together with the "base" character. Each of these code points taken alone is still at most 2 UTF-16 code units.
BTW I think the proper name for what you are talking about is not "character" but "combining character". So yes, a single combining character can have more than 1 code point and therefore more than 2 code units. If you want a decimal representation of such as combining character, you can probably do it most easily through BigInteger:
string c = "\x0072\x0338\x0327\x0316\x0317\x0300\x0301\x0302\x0308\x0360";
string s = (new BigInteger(Encoding.Unicode.GetBytes(c))).ToString();
Depending on what order of significance of the code unit "digits" you wish, you may want reverse the c.
char c = 'அ';
short code = (short)c;
ushort code2 = (ushort)c;
This is an example of using Plane 1, the Supplementary Multilingual Plane (SMP):
string single_character = "\U00013000"; //first Egyptian ancient hieroglyph in hex
//it is encoded as 4 bytes (instead of 2)
//get the Unicode index using UTF32 (4 bytes fixed encoding)
Encoding enc = new UTF32Encoding(false, true, true);
byte[] b = enc.GetBytes(single_character);
Int32 code = BitConverter.ToInt32(b, 0); //in decimal