In C# I need to get the ASCII code of some characters.
So I convert the char To byte Or int, then print the result.
String sample="A";
int AsciiInt = sample[0];
byte AsciiByte = (byte)sample[0];
For characters with ASCII code 128 and less, I get the right answer.
But for characters greater than 128 I get irrelevant answers!
I am sure all characters are less than 0xFF.
Also I have Tested System.Text.Encoding and got the same results.
For example: I get 172 For a char with actual byte value of 129!
Actually ASCII characters Like ƒ , ‡ , ‹ , “ , ¥ , © , Ï , ³ , · , ½ , » , Á Each character takes 1 byte and goes up to more than 193.
I Guess There is An Unicode Equivalent for Them and .Net Return That Because Interprets Strings As Unicode!
What If SomeOne Needs To Access The Actual Value of a byte , Whether It is a valid Known ASCII Character Or Not!!!
But For Characters Upper Than 128 I get Irrelevant answers
No you don't. You get the bottom 8 bits of the UTF-16 code unit corresponding to the char.
Now if your text were all ASCII, that would be fine - because ASCII only goes up to 127 anyway. It sounds like you're actually expecting the representation in some other encoding - so you need to work out which encoding that is, at which point you can use:
Encoding encoding = ...;
byte[] bytes = encoding.GetBytes(sample);
// Now extract the bytes you want. Note that a character may be represented by more than
// one byte.
If you're essentially looking for an encoding which treats bytes 0 to 255 respectively as U+0000 to U+00FF respectively, you should use ISO-8859-1, which you can access using Encoding.GetEncoding(28591).
You can't just ignore the issue of encoding. There is no inherent mapping between bytes and characters - that's defined by the encoding.
If I use your example of 131, on my system, this produces â. However, since you're obviously on an arabic system, you most likely have Windows-1256 encoding, which produces ƒ for 131.
In other words, if you need to use the correct encoding when converting characters to bytes and vice versa. In your case,
var sample = "ƒ";
var byteValue = Encoding.GetEncoding("windows-1256").GetBytes(sample)[0];
Which produces 131, as you seem to expect. Most importantly, this will work on all computers - if you want to have this system locale-specific, Encoding.Default can also work for you.
The only reason your method seems to work for bytes under 128 is that in UTF-8, the characters correspond to the ASCII standard mapping. However, you're misusing the term ASCII - it really only refers to these 7-bit characters. What you're calling ASCII is actually an extended 8-bit charset - all characters with the 8-bit set are charset-dependent.
We're no longer in a world when you can assume your application will only run on computers with the same locale you have - .NET is designed for this, which is why all strings are unicode. At the very least, read this http://www.joelonsoftware.com/articles/Unicode.html for an explanation of how encodings work, and to get rid of some of the serious and dangerous misconceptions you seem to have.
Related
According to the Wikipedia article on UTF-16, "...[UTF-16] is also the only web-encoding incompatible with ASCII." (at the end of the abstract.) This statement refers to the HTML Standard. Is this a wrong statement?
I'm mainly a C# / .NET dev, and .NET as well as .NET Core uses UTF-16 internally to represent strings. I'm pretty certain that UTF-16 is a superset of ASCII, as I can easily write code that displays all ASCII characters:
public static void Main()
{
for (byte currentAsciiCharacter = 0; currentAsciiCharacter < 128; currentAsciiCharacter++)
{
Console.WriteLine($"ASCII character {currentAsciiCharacter}: \"{(char) currentAsciiCharacter}\"");
}
}
Sure, the control characters will mess up the console output, but I think my statement is clear: the lower 7 bits of a 16 bit char take the corresponding ASCII code point, while the upper 9 bits are zero. Thus UTF-16 should be a superset of ASCII in .NET.
I tried to find out why the HTML Standard says that UTF-16 is incompatible to ASCII, but it seems like they simply define it that way:
An ASCII-compatible encoding is any encoding that is not a UTF-16 encoding.
I couldn't find any explanations why UTF-16 is not compatible in their spec.
My detailed questions are:
Is UTF-16 actually compatible to ASCII? Or did I miss something here?
If it is compatible, why does the HTML Standard say it's not compatible? Maybe because of byte ordering?
ASCII is 7 bit encoding and stored in a single byte. UTF-16 uses 2 bytes chunks (ord) , which makes it right away incompatible. UTF-8 uses one byte chunk and for Latin alphabet matches with ASCII. IOW, UTF-8 is designed to be backward compatible with ASCII encoding.
I have a string and I convert this string to byte array to send tcp device.
byte[] loadRegionCommand=System.Text.Encoding.Unicode.GetBytes("$RGNLOAD http://1.1.1.1:9999/region1.txt");
System.Text.Encoding.Unicode.GetBytes() method is adding a zero after each character.But propably my device doesnt accept unicode , what can I use instead of System.Text.Encoding.Unicode.GetBytes().
Thanks for your help and best idea.
But System.Text.Encoding.Unicode.GetBytes() method is adding a zero after each character.
Yes, because Unicode 16 bits, it is just what you asked for.
Use Encoding.UTF8.GetBytes() or maybe even Encoding.ASCII. That depends on what your device expects.
You are using a UnicodeEncoding. A unicode string string has a character size of two bytes. Since you are using input characters that are exclusively in the ASCII range, the upper byte will be always zero. The data is stored in little Endian, so the lower byte is written first. Hence your result.
You can choose a different encoding depending on your input. If all your characters are ASCII, use an ASCIIEncoding. If you must use one byte per character and you have characters outside the ASCII range, use the appropriate code page. Otherwise you can use UTF8Encoding, which will encode all ASCII characters in one byte and all other charcters in two or more bytes (up to four).
Take the following example:
string testfile1 = Path.Combine(HttpRuntime.AppDomainAppPath, "folder\\" + "test1.txt");
if (!System.IO.File.Exists(testfile1))
{
System.IO.File.WriteAllText(testfile1, "£100", System.Text.Encoding.ASCII);
}
string testfile2 = Path.Combine(HttpRuntime.AppDomainAppPath, "folder\\" + "test2.txt");
if (!System.IO.File.Exists(testfile2))
{
System.IO.File.WriteAllText(testfile2, "£100", System.Text.Encoding.UTF8);
}
Note the encoding. The first outputs ?100. The second outputs £100.
I know the encoding is different, but can somebody explain why ASCII encoding can't write the £ sign?
ASCII doesn't include the "£" character. That is - there is no byte value (nor a multiple byte value - they don't exist in ASCII) that denotes that symbol. So it shows you a "?" to tell you that. UTF8, on the other hand, does include it.
See here a list of all of the printable characters in ASCII.
If you must use ASCII, consider using "GBP" as mentioned here for Pound sterling. (Also might be relevant: Extended ASCII.)
To deal with ASCII and certain characters it depended largely on what code page you're using. £ isn't a character that is required or used universally within the latin alphabet so didn't appear in the standard ASCII set.
Look at this article or this one on code pages to see how the character limitation was resolved and for an idea as to why it won't show up everywhere.
As Hans pointed out, ASCII is designed to Americans using only code points 0-127, the negligible rest of the English speaking world can live with that unless they try to use obscure symbols like £ with code points outside the range 0-127. I presume you live in the UK and aim only at customers from the UK, or Western Europe. Don't use Encoding.ASCII but Encoding.Default which would be code page 1252 in the UK, not in Turkey of course. You get real ASCII for every character in the ASCII range 0-127 but can also use characters in the range 128-255 where the pound symbol lives. But note, if someone tries to read the file assuming it is encoded in UTF8, the £ sign will obscure the content since it includes a byte that is non-existing in UTF8. This is indicated by some weird glyph like �.
I came across this line of code today:
int c = (int)'c';
I was not aware you could cast a char to an int. So I tested it out, and found that a=97, b=98, c=99, d=100 etc etc...
Why is 'a' 97? What do those numbers relate to?
Everyone else (so far) has referred to ASCII. That's a very limited view - it works for 'a', but doesn't work for anything with an accent etc - which can very easily be represented by char.
A char is just an unsigned 16-bit integer, which is a UTF-16 code unit. Usually that's equivalent to a Unicode character, but not always - sometimes multiple code units are required for a single full character. See the documentation for System.Char for more details.
The implicit conversion from char to int (you don't need the cast in your code) just converts that 16-bit unsigned integer to a 32-bit signed integer in the natural, non-lossy way - just as if you had a ushort.
Note that every valid character in ASCII has the same value in UTF-16, which is why the two are often confused when the examples are only ones from the ASCII set.
97 is UTF-16 code unit value of letter a.
Basically this number relates to UTF-16 code unit of given character.
These are the ASCII values representing the characters.
They are the decimal representation of their ascii counterpart:
http://www.asciitable.com/index/asciifull.gif
so 'a' would be a 97
They are character codes, commonly known as ASCII values.
Technically, though, the character codes are not ASCII.
size of char is : 2 (msdn)
sizeof(char) //2
a test :
char[] c = new char[1] {'a'};
Encoding.UTF8.GetByteCount(c) //1 ?
why the value is 1?
(of course if c is a unicode char like 'ש' so it does show 2 as it should.)
a is not .net char ?
It's because 'a' only takes one byte to encode in UTF-8.
Encoding.UTF8.GetByteCount(c) will tell you how many bytes it takes to encode the given array of characters in UTF-8. See the documentation for Encoding.GetByteCount for more details. That's entirely separate from how wide the char type is internally in .NET.
Each character with code points less than 128 (i.e. U+0000 to U+007F) takes a single byte to encode in UTF-8.
Other characters take 2, 3 or even 4 bytes in UTF-8. (There are values over U+1FFFF which would take 5 or 6 bytes to encode, but they're not part of Unicode at the moment, and probably never will be.)
Note that the only characters which take 4 bytes to encode in UTF-8 can't be encoded in a single char anyway. A char is a UTF-16 code unit, and any Unicode code points over U+FFFF require two UTF-16 code units forming a surrogate pair to represent them.
The reason is that, internally, .NET represents characters as UTF-16, where each character typically occupies 2 bytes. On the other hand, in UTF-8, each character occupies 1 byte if it’s among the first 128 codepoints (which incidentally overlap with ASCII), and 2 or more bytes beyond that.
That's not fair. The page you mention says
The char keyword is used to declare a Unicode character
Try then:
Encoding.Unicode.GetByteCount(c)