Need help understanding UTF encodings

Need help understanding UTF encodings - c#

Hallo, I have noticed that when I save a text file using UTF-8 encoding (no BOM), I am able to read it perfectly using the UTF-16 encoding on C#. Now this got me a little confused cause UTF-8 only uses 8 bits, right? And utf-16 takes, well, 16 bits for each character.
Now imagine that I have the string "ab" written in this file as UTF-8, then there is one byte there for the letter "a" & another one for the "b".
Ok, but how is it possible to read this UTF-8 file when using UTF-16 charset? The way I see it, while reading the file, the two bytes of the "ab" would be mistaken into been only one character containing both bytes. Because UTF-16 needs those 2 bytes.
This is how I read it (t.txt is encoded as UTF-8):
using(StreamReader sr = new StreamReader(File.OpenRead("t.txt"), Encoding.GetEncoding("utf-16")))
{
Console.Write(sr.ReadToEnd());
Console.ReadKey();
}

Check out http://www.joelonsoftware.com/articles/Unicode.html, it will answer all your unicode questions

take a look at the following article:
http://www.joelonsoftware.com/printerFriendly/articles/Unicode.html

The '8' means it uses 8-bit blocks to represent a character. This does not mean that each character takes a fixed 8 bits. The number of blocks per character vary from 1 to 4 (though characters can be theorically upto 6 bytes long).
Try this simple test,
Create a text file (in say Notepad++) with UTF8 without BOM encoding
Read the text file (as you have done in your code) with File.ReadAllBytes(). byte[] utf8 = File.ReadAllBytes(#"E:\SavedUTF8.txt");
Check the number of bytes in taken by each character.
Now try the same with a file encoded as ANSI byte[] ansi = File.ReadAllBytes(#"E:\SavedANSI.txt");
Compare the bytes per character for both encodings.
Note, File.ReadAllBytes() attempts to automatically detect the encoding of a file based on the presence of byte order marks. Encoding formats UTF-8 and UTF-32 (both big-endian and little-endian) can be detected.
Interesting results
SavedUTF8.txt contains character
a : Number of bytes in the byte array = 1
© (UTF+00A9)(Alt+0169) : Number of bytes in the byte array = 2
€: (UTF+E0A080)(Alt+14721152) Number of bytes in the byte array = 3
ANSI encoding always takes 8 bits (i.e. in the above sample, the byte array will always be of size 1 irrespective of the character in the file). As pointed out by #tchrist, UTF16 takes 2 or 4 bytes per character (and not a fixed 2 bytes per character).
Encoding table (from here)
The following byte sequences are used to represent a character. The sequence to be used depends on the Unicode number of the character:
U-00000000 – U-0000007F: 0xxxxxxx
U-00000080 – U-000007FF: 110xxxxx 10xxxxxx
U-00000800 – U-0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx
U-00010000 – U-001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
U-00200000 – U-03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
U-04000000 – U-7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
The xxx bit positions are filled with the bits of the character code number in binary representation. The rightmost x bit is the least-significant bit. Only the shortest possible multibyte sequence which can represent the code number of the character can be used. Note that in multibyte sequences, the number of leading 1 bits in the first byte is identical to the number of bytes in the entire sequence.
Determining the size of character
The first byte of a multibyte sequence that represents a non-ASCII character is always in the range 0xC0 to 0xFD and it indicates how many bytes follow for this character.
This means that the leading bits for a 2 byte character (110) are different than the leading bits of a 3 byte character (1110). These leading bits can be used to uniquely identify the number of bytes a character takes.
More information
UTF-8 Encoding
UTF-8, UTF-16, UTF-32 & BOM
UTF-8 and Unicode FAQ for Unix/Linux

Related

Is UTF-16 a superset of ASCII? If yes, why is UTF-16 incompatible with ASCII according to the HTML Standard?

According to the Wikipedia article on UTF-16, "...[UTF-16] is also the only web-encoding incompatible with ASCII." (at the end of the abstract.) This statement refers to the HTML Standard. Is this a wrong statement?
I'm mainly a C# / .NET dev, and .NET as well as .NET Core uses UTF-16 internally to represent strings. I'm pretty certain that UTF-16 is a superset of ASCII, as I can easily write code that displays all ASCII characters:
public static void Main()
{
for (byte currentAsciiCharacter = 0; currentAsciiCharacter < 128; currentAsciiCharacter++)
{
Console.WriteLine($"ASCII character {currentAsciiCharacter}: \"{(char) currentAsciiCharacter}\"");
}
}
Sure, the control characters will mess up the console output, but I think my statement is clear: the lower 7 bits of a 16 bit char take the corresponding ASCII code point, while the upper 9 bits are zero. Thus UTF-16 should be a superset of ASCII in .NET.
I tried to find out why the HTML Standard says that UTF-16 is incompatible to ASCII, but it seems like they simply define it that way:
An ASCII-compatible encoding is any encoding that is not a UTF-16 encoding.
I couldn't find any explanations why UTF-16 is not compatible in their spec.
My detailed questions are:
Is UTF-16 actually compatible to ASCII? Or did I miss something here?
If it is compatible, why does the HTML Standard say it's not compatible? Maybe because of byte ordering?

ASCII is 7 bit encoding and stored in a single byte. UTF-16 uses 2 bytes chunks (ord) , which makes it right away incompatible. UTF-8 uses one byte chunk and for Latin alphabet matches with ASCII. IOW, UTF-8 is designed to be backward compatible with ASCII encoding.

My Tcp device does not accept unicode

I have a string and I convert this string to byte array to send tcp device.
byte[] loadRegionCommand=System.Text.Encoding.Unicode.GetBytes("$RGNLOAD http://1.1.1.1:9999/region1.txt");
System.Text.Encoding.Unicode.GetBytes() method is adding a zero after each character.But propably my device doesnt accept unicode , what can I use instead of System.Text.Encoding.Unicode.GetBytes().
Thanks for your help and best idea.

But System.Text.Encoding.Unicode.GetBytes() method is adding a zero after each character.
Yes, because Unicode 16 bits, it is just what you asked for.
Use Encoding.UTF8.GetBytes() or maybe even Encoding.ASCII. That depends on what your device expects.

You are using a UnicodeEncoding. A unicode string string has a character size of two bytes. Since you are using input characters that are exclusively in the ASCII range, the upper byte will be always zero. The data is stored in little Endian, so the lower byte is written first. Hence your result.
You can choose a different encoding depending on your input. If all your characters are ASCII, use an ASCIIEncoding. If you must use one byte per character and you have characters outside the ASCII range, use the appropriate code page. Otherwise you can use UTF8Encoding, which will encode all ASCII characters in one byte and all other charcters in two or more bytes (up to four).

ASCII Code Of Characters

In C# I need to get the ASCII code of some characters.
So I convert the char To byte Or int, then print the result.
String sample="A";
int AsciiInt = sample[0];
byte AsciiByte = (byte)sample[0];
For characters with ASCII code 128 and less, I get the right answer.
But for characters greater than 128 I get irrelevant answers!
I am sure all characters are less than 0xFF.
Also I have Tested System.Text.Encoding and got the same results.
For example: I get 172 For a char with actual byte value of 129!
Actually ASCII characters Like ƒ , ‡ , ‹ , “ , ¥ , © , Ï , ³ , · , ½ , » , Á Each character takes 1 byte and goes up to more than 193.
I Guess There is An Unicode Equivalent for Them and .Net Return That Because Interprets Strings As Unicode!
What If SomeOne Needs To Access The Actual Value of a byte , Whether It is a valid Known ASCII Character Or Not!!!

But For Characters Upper Than 128 I get Irrelevant answers
No you don't. You get the bottom 8 bits of the UTF-16 code unit corresponding to the char.
Now if your text were all ASCII, that would be fine - because ASCII only goes up to 127 anyway. It sounds like you're actually expecting the representation in some other encoding - so you need to work out which encoding that is, at which point you can use:
Encoding encoding = ...;
byte[] bytes = encoding.GetBytes(sample);
// Now extract the bytes you want. Note that a character may be represented by more than
// one byte.
If you're essentially looking for an encoding which treats bytes 0 to 255 respectively as U+0000 to U+00FF respectively, you should use ISO-8859-1, which you can access using Encoding.GetEncoding(28591).

You can't just ignore the issue of encoding. There is no inherent mapping between bytes and characters - that's defined by the encoding.
If I use your example of 131, on my system, this produces â. However, since you're obviously on an arabic system, you most likely have Windows-1256 encoding, which produces ƒ for 131.
In other words, if you need to use the correct encoding when converting characters to bytes and vice versa. In your case,
var sample = "ƒ";
var byteValue = Encoding.GetEncoding("windows-1256").GetBytes(sample)[0];
Which produces 131, as you seem to expect. Most importantly, this will work on all computers - if you want to have this system locale-specific, Encoding.Default can also work for you.
The only reason your method seems to work for bytes under 128 is that in UTF-8, the characters correspond to the ASCII standard mapping. However, you're misusing the term ASCII - it really only refers to these 7-bit characters. What you're calling ASCII is actually an extended 8-bit charset - all characters with the 8-bit set are charset-dependent.
We're no longer in a world when you can assume your application will only run on computers with the same locale you have - .NET is designed for this, which is why all strings are unicode. At the very least, read this http://www.joelonsoftware.com/articles/Unicode.html for an explanation of how encodings work, and to get rid of some of the serious and dangerous misconceptions you seem to have.

Char size in .net is not as expected?

size of char is : 2 (msdn)
sizeof(char) //2
a test :
char[] c = new char[1] {'a'};
Encoding.UTF8.GetByteCount(c) //1 ?
why the value is 1?
(of course if c is a unicode char like 'ש' so it does show 2 as it should.)
a is not .net char ?

It's because 'a' only takes one byte to encode in UTF-8.
Encoding.UTF8.GetByteCount(c) will tell you how many bytes it takes to encode the given array of characters in UTF-8. See the documentation for Encoding.GetByteCount for more details. That's entirely separate from how wide the char type is internally in .NET.
Each character with code points less than 128 (i.e. U+0000 to U+007F) takes a single byte to encode in UTF-8.
Other characters take 2, 3 or even 4 bytes in UTF-8. (There are values over U+1FFFF which would take 5 or 6 bytes to encode, but they're not part of Unicode at the moment, and probably never will be.)
Note that the only characters which take 4 bytes to encode in UTF-8 can't be encoded in a single char anyway. A char is a UTF-16 code unit, and any Unicode code points over U+FFFF require two UTF-16 code units forming a surrogate pair to represent them.

The reason is that, internally, .NET represents characters as UTF-16, where each character typically occupies 2 bytes. On the other hand, in UTF-8, each character occupies 1 byte if it’s among the first 128 codepoints (which incidentally overlap with ASCII), and 2 or more bytes beyond that.

That's not fair. The page you mention says
The char keyword is used to declare a Unicode character
Try then:
Encoding.Unicode.GetByteCount(c)

What's the difference between UTF8/UTF16 and Base64 in terms of encoding

In. c#
We can use below classes to do encoding:
System.Text.Encoding.UTF8
System.Text.Encoding.UTF16
System.Text.Encoding.ASCII
Why there is no System.Text.Encoding.Base64?
We can only use Convert.From(To)Base64String method, what's special of base64?
Can I say base64 is the same encoding method as UTF-8? Or UTF-8 is one of base64?

UTF-8 and UTF-16 are methods to encode Unicode strings to byte sequences.
See: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
Base64 is a method to encode a byte sequence to a string.
So, these are widely different concepts and should not be confused.
Things to keep in mind:
Not every byte sequence represents an Unicode string encoded in UTF-8 or UTF-16.
Not every Unicode string represents a byte sequence encoded in Base64.

Base64 is a way to encode binary data, while UTF8 and UTF16 are ways to encode Unicode text. Note that in a language like Python 2.x, where binary data and strings are mixed, you can encode strings into base64 or utf8 the same way:
u'abc'.encode('utf16')
u'abc'.encode('base64')
But in languages where there's a more well-defined separation between the two types of data, the two ways of representing data generally have quite different utilities, to keep the concerns separate.

UTF-8 is like the other UTF encodings a character encoding to encode characters of the Unicode character set UCS.
Base64 is an encoding to represent any byte sequence by a sequence of printable characters (i.e. A–Z, a–z, 0–9, +, and /).
There is no System.Text.Encoding.Base64 because Base64 is not a text encoding but rather a base conversion like the hexadecimal that uses 0–9 and A–F (or a–f) to represent numbers.

Simply speaking, a charcter enconding, like UTF8 , or UTF16 are useful for to match numbers, i.e. bytes to characters and viceversa, for example in ASCII 65 is matched to "A" , while a base encoding is used mainly to translate bytes to bytes so that the resulting bytes converted from a single byte are printable and are a subset of the ASCII charachter encoding, for that reason you can see Base64 also as a bytes to text encoding mechanism. The main reason to use Base64 is to be trasmit data over a channel that doesn't allow binary data transfer.
That said, now it should be clear that you can have a stream encoded in Base64 that rapresent a stream UTF8 encoded.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Need help understanding UTF encodings - c#

Check out http://www.joelonsoftware.com/articles/Unicode.html, it will answer all your unicode questions

take a look at the following article: http://www.joelonsoftware.com/printerFriendly/articles/Unicode.html

Related

Is UTF-16 a superset of ASCII? If yes, why is UTF-16 incompatible with ASCII according to the HTML Standard?

My Tcp device does not accept unicode

ASCII Code Of Characters

Char size in .net is not as expected?

What's the difference between UTF8/UTF16 and Base64 in terms of encoding

Categories

Resources