Strings encoded ASCII and UTF8 have different lengths!

Strings encoded ASCII and UTF8 have different lengths! - c#

I'm reading a stream and am wondering why the UTF-8 encoded string is shorter than the ASCII one.
ASCIIEncoding encoder = new ASCIIEncoding();
UTF8Encoding enc = new UTF8Encoding();
string response = encoder.GetString(message, 0, bytesRead); //4096
string responseUtf8 = enc.GetString(message, 0, bytesRead); //3955

UTF-8 handles different the strings than ASCII: On UTF-8, each character may be of 1, 2 or 3 bytes length. However, ASCII considers each byte as a character. The C# UTF-8 encoder counts well-formed UTF-8 characters, instead of bytes. I hope this helps you.

Because when decoding bytes, ASCIIEncoding replaces all bytes greater than 127 (0x7F) with a question mark (?) which is one character, while UTF8Encoding decodes UTF-8 multi-byte sequences correctly into single characters (for example, the three bytes 232,170,158 become the single character 語).

That's because the stream is actually UTF-8 encoded. If it was ASCII encoded, the strings would be identical.
When read as ASCII, the byte combinations that represent characters outside the 0-127 code set will be read as separate characters, and they will look like garbage.
When read as UTF-8, the byte combinations will be decoded into the correct characters, each multi-byte combination ending up as a single character.
(Note: Strings are not encoded, it's the stream that is encoded. You decode the stream from ASCII or UTF-8 into a Unicode character string.)

Perhaps the message contained some characters that couldn't be encoded as a single byte in UTF-8.

Related

Chinese Simplified to Hex GB2312 encoding in C#

I am having issue trying to convert a string containing Simplified Chinese to double byte encoding (GB2312). This is for printing Chinese characters to a zebra printer.
The specs I am looking at show an example with the text of "冈区色呆" which they show as converting to a hex value of 38_54_47_78_49_2b_34_74.
In my C# code I am trying to convert this using the below code as a test. My result seems to be off by 7 in the leading hex value. What am I missing here?
private const string SimplifiedChineseChars = "冈区色呆";
[TestMethod]
public void GetBackCorrectHexValues()
{
byte[] bytes = Encoding.GetEncoding(20936).GetBytes(SimplifiedChineseChars);
string hex = BitConverter.ToString(bytes).Replace("-", "_");
//I get the following: B8_D4_C7_F8_C9_AB_B4_F4
//I am expecting: 38_54_47_78_49_2b_34_74
}

The only thing that makes sense to me is that 38_54_47_78_49_2b_34_74 is some form of 7-bit encoding.
Interestingly, a 7-bit version of the GB2312 encoding does exist, and is called the HZ character encoding.
Here is the wikipedia entry on HZ. Interesting parts:
The HZ ... encoding was invented to facilitate the use of Chinese characters through e-mail, which at that time only allowed 7-bit characters.
the HZ code uses only printable, 7-bit characters to represent Chinese characters.
And, according to this Microsoft reference page on EncodingInfo.GetEncoding, this character encoding is supported in .NET:
52936 hz-gb-2312 Chinese Simplified (HZ)
If I try your code, and replace the character encoding to use HZ, I get:
static void Main(string[] args)
{
const string SimplifiedChineseChars = "冈区色呆";
byte[] bytes = Encoding.GetEncoding("hz-gb-2312").GetBytes(SimplifiedChineseChars);
string hex = BitConverter.ToString(bytes).Replace("-", "_");
Console.WriteLine(hex);
}
Output:
7E_7B_38_54_47_78_49_2B_34_74_7E_7D
So, you basically get exactly what you are looking for, except that it adds the escape sequences ~{ and ~} before and after the chinese character bytes. Those escape sequences are necessary because this encoding supports mixing ASCII character bytes (single byte encoding) with GB chinese character bytes (double byte encoding). The escape sequences mark the areas that should not be interpreted as ASCII.
If you choose to use the hz-gb-2312 encoding, you would have to strip any unwanted escape sequences yourself, if you think you don't need them. But, perhaps you do need them. You'll have to figure out exactly what your printer is expecting.
Alternatively, if you really don't want to have those escape sequences and if you are not worried about having to handle ASCII characters, and are confident that you only have to deal with chinese double byte characters, then you could choose to stick with using the vanilla GB2312 encoding, and then drop the most significant bit of every byte yourself to essentially convert the results to 7-bit encoding.
Here is what the code could look like. Notice that I mask each byte value with 0x7F to drop the 8th bit.
static void Main(string[] args)
{
const string SimplifiedChineseChars = "冈区色呆";
byte[] bytes = Encoding.GetEncoding("gb2312") // vanilla gb2312 encoding
.GetBytes(SimplifiedChineseChars)
.Select(b => (byte)(b & 0x7F)) // retain 7 bits only
.ToArray();
string hex = BitConverter.ToString(bytes).Replace("-", "_");
Console.WriteLine(hex);
}
Output:
38_54_47_78_49_2B_34_74

How to send ASCII values greater than 127 to serial port

Whenever i send any ASCII value greater than 127 to the com port i get junk output values at the serial port.
ComPort.Write(data);

Strictly speaking ASCII only contains 128 possible symbols.
You will need to use a different character set to send anything other than the 33 control characters and the 94 letters and symbols that are ASCII.
To make things more confusing, ASCII is used as a starting point for several larger (conflicting) character sets.
This list is not comprehensive, but the most most common are:
Extended_ASCII which is ASCII with 128 more characters in it.
ISO 8859-1 is ASCII with all characters required to represent all Western European languages.
Windows-1252 is ISO 8859-1 with a few alterations (including printable characters in the 0x80-0x9F range).
ISO 8859-2 which is the equivalent for Eastern European languages.
Windows-1250 is also for Eastern european languages, but bears little resemblance to 8859-2.
UTF-8 is also derived from ASCII, but has different characters from 128 to 255.
Back to your problem: your encoding is set to ASCII, so you are prevented from sending any characters outside of that character set. You need to set your encoding to an appropriate character set for the data you are sending. In my experience (which is admittedly in USA and Western Europe) Windows-1252 is the most used.
In C#, you can send either a string or bytes through a Serial Port. If you send a string, it uses the SerialPort.Encoding to convert that string into bytes to send. You can set this to the appropriate encoding for your purposes, or you can manually convert convert a string with a System.Text.Encoding object.
Set the encoder on the com port:
ComPort.Encoding = Encoding.GetEncoding("Windows-1252");
or manually encode the string:
System.Text.Encoding enc = System.Text.Encoding.GetEncoding("Windows-1252");
byte[] sendBuffer = enc.GetBytes(command);
ComPort.write(sendBytes, 0, sendBuffer.Length);
both do functionally the same thing.
EDIT:
0x96 is a valid character in Windows-1252
This outputs a long hyphen. (normal hyphen is 0x2D)
System.Text.Encoding enc = System.Text.Encoding.GetEncoding("windows-1252");
byte[] buffer = new byte[]{0x96};
Console.WriteLine(enc.GetString(buffer));

The issue was resolved when the encoding on the serial port was changed.
this.ComPort.Encoding = Encoding.GetEncoding(28591);

Change Encoding in C#?

Theoretical question :
Let's say there is one source which knows only how to transmit ASCII chars. (0..127)
And let's say there is an endpoint which receives these chars .
Can the endpoint decode those chars as utf8 ?
ascii chars
...
...
|
|
V
read as utf ?
Something like this pseudo code :
var txt="אבג";
var _bytes=Encoding.ASCII.GetBytes(txt); <= it wont recognize [א] here
...transmit...
var myUtfString=Encoding.UTF8.GetString(getBytesFromWire(); <= some magic has to be done here

That is possible, but not using UTF8.
UTF8 works by encoding multibyte characters into sequences of bytes that are between 128 and 255.
Your ASCII protocol will not be able to transmit those bytes.
Instead, you need some mechanism to store arbitrary Unicode codepoints or bytes in pure ASCII text:
You can encode the Unicode text using any encoding to get a stream of (non-ASCII) bytes, then transmit those bytes using Base64 encoding
You can use the UTF7 encoding to encode Unicode codepoints using pure ASCII characters.
This will be substantially more space-efficient than Base64 if your text is mostly ASCII.

var txt = "אבג";
var str = Convert.ToBase64String(Encoding.UTF8.GetBytes(txt)); //<--ASCII
//Transmit
var txt2 = Encoding.UTF8.GetString(Convert.FromBase64String(str));

How to encode & decode non Ascii characters?

I am developing an application in which i want to encode the Spanish text.
But the problem is that,it doesn't encode the special characters such as á, é, í, ó, ú, ü,Á, É, Í, Ó, Ú, Ü,Ñ,ñ .
How can i do this?i want to encode-decode the spanish text.

For international support using simple UTF-8 encoding to encode/decode your data should be enough.
Utf-8 has a beautiful capability to be able to read ASCII with one byte, as ordinary ASCII, and Unicode characters with 2 bytes. So it's able "to shrink" when it's necesary.
For complete C# documentation look on
UTF-8
EDIT
Encoding enc = new UTF8Encoding(true, true);
string value = " á, é, í, ó, ú, ü,Á, É, Í, Ó, Ú, Ü,Ñ,ñ ";
byte[] bytes= enc.GetBytes(value); //convert to BYTE array
//save in some file
//after can read from the file like
string decodedString = enc.GetString(byteArrayReadFromFile);

ok,I am answering my own question ,Hope it will help someone; to print spanish or any other non-ascii character in the given string replace all non-ascii characters by their unicode escape character set
E.g repalce á by \u00e1
And then simply print the string.
i.e
string str="árgrgrgrááhhttá";
str=str.Replace("á", "\u00e1");

What's the difference between UTF8/UTF16 and Base64 in terms of encoding

In. c#
We can use below classes to do encoding:
System.Text.Encoding.UTF8
System.Text.Encoding.UTF16
System.Text.Encoding.ASCII
Why there is no System.Text.Encoding.Base64?
We can only use Convert.From(To)Base64String method, what's special of base64?
Can I say base64 is the same encoding method as UTF-8? Or UTF-8 is one of base64?

UTF-8 and UTF-16 are methods to encode Unicode strings to byte sequences.
See: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
Base64 is a method to encode a byte sequence to a string.
So, these are widely different concepts and should not be confused.
Things to keep in mind:
Not every byte sequence represents an Unicode string encoded in UTF-8 or UTF-16.
Not every Unicode string represents a byte sequence encoded in Base64.

Base64 is a way to encode binary data, while UTF8 and UTF16 are ways to encode Unicode text. Note that in a language like Python 2.x, where binary data and strings are mixed, you can encode strings into base64 or utf8 the same way:
u'abc'.encode('utf16')
u'abc'.encode('base64')
But in languages where there's a more well-defined separation between the two types of data, the two ways of representing data generally have quite different utilities, to keep the concerns separate.

UTF-8 is like the other UTF encodings a character encoding to encode characters of the Unicode character set UCS.
Base64 is an encoding to represent any byte sequence by a sequence of printable characters (i.e. A–Z, a–z, 0–9, +, and /).
There is no System.Text.Encoding.Base64 because Base64 is not a text encoding but rather a base conversion like the hexadecimal that uses 0–9 and A–F (or a–f) to represent numbers.

Simply speaking, a charcter enconding, like UTF8 , or UTF16 are useful for to match numbers, i.e. bytes to characters and viceversa, for example in ASCII 65 is matched to "A" , while a base encoding is used mainly to translate bytes to bytes so that the resulting bytes converted from a single byte are printable and are a subset of the ASCII charachter encoding, for that reason you can see Base64 also as a bytes to text encoding mechanism. The main reason to use Base64 is to be trasmit data over a channel that doesn't allow binary data transfer.
That said, now it should be clear that you can have a stream encoded in Base64 that rapresent a stream UTF8 encoded.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Strings encoded ASCII and UTF8 have different lengths! - c#

UTF-8 handles different the strings than ASCII: On UTF-8, each character may be of 1, 2 or 3 bytes length. However, ASCII considers each byte as a character. The C# UTF-8 encoder counts well-formed UTF-8 characters, instead of bytes. I hope this helps you.

Because when decoding bytes, ASCIIEncoding replaces all bytes greater than 127 (0x7F) with a question mark (?) which is one character, while UTF8Encoding decodes UTF-8 multi-byte sequences correctly into single characters (for example, the three bytes 232,170,158 become the single character 語).

Perhaps the message contained some characters that couldn't be encoded as a single byte in UTF-8.

Related

Chinese Simplified to Hex GB2312 encoding in C#

How to send ASCII values greater than 127 to serial port

Change Encoding in C#?

How to encode & decode non Ascii characters?

What's the difference between UTF8/UTF16 and Base64 in terms of encoding

Categories

Resources