Unicode conversion to String leaves leading Byte order mark [duplicate] - c#

This question already has answers here:
Encoding.UTF8.GetString doesn't take into account the Preamble/BOM
(4 answers)
Closed 7 years ago.
In my .NET 3.5 C# application I'm converting a unicode encoded byte array to a string.
The byte array is as follows:
{255, 254, 85, 0, 83, 0, 69, 0}
Using Encoding.Unicode.GetString(var), I convert the byte array to a string, which returns:
{65279 '', 85 'U', 83 'S' , 69 'E'}
The leading character, 65279, seems to be a Zero Width No-Break Space, which is used as a Byte Order Mark in Unicode encoding, and its appearance is causing problems in the rest of my application.
Currently the workaround I'm using is var.Trim(new char[]{'\uFEFF','\u200B'});, which works just fine.
But the question really is, shouldn't GetStringtake care of removing the byte order mark? Or am I doing something wrong when converting the byte array?

No, GetString() should not be removing the BOM. The BOM is actually a perfectly valid Unicode character (selected specifically because if it appears in the middle of a Unicode file, e.g. if the file was the result of concatenating multiple Unicode files, it won't affect the rendered text) and must be decoded along with all other characters in the byte[].
The only code that ought to be interpreting and filtering out the BOM would be code that understands the data is coming from some persistent storage, e.g. StreamReader. And note that it will do that only if you don't disable that behavior.
All that GetString() should do is interpret the actual encoded characters and convert them to the text they represent (of course, in C# strings are stored internally as UTF16, so there's very little to that conversion when the original data is already in UTF16 :) ).

Related

Decode UTF-16 LE with BOM from ByteArray in C# [duplicate]

This question already has answers here:
Encoding.UTF8.GetString doesn't take into account the Preamble/BOM
(4 answers)
Closed 7 years ago.
In my .NET 3.5 C# application I'm converting a unicode encoded byte array to a string.
The byte array is as follows:
{255, 254, 85, 0, 83, 0, 69, 0}
Using Encoding.Unicode.GetString(var), I convert the byte array to a string, which returns:
{65279 '', 85 'U', 83 'S' , 69 'E'}
The leading character, 65279, seems to be a Zero Width No-Break Space, which is used as a Byte Order Mark in Unicode encoding, and its appearance is causing problems in the rest of my application.
Currently the workaround I'm using is var.Trim(new char[]{'\uFEFF','\u200B'});, which works just fine.
But the question really is, shouldn't GetStringtake care of removing the byte order mark? Or am I doing something wrong when converting the byte array?
No, GetString() should not be removing the BOM. The BOM is actually a perfectly valid Unicode character (selected specifically because if it appears in the middle of a Unicode file, e.g. if the file was the result of concatenating multiple Unicode files, it won't affect the rendered text) and must be decoded along with all other characters in the byte[].
The only code that ought to be interpreting and filtering out the BOM would be code that understands the data is coming from some persistent storage, e.g. StreamReader. And note that it will do that only if you don't disable that behavior.
All that GetString() should do is interpret the actual encoded characters and convert them to the text they represent (of course, in C# strings are stored internally as UTF16, so there's very little to that conversion when the original data is already in UTF16 :) ).

A weird thing in c# Encoding

I convert a byte array to a string , and I convert this string to byte array.
these two byte arrays are different.
As below:
byte[] tmp = Encoding.ASCII.GetBytes(Encoding.ASCII.GetString(b));
Suppose b is a byte array.
b[0]=3, b[1]=188, b[2]=2 //decimal system
Result:
tmp[0]=3, tmp[1]=63, tmp[2]=2
So that's my problem, what's wrong with it?
188 is out of range for ASCII. Characters that are not in the corresponding character set are transposed to '?' by design (would you prefer transposing to "1/4"?)
ASCII is 7-bit only, so others are invalid. By default it uses ? to replace any invalid bytes and that's why you get a ?.
For 8-bit character sets, you should be looking for either the Extended ASCII (which is later defined "ISO 8859-1") or the code page 437 (which is often confused with Extended ASCII, but in fact it's not).
You can use the following code:
Encoding enc = Encoding.GetEncoding("iso-8859-1");
// For CP437, use Encoding.GetEncoding(437)
byte[] tmp = enc.GetBytes(enc.GetString(b));
The character 188 is not defined for ASCII. Instead, you're getting 63, which is a question mark.
The ASCII character set has a range from 1 to 127. You can see 188 is not in this range and is converted to ? (= ASC 63).
Not every sequence of bytes is necessarily a valid sequence of encoded values for a particular encoding.
So the result of Encoding.ASCII.GetString(b) on an arbitrary array of bytes, b, is poorly defined. (And could be, for any other encoding also).
If you need to take an arbitrary byte array and obtain a sequence of characters, you might want to look into the Convert classes ToBase64String and FromBase64String. If that's not what you're trying to do, maybe explain the original problem to us.
188 isn't in the range of ASCII (7 bit), you should use Encoding.Default to get the ANSI encoding:
byte[] b = new byte[3]{ 3, 188, 2 };
byte[] tmp = Encoding.Default.GetBytes(Encoding.Default.GetString(b));

How do I convert a byte array to a string? [duplicate]

This question already has answers here:
How to convert UTF-8 byte[] to string
(16 answers)
Closed 6 years ago.
I have a byte that is an array of 30 bytes, but when I use BitConverter.ToString it displays the hex string. The byte is
0x42007200650061006B0069006E00670041007700650073006F006D0065.
Which is in Unicode as well.
It means B.r.e.a.k.i.n.g.A.w.e.s.o.m.e, but I am not sure how to get it to convert from hex to Unicode to ASCII.
You can use one of the Encoding classes - you will need to know what encoding these bytes are in though.
string val = Encoding.UTF8.GetString(myByteArray);
The values you have displayed look like a Unicode encoding, so UTF8 or Unicode look like good bets.
It looks like that's little-endian UTF-16, so you want Encoding.Unicode:
string text = Encoding.Unicode.GetString(bytes);
You shouldn't normally assume what the encoding is though - it should be something you know about the data. For other encodings, you'd obviously use different Encoding instances, but Encoding is the right class for binary representations of text.
EDIT: As noted in comments, you appear to be missing an "00" either from the start of your byte array (in which case you need Encoding.BigEndianUnicode) or from the end (in which case just Encoding.Unicode is fine).
(When it comes to the other way round, however, taking arbitrary binary data and representing it as text, you should use hex or base64. That's not the case here, but you ought to be aware of it.)

What happens to a null byte when converting bytes to ISO 8859-1 encoding?

I'm not entirely sure if the question even makes sense. I'm converting a byte array taken from an ID3 tag and converting it to a string. Most text frames in an ID3 tag use ISO 8859-1 encoding but it depends on the frame. In any case, if you look up what 0x00 is in the ISO 8859-1 codes it is invalid.
To further complicate, either due programmer error or just poor formatting, some of the strings end in 0x00 and some do not.
When converting a series of bytes into a string using ISO 8859-1 encoding do you have manually check the end of the string to see if it is a null? Or will the encoding object through whatever method it uses to convert in the first place deal with the null properly? Furthermore, is there some sort of function that could normalize or "fix" the null terminated string?
When you try to display these strings they do not display properly.
I am using C# for this particular project.
Some extra info here about ID3 Tags: ID3 Specs
Or am I completely misunderstanding the whole thing? Is a null terminator simply a way a particular language handles strings and it has nothing to do with encoding?
Edit: I used System.Text.Encoding.GetEncoding("iso-8859-1") followed by a GetString call
If you use Encoding.GetEncoding(28591), it just converts a byte 0 to the Unicode U+0000. Encodings generally assume that they have to convert all the bytes - they don't look for terminators.
This treatment of 0 as Unicode 0 is inline with the Wikipedia description:
In 1992, the IANA registered the character map ISO_8859-1:1987, more commonly known by its preferred MIME name of ISO-8859-1 (note the extra hyphen over ISO 8859-1), a superset of ISO 8859-1, for use on the Internet. This map assigns the C0 and C1 control characters to the unassigned code values thus provides for 256 characters via every possible 8-bit value.
The C0 and C1 control characters page includes:
0: Originally used to allow gaps to be left on paper tape for edits. Later used for padding after a code that might take a terminal some time to process (e.g. a carriage return or line feed on a printing terminal). Now often used as a string terminator, especially in the C programming language.
Sample code:
using System;
using System.Text;
class Program
{
static void Main(string[] args)
{
byte[] data = { 0, 0 };
Encoding latin1 = Encoding.GetEncoding(28591);
string text = latin1.GetString(data);
Console.WriteLine(text.Length); // 2
Console.WriteLine((int) text[0]); // 0
Console.WriteLine((int) text[1]); // 0
}
}
Happily, ASCII, ISO-8859-1 and Unicode all agree on codepoints in the range 0..127. Thus your character '\0' will be encoded identically in ASCII, ISO-8859-1 and UTF-8.
If your program assigns special semantics to the zero byte, you have to take care of that appropriately.

"Unable to translate Unicode character" error when saving to txt file

Additional information: Unable to
translate Unicode character \uDFFF at
index 195 to specified code page.
I made an algorithm, who's result are binary values (different lengths). I transformed it into uint, and then into chars and saved into stringbuilder, as you can see below:
uint n = Convert.ToUInt16(tmp_chars, 2);
_koded_text.Append(Convert.ToChar(n));
My problem is, that when i try to save those values into .txt i get the previously mentioned error.
StreamWriter file = new StreamWriter(filename);
file.WriteLine(_koded_text);
file.Close();
What i am saving is this: "忿췾᷿]볯褟ﶞ痢ﳻ��伞ﳴ㿯ﹽ翼蛿㐻ﰻ筹��﷿₩マ랿鳿⏟麞펿"... which are some weird signs.
What i need is to convert those binary values into some kind of string of chars and save it to txt. I saw somewhere that converting to UTF8 should help, but i don't know how to. Would changing files encoding help too?
You cannot transform binary data to a string directly. The Unicode characters in a string are encoded using utf16 in .NET. That encoding uses two bytes per character, providing 65536 distinct values. Unicode however has over one million codepoints. To make that work, the Unicode codepoints above \uffff (above the BMP, Basic Multilingual Plane) are encoded with a surrogate pair. The first one has a value between 0xd800 and 0xdbff, the second between 0xdc00 and 0xdfff. That provides 2 ^ (10 + 10) = 1 million additional codes.
You can perhaps see where this leads, in your case the code detects a high surrogate value (0xdfff) that isn't paired with a low surrogate. That's illegal. Lots more possible mishaps, several codepoints are unassigned, several are diacritics that get mangled when the string is normalized.
You just can't make this work. Base64 encoding is the standard way to carry binary data across a text stream. It uses 6 bits per character, 3 bytes require 4 characters. The character set is ASCII so the odds of the receiving program decoding the character back to binary incorrectly are minimal. Only a decades old IBM mainframe that uses EBCDIC could get you into trouble. Or just plain avoid encoding to text and keep it binary.
Since you're trying to encode binary data to a text stream this SO question already contains an answer to the question: "How do I encode something as base64?" From there plain ASCII/ANSI text is fine for the output encoding.

Categories