Decode UTF-16 LE with BOM from ByteArray in C# [duplicate] - c#

This question already has answers here:
Encoding.UTF8.GetString doesn't take into account the Preamble/BOM
(4 answers)
Closed 7 years ago.
In my .NET 3.5 C# application I'm converting a unicode encoded byte array to a string.
The byte array is as follows:
{255, 254, 85, 0, 83, 0, 69, 0}
Using Encoding.Unicode.GetString(var), I convert the byte array to a string, which returns:
{65279 '', 85 'U', 83 'S' , 69 'E'}
The leading character, 65279, seems to be a Zero Width No-Break Space, which is used as a Byte Order Mark in Unicode encoding, and its appearance is causing problems in the rest of my application.
Currently the workaround I'm using is var.Trim(new char[]{'\uFEFF','\u200B'});, which works just fine.
But the question really is, shouldn't GetStringtake care of removing the byte order mark? Or am I doing something wrong when converting the byte array?

No, GetString() should not be removing the BOM. The BOM is actually a perfectly valid Unicode character (selected specifically because if it appears in the middle of a Unicode file, e.g. if the file was the result of concatenating multiple Unicode files, it won't affect the rendered text) and must be decoded along with all other characters in the byte[].
The only code that ought to be interpreting and filtering out the BOM would be code that understands the data is coming from some persistent storage, e.g. StreamReader. And note that it will do that only if you don't disable that behavior.
All that GetString() should do is interpret the actual encoded characters and convert them to the text they represent (of course, in C# strings are stored internally as UTF16, so there's very little to that conversion when the original data is already in UTF16 :) ).

Related

c# UTF8 GetString from bytes array not equal to php chr function

I'm trying to make one decoder. Basic system .Net 4.7 I'm trying to migrate this system into php, but I'm having trouble converting bytes. As far as I understand the default string UTF-16le on C#, I understood the ord and chr functions as UCS-2 on the PHP side. I want to do below and I do not get the same result there are codes. What can I do to fix this, thanks in advance
XOR Encoded Text Bytes = [101,107,217,78,40,68,234,218,162,67,139,81,44,166,24,148];
on C#
string result = System.Text.Encoding.UTF8.GetString(destinationArray);
On PHP
for($i=0;$i<sizeof($encoded);$i++){
echo "\t".$encoded[$i]." => ".chr($encoded[$i])."\n";
$tmpStr .= chr($encoded[$i]);
}
C# Result size=26:
ek�N(D�ڢC�Q,��
PHP Result size=16:
ek�N(D�ڢC�Q,��
the strings looks the same, but byte translation is quite different.
C# Result to Bytes array:
byte[] utf8 = System.Text.Encoding.Unicode.GetBytes(result);
Console.WriteLine(string.Join("-", utf8));
response =
101-0-107-0-253-255-78-0-40-0-68-0-253-255-162-6-67-0-253-255-81-0-44-0-253-255-24-0-253-255
PHP Result to Bytes Array:
echo implode("-",unpack("C*", $tmpStr));
response = 101-107-217-78-40-68-234-218-162-67-139-81-44-166-24-148
if php response convert to UTF-16le, results again different
echo implode("-",unpack("C*", mb_convert_encoding($tmpStr,'UTF-16le')));
response =
101-0-107-0-63-0-78-0-40-0-68-0-63-0-162-6-67-0-63-0-81-0-44-0-63-0-24-0-63-0
You are mixing quite different things here.
First, in the C# code, you are not using the same encoding when converting from bytes to a string and then from a string back to bytes: Encoding.UTF8 in the first case and Encoding.Unicode (which is .NET name for UTF-16) in the latter... Things cannot go well if you do this. And by the way, I'm not sure that PHP's UCS2 is equivalent to UTF-16:
UTF-8 encodes characters on 1, 2, 3 or 4 bytes depending on the character
UTF-16 encodes characters on 2 or 4 bytes depending on the character
UCS-2 always encodes characters on 2 bytes, and hence cannot encode more than 65536 characters...
Then what you pass to the 'bytes to string' conversions is not necessarily valid! Because you've XORed the input data (I assume it to be some secret string), the resulting bytes may or may not be a valid sequence in some encodings. For example:
It is not valid in ASCII because you have (in your example) bytes > 127
It is not valid in UTF-8 because 217 followed by 78 is recognized neither as a 1-, 2-, 3-, or 4-byte character by UTF-8; hence, the � you see before the N.
It seems to be invalid UTF-16 as well, but roundtripping works (I could get back the original array using .NET's Unicode.GetString, then Unicode.GetBytes. If I remove your last byte - and end up with an odd number of bytes - then UTF-16 roundtripping does not work any more...
Although I did not test it, it should also be invalid UCS-2 because UCS-2 'looks like' UTF-16 for 2-byte characters.
Roundtripping works with ANSI encodings sucha as windows-1252 because these encodings accept any byte. However, I would discourage using such trick because you have to be sure the same code page is used on both sides of the encoding/decoding process.
Therefore, I think, in your case, the best way to store your XORed bytes into a string would be to convert the array to base64. In C# you can do it this way:
// The code below gives you ZWt1TihEInY+QydRLEIYMA==
var converted = Convert.ToBase64String(array);
// And this one gives you back the initial array
var bytes = Convert.FromBase64String(converted);
Quick googling will tell you to use base64_encode and base64_decode in PHP.
Bottom note: if you want to really understand what's going on with al this encodings stuff, here is the must-read blog post on the subject: https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/

Unicode conversion to String leaves leading Byte order mark [duplicate]

This question already has answers here:
Encoding.UTF8.GetString doesn't take into account the Preamble/BOM
(4 answers)
Closed 7 years ago.
In my .NET 3.5 C# application I'm converting a unicode encoded byte array to a string.
The byte array is as follows:
{255, 254, 85, 0, 83, 0, 69, 0}
Using Encoding.Unicode.GetString(var), I convert the byte array to a string, which returns:
{65279 '', 85 'U', 83 'S' , 69 'E'}
The leading character, 65279, seems to be a Zero Width No-Break Space, which is used as a Byte Order Mark in Unicode encoding, and its appearance is causing problems in the rest of my application.
Currently the workaround I'm using is var.Trim(new char[]{'\uFEFF','\u200B'});, which works just fine.
But the question really is, shouldn't GetStringtake care of removing the byte order mark? Or am I doing something wrong when converting the byte array?
No, GetString() should not be removing the BOM. The BOM is actually a perfectly valid Unicode character (selected specifically because if it appears in the middle of a Unicode file, e.g. if the file was the result of concatenating multiple Unicode files, it won't affect the rendered text) and must be decoded along with all other characters in the byte[].
The only code that ought to be interpreting and filtering out the BOM would be code that understands the data is coming from some persistent storage, e.g. StreamReader. And note that it will do that only if you don't disable that behavior.
All that GetString() should do is interpret the actual encoded characters and convert them to the text they represent (of course, in C# strings are stored internally as UTF16, so there's very little to that conversion when the original data is already in UTF16 :) ).

A weird thing in c# Encoding

I convert a byte array to a string , and I convert this string to byte array.
these two byte arrays are different.
As below:
byte[] tmp = Encoding.ASCII.GetBytes(Encoding.ASCII.GetString(b));
Suppose b is a byte array.
b[0]=3, b[1]=188, b[2]=2 //decimal system
Result:
tmp[0]=3, tmp[1]=63, tmp[2]=2
So that's my problem, what's wrong with it?
188 is out of range for ASCII. Characters that are not in the corresponding character set are transposed to '?' by design (would you prefer transposing to "1/4"?)
ASCII is 7-bit only, so others are invalid. By default it uses ? to replace any invalid bytes and that's why you get a ?.
For 8-bit character sets, you should be looking for either the Extended ASCII (which is later defined "ISO 8859-1") or the code page 437 (which is often confused with Extended ASCII, but in fact it's not).
You can use the following code:
Encoding enc = Encoding.GetEncoding("iso-8859-1");
// For CP437, use Encoding.GetEncoding(437)
byte[] tmp = enc.GetBytes(enc.GetString(b));
The character 188 is not defined for ASCII. Instead, you're getting 63, which is a question mark.
The ASCII character set has a range from 1 to 127. You can see 188 is not in this range and is converted to ? (= ASC 63).
Not every sequence of bytes is necessarily a valid sequence of encoded values for a particular encoding.
So the result of Encoding.ASCII.GetString(b) on an arbitrary array of bytes, b, is poorly defined. (And could be, for any other encoding also).
If you need to take an arbitrary byte array and obtain a sequence of characters, you might want to look into the Convert classes ToBase64String and FromBase64String. If that's not what you're trying to do, maybe explain the original problem to us.
188 isn't in the range of ASCII (7 bit), you should use Encoding.Default to get the ANSI encoding:
byte[] b = new byte[3]{ 3, 188, 2 };
byte[] tmp = Encoding.Default.GetBytes(Encoding.Default.GetString(b));

How do I convert a byte array to a string? [duplicate]

This question already has answers here:
How to convert UTF-8 byte[] to string
(16 answers)
Closed 6 years ago.
I have a byte that is an array of 30 bytes, but when I use BitConverter.ToString it displays the hex string. The byte is
0x42007200650061006B0069006E00670041007700650073006F006D0065.
Which is in Unicode as well.
It means B.r.e.a.k.i.n.g.A.w.e.s.o.m.e, but I am not sure how to get it to convert from hex to Unicode to ASCII.
You can use one of the Encoding classes - you will need to know what encoding these bytes are in though.
string val = Encoding.UTF8.GetString(myByteArray);
The values you have displayed look like a Unicode encoding, so UTF8 or Unicode look like good bets.
It looks like that's little-endian UTF-16, so you want Encoding.Unicode:
string text = Encoding.Unicode.GetString(bytes);
You shouldn't normally assume what the encoding is though - it should be something you know about the data. For other encodings, you'd obviously use different Encoding instances, but Encoding is the right class for binary representations of text.
EDIT: As noted in comments, you appear to be missing an "00" either from the start of your byte array (in which case you need Encoding.BigEndianUnicode) or from the end (in which case just Encoding.Unicode is fine).
(When it comes to the other way round, however, taking arbitrary binary data and representing it as text, you should use hex or base64. That's not the case here, but you ought to be aware of it.)

"Unable to translate Unicode character" error when saving to txt file

Additional information: Unable to
translate Unicode character \uDFFF at
index 195 to specified code page.
I made an algorithm, who's result are binary values (different lengths). I transformed it into uint, and then into chars and saved into stringbuilder, as you can see below:
uint n = Convert.ToUInt16(tmp_chars, 2);
_koded_text.Append(Convert.ToChar(n));
My problem is, that when i try to save those values into .txt i get the previously mentioned error.
StreamWriter file = new StreamWriter(filename);
file.WriteLine(_koded_text);
file.Close();
What i am saving is this: "忿췾᷿]볯褟ﶞ痢ﳻ��伞ﳴ㿯ﹽ翼蛿㐻ﰻ筹��﷿₩マ랿鳿⏟麞펿"... which are some weird signs.
What i need is to convert those binary values into some kind of string of chars and save it to txt. I saw somewhere that converting to UTF8 should help, but i don't know how to. Would changing files encoding help too?
You cannot transform binary data to a string directly. The Unicode characters in a string are encoded using utf16 in .NET. That encoding uses two bytes per character, providing 65536 distinct values. Unicode however has over one million codepoints. To make that work, the Unicode codepoints above \uffff (above the BMP, Basic Multilingual Plane) are encoded with a surrogate pair. The first one has a value between 0xd800 and 0xdbff, the second between 0xdc00 and 0xdfff. That provides 2 ^ (10 + 10) = 1 million additional codes.
You can perhaps see where this leads, in your case the code detects a high surrogate value (0xdfff) that isn't paired with a low surrogate. That's illegal. Lots more possible mishaps, several codepoints are unassigned, several are diacritics that get mangled when the string is normalized.
You just can't make this work. Base64 encoding is the standard way to carry binary data across a text stream. It uses 6 bits per character, 3 bytes require 4 characters. The character set is ASCII so the odds of the receiving program decoding the character back to binary incorrectly are minimal. Only a decades old IBM mainframe that uses EBCDIC could get you into trouble. Or just plain avoid encoding to text and keep it binary.
Since you're trying to encode binary data to a text stream this SO question already contains an answer to the question: "How do I encode something as base64?" From there plain ASCII/ANSI text is fine for the output encoding.

Categories