How do I convert a byte array to a string? [duplicate] - c#

This question already has answers here:
How to convert UTF-8 byte[] to string
(16 answers)
Closed 6 years ago.
I have a byte that is an array of 30 bytes, but when I use BitConverter.ToString it displays the hex string. The byte is
0x42007200650061006B0069006E00670041007700650073006F006D0065.
Which is in Unicode as well.
It means B.r.e.a.k.i.n.g.A.w.e.s.o.m.e, but I am not sure how to get it to convert from hex to Unicode to ASCII.

You can use one of the Encoding classes - you will need to know what encoding these bytes are in though.
string val = Encoding.UTF8.GetString(myByteArray);
The values you have displayed look like a Unicode encoding, so UTF8 or Unicode look like good bets.

It looks like that's little-endian UTF-16, so you want Encoding.Unicode:
string text = Encoding.Unicode.GetString(bytes);
You shouldn't normally assume what the encoding is though - it should be something you know about the data. For other encodings, you'd obviously use different Encoding instances, but Encoding is the right class for binary representations of text.
EDIT: As noted in comments, you appear to be missing an "00" either from the start of your byte array (in which case you need Encoding.BigEndianUnicode) or from the end (in which case just Encoding.Unicode is fine).
(When it comes to the other way round, however, taking arbitrary binary data and representing it as text, you should use hex or base64. That's not the case here, but you ought to be aware of it.)

Related

Decode UTF-8 bytes as Latin-1 characters

I have a string that I receive from a third party app and I would like to display it correctly in any language using C# on my Windows Surface.
Due to incorrect encoding, a piece of my string looks like this in Farsi (Persian-Arabic):
مدل-رنگ-موی-جدید-5-436x500
whereas it should look like this:
مدل-رنگ-موی-جدید-5-436x500
This link convert this correctly:
http://www.ltg.ed.ac.uk/~richard/utf-8.html
How I can do it in c#?
It is very hard to tell exactly what is going on from the description of your question. We would all be much better off if you provided us with an example of what is happening using a single character instead of a whole string, and if you chose an example character which does not belong to some exotic character set, for example the bullet character (u2022) or something like that.
Anyhow, what is probably happening is this:
The letter "ر" is represented in UTF-8 as a byte sequence of D8 B1, but what you see is "ر", and that's because in UTF-16 Ø is u00D8 and ± is u00B1. So, the incoming text was originally in UTF-8, but in the process of importing it to a dotNet Unicode String in your application it was incorrectly interpreted as being in some 8-bit character set such as ANSI or Latin-1. That's why you now have a Unicode String which appears to contain garbage.
However, the process of converting 8-bit characters to Unicode is for the most part not destructive, so all of the information is still there, that's why the UTF-8 tool that you linked to can still kind of make sense out of it.
What you need to do is convert the string back to an array of ANSI (or Latin-1, whatever) bytes, and then re-construct the string the right way, which is a conversion of UTF-8 to Unicode.
I cannot easily reproduce your situation, so here are some things to try:
byte[] bytes = System.Text.Encoding.Ansi.GetBytes( garbledUnicodeString );
followed by
string properUnicodeString = System.Text.Encoding.UTF8.GetString( bytes );

Unicode conversion to String leaves leading Byte order mark [duplicate]

This question already has answers here:
Encoding.UTF8.GetString doesn't take into account the Preamble/BOM
(4 answers)
Closed 7 years ago.
In my .NET 3.5 C# application I'm converting a unicode encoded byte array to a string.
The byte array is as follows:
{255, 254, 85, 0, 83, 0, 69, 0}
Using Encoding.Unicode.GetString(var), I convert the byte array to a string, which returns:
{65279 '', 85 'U', 83 'S' , 69 'E'}
The leading character, 65279, seems to be a Zero Width No-Break Space, which is used as a Byte Order Mark in Unicode encoding, and its appearance is causing problems in the rest of my application.
Currently the workaround I'm using is var.Trim(new char[]{'\uFEFF','\u200B'});, which works just fine.
But the question really is, shouldn't GetStringtake care of removing the byte order mark? Or am I doing something wrong when converting the byte array?
No, GetString() should not be removing the BOM. The BOM is actually a perfectly valid Unicode character (selected specifically because if it appears in the middle of a Unicode file, e.g. if the file was the result of concatenating multiple Unicode files, it won't affect the rendered text) and must be decoded along with all other characters in the byte[].
The only code that ought to be interpreting and filtering out the BOM would be code that understands the data is coming from some persistent storage, e.g. StreamReader. And note that it will do that only if you don't disable that behavior.
All that GetString() should do is interpret the actual encoded characters and convert them to the text they represent (of course, in C# strings are stored internally as UTF16, so there's very little to that conversion when the original data is already in UTF16 :) ).

Can any byte array be converted to a string?

Can any byte array be converted to a string? Or there are some byte values that are not available or cannot be converted to characteres depending on the encoding of the string?
You should only try to convert byte arrays to strings if they started as text. If the byte array is actually the contents of an image file, or a video, or maybe encoded or compressed data, you should not try to convert it straight to a string using an encoding. Doing so almost always goes badly in the end: with ISO-8859-1 you might be okay, but it's fundamentally a bad idea, and you really shouldn't do it.
Instead, you should use Convert.ToBase64String to convert it to Base64, or perhaps convert it to hex instead.
If you do use Base64, you'd use Convert.FromBase64String to convert back from text to a byte array.
Can any byte array be converted to a string?
Base64 seems like an appropriate representation of a byte array:
byte[] buffer = ...
string base64 = Convert.ToBase64String(buffer);
In .NET you could use the ToBase64String method to achieve this.
Also you seem to have talked about some encoding of a string in your question, but in .NET all strings are UTF-16 encoded, so I don't quite understand what you meant by that.
Strings may be converted into sequences of bytes using a variety of encodings. Some encodings can convert any possible string to some sequence of bytes; others will only work with strings containing a limited variety of characters, but for every possible byte sequence there will exist a string that would yield it. Some encoding methods will convert any possible string into an even-length sequence of bytes, and will allow any even-length sequence of bytes to be converted back to a string, but cannot yield odd-length strings. I'm not aware of any encoding methods which create a one-to-one relationship between all possible strings and all possible arbitrary-length byte sequences.
Once upon a time, strings were a convenient way of holding arbitrary byte sequences, but in .NET they may only be used as a means of holding binary data if the data is filtered so as to ensure that it doesn't contain any invalid characters or sequences. I wish there was an "immutable byte sequence" type which could be used for that purpose which used to be served by strings, but I'm unaware of one.

A weird thing in c# Encoding

I convert a byte array to a string , and I convert this string to byte array.
these two byte arrays are different.
As below:
byte[] tmp = Encoding.ASCII.GetBytes(Encoding.ASCII.GetString(b));
Suppose b is a byte array.
b[0]=3, b[1]=188, b[2]=2 //decimal system
Result:
tmp[0]=3, tmp[1]=63, tmp[2]=2
So that's my problem, what's wrong with it?
188 is out of range for ASCII. Characters that are not in the corresponding character set are transposed to '?' by design (would you prefer transposing to "1/4"?)
ASCII is 7-bit only, so others are invalid. By default it uses ? to replace any invalid bytes and that's why you get a ?.
For 8-bit character sets, you should be looking for either the Extended ASCII (which is later defined "ISO 8859-1") or the code page 437 (which is often confused with Extended ASCII, but in fact it's not).
You can use the following code:
Encoding enc = Encoding.GetEncoding("iso-8859-1");
// For CP437, use Encoding.GetEncoding(437)
byte[] tmp = enc.GetBytes(enc.GetString(b));
The character 188 is not defined for ASCII. Instead, you're getting 63, which is a question mark.
The ASCII character set has a range from 1 to 127. You can see 188 is not in this range and is converted to ? (= ASC 63).
Not every sequence of bytes is necessarily a valid sequence of encoded values for a particular encoding.
So the result of Encoding.ASCII.GetString(b) on an arbitrary array of bytes, b, is poorly defined. (And could be, for any other encoding also).
If you need to take an arbitrary byte array and obtain a sequence of characters, you might want to look into the Convert classes ToBase64String and FromBase64String. If that's not what you're trying to do, maybe explain the original problem to us.
188 isn't in the range of ASCII (7 bit), you should use Encoding.Default to get the ANSI encoding:
byte[] b = new byte[3]{ 3, 188, 2 };
byte[] tmp = Encoding.Default.GetBytes(Encoding.Default.GetString(b));

Can we simplify this string encoding code

Is it possible to simplify this code into a cleaner/faster form?
StringBuilder builder = new StringBuilder();
var encoding = Encoding.GetEncoding(936);
// convert the text into a byte array
byte[] source = Encoding.Unicode.GetBytes(text);
// convert that byte array to the new codepage.
byte[] converted = Encoding.Convert(Encoding.Unicode, encoding, source);
// take multi-byte characters and encode them as separate ascii characters
foreach (byte b in converted)
builder.Append((char)b);
// return the result
string result = builder.ToString();
Simply put, it takes a string with Chinese characters such as 鄆 and converts them to ài.
For example, that Chinese character in decimal is 37126 or 0x9106 in hex.
See http://unicodelookup.com/#0x9106/1
Converted to a byte array, we get [145, 6] (145 * 256 + 6 = 37126). When encoded in CodePage 936 (simplified chinese), we get [224, 105]. If we break this byte array down into individual characters, we 224=e0=à and 105=69=i in unicode.
See http://unicodelookup.com/#0x00e0/1
and
http://unicodelookup.com/#0x0069/1
Thus, we're doing an encoding conversion and ensuring that all characters in our output Unicode string can be represented using at most two bytes.
Update: I need this final representation because this is the format my receipt printer is accepting. Took me forever to figure it out! :) Since I'm not an encoding expert, I'm looking for simpler or faster code, but the output must remain the same.
Update (Cleaner version):
return Encoding.GetEncoding("ISO-8859-1").GetString(Encoding.GetEncoding(936).GetBytes(text));
Well, for one, you don't need to convert the "built-in" string representation to a byte array before calling Encoding.Convert.
You could just do:
byte[] converted = Encoding.GetEncoding(936).GetBytes(text);
To then reconstruct a string from that byte array whereby the char values directly map to the bytes, you could do...
static string MangleTextForReceiptPrinter(string text) {
return new string(
Encoding.GetEncoding(936)
.GetBytes(text)
.Select(b => (char) b)
.ToArray());
}
I wouldn't worry too much about efficiency; how many MB/sec are you going to print on a receipt printer anyhow?
Joe pointed out that there's an encoding that directly maps byte values 0-255 to code points, and it's age-old Latin1, which allows us to shorten the function to...
return Encoding.GetEncoding("Latin1").GetString(
Encoding.GetEncoding(936).GetBytes(text)
);
By the way, if this is a buggy windows-only API (which it is, by the looks of it), you might be dealing with codepage 1252 instead (which is almost identical). You might try reflector to see what it's doing with your System.String before it sends it over the wire.
Almost anything would be cleaner than this - you're really abusing text here, IMO. You're trying to represent effectively opaque binary data (the encoded text) as text data... so you'll potentially get things like bell characters, escapes etc.
The normal way of encoding opaque binary data in text is base64, so you could use:
return Convert.ToBase64String(Encoding.GetEncoding(936).GetBytes(text));
The resulting text will be entirely ASCII, which is much less likely to cause you hassle.
EDIT: If you need that output, I would strongly recommend that you represent it as a byte array instead of as a string... pass it around as a byte array from that point onwards, so you're not tempted to perform string operations on it.
Does your receipt printer have an API that accepts a byte array rather than a string?
If so you may be able to simplify the code to a single conversion, from a Unicode string to a byte array using the encoding used by the receipt printer.
Also, if you want to convert an array of bytes to a string whose character values correspond 1-1 to the values of the bytes, you can use the code page 28591 aka Latin1 aka ISO-8859-1.
I.e., the following
foreach (byte b in converted)
builder.Append((char)b);
string result = builder.ToString();
can be replaced by:
// All three of the following are equivalent
// string result = Encoding.GetEncoding(28591).GetString(converted);
// string result = Encoding.GetEncoding("ISO-8859-1").GetString(converted);
string result = Encoding.GetEncoding("Latin1").GetString(converted);
Latin1 is a useful encoding when you want to encode binary data in a string, e.g. to send through a serial port.

Categories