Is it possible to get a character/string encoding of bytes greater then 0x7F in c#
At the moment i get 0x3f (?) from any of these bytes greater then 0x7F and i imagine that's an error character because there is no corresponding character.
I need to build a byte[] and unfortunately(due to code structure i cant control) this must be passed around my program as a string each character representing a byte. My byte[] needs some of its bytes to be greater then 0x7F but the string cant handle these characters. The values to be encoded to characters are just ints nothing special but are in the range 0-255.
Example:
say i want my byte[] to be 3 bytes {0x2E, 0x55, 0x8D}
I want my string representation of this to be to be something like ".U\x8D"
but instead i get ".U?" which translates to an incorrect byte array
If you can live with the 33% storage overhead, the simplest solution seems to be Base64 encoding:
byte[] bytes = {0x2e, 0x55, 0x8d};
string str = Convert.ToBase64String(bytes);
byte[] bytes2 = Convert.FromBase64String(str);
Dont know if it is possible to make it the way you would like. String is a representation of unicode characters. Unicode 0x8d is REVERSE LINE FEED, so in string its represented with ?. The same goes for other control characters(like 0x01, 0x02, ...)
If you have to send it like string you can try coma separated list of characters(1,2,141,255), it might be the best solution for your problem.
Related
I'm trying to convert some strings from UTF 16 LE to UTF 16 BE but it fails to encode the second Chinese character.
Sample string: test馨俞
Code:
byte[] bytes = Encoding.Unicode.GetBytes(sendMsg.Text);
sendMsg.Text = Encoding.BigEndianUnicode.GetString(bytes)
I've also tried
var encode = new UnicodeEncoding(false, true, true);
var messageAsBytes = encode.GetBytes(sendMsg.Text);
var enc = new UnicodeEncoding(true, true, true);
sendMsg.Text = enc.GetString(messageAsBytes);
Which results in the following error: Unable to translate bytes [DE][4F] at index 184 from specified code page to Unicode on the line:
sendMsg.Text = enc.GetString(messageAsBytes);
Thanks.
I think you should process your input string with the BigEndianUnicode class.
I made this code from the one you provided. It works fine, without error:
String input = "馨俞";
var messageAsBytes = Encoding.BigEndianUnicode.GetBytes(input);
input = Encoding.BigEndianUnicode.GetString(messageAsBytes);
If I process "input" with Encoding.Unicode, and print out both byte arrays (the one processed with unicode and the one with big endian), it show the differences:
So, input is converted to the endian you need.
The result of encoding a string is a byte array, not another string.
Just use
byte[] bytes = Encoding.BigEndianUnicode.GetBytes(sendMsg.Text);
to encode the string to bytes using the UTF 16 BE encoding.
Then send those bytes to the mainframe.
How you send those bytes to the mainframe may be the topic of another question, but it sounds like you somehow need to present those encoded bytes in a variable of type string. That sounds like a bug in the library you are using. We would need to understand the nature of that library and its possible bug to find a workaround. One option you could try, but it's a shot in the dark, is this:
string toSend = Encoding.Default.GetString(bytes);
That will produce a string where each character is the representation of one byte from the encoded string, in UTF 16 BE order. It's length will be double the length of the original string.
I got it working by setting this property without any conversion.
sendMsg.SetIntProperty(XMSC.JMS_IBM_CHARACTER_SET, 1201);
Im trying to cast a char-array into a byte-array.
char[] cc = new char[] { ((char)(byte)210) }; // Count = 1
byte[] b = System.Text.Encoding.UTF8.GetBytes(cc); // Count = 2
The conversion results in 2 entries for my byte-array {195, 146}.
I guess theres a problem with the encoding. Any help is appreciated.
After facing some problems I've written this 2 lines for the purpose of testing, so dont mind the style.
Thanks
UTF-8 can use more than just one byte to store a character. It uses only one byte for the ASCII characters within the range from 0-127, other characters need two or more bytes to be stored.
You are encoding the ASCII character 210 which is from the extended ASCII character (numeric value > 127), UTF-8 uses two bytes to store this character.
As M.kazemAkhgary said in the comments above:
cc.Select(c=>(byte)c).ToArray();
The clue was to cast instead of using converting. Thanks for that!
Good day!
I convert binary file into char array:
var bytes = File.ReadAllBytes(#"file.wav");
char[] outArr = new char[(int)(Math.Ceiling((double)bytes.Length / 3) * 4)];
var result = Convert.ToBase64CharArray(bytes, 0, bytes.Length, outArr, 0, Base64FormattingOptions.None);
string resStr = new string(outArr);
So, is it little endian?
And does it convert to UTF-8?
Thank you!
You don't have any UTF-8 here - and UTF-8 doesn't have an endianness anyway, as its code unit size is just a single byte.
Your code would be simpler as:
var bytes = File.ReadAllBytes(#"file.wav");
string base64 = Convert.ToBase64String(bytes);
If you then write the string to a file, that would have an encoding, which could easily be UTF-8 (and will be by default), but again there's no endianness to worry about.
Note that as base64 text is always in ASCII, each character within a base64 string will take up a single byte in UTF-8 anyway. Even if UTF-8 did have different representations for multi-byte values, it wouldn't be an issue here.
C# char represents a UTF-16 character element. So there is no UTF-8 here.
Since .net is little endian, and since char is two bytes wide, then the char array, and the string, are both stored in little endian byte order.
If you want to convert your byte array to base64 and then encode as UTF-8 do it like this:
byte[] base64utf8 = Encoding.UTF8.GetBytes(Convert.ToBase64String(bytes));
If you wish to save the base64 text to a file, encoded as UTF-8, you could do that like so:
File.WriteAllText(filename, Convert.ToBase64String(bytes), Encoding.UTF8);
Since UTF-8 is a byte oriented encoding, endianness is not an issue.
I convert a byte array to a string , and I convert this string to byte array.
these two byte arrays are different.
As below:
byte[] tmp = Encoding.ASCII.GetBytes(Encoding.ASCII.GetString(b));
Suppose b is a byte array.
b[0]=3, b[1]=188, b[2]=2 //decimal system
Result:
tmp[0]=3, tmp[1]=63, tmp[2]=2
So that's my problem, what's wrong with it?
188 is out of range for ASCII. Characters that are not in the corresponding character set are transposed to '?' by design (would you prefer transposing to "1/4"?)
ASCII is 7-bit only, so others are invalid. By default it uses ? to replace any invalid bytes and that's why you get a ?.
For 8-bit character sets, you should be looking for either the Extended ASCII (which is later defined "ISO 8859-1") or the code page 437 (which is often confused with Extended ASCII, but in fact it's not).
You can use the following code:
Encoding enc = Encoding.GetEncoding("iso-8859-1");
// For CP437, use Encoding.GetEncoding(437)
byte[] tmp = enc.GetBytes(enc.GetString(b));
The character 188 is not defined for ASCII. Instead, you're getting 63, which is a question mark.
The ASCII character set has a range from 1 to 127. You can see 188 is not in this range and is converted to ? (= ASC 63).
Not every sequence of bytes is necessarily a valid sequence of encoded values for a particular encoding.
So the result of Encoding.ASCII.GetString(b) on an arbitrary array of bytes, b, is poorly defined. (And could be, for any other encoding also).
If you need to take an arbitrary byte array and obtain a sequence of characters, you might want to look into the Convert classes ToBase64String and FromBase64String. If that's not what you're trying to do, maybe explain the original problem to us.
188 isn't in the range of ASCII (7 bit), you should use Encoding.Default to get the ANSI encoding:
byte[] b = new byte[3]{ 3, 188, 2 };
byte[] tmp = Encoding.Default.GetBytes(Encoding.Default.GetString(b));
Is it possible to simplify this code into a cleaner/faster form?
StringBuilder builder = new StringBuilder();
var encoding = Encoding.GetEncoding(936);
// convert the text into a byte array
byte[] source = Encoding.Unicode.GetBytes(text);
// convert that byte array to the new codepage.
byte[] converted = Encoding.Convert(Encoding.Unicode, encoding, source);
// take multi-byte characters and encode them as separate ascii characters
foreach (byte b in converted)
builder.Append((char)b);
// return the result
string result = builder.ToString();
Simply put, it takes a string with Chinese characters such as 鄆 and converts them to ài.
For example, that Chinese character in decimal is 37126 or 0x9106 in hex.
See http://unicodelookup.com/#0x9106/1
Converted to a byte array, we get [145, 6] (145 * 256 + 6 = 37126). When encoded in CodePage 936 (simplified chinese), we get [224, 105]. If we break this byte array down into individual characters, we 224=e0=à and 105=69=i in unicode.
See http://unicodelookup.com/#0x00e0/1
and
http://unicodelookup.com/#0x0069/1
Thus, we're doing an encoding conversion and ensuring that all characters in our output Unicode string can be represented using at most two bytes.
Update: I need this final representation because this is the format my receipt printer is accepting. Took me forever to figure it out! :) Since I'm not an encoding expert, I'm looking for simpler or faster code, but the output must remain the same.
Update (Cleaner version):
return Encoding.GetEncoding("ISO-8859-1").GetString(Encoding.GetEncoding(936).GetBytes(text));
Well, for one, you don't need to convert the "built-in" string representation to a byte array before calling Encoding.Convert.
You could just do:
byte[] converted = Encoding.GetEncoding(936).GetBytes(text);
To then reconstruct a string from that byte array whereby the char values directly map to the bytes, you could do...
static string MangleTextForReceiptPrinter(string text) {
return new string(
Encoding.GetEncoding(936)
.GetBytes(text)
.Select(b => (char) b)
.ToArray());
}
I wouldn't worry too much about efficiency; how many MB/sec are you going to print on a receipt printer anyhow?
Joe pointed out that there's an encoding that directly maps byte values 0-255 to code points, and it's age-old Latin1, which allows us to shorten the function to...
return Encoding.GetEncoding("Latin1").GetString(
Encoding.GetEncoding(936).GetBytes(text)
);
By the way, if this is a buggy windows-only API (which it is, by the looks of it), you might be dealing with codepage 1252 instead (which is almost identical). You might try reflector to see what it's doing with your System.String before it sends it over the wire.
Almost anything would be cleaner than this - you're really abusing text here, IMO. You're trying to represent effectively opaque binary data (the encoded text) as text data... so you'll potentially get things like bell characters, escapes etc.
The normal way of encoding opaque binary data in text is base64, so you could use:
return Convert.ToBase64String(Encoding.GetEncoding(936).GetBytes(text));
The resulting text will be entirely ASCII, which is much less likely to cause you hassle.
EDIT: If you need that output, I would strongly recommend that you represent it as a byte array instead of as a string... pass it around as a byte array from that point onwards, so you're not tempted to perform string operations on it.
Does your receipt printer have an API that accepts a byte array rather than a string?
If so you may be able to simplify the code to a single conversion, from a Unicode string to a byte array using the encoding used by the receipt printer.
Also, if you want to convert an array of bytes to a string whose character values correspond 1-1 to the values of the bytes, you can use the code page 28591 aka Latin1 aka ISO-8859-1.
I.e., the following
foreach (byte b in converted)
builder.Append((char)b);
string result = builder.ToString();
can be replaced by:
// All three of the following are equivalent
// string result = Encoding.GetEncoding(28591).GetString(converted);
// string result = Encoding.GetEncoding("ISO-8859-1").GetString(converted);
string result = Encoding.GetEncoding("Latin1").GetString(converted);
Latin1 is a useful encoding when you want to encode binary data in a string, e.g. to send through a serial port.