Can we simplify this string encoding code - c#

Is it possible to simplify this code into a cleaner/faster form?
StringBuilder builder = new StringBuilder();
var encoding = Encoding.GetEncoding(936);
// convert the text into a byte array
byte[] source = Encoding.Unicode.GetBytes(text);
// convert that byte array to the new codepage.
byte[] converted = Encoding.Convert(Encoding.Unicode, encoding, source);
// take multi-byte characters and encode them as separate ascii characters
foreach (byte b in converted)
builder.Append((char)b);
// return the result
string result = builder.ToString();
Simply put, it takes a string with Chinese characters such as 鄆 and converts them to ài.
For example, that Chinese character in decimal is 37126 or 0x9106 in hex.
See http://unicodelookup.com/#0x9106/1
Converted to a byte array, we get [145, 6] (145 * 256 + 6 = 37126). When encoded in CodePage 936 (simplified chinese), we get [224, 105]. If we break this byte array down into individual characters, we 224=e0=à and 105=69=i in unicode.
See http://unicodelookup.com/#0x00e0/1
and
http://unicodelookup.com/#0x0069/1
Thus, we're doing an encoding conversion and ensuring that all characters in our output Unicode string can be represented using at most two bytes.
Update: I need this final representation because this is the format my receipt printer is accepting. Took me forever to figure it out! :) Since I'm not an encoding expert, I'm looking for simpler or faster code, but the output must remain the same.
Update (Cleaner version):
return Encoding.GetEncoding("ISO-8859-1").GetString(Encoding.GetEncoding(936).GetBytes(text));

Well, for one, you don't need to convert the "built-in" string representation to a byte array before calling Encoding.Convert.
You could just do:
byte[] converted = Encoding.GetEncoding(936).GetBytes(text);
To then reconstruct a string from that byte array whereby the char values directly map to the bytes, you could do...
static string MangleTextForReceiptPrinter(string text) {
return new string(
Encoding.GetEncoding(936)
.GetBytes(text)
.Select(b => (char) b)
.ToArray());
}
I wouldn't worry too much about efficiency; how many MB/sec are you going to print on a receipt printer anyhow?
Joe pointed out that there's an encoding that directly maps byte values 0-255 to code points, and it's age-old Latin1, which allows us to shorten the function to...
return Encoding.GetEncoding("Latin1").GetString(
Encoding.GetEncoding(936).GetBytes(text)
);
By the way, if this is a buggy windows-only API (which it is, by the looks of it), you might be dealing with codepage 1252 instead (which is almost identical). You might try reflector to see what it's doing with your System.String before it sends it over the wire.

Almost anything would be cleaner than this - you're really abusing text here, IMO. You're trying to represent effectively opaque binary data (the encoded text) as text data... so you'll potentially get things like bell characters, escapes etc.
The normal way of encoding opaque binary data in text is base64, so you could use:
return Convert.ToBase64String(Encoding.GetEncoding(936).GetBytes(text));
The resulting text will be entirely ASCII, which is much less likely to cause you hassle.
EDIT: If you need that output, I would strongly recommend that you represent it as a byte array instead of as a string... pass it around as a byte array from that point onwards, so you're not tempted to perform string operations on it.

Does your receipt printer have an API that accepts a byte array rather than a string?
If so you may be able to simplify the code to a single conversion, from a Unicode string to a byte array using the encoding used by the receipt printer.
Also, if you want to convert an array of bytes to a string whose character values correspond 1-1 to the values of the bytes, you can use the code page 28591 aka Latin1 aka ISO-8859-1.
I.e., the following
foreach (byte b in converted)
builder.Append((char)b);
string result = builder.ToString();
can be replaced by:
// All three of the following are equivalent
// string result = Encoding.GetEncoding(28591).GetString(converted);
// string result = Encoding.GetEncoding("ISO-8859-1").GetString(converted);
string result = Encoding.GetEncoding("Latin1").GetString(converted);
Latin1 is a useful encoding when you want to encode binary data in a string, e.g. to send through a serial port.

Related

c# UTF8 GetString from bytes array not equal to php chr function

I'm trying to make one decoder. Basic system .Net 4.7 I'm trying to migrate this system into php, but I'm having trouble converting bytes. As far as I understand the default string UTF-16le on C#, I understood the ord and chr functions as UCS-2 on the PHP side. I want to do below and I do not get the same result there are codes. What can I do to fix this, thanks in advance
XOR Encoded Text Bytes = [101,107,217,78,40,68,234,218,162,67,139,81,44,166,24,148];
on C#
string result = System.Text.Encoding.UTF8.GetString(destinationArray);
On PHP
for($i=0;$i<sizeof($encoded);$i++){
echo "\t".$encoded[$i]." => ".chr($encoded[$i])."\n";
$tmpStr .= chr($encoded[$i]);
}
C# Result size=26:
ek�N(D�ڢC�Q,��
PHP Result size=16:
ek�N(D�ڢC�Q,��
the strings looks the same, but byte translation is quite different.
C# Result to Bytes array:
byte[] utf8 = System.Text.Encoding.Unicode.GetBytes(result);
Console.WriteLine(string.Join("-", utf8));
response =
101-0-107-0-253-255-78-0-40-0-68-0-253-255-162-6-67-0-253-255-81-0-44-0-253-255-24-0-253-255
PHP Result to Bytes Array:
echo implode("-",unpack("C*", $tmpStr));
response = 101-107-217-78-40-68-234-218-162-67-139-81-44-166-24-148
if php response convert to UTF-16le, results again different
echo implode("-",unpack("C*", mb_convert_encoding($tmpStr,'UTF-16le')));
response =
101-0-107-0-63-0-78-0-40-0-68-0-63-0-162-6-67-0-63-0-81-0-44-0-63-0-24-0-63-0
You are mixing quite different things here.
First, in the C# code, you are not using the same encoding when converting from bytes to a string and then from a string back to bytes: Encoding.UTF8 in the first case and Encoding.Unicode (which is .NET name for UTF-16) in the latter... Things cannot go well if you do this. And by the way, I'm not sure that PHP's UCS2 is equivalent to UTF-16:
UTF-8 encodes characters on 1, 2, 3 or 4 bytes depending on the character
UTF-16 encodes characters on 2 or 4 bytes depending on the character
UCS-2 always encodes characters on 2 bytes, and hence cannot encode more than 65536 characters...
Then what you pass to the 'bytes to string' conversions is not necessarily valid! Because you've XORed the input data (I assume it to be some secret string), the resulting bytes may or may not be a valid sequence in some encodings. For example:
It is not valid in ASCII because you have (in your example) bytes > 127
It is not valid in UTF-8 because 217 followed by 78 is recognized neither as a 1-, 2-, 3-, or 4-byte character by UTF-8; hence, the � you see before the N.
It seems to be invalid UTF-16 as well, but roundtripping works (I could get back the original array using .NET's Unicode.GetString, then Unicode.GetBytes. If I remove your last byte - and end up with an odd number of bytes - then UTF-16 roundtripping does not work any more...
Although I did not test it, it should also be invalid UCS-2 because UCS-2 'looks like' UTF-16 for 2-byte characters.
Roundtripping works with ANSI encodings sucha as windows-1252 because these encodings accept any byte. However, I would discourage using such trick because you have to be sure the same code page is used on both sides of the encoding/decoding process.
Therefore, I think, in your case, the best way to store your XORed bytes into a string would be to convert the array to base64. In C# you can do it this way:
// The code below gives you ZWt1TihEInY+QydRLEIYMA==
var converted = Convert.ToBase64String(array);
// And this one gives you back the initial array
var bytes = Convert.FromBase64String(converted);
Quick googling will tell you to use base64_encode and base64_decode in PHP.
Bottom note: if you want to really understand what's going on with al this encodings stuff, here is the must-read blog post on the subject: https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/

Handle Non-UTF-8 Characters in Byte Array

I have an array of bytes which contains some characters that are not UTF-8. These characters cannot be deserialized using UTF-8 encoding. So, my question is, how can I handle these characters and make the string readable in whatever language it is.
For example, if I have an array:
byte[] b = myArrayWithNonUTF8Characters;
And I try to deserialize the array with:
DataContractJsonSerializer jsonSerializer = new DataContractJsonSerializer(typeof(MyObject));
MyObject objResponse = (MyObject)jsonSerializer.ReadObject(new MemoryStream(b));
Then I get an error that the array contains invalid UTF8 bytes.
Any way to make this work?
PS: Please, do not give me this answer: string s = System.Text.Encoding.UTF8.GetString(b, 0, b.Length); It will only return symbols replacing the non-UTF-8 characters.
The beauty of UTF is that it encodes characters in most languages; so you can have Greek and Japanese in the same character stream.
Without UTF, your entire stream (or in your case an array) must be in a single language defined by a Code Page. Each character is represented by an ASCII byte but the actual character is determined by the Code Page (see http://en.wikipedia.org/wiki/Code_page for more details).
For example if your text was written in Greek you might use Code Page 111:
System.Text.Encoding.GetEncoding(111)
In short, you need to know what language the ASCII text was written in.

Why am I getting two different 'formats' of hex in my bytes while evaluating an HMAC?

I'm getting a signed payload from an authentication source that comes in a base64 encoded and URL encoded format. I'm getting confused somewhere while evaluating, and ending up with similar data in different 'formats'.
Here's my code:
//Split the message to payload and signature
string[] split = raw_message.Split('.');
//Payload
string base64_payload = WebUtility.UrlDecode(split[0]);
byte[] payload = Convert.FromBase64String(base64_payload);
//Expected signature
string base64_expected_sig = WebUtility.UrlDecode(split[1]);
byte[] expected_sig = Convert.FromBase64String(base64_expected_sig);
//Signature
byte[] signature = hmacsha256.ComputeHash(payload);
//Output as a string
var foo = System.Text.Encoding.UTF8.GetString(expected_sig);
var bar = BitConverter.ToString(signature);
The expected signature (foo) comes out like so:
76eba09fcb54877299dcbd1e1e35717e3bd42e066e7ecdb131c7d0161dec3418
The computed signature (bar) is as follows:
76-EB-A0-9F-CB-54-87-72-99-DC-BD-1E-1E-35-71-7E-3B-D4-2E-06-6E-7E-CD-B1-31-C7-D0-16-1D-EC-34-18
Obviously, when comparing bytes for bytes, this doesn't work.
I see that I'm having to convert the expected_sig and the signature in different ways to get them to display as a string, but I can't figure out how I need to change the expected signature to get to where I can compare bytes for bytes.
I can obviously work around the issue but simply converting the string bar, but that's dirty and I just don't like it.
Where am I going wrong here? What am I not understanding?
The good news is that the hash computation appears to be working.
The bad news is that you're receiving the hash in a brain-dead fashion. For some reason it seems that the authors decided it was a good idea to:
Compute the hash (fine)
Convert this binary data to text as hex (fine)
Convert the hex back into binary data by applying ASCII/UTF-8/anything-ASCII-compatible encoding (why?)
Convert the result back into text using base64 (what?)
URL-encode the result (which wouldn't even be necessary with hex...)
Using either base64 or hex on the original binary makes sense, but applying both is crazy.
Anyway, it's fairly easy for you to do the same thing. For example:
string hexSignature = string.Join("", signature.Select(b => b.ToString("x2")));
byte[] hexSignatureUtf8 = Encoding.UTF8.GetBytes(hexSignature);
string finalSignature = Convert.ToBase64String(hexSignatureUtf8);
That should now match WebUtility.UrlDecode(split[1]).
Alternatively, you can work backwards from what's in the result, but I wouldn't go as far as parsing the hex back to bytes - it would be simpler to keep the first line of the above, but use:
string expectedHexBase64 = WebUtility.UrlDecode(split[1]);
byte[] expectedHexUtf8 = Convert.FromBase64String(expectedHexBase64);
string expectedHex = Encoding.UTF8.GetString(expectedHexUtf8);
Then compare it with hexSignature.
Ideally, you should talk to whoever's providing you with the crazy format and hit them with a cluestick though...

Can any byte array be converted to a string?

Can any byte array be converted to a string? Or there are some byte values that are not available or cannot be converted to characteres depending on the encoding of the string?
You should only try to convert byte arrays to strings if they started as text. If the byte array is actually the contents of an image file, or a video, or maybe encoded or compressed data, you should not try to convert it straight to a string using an encoding. Doing so almost always goes badly in the end: with ISO-8859-1 you might be okay, but it's fundamentally a bad idea, and you really shouldn't do it.
Instead, you should use Convert.ToBase64String to convert it to Base64, or perhaps convert it to hex instead.
If you do use Base64, you'd use Convert.FromBase64String to convert back from text to a byte array.
Can any byte array be converted to a string?
Base64 seems like an appropriate representation of a byte array:
byte[] buffer = ...
string base64 = Convert.ToBase64String(buffer);
In .NET you could use the ToBase64String method to achieve this.
Also you seem to have talked about some encoding of a string in your question, but in .NET all strings are UTF-16 encoded, so I don't quite understand what you meant by that.
Strings may be converted into sequences of bytes using a variety of encodings. Some encodings can convert any possible string to some sequence of bytes; others will only work with strings containing a limited variety of characters, but for every possible byte sequence there will exist a string that would yield it. Some encoding methods will convert any possible string into an even-length sequence of bytes, and will allow any even-length sequence of bytes to be converted back to a string, but cannot yield odd-length strings. I'm not aware of any encoding methods which create a one-to-one relationship between all possible strings and all possible arbitrary-length byte sequences.
Once upon a time, strings were a convenient way of holding arbitrary byte sequences, but in .NET they may only be used as a means of holding binary data if the data is filtered so as to ensure that it doesn't contain any invalid characters or sequences. I wish there was an "immutable byte sequence" type which could be used for that purpose which used to be served by strings, but I'm unaware of one.

Ruby equivalent to .NET's Encoding.ASCII.GetString(byte[])

Does Ruby have an equivalent to .NET's Encoding.ASCII.GetString(byte[])?
Encoding.ASCII.GetString(bytes[]) takes an array of bytes and returns a string after decoding the bytes using the ASCII encoding.
Assuming your data is in an array like so (each element is a byte, and further, from the description you posted, no larger than 127 in value, that is, a 7-bit ASCII character):
array =[104, 101, 108, 108, 111]
string = array.pack("c*")
After this, string will contain "hello", which is what I believe you're requesting.
The pack method "Packs the contents of arr into a binary sequence according to the directives in the given template string".
"c*" asks the method to interpret each element of the array as a "char". Use "C*" if you want to interpret them as unsigned chars.
http://ruby-doc.org/core/classes/Array.html#M002222
The example given in the documentation page uses the function to convert a string with Unicode characters. In Ruby I believe this is best done using Iconv:
require "iconv"
require "pp"
#Ruby representation of unicode characters is different
unicodeString = "This unicode string contains two characters " +
"with codes outside the ASCII code range, " +
"Pi (\342\x03\xa0) and Sigma (\342\x03\xa3).";
#printing original string
puts unicodeString
i = Iconv.new("ASCII//IGNORE","UTF-8")
#Printing converted string, unicode characters stripped
puts i.iconv(unicodeString)
bytes = i.iconv(unicodeString).unpack("c*")
#printing array of bytes of converted string
pp bytes
Read up on Ruby's Iconv here.
You might also want to check this question.

Categories