Convert python byte in string format to byte array in c# - c#

In my python code i have a value in byte code, whenever print that byte code it will give something like this,
b'\xe0\xb6\x9c\xe0\xb7\x92\xe0\xb6\xb1\xe0\xb7\x8a\xe0\xb6\xaf\xe0\xb6\xbb'
now, that value in string format in c# that is,
string byteString = "b'\xe0\xb6\x9c\xe0\xb7\x92\xe0\xb6\xb1\xe0\xb7\x8a\xe0\xb6\xaf\xe0\xb6\xbb'";
so question is how can i convert that byteString to byte array in c#
but, my actual problem is i have a string value in python which is not in English, when i run the python code it will print the string(in non English, work fine).
But, whenever run that python code in c# from process class it work fine for English and i can get the value. but it not working for non English characters, it was a null value. therefore, in python if i print that non English value in byte code i can get the value in c#. problem is how can i convert that in byte code into byte array in c#.

First, you want to modify your string slightly for usage in C#.
var str = "\xe0\xb6\x9c\xe0\xb7\x92\xe0\xb6\xb1\xe0\xb7\x8a\xe0\xb6\xaf\xe0\xb6\xbb";
You can then get your bytes fairly easily with LINQ.
var bytes = str.Select(x => Convert.ToByte(x)).ToArray();
An odd case that can occur with trying to use byte strings between Python and C# is that python will sometimes put out straight ASCII characters for certain byte values, leaving you with a mixed string like b'\xe0ello'. C# recognizes \x##, but it also attempts to parse \x####, which will tend to break when dealing with the output of a python bytestring that mixes hex codes and ascii.

Related

c# UTF8 GetString from bytes array not equal to php chr function

I'm trying to make one decoder. Basic system .Net 4.7 I'm trying to migrate this system into php, but I'm having trouble converting bytes. As far as I understand the default string UTF-16le on C#, I understood the ord and chr functions as UCS-2 on the PHP side. I want to do below and I do not get the same result there are codes. What can I do to fix this, thanks in advance
XOR Encoded Text Bytes = [101,107,217,78,40,68,234,218,162,67,139,81,44,166,24,148];
on C#
string result = System.Text.Encoding.UTF8.GetString(destinationArray);
On PHP
for($i=0;$i<sizeof($encoded);$i++){
echo "\t".$encoded[$i]." => ".chr($encoded[$i])."\n";
$tmpStr .= chr($encoded[$i]);
}
C# Result size=26:
ek�N(D�ڢC�Q,��
PHP Result size=16:
ek�N(D�ڢC�Q,��
the strings looks the same, but byte translation is quite different.
C# Result to Bytes array:
byte[] utf8 = System.Text.Encoding.Unicode.GetBytes(result);
Console.WriteLine(string.Join("-", utf8));
response =
101-0-107-0-253-255-78-0-40-0-68-0-253-255-162-6-67-0-253-255-81-0-44-0-253-255-24-0-253-255
PHP Result to Bytes Array:
echo implode("-",unpack("C*", $tmpStr));
response = 101-107-217-78-40-68-234-218-162-67-139-81-44-166-24-148
if php response convert to UTF-16le, results again different
echo implode("-",unpack("C*", mb_convert_encoding($tmpStr,'UTF-16le')));
response =
101-0-107-0-63-0-78-0-40-0-68-0-63-0-162-6-67-0-63-0-81-0-44-0-63-0-24-0-63-0
You are mixing quite different things here.
First, in the C# code, you are not using the same encoding when converting from bytes to a string and then from a string back to bytes: Encoding.UTF8 in the first case and Encoding.Unicode (which is .NET name for UTF-16) in the latter... Things cannot go well if you do this. And by the way, I'm not sure that PHP's UCS2 is equivalent to UTF-16:
UTF-8 encodes characters on 1, 2, 3 or 4 bytes depending on the character
UTF-16 encodes characters on 2 or 4 bytes depending on the character
UCS-2 always encodes characters on 2 bytes, and hence cannot encode more than 65536 characters...
Then what you pass to the 'bytes to string' conversions is not necessarily valid! Because you've XORed the input data (I assume it to be some secret string), the resulting bytes may or may not be a valid sequence in some encodings. For example:
It is not valid in ASCII because you have (in your example) bytes > 127
It is not valid in UTF-8 because 217 followed by 78 is recognized neither as a 1-, 2-, 3-, or 4-byte character by UTF-8; hence, the � you see before the N.
It seems to be invalid UTF-16 as well, but roundtripping works (I could get back the original array using .NET's Unicode.GetString, then Unicode.GetBytes. If I remove your last byte - and end up with an odd number of bytes - then UTF-16 roundtripping does not work any more...
Although I did not test it, it should also be invalid UCS-2 because UCS-2 'looks like' UTF-16 for 2-byte characters.
Roundtripping works with ANSI encodings sucha as windows-1252 because these encodings accept any byte. However, I would discourage using such trick because you have to be sure the same code page is used on both sides of the encoding/decoding process.
Therefore, I think, in your case, the best way to store your XORed bytes into a string would be to convert the array to base64. In C# you can do it this way:
// The code below gives you ZWt1TihEInY+QydRLEIYMA==
var converted = Convert.ToBase64String(array);
// And this one gives you back the initial array
var bytes = Convert.FromBase64String(converted);
Quick googling will tell you to use base64_encode and base64_decode in PHP.
Bottom note: if you want to really understand what's going on with al this encodings stuff, here is the must-read blog post on the subject: https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/

Create properly escaped string from byte array

I have a byte array, read from an image file, that I am trying to send from C# across a socket to a Meteor server running collectionFS (v0.3.7).
I am trying to convert it to a string to match the result I would get from calling FileReader.readAsBinaryString() in JavaScript, for example:
?PNG\r\n\u001a\n\u0000\u0000\u0000\rIHDR\u0000\u0000\u0003?\u0000\u0000\u0002?
In my C# code, I have tried using System.Text.Encoding.UTF8.GetString(), which gives me something like this:
�PNG\r\n\n\0\0\0\rIHDR\0\0�\0\0
This fails on the transfer, presumably because the '\0' is treated like the end of the string.
Can anyone better explain what is happening here? Is there a nice way in C# to format the bytes using the unicode escape sequences like readAsBinaryString() does?
EDIT: The eventual destination for this data is a BSON binary entry in MongoDB (in Meteor), to be later extracted (as a Blob) and viewed through the normal Meteor web browser client.
There is no built in method that does exactly that.
To transform byte array to encoded you need to decide what is encoded and what is not. Looks like 0-9a-zA-Z range should not be encoded and the rest encoded as \uXXXX:
I'd do something like following:
var result = String.Join("", byteArray
.Select(b => b >'0' && b <'9' ?
(char)b.ToString() : String.Format(#"\u{0:x4}", b)));

Byte array to text for ScintillaNET

I'm writing a windows forms application in c#. The application allows the user to select source code-files from a listbox and displays them in colored code using ScintillaNET. The files are saved as byte arrays in a database. I've managed to make the conversion from a file on my hard drive to byte array and store it. The user should also be able to edit the code and then save it to the database without having to dowload the file to their local hard drive first, I don't know how to approach this.
Basically I want to save the text from the ScintillNET control and convert it to a byte array.
And the other way around, take a byte array and print out the text as it originally appeared in ScintillaNET.
You can use the "Encoding" class from System.Text.
System.Text.Encoding.Unicode.GetBytes("Example");
This will return a byte array with the bytes equivalent to the text "string" using the unicode encoding. There are other encoding available, but I suggest using unicode since it supports more characters (anything you find in windows charmap, for example). In my case is because I'm latin and certain letters aren't available in UTF and I have my doubts about ASCII.
Now to convert from the byte array to string use:
byte[] exampleByteArray = MemStream.ToArray();
System.Text.Encoding.Unicode.GetString(exampleByteArray);
This code will return the string saved previously as a byte array in a memory stream. You can load the byte array with other methods, in you your case you are gonna load it from the database and call System.Text.Encoding.Unicode.GetString().
I believe you are looking for the System.Text.Encoding namespace...
// a sample string...
string example = "A string example...";
// convert string to bytes
byte[] bytes = Encoding.UTF8.GetBytes(example);
// convert bytes to string
string str = System.Text.Encoding.UTF8.GetString(bytes);

How does WChar relate to Unicode and ASCII

I am about to show my total ignorance of how encoding works and different string formats.
I am passing a string to a compiler (Microsoft as it happens amd for their Flight Simulator). The string is passed as part of an XML document which is used as the source for the compiler. This is created using using standard NET strings. I have not needed to specifically specify any encoding or setting of type since the XML is just text.
The string is just a collection of characters. This is an example of one that gives the error:
ARG, AFL, AMX, ACA, DAH, CCA, AEL, AGN, MAU, SEY, TSC, AZA, AAL, ANA, BBC, CPA, CAL, COA, CUB, DAL, UGX, ELY, UAE, ERT, ETH, EEZ, GHA, IRA, JAL, NWA, KAL, KAC, LAN, LDI, MAS, MEA, PIA, QTR, RAM, RJA, SVA, SIA, SWR, ROT, THA, THY, AUI, UAL, USA, ACA, TAR, UZB, IYE, QFA
If I create the string using my C# managed program then there is no issue. However this string is coming from a c++ program that can create the compiled file using its own compiler that is not compliant with the MS one
The MS compiler does not like the string. It throws two errors:
INTERNAL COMPILER ERROR: #C2621: Couldn't convert WChar string!
INTERNAL COMPILER ERROR: #C2029: Failed to convert attribute value from UNICODE!
Unfortunately there is not any useful documentation with the compiler on its errors. We just makethe best of what we see!
I have seen other errors of this type but these contain hidden characters and control characters that I can trap and remove.
In this case I looked at the string as a Char[] and could not see anything unusual. Only what I expected. No values above the ascii limit of 127 and no control characters.
I understand that WChar is something that C++ understands (but I don't), Unicode is a two byte representation of characters and ASCII is a one byte representation.
I would like to do two things - first identify a string that will fail if passed to the compiler and second fix the string. I assume the compiler is expecting ASCII.
EDIT
I told an untruth - in fact I do use encoding. I checked the code I used to convert a byte array into a string.
public static string Bytes2String(byte[] bytes, int start, int length) {
string temp = Encoding.Defaut.GetString(bytes, start, length);
}
I realized that Default might be an issue but changing it to ASCII makes no difference. I am beginning to believe that the error message is not what it seems.
It looks like you are taking a byte array, and converting it as a string using the encoding returned by Encoding.Default.
It is recommended that you do not do this (in the Microsoft documentation).
You need to work out what encoding is being used in the C++ program to generate the byte array, and use the same one (or a compatible one) to convert the byte array back to a string again in the C# code.
E.g. if the byte array is using ASCII encoding, you could use:
System.Text.ASCIIEncoding.GetString(bytes, start, length);
or
System.Text.UTF8Encoding.GetString(bytes, start, length);
P.S. I hope Joel doesn't catch you ;)
I have to come clean that the compiler error has nothing to do with the encoding format of the string. It turns out that it is the length of the string that is at fault. As per the sample there are a number of entries separated by commas. The compiler throws the rather unhelful messages if the entry count exceeds 50.
However Thanks everyone for your help - it has raised the issue of encoding in my mind and I will now look at it much more carefully

Can we simplify this string encoding code

Is it possible to simplify this code into a cleaner/faster form?
StringBuilder builder = new StringBuilder();
var encoding = Encoding.GetEncoding(936);
// convert the text into a byte array
byte[] source = Encoding.Unicode.GetBytes(text);
// convert that byte array to the new codepage.
byte[] converted = Encoding.Convert(Encoding.Unicode, encoding, source);
// take multi-byte characters and encode them as separate ascii characters
foreach (byte b in converted)
builder.Append((char)b);
// return the result
string result = builder.ToString();
Simply put, it takes a string with Chinese characters such as 鄆 and converts them to ài.
For example, that Chinese character in decimal is 37126 or 0x9106 in hex.
See http://unicodelookup.com/#0x9106/1
Converted to a byte array, we get [145, 6] (145 * 256 + 6 = 37126). When encoded in CodePage 936 (simplified chinese), we get [224, 105]. If we break this byte array down into individual characters, we 224=e0=à and 105=69=i in unicode.
See http://unicodelookup.com/#0x00e0/1
and
http://unicodelookup.com/#0x0069/1
Thus, we're doing an encoding conversion and ensuring that all characters in our output Unicode string can be represented using at most two bytes.
Update: I need this final representation because this is the format my receipt printer is accepting. Took me forever to figure it out! :) Since I'm not an encoding expert, I'm looking for simpler or faster code, but the output must remain the same.
Update (Cleaner version):
return Encoding.GetEncoding("ISO-8859-1").GetString(Encoding.GetEncoding(936).GetBytes(text));
Well, for one, you don't need to convert the "built-in" string representation to a byte array before calling Encoding.Convert.
You could just do:
byte[] converted = Encoding.GetEncoding(936).GetBytes(text);
To then reconstruct a string from that byte array whereby the char values directly map to the bytes, you could do...
static string MangleTextForReceiptPrinter(string text) {
return new string(
Encoding.GetEncoding(936)
.GetBytes(text)
.Select(b => (char) b)
.ToArray());
}
I wouldn't worry too much about efficiency; how many MB/sec are you going to print on a receipt printer anyhow?
Joe pointed out that there's an encoding that directly maps byte values 0-255 to code points, and it's age-old Latin1, which allows us to shorten the function to...
return Encoding.GetEncoding("Latin1").GetString(
Encoding.GetEncoding(936).GetBytes(text)
);
By the way, if this is a buggy windows-only API (which it is, by the looks of it), you might be dealing with codepage 1252 instead (which is almost identical). You might try reflector to see what it's doing with your System.String before it sends it over the wire.
Almost anything would be cleaner than this - you're really abusing text here, IMO. You're trying to represent effectively opaque binary data (the encoded text) as text data... so you'll potentially get things like bell characters, escapes etc.
The normal way of encoding opaque binary data in text is base64, so you could use:
return Convert.ToBase64String(Encoding.GetEncoding(936).GetBytes(text));
The resulting text will be entirely ASCII, which is much less likely to cause you hassle.
EDIT: If you need that output, I would strongly recommend that you represent it as a byte array instead of as a string... pass it around as a byte array from that point onwards, so you're not tempted to perform string operations on it.
Does your receipt printer have an API that accepts a byte array rather than a string?
If so you may be able to simplify the code to a single conversion, from a Unicode string to a byte array using the encoding used by the receipt printer.
Also, if you want to convert an array of bytes to a string whose character values correspond 1-1 to the values of the bytes, you can use the code page 28591 aka Latin1 aka ISO-8859-1.
I.e., the following
foreach (byte b in converted)
builder.Append((char)b);
string result = builder.ToString();
can be replaced by:
// All three of the following are equivalent
// string result = Encoding.GetEncoding(28591).GetString(converted);
// string result = Encoding.GetEncoding("ISO-8859-1").GetString(converted);
string result = Encoding.GetEncoding("Latin1").GetString(converted);
Latin1 is a useful encoding when you want to encode binary data in a string, e.g. to send through a serial port.

Categories