What's going wrong with C#'s string formatter? - c#

I'm getting the following behavior from C#s string encoder:
[Test Case Screenshot][1]
poundFromBytes should be "£", but instead it's "?".
It's as if it's trying to encode the byte array using ASCII instead of UTF-8.
Is this a bug in Windows 7 / C#'s string encoder, or am I missing something?
My real issue here is that I get the same problem when I use File.ReadAllText on an ANSI text file, and I get a related issue in a third party library.
EDIT
I found my problem, I was running under the assumption that UTF-8 was backwards compatible with ANSI, but it's actually only backwards compatible with ASCII. Cheers anyway, at least I'll know to make sure I have no immaterial problems with my test case next time.

The single-byte representation of the pound sign is not valid UTF-8.
Use Encoding.GetBytes instead:
byte[] poundBytes = Encoding.GetEncoding("UTF-8").GetBytes(sPound)

The correct block of code should read something like:
var testChar = '£';
var bytes = Encoding.UTF32.GetBytes(new []{testChar});
string testConvert = Encoding.UTF32.GetString(bytes, 0, bytes.Length);
As others have said, you need to use a UTF encoder to get the bytes for a character. Incidentally characters are UTF-16 format by default (see: http://msdn.microsoft.com/en-us/library/x9h8tsay.aspx)

If you want to use an Encoding's GetString() method, you should probably also use it's corresponding GetBytes() method:
static void Main(string[] args)
{
char cPound = '£';
byte bPound = (byte)cPound; //not really valid
string sPound = "" + cPound;
byte[] poundBytes = Encoding.UTF8.GetBytes(sPound);
string poundFromBytes = Encoding.UTF8.GetString(pountBytes);
Console.WriteLine(poundFromBytes);
Console.ReadKey(True);
}

Check out the documents here. As mentioned in the comments you can't just cast your char to a byte. I'll edit with a more succinct answer but I want to avoid just copy/pasting what msdn has. http://msdn.microsoft.com/en-us/library/ds4kkd55(v=vs.110).aspx
char[] pound = new char[] { '£' };
byte[] poundAsBytes = Encoding.UTF8.GetBytes(pound);
Also, why is everyone using this GetEncoding with a hard coded argument rather than accessing UTF8 directly?

Related

.NET C# conversion from UTF 16 LE to UTF 16 BE failing

I'm trying to convert some strings from UTF 16 LE to UTF 16 BE but it fails to encode the second Chinese character.
Sample string: test馨俞
Code:
byte[] bytes = Encoding.Unicode.GetBytes(sendMsg.Text);
sendMsg.Text = Encoding.BigEndianUnicode.GetString(bytes)
I've also tried
var encode = new UnicodeEncoding(false, true, true);
var messageAsBytes = encode.GetBytes(sendMsg.Text);
var enc = new UnicodeEncoding(true, true, true);
sendMsg.Text = enc.GetString(messageAsBytes);
Which results in the following error: Unable to translate bytes [DE][4F] at index 184 from specified code page to Unicode on the line:
sendMsg.Text = enc.GetString(messageAsBytes);
Thanks.
I think you should process your input string with the BigEndianUnicode class.
I made this code from the one you provided. It works fine, without error:
String input = "馨俞";
var messageAsBytes = Encoding.BigEndianUnicode.GetBytes(input);
input = Encoding.BigEndianUnicode.GetString(messageAsBytes);
If I process "input" with Encoding.Unicode, and print out both byte arrays (the one processed with unicode and the one with big endian), it show the differences:
So, input is converted to the endian you need.
The result of encoding a string is a byte array, not another string.
Just use
byte[] bytes = Encoding.BigEndianUnicode.GetBytes(sendMsg.Text);
to encode the string to bytes using the UTF 16 BE encoding.
Then send those bytes to the mainframe.
How you send those bytes to the mainframe may be the topic of another question, but it sounds like you somehow need to present those encoded bytes in a variable of type string. That sounds like a bug in the library you are using. We would need to understand the nature of that library and its possible bug to find a workaround. One option you could try, but it's a shot in the dark, is this:
string toSend = Encoding.Default.GetString(bytes);
That will produce a string where each character is the representation of one byte from the encoded string, in UTF 16 BE order. It's length will be double the length of the original string.
I got it working by setting this property without any conversion.
sendMsg.SetIntProperty(XMSC.JMS_IBM_CHARACTER_SET, 1201);

Byte[] from Registry returns only one letter

I'm trying to read data from the registry # ""SOFTWARE\Microsoft\Windows\CurrentVersion\Explorer\RecentDocs\"
The return value I get is System.byte[], when I convert it to a string like suggested here.
It works (I think). But I only get 1 letter returned and not the whole string.
Perhaps I'm doing something wrong? I'm fairly certain there can't be only one letter in there..
I've tried Encoding.ASCII.GetString(bytes); and Encoding.UTF8.GetString(bytes); and Encoding.Default.GetString(bytes); but it all returns only 1 character/letter.
I've checkout this link as well. But thats for C++ and I'm using C# and don't see that Method that they suggested (RegGetValueA)
Here is my code:
RegistryKey pRegKey = Registry.CurrentUser;
pRegKey = pRegKey.OpenSubKey("SOFTWARE\\Microsoft\\Windows\\CurrentVersion\\Explorer\\RecentDocs\\");
Object val = pRegKey..GetValue("0");
byte[] bytes = (byte[])pRegKey.GetValue ("0");
string str = Encoding.ASCII.GetString(bytes);
System.Windows.MessageBox.Show("The value is: " + str);
Thanks in advance for any help :)
The string is encoded using UTF-16, so you should use Encoding.Unicode.
But it doesn't seem it's just UTF-16 encoded strings, there's some more data. For me, (when decoded as UTF-16), it displays as
Stažené soubory□Š6□□□□□Stažené soubory.lnk□T□□뻯□□□□*□□□□□□□□□□□□Stažené soubory.lnk□6□
Stažené soubory means Downloads in Czech, which is the language of my Windows. And the U+25A1 squares in the above text are actually zero chars.
Are you sure that the encoding is ASCII ?
I would suspect some UTF like Encoding.UTF8 or Encoding.Unicode - try that...

Convert UCS-2 characters to UTF-8 Using C#

I'm pulling some internationalized text from a MS SQL Server 2005 database. As per the defaults for that DB, the characters are stored as UCS-2. However, I need to output the data in UTF-8 format, as I'm sending it out over the web. Currently, I have the following code to convert:
SqlString dbString = resultReader.GetSqlString(0);
byte[] dbBytes = dbString.GetUnicodeBytes();
byte[] utf8Bytes = System.Text.Encoding.Convert(System.Text.Encoding.Unicode,
System.Text.Encoding.UTF8, dbBytes);
System.Text.UTF8Encoding encoder = new System.Text.UTF8Encoding();
string outputString = encoder.GetString(utf8Bytes);
However, when I examine the output in the browser, it appears to be garbage, no matter what I set the encoding to.
What am I missing?
EDIT:
In response to the answers below, the reason I thought I had to perform a conversion is because I can output literal multibyte strings just fine. For example:
OutputControl.Text = "カルフォルニア工科大学とチューリッヒ工科大学は共同で、太陽光を保管可能な燃料に直接変えることのできる装置の開発に成功したとのこと";
works. Here, OutputControl is an ASP.Net Literal. However,
OutputControl.Text = outputString; //Output from above snippet
results in mangled output as described above. My hypothesis was that the database's output was somehow getting mangled by ASP.Net. If that's not the case, then what are some other possibilities?
EDIT 2:
Okay, I'm stupid. It turns out that there's nothing wrong with the database at all. When I tried inserting my own literal double byte characters (材料,原料;木料), I could read and output them just fine even without any conversion process at all. It seems to me that whatever is inserting the data into the DB is mangling the characters somehow, so I'm going to look at that. With my verified, "clean" data, the following code works:
OutputControl.Text = dbString.ToString();
as the responses below indicate it should.
Your code does essentially the same as:
SqlString dbString = resultReader.GetSqlString(0);
string outputString = dbString.ToString();
string itself is a UNICODE string (specifically, UTF-16, which is 'almost' the same as UCS-2, except for codepoints not fitting into the lowest 16 bits). In other words, the conversions you are performing are redundant.
Your web app most likely mangles the encoding somewhere else as well, or sets a wrong encoding for the HTML output. However, that can't be diagnosed from the information you provided so far.
String in .net is 'encoding agnostic'.
You can convert bytes to string using a particular encoding to tell .net how to interprets your bytes.
You can convert string to bytes using a particular encoding to tell .net how you want your bytes served.
But trying to convert a string to another string using encodings makes no sens at all.

How to convert hebrew (unicode) to Ascii in c#?

I have to create some sort of text file in which there are numbers and Hebrew letters decoded to ASCII.
This is file creation method which triggers on ButtonClick
protected void ToFile(object sender, EventArgs e)
{
filename = Transactions.generateDateYMDHMS();
string path = string.Format("{0}{1}.001", Server.MapPath("~/transactions/"), filename);
StreamWriter sw = new StreamWriter(path, false, Encoding.ASCII);
sw.WriteLine("hello");
sw.WriteLine(Transactions.convertUTF8ASCII("שלום"));
sw.WriteLine("bye");
sw.Close();
}
as you can see, i use Transactions.convertUTF8ASCII() static method to convert from probably Unicode string from .NET to ASCII representation of it. I use it on term Hebrew 'shalom' and get back '????' instead of result i need.
Here is the method.
public static string convertUTF8ASCII(string initialString)
{
byte[] unicodeBytes = Encoding.Unicode.GetBytes(initialString);
byte[] asciiBytes = Encoding.Convert(Encoding.Unicode, Encoding.ASCII, unicodeBytes);
return Encoding.ASCII.GetString(asciiBytes);
}
Instead of having initial word decoded to ASCII i get '????' in the file i create even if i run debbuger i get same result.
What i'm doing wrong ?
You can't simply translate arbitrary unicode characters to ASCII. The best it can do is discard the unsupportable characters, hence ????. Obviously the basic 7-bit characters will work, but not much else. I'm curious as to what the expected result is?
If you need this for transfer (rather than representation) you might consider base-64 encoding of the underlying UTF8 bytes.
Do you perhaps mean ANSI, not ASCII?
ASCII doesn't define any Hebrew characters. There are however some ANSI code pages which do such as "windows-1255"
In which case, you may want to consider looking at:
http://msdn.microsoft.com/en-us/library/system.text.encoding.aspx
In short, where you have:
Encoding.ASCII
You would replace it with:
Encoding.GetEncoding(1255)
Are you perhaps asking about transliteration (as in "Romanization") instead of encoding conversion, if you really are talking about ASCII?
I just faced the same issue when original xml file was in ASCII Encoding.
As Userx suggested
Encoding.GetEncoding(1255)
XDocument.Parse(System.IO.File.ReadAllText(xmlPath, Encoding.GetEncoding(1255)));
So now my XDocument file can read hebrew even if the xml file was saved as ASCII

Can we simplify this string encoding code

Is it possible to simplify this code into a cleaner/faster form?
StringBuilder builder = new StringBuilder();
var encoding = Encoding.GetEncoding(936);
// convert the text into a byte array
byte[] source = Encoding.Unicode.GetBytes(text);
// convert that byte array to the new codepage.
byte[] converted = Encoding.Convert(Encoding.Unicode, encoding, source);
// take multi-byte characters and encode them as separate ascii characters
foreach (byte b in converted)
builder.Append((char)b);
// return the result
string result = builder.ToString();
Simply put, it takes a string with Chinese characters such as 鄆 and converts them to ài.
For example, that Chinese character in decimal is 37126 or 0x9106 in hex.
See http://unicodelookup.com/#0x9106/1
Converted to a byte array, we get [145, 6] (145 * 256 + 6 = 37126). When encoded in CodePage 936 (simplified chinese), we get [224, 105]. If we break this byte array down into individual characters, we 224=e0=à and 105=69=i in unicode.
See http://unicodelookup.com/#0x00e0/1
and
http://unicodelookup.com/#0x0069/1
Thus, we're doing an encoding conversion and ensuring that all characters in our output Unicode string can be represented using at most two bytes.
Update: I need this final representation because this is the format my receipt printer is accepting. Took me forever to figure it out! :) Since I'm not an encoding expert, I'm looking for simpler or faster code, but the output must remain the same.
Update (Cleaner version):
return Encoding.GetEncoding("ISO-8859-1").GetString(Encoding.GetEncoding(936).GetBytes(text));
Well, for one, you don't need to convert the "built-in" string representation to a byte array before calling Encoding.Convert.
You could just do:
byte[] converted = Encoding.GetEncoding(936).GetBytes(text);
To then reconstruct a string from that byte array whereby the char values directly map to the bytes, you could do...
static string MangleTextForReceiptPrinter(string text) {
return new string(
Encoding.GetEncoding(936)
.GetBytes(text)
.Select(b => (char) b)
.ToArray());
}
I wouldn't worry too much about efficiency; how many MB/sec are you going to print on a receipt printer anyhow?
Joe pointed out that there's an encoding that directly maps byte values 0-255 to code points, and it's age-old Latin1, which allows us to shorten the function to...
return Encoding.GetEncoding("Latin1").GetString(
Encoding.GetEncoding(936).GetBytes(text)
);
By the way, if this is a buggy windows-only API (which it is, by the looks of it), you might be dealing with codepage 1252 instead (which is almost identical). You might try reflector to see what it's doing with your System.String before it sends it over the wire.
Almost anything would be cleaner than this - you're really abusing text here, IMO. You're trying to represent effectively opaque binary data (the encoded text) as text data... so you'll potentially get things like bell characters, escapes etc.
The normal way of encoding opaque binary data in text is base64, so you could use:
return Convert.ToBase64String(Encoding.GetEncoding(936).GetBytes(text));
The resulting text will be entirely ASCII, which is much less likely to cause you hassle.
EDIT: If you need that output, I would strongly recommend that you represent it as a byte array instead of as a string... pass it around as a byte array from that point onwards, so you're not tempted to perform string operations on it.
Does your receipt printer have an API that accepts a byte array rather than a string?
If so you may be able to simplify the code to a single conversion, from a Unicode string to a byte array using the encoding used by the receipt printer.
Also, if you want to convert an array of bytes to a string whose character values correspond 1-1 to the values of the bytes, you can use the code page 28591 aka Latin1 aka ISO-8859-1.
I.e., the following
foreach (byte b in converted)
builder.Append((char)b);
string result = builder.ToString();
can be replaced by:
// All three of the following are equivalent
// string result = Encoding.GetEncoding(28591).GetString(converted);
// string result = Encoding.GetEncoding("ISO-8859-1").GetString(converted);
string result = Encoding.GetEncoding("Latin1").GetString(converted);
Latin1 is a useful encoding when you want to encode binary data in a string, e.g. to send through a serial port.

Categories