Convert UCS-2 characters to UTF-8 Using C#

Convert UCS-2 characters to UTF-8 Using C# - c#

I'm pulling some internationalized text from a MS SQL Server 2005 database. As per the defaults for that DB, the characters are stored as UCS-2. However, I need to output the data in UTF-8 format, as I'm sending it out over the web. Currently, I have the following code to convert:
SqlString dbString = resultReader.GetSqlString(0);
byte[] dbBytes = dbString.GetUnicodeBytes();
byte[] utf8Bytes = System.Text.Encoding.Convert(System.Text.Encoding.Unicode,
System.Text.Encoding.UTF8, dbBytes);
System.Text.UTF8Encoding encoder = new System.Text.UTF8Encoding();
string outputString = encoder.GetString(utf8Bytes);
However, when I examine the output in the browser, it appears to be garbage, no matter what I set the encoding to.
What am I missing?
EDIT:
In response to the answers below, the reason I thought I had to perform a conversion is because I can output literal multibyte strings just fine. For example:
OutputControl.Text = "カルフォルニア工科大学とチューリッヒ工科大学は共同で、太陽光を保管可能な燃料に直接変えることのできる装置の開発に成功したとのこと";
works. Here, OutputControl is an ASP.Net Literal. However,
OutputControl.Text = outputString; //Output from above snippet
results in mangled output as described above. My hypothesis was that the database's output was somehow getting mangled by ASP.Net. If that's not the case, then what are some other possibilities?
EDIT 2:
Okay, I'm stupid. It turns out that there's nothing wrong with the database at all. When I tried inserting my own literal double byte characters (材料,原料;木料), I could read and output them just fine even without any conversion process at all. It seems to me that whatever is inserting the data into the DB is mangling the characters somehow, so I'm going to look at that. With my verified, "clean" data, the following code works:
OutputControl.Text = dbString.ToString();
as the responses below indicate it should.

Your code does essentially the same as:
SqlString dbString = resultReader.GetSqlString(0);
string outputString = dbString.ToString();
string itself is a UNICODE string (specifically, UTF-16, which is 'almost' the same as UCS-2, except for codepoints not fitting into the lowest 16 bits). In other words, the conversions you are performing are redundant.
Your web app most likely mangles the encoding somewhere else as well, or sets a wrong encoding for the HTML output. However, that can't be diagnosed from the information you provided so far.

String in .net is 'encoding agnostic'.
You can convert bytes to string using a particular encoding to tell .net how to interprets your bytes.
You can convert string to bytes using a particular encoding to tell .net how you want your bytes served.
But trying to convert a string to another string using encodings makes no sens at all.

Related

C# - String stored as Base64, but the retreived string is not a valid base64 string

I'm using Base64 encoding to store values from my data structure into a string.
Basically what I do is convert a byte array into base64 string
string StoredData = Convert.ToBase64String(ByteArray);
I then divide StoredData into strings of a maximum length of 256 Characters and store them as an ASCII string (in AutoCAD XData as an DxfCode.ExtendedDataAsciiString) .
When I want to retrieve my data I do the following:
First I combine each 256 long string using StoredData = sting1 + string2 + ...
Then I convert StoredData back into ByteArray using
var ByteArray = Convert.FromBase64String(StoredData);
Now this has worked great for me and my clients until a month ago, where one of my clients has had some crash and errors popping up.
I asked him to send me his stored data, and I got surprised to see that his data contained invalid Base64 Characters (see sample below)
tM7x24QLLLALr5ivAx3XFAM7uciYXrCjKXSFd3XOL/KGIc3C+JMO8QjHT/4c+puYrNLq5r9Is0vpDKyuxw9I6R3f1LuOYSdHS6XgZJEyMvGwSHNRSYJ/a0IoumQftB3XspQRwp4QSd7qcUVsrXw0+2RS/sd2vAvUFxEQgwsHaabb01YjchGeyxr1f78A4qy2BL/oHAsRak9UYN0mDzhZgbhpahlgdK3eWd8b2BTM01lWh74pYUrJR+JfQ0tw0Eu㿔
Z/1JxBMUv2cB6NrFehSuNF9l4dhAaZQ+TcIClZmk/ZC8TJ0rKka/J+HqhLDAwWExB3nXoIi00uJnE7J4R6rU+Q==
as you can see the first 256 long string had an invalid Base64 character (㿔)
Why is that happening? can this be related to the users computer? I tried to replicate this error without any success and because I don't have access to their computers, I'm starting to think it might be something on their side.
The application uses .Net framework version 4.5.
Edit: it turned out client has sent me a recovered document which didn't recover the text strings properly which explains the corrupted string.

It turns out the app has crashed and client has recovered the drawing document with corrupted string.

how to fix corrupt japanese character encoding

i have the following string that i know is suppose to be displayed as Japanese text
25“ú‚¨“¾‚ÈƒAƒ‹ƒeƒBƒƒbƒgƒRƒXƒZƒbƒg‹L”O
is there any way to decode and re-encode the text so it displays properly? i already tried using shift-jis but it did not produce a readable string.
string main = "25“ú‚¨“¾‚ÈƒAƒ‹ƒeƒBƒƒbƒgƒRƒXƒZƒbƒg‹L”O.zip";
byte[] mainBytes = System.Text.Encoding.GetEncoding("shift-jis").GetBytes(main);
string jpn = System.Text.Encoding.GetEncoding("shift-jis").GetString(mainBytes);
thanks!

I think that the original is Shift-JIS, but you didn't show how you did try. So here is my try to re-code it::
string s1 = "25“ú‚¨“¾‚ÈƒAƒ‹ƒeƒBƒƒbƒgƒRƒXƒZƒbƒg‹L”O";
byte[] bs = Encoding.GetEncoding(1252).GetBytes(s1);
string s2 = Encoding.GetEncoding(932).GetString(bs);
And s2 is now "25日お得なアルティャbトコスセット記念", that looks a lot more like Japanese.
What I assume it that some byte array that represent text Shift-JIS encoded, what read by using a different encoding, maybe Windows-1252. So first I try to get back the original byte array. Then I use the proper encoding to get the correct text.
A few notes about my code:
1252 is the numeric ID for Windows-1252, the most usually used-by-mistake encoding. But this is just a guess, you can try with other encodings and see if it makes more sense.
932 is de numeric ID for Shift-JIS (you can also use the string name). This is also a guess, but likely right.
Take into account that using a wrong encoding is not generally a reversible procedure so there may be characters that are lost in the translation.

How to decode a utf string in c#

I have been trying to decode the following string:
CrÃ©dit
in c# using the following code:
Encoding iso = Encoding.GetEncoding("ISO-8859-1");
Encoding utf8 = Encoding.UTF8;
string msg = iso.GetString(utf8.GetBytes(#"CrÃ©dit"));
which is yielding:
CrÃÂ©dit
I looked online http://jeppesn.dk/utf-8.html and this is in correct utf 8 and should yield:
Crédit
Can someone please point out where i am going wrong?
Thanks

It should be the other way around, and Windows-1252, not ISO-8859-1. Depending on context, people usually mean Windows-1252 when they say Latin-1 or ISO-8859-1, but actually using ISO-8859-1 will fail when there are characters like € because it was a mislabeling in the first place. Even browsers use Windows-1252 when ISO-8859-1 is specified as encoding.
Encoding w1252 = Encoding.GetEncoding(1252);
Encoding utf8 = Encoding.UTF8;
string msg = utf8.GetString(w1252.GetBytes(#"CrÃ©dit"));

You're trying to do something that doesn't make sense, basically. You should almost never1 be interpreting the output of one encoding as the input to another encoding. It's like saying, "Suppose I save this image as a gif... then load that file using a jpeg loader... what does it look like?"
I suspect that if you use:
// Just an example: don't actually do this.
string msg = utf8.GetString(iso.GetBytes(#"CrÃ©dit"));
... it will do what you want, but you shouldn't be doing this at all.
Now, what is your real input (in what form) and what are you trying to achieve?
1 If you're doing so, it's usually because someone else has already done the wrong thing, or there's a configuration problem somewhere. If you find yourself doing this, you should think very carefully about whether you should really be doing it, or whether you're just working around a different problem which should be tackled differently.

Insert UTF8 data into a SQL Server 2008

I have an issue with encoding. I want to put data from a UTF-8-encoded file into a SQL Server 2008 database. SQL Server only features UCS-2 encoding, so I decided to explicitly convert the retrieved data.
// connect to page file
_fsPage = new FileStream(mySettings.filePage, FileMode.Open, FileAccess.Read);
_streamPage = new StreamReader(_fsPage, System.Text.Encoding.UTF8);
Here's the conversion routine for the data:
private string ConvertTitle(string title)
{
string utf8_String = Regex.Replace(Regex.Replace(title, #"\\.", _myEvaluator), #"(?<=[^\\])_", " ");
byte[] utf8_bytes = System.Text.Encoding.UTF8.GetBytes(utf8_String);
byte[] ucs2_bytes = System.Text.Encoding.Convert(System.Text.Encoding.UTF8, System.Text.Encoding.Unicode, utf8_bytes);
string ucs2_String = System.Text.Encoding.Unicode.GetString(ucs2_bytes);
return ucs2_String;
}
When stepping through the code for critical titles, variable watch shows the correct characters for both utf-8 and ucs-2 string. But in the database its - partially wrong. Some special chars are saved correctly, others not.
Wrong: ń becomes an n
Right: É or é are for example inserted correctly.
Any idea where the problem might be and how to solve it?
Thans in advance,
Frank

SQL server 2008 handles the conversion from UTF-8 into UCS-2 for you.
First make sure your SQL tables are using nchar, nvarchar data types for the columns. Then you need to tell SQL Server your sending in Unicode data by adding an N in front of the encoded string.
INSERT INTO tblTest (test) VALUES (N'EncodedString')
from Microsoft http://support.microsoft.com/kb/239530
See my question and solution here: How do I convert UTF-8 data from Classic asp Form post to UCS-2 for inserting into SQL Server 2008 r2?

I think you have a misunderstanding of what encodings are. An encoding is used to convert a bunch of bytes into a character string. A String does not itself have an encoding associated with it.
Internally, Strings are stored in memory as UTF-16LE bytes (which is why Windows persists in confusing everyone by calling the UTF-16LE encoding just “Unicode”). But you don't need to know that — to you, they're just strings of characters.
What your function does is:
Takes a string and converts it to UTF-8 bytes.
Takes those UTF-8 bytes and converts them to UTF-16LE bytes. (You could have just encoded straight to UTF-16LE instead of UTF-8 in step one.)
Takes those UTF-16LE bytes and converts them back to a string. This gives you the exact same String you had in the first place!
So this function is redundant; you can actually just pass a normal String to SQL Server from .NET and not worry about it.
The bit with the backslashes does do something, presumably application-specific I don't understand what it's for. But nothing in that function will cause Windows to flatten characters like ń to n.
What /will/ cause that kind of flattening is when you try to put characters that aren't in the database's own encoding in the database. Presumably é is OK because that character is in your default encoding of cp1252 Western European, but ń is not so it gets mangled.
SQL Server does use ‘UCS2’ (really UTF-16LE again) to store Unicode strings, but you have tell it to, typically by using a NATIONAL CHARACTER (NCHAR/NVARCHAR) column type instead of plain CHAR.

We were also very confused about encoding. Here is an useful page that explains it.
Also, answer to following SO question will help to explain it too -
In C# String/Character Encoding what is the difference between GetBytes(), GetString() and Convert()?

For future readers using newer releases, note that SQL Server 2016 supports UTF-8 in their bcp utility.

How to get correctly-encoded HTML from the clipboard?

Has anyone noticed that if you retrieve HTML from the clipboard, it gets the encoding wrong and injects weird characters?
For example, executing a command like this:
string s = (string) Clipboard.GetData(DataFormats.Html)
Results in stuff like:
<FONT size=-2>Â Â <A href="/advanced_search?hl=en">Advanced
Search</A><BR>Â Â Preferences<BR>Â Â <A
href="/language_tools?hl=en">Language
Tools</A></FONT>
Not sure how MarkDown will process this, but there are weird characters in the resulting markup above.
It appears that the bug is with the .NET framework. What do you think is the best way to get correctly-encoded HTML from the clipboard?

In this case it is not so visible as it was in my case. Today I tried to copy data from clipboard but there were a few unicode characters. The data I got were as if I would read a UTF-8 encoded file in Windows-1250 encoding (local encoding in my Windows).
It seems you case is the same. If you save the html data (remember to put non-breakable space = 0xa0 after the Â character, not a standard space) in Windows-1252 (or Windows-1250; both works). Then open this file as a UTF-8 file and you will see what there should be.
For my other project I made a function that fix data with corrupted encoding.
In this case simple conversion should be sufficient:
byte[] data = Encoding.Default.GetBytes(text);
text = Encoding.UTF8.GetString(data);
My original function is a little bit more complex and contains tests to ensure that data are not corrupted...
public static bool FixMisencodedUTF8(ref string text, Encoding encoding)
{
if (string.IsNullOrEmpty(text))
return false;
byte[] data = encoding.GetBytes(text);
// there should not be any character outside source encoding
string newStr = encoding.GetString(data);
if (!string.Equals(text, newStr)) // if there is any character "outside"
return false; // leave, the input is in a different encoding
if (IsValidUtf8(data) == 0) // test data to be valid UTF-8 byte sequence
return false; // if not, can not convert to UTF-8
text = Encoding.UTF8.GetString(data);
return true;
}
I know that this is not the best (or correct solution) but I did not found any other way how to fix the input...
EDIT: (July 20, 2017)
It Seems like the Microsoft already found this error and now it works correctly. I'm not sure whether the problem is in some frameworks, but I know for sure, that now the application uses a different framework as in time, when I wrote the answer. (Now it is 4.5; the previous version was 2.0)
(Now all my code fails in parsing the data. There is another problem to determine the correct behaviour for application with fix already aplied and without fix.)

You have to interpret the data as UTF-8. See MS Office hyperlinks change code page?.

DataFormats.Html specification states it's encoded in UTF-8. But there's a bug in .NET 4 Framework and lower, and it actually reads as UTF-8 as Windows-1252.
You get allot of wrong encodings, leading funny/bad characters such as
'Å','â€¹','Å’','Å½','Å¡','Å“','Å¾','Å¸','Â','Â¡','Â¢','Â£','Â¤','Â¥','Â¦','Â§','Â¨','Â©'
Full explanation here
Debugging Chart Mapping Windows-1252 Characters to UTF-8 Bytes to Latin-1 Characters
Soln: Create a translation dictionary and search and replace.

I don't know what your original source document is, but be aware that Word and Outlook provide several versions of the clipboard in different encodings. One is usually Windows-1252 and another is UTF-8. Possibly you're grabbing the UTF-8 encoded version by default, when you're expecting the Windows-1252 (Latin-1 + Smart Quotes)? Non-ASCII characters would show up as multiple odd Latin-1 accented characters. Most "Smart Quotes" are not in the Latin-1 set and are often three bytes in UTF-8.
Can you specify which encoding you want the clipboard contents in?

Try this:
System.Windows.Forms.Clipboard.GetText(System.Windows.Forms.TextDataFormat.Html);

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.