I have an issue with encoding. I want to put data from a UTF-8-encoded file into a SQL Server 2008 database. SQL Server only features UCS-2 encoding, so I decided to explicitly convert the retrieved data.
// connect to page file
_fsPage = new FileStream(mySettings.filePage, FileMode.Open, FileAccess.Read);
_streamPage = new StreamReader(_fsPage, System.Text.Encoding.UTF8);
Here's the conversion routine for the data:
private string ConvertTitle(string title)
{
string utf8_String = Regex.Replace(Regex.Replace(title, #"\\.", _myEvaluator), #"(?<=[^\\])_", " ");
byte[] utf8_bytes = System.Text.Encoding.UTF8.GetBytes(utf8_String);
byte[] ucs2_bytes = System.Text.Encoding.Convert(System.Text.Encoding.UTF8, System.Text.Encoding.Unicode, utf8_bytes);
string ucs2_String = System.Text.Encoding.Unicode.GetString(ucs2_bytes);
return ucs2_String;
}
When stepping through the code for critical titles, variable watch shows the correct characters for both utf-8 and ucs-2 string. But in the database its - partially wrong. Some special chars are saved correctly, others not.
Wrong: ń becomes an n
Right: É or é are for example inserted correctly.
Any idea where the problem might be and how to solve it?
Thans in advance,
Frank
SQL server 2008 handles the conversion from UTF-8 into UCS-2 for you.
First make sure your SQL tables are using nchar, nvarchar data types for the columns. Then you need to tell SQL Server your sending in Unicode data by adding an N in front of the encoded string.
INSERT INTO tblTest (test) VALUES (N'EncodedString')
from Microsoft http://support.microsoft.com/kb/239530
See my question and solution here: How do I convert UTF-8 data from Classic asp Form post to UCS-2 for inserting into SQL Server 2008 r2?
I think you have a misunderstanding of what encodings are. An encoding is used to convert a bunch of bytes into a character string. A String does not itself have an encoding associated with it.
Internally, Strings are stored in memory as UTF-16LE bytes (which is why Windows persists in confusing everyone by calling the UTF-16LE encoding just “Unicode”). But you don't need to know that — to you, they're just strings of characters.
What your function does is:
Takes a string and converts it to UTF-8 bytes.
Takes those UTF-8 bytes and converts them to UTF-16LE bytes. (You could have just encoded straight to UTF-16LE instead of UTF-8 in step one.)
Takes those UTF-16LE bytes and converts them back to a string. This gives you the exact same String you had in the first place!
So this function is redundant; you can actually just pass a normal String to SQL Server from .NET and not worry about it.
The bit with the backslashes does do something, presumably application-specific I don't understand what it's for. But nothing in that function will cause Windows to flatten characters like ń to n.
What /will/ cause that kind of flattening is when you try to put characters that aren't in the database's own encoding in the database. Presumably é is OK because that character is in your default encoding of cp1252 Western European, but ń is not so it gets mangled.
SQL Server does use ‘UCS2’ (really UTF-16LE again) to store Unicode strings, but you have tell it to, typically by using a NATIONAL CHARACTER (NCHAR/NVARCHAR) column type instead of plain CHAR.
We were also very confused about encoding. Here is an useful page that explains it.
Also, answer to following SO question will help to explain it too -
In C# String/Character Encoding what is the difference between GetBytes(), GetString() and Convert()?
For future readers using newer releases, note that SQL Server 2016 supports UTF-8 in their bcp utility.
Related
I have a DataBase in Sql Server with collection Arabic_CI_AS and i need to compare some string data with another Postgres Database with Utf8 character set. Also i use C# for convert & compare. It easy done when string contains just one word (in these cases i should just replace 'ي' to 'ی'), but for long string special with '(' charachter has problem.
I cant do it! I try some suggested solution such as:
var enc = Encoding.GetEncoding(1256);
byte[] encBytes = enc.GetBytes(customer.name);
customer.name = Encoding.UTF8.GetString(encBytes, 0, encBytes.Length);
or:
SELECT cast (name as nvarchar) as NewName
from Customer
But they dont work! Can anyone help me?
Example of input and output, see tooltips on the right:
maybe this can help you to change your collation dynamically
SELECT name collate SQL_Latin1_General_CP1_CI_AS
from Customer
or
SELECT name collate Persian_100_CI_AI
from Customer
or
you can try this in c# side
string _Value=string.Empty;
byte[] enBuff= Encoding.GetEncoding("windows-1256").GetBytes(customer.name);
customer.name= Encoding.GetEncoding("windows-1252").GetString(enBuff);
you can choose another collations too.
you should change many collation and Encoding number to get wanted result.
SQL Server does not support utf-8 strings. If you have to deal with characters other than plain-latin it is strongly recommended to use NVARCHAR instead of VARCHAR with an arabic collation.
Many people think, that NVARCHAR is utf-16 while VARCHAR is utf-8. This is not true! The second is extended ASCII and is using 1 byte in any case, while utf-8 will encode some characters with more than one byte.
So - the most important question is: WHY?
SQL Server can take your string into a NVARCHAR variable, cast it to a chain of bytes and re-cast it to the former string:
DECLARE #str NVARCHAR(MAX)=N'(نماینده اراک)';
SELECT #str
,CAST(#str AS VARBINARY(MAX))
,CAST(CAST(#str AS VARBINARY(MAX)) AS NVARCHAR(MAX));
The problem with the ) is - quite probably! - that your arabic letters are right-to-left while the ) is left-to-right. I wanted to paste the result of the query above into this answer but did not manage to get the closing ) to the original place... You try to edit, delete, replace, but you get something else... Somehow funny, but not a question of bad encoding but one of buggy editors...
Anyway, SQL-Server is not your issue. You must read the string as NVARCHAR out of SQL-Server. C# is working with unicode strings and not a collated 1-byte string. Every conversion carries the chance to destroy your text.
If your target (or the tooltip you show us) is not capable to show the string properly, it might be perfectly okay, but the editor is not...
If you pass such an UTF-8 string back to SQL-Server, you'll get a mess...
The only place, where UTF-8 makes sense is written to a file or transmitted via small band. If a text contains very many plain latin characters and just a few strange letters (like ver often XML, HTML) you can save quite some diskspace or band with. With a far-east text you'd even bloat you text. Some of these characters will need 3 or even 4 bytes to be encoded.
Within your database and application you should stick with unicode.
What I am working with:
Within my Asp.net Webforms application, I am getting form data from the user and then inserting that data into a SQL Server database. Each key is the identifier for the field from within the form, and the value is the data received by the user.
My Issue:
My issue is that users are copying and pasting UTF-8 data from emails, etc into the "notes" field. The SQL Server database does not recognize UTF-8 as valid character data. Instead, it utilizes both the the UCS-2 & ISO-8859-1 character sets. Thus, these character sets are being inserted into the database as question marks (?). So, I would like to properly convert any UTF-8 characters to UCS-2 or ISO-8859-1.
Questions:
Should I convert the UTF-8 characters to UCS-2 or to ISO-8859-1?
Within the ASP.NET web form, what is the best means of determining the character sets used within the value for the "notes" key of my hashtable?
What is the best possible means for converting the characters that are UTF-8 into the acceptable character set?
Option 1: use nvarchar
You could just change your field from varchar to nvarchar so that your unicode characters are stored correctly. That's the point of that nvarchar data type. It's cool. Use it.
Option 2: Convert Intelligently.
If you have a legacy db where nvarchar simply wont work, then you can just create a string extension that lets you store the ascii version of your values from users. Below is one such extension (note that we are doing some initial replacements for "smart" quotes/etc before ditching all characters that aren't ascii).
if you're supporting international (accents, etc), then this is a little culturally insensitive ("bah - away with your crazy accent marks and strange non-english looking letters").
public static class StringExt {
static public string TryGetAsciiString(this string original) {
//Replace those msword "smart" characters with ascii (dumb) characters.
string escaped = System.Convert.ToString(p_String.Replace('\u2013', '-').Replace('\u2014', '-').Replace('\u2015', '-').Replace('\u2017', '_').Replace('\u2018', '\'').Replace('\u2019', '\'').Replace('\u201a', ',').Replace('\u201b', '\'').Replace('\u201c', '\"').Replace('\u201d', '\"').Replace('\u201e', '\"').Replace("\u2026", "...").Replace('\u2032', '\'').Replace('\u2033', '\"'));
//regex out all those other non-ascii characters.
escaped = Regex.Replace(p_sVal, "[^A-Za-z 0-9 \\.,\\?\'\"!##\\$%\\^&\\*\\(\\)-_=\\+;:<>\\/\\\\\\|\\}\\{\\[\\]`~\\n\\r]*", "");
//All set..
return escaped;
}
}
Option ... err... 2A? : Ditch the first 30 ascii codes (give or take)
I've noticed that, when users copy/paste from MAC word (and a few other programs), that pasted data contains characters in the first 30 ascii characters. Aside from 9, 10 and 13 ... you can probably ditch those (they're just NUL's ACK's DC's and some other garbage no user would actually type).
I have a Client/Server architecture where messages in text-format are exchanged.
For example:
12 2013/11/11 abcd 5
^ ^ ^ ^
int date text int
Everything works fine with "normal" text.
Now this is a chinese project, so they also want so send chinese symbols. Encoding GB18030 or GB2312.
I read the data this way:
char[] dataIn = binaryReader.ReadChars(length);
then i create a new string from the char array and convert it to the right data type (int, float, string etc.).
How can I change/enable chinese encoding, or convert the string values to chinese?
And what would be a good & easy way to test this.
Thanks.
I tried using something like this
string stringData = new string(dataIn).Trim();
byte[] data = Encoding.Unicode.GetBytes(stringData);
stringData = Encoding.GetEncoding("GB18030").GetString(data);
Without success.
Also I need to save some text values to MS SQL Server 2008, is this possible - do I need to configurate anything special?
I also tried this example with storing to the database and printing to the console, but I just get ????????
string chinese = "123东北特钢大连新基地testtest";
byte[] utfBytes = Encoding.Unicode.GetBytes(chinese);
byte[] chineseBytes = Encoding.Convert(Encoding.Unicode, Encoding.GetEncoding("GB18030"), utfBytes);
string msg = Encoding.GetEncoding("GB18030").GetString(chineseBytes);
Edit
The problem was with the INSERT queries, which I send to the database. I fixed it with using N' before the string.
sqlCommand = string.Format("INSERT INTO uber_chinese (columnName) VALUES(N'{0}')", myChineseString);
Also the column dataType has to be nvarchar instead of varchar.
This anser is "promoted" (by request from the Original Poster) from comments by myself.
In the .NET Framework, strings are already Unicode strings.
(Don't test Unicode strings by writing to the console, though, since the terminal window and console typically won't display them correctly. However, since .NET version 4.5 there is some support for this.)
The thing to be aware of is the Encoding when you get text from an outside source. In this case, the constructor of BinaryReader offers an overload that takes in an Encoding:
using (var binaryReader = new BinaryReader(yourStream, Encoding.GetEncoding("GB18030")))
...
On the SQL Server, be sure that any column that needs to hold Chinese strings is of type nvarchar (or nchar), not just varchar (char). Otherwise, depending on the collation, the column may not be able to hold general Unicode characters (it may be represented internally by some 8-bit Microsoft code page).
Whenever you give an nchar literal in SQL, use the format N'my text', not just 'my text', to make sure the literal is interpreted as an nchar rather than just char. For example N'Erdős' is distinct from N'Erdos' while, in many collations, 'Erdős' and 'Erdos' might be (projected onto) the same value in the underlying code page.
Similarly N'东北特钢大连新基地' will work, while '东北特钢大连新基地' might result in a lot of question marks. From the update of your quetion:
sqlCommand = string.Format("INSERT INTO uber_chinese (columnName) VALUES(N'{0}')", myChineseString);
↑
(This is prone to SQL injection, of course.)
The default collation of your column will be that of your database (SQL_Latin1_General_CP1_CI_AS from your comment). Unless you ORDER BY that column, or similar, that will probably be fine. If you do order by this column, consider using some Chinese language collation for the column (or for the entire database).
I know its a recurrent question here but no one of answers havent work for me.
From a system I'm receiving a Unicode text. Just an email + name from customers.
When I record these strings to my SQL DB the appears some chars appears with \u.
For example the emails are getting in the DB: name\u0040domain.com
How I transform the Unicode string in my c# program to ascii, so the DB gets name#domain.com.
Also that replace special chars to equivalent or to no one... For example "Hernán π" to "Hernan "
Thanks!
IMHO converting Unicode back to ASCII for some dubious storage or technical benefit isn't a good idea in the 21st century, especially since email is being changed to support Unicode in headers and bodies.
http://en.wikipedia.org/wiki/Unicode_and_e-mail
If the reason why you want to convert Hernán to Hernan is for searching, you should look at using an Accent Insensitive (AI) collation on your database, or coerce it to do so - see this SO post.
One thing you might need to double check however is that your strings aren't getting preencoded before storage in your database (assuming that your DB column is set to accept unicode - i.e. NVARCHAR etc), the character '#' should be stored as '#' (0040 in UTF 16) and not as '\u0040'.
EDIT:
The "\uNNNN" encoding in a string might originate from Java or Python.
You might be able to trace the email string data up your architecture to find the source of this encoding and change it to something more easy to decode in C# such as UTF-8.
How do I treat an ASCII string as unicode and unescape the escaped characters in it in python?
You can use Encoding.Convert for such operations. Read about this on MSDN
I'm pulling some internationalized text from a MS SQL Server 2005 database. As per the defaults for that DB, the characters are stored as UCS-2. However, I need to output the data in UTF-8 format, as I'm sending it out over the web. Currently, I have the following code to convert:
SqlString dbString = resultReader.GetSqlString(0);
byte[] dbBytes = dbString.GetUnicodeBytes();
byte[] utf8Bytes = System.Text.Encoding.Convert(System.Text.Encoding.Unicode,
System.Text.Encoding.UTF8, dbBytes);
System.Text.UTF8Encoding encoder = new System.Text.UTF8Encoding();
string outputString = encoder.GetString(utf8Bytes);
However, when I examine the output in the browser, it appears to be garbage, no matter what I set the encoding to.
What am I missing?
EDIT:
In response to the answers below, the reason I thought I had to perform a conversion is because I can output literal multibyte strings just fine. For example:
OutputControl.Text = "カルフォルニア工科大学とチューリッヒ工科大学は共同で、太陽光を保管可能な燃料に直接変えることのできる装置の開発に成功したとのこと";
works. Here, OutputControl is an ASP.Net Literal. However,
OutputControl.Text = outputString; //Output from above snippet
results in mangled output as described above. My hypothesis was that the database's output was somehow getting mangled by ASP.Net. If that's not the case, then what are some other possibilities?
EDIT 2:
Okay, I'm stupid. It turns out that there's nothing wrong with the database at all. When I tried inserting my own literal double byte characters (材料,原料;木料), I could read and output them just fine even without any conversion process at all. It seems to me that whatever is inserting the data into the DB is mangling the characters somehow, so I'm going to look at that. With my verified, "clean" data, the following code works:
OutputControl.Text = dbString.ToString();
as the responses below indicate it should.
Your code does essentially the same as:
SqlString dbString = resultReader.GetSqlString(0);
string outputString = dbString.ToString();
string itself is a UNICODE string (specifically, UTF-16, which is 'almost' the same as UCS-2, except for codepoints not fitting into the lowest 16 bits). In other words, the conversions you are performing are redundant.
Your web app most likely mangles the encoding somewhere else as well, or sets a wrong encoding for the HTML output. However, that can't be diagnosed from the information you provided so far.
String in .net is 'encoding agnostic'.
You can convert bytes to string using a particular encoding to tell .net how to interprets your bytes.
You can convert string to bytes using a particular encoding to tell .net how you want your bytes served.
But trying to convert a string to another string using encodings makes no sens at all.