C# chinese Encoding/Network

C# chinese Encoding/Network - c#

I have a Client/Server architecture where messages in text-format are exchanged.
For example:
12 2013/11/11 abcd 5
^ ^ ^ ^
int date text int
Everything works fine with "normal" text.
Now this is a chinese project, so they also want so send chinese symbols. Encoding GB18030 or GB2312.
I read the data this way:
char[] dataIn = binaryReader.ReadChars(length);
then i create a new string from the char array and convert it to the right data type (int, float, string etc.).
How can I change/enable chinese encoding, or convert the string values to chinese?
And what would be a good & easy way to test this.
Thanks.
I tried using something like this
string stringData = new string(dataIn).Trim();
byte[] data = Encoding.Unicode.GetBytes(stringData);
stringData = Encoding.GetEncoding("GB18030").GetString(data);
Without success.
Also I need to save some text values to MS SQL Server 2008, is this possible - do I need to configurate anything special?
I also tried this example with storing to the database and printing to the console, but I just get ????????
string chinese = "123东北特钢大连新基地testtest";
byte[] utfBytes = Encoding.Unicode.GetBytes(chinese);
byte[] chineseBytes = Encoding.Convert(Encoding.Unicode, Encoding.GetEncoding("GB18030"), utfBytes);
string msg = Encoding.GetEncoding("GB18030").GetString(chineseBytes);
Edit
The problem was with the INSERT queries, which I send to the database. I fixed it with using N' before the string.
sqlCommand = string.Format("INSERT INTO uber_chinese (columnName) VALUES(N'{0}')", myChineseString);
Also the column dataType has to be nvarchar instead of varchar.

This anser is "promoted" (by request from the Original Poster) from comments by myself.
In the .NET Framework, strings are already Unicode strings.
(Don't test Unicode strings by writing to the console, though, since the terminal window and console typically won't display them correctly. However, since .NET version 4.5 there is some support for this.)
The thing to be aware of is the Encoding when you get text from an outside source. In this case, the constructor of BinaryReader offers an overload that takes in an Encoding:
using (var binaryReader = new BinaryReader(yourStream, Encoding.GetEncoding("GB18030")))
...
On the SQL Server, be sure that any column that needs to hold Chinese strings is of type nvarchar (or nchar), not just varchar (char). Otherwise, depending on the collation, the column may not be able to hold general Unicode characters (it may be represented internally by some 8-bit Microsoft code page).
Whenever you give an nchar literal in SQL, use the format N'my text', not just 'my text', to make sure the literal is interpreted as an nchar rather than just char. For example N'Erdős' is distinct from N'Erdos' while, in many collations, 'Erdős' and 'Erdos' might be (projected onto) the same value in the underlying code page.
Similarly N'东北特钢大连新基地' will work, while '东北特钢大连新基地' might result in a lot of question marks. From the update of your quetion:
sqlCommand = string.Format("INSERT INTO uber_chinese (columnName) VALUES(N'{0}')", myChineseString);
↑
(This is prone to SQL injection, of course.)
The default collation of your column will be that of your database (SQL_Latin1_General_CP1_CI_AS from your comment). Unless you ORDER BY that column, or similar, that will probably be fine. If you do order by this column, consider using some Chinese language collation for the column (or for the entire database).

Related

Arabic_CI_AS to utf8 in C#

I have a DataBase in Sql Server with collection Arabic_CI_AS and i need to compare some string data with another Postgres Database with Utf8 character set. Also i use C# for convert & compare. It easy done when string contains just one word (in these cases i should just replace 'ي' to 'ی'), but for long string special with '(' charachter has problem.
I cant do it! I try some suggested solution such as:
var enc = Encoding.GetEncoding(1256);
byte[] encBytes = enc.GetBytes(customer.name);
customer.name = Encoding.UTF8.GetString(encBytes, 0, encBytes.Length);
or:
SELECT cast (name as nvarchar) as NewName
from Customer
But they dont work! Can anyone help me?
Example of input and output, see tooltips on the right:

maybe this can help you to change your collation dynamically
SELECT name collate SQL_Latin1_General_CP1_CI_AS
from Customer
or
SELECT name collate Persian_100_CI_AI
from Customer
or
you can try this in c# side
string _Value=string.Empty;
byte[] enBuff= Encoding.GetEncoding("windows-1256").GetBytes(customer.name);
customer.name= Encoding.GetEncoding("windows-1252").GetString(enBuff);
you can choose another collations too.
you should change many collation and Encoding number to get wanted result.

SQL Server does not support utf-8 strings. If you have to deal with characters other than plain-latin it is strongly recommended to use NVARCHAR instead of VARCHAR with an arabic collation.
Many people think, that NVARCHAR is utf-16 while VARCHAR is utf-8. This is not true! The second is extended ASCII and is using 1 byte in any case, while utf-8 will encode some characters with more than one byte.
So - the most important question is: WHY?
SQL Server can take your string into a NVARCHAR variable, cast it to a chain of bytes and re-cast it to the former string:
DECLARE #str NVARCHAR(MAX)=N'(نماینده اراک)';
SELECT #str
,CAST(#str AS VARBINARY(MAX))
,CAST(CAST(#str AS VARBINARY(MAX)) AS NVARCHAR(MAX));
The problem with the ) is - quite probably! - that your arabic letters are right-to-left while the ) is left-to-right. I wanted to paste the result of the query above into this answer but did not manage to get the closing ) to the original place... You try to edit, delete, replace, but you get something else... Somehow funny, but not a question of bad encoding but one of buggy editors...
Anyway, SQL-Server is not your issue. You must read the string as NVARCHAR out of SQL-Server. C# is working with unicode strings and not a collated 1-byte string. Every conversion carries the chance to destroy your text.
If your target (or the tooltip you show us) is not capable to show the string properly, it might be perfectly okay, but the editor is not...
If you pass such an UTF-8 string back to SQL-Server, you'll get a mess...
The only place, where UTF-8 makes sense is written to a file or transmitted via small band. If a text contains very many plain latin characters and just a few strange letters (like ver often XML, HTML) you can save quite some diskspace or band with. With a far-east text you'd even bloat you text. Some of these characters will need 3 or even 4 bytes to be encoded.
Within your database and application you should stick with unicode.

Encoding insert-value Oracle

I've some date from file and I'm inserting this to DB Oracle.
The problem is an example:
Column type VARCHAR2, size 3 bytes
I've tried to insert 'абв' and saw the exception: ORA-12899: value too large for column (actual: 6, maximum: 3)
That's because each character is encoded in two bytes. Okay now we will re-encode. The database is encoded AL32UTF8. The encoding of the file is CP866.
An attempt to unsuccessfully recode:
Encoding srcEncodingFormat = Encoding.GetEncoding(866);
Encoding dstEncodingFormat = Encoding.UTF8;
byte[] originalByteString = srcEncodingFormat.GetBytes(s);
byte[] convertedByteString = Encoding.Convert(srcEncodingFormat,
dstEncodingFormat, originalByteString);
s = dstEncodingFormat.GetString(convertedByteString);
We can't change the column type. We can't use VARCHAR2(3 CHAR) either. How can I solve it? Is it possible to explicitly specify the encoding of the value to add when data is inserted into the database?
NLS_LANGUAGE AMERICAN
NLS_TERRITORY AMERICA
NLS_CURRENCY $
NLS_ISO_CURRENCY AMERICA
NLS_NUMERIC_CHARACTERS .,
NLS_CHARACTERSET AL32UTF8
NLS_CALENDAR GREGORIAN
NLS_DATE_FORMAT DD-MON-RR
NLS_DATE_LANGUAGE AMERICAN
NLS_SORT BINARY
NLS_TIME_FORMAT HH.MI.SSXFF AM
NLS_TIMESTAMP_FORMAT DD-MON-RR HH.MI.SSXFF AM
NLS_TIME_TZ_FORMAT HH.MI.SSXFF AM TZR
NLS_TIMESTAMP_TZ_FORMAT DD-MON-RR HH.MI.SSXFF AM TZR
NLS_DUAL_CURRENCY $
NLS_COMP BINARY
NLS_LENGTH_SEMANTICS BYTE
NLS_NCHAR_CONV_EXCP FALSE
NLS_NCHAR_CHARACTERSET AL16UTF16
NLS_RDBMS_VERSION 11.2.0.2.0
This is my NLS parameters. The fact is that the boss strictly forbid something to be changed at the database level. Excuse me is there any way to do without this?

Unfortunately what you want to do can not be achieved:
Your string 'абв' requires 6 byte in AL32UTF8 characterset.
You only allow your column to contain up to 3 byte.
You can not define a specific characterset for a column.
Every time you provide the database a string in a specific encoding, it translates it automatically to the correct representation in it's characterset. This is a feature so you can insert (and query) with different clients in different characterset settings but always get the correct encoding.
This leads to an ugly trick which is possible in some clients (I don't know about c#):
When sending a set of characters to the database, you tell it, the string is the same characterset as the databases NLS_CHARACTERSET. As there is no conversion needed, often the string isn't checked as well, just inserted into the row.
As long as the string is only selected by the same client (with the same characterset as the database) everything seems fine.
But whenever the string is used inside the database (most likely somewhere in the WHERE part of the query) unforeseen results will appear. The same is true if any client with another encoding will ever try to access this data.
This is why I recommend not to implement such hacks.

Converting UTF-8 Encoded Data from Hashtable of ASP.NET Webform Before Inserting Into SQL Server Database

What I am working with:
Within my Asp.net Webforms application, I am getting form data from the user and then inserting that data into a SQL Server database. Each key is the identifier for the field from within the form, and the value is the data received by the user.
My Issue:
My issue is that users are copying and pasting UTF-8 data from emails, etc into the "notes" field. The SQL Server database does not recognize UTF-8 as valid character data. Instead, it utilizes both the the UCS-2 & ISO-8859-1 character sets. Thus, these character sets are being inserted into the database as question marks (?). So, I would like to properly convert any UTF-8 characters to UCS-2 or ISO-8859-1.
Questions:
Should I convert the UTF-8 characters to UCS-2 or to ISO-8859-1?
Within the ASP.NET web form, what is the best means of determining the character sets used within the value for the "notes" key of my hashtable?
What is the best possible means for converting the characters that are UTF-8 into the acceptable character set?

Option 1: use nvarchar
You could just change your field from varchar to nvarchar so that your unicode characters are stored correctly. That's the point of that nvarchar data type. It's cool. Use it.
Option 2: Convert Intelligently.
If you have a legacy db where nvarchar simply wont work, then you can just create a string extension that lets you store the ascii version of your values from users. Below is one such extension (note that we are doing some initial replacements for "smart" quotes/etc before ditching all characters that aren't ascii).
if you're supporting international (accents, etc), then this is a little culturally insensitive ("bah - away with your crazy accent marks and strange non-english looking letters").
public static class StringExt {
static public string TryGetAsciiString(this string original) {
//Replace those msword "smart" characters with ascii (dumb) characters.
string escaped = System.Convert.ToString(p_String.Replace('\u2013', '-').Replace('\u2014', '-').Replace('\u2015', '-').Replace('\u2017', '_').Replace('\u2018', '\'').Replace('\u2019', '\'').Replace('\u201a', ',').Replace('\u201b', '\'').Replace('\u201c', '\"').Replace('\u201d', '\"').Replace('\u201e', '\"').Replace("\u2026", "...").Replace('\u2032', '\'').Replace('\u2033', '\"'));
//regex out all those other non-ascii characters.
escaped = Regex.Replace(p_sVal, "[^A-Za-z 0-9 \\.,\\?\'\"!##\\$%\\^&\\*\\(\\)-_=\\+;:<>\\/\\\\\\|\\}\\{\\[\\]`~\\n\\r]*", "");
//All set..
return escaped;
}
}
Option ... err... 2A? : Ditch the first 30 ascii codes (give or take)
I've noticed that, when users copy/paste from MAC word (and a few other programs), that pasted data contains characters in the first 30 ascii characters. Aside from 9, 10 and 13 ... you can probably ditch those (they're just NUL's ACK's DC's and some other garbage no user would actually type).

Insert Russian Language data into database from an array

My query looks like:
string str = string.Format("Insert into [MyDB].[dbo].[tlb1] ([file_path],[CONTENT1],[CONTENT2]) values ('{0}','{1}','{2}');", fullpath, _val[0], _val[1]);
Now when I insert data into database if array _val[] contains data in english language it insert correctly but when array contains data in Russian Language in database this show like ???????????????????????
Is there a way to insert data in Russian Language from an array.

According to this (Archived) Microsoft Support Issue:
You must precede all Unicode strings with a prefix N when you deal with Unicode string constants in SQL Server

First of all, you should use prepared statements and let the database driver insert the placeholders correctly (i.e. SqlCommand with parameters). Then the issue should go away (as well as any potential SQL injection problems).
As a quick fix in your case: Prefix the string literals you're inserting with N:
... values (N'{0}',N'{1}',N'{2}')
This causes the literals to be Unicode literals, not arbitrary-legacy-codepage ones and thus preventing the conversion from Unicode to the legacy codepage (which results in question marks for characters that cannot be represented).

It seems that the datatype of columns [Content1] and [Content2] is nchar. You should convert the columns to nvarchar which is used to store unicode data.

First of all you must see Database codepage at server. May be non-Unicode CP in database, but data from your app comes in Unicode format.

Insert UTF8 data into a SQL Server 2008

I have an issue with encoding. I want to put data from a UTF-8-encoded file into a SQL Server 2008 database. SQL Server only features UCS-2 encoding, so I decided to explicitly convert the retrieved data.
// connect to page file
_fsPage = new FileStream(mySettings.filePage, FileMode.Open, FileAccess.Read);
_streamPage = new StreamReader(_fsPage, System.Text.Encoding.UTF8);
Here's the conversion routine for the data:
private string ConvertTitle(string title)
{
string utf8_String = Regex.Replace(Regex.Replace(title, #"\\.", _myEvaluator), #"(?<=[^\\])_", " ");
byte[] utf8_bytes = System.Text.Encoding.UTF8.GetBytes(utf8_String);
byte[] ucs2_bytes = System.Text.Encoding.Convert(System.Text.Encoding.UTF8, System.Text.Encoding.Unicode, utf8_bytes);
string ucs2_String = System.Text.Encoding.Unicode.GetString(ucs2_bytes);
return ucs2_String;
}
When stepping through the code for critical titles, variable watch shows the correct characters for both utf-8 and ucs-2 string. But in the database its - partially wrong. Some special chars are saved correctly, others not.
Wrong: ń becomes an n
Right: É or é are for example inserted correctly.
Any idea where the problem might be and how to solve it?
Thans in advance,
Frank

SQL server 2008 handles the conversion from UTF-8 into UCS-2 for you.
First make sure your SQL tables are using nchar, nvarchar data types for the columns. Then you need to tell SQL Server your sending in Unicode data by adding an N in front of the encoded string.
INSERT INTO tblTest (test) VALUES (N'EncodedString')
from Microsoft http://support.microsoft.com/kb/239530
See my question and solution here: How do I convert UTF-8 data from Classic asp Form post to UCS-2 for inserting into SQL Server 2008 r2?

I think you have a misunderstanding of what encodings are. An encoding is used to convert a bunch of bytes into a character string. A String does not itself have an encoding associated with it.
Internally, Strings are stored in memory as UTF-16LE bytes (which is why Windows persists in confusing everyone by calling the UTF-16LE encoding just “Unicode”). But you don't need to know that — to you, they're just strings of characters.
What your function does is:
Takes a string and converts it to UTF-8 bytes.
Takes those UTF-8 bytes and converts them to UTF-16LE bytes. (You could have just encoded straight to UTF-16LE instead of UTF-8 in step one.)
Takes those UTF-16LE bytes and converts them back to a string. This gives you the exact same String you had in the first place!
So this function is redundant; you can actually just pass a normal String to SQL Server from .NET and not worry about it.
The bit with the backslashes does do something, presumably application-specific I don't understand what it's for. But nothing in that function will cause Windows to flatten characters like ń to n.
What /will/ cause that kind of flattening is when you try to put characters that aren't in the database's own encoding in the database. Presumably é is OK because that character is in your default encoding of cp1252 Western European, but ń is not so it gets mangled.
SQL Server does use ‘UCS2’ (really UTF-16LE again) to store Unicode strings, but you have tell it to, typically by using a NATIONAL CHARACTER (NCHAR/NVARCHAR) column type instead of plain CHAR.

We were also very confused about encoding. Here is an useful page that explains it.
Also, answer to following SO question will help to explain it too -
In C# String/Character Encoding what is the difference between GetBytes(), GetString() and Convert()?

For future readers using newer releases, note that SQL Server 2016 supports UTF-8 in their bcp utility.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.