MySQL comparing Japanese characters in a query as question marks - c#

I have a MySQL database with some varchar fields that can contain Latin characters or Japanese characters. There are entries that contain Japanese characters, that is not a problem. However, from my C# code, using MySqlConnection, I have been unable to get the correct results using Japanese characters in my WHERE clauses. It seems to compare the Japanese characters as though they are question marks. For example a query with WHERE series_title LIKE '%未来警%' does not return values where series_title contains "未来警", but instead returns all entries where series_title contains "???".
Some details:
series_title is a varchar(150) with collation utf8_general_ci.
the ConnectionString for the MySqlConnection includes the kv pair CharSet=utf8_general_ci
the database does contain Japanese characters and is able to return them to the C# client - it only has problems when Japanese characters are being sent to it

Try adding charset=utf8 to your connection string:
server=server;uid=my_user;password=pass;database=db;charset=utf8;
EDIT:
Try execute this sql after connect:
SET NAMES utf8

I would ensure your data is stored using the right encoding. For Japanese, you might want to try eucjp, and you can find out more than you ever wanted to know about character encoding here. It looks like you may also need the BOM. Best of luck and let me know how you get on.

Related

Arabic_CI_AS to utf8 in C#

I have a DataBase in Sql Server with collection Arabic_CI_AS and i need to compare some string data with another Postgres Database with Utf8 character set. Also i use C# for convert & compare. It easy done when string contains just one word (in these cases i should just replace 'ي' to 'ی'), but for long string special with '(' charachter has problem.
I cant do it! I try some suggested solution such as:
var enc = Encoding.GetEncoding(1256);
byte[] encBytes = enc.GetBytes(customer.name);
customer.name = Encoding.UTF8.GetString(encBytes, 0, encBytes.Length);
or:
SELECT cast (name as nvarchar) as NewName
from Customer
But they dont work! Can anyone help me?
Example of input and output, see tooltips on the right:
maybe this can help you to change your collation dynamically
SELECT name collate SQL_Latin1_General_CP1_CI_AS
from Customer
or
SELECT name collate Persian_100_CI_AI
from Customer
or
you can try this in c# side
string _Value=string.Empty;
byte[] enBuff= Encoding.GetEncoding("windows-1256").GetBytes(customer.name);
customer.name= Encoding.GetEncoding("windows-1252").GetString(enBuff);
you can choose another collations too.
you should change many collation and Encoding number to get wanted result.
SQL Server does not support utf-8 strings. If you have to deal with characters other than plain-latin it is strongly recommended to use NVARCHAR instead of VARCHAR with an arabic collation.
Many people think, that NVARCHAR is utf-16 while VARCHAR is utf-8. This is not true! The second is extended ASCII and is using 1 byte in any case, while utf-8 will encode some characters with more than one byte.
So - the most important question is: WHY?
SQL Server can take your string into a NVARCHAR variable, cast it to a chain of bytes and re-cast it to the former string:
DECLARE #str NVARCHAR(MAX)=N'(نماینده اراک)';
SELECT #str
,CAST(#str AS VARBINARY(MAX))
,CAST(CAST(#str AS VARBINARY(MAX)) AS NVARCHAR(MAX));
The problem with the ) is - quite probably! - that your arabic letters are right-to-left while the ) is left-to-right. I wanted to paste the result of the query above into this answer but did not manage to get the closing ) to the original place... You try to edit, delete, replace, but you get something else... Somehow funny, but not a question of bad encoding but one of buggy editors...
Anyway, SQL-Server is not your issue. You must read the string as NVARCHAR out of SQL-Server. C# is working with unicode strings and not a collated 1-byte string. Every conversion carries the chance to destroy your text.
If your target (or the tooltip you show us) is not capable to show the string properly, it might be perfectly okay, but the editor is not...
If you pass such an UTF-8 string back to SQL-Server, you'll get a mess...
The only place, where UTF-8 makes sense is written to a file or transmitted via small band. If a text contains very many plain latin characters and just a few strange letters (like ver often XML, HTML) you can save quite some diskspace or band with. With a far-east text you'd even bloat you text. Some of these characters will need 3 or even 4 bytes to be encoded.
Within your database and application you should stick with unicode.

Converting UTF-8 Encoded Data from Hashtable of ASP.NET Webform Before Inserting Into SQL Server Database

What I am working with:
Within my Asp.net Webforms application, I am getting form data from the user and then inserting that data into a SQL Server database. Each key is the identifier for the field from within the form, and the value is the data received by the user.
My Issue:
My issue is that users are copying and pasting UTF-8 data from emails, etc into the "notes" field. The SQL Server database does not recognize UTF-8 as valid character data. Instead, it utilizes both the the UCS-2 & ISO-8859-1 character sets. Thus, these character sets are being inserted into the database as question marks (?). So, I would like to properly convert any UTF-8 characters to UCS-2 or ISO-8859-1.
Questions:
Should I convert the UTF-8 characters to UCS-2 or to ISO-8859-1?
Within the ASP.NET web form, what is the best means of determining the character sets used within the value for the "notes" key of my hashtable?
What is the best possible means for converting the characters that are UTF-8 into the acceptable character set?
Option 1: use nvarchar
You could just change your field from varchar to nvarchar so that your unicode characters are stored correctly. That's the point of that nvarchar data type. It's cool. Use it.
Option 2: Convert Intelligently.
If you have a legacy db where nvarchar simply wont work, then you can just create a string extension that lets you store the ascii version of your values from users. Below is one such extension (note that we are doing some initial replacements for "smart" quotes/etc before ditching all characters that aren't ascii).
if you're supporting international (accents, etc), then this is a little culturally insensitive ("bah - away with your crazy accent marks and strange non-english looking letters").
public static class StringExt {
static public string TryGetAsciiString(this string original) {
//Replace those msword "smart" characters with ascii (dumb) characters.
string escaped = System.Convert.ToString(p_String.Replace('\u2013', '-').Replace('\u2014', '-').Replace('\u2015', '-').Replace('\u2017', '_').Replace('\u2018', '\'').Replace('\u2019', '\'').Replace('\u201a', ',').Replace('\u201b', '\'').Replace('\u201c', '\"').Replace('\u201d', '\"').Replace('\u201e', '\"').Replace("\u2026", "...").Replace('\u2032', '\'').Replace('\u2033', '\"'));
//regex out all those other non-ascii characters.
escaped = Regex.Replace(p_sVal, "[^A-Za-z 0-9 \\.,\\?\'\"!##\\$%\\^&\\*\\(\\)-_=\\+;:<>\\/\\\\\\|\\}\\{\\[\\]`~\\n\\r]*", "");
//All set..
return escaped;
}
}
Option ... err... 2A? : Ditch the first 30 ascii codes (give or take)
I've noticed that, when users copy/paste from MAC word (and a few other programs), that pasted data contains characters in the first 30 ascii characters. Aside from 9, 10 and 13 ... you can probably ditch those (they're just NUL's ACK's DC's and some other garbage no user would actually type).

C# SQL Select Chinese characters returns weird characters

I'm trying to convert a piece of software into Chinese but I'm having some problems with the database. It returns weird strings of characters and my guess is that it's because of wrong encoding but I'm not sure about what to do.
If I set column data to 头版 it returns
>> 头版
If I set column data to 头版 it returns
>> ??
It works fine because if I insert '头版' into the database, it will get inserted as '头版' but I would like it to display the characters correctly, so searching through the database will be easier.
I've tried running this query when connected to the database
SET NAMES utf8;
Also tried this
SET NAMES utf8; SELECT * FROM `table` ORDER BY num;
But it doesn't change anything.
The culture is set zh-Hans.
The column should be nvarchar. This type supports Unicode and allows non-English characters (such as Mandarin, Arabic, etc.) to be used.
Update
The above was for SQL Server. For MySql the column should be VARCHAR(50) CHARACTER SET UCS2.
UCS2 is better than utf-8 for Chinese because most of its characters require 16-bit code points. If using utf-8, 3 bytes would be needed to store the code point.

Insert Russian Language data into database from an array

My query looks like:
string str = string.Format("Insert into [MyDB].[dbo].[tlb1] ([file_path],[CONTENT1],[CONTENT2]) values ('{0}','{1}','{2}');", fullpath, _val[0], _val[1]);
Now when I insert data into database if array _val[] contains data in english language it insert correctly but when array contains data in Russian Language in database this show like ???????????????????????
Is there a way to insert data in Russian Language from an array.
According to this (Archived) Microsoft Support Issue:
You must precede all Unicode strings with a prefix N when you deal with Unicode string constants in SQL Server
First of all, you should use prepared statements and let the database driver insert the placeholders correctly (i.e. SqlCommand with parameters). Then the issue should go away (as well as any potential SQL injection problems).
As a quick fix in your case: Prefix the string literals you're inserting with N:
... values (N'{0}',N'{1}',N'{2}')
This causes the literals to be Unicode literals, not arbitrary-legacy-codepage ones and thus preventing the conversion from Unicode to the legacy codepage (which results in question marks for characters that cannot be represented).
It seems that the datatype of columns [Content1] and [Content2] is nchar. You should convert the columns to nvarchar which is used to store unicode data.
First of all you must see Database codepage at server. May be non-Unicode CP in database, but data from your app comes in Unicode format.

c#: How to convert a Unicode character to its ASCII equivalent

I know its a recurrent question here but no one of answers havent work for me.
From a system I'm receiving a Unicode text. Just an email + name from customers.
When I record these strings to my SQL DB the appears some chars appears with \u.
For example the emails are getting in the DB: name\u0040domain.com
How I transform the Unicode string in my c# program to ascii, so the DB gets name#domain.com.
Also that replace special chars to equivalent or to no one... For example "Hernán π" to "Hernan "
Thanks!
IMHO converting Unicode back to ASCII for some dubious storage or technical benefit isn't a good idea in the 21st century, especially since email is being changed to support Unicode in headers and bodies.
http://en.wikipedia.org/wiki/Unicode_and_e-mail
If the reason why you want to convert Hernán to Hernan is for searching, you should look at using an Accent Insensitive (AI) collation on your database, or coerce it to do so - see this SO post.
One thing you might need to double check however is that your strings aren't getting preencoded before storage in your database (assuming that your DB column is set to accept unicode - i.e. NVARCHAR etc), the character '#' should be stored as '#' (0040 in UTF 16) and not as '\u0040'.
EDIT:
The "\uNNNN" encoding in a string might originate from Java or Python.
You might be able to trace the email string data up your architecture to find the source of this encoding and change it to something more easy to decode in C# such as UTF-8.
How do I treat an ASCII string as unicode and unescape the escaped characters in it in python?
You can use Encoding.Convert for such operations. Read about this on MSDN

Categories