Allowing Simplified Chinese Input

Allowing Simplified Chinese Input - c#

The company I work for is bidding on a project that will require our eCommerce solution to accept simplified Chinese input. After doing a bit of research, it seems that ASP.net makes globalization configuration easy:
<configuration>
<system.web>
<globalization
fileEncoding="utf-8"
requestEncoding="utf-8"
responseEncoding="utf-8"
culture="zh-Hans"
uiCulture="en-us" />
</system.web>
</configuration>
Questions:
Is this really all there is to it in ASP.net? It seems to good to be true.
Are there any DB considerations with SQL Server 2005? Will the DB accept the simplified Chinese without additional configuration?

Ad 1. The real question is, how far you want to go with Internationalization. Because i18n is not only allowing Unicode input. You need at least support local date, time and number formats, local collation (mostly related to sorting) and ensure that your application runs correctly on localized Operating Systems (unless you are developing Cloud aka hosted solution). You might want to read more on the topic here.
As far as support for Chinese character input goes, if you are going to offer software in China, you need to at least support GB18030-2000. To do just that, you need to use proper .Net Framework version - the one that supports Unicode 3.0. I believe it was supported since .Net Framework 2.0.
However, if you want to go one step further (which might be required for gaining competitive edge), you might want to support GB18030-2005. The only problem is, the full support for these characters (CJK Unified Ideographs Extension B) happened later (I am not really sure if it is Unicode 6.0 or Unicode 6.1) in the process. Therefore you might be forced to use the latest .Net Framework and still not be sure if it covers everything.
You might want to read Unicode FAQ on Han characters.
Ad 2. I strongly advice you not to use SQL Server 2005 with Chinese characters. The reason is, old SQL Server engine supports only UCS-2 rather than UTF-16. This might seems as slight difference, but that really poses the problem with 4-byte Han Ideographs. Actually, you want be able to use them in queries (i.e. LIKE or WHERE clauses) - you will receive all records. That's how it works. And to support them, you would need to set very specific Chinese collation, which will simply break support for other languages.
Basically, using SQL Server 2005 with Chinese Ideographs is a bad idea.

First off, I wonder if you are you sure that you picked the right culture identifier with zh-Hans, which is a neutral culture. Perhaps it would be more appropriate for you to target a specific culture, such as zh-CN (Chinese being used in China) if that is the market you are aiming to support.
Secondly, using the web.config file to set the culture is fine if you are planning a deployment that is exclusively targeting this culture. Often you'll want one same deployment to dynamically adapt to the end user's culture, in which case you would programmatically set the Thread.CurrentCulture (and even Thread.CurrentUICulture if you are providing localized resources) based for example on a URL scheme (e.g. www.myapp.com would use en-US and www.myapp.com/china would use zh-CN) or the accept-languages header or an in-app language selector.
Other than the Unicode limitations that Paweł refers to (which mean that you may really need to use the latest .NET Framework/SQL Server), there isn't anything specific you should need to do for simplified Chinese -- if you follow standard internationalization guidelines you should be all set. Perhaps you should consider localizing (translating) your app into Chinese as part of this, by the way.
About SQL Server, Paweł's points seem pretty clear. That said, so long as you use nvarchar datatypes (Unicode) and you don't run queries on these columns or sort them based on these columns on the DB side, I'd be surprised if you had any issues on SQL Server 2005. So it really depends what you do with this data.

Related

Why is .NET "de-CH" culture number group separator different locally and on Azure?

I am seeing a different Unicode character as the number group separator for the "de-CH" culture when running on a local desktop and in Azure.
When the following code is run on my desktop in .NET Core 3.1 or .NET Framework 4.7.2 it outputs 2019 which looks like an apostrophe but is not the same.
When run in Azure, for instance in https://try.dot.net or (slightly modified) in an Azure function running on .NET Core 3.1 (on a Windows based App Service) it results in 0027, a standard ASCII apostrophe.
using System;
using System.Linq;
using System.Globalization;
Console.WriteLine(((int)(CultureInfo
.GetCultureInfo("de-CH")
.NumberFormat
.NumberGroupSeparator
.Single())) // Just getting the single character as an int
.ToString("X4") // unicode value of that character
);
The result of this is that trying to parse the string 4'200.000 (where the apostrophe there is Unicode 0027) on local desktop using "de-CH" culture fails, but it works in Azure.
Why the difference?

This Microsoft blog by Shawn Steele explains why you shouldn't rely on a specific culture setting being stable (Fully quoted because it is no longer online at MSDN):
https://web.archive.org/web/20190110065542/https://blogs.msdn.microsoft.com/shawnste/2005/04/05/culture-data-shouldnt-be-considered-stable-except-for-invariant/
CultureInfo and RegionInfo data represents a cultural, regional, admin
or user preference for cultural settings. Applications should NOT
make any assumptions that rely on this data being stable. The only
exception (this is a rule, so of course there's an exception) is for
CultureInfo.InvariantCulture. CultureInfo.InvariantCulture is
supposed to remain stable, even between versions.
There are many reasons that cultural data can change. With Whidbey
and Custom Cultures the list gets a little longer.
The most obvious reason is that there is a bug in the data and we had to make a change. (Believe it or not we make mistakes ;-)) In this case our users (and yours too) want culturally correct data, so we have to fix the bug even if it breaks existing applications.
Another reason is that cultural preferences can change. There're lots of ways this can happen, but it does happen:
Global awareness, cross cultural exchange, the changing role of computers and so forth can all effect a cultural preference.
International treaties, trade, etc. can change values. The adoption of the Euro changed many countries currency symbol to €.
National or regional regulations can impact these values too.
Preferred spelling of words can change over time.
Preferred date formats, etc can change.
Multiple preferences could exist for a culture. The preferred best choice can then change over time.
Users could have overridden some values, like date or time formats. These can be requested without user override, however we recommend that applications consider using user overrides.
Users or administrators could have created a replacement culture, replacing common default values for a culture with company specific, regional specific, or other variations of the standard data.
Some cultures may have preferences that vary depending on the setting. A business might have a more formal form than an Internet Café.
An enterprise may require a specific date format or time format for the entire organization.
Differing versions of the same custom culture, or one that's custom on one machine and a windows only culture on another machine.
So if you format a string with a particular date/time format, and then
try to Parse it later, parse might fail if the version changed, if the
machine changed, if the framework version changed (newer data), or if
a custom culture was changed. If you need to persist data in a
reliable format, choose a binary method, provide your own format or
use the InvariantCulture.
Even without changing data, remembering to use Invariant is still a
good idea. If you have different . and , syntax for something like
1,000.29, then Parsing can get confused if a client was expecting
1.000,29. I've seen this problem with applications that didn't realize that a user's culture would be different than the developer's
culture. Using Invariant or another technique solves this kind of
problem.
Of course you can't have both "correct" display for the current user
and perfect round tripping if the culture data changes. So generally
I'd recommend persisting data using InvariantCulture or another
immutable format, and always using the appropriate formatting APIs for
display. Your application will have its own requirements, so consider
them carefully.
Note that for collation (sort order/comparisons), even Invariant
behavior can change. You'll need to use the Sort Versioning to get
around that if you require consistently stable sort orders.
If you need to parse data automatically that is formatted to be user-friendly, there are two approaches:
Allow the user to explicitly specify the used format.
First remove every character except digits, minus sign and the decimal separator from the string before trying to parse this. Note that you need to know the correct decimal separator first. There is no way to guess this correctly and guessing wrong could result in major problems.
Wherever possible try to avoid parsing numbers that are formatted to be user-friendly. Instead whenever possible try to request numbers in a strictly defined (invariant) format.

Localization alternative to Resx file

Why I don't want to use Resx files:
I am looking for an alternative for resx files to offer multilanguage support for my project, due to the following reasons:
I don't like to specify a "messageId" when writing messages, it is more effort and it is annoying for the flow as I don't see what the log message would actually say and I would need to open another tab to edit the message
Sometimes I use code inline because I don't want to create new variables for to easy steps (e. g. Log.Info("Iterated {i+1} times");). Using variables or doing simple calculations inline makes the whole code sometimes more clearly than creating additional code lines
What I could imagine instead:
An external application which crawls a compiled exe for all strings, giving you the opportunity to ignore/add strings which should be translated. It could create a XML or Json file for all languages as well then. It would replace all strings with a hash/id so that a lookup for strings in all languages is still possible.
Am I the only one who is not happy with the commonly used Resx / centralized string db solution? Do I miss points why this wouldn't be a good idea?

One reason for relying on established approaches instead of implementing your own format is translation. It really depends on how your resources are translated: if it is done by volunteers with a technical background who don't mind working in a plain text editor, then you are free to come up with your own resource format. If on the other hand you send out your resources to professional translators who are not very technical and who prefer to work in a translation environment with integrated terminology management, translation memory, spelling and quality checks etc. it is quite likely that this environment will not be able to handle your homemade resource format.
Since I already mentioned professional translation environments: some of these tools rely on IDs to figure out which strings are old and which are new. If you use the approach that the text is the ID every fixed typo in your source language means that you create a new string that needs to be translated - and paid for. If the translator sees that the source text for a string has changed he can have a look at the change, notice that a typo has been fixed, decide that the translation is still OK and sign the string off, without extra translation cost.
By the way, if you want good localizations for strings like Log.Info("Iterated {i+1} times"); you have to find some way of dealing with plural forms correctly. Some languages have different grammatical rules for different numbers (see the Unicode Language Plural Rules for an overview). Just because something is easy to do in code does not mean that it is easy to localize, I'm afraid.
To sum this up: if you want to create your own resource format, talk with your translators. Ask them which formats they can handle. Think about translation related limitations that come with your format, for example if there are any characters that the translators should not use because they break your strings? Apostrophes and quotes are prime candidates here because they are often used as string delimiters in resource files, or < and & if you decide to go the XML way. Think about a conversion to XLIFF and back: most translation environments can handle XLIFF.

Translation and localization issue

Does Microsoft implementation of C# runtime offer some localization mechanism to translate common strings like Overflow, Stack overflow, Underflow, etc...
See the code below - it's a part of Mono and Mono itself has a Locale.GetText routine for making such translations.
// Added to avoid possible integer overflow.
if (inputOffset > inputBuffer.Length - inputCount)
throw new ArgumentException("inputOffset" +
Locale.GetText("Overflow");
Now - how is it done in Microsoft version of runtime and how can I use it, for example, to get the localized equivalent of Overflow without adding resource files?

.NET provides a framework that makes it easy to localize your content (ResourceManager) and while it internally maintains some translations for its own purpose (for example DateTime.ToString gives you a textual representation for the date/time that is locally appropriate, which includes the translated month and day names), it does not provide you with any ready-made translations, be they common strings or not. It could hardly do this reliably anyway, as there is a plethora of human languages out there and words can have different translations depending on context etc.
In your example, I would say that you are OK with untranslated exception messages. Although Microsoft recommends that you localize exception descriptions and they do localize their own (at least for major languages), this advice seems ill-thought at it's not only a waste of effort to translate all this text that users probably should never see, but it can make debugging a nightmare.

Yes, it does and it's a terrible idea. It makes debugging so much harder.

without adding resource files
What do you have against resource files? Resources are the prescribed way to provide localized and localizable strings, images, and other data for a .NET app or assembly.
Note that single word substitution as shown in your example code will result in poor quality translations. Different languages have different sentence structure and word order which your single word substitution won't accommodate. Non-English languages often involve genders for nouns and declension of words to properly reflect their role and number in a phrase. Single word substitution fails miserably at this.
Your non-English customers will most likely prefer that you not butcher their language by attempting to partially translate text a word here and a word there. If you're going to go to the trouble of supporting localizable messages, do it right and allow the entire string to be translated so that word ordering and declension can be done properly by translators. In cases where the content is variable, make the format string a resource so that the translator can set off the variable data using the conventions of the language.

Globalization in C#

Can somebody explain to me what is the use of globalization in C#?
Is it used for conversion purposes? I mean I want to convert any English word into a selected language.
So will this globalization or cultureinfo help me?

Globalization is a means of formatting text for specific cultures. E.g. a string representation of the number 1000 may be 1,000.00 for the UK or 1'000,00 for France. It is quite an in depth subject but that is the essential aim.
It is NOT a translation service, but it does allow you to determine the culture under which your application is running and therefore allow you to choose the language you want to display. You will have to provide text translation yourself, however, usually by means of resource files.

Globalization is a way of allowing the user to customize the application that he or she may be using to fit the standards where they may be. Cusomtization allows for the:
Money Formatting
Time
Date
Text orientation
To be culturally appropriate. The region that is currently set is handled by the OS and passed to your application. Globalization/Internationalization(I18n) also typically motivates the developer to separate the displayed text of the program from the implementation its self.

From MSDN:
System.Globalization - contains
classes that define culture-related
information, including the language,
the country/region, the calendars in
use, the format patterns for dates,
currency and numbers, and the sort
order for strings.
This assembly helps in making your application culture-aware, and is used heavily internally within the .NET framework. For example, when converting from Date to String, Globalization is used to determine what format to use, such as "11/28/2009" or "28-11-2009". Generally this determination is done automatically within the framework without you ever using the assembly directly. However, if you need to, you can use Globalization directly to look up culture-specific information for your own use.

To clear even more confusion
Localization (or Localisation for non-US people), L10n for short: process of adapting program for a specific location. It consist of translating resources, adapting UI (if necessary), etc.
Internationalization, i18n for short: process of adapting program to support localization, regional characters, formats and so on and so forth, but most importantly, the process of allowing program to work correctly regardless of current locale settings and OS language version.
Globalization, g11n for short: consist of both i18n and L10n.

To clear some confusion:
Globalisation: Allowing your program to use locale specific resources loaded from an external resource DLL at runtime. This means putting all your strings in resource files rather than hard coding them into the source code.
Localisation: Adapting your program for a specific locale. This could be translating Strings and making dialog boxes read right-to-left for languages such as Arabic.
Here is a link to creating Satellite DLLs. Its says C++ but it the same principle applies to C#.

Globalization:-
Globalization is the process of designing and developing applications for multiple cultures regions.
Localization:-
Localization is the process of customizing application for a given culture and locale.

Internationalization in the database

Do you guys think internationalization should only done at the code level or should it be done in the database as well.
Are there any design patterns or accepted practices for internationalizing databases?
If you think it's a code problem then do you just use Resource files to look up a key returned from the database?
Thanks

Internationalization extends to the database insofar as your columns will be able to hold the data. NText, NChar, NVarchar instead of Text, Char, Varchar.
As far as your UI goes, for non-changing labels, resources files are a good method to use.

If you refer to making your app support multiple languages at the UI level, then there are more possibilities. If the labels never change, unless when you release a new version, then resource files that get embedded in the executable or assembly itself are your best bet, since they work faster. If your labels, on the other hand, need to be adjusted at runtime by the users, then storing the translations in the database is a good choice. As far as the code itself, the names of tables & fields in the database, we keep them in English as much as possible, since English is the "de facto" standard for IT people.

It depends a lot of what you are storing in your database. Taking examples from a recent project I was on, a film title that is entered at a client site and only visible to that client is fair game to store as-is in the database. A projector error code, on the other hand, because it can be viewed by the client, as well as by network operations centers that might be in different countries, should be stored as an error code (and supporting data, like lamp hours and the title of the movie being shown) which can be translated at the gui level depending on the language setting of the viewer.

#hova covers the technicalities, but something you might want to consider is support of a system showing a language you don't understand.
One way to cope with this is to have English as the default language, and a user setting that switches into a different language. That way your support users can log in and see the system in a natural way (assuming English as their first language), and your actual users can see the system in their first language. IMO, the data should always be 'natural' - in the language of the users.
Which raises another interesting point - should your system allow multiple languages for cross-border installations? In my experience, for user interface yes, but for data, no. To take a trivial example of address formatting, a letter to a French third party from a Swiss system should still have a Swiss-format address instead of a French one, as it has to go through the Swiss postal system first.

If your customers are Japanese and want to see their names in Kanji and Katakana (and sometimes in most formal Gaiji), you've got to store them as Unicode. No way around that.

Even things like addresses are very different between the US and Japan. One schema won't cut it for both.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.