Internationalization in the database

Internationalization in the database - c#

Do you guys think internationalization should only done at the code level or should it be done in the database as well.
Are there any design patterns or accepted practices for internationalizing databases?
If you think it's a code problem then do you just use Resource files to look up a key returned from the database?
Thanks

Internationalization extends to the database insofar as your columns will be able to hold the data. NText, NChar, NVarchar instead of Text, Char, Varchar.
As far as your UI goes, for non-changing labels, resources files are a good method to use.

If you refer to making your app support multiple languages at the UI level, then there are more possibilities. If the labels never change, unless when you release a new version, then resource files that get embedded in the executable or assembly itself are your best bet, since they work faster. If your labels, on the other hand, need to be adjusted at runtime by the users, then storing the translations in the database is a good choice. As far as the code itself, the names of tables & fields in the database, we keep them in English as much as possible, since English is the "de facto" standard for IT people.

It depends a lot of what you are storing in your database. Taking examples from a recent project I was on, a film title that is entered at a client site and only visible to that client is fair game to store as-is in the database. A projector error code, on the other hand, because it can be viewed by the client, as well as by network operations centers that might be in different countries, should be stored as an error code (and supporting data, like lamp hours and the title of the movie being shown) which can be translated at the gui level depending on the language setting of the viewer.

#hova covers the technicalities, but something you might want to consider is support of a system showing a language you don't understand.
One way to cope with this is to have English as the default language, and a user setting that switches into a different language. That way your support users can log in and see the system in a natural way (assuming English as their first language), and your actual users can see the system in their first language. IMO, the data should always be 'natural' - in the language of the users.
Which raises another interesting point - should your system allow multiple languages for cross-border installations? In my experience, for user interface yes, but for data, no. To take a trivial example of address formatting, a letter to a French third party from a Swiss system should still have a Swiss-format address instead of a French one, as it has to go through the Swiss postal system first.

If your customers are Japanese and want to see their names in Kanji and Katakana (and sometimes in most formal Gaiji), you've got to store them as Unicode. No way around that.

Even things like addresses are very different between the US and Japan. One schema won't cut it for both.

Related

Localization alternative to Resx file

Why I don't want to use Resx files:
I am looking for an alternative for resx files to offer multilanguage support for my project, due to the following reasons:
I don't like to specify a "messageId" when writing messages, it is more effort and it is annoying for the flow as I don't see what the log message would actually say and I would need to open another tab to edit the message
Sometimes I use code inline because I don't want to create new variables for to easy steps (e. g. Log.Info("Iterated {i+1} times");). Using variables or doing simple calculations inline makes the whole code sometimes more clearly than creating additional code lines
What I could imagine instead:
An external application which crawls a compiled exe for all strings, giving you the opportunity to ignore/add strings which should be translated. It could create a XML or Json file for all languages as well then. It would replace all strings with a hash/id so that a lookup for strings in all languages is still possible.
Am I the only one who is not happy with the commonly used Resx / centralized string db solution? Do I miss points why this wouldn't be a good idea?

One reason for relying on established approaches instead of implementing your own format is translation. It really depends on how your resources are translated: if it is done by volunteers with a technical background who don't mind working in a plain text editor, then you are free to come up with your own resource format. If on the other hand you send out your resources to professional translators who are not very technical and who prefer to work in a translation environment with integrated terminology management, translation memory, spelling and quality checks etc. it is quite likely that this environment will not be able to handle your homemade resource format.
Since I already mentioned professional translation environments: some of these tools rely on IDs to figure out which strings are old and which are new. If you use the approach that the text is the ID every fixed typo in your source language means that you create a new string that needs to be translated - and paid for. If the translator sees that the source text for a string has changed he can have a look at the change, notice that a typo has been fixed, decide that the translation is still OK and sign the string off, without extra translation cost.
By the way, if you want good localizations for strings like Log.Info("Iterated {i+1} times"); you have to find some way of dealing with plural forms correctly. Some languages have different grammatical rules for different numbers (see the Unicode Language Plural Rules for an overview). Just because something is easy to do in code does not mean that it is easy to localize, I'm afraid.
To sum this up: if you want to create your own resource format, talk with your translators. Ask them which formats they can handle. Think about translation related limitations that come with your format, for example if there are any characters that the translators should not use because they break your strings? Apostrophes and quotes are prime candidates here because they are often used as string delimiters in resource files, or < and & if you decide to go the XML way. Think about a conversion to XLIFF and back: most translation environments can handle XLIFF.

Stand-alone Error Page with translated text?

I'm working on a website that will deployed internationally. Very big site, but for the sake of simplicity, all we're concerned about is my Error.aspx with c# code behind. I'd like to make this custom error page as dynamic as possible. There's at least a handful of languages we need to read this page in right now, and more to come. This page needs to work independently and without a database to reference.
I'd like to have some text, and have the appropriate translation appear based on the language appropriate for that domain... e.g. ".com" = English, ".ca/fr" = French, ".mx" = Spanish... you get the idea.
What's the best way to do this?
I've looked into API's, but the decent ones have a cost threshold, and while it might look really helpful, this is just pretty standard error message text, that's unlikely to change, so that seems like overkill to have a dynamic translator. It might help with scalability, but it's extra money indefinitely, when it will only save vs hard-coding on the handful of occasions where we add another language/country/domain.
The other idea I had was to simply hardcode it in the c#. parse out Request.URL and get the domain, and make a ever-growing switch statement which would assign the appropriate text. (As an aside, I'm also trying to find a better way to do this, but is the country code something that would be an available piece of information from either the request object or server?) This way would be independent, precise, and the only drawback on a concrete level would be the cost of adding new languages, or changing every string (probably not that many, at least at first) if the content of the error message needed to be adjusted. But this feels like bad practice.
I've been researching this for a day now, but I haven't found any alternatives to these 2 options. What are the best practices for handling small amounts of text for translation, without the use of a CMS?

There is an easy built-in way to handle localization in ASP.NET Web Forms. It uses the Language Preference settings in the client's browser to select the language. Posting the steps of setting it up would be redundant since there's lots of information on this subject available online. Here is a good tutorial.
EDIT:
It might be a good idea to read up on HTML resource files. That is the HTML standard for handling different languages (referred to as localization). And it is what ASP.NET uses in the background when creating a local resource for a server control.

why store internationalization "words" in separate (xml) files?

I've been reading up a bit on how people do internationalization. It seems that the common consensus is to save those strings in a separate file (usually xml) and load it when necessary.
I'm wondering why not just store those strings in a database instead? isn't it much better this way?
Btw the nature of my app is a website app.

The most important thing is to store your string tables outside of your compilation units so that incorporating updated translations does not require a rebuild. This allows for new or updated translations to be incorporated at a later point without too much hassle.
Of course, those string tables could be stored anywhere. If you want to put them in a database, knock yourself out. As long as your application can reach them and your translation staff know how to deliver them into the right place, it doesn't make a difference.

Serious Internationalization is always a big project with a lot of parts and players. As #zerkms alludes to, the translation task is very often an offline activity by individuals and teams around the world.
So it makes sense to have a clean work product that a translation team can produce (the translated XML file).
Once the file is translated, it is up to you how you handle the translations within your software. It is common to keep them in memory since you often need to substitute in variables into the placeholders.

If you do store in the database, you will introduce additional overhead of querying the database everytime your "locale" switches.
This the reason for resource bundles. You package it along with source code, but you dont have to change code to add support for languages.
You could also subclass the resourcebundle class yourself and implement jdbc support so the locale-specific strings are stored in the database.
http://java.sun.com/developer/technicalArticles/Intl/ResourceBundles/

Should I store localization content in the application state

I am developing my first multilingual C# site and everything is going ok except for one crucial aspect. I'm not 100% sure what the best option is for storing strings (typically single words) that will be translated by code from my code behind pages.
On the front end of the site I am going to use asp.net resource files for the wording on the pages. This part is fine. However, this site will make XML calls and the XML responses are only ever in english. I have been given an excel sheet with all the words that will be returned by the XML broken into the different languages but I'm not sure how best to store/access this information. There are roughly 80 words x 7 languages.
I am thinking about creating a dictionary object for each language that is created by my global.asax file at application run time and just keeping it stored in memory. The plus side for doing this is that the dictionary object will only have to be created once (until IIS restarts) and can be accessed by any user without needing to be rebuilt but the downside is that I have 7 dictionary objects constantly stored in memory. The server is a Win 2008 64bit with 4GB of RAM so should I even be concerned with memory taken up by using this method?
What do you guys think would be the best way to store/retrieve different language words that would be used by all users?
Thanks for your input.
Rich

From what you say, you are looking at 560 words which need to differ based on locale. This is a drop in the ocean. The resource file method which you have contemplated is fit for purpose and I would recommend using them. They integrate with controls so you will be making the most from them.
If it did trouble you, you could have them on a sliding cache, i.e. sliding cache of 20mins for example, But I do not see anything wrong with your choice in this solution.
OMO
Cheers,
Andrew
P.s. have a read through this, to see how you can find and bind values in different resource files to controls and literals and use programatically.
http://msdn.microsoft.com/en-us/magazine/cc163566.aspx

As long as you are aware of the impact of doing so then yes, storing this data in memory would be fine (as long as you have enough to do so). Once you know what is appropriate for the current user then tossing it into memory would be fine. You might look at something like MemCached Win32 or Velocity though to offload the storage to another app server. Use this even on your local application for the time being that way when it is time to push this to another server or grow your app you have a clear separation of concerns defined at your caching layer. And keep in mind that the more languages you support the more stuff you are storing in memory. Keep an eye on the amount of data being stored in memory on your lone app server as this could become overwhelming in time. Also, make sure that the keys you are using are specific to the language. Otherwise you might find that you are storing a menu in german for an english user.

Localizing data that is generated dynamically

This was a hard question for me to summarize so we may need to edit this a bit.
Background
About four years ago, we had to translate our asp.net application for our clients in Mexico. Extensibility and scalability were not that much of a concern at the time (oh yes, I just said those dreadful words) because we only have U.S. and Mexican customers.
Rather than use resource files, we replaced every single piece of static text in our application with some type of server control (asp.net label for example). We store each and every English word in a SQL database. We have added the ability to translate the English text into another language and also can add cultural overrides. For example, hello can be translated to ¡hola! in one language and overridden to ¡bueno! in a different culture. The business has full control over these translations because will built management utilities for them to control everything. The translation kicks in when we detect that the user has a browser culture other than en-us. Every form descends from a base form that iterates through each server control and executes a translation (translation data is stored as a datatable in an application variable for a culture). I'm still amazed at how fast the control iteration is.
The problem
The business is very happy with how the translations work. In addition to the static content that I mentioned above, the business now wants to have certain data translated as well. System notes are a good example of a translation they want. Example "Sent Letter #XXXX to Customer" - the business wants the "Sent Letter to Customer" text translated based on their browser culture.
I have read a couple of other posts on SO that talk about localization but they don't address my problem. How do you translate a phrase that is dynamically generated? I could easily read the English text and translate "Sent", "Letter", "to" and "Customer", but I guarantee that it will look stupid to the end user because it's a phrase. The dynamic part of the system-generated note would screw up any look-ups that we perform on the phrase if we stored the phrase in English, less the dynamic text.
One thought I had... We don't have a table of system generated note types. I suppose we could create one that had placeholders for dynamic data and the translation engine would ignore the placeholder markers. The problem with this approach is that our SQL server database is a replication of an old pick database and we don't really know all the types of system generated phrases (They are deep in the pic code base, in subroutines, control files, etc.). Things like notes, ticklers, and payment rejection reasons are all stored differently. Trying to normalize this data has proven difficult. It would be a huge effort to go back and identify and change every pick program that generated a message.
This question is very close; but I'm not dealing with just system-generated status messages but rather an infinite number of phrases and types of phrases with no central generation mechanism.
Any ideas?

The lack of a "bottleneck" -- what you identify as the (missing) "central generation mechanism" -- is the architectural problem in this situation. Ideally, rearchitecting to put such a bottleneck in place (so you can keep using your general approach with a database of culture-appropriate renditions of messages, just with "placeholders" for e.g. the #XXXX in your example) would be best.
If that's just unfeasible, you can place the "bottleneck" at the other end of the pipe -- when a message is about to be emitted. At that point, or few points, you need to try and match the (English) string that's about to be emitted with a series of well-crafted regular expressions (with "placeholders" typically like (.*?)...) and thereby identify the appropriate key for the DB lookup. Yes, that still is a lot of work, but at least it should be feasible without the issues you mention wrt old translated pick code.

We use technique you propose with insertion points.
"Sent letter #{0:Letter Num} to Customer {1:Customer Full Name}"
Which might be (in reverse Pig Latin, say):
"Ustomercay {1:Customer Full Name} asway entsay etterlay #{0:Letter Num}"
Note that this handles cases where the particular target langue reverses the order of insertion etc. It does not handle subtleties like first, second, etc, which have to be handled with application logic/more phrases:
"This is your {0:first, second, third} warning"

In a pinch I suppose you could try something like foisting the job off onto Google if you don't have a translation on hand for a particular phrase, and stashing the translation for later.
Stashing the translations for later provides both a data collection point for building a message catalog and a rough (if sometimes laughably wonky) dynamically built starter set of translations. Once you begin the process, track which translations have been reviewed and how frequently each have been hit. Frequently hit machine translations can then be reviewed and refined.

Dynamic machine translation is not suitable for a product that you actually expect people to pay money for. The only way to do it is with static templates containing insertion points (as Cade Roux has demonstrated in his answer).
There's no getting around a thorough refactoring of your code to make this feasible. The alternative is to do nothing with those phrases (which is what you're doing now, and it's working out okay, right?). Usually no translation is better than embarrassingly bad translation.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.