Localizing data that is generated dynamically

Localizing data that is generated dynamically - c#

This was a hard question for me to summarize so we may need to edit this a bit.
Background
About four years ago, we had to translate our asp.net application for our clients in Mexico. Extensibility and scalability were not that much of a concern at the time (oh yes, I just said those dreadful words) because we only have U.S. and Mexican customers.
Rather than use resource files, we replaced every single piece of static text in our application with some type of server control (asp.net label for example). We store each and every English word in a SQL database. We have added the ability to translate the English text into another language and also can add cultural overrides. For example, hello can be translated to ¡hola! in one language and overridden to ¡bueno! in a different culture. The business has full control over these translations because will built management utilities for them to control everything. The translation kicks in when we detect that the user has a browser culture other than en-us. Every form descends from a base form that iterates through each server control and executes a translation (translation data is stored as a datatable in an application variable for a culture). I'm still amazed at how fast the control iteration is.
The problem
The business is very happy with how the translations work. In addition to the static content that I mentioned above, the business now wants to have certain data translated as well. System notes are a good example of a translation they want. Example "Sent Letter #XXXX to Customer" - the business wants the "Sent Letter to Customer" text translated based on their browser culture.
I have read a couple of other posts on SO that talk about localization but they don't address my problem. How do you translate a phrase that is dynamically generated? I could easily read the English text and translate "Sent", "Letter", "to" and "Customer", but I guarantee that it will look stupid to the end user because it's a phrase. The dynamic part of the system-generated note would screw up any look-ups that we perform on the phrase if we stored the phrase in English, less the dynamic text.
One thought I had... We don't have a table of system generated note types. I suppose we could create one that had placeholders for dynamic data and the translation engine would ignore the placeholder markers. The problem with this approach is that our SQL server database is a replication of an old pick database and we don't really know all the types of system generated phrases (They are deep in the pic code base, in subroutines, control files, etc.). Things like notes, ticklers, and payment rejection reasons are all stored differently. Trying to normalize this data has proven difficult. It would be a huge effort to go back and identify and change every pick program that generated a message.
This question is very close; but I'm not dealing with just system-generated status messages but rather an infinite number of phrases and types of phrases with no central generation mechanism.
Any ideas?

The lack of a "bottleneck" -- what you identify as the (missing) "central generation mechanism" -- is the architectural problem in this situation. Ideally, rearchitecting to put such a bottleneck in place (so you can keep using your general approach with a database of culture-appropriate renditions of messages, just with "placeholders" for e.g. the #XXXX in your example) would be best.
If that's just unfeasible, you can place the "bottleneck" at the other end of the pipe -- when a message is about to be emitted. At that point, or few points, you need to try and match the (English) string that's about to be emitted with a series of well-crafted regular expressions (with "placeholders" typically like (.*?)...) and thereby identify the appropriate key for the DB lookup. Yes, that still is a lot of work, but at least it should be feasible without the issues you mention wrt old translated pick code.

We use technique you propose with insertion points.
"Sent letter #{0:Letter Num} to Customer {1:Customer Full Name}"
Which might be (in reverse Pig Latin, say):
"Ustomercay {1:Customer Full Name} asway entsay etterlay #{0:Letter Num}"
Note that this handles cases where the particular target langue reverses the order of insertion etc. It does not handle subtleties like first, second, etc, which have to be handled with application logic/more phrases:
"This is your {0:first, second, third} warning"

In a pinch I suppose you could try something like foisting the job off onto Google if you don't have a translation on hand for a particular phrase, and stashing the translation for later.
Stashing the translations for later provides both a data collection point for building a message catalog and a rough (if sometimes laughably wonky) dynamically built starter set of translations. Once you begin the process, track which translations have been reviewed and how frequently each have been hit. Frequently hit machine translations can then be reviewed and refined.

Dynamic machine translation is not suitable for a product that you actually expect people to pay money for. The only way to do it is with static templates containing insertion points (as Cade Roux has demonstrated in his answer).
There's no getting around a thorough refactoring of your code to make this feasible. The alternative is to do nothing with those phrases (which is what you're doing now, and it's working out okay, right?). Usually no translation is better than embarrassingly bad translation.

Related

Stand-alone Error Page with translated text?

I'm working on a website that will deployed internationally. Very big site, but for the sake of simplicity, all we're concerned about is my Error.aspx with c# code behind. I'd like to make this custom error page as dynamic as possible. There's at least a handful of languages we need to read this page in right now, and more to come. This page needs to work independently and without a database to reference.
I'd like to have some text, and have the appropriate translation appear based on the language appropriate for that domain... e.g. ".com" = English, ".ca/fr" = French, ".mx" = Spanish... you get the idea.
What's the best way to do this?
I've looked into API's, but the decent ones have a cost threshold, and while it might look really helpful, this is just pretty standard error message text, that's unlikely to change, so that seems like overkill to have a dynamic translator. It might help with scalability, but it's extra money indefinitely, when it will only save vs hard-coding on the handful of occasions where we add another language/country/domain.
The other idea I had was to simply hardcode it in the c#. parse out Request.URL and get the domain, and make a ever-growing switch statement which would assign the appropriate text. (As an aside, I'm also trying to find a better way to do this, but is the country code something that would be an available piece of information from either the request object or server?) This way would be independent, precise, and the only drawback on a concrete level would be the cost of adding new languages, or changing every string (probably not that many, at least at first) if the content of the error message needed to be adjusted. But this feels like bad practice.
I've been researching this for a day now, but I haven't found any alternatives to these 2 options. What are the best practices for handling small amounts of text for translation, without the use of a CMS?

There is an easy built-in way to handle localization in ASP.NET Web Forms. It uses the Language Preference settings in the client's browser to select the language. Posting the steps of setting it up would be redundant since there's lots of information on this subject available online. Here is a good tutorial.
EDIT:
It might be a good idea to read up on HTML resource files. That is the HTML standard for handling different languages (referred to as localization). And it is what ASP.NET uses in the background when creating a local resource for a server control.

Implement Language Auto-Completion based on ANTLR4 Grammar

I am wondering if are there any examples (googling I haven't found any) of TAB auto-complete solutions for Command Line Interface (console), that use ANTLR4 grammars for predicting the next term (like in a REPL model).
I've written a PL/SQL grammar for an open source database, and now I would like to implement a command line interface to the database that provides the user the feature of completing the statements according to the grammar, or eventually discover the proper database object name to use (eg. a table name, a trigger name, the name of a column, etc.).
Thanks for pointing me to the right direction.

Actually it is possible! (Of course, based on the complexity of your grammar.) Problem with auto-completion and ANTLR is that you do not have complete expression and you want to parse it. If you would have complete expression, it wont be any big problem to know what kind of element is at what place and to know what can be used at such a place. But you do not have complete expression and you cannot parse the incomplete one. So what you need to do is to wrap the input into some wrapper/helper that will complete the expression to create a parse-able one. Notice that nothing that is added only to complete the expression is important to you - you will only ask for members up to last really written character.
So:
A) Create the wrapper that will change this (excel formula) '=If(' into '=If()'
B) Parse the wrapped input
C) Realize that you are in the IF function at the first parameter
D) Return all that can go into that place.
It actually works, I have completed intellisense editor for several simple languages. There is much more infrastructure than this, but the basic idea is as I wrote it. Only be careful, writing the wrapper is not easy if not impossible if the grammar is really complex. In that case look at Papa Carlo project. http://lakhin.com/projects/papa-carlo/

As already mentioned auto completion is based on the follow set at a given position, simply because this is what we defined in the grammar to be valid language. But that's only a small part of the task. What you need is context (as Sam Harwell wrote: it's a semantic process, not a syntactic one). And this information is independent of the parser. And since a parser is made to parse valid input (and during auto completion you have most of the time invalid input), it's not the right tool for this task.
Knowing what token can follow at a given position is useful to control the entire process (e.g. you don't want to show suggestions if only a string can appear), but is most of the time not what you actually want to suggest (except for keywords). If an ID is possible at the current position, it doesn't tell you what ID is actually allowed (a variable name? a namespace? etc.). So what you need is essentially 3 things:
A symbol table that provides you with all possible names sorted by scope. Creating this depends heavily on the parsed language. But this is a task where a parser is very helpful. You may want to cache this info as it is time consuming to run this analysis step.
Determine in which scope you are when invoking auto completion. You could use a parser as well here (maybe in conjunction with step 1).
Determine what type of symbol(s) you want to show. Many people think this is where a parser can give you all necessary information (the follow set). But as mentioned above that's not true (keywords aside).
In my blog post Universal Code Completion using ANTLR3 I especially addressed the 3rd step. There I don't use a parser, but simulate one, only that I don't stop when a parser would, but when the caret position is reached (so it is essential that the input must be valid syntax up to that point). After reaching the caret the collection process starts, which not only collects terminal nodes (for keywords) but looks at the rule names to learn what needs to be collected too. Using specific rule names is my way there to put context into the grammar, so when the collection code finds a rule table_ref it knows that it doesn't need to go further down the rule chain (to the ultimate ID token), but instead can use this information to provide a list of tables as suggestion.
With ANTLR4 things might become even simpler. I haven't used it myself yet, but the parser interpreter could be a big help here, as it essentially doing what I do manually in my implementation (with the ANTLR3 backend).

This is probably pretty hard to do.
Fundamentally you want to use some parser to predict "what comes next" to display as auto-completion. This has to at least predict what the FIRST token is at the point where the user's input stops.
For ANTLR, I think this will be very difficult. The reason is that ANTLR generates essentially procedural, recursive descent parsers. So at runtime, when you need to figure out what FIRST tokens are, you have to inspect the procedural source code of the generated parser. That way lies madness.
This blog entry claims to achieve autocompletion by collecting error reports rather than inspecting the parser code. Its sort of an interesting idea, but I do not understand how his method really works, and I cannot see how it would offer all possible FIRST tokens; it might acquire some of them. This SO answer confirms my intuition.
Sam Harwell discusses how he has tackled this; he is one of the ANTLR4 implementers and if anybody can make this work, he can. It wouldn't surprise me if he reached inside ANTLR to extract the information he needs; as an ANTLR implementer he would certainly know where to tap in. You are not likely to be so well positioned. Even so, he doesn't really describe what he did in detail. Good luck replicating. You might ask him what he really did.
What you want is a parsing engine for which that FIRST token information is either directly available (the parser generator could produce it) or computable based on the parser state. This is actually possible to do with bottom up parsers such as LALR(k); you can build an algorithm that walks the state tables and computes this information. (We do this with our DMS Software Reengineering Toolkit for its GLR parser precisely to produce syntax error reports that say "missing token, could be any of these [set]")

Pre-process MVC Razor File For Multi-Lingual Language Strings?

In my application we have multi-lingual language strings which are stored in custom tables, as the user can edit, delete, import new languages etc... via a UI
Currently, what I'm doing is at the beginning of each request is. I'm going off and getting all the language strings (From our database) for the currently selected language and sticking them in a dictionary.
I then have a Html Helper extension method which I use in the razor views (See below), which fishes in the dictionary I got at the beginning of the request to pull out the correct language based on the key supplied in the helper.
Html.LanguageString("MyLanguage.KeyHere")
Now this works fine. However, as the application is getting bigger. We are getting more and more language strings. It's not an issue right now, as its still very fast as there are only around 200 strings to get.
But this also means I'm getting all of them, even if a page has say one on it. I'd ideally like a way of processing the LanguageString("")'s before hand and doing a query to just get those that are needed at the beginning of the request? Or maybe my own linq based language that can be processed and product a more efficient call.
I'm looking for some advice on how to do this. As I'd like the application to be as efficient as possible. Any advice, help, tips are greatly received. Thanks.

I'd suggest caching language strings on the application basis rather than fetching them for every request. For example, this can be done by maintaining a static dictionary and invalidating the cache only when the user makes changes to these strings. This will make your application more responsive as well as save you from implementing (imho) rather more complex and not necessarily efficient technique of loading this data on-demand.
As a side note I'd add the following: it's usually a good practice to address these kinds of problems when they arise (rather than fixing something that is not broken) and focus on more important things. I totally agree that performance implications of a given solution must always be taken into consideration, I'm just saying that premature optimizations are not always a good idea.

What is the best way to store area data for a text adventure?

I'm developing a "Zork" style text adventure in C#, and it's going to have a fairly large number of different areas with descriptions and environmental modifiers. I don't want to have a database, ideally, unless it really is the best way of doing it.
I need advice on the best way to store/load this data.
It will include:
Area description
Environmental modifiers (windows open/broken, door closed)
Items present by default

I would solve your problem by abandoning C# and writing your program in Inform7. Inform7 is just about the most awesome programming language I have ever seen and it is specifically designed to solve your problem.
The awesome thing about Inform7 is that you write your text adventure in a language that resembles text adventures. For example, here's a fragment of one of the sample adventures' source code:
The iron-barred gate is a door.
"An iron-barred gate leads [gate direction]."
It is north of the Drawbridge and south of the Entrance Hall.
It is closed and openable.
Before entering the castle, try entering the gate instead.
Before going inside in the Drawbridge, try going north instead.
Understand "door" as the gate.
This adds an object to the game - the object is a door, it is called "the iron-barred gate". A door is understood to be between two rooms, in this case, the drawbridge and the entrance hall. If the player tries to "enter the drawbridge" then the game logic will know that this is the same as "go north", and then the door logic will determine whether the door is closed or not. And so on. It makes writing text adventures extremely easy.
Is there some particular reason why you want to use C# instead of a domain-specific language like Inform7? If your goal is to learn how to write C# code or how to build a parser or whatever, then by all means do it yourself. If your goal is to write a text adventure, then I'd use a language designed for that.

Serialize all the data to file. It will ensure the smallest footprint when the user installs the game, without any real disadvantage. A database is great when you have a lot of data, but you are talking about a text adventure in which you will load the entire game contents into memory. A simple file will work for this very nicely.
Note, I'm not talking about xml but binary serialization. Any kind of text serialization will allow users to peek at your data, and either cheat or hack up the game. And you can just as easily swap in/out the serialized data file whether it's text or binary. Remember, your whole 'text' is likely to be just a few hundred kilobytes at most.

There are many interactive fiction engines already. I would take a look at their data formats, that way you can re-use existing content and tools for editing the content.
The most popular engines currently are Glulx http://eblong.com/zarf/glulx/ and Z-Machine http://en.wikipedia.org/wiki/Z-machine
Here is a technical reference for the Glulx format: http://eblong.com/zarf/glulx/technical.txt

I know you didn't want a DB, but have you looked at SQL Server Compact Edition? It might just do what you want.

I'd argue C# offers you precisely the right tools for this. Just encapsulate your structures into classes. Our first OOP project at university was exactly this problem! It is the perfect case study for OOP.
You can then use C#'s many serialization methods to store it persistently (load/save) however you see fit.

How would 'scripting' your adventure as one large text file sound? Then have your application parse this file, build the adventure in classes and run from there?
This would mean you could edit the adventure using a simple text editor. I would imagine that when multiple decisions can be made from a single source it may become tricky visualising the links. However this would also be tricky in a DB without some specialist front-end.
UPDATE:
Or have you considered XML eg...
<area id="DarkRoom1">
<description>Dark Room</description>
<item>Bucket</item>
<item>Spade</item>
</area>
Then use this to populate your classes in memory.

You could store the data in a file system (a zip file or a folder).
Each choice could be stored as a folder, while all descriptions, modifiers and other data could be stored as a text file (xml?). When user makes a decision you go to the appropriate folder and follow the plot.
Example:
Do you want to:
open the door (door)
leave (leave)
If user choses to open the door you go the folder door and reads data from data file in this folder.
Pros:
simple
do not require database
easy to add new adventures
Cons:
problem with rollback decision (getting back to start or to certain point in plot)

Personally, I'd avoid a database in this case, and go with a text-based file format (probably two distinct files, one for the initial state (like terrain and such), which never gets modified, and one for state that is to be modified during the course of the game (the broken windows etc.); or split the entire thing up into one pair of static/dynamic data per area.
A few reasons:
A text file is human-readable; hence, you can create content without a dedicated editor, while with a database approach, you'd either have to enter data through queries, or code a level editor
Assuming a single-player scenario, concurrency is not an issue
Savegames are a matter of copying the modified-state files into a savegame folder, or packing them into a single file
You can easily embed scripts
The data structures you're dealing with are probably simple enough for data integrity not to be a serious issue

Internationalization in the database

Do you guys think internationalization should only done at the code level or should it be done in the database as well.
Are there any design patterns or accepted practices for internationalizing databases?
If you think it's a code problem then do you just use Resource files to look up a key returned from the database?
Thanks

Internationalization extends to the database insofar as your columns will be able to hold the data. NText, NChar, NVarchar instead of Text, Char, Varchar.
As far as your UI goes, for non-changing labels, resources files are a good method to use.

If you refer to making your app support multiple languages at the UI level, then there are more possibilities. If the labels never change, unless when you release a new version, then resource files that get embedded in the executable or assembly itself are your best bet, since they work faster. If your labels, on the other hand, need to be adjusted at runtime by the users, then storing the translations in the database is a good choice. As far as the code itself, the names of tables & fields in the database, we keep them in English as much as possible, since English is the "de facto" standard for IT people.

It depends a lot of what you are storing in your database. Taking examples from a recent project I was on, a film title that is entered at a client site and only visible to that client is fair game to store as-is in the database. A projector error code, on the other hand, because it can be viewed by the client, as well as by network operations centers that might be in different countries, should be stored as an error code (and supporting data, like lamp hours and the title of the movie being shown) which can be translated at the gui level depending on the language setting of the viewer.

#hova covers the technicalities, but something you might want to consider is support of a system showing a language you don't understand.
One way to cope with this is to have English as the default language, and a user setting that switches into a different language. That way your support users can log in and see the system in a natural way (assuming English as their first language), and your actual users can see the system in their first language. IMO, the data should always be 'natural' - in the language of the users.
Which raises another interesting point - should your system allow multiple languages for cross-border installations? In my experience, for user interface yes, but for data, no. To take a trivial example of address formatting, a letter to a French third party from a Swiss system should still have a Swiss-format address instead of a French one, as it has to go through the Swiss postal system first.

If your customers are Japanese and want to see their names in Kanji and Katakana (and sometimes in most formal Gaiji), you've got to store them as Unicode. No way around that.

Even things like addresses are very different between the US and Japan. One schema won't cut it for both.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.