Prepare for multi language capability. Have I missed a trick?

Prepare for multi language capability. Have I missed a trick? - c#

My asp.net web app is currently being developed and I want to handle any language input by the user. This input will then be displayed to other users on the site.
So far I have done the following:
Put this is the head - meta http-equiv="Content-Type" content="text/html; charset=utf-8"
Saved inputs in NVARCHAR fields
Do I need to do anything else? Do I need any other meta tags (content-language, etc)?

Also think of a way to localize your UI, either via resources or with an appropriate support in your database. If the users are expected to generate non-English content, they will definitely appreciate seeing UI in their native language.

You should remember not to make assumptions that are not valid in general.
A fairly common assumption that is wrong is that (str.ToUpper().ToLower() == str) for any string str. A more subtle assumption is that the concept of "upper" and "lower" case even makes sense for any given language.
Another frequent problematic assumption is that a single char in the input is always an actual character from user's perspective. This is wrong - even setting things such as surrogate pairs aside, there are also combining characters. You either have to normalize your strings (and even that isn't 100% foolproof), or just avoid dealing with individual chars.
If you want to deal with more than just plain text input displayed verbatim - i.e. full, proper localization - you'll also have to handle number, date, currency etc formats correctly; and, for example, do not assume that decimal separator is a dot.
My best general advice would be to just go and read Michael Kaplan's blog, Microsoft's local guru on localization and related issues. Look for categories (tags) such as "Collation/Casing", "Encoding/Codepages" and "Int'l Programming". There's a lot of stuff there, and most of it is either directly relevant to your question, or interesting, or both. If, after reading a couple of his blog posts, you start thinking that maybe hiring a localization expert just to point out potential non-obvious problems in that area is a good idea, then you're probably right :)

Whenever you use String.Format append client's culture spec. Using FxCop allows to explore these places.
Exclude string constants from .cs code
Place images (that can contain culture specific text) into skin files or resources.

The browsers determine the charset in the following order:
Content-Type http header (value example: "text/html; charset=utf-8")
XML declaration
meta attribute
You should check that the web server does not send conflicting content type information in headers.
Make sure you save the files in UTF-8.

Related

How can i make sure one word can be surrounded by an element ASP.NET Localization

I am translating a string like this and I am wondering if there is a built in functionality so you can highlight a word, so you later can surround a specific word around an element like in the example below
What is shown when users translate
Read our Terms of Service here.
What is shown on the website
Read our Terms of Service here.

What you describe doesn't make sense in general, because any given word or phrase in the source language may get translated into a different number of words that may not even be contiguous.
The way this is typically done is that you simply include the markup in your string, and your translators deal with it. Any competent translation service can handle html markup.
Another option is to make the entire sentence the clickable text. The basic principle is to provide complete segments for translation. Attempting to deconstruct or reconstruct from parts is doomed to fail under localization.

Localization alternative to Resx file

Why I don't want to use Resx files:
I am looking for an alternative for resx files to offer multilanguage support for my project, due to the following reasons:
I don't like to specify a "messageId" when writing messages, it is more effort and it is annoying for the flow as I don't see what the log message would actually say and I would need to open another tab to edit the message
Sometimes I use code inline because I don't want to create new variables for to easy steps (e. g. Log.Info("Iterated {i+1} times");). Using variables or doing simple calculations inline makes the whole code sometimes more clearly than creating additional code lines
What I could imagine instead:
An external application which crawls a compiled exe for all strings, giving you the opportunity to ignore/add strings which should be translated. It could create a XML or Json file for all languages as well then. It would replace all strings with a hash/id so that a lookup for strings in all languages is still possible.
Am I the only one who is not happy with the commonly used Resx / centralized string db solution? Do I miss points why this wouldn't be a good idea?

One reason for relying on established approaches instead of implementing your own format is translation. It really depends on how your resources are translated: if it is done by volunteers with a technical background who don't mind working in a plain text editor, then you are free to come up with your own resource format. If on the other hand you send out your resources to professional translators who are not very technical and who prefer to work in a translation environment with integrated terminology management, translation memory, spelling and quality checks etc. it is quite likely that this environment will not be able to handle your homemade resource format.
Since I already mentioned professional translation environments: some of these tools rely on IDs to figure out which strings are old and which are new. If you use the approach that the text is the ID every fixed typo in your source language means that you create a new string that needs to be translated - and paid for. If the translator sees that the source text for a string has changed he can have a look at the change, notice that a typo has been fixed, decide that the translation is still OK and sign the string off, without extra translation cost.
By the way, if you want good localizations for strings like Log.Info("Iterated {i+1} times"); you have to find some way of dealing with plural forms correctly. Some languages have different grammatical rules for different numbers (see the Unicode Language Plural Rules for an overview). Just because something is easy to do in code does not mean that it is easy to localize, I'm afraid.
To sum this up: if you want to create your own resource format, talk with your translators. Ask them which formats they can handle. Think about translation related limitations that come with your format, for example if there are any characters that the translators should not use because they break your strings? Apostrophes and quotes are prime candidates here because they are often used as string delimiters in resource files, or < and & if you decide to go the XML way. Think about a conversion to XLIFF and back: most translation environments can handle XLIFF.

Stand-alone Error Page with translated text?

I'm working on a website that will deployed internationally. Very big site, but for the sake of simplicity, all we're concerned about is my Error.aspx with c# code behind. I'd like to make this custom error page as dynamic as possible. There's at least a handful of languages we need to read this page in right now, and more to come. This page needs to work independently and without a database to reference.
I'd like to have some text, and have the appropriate translation appear based on the language appropriate for that domain... e.g. ".com" = English, ".ca/fr" = French, ".mx" = Spanish... you get the idea.
What's the best way to do this?
I've looked into API's, but the decent ones have a cost threshold, and while it might look really helpful, this is just pretty standard error message text, that's unlikely to change, so that seems like overkill to have a dynamic translator. It might help with scalability, but it's extra money indefinitely, when it will only save vs hard-coding on the handful of occasions where we add another language/country/domain.
The other idea I had was to simply hardcode it in the c#. parse out Request.URL and get the domain, and make a ever-growing switch statement which would assign the appropriate text. (As an aside, I'm also trying to find a better way to do this, but is the country code something that would be an available piece of information from either the request object or server?) This way would be independent, precise, and the only drawback on a concrete level would be the cost of adding new languages, or changing every string (probably not that many, at least at first) if the content of the error message needed to be adjusted. But this feels like bad practice.
I've been researching this for a day now, but I haven't found any alternatives to these 2 options. What are the best practices for handling small amounts of text for translation, without the use of a CMS?

There is an easy built-in way to handle localization in ASP.NET Web Forms. It uses the Language Preference settings in the client's browser to select the language. Posting the steps of setting it up would be redundant since there's lots of information on this subject available online. Here is a good tutorial.
EDIT:
It might be a good idea to read up on HTML resource files. That is the HTML standard for handling different languages (referred to as localization). And it is what ASP.NET uses in the background when creating a local resource for a server control.

Translation and localization issue

Does Microsoft implementation of C# runtime offer some localization mechanism to translate common strings like Overflow, Stack overflow, Underflow, etc...
See the code below - it's a part of Mono and Mono itself has a Locale.GetText routine for making such translations.
// Added to avoid possible integer overflow.
if (inputOffset > inputBuffer.Length - inputCount)
throw new ArgumentException("inputOffset" +
Locale.GetText("Overflow");
Now - how is it done in Microsoft version of runtime and how can I use it, for example, to get the localized equivalent of Overflow without adding resource files?

.NET provides a framework that makes it easy to localize your content (ResourceManager) and while it internally maintains some translations for its own purpose (for example DateTime.ToString gives you a textual representation for the date/time that is locally appropriate, which includes the translated month and day names), it does not provide you with any ready-made translations, be they common strings or not. It could hardly do this reliably anyway, as there is a plethora of human languages out there and words can have different translations depending on context etc.
In your example, I would say that you are OK with untranslated exception messages. Although Microsoft recommends that you localize exception descriptions and they do localize their own (at least for major languages), this advice seems ill-thought at it's not only a waste of effort to translate all this text that users probably should never see, but it can make debugging a nightmare.

Yes, it does and it's a terrible idea. It makes debugging so much harder.

without adding resource files
What do you have against resource files? Resources are the prescribed way to provide localized and localizable strings, images, and other data for a .NET app or assembly.
Note that single word substitution as shown in your example code will result in poor quality translations. Different languages have different sentence structure and word order which your single word substitution won't accommodate. Non-English languages often involve genders for nouns and declension of words to properly reflect their role and number in a phrase. Single word substitution fails miserably at this.
Your non-English customers will most likely prefer that you not butcher their language by attempting to partially translate text a word here and a word there. If you're going to go to the trouble of supporting localizable messages, do it right and allow the entire string to be translated so that word ordering and declension can be done properly by translators. In cases where the content is variable, make the format string a resource so that the translator can set off the variable data using the conventions of the language.

BBCode to HTML transformation rules

Background
I have written very simple BBCode parser using C# which transforms BBCode to HTML. Currently it supports only [b], [i] and [u] tags. I know that BBCode is always considered as valid regardless whatever user have typed. I cannot find strict specification how to transform BBCode to HTML
Question
Does standard "BBCode to HTML" specification exist?
How should I handle "[b][b][/b][/b]"? For now parser yields "<b>[b][/b]</b>".
How should I handle "[b][i][u]zzz[/b][/i][/u]" input? Currently my parser is smart enough to produce "<b><i><u>zzz</u></i></b>" output for such case, but I wonder that it is "too smart" approach, or it is not?
More details
I have found some ready-to-use BBCode parser implementations, but they are too heavy/complex for me and, what is worse, use tons of Regular Expressions and produce not that markup what I expect. Ideally, I want to receive XHTML at the output. For inferring "BBCode to HTML" transformation rules I am using this online parser: http://www.bbcode.org/playground.php. It produces HTML that is intuitively correct on my opinion. The only thing I dislike it does not produce XHTML. For example "[b][i]zzz[/b][/i]" is transformed to "<b><i>zzz</b></i>" (note closing tags order). FireBug of course shows this as "<b><i>zzz</i></b><i></i>". As I understand, browsers fix such wrong closing tags order cases, but I am in doubt:
Should I rely on this browsers feature and do not try to make XHTML.
Maybe "[b][i]zzz[/b]ccc[/i]" must be understood as "<b>[i]zzz</b>ccc[/i]" - looks logically for such improper formatting, but is in conflict with popular forums BBCode outputs (*zzz****ccc*, not **[i]zzzccc[/i])
Thanks.

On your first question, I don't think that relying on browsers to correct any kind of mistakes is a good idea regardless the scope of your project (well, maybe except when you're actually doing bug tests on the browser itself). Some browsers might do an awesome job on that while others might fail miserably. The best way to make sure the output syntax is correct (or at least as correct as possible) is to send it with a correct syntax to the browser in the first place.
Regarding your second question, since you're trying to have correct BBCode converted to correct HTML, if your input is [b][i]zzz[/b]ccc[/i], its correct HTML equivalent would be <i><b>zzz</b>ccc</i> and not <b>[i]zzz</b>ccc[/i]. And this is where things get complicated as you would not be writing just a converter anymore, but also a syntax checker/correcter. I have written a similar script in PHP for a rather weird game engine scripting language but the logic could be easily applied to your case. Basically, I had a flag set for each opening tag and checked if the closing tag was in the right position. Of course, this gives limited functionality but for what I needed it did the trick. If you need more advanced search patterns, I think you're stuck with regex.

If you're only going to implement B, I and U, which aren't terribly important tags, why not simply have a counter for each of those tags: +1 each time it is opened, and -1 each time it's closed.
At the end of a forum post (or whatever) if there are still-open tags, simply close them. If the user puts in invalid bbcode, it may look strange for the duration of their post, but it won't be disastrous.

Regarding invalid user-submitted markup, you have at least three options:
Strip it out
Print it literally, i.e. don't convert it to HTML
Attempt to fix it.
I don't recommend 3. It gets really tricky really fast. 1 and 2 are both reasonable options.
As for how to parse BBCode, I strongly recommend against using regex. BBCode is actually a fairly complex language. Most significantly, it supports nesting of tags. Regex can't handle arbitrary nesting. That's one of the fundamental limitations of regex. That makes it a bad choice for parsing languages like HTML and BBCode.
For my own project, rbbcode, I use a parsing expression grammer (PEG). I recommend using something similar. In general, these types of tools are called "compiler compilers," "compiler generators," or "parser generators." Using one of these is probably the sanest approach, as it allows you to specify the grammar of BBCode in a clean, readable format. You'll have fewer bugs this way than if you use regex or attempt to build your own state machine.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.