I am writing a program that needs to parse a bunch of text files generated by some third-party software. Some of these files will be generated in France, where something like "1,5" means "one and a half". Other files will be generated in the US, where "1,5" is not a number, and "one and a half" is "1.5". Of course, "1,234.5" is a legitimate number in the US.
These are just examples; in reality, my program needs to deal with a variety of numbers in a variety of locales; it needs to handle things like "e-5" and "2e10", etc. Unfortunately, there's no way to know ahead of time which file comes from which locale.
Is there some commonly accepted solution to this problem in C# ? I realize that I can write my own number-parsing code, but I'd prefer to avoid it, unless there's no other way...
Since your entire input file has been generated from one locale, you could look at the problem as having to detect the specific locale from the input file prior to actually parsing it. It's an extra requirement that results from the inadequate input files (which should all use one agreed locale or have a field to specify the locale used).
Language detection is not a complete solution as number formatting is not language-specific but locale-specific. Here is an example: If you detect the language as Spanish, would that be es-ES (Spain) or es-MX (Mexico)? In the former case, the decimal separator is a comma (1,23). In the latter, the decimal separator is a period (1.23).
The solution would be heuristics-based. The simplest is probably that if you know what your locale generally is (e.g. most of your users use the period), you could have an ordered list of culture identifiers and try then one after the other until you've found one that can be used to interpret all the numbers in the file. Could be as simple as starting with en-US and, failing that, trying with en-GB, since for numbers, there really aren't many more formats.
This is maybe a little bit overdesigned solution, but it could work (In case your text files contain some text apart from numbers):
Detect language of your text files using a letter frequency. Google has open sourced a code they use in Chrome to detect page language - http://code.google.com/p/chromium-compact-language-detector/. I think I saw C# wrapper for this, but I can´t find it now. If you don´t want to use any library, it is not so difficult to implement it on your own. I have done some very simple testing of this algorithm and it seems that it is possible to detect a language from only about 15-20 letters.
Build regular expression based on rules for detected language (Or just parse it). This can be very complex problem, considering that there are many rules for decimal separator, number grouping, negative signs etc. But it is not impossible to implement.
As you see from the comments your problem has no fail safe solution.
The best you can do is minimize the error:
Since each file (hopefully) contains several numbers all from the same locale, try parsing the numbers in file with all the expected distinct locales (i.e. don't parse with en-US and en-AU for instance as the number format for both locales is the same.)
After parsing you'll end up with either of:
A single matching locale.
Multiple locales.
In the second case test whether the results from all locales match (most/all locales parse integers without thousand separators and scientific notation the same way.)
If they match no problem, else try to employ heuristics to figure out the correct locale:
Are the values in the expected range.
If there is any other text in the file, you can do a word search in language dictionaries to try and figure out the language.
If everything fails discard the file and mark it for manual processing.
Your program should have a facility that allows marking files as being of a specific culture bypassing the heuristics.
Your best choice is to change the input format so that the file locale is specified somewhere, such as in the data, the name of the file or an accompanying metadata file.
Related
I am seeing a different Unicode character as the number group separator for the "de-CH" culture when running on a local desktop and in Azure.
When the following code is run on my desktop in .NET Core 3.1 or .NET Framework 4.7.2 it outputs 2019 which looks like an apostrophe but is not the same.
When run in Azure, for instance in https://try.dot.net or (slightly modified) in an Azure function running on .NET Core 3.1 (on a Windows based App Service) it results in 0027, a standard ASCII apostrophe.
using System;
using System.Linq;
using System.Globalization;
Console.WriteLine(((int)(CultureInfo
.GetCultureInfo("de-CH")
.NumberFormat
.NumberGroupSeparator
.Single())) // Just getting the single character as an int
.ToString("X4") // unicode value of that character
);
The result of this is that trying to parse the string 4'200.000 (where the apostrophe there is Unicode 0027) on local desktop using "de-CH" culture fails, but it works in Azure.
Why the difference?
This Microsoft blog by Shawn Steele explains why you shouldn't rely on a specific culture setting being stable (Fully quoted because it is no longer online at MSDN):
https://web.archive.org/web/20190110065542/https://blogs.msdn.microsoft.com/shawnste/2005/04/05/culture-data-shouldnt-be-considered-stable-except-for-invariant/
CultureInfo and RegionInfo data represents a cultural, regional, admin
or user preference for cultural settings. Applications should NOT
make any assumptions that rely on this data being stable. The only
exception (this is a rule, so of course there's an exception) is for
CultureInfo.InvariantCulture. CultureInfo.InvariantCulture is
supposed to remain stable, even between versions.
There are many reasons that cultural data can change. With Whidbey
and Custom Cultures the list gets a little longer.
The most obvious reason is that there is a bug in the data and we had to make a change. (Believe it or not we make mistakes ;-)) In this case our users (and yours too) want culturally correct data, so we have to fix the bug even if it breaks existing applications.
Another reason is that cultural preferences can change. There're lots of ways this can happen, but it does happen:
Global awareness, cross cultural exchange, the changing role of computers and so forth can all effect a cultural preference.
International treaties, trade, etc. can change values. The adoption of the Euro changed many countries currency symbol to €.
National or regional regulations can impact these values too.
Preferred spelling of words can change over time.
Preferred date formats, etc can change.
Multiple preferences could exist for a culture. The preferred best choice can then change over time.
Users could have overridden some values, like date or time formats. These can be requested without user override, however we recommend that applications consider using user overrides.
Users or administrators could have created a replacement culture, replacing common default values for a culture with company specific, regional specific, or other variations of the standard data.
Some cultures may have preferences that vary depending on the setting. A business might have a more formal form than an Internet Café.
An enterprise may require a specific date format or time format for the entire organization.
Differing versions of the same custom culture, or one that's custom on one machine and a windows only culture on another machine.
So if you format a string with a particular date/time format, and then
try to Parse it later, parse might fail if the version changed, if the
machine changed, if the framework version changed (newer data), or if
a custom culture was changed. If you need to persist data in a
reliable format, choose a binary method, provide your own format or
use the InvariantCulture.
Even without changing data, remembering to use Invariant is still a
good idea. If you have different . and , syntax for something like
1,000.29, then Parsing can get confused if a client was expecting
1.000,29. I've seen this problem with applications that didn't realize that a user's culture would be different than the developer's
culture. Using Invariant or another technique solves this kind of
problem.
Of course you can't have both "correct" display for the current user
and perfect round tripping if the culture data changes. So generally
I'd recommend persisting data using InvariantCulture or another
immutable format, and always using the appropriate formatting APIs for
display. Your application will have its own requirements, so consider
them carefully.
Note that for collation (sort order/comparisons), even Invariant
behavior can change. You'll need to use the Sort Versioning to get
around that if you require consistently stable sort orders.
If you need to parse data automatically that is formatted to be user-friendly, there are two approaches:
Allow the user to explicitly specify the used format.
First remove every character except digits, minus sign and the decimal separator from the string before trying to parse this. Note that you need to know the correct decimal separator first. There is no way to guess this correctly and guessing wrong could result in major problems.
Wherever possible try to avoid parsing numbers that are formatted to be user-friendly. Instead whenever possible try to request numbers in a strictly defined (invariant) format.
Why I don't want to use Resx files:
I am looking for an alternative for resx files to offer multilanguage support for my project, due to the following reasons:
I don't like to specify a "messageId" when writing messages, it is more effort and it is annoying for the flow as I don't see what the log message would actually say and I would need to open another tab to edit the message
Sometimes I use code inline because I don't want to create new variables for to easy steps (e. g. Log.Info("Iterated {i+1} times");). Using variables or doing simple calculations inline makes the whole code sometimes more clearly than creating additional code lines
What I could imagine instead:
An external application which crawls a compiled exe for all strings, giving you the opportunity to ignore/add strings which should be translated. It could create a XML or Json file for all languages as well then. It would replace all strings with a hash/id so that a lookup for strings in all languages is still possible.
Am I the only one who is not happy with the commonly used Resx / centralized string db solution? Do I miss points why this wouldn't be a good idea?
One reason for relying on established approaches instead of implementing your own format is translation. It really depends on how your resources are translated: if it is done by volunteers with a technical background who don't mind working in a plain text editor, then you are free to come up with your own resource format. If on the other hand you send out your resources to professional translators who are not very technical and who prefer to work in a translation environment with integrated terminology management, translation memory, spelling and quality checks etc. it is quite likely that this environment will not be able to handle your homemade resource format.
Since I already mentioned professional translation environments: some of these tools rely on IDs to figure out which strings are old and which are new. If you use the approach that the text is the ID every fixed typo in your source language means that you create a new string that needs to be translated - and paid for. If the translator sees that the source text for a string has changed he can have a look at the change, notice that a typo has been fixed, decide that the translation is still OK and sign the string off, without extra translation cost.
By the way, if you want good localizations for strings like Log.Info("Iterated {i+1} times"); you have to find some way of dealing with plural forms correctly. Some languages have different grammatical rules for different numbers (see the Unicode Language Plural Rules for an overview). Just because something is easy to do in code does not mean that it is easy to localize, I'm afraid.
To sum this up: if you want to create your own resource format, talk with your translators. Ask them which formats they can handle. Think about translation related limitations that come with your format, for example if there are any characters that the translators should not use because they break your strings? Apostrophes and quotes are prime candidates here because they are often used as string delimiters in resource files, or < and & if you decide to go the XML way. Think about a conversion to XLIFF and back: most translation environments can handle XLIFF.
Does Microsoft implementation of C# runtime offer some localization mechanism to translate common strings like Overflow, Stack overflow, Underflow, etc...
See the code below - it's a part of Mono and Mono itself has a Locale.GetText routine for making such translations.
// Added to avoid possible integer overflow.
if (inputOffset > inputBuffer.Length - inputCount)
throw new ArgumentException("inputOffset" +
Locale.GetText("Overflow");
Now - how is it done in Microsoft version of runtime and how can I use it, for example, to get the localized equivalent of Overflow without adding resource files?
.NET provides a framework that makes it easy to localize your content (ResourceManager) and while it internally maintains some translations for its own purpose (for example DateTime.ToString gives you a textual representation for the date/time that is locally appropriate, which includes the translated month and day names), it does not provide you with any ready-made translations, be they common strings or not. It could hardly do this reliably anyway, as there is a plethora of human languages out there and words can have different translations depending on context etc.
In your example, I would say that you are OK with untranslated exception messages. Although Microsoft recommends that you localize exception descriptions and they do localize their own (at least for major languages), this advice seems ill-thought at it's not only a waste of effort to translate all this text that users probably should never see, but it can make debugging a nightmare.
Yes, it does and it's a terrible idea. It makes debugging so much harder.
without adding resource files
What do you have against resource files? Resources are the prescribed way to provide localized and localizable strings, images, and other data for a .NET app or assembly.
Note that single word substitution as shown in your example code will result in poor quality translations. Different languages have different sentence structure and word order which your single word substitution won't accommodate. Non-English languages often involve genders for nouns and declension of words to properly reflect their role and number in a phrase. Single word substitution fails miserably at this.
Your non-English customers will most likely prefer that you not butcher their language by attempting to partially translate text a word here and a word there. If you're going to go to the trouble of supporting localizable messages, do it right and allow the entire string to be translated so that word ordering and declension can be done properly by translators. In cases where the content is variable, make the format string a resource so that the translator can set off the variable data using the conventions of the language.
Question: In terms of program stability and ensuring that the system will actually operate, how safe is it to use chars like ¦, § or ‡ for complex delimiter sequences in strings? Can I reliable believe that I won't run into any issues in a program reading these incorrectly?
I am working in a system, using C# code, in which I have to store a fairly complex set of information within a single string. The readability of this string is only necessary on the computer side, end-users should only ever see the information after it has been parsed by the appropriate methods. Because some of the data in these strings will be collections of variable size, I use different delimiters to identify what parts of the string correspond to a certain tier of organization. There are enough cases that the standard sets of ;, |, and similar ilk have been exhausted. I considered two-char delimiters, like ;# or ;|, but I felt that it would be very inefficient. There probably isn't that large of a performance difference in storing with one char versus two chars, but when I have the option of picking the smaller option, it just feels wrong to pick the larger one.
So finally, I considered using the set of characters like the double dagger and section. They only take up one char, and they are definitely not going to show up in the actual text that I'll be storing, so they won't be confused for anything.
But character encoding is finicky. While the visibility to the end user is meaningless (since they, in fact, won't see it), I became recently concerned about how the programs in the system will read it. The string is stored in one database, while a separate program is responsible for both encoding and decoding the string into different object types for the rest of the application to work with. And if something is expected to be written one way, is possibly written another, then maybe the whole system will fail and I can't really let that happen. So is it safe to use these kind of chars for background delimiters?
Because you must encode the data in a string, I am assuming it is because you are interfacing with other systems. Why not use something like XML or JSON for this rather than inventing your own data format?
With XML you can specify the encoding in use, e.g.:
<?xml version="1.0" encoding="UTF-8"?>
There is very little danger that any system that stores and retrieves Unicode text will alter those specific characters.
The main characters that can be altered in a text transfer process are the end of line markers. For example, FTPing a file from a Unix system to a Windows system in text mode might replace LINE FEED characters for CARRIAGE RETURN + LINE FEED pairs.
After that, some systems may perform a canonical normalization of the text. Combining characters and characters with diacritics on them should not be used unless canonical normalization (either composing or decomposing) is taken into account. The Unicode character database contains information about which transformations are required under these normalization schemes.
That sums up the biggest things to watch out for, and none of them are a problem for the characters that you have listed.
Other transformations that might be made, but are less likely, are case changes and compatibility normalizations. To avoid these, just stay away from alphabetic letters or anything that looks like an alphabetic letter. Some symbols are also converted in a compatibility normalization, so you should check the properties in the Unicode Character Database just to be sure. But it is unlikely that any system will do a compatibility normalization without expressly indicating that it will do so.
In the Unicode Code Charts, cannonical normalizations are indicated by "≡" and compatability normalizations are indicated by "≈".
You could take the same approach as URL or HTML encoding, and replace key chars with sequences of chars. I.e. & becomes &.
Although this results in more chars, it could be pretty efficiently compressed due to the repetition of those sequences.
Well, UNICODE is a standard, so as long as everybody involved (code, db, etc) is using UNICODE, you shouldn't have any problems.
There are rarer characters in the Unicode set. As far as I know, only the chars below 0x32 (space) have special meanings, anything abovde that should be preserved in an NVARCHAR data column.
It is never going to be totally safe unless you have a good specification what characters can and cannot be part of your data.
Remember some of the laws of Murphy:
"Anything that can go wrong will."
"Anything that can't go wrong, will
anyway."
Those characters that definitely will not be used, may eventually be used. When they are, the application will definitely fail.
You can use any character you like as delimiter, if you only escape the values so that character is guaranteed not to appear in them. I wrote an example a while back, showing that you could even use a common character like "a" as delimiter.
Escaping the values of course means that some characters will be represented as two characters, but usually that will still be less of an overhead than using a multiple character delimiter. And more importantly, it's completely safe.
I have a question about the windows invariant culture.
Succinctly, my question is:
does there exist any pair of characters c1, and c2 such that:
lower(c1, invariant) =latin-general lower(c2, Invariant)
but
lower(c1, invaraint) !=invariant lower(c2, invariant)
Background:
I need to store an invariant lower case string (representing a file name) inside of SQL Server Compact, which does not support windows invariant collations.
Ideally I would like to do this without having to pull all of my comparison logic out of the database and into my app.
The idea I had for solving this was to store 2 versions of all file names: one that is used for displaying data to the customer, and another that is used for performing comparisons. The comparison column would be converted to lower case using the windows invariant locale before storing it in the database.
However, I don't really have any idea what kind of mappings the invariant culture does, other than the fact that its what windows uses for comparing file names.
I'm wondering if it is possible to get false positives (or false negatives) as a result of this scheme.
That is, can I produce characters (previously lower cased using the invariant culture) that compare equal to each other using the latin-general-1 case insensitive SQL server collation, but do not compare equal to each other under the invariant culture?
If this can happen, then my app may consider 2 files that Windows thinks are different as being the same. This could ultimately lead to data loss.
NOTE:
I am aware that it is possible to have case sensitive files on Windows. I don't need to support those scenarios, however.
By looking through the answers to this question:
win32-file-name-comparison
which I asked a while back.,
I found an indirect link the following page:
http://msdn.microsoft.com/en-us/library/ms973919.aspx
It suggests using an ordinal comparison after an invariant upper case as the best way to mimic what the file system does.
So I think if I use as "case sensitive, accent sensitive" collation in the database, and do a "upper" using the invariant local before storing the files I should be ok.
Does anyone know if there are any problems with that?
why don't you convert filenames to ASCII? In your situation can filenames contain non-ascii characters?
Why not URL-encode the utf8 byte representation of the filename to get an ascii version which can be converted back to unicode easily without any possible loss?
"However, I don't really have any idea what kind of mappings the invariant culture does, other than the fact that its what windows uses for comparing file names."
I didn't think Windows used the invariant culture when comparing file names. For example if my culture is English then I can name two separate files turkish and TURKİSH, but if someone's culture is Turkish then I hope Windows won't let them do that.