Windows Invariant Culture Puzzle

Windows Invariant Culture Puzzle - c#

I have a question about the windows invariant culture.
Succinctly, my question is:
does there exist any pair of characters c1, and c2 such that:
lower(c1, invariant) =latin-general lower(c2, Invariant)
but
lower(c1, invaraint) !=invariant lower(c2, invariant)
Background:
I need to store an invariant lower case string (representing a file name) inside of SQL Server Compact, which does not support windows invariant collations.
Ideally I would like to do this without having to pull all of my comparison logic out of the database and into my app.
The idea I had for solving this was to store 2 versions of all file names: one that is used for displaying data to the customer, and another that is used for performing comparisons. The comparison column would be converted to lower case using the windows invariant locale before storing it in the database.
However, I don't really have any idea what kind of mappings the invariant culture does, other than the fact that its what windows uses for comparing file names.
I'm wondering if it is possible to get false positives (or false negatives) as a result of this scheme.
That is, can I produce characters (previously lower cased using the invariant culture) that compare equal to each other using the latin-general-1 case insensitive SQL server collation, but do not compare equal to each other under the invariant culture?
If this can happen, then my app may consider 2 files that Windows thinks are different as being the same. This could ultimately lead to data loss.
NOTE:
I am aware that it is possible to have case sensitive files on Windows. I don't need to support those scenarios, however.

By looking through the answers to this question:
win32-file-name-comparison
which I asked a while back.,
I found an indirect link the following page:
http://msdn.microsoft.com/en-us/library/ms973919.aspx
It suggests using an ordinal comparison after an invariant upper case as the best way to mimic what the file system does.
So I think if I use as "case sensitive, accent sensitive" collation in the database, and do a "upper" using the invariant local before storing the files I should be ok.
Does anyone know if there are any problems with that?

why don't you convert filenames to ASCII? In your situation can filenames contain non-ascii characters?

Why not URL-encode the utf8 byte representation of the filename to get an ascii version which can be converted back to unicode easily without any possible loss?

"However, I don't really have any idea what kind of mappings the invariant culture does, other than the fact that its what windows uses for comparing file names."
I didn't think Windows used the invariant culture when comparing file names. For example if my culture is English then I can name two separate files turkish and TURKİSH, but if someone's culture is Turkish then I hope Windows won't let them do that.

Related

Why is .NET "de-CH" culture number group separator different locally and on Azure?

I am seeing a different Unicode character as the number group separator for the "de-CH" culture when running on a local desktop and in Azure.
When the following code is run on my desktop in .NET Core 3.1 or .NET Framework 4.7.2 it outputs 2019 which looks like an apostrophe but is not the same.
When run in Azure, for instance in https://try.dot.net or (slightly modified) in an Azure function running on .NET Core 3.1 (on a Windows based App Service) it results in 0027, a standard ASCII apostrophe.
using System;
using System.Linq;
using System.Globalization;
Console.WriteLine(((int)(CultureInfo
.GetCultureInfo("de-CH")
.NumberFormat
.NumberGroupSeparator
.Single())) // Just getting the single character as an int
.ToString("X4") // unicode value of that character
);
The result of this is that trying to parse the string 4'200.000 (where the apostrophe there is Unicode 0027) on local desktop using "de-CH" culture fails, but it works in Azure.
Why the difference?

This Microsoft blog by Shawn Steele explains why you shouldn't rely on a specific culture setting being stable (Fully quoted because it is no longer online at MSDN):
https://web.archive.org/web/20190110065542/https://blogs.msdn.microsoft.com/shawnste/2005/04/05/culture-data-shouldnt-be-considered-stable-except-for-invariant/
CultureInfo and RegionInfo data represents a cultural, regional, admin
or user preference for cultural settings. Applications should NOT
make any assumptions that rely on this data being stable. The only
exception (this is a rule, so of course there's an exception) is for
CultureInfo.InvariantCulture. CultureInfo.InvariantCulture is
supposed to remain stable, even between versions.
There are many reasons that cultural data can change. With Whidbey
and Custom Cultures the list gets a little longer.
The most obvious reason is that there is a bug in the data and we had to make a change. (Believe it or not we make mistakes ;-)) In this case our users (and yours too) want culturally correct data, so we have to fix the bug even if it breaks existing applications.
Another reason is that cultural preferences can change. There're lots of ways this can happen, but it does happen:
Global awareness, cross cultural exchange, the changing role of computers and so forth can all effect a cultural preference.
International treaties, trade, etc. can change values. The adoption of the Euro changed many countries currency symbol to €.
National or regional regulations can impact these values too.
Preferred spelling of words can change over time.
Preferred date formats, etc can change.
Multiple preferences could exist for a culture. The preferred best choice can then change over time.
Users could have overridden some values, like date or time formats. These can be requested without user override, however we recommend that applications consider using user overrides.
Users or administrators could have created a replacement culture, replacing common default values for a culture with company specific, regional specific, or other variations of the standard data.
Some cultures may have preferences that vary depending on the setting. A business might have a more formal form than an Internet Café.
An enterprise may require a specific date format or time format for the entire organization.
Differing versions of the same custom culture, or one that's custom on one machine and a windows only culture on another machine.
So if you format a string with a particular date/time format, and then
try to Parse it later, parse might fail if the version changed, if the
machine changed, if the framework version changed (newer data), or if
a custom culture was changed. If you need to persist data in a
reliable format, choose a binary method, provide your own format or
use the InvariantCulture.
Even without changing data, remembering to use Invariant is still a
good idea. If you have different . and , syntax for something like
1,000.29, then Parsing can get confused if a client was expecting
1.000,29. I've seen this problem with applications that didn't realize that a user's culture would be different than the developer's
culture. Using Invariant or another technique solves this kind of
problem.
Of course you can't have both "correct" display for the current user
and perfect round tripping if the culture data changes. So generally
I'd recommend persisting data using InvariantCulture or another
immutable format, and always using the appropriate formatting APIs for
display. Your application will have its own requirements, so consider
them carefully.
Note that for collation (sort order/comparisons), even Invariant
behavior can change. You'll need to use the Sort Versioning to get
around that if you require consistently stable sort orders.
If you need to parse data automatically that is formatted to be user-friendly, there are two approaches:
Allow the user to explicitly specify the used format.
First remove every character except digits, minus sign and the decimal separator from the string before trying to parse this. Note that you need to know the correct decimal separator first. There is no way to guess this correctly and guessing wrong could result in major problems.
Wherever possible try to avoid parsing numbers that are formatted to be user-friendly. Instead whenever possible try to request numbers in a strictly defined (invariant) format.

How to get all Cultures to avoid CultureNotFoundException

Guys I'm developing an application that will run across multiple machines. I've recently introduced Cultures in it to support all currencies.
I have 2 main development PCs and I move code between them. One is a Windows 8 laptop, while the other is a Windows 7 PC.
It seems that the list of SpecificCultures in these two machines is NOT the same. When the executable runs on Windows 8, a few more SpecificCultures are returned, and some existing ones are also renamed.
I used the following code to text file dump all Specific Cultures:
StringBuilder sb = new StringBuilder();
foreach (CultureInfo ci in CultureInfo.GetCultures(CultureTypes.SpecificCultures))
{
sb.Append(ci.Name + "\t" + new RegionInfo(ci.LCID).CurrencyEnglishName);
sb.AppendLine();
}
StreamWriter f = new StreamWriter(#"specificCulturesFound.txt");
f.Write(sb);
f.Close();
The SpecificCultures returned from my Windows 8 Laptop is this: http://pastebin.com/cznLRG62
The SpecificCultures returned from my Windows 7 PC is this: http://pastebin.com/MwMXwSdb
If you compare them in Notepad++ or something, you'll see differences.
For Example: For example, the et-EE Estonian Kroon entry is only available on my Windows 7 PC, while ku-Arab-IQ Iraqi Dinar is only available on Windows 8 Laptop
Question is, how can I deal with this situation ? Once the application is released, it will be run on different machines with different .NET framework versions.
Is there a way to maybe export all collected CultureInfo data with the application, so that can be used instead of getting it from the installed .NET framework ?

It's not an easy situation, in normal conditions your application shouldn't not be aware of differences between cultures, what it should be aware is just that differences exist.
This is usually done saving everything in a neutral culture (let's say, for example, en-US locale) then converting back to the specific locale of the user where application runs. For example if user enters a list of date values what you do to save them is to convert each date to its neutral representation and to concatenate them using the list separator (both using the neutral culture, you can do it because you can get the current locale of the user and you know the neutral one). Later, in another machine with a different locale, someone else will read that file, your application knows that that dates are in neutral culture then it can read them and to present them to the user with required formatting. Example:
User with it-IT locale enters two dates: 01/01/1960;25/12/2013.
Application stores it in a neutral en-US format: 01/01/1977,12/25/2013.
Another user with jp-JP locale will read it, application will read the neutral locale and will show it to the user as 1977/01/01・2013/12/25.
Things become more complicated when this data aren't fixed, for example currencies. In this case you can't simply convert to a neutral value, let's imagine this (values are pretty random):
User enters a value in, let's say, 1000 YEN.
You convert YEN to DOLLARS (imagine 100 YEN = 1 DOLLAR): 10 DOLLARS and you save it.
Data will be read in another country in Europe where EURO is in use, application read 10 DOLLARS then it'll convert it to 10 EURO.
This is wrong of course because:
Change YEN <-> DOLLARS may be changed from the day you wrote it and the day it'll be read in EURO.
Change YEN <-> EURO is not equivalent to YEN <-> DOLLARS <-> EURO.
This is not an easy solving problem, you simply can't rely on "automatic" conversions via CultureInfo but you have to implement this by your own (in the way that best suits your requirements). For example (very naive, of course) implementation may store currencies like this:
struct CurrencyValue
{
public decimal Value;
public string Currency;
}
This may be complicated as needed (using a class hierarchy instead of currencies instead of a string, for example) but just keep in mind that this kind of conversion is absolutely not trivial.
To summarize
Don't care about this conversions, just store everything using a neutral locale and conversion (to and from) will happen in the user machine (where you don't need to care about which locale is in use).
If you have to manage special types like currencies then you have to implement your own library to handle them in the proper way. What "proper way" is depends on your full application requirements so it can't be answered here.

Parsing doubles in an unknown locale in C#

I am writing a program that needs to parse a bunch of text files generated by some third-party software. Some of these files will be generated in France, where something like "1,5" means "one and a half". Other files will be generated in the US, where "1,5" is not a number, and "one and a half" is "1.5". Of course, "1,234.5" is a legitimate number in the US.
These are just examples; in reality, my program needs to deal with a variety of numbers in a variety of locales; it needs to handle things like "e-5" and "2e10", etc. Unfortunately, there's no way to know ahead of time which file comes from which locale.
Is there some commonly accepted solution to this problem in C# ? I realize that I can write my own number-parsing code, but I'd prefer to avoid it, unless there's no other way...

Since your entire input file has been generated from one locale, you could look at the problem as having to detect the specific locale from the input file prior to actually parsing it. It's an extra requirement that results from the inadequate input files (which should all use one agreed locale or have a field to specify the locale used).
Language detection is not a complete solution as number formatting is not language-specific but locale-specific. Here is an example: If you detect the language as Spanish, would that be es-ES (Spain) or es-MX (Mexico)? In the former case, the decimal separator is a comma (1,23). In the latter, the decimal separator is a period (1.23).
The solution would be heuristics-based. The simplest is probably that if you know what your locale generally is (e.g. most of your users use the period), you could have an ordered list of culture identifiers and try then one after the other until you've found one that can be used to interpret all the numbers in the file. Could be as simple as starting with en-US and, failing that, trying with en-GB, since for numbers, there really aren't many more formats.

This is maybe a little bit overdesigned solution, but it could work (In case your text files contain some text apart from numbers):
Detect language of your text files using a letter frequency. Google has open sourced a code they use in Chrome to detect page language - http://code.google.com/p/chromium-compact-language-detector/. I think I saw C# wrapper for this, but I can´t find it now. If you don´t want to use any library, it is not so difficult to implement it on your own. I have done some very simple testing of this algorithm and it seems that it is possible to detect a language from only about 15-20 letters.
Build regular expression based on rules for detected language (Or just parse it). This can be very complex problem, considering that there are many rules for decimal separator, number grouping, negative signs etc. But it is not impossible to implement.

As you see from the comments your problem has no fail safe solution.
The best you can do is minimize the error:
Since each file (hopefully) contains several numbers all from the same locale, try parsing the numbers in file with all the expected distinct locales (i.e. don't parse with en-US and en-AU for instance as the number format for both locales is the same.)
After parsing you'll end up with either of:
A single matching locale.
Multiple locales.
In the second case test whether the results from all locales match (most/all locales parse integers without thousand separators and scientific notation the same way.)
If they match no problem, else try to employ heuristics to figure out the correct locale:
Are the values in the expected range.
If there is any other text in the file, you can do a word search in language dictionaries to try and figure out the language.
If everything fails discard the file and mark it for manual processing.
Your program should have a facility that allows marking files as being of a specific culture bypassing the heuristics.
Your best choice is to change the input format so that the file locale is specified somewhere, such as in the data, the name of the file or an accompanying metadata file.

Allowing Simplified Chinese Input

The company I work for is bidding on a project that will require our eCommerce solution to accept simplified Chinese input. After doing a bit of research, it seems that ASP.net makes globalization configuration easy:
<configuration>
<system.web>
<globalization
fileEncoding="utf-8"
requestEncoding="utf-8"
responseEncoding="utf-8"
culture="zh-Hans"
uiCulture="en-us" />
</system.web>
</configuration>
Questions:
Is this really all there is to it in ASP.net? It seems to good to be true.
Are there any DB considerations with SQL Server 2005? Will the DB accept the simplified Chinese without additional configuration?

Ad 1. The real question is, how far you want to go with Internationalization. Because i18n is not only allowing Unicode input. You need at least support local date, time and number formats, local collation (mostly related to sorting) and ensure that your application runs correctly on localized Operating Systems (unless you are developing Cloud aka hosted solution). You might want to read more on the topic here.
As far as support for Chinese character input goes, if you are going to offer software in China, you need to at least support GB18030-2000. To do just that, you need to use proper .Net Framework version - the one that supports Unicode 3.0. I believe it was supported since .Net Framework 2.0.
However, if you want to go one step further (which might be required for gaining competitive edge), you might want to support GB18030-2005. The only problem is, the full support for these characters (CJK Unified Ideographs Extension B) happened later (I am not really sure if it is Unicode 6.0 or Unicode 6.1) in the process. Therefore you might be forced to use the latest .Net Framework and still not be sure if it covers everything.
You might want to read Unicode FAQ on Han characters.
Ad 2. I strongly advice you not to use SQL Server 2005 with Chinese characters. The reason is, old SQL Server engine supports only UCS-2 rather than UTF-16. This might seems as slight difference, but that really poses the problem with 4-byte Han Ideographs. Actually, you want be able to use them in queries (i.e. LIKE or WHERE clauses) - you will receive all records. That's how it works. And to support them, you would need to set very specific Chinese collation, which will simply break support for other languages.
Basically, using SQL Server 2005 with Chinese Ideographs is a bad idea.

First off, I wonder if you are you sure that you picked the right culture identifier with zh-Hans, which is a neutral culture. Perhaps it would be more appropriate for you to target a specific culture, such as zh-CN (Chinese being used in China) if that is the market you are aiming to support.
Secondly, using the web.config file to set the culture is fine if you are planning a deployment that is exclusively targeting this culture. Often you'll want one same deployment to dynamically adapt to the end user's culture, in which case you would programmatically set the Thread.CurrentCulture (and even Thread.CurrentUICulture if you are providing localized resources) based for example on a URL scheme (e.g. www.myapp.com would use en-US and www.myapp.com/china would use zh-CN) or the accept-languages header or an in-app language selector.
Other than the Unicode limitations that Paweł refers to (which mean that you may really need to use the latest .NET Framework/SQL Server), there isn't anything specific you should need to do for simplified Chinese -- if you follow standard internationalization guidelines you should be all set. Perhaps you should consider localizing (translating) your app into Chinese as part of this, by the way.
About SQL Server, Paweł's points seem pretty clear. That said, so long as you use nvarchar datatypes (Unicode) and you don't run queries on these columns or sort them based on these columns on the DB side, I'd be surprised if you had any issues on SQL Server 2005. So it really depends what you do with this data.

Globalization in C#

Can somebody explain to me what is the use of globalization in C#?
Is it used for conversion purposes? I mean I want to convert any English word into a selected language.
So will this globalization or cultureinfo help me?

Globalization is a means of formatting text for specific cultures. E.g. a string representation of the number 1000 may be 1,000.00 for the UK or 1'000,00 for France. It is quite an in depth subject but that is the essential aim.
It is NOT a translation service, but it does allow you to determine the culture under which your application is running and therefore allow you to choose the language you want to display. You will have to provide text translation yourself, however, usually by means of resource files.

Globalization is a way of allowing the user to customize the application that he or she may be using to fit the standards where they may be. Cusomtization allows for the:
Money Formatting
Time
Date
Text orientation
To be culturally appropriate. The region that is currently set is handled by the OS and passed to your application. Globalization/Internationalization(I18n) also typically motivates the developer to separate the displayed text of the program from the implementation its self.

From MSDN:
System.Globalization - contains
classes that define culture-related
information, including the language,
the country/region, the calendars in
use, the format patterns for dates,
currency and numbers, and the sort
order for strings.
This assembly helps in making your application culture-aware, and is used heavily internally within the .NET framework. For example, when converting from Date to String, Globalization is used to determine what format to use, such as "11/28/2009" or "28-11-2009". Generally this determination is done automatically within the framework without you ever using the assembly directly. However, if you need to, you can use Globalization directly to look up culture-specific information for your own use.

To clear even more confusion
Localization (or Localisation for non-US people), L10n for short: process of adapting program for a specific location. It consist of translating resources, adapting UI (if necessary), etc.
Internationalization, i18n for short: process of adapting program to support localization, regional characters, formats and so on and so forth, but most importantly, the process of allowing program to work correctly regardless of current locale settings and OS language version.
Globalization, g11n for short: consist of both i18n and L10n.

To clear some confusion:
Globalisation: Allowing your program to use locale specific resources loaded from an external resource DLL at runtime. This means putting all your strings in resource files rather than hard coding them into the source code.
Localisation: Adapting your program for a specific locale. This could be translating Strings and making dialog boxes read right-to-left for languages such as Arabic.
Here is a link to creating Satellite DLLs. Its says C++ but it the same principle applies to C#.

Globalization:-
Globalization is the process of designing and developing applications for multiple cultures regions.
Localization:-
Localization is the process of customizing application for a given culture and locale.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.