Which Version of StringComparer to use - c#

If I want to have a case-insensitive string-keyed dictionary, which version of StringComparer should I use given these constraints:
The keys in the dictionary come from either C# code or config files written in english locale only (either US, or UK)
The software is internationalized and will run in different locales
I normally use StringComparer.InvariantCultureIgnoreCase but wasn't sure if that is the correct case. Here is example code:
Dictionary< string, object> stuff = new Dictionary< string, object>(StringComparer.InvariantCultureIgnoreCase);

There are three kinds of comparers:
Culture-aware
Culture invariant
Ordinal
Each comparer has a case-sensitive as well as a case-insensitive version.
An ordinal comparer uses ordinal values of characters. This is the fastest comparer, it should be used for internal purposes.
A culture-aware comparer considers aspects that are specific to the culture of the current thread. It knows the "Turkish i", "Spanish LL", etc. problems. It should be used for UI strings.
The culture invariant comparer is actually not defined and can produce unpredictable results, and thus should never be used at all.
References
New Recommendations for Using Strings in Microsoft .NET 2.0

This MSDN article covers everything you could possibly want to know in great depth, including the Turkish-I problem.
It's been a while since I read it, so I'm off to do so again. See you in an hour!

The concept of "case insensitive" is a linguistic one, and so it doesn't make sense without a culture.
See this blog for more information.
That said if you are just talking about strings using the latin alphabet then you will probably get away with the InvariantCulture.
It is probably best to create the dictionary with StringComparer.CurrentCulture, though. This will allow "ß" to match "ss" in your dictionary under a German culture, for example.

Since the keys are your known fixed values, then either InvariantCultureIgnoreCase or OrdinalIgnoreCase should work. Avoid the culture-specific one, or you could hit some of the more "fun" things like the "Turkish i" problem. Obviously, you'd use a cultured comparer if you were comparing cultured values... but it sounds like you aren't.

StringComparer.OrdinalIgnoreCase is slightly faster than InvariantCultureIgnoreCase FWIW ("An ordinal comparison is fast, but culture-insensitive" according to MSDN.
You'd have to be doing a lot of comparisons to notice the difference of course.

The Invariant Culture exists specifically to deal with strings that are internal to the program and have nothing to do with user data or UI. It sounds like this is the case for this situation.

System.Collections.Specialized includes StringDictionary. The Remarks section of the MSDN states "A key cannot be null, but a value can.
The key is handled in a case-insensitive manner; it is translated to lowercase before it is used with the string dictionary.
In .NET Framework version 1.0, this class uses culture-sensitive string comparisons. However, in .NET Framework version 1.1 and later, this class uses CultureInfo.InvariantCulture when comparing strings. For more information about how culture affects comparisons and sorting, see Comparing and Sorting Data for a Specific Culture and Performing Culture-Insensitive String Operations.

Related

Default String Sort Order

Is the default sort order an implementation detail? or how is it that the Default Comparer is selected?
It reminds me of the advice. "Don't store HashCodes in the database"
Is the following Code guaranteed to sort the string in the same order?
string[] randomStrings = { "Hello", "There", "World", "The", "Secrete", "To", "Life", };
randomStrings.ToList().Sort();
Strings are always sorted in alphabetical order.
The default (string.CompareTo()) uses the Unicode comparison rules of the current culture:
public int CompareTo(String strB) {
if (strB==null) {
return 1;
}
return CultureInfo.CurrentCulture.CompareInfo.Compare(this, strB, 0);
}
This overload of List<T>.Sort uses the default comparer for strings, which is implemented like this:
This method performs a word (case-sensitive and culture-sensitive)
comparison using the current culture. For more information about word,
string, and ordinal sorts, see System.Globalization.CompareOptions.
If you do not specify a comparer then sort will use the default comparer which sorts alphabetically. So to answer your question yes that code will always return the strings in the same order.
There is an overload to the sort method that allows you to specify your own comparer if you wish to sort the data in a different order.
I wanted to post a reply related to cultural things while you sort your strings but Jon already added it:-(. Yes, I think you may take into acc the issue too because it is selected by default the alphabetical order, the random strings if existed as foreign besides English (i.e Spanish) will be placed after English after all, they appear in the same first letters though. That means you need a globalization namespace to deal with it.
By the way, Timothy, it is secret not secrete :D
Your code creates a new list copying from the array, sorts that list, and then discards it. It does not change the array at all.
try:
Array.Sort(randomStrings);

StringComparision.Ordinal is more accurate or IgnoreCase

In a WinForms application that may be used in non-US-English environmnets too, I have a String.Equals(strA, strB) method and it is failing because I needed to do a a case-INsensitive comparision but by defdault is comparision is case-sensitive. Now to fix this what do you recommend is better?
CurrentCultureIgnoreCase ?
StringComparision.Ordinal ?
StringComparision.OrdinalIgnoreCase ?
*ANY BETTER SUGGESTIONS?
Thanks.
Use CurrentCultureIgnoreCase. An Ordinal comparison does not respect the alphabetical order used by the culture.
But of course it depends on what you're trying to accomplish. If you want to do something that ignores the culture of the user, certainly there are other possibilities, including using the InvariantCulture.
Addition: Even if you are not sorting/ordering but only checking for "equal" versus "not equal", there may be a difference between OrdinalIgnoreCase and CurrentCultureIgnoreCase. For example, to an ordinal comparison, "istanbul" and "Istanbul" are equal, up to case. However, with a Turkish culture, they might not be equivalent, because the capital version of 'i' is not 'I' but 'İ'. So the city would be "İstanbul".

MySQL's utf_general_ci in C#

Is there an easy way to replicate the behavior of MySQL's utf_general_ci collation in C#?
In particular, given a Unicode string, I want to generate a(n ASCII?) string that can then be trivially sorted or compared, as utf_general_ci would.
I found this question, which shows how to strip accents from strings, which looks like a similar but not quite equivalent function, e.g., it doesn't decompose ß into ss.
For my purposes, that may end up being good enough, but if there's a way to replicate its behavior completely I'd prefer that.
Take a look at the SortKey class and its KeyData property.
For given collation (CultureInfo in .NET terms) and a string, you can get a naturally sortable byte array using MyCultureInfo.CompareInfo.GetSortKey(mystring).KeyData
I do not think the collating keys for .NET and for MySQL will necessarily match, however both are using the same technique based on Unicode Collation algorithm

Problem comparing French character Î

When comparing "Île" and "Ile", C# does not consider these to be to be the same.
string.Equals("Île", "Ile", StringComparison.InvariantCultureIgnoreCase)
For all other accented characters I have come across the comparison works fine.
Is there another comparison function I should use?
You are specifying to compare the strings using the Invariant culture's comparison rules. Evidently, in the invariant culture, the two strings are not considered equal.
You can compare them in a culture-specific manner using String.Compare and providing the culture for which you want to compare the strings:
if(String.Compare("Île", "Ile", new CultureInfo("fr-FR"), CompareOptions.None)==0)
Please note that in the french culture, those strings are also considered different. I included the example to show, that it is the culture that defines the sort rules. You might be able to find a culture that fits your requirements, or build a custom one with the needed compare rules, but that it probably not what you want.
For a good example of normalizing the string so there are no accents, have a look at this question. After normalizing the string, you would be able to compare them and consider them equal. This would probably be the easiest way to implement your requirement.
Edit
It is not just the I character that has this behaviour in the InvariantCulture, this statement also returns false:
String.Equals("Ilê", "Ile", StringComparison.InvariantCultureIgnoreCase)
The framework does the right thing - those characters are in fact different (has different meaning) in most cultures, and therefore they should not be considered the same.

String comparison: InvariantCultureIgnoreCase vs OrdinalIgnoreCase? [duplicate]

This question already has answers here:
Which is generally best to use — StringComparison.OrdinalIgnoreCase or StringComparison.InvariantCultureIgnoreCase?
(5 answers)
Closed 5 years ago.
Which would be better code:
int index = fileName.LastIndexOf(".", StringComparison.InvariantCultureIgnoreCase);
or
int index = fileName.LastIndexOf(".", StringComparison.OrdinalIgnoreCase);
Neither code is always better. They do different things, so they are good at different things.
InvariantCultureIgnoreCase uses comparison rules based on english, but without any regional variations. This is good for a neutral comparison that still takes into account some linguistic aspects.
OrdinalIgnoreCase compares the character codes without cultural aspects. This is good for exact comparisons, like login names, but not for sorting strings with unusual characters like é or ö. This is also faster because there are no extra rules to apply before comparing.
FXCop typically prefers OrdinalIgnoreCase. But your requirements may vary.
For English there is very little difference. It is when you wander into languages that have different written language constructs that this becomes an issue. I am not experienced enough to give you more than that.
OrdinalIgnoreCase
The StringComparer returned by the
OrdinalIgnoreCase property treats
the characters in the strings to
compare as if they were converted
to uppercase using the conventions
of the invariant culture, and then
performs a simple byte comparison
that is independent of language.
This is most appropriate when
comparing strings that are generated
programmatically or when comparing
case-insensitive resources such as
paths and filenames.
http://msdn.microsoft.com/en-us/library/system.stringcomparer.ordinalignorecase.aspx
InvariantCultureIgnoreCase
The StringComparer returned by the
InvariantCultureIgnoreCase property
compares strings in a linguistically
relevant manner that ignores case, but
it is not suitable for display in any
particular culture. Its major
application is to order strings in a
way that will be identical across
cultures.
http://msdn.microsoft.com/en-us/library/system.stringcomparer.invariantcultureignorecase.aspx
The invariant culture is the
CultureInfo object returned by the
InvariantCulture property.
The InvariantCultureIgnoreCase
property actually returns an instance
of an anonymous class derived from the
StringComparer class.
If you really want to match only the dot, then StringComparison.Ordinal would be fastest, as there is no case-difference.
"Ordinal" doesn't use culture and/or casing rules that are not applicable anyway on a symbol like a ..
You seem to be doing file name comparisons, so I would just add that OrdinalIgnoreCase is closest to what NTFS does (it's not exactly the same, but it's closer than InvariantCultureIgnoreCase)

Categories