When comparing "Île" and "Ile", C# does not consider these to be to be the same.
string.Equals("Île", "Ile", StringComparison.InvariantCultureIgnoreCase)
For all other accented characters I have come across the comparison works fine.
Is there another comparison function I should use?
You are specifying to compare the strings using the Invariant culture's comparison rules. Evidently, in the invariant culture, the two strings are not considered equal.
You can compare them in a culture-specific manner using String.Compare and providing the culture for which you want to compare the strings:
if(String.Compare("Île", "Ile", new CultureInfo("fr-FR"), CompareOptions.None)==0)
Please note that in the french culture, those strings are also considered different. I included the example to show, that it is the culture that defines the sort rules. You might be able to find a culture that fits your requirements, or build a custom one with the needed compare rules, but that it probably not what you want.
For a good example of normalizing the string so there are no accents, have a look at this question. After normalizing the string, you would be able to compare them and consider them equal. This would probably be the easiest way to implement your requirement.
Edit
It is not just the I character that has this behaviour in the InvariantCulture, this statement also returns false:
String.Equals("Ilê", "Ile", StringComparison.InvariantCultureIgnoreCase)
The framework does the right thing - those characters are in fact different (has different meaning) in most cultures, and therefore they should not be considered the same.
Related
I have List like this
List<string> items = new List<string>();
items.Add("-");
items.Add(".");
items.Add("a-");
items.Add("a.");
items.Add("a-a");
items.Add("a.a");
items.Sort();
string output = string.Empty;
foreach (string s in items)
{
output += s + Environment.NewLine;
}
MessageBox.Show(output);
The output is coming back as
-
.
a-
a.
a.a
a-a
where as I am expecting the results as
-
.
a-
a.
a-a
a.a
Any idea why "a-a" is not coming before "a.a" where as "a-" comes before "a."
I suspect that in the last case "-" is treated in a different way due to culture-specific settings (perhaps as a "dash" as opposed to "minus" in the first strings). MSDN warns about this:
The comparison uses the current culture to obtain culture-specific
information such as casing rules and the alphabetic order of
individual characters. For example, a culture could specify that
certain combinations of characters be treated as a single character,
or uppercase and lowercase characters be compared in a particular way,
or that the sorting order of a character depends on the characters
that precede or follow it.
Also see in this MSDN page:
The .NET Framework uses three distinct ways of sorting: word sort,
string sort, and ordinal sort. Word sort performs a culture-sensitive
comparison of strings. Certain nonalphanumeric characters might have
special weights assigned to them; for example, the hyphen ("-") might
have a very small weight assigned to it so that "coop" and "co-op"
appear next to each other in a sorted list. String sort is similar to
word sort, except that there are no special cases; therefore, all
nonalphanumeric symbols come before all alphanumeric characters.
Ordinal sort compares strings based on the Unicode values of each
element of the string.
So, hyphen gets a special treatment in the default sort mode in order to make the word sort more "natural".
You can get "normal" ordinal sort if you specifically turn it on:
Console.WriteLine(string.Compare("a.", "a-")); //1
Console.WriteLine(string.Compare("a.a", "a-a")); //-1
Console.WriteLine(string.Compare("a.", "a-", StringComparison.Ordinal)); //1
Console.WriteLine(string.Compare("a.a", "a-a", StringComparison.Ordinal)); //1
To sort the original collection using ordinal comparison use:
items.Sort(StringComparer.Ordinal);
If you want your string sort to be based on the actual byte value as opposed to the rules defined by the current culture you can sort by Ordinal:
items.Sort(StringComparer.Ordinal);
This will make the results consistent across all cultures (but it will produce unintuitive sortings of "14" coming before "9" which may or may not be what you're looking for).
The Sort method of the List<> class relies on the default string comparer of the .NET Framework, which is actually an instance of the current CultureInfo of the Thread.
The CultureInfo specifies the alphabetical order of characters and it seems that the default one is using an order different order to what you would expect.
When sorting you can specify a specific CultureInfo, one that you know will match your sorting requirements, sample (german culture):
var sortCulture = new CultureInfo("de-DE");
items.Sort(sortCulture);
More info can be found here:
http://msdn.microsoft.com/en-us/library/b0zbh7b6.aspx
http://msdn.microsoft.com/de-de/library/system.stringcomparer.aspx
In a WinForms application that may be used in non-US-English environmnets too, I have a String.Equals(strA, strB) method and it is failing because I needed to do a a case-INsensitive comparision but by defdault is comparision is case-sensitive. Now to fix this what do you recommend is better?
CurrentCultureIgnoreCase ?
StringComparision.Ordinal ?
StringComparision.OrdinalIgnoreCase ?
*ANY BETTER SUGGESTIONS?
Thanks.
Use CurrentCultureIgnoreCase. An Ordinal comparison does not respect the alphabetical order used by the culture.
But of course it depends on what you're trying to accomplish. If you want to do something that ignores the culture of the user, certainly there are other possibilities, including using the InvariantCulture.
Addition: Even if you are not sorting/ordering but only checking for "equal" versus "not equal", there may be a difference between OrdinalIgnoreCase and CurrentCultureIgnoreCase. For example, to an ordinal comparison, "istanbul" and "Istanbul" are equal, up to case. However, with a Turkish culture, they might not be equivalent, because the capital version of 'i' is not 'I' but 'İ'. So the city would be "İstanbul".
This question already has answers here:
Which is generally best to use — StringComparison.OrdinalIgnoreCase or StringComparison.InvariantCultureIgnoreCase?
(5 answers)
Closed 5 years ago.
Which would be better code:
int index = fileName.LastIndexOf(".", StringComparison.InvariantCultureIgnoreCase);
or
int index = fileName.LastIndexOf(".", StringComparison.OrdinalIgnoreCase);
Neither code is always better. They do different things, so they are good at different things.
InvariantCultureIgnoreCase uses comparison rules based on english, but without any regional variations. This is good for a neutral comparison that still takes into account some linguistic aspects.
OrdinalIgnoreCase compares the character codes without cultural aspects. This is good for exact comparisons, like login names, but not for sorting strings with unusual characters like é or ö. This is also faster because there are no extra rules to apply before comparing.
FXCop typically prefers OrdinalIgnoreCase. But your requirements may vary.
For English there is very little difference. It is when you wander into languages that have different written language constructs that this becomes an issue. I am not experienced enough to give you more than that.
OrdinalIgnoreCase
The StringComparer returned by the
OrdinalIgnoreCase property treats
the characters in the strings to
compare as if they were converted
to uppercase using the conventions
of the invariant culture, and then
performs a simple byte comparison
that is independent of language.
This is most appropriate when
comparing strings that are generated
programmatically or when comparing
case-insensitive resources such as
paths and filenames.
http://msdn.microsoft.com/en-us/library/system.stringcomparer.ordinalignorecase.aspx
InvariantCultureIgnoreCase
The StringComparer returned by the
InvariantCultureIgnoreCase property
compares strings in a linguistically
relevant manner that ignores case, but
it is not suitable for display in any
particular culture. Its major
application is to order strings in a
way that will be identical across
cultures.
http://msdn.microsoft.com/en-us/library/system.stringcomparer.invariantcultureignorecase.aspx
The invariant culture is the
CultureInfo object returned by the
InvariantCulture property.
The InvariantCultureIgnoreCase
property actually returns an instance
of an anonymous class derived from the
StringComparer class.
If you really want to match only the dot, then StringComparison.Ordinal would be fastest, as there is no case-difference.
"Ordinal" doesn't use culture and/or casing rules that are not applicable anyway on a symbol like a ..
You seem to be doing file name comparisons, so I would just add that OrdinalIgnoreCase is closest to what NTFS does (it's not exactly the same, but it's closer than InvariantCultureIgnoreCase)
If I execute the following statement:
string.Compare("mun", "mün", true, CultureInfo.InvariantCulture)
The result is '-1', indicating that 'mun' has a lower numeric value than 'mün'.
However, if I execute this statement:
string.Compare("Muntelier, Schweiz", "München, Deutschland", true, CultureInfo.InvariantCulture)
I get '1', indicating that 'Muntelier, Schewiz' should go last.
Is this a bug in the comparison? Or, more likely, is there a rule I should be taking into account when sorting strings containing accented
The reason this is an issue is, I'm sorting a list and then doing a manual binary filter that's meant to get every string beginning with 'xxx'.
Previously I was using the Linq 'Where' method, but now I have to use this custom function written by another person, because he says it performs better.
But the custom function doesn't seem to take into account whatever 'unicode' rules .NET has. So if I tell it to filter by 'mün', it doesn't find any items, even though there are items in the list beginning with 'mun'.
This seems to be because of the inconsistent ordering of accented characters, depending on what characters go after the accented character.
OK, I think I've fixed the problem.
Before the filter, I do a sort based on the first n letters of each string, where n is the length of the search string.
There is a tie-breaking algorithm at work, see http://unicode.org/reports/tr10/
To address the complexities of
language-sensitive sorting, a
multilevel comparison algorithm is
employed. In comparing two words, for
example, the most important feature is
the base character: such as the
difference between an A and a B.
Accent differences are typically
ignored, if there are any differences
in the base letters. Case differences
(uppercase versus lowercase), are
typically ignored, if there are any
differences in the base or accents.
Punctuation is variable. In some
situations a punctuation character is
treated like a base character. In
other situations, it should be ignored
if there are any base, accent, or case
differences. There may also be a
final, tie-breaking level, whereby if
there are no other differences at all
in the string, the (normalized) code
point order is used.
So, "Munt..." and "Münc..." are alphabetically different and sort based on the "t" and "c".
Whereas, "mun" and "mün" are alphabetically the same ("u" equivelent to "ü" in lost languages) so the character codes are compared
It looks like the accented character is only being used in a sort of "tie-break" situation - in other words, if the strings are otherwise equal.
Here's some sample code to demonstrate:
using System;
using System.Globalization;
class Test
{
static void Main()
{
Compare("mun", "mün");
Compare("muna", "münb");
Compare("munb", "müna");
}
static void Compare(string x, string y)
{
int result = string.Compare(x, y, true,
CultureInfo.InvariantCulture));
Console.WriteLine("{0}; {1}; {2}", x, y, result);
}
}
(I've tried adding a space after the "n" as well, to see if it was done on word boundaries - it isn't.)
Results:
mun; mün; -1
muna; münb; -1
munb; müna; 1
I suspect this is correct by various complicated Unicode rules - but I don't know enough about them.
As for whether you need to take this into account... I wouldn't expect so. What are you doing that is thrown by this?
As I understand this it is still somewhat consistent. When comparing using CultureInfo.InvariantCulture the umlaut character ü is treated like the non-accented character u.
As the strings in your first example obviously are not equal the result will not be 0 but -1 (which seems to be a default value). In the second example Muntelier goes last because t follows c in the alphabet.
I couldn't find any clear documentation in MSDN explaining these rules, but I found that
string.Compare("mun", "mün", CultureInfo.InvariantCulture,
CompareOptions.StringSort);
and
string.Compare("Muntelier, Schweiz", "München, Deutschland",
CultureInfo.InvariantCulture, CompareOptions.StringSort);
gives the desired result.
Anyway, I think you'd be better off to base your sorting on a specific culture such as the current user's culture (if possible).
If I want to have a case-insensitive string-keyed dictionary, which version of StringComparer should I use given these constraints:
The keys in the dictionary come from either C# code or config files written in english locale only (either US, or UK)
The software is internationalized and will run in different locales
I normally use StringComparer.InvariantCultureIgnoreCase but wasn't sure if that is the correct case. Here is example code:
Dictionary< string, object> stuff = new Dictionary< string, object>(StringComparer.InvariantCultureIgnoreCase);
There are three kinds of comparers:
Culture-aware
Culture invariant
Ordinal
Each comparer has a case-sensitive as well as a case-insensitive version.
An ordinal comparer uses ordinal values of characters. This is the fastest comparer, it should be used for internal purposes.
A culture-aware comparer considers aspects that are specific to the culture of the current thread. It knows the "Turkish i", "Spanish LL", etc. problems. It should be used for UI strings.
The culture invariant comparer is actually not defined and can produce unpredictable results, and thus should never be used at all.
References
New Recommendations for Using Strings in Microsoft .NET 2.0
This MSDN article covers everything you could possibly want to know in great depth, including the Turkish-I problem.
It's been a while since I read it, so I'm off to do so again. See you in an hour!
The concept of "case insensitive" is a linguistic one, and so it doesn't make sense without a culture.
See this blog for more information.
That said if you are just talking about strings using the latin alphabet then you will probably get away with the InvariantCulture.
It is probably best to create the dictionary with StringComparer.CurrentCulture, though. This will allow "ß" to match "ss" in your dictionary under a German culture, for example.
Since the keys are your known fixed values, then either InvariantCultureIgnoreCase or OrdinalIgnoreCase should work. Avoid the culture-specific one, or you could hit some of the more "fun" things like the "Turkish i" problem. Obviously, you'd use a cultured comparer if you were comparing cultured values... but it sounds like you aren't.
StringComparer.OrdinalIgnoreCase is slightly faster than InvariantCultureIgnoreCase FWIW ("An ordinal comparison is fast, but culture-insensitive" according to MSDN.
You'd have to be doing a lot of comparisons to notice the difference of course.
The Invariant Culture exists specifically to deal with strings that are internal to the program and have nothing to do with user data or UI. It sounds like this is the case for this situation.
System.Collections.Specialized includes StringDictionary. The Remarks section of the MSDN states "A key cannot be null, but a value can.
The key is handled in a case-insensitive manner; it is translated to lowercase before it is used with the string dictionary.
In .NET Framework version 1.0, this class uses culture-sensitive string comparisons. However, in .NET Framework version 1.1 and later, this class uses CultureInfo.InvariantCulture when comparing strings. For more information about how culture affects comparisons and sorting, see Comparing and Sorting Data for a Specific Culture and Performing Culture-Insensitive String Operations.