StringComparision.Ordinal is more accurate or IgnoreCase - c#

In a WinForms application that may be used in non-US-English environmnets too, I have a String.Equals(strA, strB) method and it is failing because I needed to do a a case-INsensitive comparision but by defdault is comparision is case-sensitive. Now to fix this what do you recommend is better?
CurrentCultureIgnoreCase ?
StringComparision.Ordinal ?
StringComparision.OrdinalIgnoreCase ?
*ANY BETTER SUGGESTIONS?
Thanks.

Use CurrentCultureIgnoreCase. An Ordinal comparison does not respect the alphabetical order used by the culture.
But of course it depends on what you're trying to accomplish. If you want to do something that ignores the culture of the user, certainly there are other possibilities, including using the InvariantCulture.
Addition: Even if you are not sorting/ordering but only checking for "equal" versus "not equal", there may be a difference between OrdinalIgnoreCase and CurrentCultureIgnoreCase. For example, to an ordinal comparison, "istanbul" and "Istanbul" are equal, up to case. However, with a Turkish culture, they might not be equivalent, because the capital version of 'i' is not 'I' but 'İ'. So the city would be "İstanbul".

Related

Problem comparing French character Î

When comparing "Île" and "Ile", C# does not consider these to be to be the same.
string.Equals("Île", "Ile", StringComparison.InvariantCultureIgnoreCase)
For all other accented characters I have come across the comparison works fine.
Is there another comparison function I should use?
You are specifying to compare the strings using the Invariant culture's comparison rules. Evidently, in the invariant culture, the two strings are not considered equal.
You can compare them in a culture-specific manner using String.Compare and providing the culture for which you want to compare the strings:
if(String.Compare("Île", "Ile", new CultureInfo("fr-FR"), CompareOptions.None)==0)
Please note that in the french culture, those strings are also considered different. I included the example to show, that it is the culture that defines the sort rules. You might be able to find a culture that fits your requirements, or build a custom one with the needed compare rules, but that it probably not what you want.
For a good example of normalizing the string so there are no accents, have a look at this question. After normalizing the string, you would be able to compare them and consider them equal. This would probably be the easiest way to implement your requirement.
Edit
It is not just the I character that has this behaviour in the InvariantCulture, this statement also returns false:
String.Equals("Ilê", "Ile", StringComparison.InvariantCultureIgnoreCase)
The framework does the right thing - those characters are in fact different (has different meaning) in most cultures, and therefore they should not be considered the same.

String comparison: InvariantCultureIgnoreCase vs OrdinalIgnoreCase? [duplicate]

This question already has answers here:
Which is generally best to use — StringComparison.OrdinalIgnoreCase or StringComparison.InvariantCultureIgnoreCase?
(5 answers)
Closed 5 years ago.
Which would be better code:
int index = fileName.LastIndexOf(".", StringComparison.InvariantCultureIgnoreCase);
or
int index = fileName.LastIndexOf(".", StringComparison.OrdinalIgnoreCase);
Neither code is always better. They do different things, so they are good at different things.
InvariantCultureIgnoreCase uses comparison rules based on english, but without any regional variations. This is good for a neutral comparison that still takes into account some linguistic aspects.
OrdinalIgnoreCase compares the character codes without cultural aspects. This is good for exact comparisons, like login names, but not for sorting strings with unusual characters like é or ö. This is also faster because there are no extra rules to apply before comparing.
FXCop typically prefers OrdinalIgnoreCase. But your requirements may vary.
For English there is very little difference. It is when you wander into languages that have different written language constructs that this becomes an issue. I am not experienced enough to give you more than that.
OrdinalIgnoreCase
The StringComparer returned by the
OrdinalIgnoreCase property treats
the characters in the strings to
compare as if they were converted
to uppercase using the conventions
of the invariant culture, and then
performs a simple byte comparison
that is independent of language.
This is most appropriate when
comparing strings that are generated
programmatically or when comparing
case-insensitive resources such as
paths and filenames.
http://msdn.microsoft.com/en-us/library/system.stringcomparer.ordinalignorecase.aspx
InvariantCultureIgnoreCase
The StringComparer returned by the
InvariantCultureIgnoreCase property
compares strings in a linguistically
relevant manner that ignores case, but
it is not suitable for display in any
particular culture. Its major
application is to order strings in a
way that will be identical across
cultures.
http://msdn.microsoft.com/en-us/library/system.stringcomparer.invariantcultureignorecase.aspx
The invariant culture is the
CultureInfo object returned by the
InvariantCulture property.
The InvariantCultureIgnoreCase
property actually returns an instance
of an anonymous class derived from the
StringComparer class.
If you really want to match only the dot, then StringComparison.Ordinal would be fastest, as there is no case-difference.
"Ordinal" doesn't use culture and/or casing rules that are not applicable anyway on a symbol like a ..
You seem to be doing file name comparisons, so I would just add that OrdinalIgnoreCase is closest to what NTFS does (it's not exactly the same, but it's closer than InvariantCultureIgnoreCase)

Why should I convert a string to upper case when comparing?

I constantly read about it being a good practise to convert a string to upper case (I think Hanselman mentioned this on his blog a long time ago), when that string is to be compared against another (which should also be converted to upper case).
What is the benefit of this? Why should I do this (or are there any cases when I shouldn't)?
Thanks
no, you should be using the enum option that allows for case insenstive comparisson (string comparison).
Make sure to use that overload of the comparison method you are using i.e. String.Compare, String.Equals
The reason that you should convert to upper case rather than lower case when doing a comparison (and it's not practically possible to do a case insensetive comparison), is that some (not so commonly used) characters does not convert to lower case without losing information.
Some upper case characters doesn't have an equivalent lower case character, so making them lower case would convert them into a different lower case character. That could cause a false positive in the comparison.
A better way to do case-insensitive string comparison is:
bool ignoreCase = true;
bool stringsAreSame = (string.Compare(str1, str2, ignoreCase) == 0)
Also, see here:
Upper vs Lower Case
Strings should be normalized to uppercase. A small group of characters, when they are converted to lowercase, cannot make a round trip. To make a round trip means to convert the characters from one locale to another locale that represents character data differently, and then to accurately retrieve the original characters from the converted characters.
Reference:
https://learn.microsoft.com/en-us/visualstudio/code-quality/ca1308-normalize-strings-to-uppercase?view=vs-2015][1]
This sounds like a cheap way to do case-insensitive comparisons. I would wonder if there isn't a function that would do that for you, without you having to explicitly telling it to go uppercase.
The .Net framework is slightly faster at doing string comparisons between uppercase letters than string comparisons between lowercase letters.
As others have mentioned, some information might be lost when converting from uppercase to lowercase.
You may want to try using a StringComparer object to do case insensitive comparisons.
StringComparer comparer = StringComparer.OrdinalIgnoreCase;
bool isEqualV1 = comparer.Equals("stringA", "stringB");
bool isEqualV2 = (comparer.Compare("stringA", "stringB") == 0);
The .Net Framework as of of 4.7 has a Span type that should assist in speeding up string comparisons in certain circumstances.
Depending on your use case, you may want to make use of the constructors for HashSet and Dictionary types which can take a StringComparer as an input parameter for the constructor.
I typically use a StringComparer as an input parameter to a method with a default of StringComparer.OrdingalIgnoreCase and I try to make use of other techniques (use of HashSets, Dictionaries or Spans) if speed is important.
You do not have to convert a string to Upper case. Convert it to lower case 8-)

Why does string.Compare seem to handle accented characters inconsistently?

If I execute the following statement:
string.Compare("mun", "mün", true, CultureInfo.InvariantCulture)
The result is '-1', indicating that 'mun' has a lower numeric value than 'mün'.
However, if I execute this statement:
string.Compare("Muntelier, Schweiz", "München, Deutschland", true, CultureInfo.InvariantCulture)
I get '1', indicating that 'Muntelier, Schewiz' should go last.
Is this a bug in the comparison? Or, more likely, is there a rule I should be taking into account when sorting strings containing accented
The reason this is an issue is, I'm sorting a list and then doing a manual binary filter that's meant to get every string beginning with 'xxx'.
Previously I was using the Linq 'Where' method, but now I have to use this custom function written by another person, because he says it performs better.
But the custom function doesn't seem to take into account whatever 'unicode' rules .NET has. So if I tell it to filter by 'mün', it doesn't find any items, even though there are items in the list beginning with 'mun'.
This seems to be because of the inconsistent ordering of accented characters, depending on what characters go after the accented character.
OK, I think I've fixed the problem.
Before the filter, I do a sort based on the first n letters of each string, where n is the length of the search string.
There is a tie-breaking algorithm at work, see http://unicode.org/reports/tr10/
To address the complexities of
language-sensitive sorting, a
multilevel comparison algorithm is
employed. In comparing two words, for
example, the most important feature is
the base character: such as the
difference between an A and a B.
Accent differences are typically
ignored, if there are any differences
in the base letters. Case differences
(uppercase versus lowercase), are
typically ignored, if there are any
differences in the base or accents.
Punctuation is variable. In some
situations a punctuation character is
treated like a base character. In
other situations, it should be ignored
if there are any base, accent, or case
differences. There may also be a
final, tie-breaking level, whereby if
there are no other differences at all
in the string, the (normalized) code
point order is used.
So, "Munt..." and "Münc..." are alphabetically different and sort based on the "t" and "c".
Whereas, "mun" and "mün" are alphabetically the same ("u" equivelent to "ü" in lost languages) so the character codes are compared
It looks like the accented character is only being used in a sort of "tie-break" situation - in other words, if the strings are otherwise equal.
Here's some sample code to demonstrate:
using System;
using System.Globalization;
class Test
{
static void Main()
{
Compare("mun", "mün");
Compare("muna", "münb");
Compare("munb", "müna");
}
static void Compare(string x, string y)
{
int result = string.Compare(x, y, true,
CultureInfo.InvariantCulture));
Console.WriteLine("{0}; {1}; {2}", x, y, result);
}
}
(I've tried adding a space after the "n" as well, to see if it was done on word boundaries - it isn't.)
Results:
mun; mün; -1
muna; münb; -1
munb; müna; 1
I suspect this is correct by various complicated Unicode rules - but I don't know enough about them.
As for whether you need to take this into account... I wouldn't expect so. What are you doing that is thrown by this?
As I understand this it is still somewhat consistent. When comparing using CultureInfo.InvariantCulture the umlaut character ü is treated like the non-accented character u.
As the strings in your first example obviously are not equal the result will not be 0 but -1 (which seems to be a default value). In the second example Muntelier goes last because t follows c in the alphabet.
I couldn't find any clear documentation in MSDN explaining these rules, but I found that
string.Compare("mun", "mün", CultureInfo.InvariantCulture,
CompareOptions.StringSort);
and
string.Compare("Muntelier, Schweiz", "München, Deutschland",
CultureInfo.InvariantCulture, CompareOptions.StringSort);
gives the desired result.
Anyway, I think you'd be better off to base your sorting on a specific culture such as the current user's culture (if possible).

Which Version of StringComparer to use

If I want to have a case-insensitive string-keyed dictionary, which version of StringComparer should I use given these constraints:
The keys in the dictionary come from either C# code or config files written in english locale only (either US, or UK)
The software is internationalized and will run in different locales
I normally use StringComparer.InvariantCultureIgnoreCase but wasn't sure if that is the correct case. Here is example code:
Dictionary< string, object> stuff = new Dictionary< string, object>(StringComparer.InvariantCultureIgnoreCase);
There are three kinds of comparers:
Culture-aware
Culture invariant
Ordinal
Each comparer has a case-sensitive as well as a case-insensitive version.
An ordinal comparer uses ordinal values of characters. This is the fastest comparer, it should be used for internal purposes.
A culture-aware comparer considers aspects that are specific to the culture of the current thread. It knows the "Turkish i", "Spanish LL", etc. problems. It should be used for UI strings.
The culture invariant comparer is actually not defined and can produce unpredictable results, and thus should never be used at all.
References
New Recommendations for Using Strings in Microsoft .NET 2.0
This MSDN article covers everything you could possibly want to know in great depth, including the Turkish-I problem.
It's been a while since I read it, so I'm off to do so again. See you in an hour!
The concept of "case insensitive" is a linguistic one, and so it doesn't make sense without a culture.
See this blog for more information.
That said if you are just talking about strings using the latin alphabet then you will probably get away with the InvariantCulture.
It is probably best to create the dictionary with StringComparer.CurrentCulture, though. This will allow "ß" to match "ss" in your dictionary under a German culture, for example.
Since the keys are your known fixed values, then either InvariantCultureIgnoreCase or OrdinalIgnoreCase should work. Avoid the culture-specific one, or you could hit some of the more "fun" things like the "Turkish i" problem. Obviously, you'd use a cultured comparer if you were comparing cultured values... but it sounds like you aren't.
StringComparer.OrdinalIgnoreCase is slightly faster than InvariantCultureIgnoreCase FWIW ("An ordinal comparison is fast, but culture-insensitive" according to MSDN.
You'd have to be doing a lot of comparisons to notice the difference of course.
The Invariant Culture exists specifically to deal with strings that are internal to the program and have nothing to do with user data or UI. It sounds like this is the case for this situation.
System.Collections.Specialized includes StringDictionary. The Remarks section of the MSDN states "A key cannot be null, but a value can.
The key is handled in a case-insensitive manner; it is translated to lowercase before it is used with the string dictionary.
In .NET Framework version 1.0, this class uses culture-sensitive string comparisons. However, in .NET Framework version 1.1 and later, this class uses CultureInfo.InvariantCulture when comparing strings. For more information about how culture affects comparisons and sorting, see Comparing and Sorting Data for a Specific Culture and Performing Culture-Insensitive String Operations.

Categories