I constantly read about it being a good practise to convert a string to upper case (I think Hanselman mentioned this on his blog a long time ago), when that string is to be compared against another (which should also be converted to upper case).
What is the benefit of this? Why should I do this (or are there any cases when I shouldn't)?
Thanks
no, you should be using the enum option that allows for case insenstive comparisson (string comparison).
Make sure to use that overload of the comparison method you are using i.e. String.Compare, String.Equals
The reason that you should convert to upper case rather than lower case when doing a comparison (and it's not practically possible to do a case insensetive comparison), is that some (not so commonly used) characters does not convert to lower case without losing information.
Some upper case characters doesn't have an equivalent lower case character, so making them lower case would convert them into a different lower case character. That could cause a false positive in the comparison.
A better way to do case-insensitive string comparison is:
bool ignoreCase = true;
bool stringsAreSame = (string.Compare(str1, str2, ignoreCase) == 0)
Also, see here:
Upper vs Lower Case
Strings should be normalized to uppercase. A small group of characters, when they are converted to lowercase, cannot make a round trip. To make a round trip means to convert the characters from one locale to another locale that represents character data differently, and then to accurately retrieve the original characters from the converted characters.
Reference:
https://learn.microsoft.com/en-us/visualstudio/code-quality/ca1308-normalize-strings-to-uppercase?view=vs-2015][1]
This sounds like a cheap way to do case-insensitive comparisons. I would wonder if there isn't a function that would do that for you, without you having to explicitly telling it to go uppercase.
The .Net framework is slightly faster at doing string comparisons between uppercase letters than string comparisons between lowercase letters.
As others have mentioned, some information might be lost when converting from uppercase to lowercase.
You may want to try using a StringComparer object to do case insensitive comparisons.
StringComparer comparer = StringComparer.OrdinalIgnoreCase;
bool isEqualV1 = comparer.Equals("stringA", "stringB");
bool isEqualV2 = (comparer.Compare("stringA", "stringB") == 0);
The .Net Framework as of of 4.7 has a Span type that should assist in speeding up string comparisons in certain circumstances.
Depending on your use case, you may want to make use of the constructors for HashSet and Dictionary types which can take a StringComparer as an input parameter for the constructor.
I typically use a StringComparer as an input parameter to a method with a default of StringComparer.OrdingalIgnoreCase and I try to make use of other techniques (use of HashSets, Dictionaries or Spans) if speed is important.
You do not have to convert a string to Upper case. Convert it to lower case 8-)
Related
Is it possible to convert a string to ordinal upper or lower case. Similar like invariant.
string upperInvariant = "ß".ToUpperInvariant();
string lowerInvariant = "ß".ToLowerInvariant();
bool invariant = upperInvariant == lowerInvariant; // true
string upperOrdinal = "ß".ToUpperOrdinal(); // SS
string lowerOrdinal = "ß".ToLowerOrdinal(); // ss
bool ordinal = upperOrdinal == lowerOrdinal; // false
How to implement ToUpperOrdinal and ToLowerOrdinal?
Edit:
How to to get the ordinal string representation? Likewise, how to get the invariant string representation? Maybe that's not possible as in the above case it might be ambiguous, at least for the ordinal representation.
Edit2:
string.Equals("ß", "ss", StringComparison.InvariantCultureIgnoreCase); // true
but
"ß".ToLowerInvariant() == "ss"; // false
I don't believe this functionality exists in the .NET Framework or .NET Core. The closest thing is string.Normalize(), but it is missing the case fold option that you need to successfully pull this off.
This functionality exists in the ICU project (which is available in C/Java). The functionality you are after is the unorm2.h file in C or the Normalizer2 class in Java. Example usage in Java and related test.
There are 2 implementations of Normalizer2 that I am aware of that have been ported to C#:
icu-dotnet (a C# wrapper library for ICU4C)
ICU4N (a fully managed port of ICU4J)
Full Disclosure: I am a maintainer of ICU4N.
From msdn:
TheStringComparer returned by the OrdinalIgnoreCase property treats the characters in the strings to compare as if they were converted to uppercase using the conventions of the invariant culture, and then performs a simple byte comparison that is independent of language.
But I'm guessing doing that won't achieve what you want, since simply doing "ß".ToUpperInvariant() won't give you a string that is ordinally equivallent to "ss". There must be some magic in the String.Equals method that handles the speciall case of Why “ss” equals 'ß'.
If you're only worried about German text then this answer might help.
Using left.ToUpper() == right.ToUpper() is not the best option to compare strings, at least because of Performance issues. I want to refactor (fully preserving behavior!) this code, to something efficient, but can't achieve full equivalency for the special case.
So, here is a simple test method:
[TestCase("Strasse", "Straße", "tr-TR")]
[TestCase("İ", "i", "tr-TR")]
public void UsingToUpper_AndCurrentCultureIgnoreCase_AreSame(string left, string right, string culture)
{
// Arrange, Act
Thread.CurrentThread.CurrentCulture = CultureInfo.GetCultureInfo(culture);
var toUpper = left.ToUpper() == right.ToUpper();
var stringComparison = left.Equals(right, StringComparison.CurrentCultureIgnoreCase);
// Assert
Assert.AreEqual(toUpper, stringComparison);
}
I tried two options,
StringComparison.CurrentCultureIgnoreCase and StringComparison.OrdinalIgnoreCase both of them fails (in different cases).
So, the question:
Is there a way to compare two strings, without changing case and fully preserve the behavior of ToUpper()?
You would have to write your own custom Comparison method I am afraid.
ToUpper is making use of the Unicode metadata. Every character
(Unicode code point) has a case as well as case mapping to upper- and
lowercase (and title case). .NET uses this information to convert a
string to upper- or lowercase. You can find the very same information
in the Unicode Character Database.
You can supply a culture to the ToUpper method, but this would not be your goal.
You can write your own customCulture like defined in this answer: Create custom culture in ASP.NET
However, there would not be any similar behaviour to the ToUpper method as mentioned before it uses Unicode metadata. You cannot force string Equals to use Unicode Characters.
In a WinForms application that may be used in non-US-English environmnets too, I have a String.Equals(strA, strB) method and it is failing because I needed to do a a case-INsensitive comparision but by defdault is comparision is case-sensitive. Now to fix this what do you recommend is better?
CurrentCultureIgnoreCase ?
StringComparision.Ordinal ?
StringComparision.OrdinalIgnoreCase ?
*ANY BETTER SUGGESTIONS?
Thanks.
Use CurrentCultureIgnoreCase. An Ordinal comparison does not respect the alphabetical order used by the culture.
But of course it depends on what you're trying to accomplish. If you want to do something that ignores the culture of the user, certainly there are other possibilities, including using the InvariantCulture.
Addition: Even if you are not sorting/ordering but only checking for "equal" versus "not equal", there may be a difference between OrdinalIgnoreCase and CurrentCultureIgnoreCase. For example, to an ordinal comparison, "istanbul" and "Istanbul" are equal, up to case. However, with a Turkish culture, they might not be equivalent, because the capital version of 'i' is not 'I' but 'İ'. So the city would be "İstanbul".
This question already has answers here:
Which is generally best to use — StringComparison.OrdinalIgnoreCase or StringComparison.InvariantCultureIgnoreCase?
(5 answers)
Closed 5 years ago.
Which would be better code:
int index = fileName.LastIndexOf(".", StringComparison.InvariantCultureIgnoreCase);
or
int index = fileName.LastIndexOf(".", StringComparison.OrdinalIgnoreCase);
Neither code is always better. They do different things, so they are good at different things.
InvariantCultureIgnoreCase uses comparison rules based on english, but without any regional variations. This is good for a neutral comparison that still takes into account some linguistic aspects.
OrdinalIgnoreCase compares the character codes without cultural aspects. This is good for exact comparisons, like login names, but not for sorting strings with unusual characters like é or ö. This is also faster because there are no extra rules to apply before comparing.
FXCop typically prefers OrdinalIgnoreCase. But your requirements may vary.
For English there is very little difference. It is when you wander into languages that have different written language constructs that this becomes an issue. I am not experienced enough to give you more than that.
OrdinalIgnoreCase
The StringComparer returned by the
OrdinalIgnoreCase property treats
the characters in the strings to
compare as if they were converted
to uppercase using the conventions
of the invariant culture, and then
performs a simple byte comparison
that is independent of language.
This is most appropriate when
comparing strings that are generated
programmatically or when comparing
case-insensitive resources such as
paths and filenames.
http://msdn.microsoft.com/en-us/library/system.stringcomparer.ordinalignorecase.aspx
InvariantCultureIgnoreCase
The StringComparer returned by the
InvariantCultureIgnoreCase property
compares strings in a linguistically
relevant manner that ignores case, but
it is not suitable for display in any
particular culture. Its major
application is to order strings in a
way that will be identical across
cultures.
http://msdn.microsoft.com/en-us/library/system.stringcomparer.invariantcultureignorecase.aspx
The invariant culture is the
CultureInfo object returned by the
InvariantCulture property.
The InvariantCultureIgnoreCase
property actually returns an instance
of an anonymous class derived from the
StringComparer class.
If you really want to match only the dot, then StringComparison.Ordinal would be fastest, as there is no case-difference.
"Ordinal" doesn't use culture and/or casing rules that are not applicable anyway on a symbol like a ..
You seem to be doing file name comparisons, so I would just add that OrdinalIgnoreCase is closest to what NTFS does (it's not exactly the same, but it's closer than InvariantCultureIgnoreCase)
If I execute the following statement:
string.Compare("mun", "mün", true, CultureInfo.InvariantCulture)
The result is '-1', indicating that 'mun' has a lower numeric value than 'mün'.
However, if I execute this statement:
string.Compare("Muntelier, Schweiz", "München, Deutschland", true, CultureInfo.InvariantCulture)
I get '1', indicating that 'Muntelier, Schewiz' should go last.
Is this a bug in the comparison? Or, more likely, is there a rule I should be taking into account when sorting strings containing accented
The reason this is an issue is, I'm sorting a list and then doing a manual binary filter that's meant to get every string beginning with 'xxx'.
Previously I was using the Linq 'Where' method, but now I have to use this custom function written by another person, because he says it performs better.
But the custom function doesn't seem to take into account whatever 'unicode' rules .NET has. So if I tell it to filter by 'mün', it doesn't find any items, even though there are items in the list beginning with 'mun'.
This seems to be because of the inconsistent ordering of accented characters, depending on what characters go after the accented character.
OK, I think I've fixed the problem.
Before the filter, I do a sort based on the first n letters of each string, where n is the length of the search string.
There is a tie-breaking algorithm at work, see http://unicode.org/reports/tr10/
To address the complexities of
language-sensitive sorting, a
multilevel comparison algorithm is
employed. In comparing two words, for
example, the most important feature is
the base character: such as the
difference between an A and a B.
Accent differences are typically
ignored, if there are any differences
in the base letters. Case differences
(uppercase versus lowercase), are
typically ignored, if there are any
differences in the base or accents.
Punctuation is variable. In some
situations a punctuation character is
treated like a base character. In
other situations, it should be ignored
if there are any base, accent, or case
differences. There may also be a
final, tie-breaking level, whereby if
there are no other differences at all
in the string, the (normalized) code
point order is used.
So, "Munt..." and "Münc..." are alphabetically different and sort based on the "t" and "c".
Whereas, "mun" and "mün" are alphabetically the same ("u" equivelent to "ü" in lost languages) so the character codes are compared
It looks like the accented character is only being used in a sort of "tie-break" situation - in other words, if the strings are otherwise equal.
Here's some sample code to demonstrate:
using System;
using System.Globalization;
class Test
{
static void Main()
{
Compare("mun", "mün");
Compare("muna", "münb");
Compare("munb", "müna");
}
static void Compare(string x, string y)
{
int result = string.Compare(x, y, true,
CultureInfo.InvariantCulture));
Console.WriteLine("{0}; {1}; {2}", x, y, result);
}
}
(I've tried adding a space after the "n" as well, to see if it was done on word boundaries - it isn't.)
Results:
mun; mün; -1
muna; münb; -1
munb; müna; 1
I suspect this is correct by various complicated Unicode rules - but I don't know enough about them.
As for whether you need to take this into account... I wouldn't expect so. What are you doing that is thrown by this?
As I understand this it is still somewhat consistent. When comparing using CultureInfo.InvariantCulture the umlaut character ü is treated like the non-accented character u.
As the strings in your first example obviously are not equal the result will not be 0 but -1 (which seems to be a default value). In the second example Muntelier goes last because t follows c in the alphabet.
I couldn't find any clear documentation in MSDN explaining these rules, but I found that
string.Compare("mun", "mün", CultureInfo.InvariantCulture,
CompareOptions.StringSort);
and
string.Compare("Muntelier, Schweiz", "München, Deutschland",
CultureInfo.InvariantCulture, CompareOptions.StringSort);
gives the desired result.
Anyway, I think you'd be better off to base your sorting on a specific culture such as the current user's culture (if possible).