In C#, how to convert to upper case following Unicode rules [duplicate] - c#

Is it possible to convert a string to ordinal upper or lower case. Similar like invariant.
string upperInvariant = "ß".ToUpperInvariant();
string lowerInvariant = "ß".ToLowerInvariant();
bool invariant = upperInvariant == lowerInvariant; // true
string upperOrdinal = "ß".ToUpperOrdinal(); // SS
string lowerOrdinal = "ß".ToLowerOrdinal(); // ss
bool ordinal = upperOrdinal == lowerOrdinal; // false
How to implement ToUpperOrdinal and ToLowerOrdinal?
Edit:
How to to get the ordinal string representation? Likewise, how to get the invariant string representation? Maybe that's not possible as in the above case it might be ambiguous, at least for the ordinal representation.
Edit2:
string.Equals("ß", "ss", StringComparison.InvariantCultureIgnoreCase); // true
but
"ß".ToLowerInvariant() == "ss"; // false

I don't believe this functionality exists in the .NET Framework or .NET Core. The closest thing is string.Normalize(), but it is missing the case fold option that you need to successfully pull this off.
This functionality exists in the ICU project (which is available in C/Java). The functionality you are after is the unorm2.h file in C or the Normalizer2 class in Java. Example usage in Java and related test.
There are 2 implementations of Normalizer2 that I am aware of that have been ported to C#:
icu-dotnet (a C# wrapper library for ICU4C)
ICU4N (a fully managed port of ICU4J)
Full Disclosure: I am a maintainer of ICU4N.

From msdn:
TheStringComparer returned by the OrdinalIgnoreCase property treats the characters in the strings to compare as if they were converted to uppercase using the conventions of the invariant culture, and then performs a simple byte comparison that is independent of language.
But I'm guessing doing that won't achieve what you want, since simply doing "ß".ToUpperInvariant() won't give you a string that is ordinally equivallent to "ss". There must be some magic in the String.Equals method that handles the speciall case of Why “ss” equals 'ß'.
If you're only worried about German text then this answer might help.

Related

C# "anyString".Contains('\0', StringComparison.InvariantCulture) returns true in .NET5 but false in older versions

I encountered an incompatible problem while I was trying to upgrade my projects from .NET core 3.1 to the latest .NET 5.
My original code has a validation logic to check invalid file name characters by checking each character returned from Path.GetInvalidFileNameChars() API.
var invalidFilenameChars = Path.GetInvalidFileNameChars();
bool validFileName = !invalidFilenameChars.Any(ch => fileName.Contains(ch, StringComparison.InvariantCulture));
Suppose you give a regular value to fileName such as "test.txt" that should be valid. Surprisingly, however, the above code gives the file name is invalid if you run it with 'net5' target framework.
After spend some time on debugging, what I found is that the returned invalid character set contains '\0', null ASCII character and "text.txt".Contains("\0, StringComparison.InvariantCulture) gives true.
class Program
{
static void Main(string[] args)
{
var containsNullChar = "test".Contains("\0", StringComparison.InvariantCulture);
Console.WriteLine($"Contains null char {containsNullChar}");
}
}
If you run in .NET core 3.1, it never says regular string contains null character. Also, if I omit the second parameter (StringComparison.InvariantCulture) or if I use StringComparison.Ordinal, the strange result is never returned.
Why this behavior is changed in .NET5?
EDIT:
As commented by Karl-Johan Sjögren before, there is indeed a behavior change in .NET5 regarding string comparison:
Behavior changes when comparing strings on .NET 5+
Also see the related ticket:
string.IndexOf get different result in .Net 5
Though this issue should be related to above, the current result related to '\0' still looks strange to me and might still be considered to be a bug as answered by #xanatos.
EDIT2:
Now I realized that the actual cause of this problem was my confusion between InvariantCulture and Ordinal string comparison. They are actually quite different things. See the ticket below:
Difference between InvariantCulture and Ordinal string comparison
Also note that this should be unique problem of .NET as other major programming languages such as Java, C++ and Python treat ordinal comparison by default.
not a bug, a feature
The issue that I've opened has been closed, but they gave a very good explanation. Now... In .NET 5.0 they began using on Windows (on Linux it was already present) a new library for comparing strings, the ICU library. It is the official library of the Unicode Consortium, so it is "the verb". That library is used for CurrentCulture, InvariantCulture (plus the respective IgnoreCase) and and any other culture. The only exception is the Ordinal/OrdinalIgnoreCase. The library is targetted for text and it has some "particular" ideas about non-text. In this particular case, there are some characters that are simply ignored. In the block 0000-00FF I would say the ignored characters are all control codes (please ignore the fact that they are shown as €‚ƒ„†‡ˆ‰Š‹ŒŽ‘’“”•–—™š›œžŸ, at a certain point these characters have been remapped somewhere else in the Unicode, but the glyps shown don't reflect it, but if you try to see their code, like doing char ch = '€'; int val = (int)ch; you'll see it), and '\0' is a control code.
Now... My personal thinking is that to compare string from today you'll need a master's degree in Unicode Technologies 😥, and I do hope that they'll do some shenanigans in .NET 6.0 to make the default comparison Ordinal (it is one of the proposals for .NET 6.0, the Option B). Note that if you want to make programs that can run in Turkey you already needed a master's degree in Unicode Technologies (see the Turkish i problem).
In general I would say that to look for words that aren't keywords/fixed words (for example column names), you should use Culture-aware comparisons, while to look for keywords/fixed words (for example column names) and symbols/control codes you should use Ordinal comparisons. The problem is when you want to look for both at the same time. Normally in this case you are looking for exact words, so you can use Ordinal. Otherwise it becames hellish. And I don't even want to think how Regex works internally in a Culture-aware environment. That I don't want to think about. Becasue in that direction there can only be folly and nightmares 😁.
As a sidenote, even before the "default" Culture-aware comparisons had some secret shaeaningans... for example:
int ix = "ʹ$ʹ".IndexOf("$"); // -1 on .NET Framework or .NET Core <= 3.1
what I had written before
I'll say that it is a bug. There is a similar bug with IndexOf. I've opened an Issue on github to track it.
As you have written, the Ordinal and OrdinalIgnoreCase work as expected (probably because they don't need to use the new ICU library for handling Unicode).
Some sample code:
Console.WriteLine($"Ordinal Contains null char {"test".Contains("\0", StringComparison.Ordinal)}");
Console.WriteLine($"OrdinalIgnoreCase Contains null char {"test".Contains("\0", StringComparison.OrdinalIgnoreCase)}");
Console.WriteLine($"CurrentCulture Contains null char {"test".Contains("\0", StringComparison.CurrentCulture)}");
Console.WriteLine($"CurrentCultureIgnoreCase Contains null char {"test".Contains("\0", StringComparison.CurrentCultureIgnoreCase)}");
Console.WriteLine($"InvariantCulture Contains null char {"test".Contains("\0", StringComparison.InvariantCulture)}");
Console.WriteLine($"InvariantCultureIgnoreCase Contains null char {"test".Contains("\0", StringComparison.InvariantCultureIgnoreCase)}");
Console.WriteLine($"Ordinal IndexOf null char {"test".IndexOf("\0t", StringComparison.Ordinal)}");
Console.WriteLine($"OrdinalIgnoreCase IndexOf null char {"test".IndexOf("\0", StringComparison.OrdinalIgnoreCase)}");
Console.WriteLine($"CurrentCulture IndexOf null char {"test".IndexOf("\0", StringComparison.CurrentCulture)}");
Console.WriteLine($"CurrentCultureIgnoreCase IndexOf null char {"test".IndexOf("\0", StringComparison.CurrentCultureIgnoreCase)}");
Console.WriteLine($"InvariantCulture IndexOf null char {"test".IndexOf("\0", StringComparison.InvariantCulture)}");
Console.WriteLine($"InvariantCultureIgnoreCase IndexOf null char {"test".IndexOf("\0", StringComparison.InvariantCultureIgnoreCase)}");
and
Console.WriteLine($"Ordinal Contains null char {"test".Contains("\0test", StringComparison.Ordinal)}");
Console.WriteLine($"OrdinalIgnoreCase Contains null char {"test".Contains("\0test", StringComparison.OrdinalIgnoreCase)}");
Console.WriteLine($"CurrentCulture Contains null char {"test".Contains("\0test", StringComparison.CurrentCulture)}");
Console.WriteLine($"CurrentCultureIgnoreCase Contains null char {"test".Contains("\0test", StringComparison.CurrentCultureIgnoreCase)}");
Console.WriteLine($"InvariantCulture Contains null char {"test".Contains("\0test", StringComparison.InvariantCulture)}");
Console.WriteLine($"InvariantCultureIgnoreCase Contains null char {"test".Contains("\0test", StringComparison.InvariantCultureIgnoreCase)}");
Console.WriteLine($"Ordinal IndexOf null char {"test".IndexOf("\0t", StringComparison.Ordinal)}");
Console.WriteLine($"OrdinalIgnoreCase IndexOf null char {"test".IndexOf("\0test", StringComparison.OrdinalIgnoreCase)}");
Console.WriteLine($"CurrentCulture IndexOf null char {"test".IndexOf("\0test", StringComparison.CurrentCulture)}");
Console.WriteLine($"CurrentCultureIgnoreCase IndexOf null char {"test".IndexOf("\0test", StringComparison.CurrentCultureIgnoreCase)}");
Console.WriteLine($"InvariantCulture IndexOf null char {"test".IndexOf("\0test", StringComparison.InvariantCulture)}");
Console.WriteLine($"InvariantCultureIgnoreCase IndexOf null char {"test".IndexOf("\0test", StringComparison.InvariantCultureIgnoreCase)}");

Alternative to string.ToUpper() with StringComparison or similar, that fully preserve behavior

Using left.ToUpper() == right.ToUpper() is not the best option to compare strings, at least because of Performance issues. I want to refactor (fully preserving behavior!) this code, to something efficient, but can't achieve full equivalency for the special case.
So, here is a simple test method:
[TestCase("Strasse", "Straße", "tr-TR")]
[TestCase("İ", "i", "tr-TR")]
public void UsingToUpper_AndCurrentCultureIgnoreCase_AreSame(string left, string right, string culture)
{
// Arrange, Act
Thread.CurrentThread.CurrentCulture = CultureInfo.GetCultureInfo(culture);
var toUpper = left.ToUpper() == right.ToUpper();
var stringComparison = left.Equals(right, StringComparison.CurrentCultureIgnoreCase);
// Assert
Assert.AreEqual(toUpper, stringComparison);
}
I tried two options,
StringComparison.CurrentCultureIgnoreCase and StringComparison.OrdinalIgnoreCase both of them fails (in different cases).
So, the question:
Is there a way to compare two strings, without changing case and fully preserve the behavior of ToUpper()?
You would have to write your own custom Comparison method I am afraid.
ToUpper is making use of the Unicode metadata. Every character
(Unicode code point) has a case as well as case mapping to upper- and
lowercase (and title case). .NET uses this information to convert a
string to upper- or lowercase. You can find the very same information
in the Unicode Character Database.
You can supply a culture to the ToUpper method, but this would not be your goal.
You can write your own customCulture like defined in this answer: Create custom culture in ASP.NET
However, there would not be any similar behaviour to the ToUpper method as mentioned before it uses Unicode metadata. You cannot force string Equals to use Unicode Characters.

Could string comparisons really differ based on culture when the string is guaranteed not to change?

I'm reading encrypted credentials/connection strings from a config file. Resharper tells me, "String.IndexOf(string) is culture-specific here" on this line:
if (line.Contains("host=")) {
_host = line.Substring(line.IndexOf(
"host=") + "host=".Length, line.Length - "host=".Length);
...and so wants to change it to:
if (line.Contains("host=")) {
_host = line.Substring(line.IndexOf("host=", System.StringComparison.Ordinal) + "host=".Length, line.Length - "host=".Length);
The value I'm reading will always be "host=" regardless of where the app may be deployed. Is it really sensible to add this "System.StringComparison.Ordinal" bit?
More importantly, could it hurt anything (to use it)?
Absolutely. Per MSDN (http://msdn.microsoft.com/en-us/library/d93tkzah.aspx),
This method performs a word (case-sensitive and culture-sensitive)
search using the current culture.
So you may get different results if you run it under a different culture (via regional and language settings in Control Panel).
In this particular case, you probably won't have a problem, but throw an i in the search string and run it in Turkey and it will probably ruin your day.
See MSDN: http://msdn.microsoft.com/en-us/library/ms973919.aspx
These new recommendations and APIs exist to alleviate misguided assumptions about the behavior of default string APIs. The canonical
example of bugs emerging where non-linguistic string data is
interpreted linguistically is the "Turkish-I" problem.
For nearly all Latin alphabets, including U.S. English, the character
i (\u0069) is the lowercase version of the character I (\u0049). This
casing rule quickly becomes the default for someone programming in
such a culture. However, in Turkish ("tr-TR"), there exists a capital
"i with a dot," character (\u0130), which is the capital version of
i. Similarly, in Turkish, there is a lowercase "i without a dot," or
(\u0131), which capitalizes to I. This behavior occurs in the Azeri
culture ("az") as well.
Therefore, assumptions normally made about capitalizing i or
lowercasing I are not valid among all cultures. If the default
overloads for string comparison routines are used, they will be
subject to variance between cultures. For non-linguistic data, as in
the following example, this can produce undesired results:
Thread.CurrentThread.CurrentCulture = new CultureInfo("en-US")
Console.WriteLine("Culture = {0}",
Thread.CurrentThread.CurrentCulture.DisplayName);
Console.WriteLine("(file == FILE) = {0}",
(String.Compare("file", "FILE", true) == 0));
Thread.CurrentThread.CurrentCulture = new CultureInfo("tr-TR");
Console.WriteLine("Culture = {0}",
Thread.CurrentThread.CurrentCulture.DisplayName);
Console.WriteLine("(file == FILE) = {0}",
(String.Compare("file", "FILE", true) == 0));
Because of the difference of the comparison of I, results of the
comparisons change when the thread culture is changed. This is the
output:
Culture = English (United States)
(file == FILE) = True
Culture = Turkish (Turkey)
(file == FILE) = False
Here is an example without case:
var s1 = "é"; //é as one character (ALT+0233)
var s2 = "é"; //'e', plus combining acute accent U+301 (two characters)
Console.WriteLine(s1.IndexOf(s2, StringComparison.Ordinal)); //-1
Console.WriteLine(s1.IndexOf(s2, StringComparison.InvariantCulture)); //0
Console.WriteLine(s1.IndexOf(s2, StringComparison.CurrentCulture)); //0
CA1309: UseOrdinalStringComparison
It doesn't hurt to not use it, but "by explicitly setting the parameter to either the StringComparison.Ordinal or StringComparison.OrdinalIgnoreCase, your code often gains speed, increases correctness, and becomes more reliable.".
What exactly is Ordinal, and why does it matter to your case?
An operation that uses ordinal sort rules performs a comparison based
on the numeric value (Unicode code point) of each Char in the string.
An ordinal comparison is fast but culture-insensitive. When you use
ordinal sort rules to sort strings that start with Unicode characters
(U+), the string U+xxxx comes before the string U+yyyy if the value of
xxxx is numerically less than yyyy.
And, as you stated... the string value you are reading in is not culture sensitive, so it makes sense to use an Ordinal comparison as opposed to a Word comparison. Just remember, Ordinal means "this isn't culture sensitive".
To answer your specific question: No, but a static analysis tool is not going to be able to realize that your input value will never have locale-specific information in it.

Issue with surrogate unicode characters in F#

I'm working with strings, which could contain surrogate unicode characters (non-BMP, 4 bytes per character).
When I use "\Uxxxxxxxxv" format to specify surrogate character in F# - for some characters it gives different result than in the case of C#. For example:
C#:
string s = "\U0001D11E";
bool c = Char.IsSurrogate(s, 0);
Console.WriteLine(String.Format("Length: {0}, is surrogate: {1}", s.Length, c));
Gives: Length: 2, is surrogate: True
F#:
let s = "\U0001D11E"
let c = Char.IsSurrogate(s, 0)
printf "Length: %d, is surrogate: %b" s.Length c
Gives: Length: 2, is surrogate: false
Note: Some surrogate characters works in F# ("\U0010011", "\U00100011"), but some of them doesn't work.
Q: Is this is bug in F#? How can I handle allowed surrogate unicode characters in strings with F# (Does F# has different format, or only the way is to use Char.ConvertFromUtf32 0x1D11E)
Update:
s.ToCharArray() gives for F# [| 0xD800; 0xDF41 |]; for C# { 0xD834, 0xDD1E }
This is a known bug in the F# compiler that shipped with VS2010 (and SP1); the fix appears in the VS11 bits, so if you have the VS11 Beta and use the F# 3.0 compiler, you'll see this behave as expected.
(If the other answers/comments here don't provide you with a suitable workaround in the meantime, let me know.)
That obviously means that F# makes mistake while parsing some string literals. That is proven by the fact character you've mentioned is non-BMP, and in UTF-16 it should be represented as pair of surrogates.
Surrogates are words in range 0xD800-0xDFFF, while neither of chars in produced string fits in that range.
But processing of surrogates doesn't change, as framework (what is under the hood) is the same. So you already have answer in your question - if you need string literals with non-BMP characters in your code, you should just use Char.ConvertFromUtf32 instead of \UXXXXXXXX notation. And all the rest processing will be just the same as always.
It seem to me that this is something connected with different forms of normalization.
Both in C# and in F# s.IsNormalized() returns true
But in C#
s.ToCharArray() gives us {55348, 56606} //0xD834, 0xDD1E
and in F#
s.ToCharArray() gives us {65533, 57422} //0xFFFD, 0xE04E
And as you probably know System.Char.IsSurrogate is implemented in the following way:
public static bool IsSurrogate(char c)
{
return (c >= HIGH_SURROGATE_START && c <= LOW_SURROGATE_END);
}
where
HIGH_SURROGATE_START = 0x00d800;
LOW_SURROGATE_END = 0x00dfff;
So in C# first char (55348) is less than LOW_SURROGATE_END but in F# first char (65533) is not less than LOW_SURROGATE_END.
I hope this helps.

Why should I convert a string to upper case when comparing?

I constantly read about it being a good practise to convert a string to upper case (I think Hanselman mentioned this on his blog a long time ago), when that string is to be compared against another (which should also be converted to upper case).
What is the benefit of this? Why should I do this (or are there any cases when I shouldn't)?
Thanks
no, you should be using the enum option that allows for case insenstive comparisson (string comparison).
Make sure to use that overload of the comparison method you are using i.e. String.Compare, String.Equals
The reason that you should convert to upper case rather than lower case when doing a comparison (and it's not practically possible to do a case insensetive comparison), is that some (not so commonly used) characters does not convert to lower case without losing information.
Some upper case characters doesn't have an equivalent lower case character, so making them lower case would convert them into a different lower case character. That could cause a false positive in the comparison.
A better way to do case-insensitive string comparison is:
bool ignoreCase = true;
bool stringsAreSame = (string.Compare(str1, str2, ignoreCase) == 0)
Also, see here:
Upper vs Lower Case
Strings should be normalized to uppercase. A small group of characters, when they are converted to lowercase, cannot make a round trip. To make a round trip means to convert the characters from one locale to another locale that represents character data differently, and then to accurately retrieve the original characters from the converted characters.
Reference:
https://learn.microsoft.com/en-us/visualstudio/code-quality/ca1308-normalize-strings-to-uppercase?view=vs-2015][1]
This sounds like a cheap way to do case-insensitive comparisons. I would wonder if there isn't a function that would do that for you, without you having to explicitly telling it to go uppercase.
The .Net framework is slightly faster at doing string comparisons between uppercase letters than string comparisons between lowercase letters.
As others have mentioned, some information might be lost when converting from uppercase to lowercase.
You may want to try using a StringComparer object to do case insensitive comparisons.
StringComparer comparer = StringComparer.OrdinalIgnoreCase;
bool isEqualV1 = comparer.Equals("stringA", "stringB");
bool isEqualV2 = (comparer.Compare("stringA", "stringB") == 0);
The .Net Framework as of of 4.7 has a Span type that should assist in speeding up string comparisons in certain circumstances.
Depending on your use case, you may want to make use of the constructors for HashSet and Dictionary types which can take a StringComparer as an input parameter for the constructor.
I typically use a StringComparer as an input parameter to a method with a default of StringComparer.OrdingalIgnoreCase and I try to make use of other techniques (use of HashSets, Dictionaries or Spans) if speed is important.
You do not have to convert a string to Upper case. Convert it to lower case 8-)

Categories