Why does string.Compare seem to handle accented characters inconsistently?

Why does string.Compare seem to handle accented characters inconsistently? - c#

If I execute the following statement:
string.Compare("mun", "mün", true, CultureInfo.InvariantCulture)
The result is '-1', indicating that 'mun' has a lower numeric value than 'mün'.
However, if I execute this statement:
string.Compare("Muntelier, Schweiz", "München, Deutschland", true, CultureInfo.InvariantCulture)
I get '1', indicating that 'Muntelier, Schewiz' should go last.
Is this a bug in the comparison? Or, more likely, is there a rule I should be taking into account when sorting strings containing accented
The reason this is an issue is, I'm sorting a list and then doing a manual binary filter that's meant to get every string beginning with 'xxx'.
Previously I was using the Linq 'Where' method, but now I have to use this custom function written by another person, because he says it performs better.
But the custom function doesn't seem to take into account whatever 'unicode' rules .NET has. So if I tell it to filter by 'mün', it doesn't find any items, even though there are items in the list beginning with 'mun'.
This seems to be because of the inconsistent ordering of accented characters, depending on what characters go after the accented character.
OK, I think I've fixed the problem.
Before the filter, I do a sort based on the first n letters of each string, where n is the length of the search string.

There is a tie-breaking algorithm at work, see http://unicode.org/reports/tr10/
To address the complexities of
language-sensitive sorting, a
multilevel comparison algorithm is
employed. In comparing two words, for
example, the most important feature is
the base character: such as the
difference between an A and a B.
Accent differences are typically
ignored, if there are any differences
in the base letters. Case differences
(uppercase versus lowercase), are
typically ignored, if there are any
differences in the base or accents.
Punctuation is variable. In some
situations a punctuation character is
treated like a base character. In
other situations, it should be ignored
if there are any base, accent, or case
differences. There may also be a
final, tie-breaking level, whereby if
there are no other differences at all
in the string, the (normalized) code
point order is used.
So, "Munt..." and "Münc..." are alphabetically different and sort based on the "t" and "c".
Whereas, "mun" and "mün" are alphabetically the same ("u" equivelent to "ü" in lost languages) so the character codes are compared

It looks like the accented character is only being used in a sort of "tie-break" situation - in other words, if the strings are otherwise equal.
Here's some sample code to demonstrate:
using System;
using System.Globalization;
class Test
{
static void Main()
{
Compare("mun", "mün");
Compare("muna", "münb");
Compare("munb", "müna");
}
static void Compare(string x, string y)
{
int result = string.Compare(x, y, true,
CultureInfo.InvariantCulture));
Console.WriteLine("{0}; {1}; {2}", x, y, result);
}
}
(I've tried adding a space after the "n" as well, to see if it was done on word boundaries - it isn't.)
Results:
mun; mün; -1
muna; münb; -1
munb; müna; 1
I suspect this is correct by various complicated Unicode rules - but I don't know enough about them.
As for whether you need to take this into account... I wouldn't expect so. What are you doing that is thrown by this?

As I understand this it is still somewhat consistent. When comparing using CultureInfo.InvariantCulture the umlaut character ü is treated like the non-accented character u.
As the strings in your first example obviously are not equal the result will not be 0 but -1 (which seems to be a default value). In the second example Muntelier goes last because t follows c in the alphabet.
I couldn't find any clear documentation in MSDN explaining these rules, but I found that
string.Compare("mun", "mün", CultureInfo.InvariantCulture,
CompareOptions.StringSort);
and
string.Compare("Muntelier, Schweiz", "München, Deutschland",
CultureInfo.InvariantCulture, CompareOptions.StringSort);
gives the desired result.
Anyway, I think you'd be better off to base your sorting on a specific culture such as the current user's culture (if possible).

Related

String interpolation C#: Documentation of colon and semicolon functionality

I found this codegolf answer for the FizzBuzz test, and after examining it a bit I realized I had no idea how it actually worked, so I started investigating:
for(int i=1; i<101;i++)
System.Console.Write($"{(i%3*i%5<1?0:i):#}{i%3:;;Fizz}{i%5:;;Buzz}\n");
I put it into dotnetfiddle and established the 1st part works as follows:
{(BOOL?0:i):#}
When BOOL is true, then the conditional expression returns 0 otherwise the number.
However the number isn't returned unless it's <> 0. I'm guessing this is the job the of :# characters. I can't find any documentation on the :# characters workings. Can anyone explain the colon/hash or point me in the right direction?
Second part:
{VALUE:;;Fizz}
When VALUE = 0 then nothing is printed. I assume this is determined by the first ; character [end statement]. The second ; character determines 'if VALUE <> 0 then print what's after me.'
Again, does anyone have documentation on the use of a semicolon in string interpolation, as I can't find anything useful.

This is all covered in the String Interpolation documentation, especially the section on the Structure of an Interpolated String, which includes this:
{<interpolatedExpression>[,<alignment>][:<formatString>]}
along with a more detailed description for each of those three sections.
The format string portion of that structure is defined on separate pages, where you can use standard and custom formats for numeric types as well as standard and custom formats for date and time types. There are also options for Enum values, and you can even create your own custom format provider.
It's worth taking a look at the custom format provider documentation just because it will also lead you to the FormattableString type. This isn't well-covered by the documentation, but my understanding is this type may in theory allow you to avoid re-parsing the interpolated string for each iteration when used in a loop, thus potentially improving performance (though in practice, there's no difference at this time). I've written about this before, and my conclusion is MS needs to build this into the framework in a better way.

Thanks to all the commenters! Fast response.
The # is defined here (Custom specifier)
https://learn.microsoft.com/en-us/dotnet/standard/base-types/custom-numeric-format-strings#the--custom-specifier
The "#" custom format specifier serves as a digit-placeholder symbol.
If the value that is being formatted has a digit in the position where
the "#" symbol appears in the format string, that digit is copied to
the result string. Otherwise, nothing is stored in that position in
the result string. Note that this specifier never displays a zero that
is not a significant digit, even if zero is the only digit in the
string. It will display zero only if it is a significant digit in the
number that is being displayed.
The ; is defined here (Section Seperator):
https://learn.microsoft.com/en-us/dotnet/standard/base-types/custom-numeric-format-strings#the--section-separator
The semicolon (;) is a conditional format specifier that applies
different formatting to a number depending on whether its value is
positive, negative, or zero. To produce this behavior, a custom format
string can contain up to three sections separated by semicolons...

Odd C# sorting Behavior [duplicate]

I have List like this
List<string> items = new List<string>();
items.Add("-");
items.Add(".");
items.Add("a-");
items.Add("a.");
items.Add("a-a");
items.Add("a.a");
items.Sort();
string output = string.Empty;
foreach (string s in items)
{
output += s + Environment.NewLine;
}
MessageBox.Show(output);
The output is coming back as
-
.
a-
a.
a.a
a-a
where as I am expecting the results as
-
.
a-
a.
a-a
a.a
Any idea why "a-a" is not coming before "a.a" where as "a-" comes before "a."

I suspect that in the last case "-" is treated in a different way due to culture-specific settings (perhaps as a "dash" as opposed to "minus" in the first strings). MSDN warns about this:
The comparison uses the current culture to obtain culture-specific
information such as casing rules and the alphabetic order of
individual characters. For example, a culture could specify that
certain combinations of characters be treated as a single character,
or uppercase and lowercase characters be compared in a particular way,
or that the sorting order of a character depends on the characters
that precede or follow it.
Also see in this MSDN page:
The .NET Framework uses three distinct ways of sorting: word sort,
string sort, and ordinal sort. Word sort performs a culture-sensitive
comparison of strings. Certain nonalphanumeric characters might have
special weights assigned to them; for example, the hyphen ("-") might
have a very small weight assigned to it so that "coop" and "co-op"
appear next to each other in a sorted list. String sort is similar to
word sort, except that there are no special cases; therefore, all
nonalphanumeric symbols come before all alphanumeric characters.
Ordinal sort compares strings based on the Unicode values of each
element of the string.
So, hyphen gets a special treatment in the default sort mode in order to make the word sort more "natural".
You can get "normal" ordinal sort if you specifically turn it on:
Console.WriteLine(string.Compare("a.", "a-")); //1
Console.WriteLine(string.Compare("a.a", "a-a")); //-1
Console.WriteLine(string.Compare("a.", "a-", StringComparison.Ordinal)); //1
Console.WriteLine(string.Compare("a.a", "a-a", StringComparison.Ordinal)); //1
To sort the original collection using ordinal comparison use:
items.Sort(StringComparer.Ordinal);

If you want your string sort to be based on the actual byte value as opposed to the rules defined by the current culture you can sort by Ordinal:
items.Sort(StringComparer.Ordinal);
This will make the results consistent across all cultures (but it will produce unintuitive sortings of "14" coming before "9" which may or may not be what you're looking for).

The Sort method of the List<> class relies on the default string comparer of the .NET Framework, which is actually an instance of the current CultureInfo of the Thread.
The CultureInfo specifies the alphabetical order of characters and it seems that the default one is using an order different order to what you would expect.
When sorting you can specify a specific CultureInfo, one that you know will match your sorting requirements, sample (german culture):
var sortCulture = new CultureInfo("de-DE");
items.Sort(sortCulture);
More info can be found here:
http://msdn.microsoft.com/en-us/library/b0zbh7b6.aspx
http://msdn.microsoft.com/de-de/library/system.stringcomparer.aspx

String sorting issue in C#

I have List like this
List<string> items = new List<string>();
items.Add("-");
items.Add(".");
items.Add("a-");
items.Add("a.");
items.Add("a-a");
items.Add("a.a");
items.Sort();
string output = string.Empty;
foreach (string s in items)
{
output += s + Environment.NewLine;
}
MessageBox.Show(output);
The output is coming back as
-
.
a-
a.
a.a
a-a
where as I am expecting the results as
-
.
a-
a.
a-a
a.a
Any idea why "a-a" is not coming before "a.a" where as "a-" comes before "a."

I suspect that in the last case "-" is treated in a different way due to culture-specific settings (perhaps as a "dash" as opposed to "minus" in the first strings). MSDN warns about this:
The comparison uses the current culture to obtain culture-specific
information such as casing rules and the alphabetic order of
individual characters. For example, a culture could specify that
certain combinations of characters be treated as a single character,
or uppercase and lowercase characters be compared in a particular way,
or that the sorting order of a character depends on the characters
that precede or follow it.
Also see in this MSDN page:
The .NET Framework uses three distinct ways of sorting: word sort,
string sort, and ordinal sort. Word sort performs a culture-sensitive
comparison of strings. Certain nonalphanumeric characters might have
special weights assigned to them; for example, the hyphen ("-") might
have a very small weight assigned to it so that "coop" and "co-op"
appear next to each other in a sorted list. String sort is similar to
word sort, except that there are no special cases; therefore, all
nonalphanumeric symbols come before all alphanumeric characters.
Ordinal sort compares strings based on the Unicode values of each
element of the string.
So, hyphen gets a special treatment in the default sort mode in order to make the word sort more "natural".
You can get "normal" ordinal sort if you specifically turn it on:
Console.WriteLine(string.Compare("a.", "a-")); //1
Console.WriteLine(string.Compare("a.a", "a-a")); //-1
Console.WriteLine(string.Compare("a.", "a-", StringComparison.Ordinal)); //1
Console.WriteLine(string.Compare("a.a", "a-a", StringComparison.Ordinal)); //1
To sort the original collection using ordinal comparison use:
items.Sort(StringComparer.Ordinal);

If you want your string sort to be based on the actual byte value as opposed to the rules defined by the current culture you can sort by Ordinal:
items.Sort(StringComparer.Ordinal);
This will make the results consistent across all cultures (but it will produce unintuitive sortings of "14" coming before "9" which may or may not be what you're looking for).

The Sort method of the List<> class relies on the default string comparer of the .NET Framework, which is actually an instance of the current CultureInfo of the Thread.
The CultureInfo specifies the alphabetical order of characters and it seems that the default one is using an order different order to what you would expect.
When sorting you can specify a specific CultureInfo, one that you know will match your sorting requirements, sample (german culture):
var sortCulture = new CultureInfo("de-DE");
items.Sort(sortCulture);
More info can be found here:
http://msdn.microsoft.com/en-us/library/b0zbh7b6.aspx
http://msdn.microsoft.com/de-de/library/system.stringcomparer.aspx

CamelCase conversion to friendly name, i.e. Enum constants; Problems?

In my answer to this question, I mentioned that we used UpperCamelCase parsing to get a description of an enum constant not decorated with a Description attribute, but it was naive, and it didn't work in all cases. I revisited it, and this is what I came up with:
var result = Regex.Replace(camelCasedString,
#"(?<a>(?<!^)[A-Z][a-z])", #" ${a}");
result = Regex.Replace(result,
#"(?<a>[a-z])(?<b>[A-Z0-9])", #"${a} ${b}");
The first Replace looks for an uppercase letter, followed by a lowercase letter, EXCEPT where the uppercase letter is the start of the string (to avoid having to go back and trim), and adds a preceding space. It handles your basic UpperCamelCase identifiers, and leading all-upper acronyms like FDICInsured.
The second Replace looks for a lowercase letter followed by an uppercase letter or a number, and inserts a space between the two. This is to handle special but common cases of middle or trailing acronyms, or numbers in an identifier (except leading numbers, which are usually prohibited in C-style languages anyway).
Running some basic unit tests, the combination of these two correctly separated all of the following identifiers: NoDescription, HasLotsOfWords, AAANoDescription, ThisHasTheAcronymABCInTheMiddle, MyTrailingAcronymID, TheNumber3, IDo3Things, IAmAValueWithSingleLetterWords, and Basic (which didn't have any spaces added).
So, I'm posting this first to share it with others who may find it useful, and second to ask two questions:
Anyone see a case that would follow common CamelCase-ish conventions, that WOULDN'T be correctly separated into a friendly string this way? I know it won't separate adjacent acronyms (FDICFCUAInsured), recapitalize "properly" camelCased acronyms like FdicInsured, or capitalize the first letter of a lowerCamelCased identifier (but that one's easy to add - result = Regex.Replace(result, "^[a-z]", m=>m.ToString().ToUpper());). Anything else?
Can anyone see a way to make this one statement, or more elegant? I was looking to combine the Replace calls, but as they do two different things to their matches it can't be done with these two strings. They could be combined into a method chain with a RegexReplace extension method on String, but can anyone think of better?

So while I agree with Hans Passant here, I have to say that I had to try my hand at making it one regex as an armchair regex user.
(?<a>(?<!^)((?:[A-Z][a-z])|(?:(?<!^[A-Z]+)[A-Z0-9]+(?:(?=[A-Z][a-z])|$))|(?:[0-9]+)))
Is what I came up with. It seems to pass all the tests you put forward in the question.
So
var result = Regex.Replace(camelCasedString, #"(?<a>(?<!^)((?:[A-Z][a-z])|(?:(?<!^[A-Z]+)[A-Z0-9]+(?:(?=[A-Z][a-z])|$))|(?:[0-9]+)))", #" ${a}");
Does it in one pass.

not that this directly answers the question, but why not test by taking the standard C# API and converting each class into a friendly name? It'd take some manual verification, but it'd give you a good list of standard names to test.

Let's say every case you come across works with this (you're asking us for examples that won't and then giving us some, so you don't even have a question left).
This still binds UI to programmatic identifiers in a way that will make both programming and UI changes brittle.
It still assumes your program will only be used in one language. Either your potential market it so small that just indexing an array of names would be scalable enough (e.g. a one-client bespoke or in-house project), or you are assuming you will never be successful enough to need to be available to other languages or other dialects of your first-chosen language.
Does "well, it'll work as long as we're a failure" sound like a passing grade in balancing designs?
Either code it to use resources, or else code it to pass the enum name blindly or use an array of names, as that at least will be modifiable afterwards.

Problem comparing French character Î

When comparing "Île" and "Ile", C# does not consider these to be to be the same.
string.Equals("Île", "Ile", StringComparison.InvariantCultureIgnoreCase)
For all other accented characters I have come across the comparison works fine.
Is there another comparison function I should use?

You are specifying to compare the strings using the Invariant culture's comparison rules. Evidently, in the invariant culture, the two strings are not considered equal.
You can compare them in a culture-specific manner using String.Compare and providing the culture for which you want to compare the strings:
if(String.Compare("Île", "Ile", new CultureInfo("fr-FR"), CompareOptions.None)==0)
Please note that in the french culture, those strings are also considered different. I included the example to show, that it is the culture that defines the sort rules. You might be able to find a culture that fits your requirements, or build a custom one with the needed compare rules, but that it probably not what you want.
For a good example of normalizing the string so there are no accents, have a look at this question. After normalizing the string, you would be able to compare them and consider them equal. This would probably be the easiest way to implement your requirement.
Edit
It is not just the I character that has this behaviour in the InvariantCulture, this statement also returns false:
String.Equals("Ilê", "Ile", StringComparison.InvariantCultureIgnoreCase)
The framework does the right thing - those characters are in fact different (has different meaning) in most cultures, and therefore they should not be considered the same.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Why does string.Compare seem to handle accented characters inconsistently? - c#

Related

String interpolation C#: Documentation of colon and semicolon functionality

Odd C# sorting Behavior [duplicate]

String sorting issue in C#

CamelCase conversion to friendly name, i.e. Enum constants; Problems?

Problem comparing French character Î

Categories

Resources