Default String Sort Order

Default String Sort Order - c#

Is the default sort order an implementation detail? or how is it that the Default Comparer is selected?
It reminds me of the advice. "Don't store HashCodes in the database"
Is the following Code guaranteed to sort the string in the same order?
string[] randomStrings = { "Hello", "There", "World", "The", "Secrete", "To", "Life", };
randomStrings.ToList().Sort();

Strings are always sorted in alphabetical order.
The default (string.CompareTo()) uses the Unicode comparison rules of the current culture:
public int CompareTo(String strB) {
if (strB==null) {
return 1;
}
return CultureInfo.CurrentCulture.CompareInfo.Compare(this, strB, 0);
}

This overload of List<T>.Sort uses the default comparer for strings, which is implemented like this:
This method performs a word (case-sensitive and culture-sensitive)
comparison using the current culture. For more information about word,
string, and ordinal sorts, see System.Globalization.CompareOptions.

If you do not specify a comparer then sort will use the default comparer which sorts alphabetically. So to answer your question yes that code will always return the strings in the same order.
There is an overload to the sort method that allows you to specify your own comparer if you wish to sort the data in a different order.

I wanted to post a reply related to cultural things while you sort your strings but Jon already added it:-(. Yes, I think you may take into acc the issue too because it is selected by default the alphabetical order, the random strings if existed as foreign besides English (i.e Spanish) will be placed after English after all, they appear in the same first letters though. That means you need a globalization namespace to deal with it.
By the way, Timothy, it is secret not secrete :D

Your code creates a new list copying from the array, sorts that list, and then discards it. It does not change the array at all.
try:
Array.Sort(randomStrings);

Related

Odd C# sorting Behavior [duplicate]

I have List like this
List<string> items = new List<string>();
items.Add("-");
items.Add(".");
items.Add("a-");
items.Add("a.");
items.Add("a-a");
items.Add("a.a");
items.Sort();
string output = string.Empty;
foreach (string s in items)
{
output += s + Environment.NewLine;
}
MessageBox.Show(output);
The output is coming back as
-
.
a-
a.
a.a
a-a
where as I am expecting the results as
-
.
a-
a.
a-a
a.a
Any idea why "a-a" is not coming before "a.a" where as "a-" comes before "a."

I suspect that in the last case "-" is treated in a different way due to culture-specific settings (perhaps as a "dash" as opposed to "minus" in the first strings). MSDN warns about this:
The comparison uses the current culture to obtain culture-specific
information such as casing rules and the alphabetic order of
individual characters. For example, a culture could specify that
certain combinations of characters be treated as a single character,
or uppercase and lowercase characters be compared in a particular way,
or that the sorting order of a character depends on the characters
that precede or follow it.
Also see in this MSDN page:
The .NET Framework uses three distinct ways of sorting: word sort,
string sort, and ordinal sort. Word sort performs a culture-sensitive
comparison of strings. Certain nonalphanumeric characters might have
special weights assigned to them; for example, the hyphen ("-") might
have a very small weight assigned to it so that "coop" and "co-op"
appear next to each other in a sorted list. String sort is similar to
word sort, except that there are no special cases; therefore, all
nonalphanumeric symbols come before all alphanumeric characters.
Ordinal sort compares strings based on the Unicode values of each
element of the string.
So, hyphen gets a special treatment in the default sort mode in order to make the word sort more "natural".
You can get "normal" ordinal sort if you specifically turn it on:
Console.WriteLine(string.Compare("a.", "a-")); //1
Console.WriteLine(string.Compare("a.a", "a-a")); //-1
Console.WriteLine(string.Compare("a.", "a-", StringComparison.Ordinal)); //1
Console.WriteLine(string.Compare("a.a", "a-a", StringComparison.Ordinal)); //1
To sort the original collection using ordinal comparison use:
items.Sort(StringComparer.Ordinal);

If you want your string sort to be based on the actual byte value as opposed to the rules defined by the current culture you can sort by Ordinal:
items.Sort(StringComparer.Ordinal);
This will make the results consistent across all cultures (but it will produce unintuitive sortings of "14" coming before "9" which may or may not be what you're looking for).

The Sort method of the List<> class relies on the default string comparer of the .NET Framework, which is actually an instance of the current CultureInfo of the Thread.
The CultureInfo specifies the alphabetical order of characters and it seems that the default one is using an order different order to what you would expect.
When sorting you can specify a specific CultureInfo, one that you know will match your sorting requirements, sample (german culture):
var sortCulture = new CultureInfo("de-DE");
items.Sort(sortCulture);
More info can be found here:
http://msdn.microsoft.com/en-us/library/b0zbh7b6.aspx
http://msdn.microsoft.com/de-de/library/system.stringcomparer.aspx

String sorting issue in C#

I have List like this
List<string> items = new List<string>();
items.Add("-");
items.Add(".");
items.Add("a-");
items.Add("a.");
items.Add("a-a");
items.Add("a.a");
items.Sort();
string output = string.Empty;
foreach (string s in items)
{
output += s + Environment.NewLine;
}
MessageBox.Show(output);
The output is coming back as
-
.
a-
a.
a.a
a-a
where as I am expecting the results as
-
.
a-
a.
a-a
a.a
Any idea why "a-a" is not coming before "a.a" where as "a-" comes before "a."

I suspect that in the last case "-" is treated in a different way due to culture-specific settings (perhaps as a "dash" as opposed to "minus" in the first strings). MSDN warns about this:
The comparison uses the current culture to obtain culture-specific
information such as casing rules and the alphabetic order of
individual characters. For example, a culture could specify that
certain combinations of characters be treated as a single character,
or uppercase and lowercase characters be compared in a particular way,
or that the sorting order of a character depends on the characters
that precede or follow it.
Also see in this MSDN page:
The .NET Framework uses three distinct ways of sorting: word sort,
string sort, and ordinal sort. Word sort performs a culture-sensitive
comparison of strings. Certain nonalphanumeric characters might have
special weights assigned to them; for example, the hyphen ("-") might
have a very small weight assigned to it so that "coop" and "co-op"
appear next to each other in a sorted list. String sort is similar to
word sort, except that there are no special cases; therefore, all
nonalphanumeric symbols come before all alphanumeric characters.
Ordinal sort compares strings based on the Unicode values of each
element of the string.
So, hyphen gets a special treatment in the default sort mode in order to make the word sort more "natural".
You can get "normal" ordinal sort if you specifically turn it on:
Console.WriteLine(string.Compare("a.", "a-")); //1
Console.WriteLine(string.Compare("a.a", "a-a")); //-1
Console.WriteLine(string.Compare("a.", "a-", StringComparison.Ordinal)); //1
Console.WriteLine(string.Compare("a.a", "a-a", StringComparison.Ordinal)); //1
To sort the original collection using ordinal comparison use:
items.Sort(StringComparer.Ordinal);

If you want your string sort to be based on the actual byte value as opposed to the rules defined by the current culture you can sort by Ordinal:
items.Sort(StringComparer.Ordinal);
This will make the results consistent across all cultures (but it will produce unintuitive sortings of "14" coming before "9" which may or may not be what you're looking for).

The Sort method of the List<> class relies on the default string comparer of the .NET Framework, which is actually an instance of the current CultureInfo of the Thread.
The CultureInfo specifies the alphabetical order of characters and it seems that the default one is using an order different order to what you would expect.
When sorting you can specify a specific CultureInfo, one that you know will match your sorting requirements, sample (german culture):
var sortCulture = new CultureInfo("de-DE");
items.Sort(sortCulture);
More info can be found here:
http://msdn.microsoft.com/en-us/library/b0zbh7b6.aspx
http://msdn.microsoft.com/de-de/library/system.stringcomparer.aspx

How to generate a unique string from a string collection?

I need a way to convert a strings collection into a unique string. This means that I need to have a different string if any of the strings inside the collection has changed.
I'm working on a big solution so I may wont be able to work with some better ideas. The required unique string will be used to compare the 2 collections, so different strings means different collections. I cannot compare the strings inside one by one because the order may change plus the solution is already built to return result based on 2 strings comparison. This is an add-on. The generated string will be passed as parameter for this comparison.
Thank you!

These both work by deciding to use the separator character of ":" and also using an escape character to make it clear when we mean something else by the separator character. We therefore just need to escape all our strings before concatenating them with our separator in between. This gives us unique strings for every collection. All we need to do if we want to make collections the same regardless or order is to sort our collection before we do anything. I should add that my sample uses LINQ and thus assumes the collection implements IEnumerable<string> and that you have a using declaration for System.LINQ
You can wrap that up in a function as follows
string GetUniqueString(IEnumerable<string> Collection, bool OrderMatters = true, string Escape = "/", string Separator = ":")
{
if(Escape == Separator)
throw new Exception("Escape character should never equal separator character because it fails in the case of empty strings");
if(!OrderMatters)
Collection = Collection.OrderBy(v=>v);//Sorting fixes ordering issues.
return Collection
.Select(v=>v.Replace(Escape, Escape + Escape).Replace(Separator,Escape + Separator))//Escape String
.Aggregate((a,b)=>a+Separator+b);
}

What about using a hash function?

Considering you constraints, use a delimited approach:
pick a delimiter and an escape method.
e.g. use ; and escape it bwithin strings y \;, also escape \ by \\
So this list of strings...
"A;bc"
"D\ef;"
...becomes "A\;bc;D\\ef\;"
It ain't pretty, but considering that it has to be a string, then the good old ways of csv and its brethren isn't all too bad.

By a "collection string" you mean "collection of strings"?
Here's a naive (but working) approach: sort the collection (to eliminate dependency on order), concat them, and take a hash of that (MD5 for instance).
Trivial to implement, but not very clever performance-wise.

Are you saying that you need to encode a string collection as a string. So for example the collection {"abc", "def"} may be encoded as "sDFSDFSDFSD" but {"a", "b"} might be encoded as "SDFeg". If so and you don't care about unique keys then you could use something like SHA or MD5.

Why does string.Compare seem to handle accented characters inconsistently?

If I execute the following statement:
string.Compare("mun", "mün", true, CultureInfo.InvariantCulture)
The result is '-1', indicating that 'mun' has a lower numeric value than 'mün'.
However, if I execute this statement:
string.Compare("Muntelier, Schweiz", "München, Deutschland", true, CultureInfo.InvariantCulture)
I get '1', indicating that 'Muntelier, Schewiz' should go last.
Is this a bug in the comparison? Or, more likely, is there a rule I should be taking into account when sorting strings containing accented
The reason this is an issue is, I'm sorting a list and then doing a manual binary filter that's meant to get every string beginning with 'xxx'.
Previously I was using the Linq 'Where' method, but now I have to use this custom function written by another person, because he says it performs better.
But the custom function doesn't seem to take into account whatever 'unicode' rules .NET has. So if I tell it to filter by 'mün', it doesn't find any items, even though there are items in the list beginning with 'mun'.
This seems to be because of the inconsistent ordering of accented characters, depending on what characters go after the accented character.
OK, I think I've fixed the problem.
Before the filter, I do a sort based on the first n letters of each string, where n is the length of the search string.

There is a tie-breaking algorithm at work, see http://unicode.org/reports/tr10/
To address the complexities of
language-sensitive sorting, a
multilevel comparison algorithm is
employed. In comparing two words, for
example, the most important feature is
the base character: such as the
difference between an A and a B.
Accent differences are typically
ignored, if there are any differences
in the base letters. Case differences
(uppercase versus lowercase), are
typically ignored, if there are any
differences in the base or accents.
Punctuation is variable. In some
situations a punctuation character is
treated like a base character. In
other situations, it should be ignored
if there are any base, accent, or case
differences. There may also be a
final, tie-breaking level, whereby if
there are no other differences at all
in the string, the (normalized) code
point order is used.
So, "Munt..." and "Münc..." are alphabetically different and sort based on the "t" and "c".
Whereas, "mun" and "mün" are alphabetically the same ("u" equivelent to "ü" in lost languages) so the character codes are compared

It looks like the accented character is only being used in a sort of "tie-break" situation - in other words, if the strings are otherwise equal.
Here's some sample code to demonstrate:
using System;
using System.Globalization;
class Test
{
static void Main()
{
Compare("mun", "mün");
Compare("muna", "münb");
Compare("munb", "müna");
}
static void Compare(string x, string y)
{
int result = string.Compare(x, y, true,
CultureInfo.InvariantCulture));
Console.WriteLine("{0}; {1}; {2}", x, y, result);
}
}
(I've tried adding a space after the "n" as well, to see if it was done on word boundaries - it isn't.)
Results:
mun; mün; -1
muna; münb; -1
munb; müna; 1
I suspect this is correct by various complicated Unicode rules - but I don't know enough about them.
As for whether you need to take this into account... I wouldn't expect so. What are you doing that is thrown by this?

As I understand this it is still somewhat consistent. When comparing using CultureInfo.InvariantCulture the umlaut character ü is treated like the non-accented character u.
As the strings in your first example obviously are not equal the result will not be 0 but -1 (which seems to be a default value). In the second example Muntelier goes last because t follows c in the alphabet.
I couldn't find any clear documentation in MSDN explaining these rules, but I found that
string.Compare("mun", "mün", CultureInfo.InvariantCulture,
CompareOptions.StringSort);
and
string.Compare("Muntelier, Schweiz", "München, Deutschland",
CultureInfo.InvariantCulture, CompareOptions.StringSort);
gives the desired result.
Anyway, I think you'd be better off to base your sorting on a specific culture such as the current user's culture (if possible).

Which Version of StringComparer to use

If I want to have a case-insensitive string-keyed dictionary, which version of StringComparer should I use given these constraints:
The keys in the dictionary come from either C# code or config files written in english locale only (either US, or UK)
The software is internationalized and will run in different locales
I normally use StringComparer.InvariantCultureIgnoreCase but wasn't sure if that is the correct case. Here is example code:
Dictionary< string, object> stuff = new Dictionary< string, object>(StringComparer.InvariantCultureIgnoreCase);

There are three kinds of comparers:
Culture-aware
Culture invariant
Ordinal
Each comparer has a case-sensitive as well as a case-insensitive version.
An ordinal comparer uses ordinal values of characters. This is the fastest comparer, it should be used for internal purposes.
A culture-aware comparer considers aspects that are specific to the culture of the current thread. It knows the "Turkish i", "Spanish LL", etc. problems. It should be used for UI strings.
The culture invariant comparer is actually not defined and can produce unpredictable results, and thus should never be used at all.
References
New Recommendations for Using Strings in Microsoft .NET 2.0

This MSDN article covers everything you could possibly want to know in great depth, including the Turkish-I problem.
It's been a while since I read it, so I'm off to do so again. See you in an hour!

The concept of "case insensitive" is a linguistic one, and so it doesn't make sense without a culture.
See this blog for more information.
That said if you are just talking about strings using the latin alphabet then you will probably get away with the InvariantCulture.
It is probably best to create the dictionary with StringComparer.CurrentCulture, though. This will allow "ß" to match "ss" in your dictionary under a German culture, for example.

Since the keys are your known fixed values, then either InvariantCultureIgnoreCase or OrdinalIgnoreCase should work. Avoid the culture-specific one, or you could hit some of the more "fun" things like the "Turkish i" problem. Obviously, you'd use a cultured comparer if you were comparing cultured values... but it sounds like you aren't.

StringComparer.OrdinalIgnoreCase is slightly faster than InvariantCultureIgnoreCase FWIW ("An ordinal comparison is fast, but culture-insensitive" according to MSDN.
You'd have to be doing a lot of comparisons to notice the difference of course.

The Invariant Culture exists specifically to deal with strings that are internal to the program and have nothing to do with user data or UI. It sounds like this is the case for this situation.

System.Collections.Specialized includes StringDictionary. The Remarks section of the MSDN states "A key cannot be null, but a value can.
The key is handled in a case-insensitive manner; it is translated to lowercase before it is used with the string dictionary.
In .NET Framework version 1.0, this class uses culture-sensitive string comparisons. However, in .NET Framework version 1.1 and later, this class uses CultureInfo.InvariantCulture when comparing strings. For more information about how culture affects comparisons and sorting, see Comparing and Sorting Data for a Specific Culture and Performing Culture-Insensitive String Operations.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Default String Sort Order - c#

Strings are always sorted in alphabetical order. The default (string.CompareTo()) uses the Unicode comparison rules of the current culture: public int CompareTo(String strB) { if (strB==null) { return 1; } return CultureInfo.CurrentCulture.CompareInfo.Compare(this, strB, 0); }

Your code creates a new list copying from the array, sorts that list, and then discards it. It does not change the array at all. try: Array.Sort(randomStrings);

Related

Odd C# sorting Behavior [duplicate]

String sorting issue in C#

How to generate a unique string from a string collection?

Why does string.Compare seem to handle accented characters inconsistently?

Which Version of StringComparer to use

Categories

Resources