Alphabetical order does not compare from left to right? - c#

I thought that in .NET strings were compared alphabetically and that they were compared from left to right.
string[] strings = { "-1", "1", "1Foo", "-1Foo" };
Array.Sort(strings);
Console.WriteLine(string.Join(",", strings));
I'd expect this (or the both with minus at the beginning first):
1,1Foo,-1,-1Foo
But the result is:
1,-1,1Foo,-1Foo
It seems to be a mixture, either the minus sign is ignored or multiple characters are compared even if the first character was already different.
Edit: I've now tested OrdinalIgnoreCase and i get the expected order:
Array.Sort(strings, StringComparer.OrdinalIgnoreCase);
But even if i use InvariantCultureIgnoreCase i get the unexpected order.

Jon Skeet to the rescue here
Specifically:
The .NET Framework uses three distinct ways of sorting: word sort,
string sort, and ordinal sort. Word sort performs a culture-sensitive
comparison of strings. Certain nonalphanumeric characters might have
special weights assigned to them. For example, the hyphen ("-") might
have a very small weight assigned to it so that "coop" and "co-op"
appear next to each other in a sorted list. String sort is similar to
word sort, except that there are no special cases. Therefore, all
nonalphanumeric symbols come before all alphanumeric characters.
Ordinal sort compares strings based on the Unicode values of each
element of the string.
But adding the StringComparer.Ordinal makes it behave as you want:
string[] strings = { "-1", "1", "10", "-10", "a", "ba","-a" };
Array.Sort(strings,StringComparer.Ordinal );
Console.WriteLine(string.Join(",", strings));
// prints: -1,-10,-a,1,10,a,ba
Edit:
About the Ordinal, quoting from MSDN CompareOptions Enumeration
Ordinal Indicates that the string comparison must use successive
Unicode UTF-16 encoded values of the string (code unit by code unit
comparison), leading to a fast comparison but one that is
culture-insensitive. A string starting with a code unit XXXX16 comes
before a string starting with YYYY16, if XXXX16 is less than YYYY16.
This value cannot be combined with other CompareOptions values and
must be used alone.
Also seems you have String.CompareOrdinal if you want the ordinal of 2 strings.
Here's another note of interest:
When possible, the application should use string comparison methods
that accept a CompareOptions value to specify the kind of comparison
expected. As a general rule, user-facing comparisons are best served
by the use of linguistic options (using the current culture), while
security comparisons should specify Ordinal or OrdinalIgnoreCase.
I guess we humans expect ordinal when dealing with strings :)

There is a small note on the String.CompareTo method documentation:
Notes to Callers:
Character sets include ignorable characters. The
CompareTo(String) method does not consider such characters when it
performs a culture-sensitive comparison. For example, if the following
code is run on the .NET Framework 4 or later, a comparison of "animal"
with "ani-mal" (using a soft hyphen, or U+00AD) indicates that the two
strings are equivalent.
And then a little later states:
To recognize ignorable characters in a string comparison, call the
CompareOrdinal(String, String) method.
These two statements seem to be consistent with the results you are seeing.

Related

List of ignorable characters for string comparison

Culture sensitive comparison in C# does not take into account "ignorable characters":
Character sets include ignorable characters. The Compare(String, String) method does not consider such characters when it performs a culture-sensitive comparison. For example, a culture-sensitive comparison of "animal" with "ani-mal" (using a soft hyphen, or U+00AD) indicates that the two strings are equivalent, as the following example shows.
Where can I find complete list of such characters and maybe some details of comparison of strings containing ignorable characters?
All Unicode code points have a "default ignorable" property that is specified by the Unicode consortium; I would be very surprised if the .NET concept of ignorable characters is in any way different from the value of that property.
The definitive resource on which characters are default-ignorable is the Unicode standard, specifically section 5.21 (link to chapter 5 PDF for Unicode v6.2.0).

string.IndexOf() not recognizing modified characters

When using IndexOf to find a char which is followed by a large valued char (e.g. char 700 which is ʼ) then the IndexOf fails to recognize the char you are looking for.
e.g.
string find = "abcʼabcabc";
int index = find.IndexOf("c");
In this code, index should be 2, but it returns 6.
Is there a way to get around this?
Unicode letter 700 is a modifier apostrophe: in other words, it modifies the letter c. In the same way, if you were to use an 'e' followed by character 769 (0x301), it would not really be an 'e' anymore: the e has been modified to be e with an acute accent. To wit: é. You'll see that letter is actually two characters: copy it to notepad and hit backspace (neat, huh?).
You need to do an "Ordinal" comparison (byte-by-byte) without any linguistic comparison. That will find the 'c', and ignore the linguistic fact that it is modified by the next letter. In my 'e' example, the bytes are (65)(769), so if you go byte-by-byte looking for 65, you will find it, and that ignores the fact that (65)(769) is linguistically the same as (233): é. If you search for (233) linguistically it will find the "equivalent" (65)(769):
string find = "abéabcabc";
int index = find.IndexOf("é"); //gives you '2' even though the "find" has two characters and the the "indexof" is one
Hopefully that's not too confusing. If you're doing this in real code you should explain in comments exactly what you're doing: as in my 'e' example generally you would want to do semantic equivalence for user data, and ordinal equivalence for e.g. constants (which hopefully wouldn't be different like this, lest your successor hunt you down with an axe).
The cʼ construct is being handled as linguistically different to the simple bytes. Use the Ordinal string comparison to force a byte comparison.
string find = "abcʼabcabc";
int index = find.IndexOf("c", StringComparison.Ordinal);

Regex Spilt based on multiple delimiters in C#

I have a string of type "KeyOperatorValue1,Value2,Value2....". For e.g = "version>=5", "lang=en,fr,es" etc and currently, the possible value for operator field is "=", "!=", ">", ">=", "<", "<=", but I don't want it to be limited to them only. Now the problem is given such a string, how can I split into a triplet?
Since, all the operator's string representation are not mutually exclusive("=" is a subset of ">="), I can't use public string[] Split(string[] separator, StringSplitOptions options) and the Regex.Split doesn't have a variant which takes multiple regex as parameters.
Since you have not mentioned the format of your input I have made certain assumptions..
I have assumed that
key would always contains alphanumeric characters
values would always be alphanumeric characters optionally separated by ,
key-value pair would be separated by non word characters
(?<key>\w+)(?<operand>[^\w,]+)(?<value>[\w,]+)
So this would match a string as operand if its not , or any one of [a-zA-Z\d_]
You can use this code
var lst=Regex.Matches(input,regex)
.Cast<Match>()
.Select(x=>new{
key=x.Groups["key"].Value,
operand=x.Groups["operand"].Value,
value=x.Groups["value"].Value
});
You can now iterate over lst
foreach(var l in lst)
{
l.key;
l.operand;
l.value;
}
Regex has "or" operator (separators will be included in the result though):
Regex.Split(#sourceString, #"(>=)|(<=)|(!=)|(=)|(>)|(<)");
You don't have to use regular expressions to accomplish that. Simply store the operators in an array. Keep the array sorted by the length of the operators. Iterate over the operators and get the position of the operator using IndexOf(). Now you can use Substring() to extract the key and the values from your input string.
You can just use branching to provide multiple alternatives. There are multiple possibilities to achieve this, one example would be this:
(\w+)([!<>]?=|[<>])(.*)
As you can see this expression contains three separate capture groups:
(\w+?): This will match "word" character (alphanumerical and underscores), as long as the sequence is at least one character long (+).
([!<>]?=|[<>]): This expression matches the operators given in your example. The first half ([!<>]?=) will match any of the characters inside [] (or skip it (?)) followed by =. The alternative simply matches < or >.
(.*): This will match any character (or nothing), whatever follows till the end of the string/line.
So when you match the expression, you'll get a total of 4 (sub) matches:
1: The name of the key.
2: The operator used.
3: The actual value given.
Edit:
If you'd like to match other operators as well, you'd have to add them as additional branches in the second matching group:
(\w+)([!<>]?=|[<>]|HERE)(.*)
Just keep in mind that there's in general no 100% perfect way to match any operator without defining the exact characters that should be considered valid operands (or components of an operand).

about string.compare method

a strange question,my code is:
static void Main(string[] args)
{
Console.WriteLine(string.Compare("-", "a"));//output -1
Console.WriteLine(string.Compare("-d", "a"));//output 1
Console.Read();
}
who can tell me why?
By default, string comparison uses culture-specific settings. These settings allow for varying orders and weights to be applied to letters and symbols; for instance, "resume" and "résumé" will appear fairly close to each other when sorting using most culture settings, because "é" is ordered just after "e" and well before "f", even though the Unicode codepage places é well after the rest of the English alphabet. Similarly, symbols that aren't whitespace, take up a position in the string, but are considered "connective" like dashes, slashes, etc are given low "weight", so that they are only considered as tie-breakers. That means that "a-b" would be sorted just after "ab" and before "ac", because the dash is less important than the letters.
What you think you want is "ordinal sorting", where strings are sorted based on the first difference in the string, based on the relative ordinal positions of the differing characters in the Unicode codepage. This would place "-d" before "a" if "-" would also come before "a", because the dash is considered a full "character" and is compared to the character "a" in the same position. However, in a list of real words, this would place the words "redo", "resume", "rosin", "ruble", "re-do", and "résumé" in that order when in an ordinal-sorted list, which may not make sense in context, and certainly not to a non-English speaker.
It compares the position of the characters within each other. In other words, "-" comes before (is less than) "a".
String.Compare() uses word sort rules when comparing. Mind you, these are all relative positions. Here is some information from MSDN.
Value : Condition
Negative : strA is less than strB
Zero : strA equals strB
Positive : strA is greater than strB
The above comparison applies to this overload:
public static int Compare(
string strA,
string strB
)
The - is treated as a special case in sorting by the .NET Framework. This answer has the details: https://stackoverflow.com/a/9355086/1180433

Using InvariantCultureIgnoreCase instead of ToUpper for case-insensitive string comparisons

On this page, a commenter writes:
Do NOT ever use .ToUpper to insure comparing strings is case-insensitive.
Instead of this:
type.Name.ToUpper() == (controllerName.ToUpper() + "Controller".ToUpper()))
Do this:
type.Name.Equals(controllerName + "Controller",
StringComparison.InvariantCultureIgnoreCase)
Why is this way preferred?
Here is the answer in details .. The Turkey Test (read section 3)
As discussed by lots and lots of
people, the "I" in Turkish behaves
differently than in most languages.
Per the Unicode standard, our
lowercase "i" becomes "İ" (U+0130
"Latin Capital Letter I With Dot
Above") when it moves to uppercase.
Similarly, our uppercase "I" becomes
"ı" (U+0131 "Latin Small Letter
Dotless I") when it moves to
lowercase.
Fix: Again, use an ordinal (raw byte)
comparer, or invariant culture for
comparisons unless you absolutely need
culturally based linguistic
comparisons (which give you uppercase
I's with dots in Turkey)
And according to Microsoft you should not even be using the Invariant... but the Ordinal... (New Recommendations for Using Strings in Microsoft .NET 2.0)
In short, it's optimized by the CLR (less memory as well).
Further, uppercase comparison is more optimized than ToLower(), if that tiny degree of performance matters.
In response to your example there is a faster way yet:
String.Equals(type.Name, controllerName + "Controller",
StringComparison.InvariantCultureIgnoreCase);

Categories