String Comparison differences between .NET and T-SQL?

String Comparison differences between .NET and T-SQL? - c#

In a test case I've written, the string comparison doesn't appear to work the same way between SQL server / .NET CLR.
This C# code:
string lesser = "SR2-A1-10-90";
string greater = "SR2-A1-100-10";
Debug.WriteLine(string.Compare("A","B"));
Debug.WriteLine(string.Compare(lesser, greater));
Will output:
-1
1
This SQL Server code:
declare #lesser varchar(20);
declare #greater varchar(20);
set #lesser = 'SR2-A1-10-90';
set #greater = 'SR2-A1-100-10';
IF #lesser < #greater
SELECT 'Less Than';
ELSE
SELECT 'Greater than';
Will output:
Less Than
Why the difference?

This is documented here.
Windows collations (e.g. Latin1_General_CI_AS) use Unicode type collation rules. SQL Collations don't.
This causes the hyphen character to be treated differently between the two.

Further to gbn's answer, you can make them behave the same by using CompareOptions.StringSort in C# (or by using StringComparison.Ordinal). This treats symbols as occurring before alphanumeric symbols, so "-" < "0".
However, Unicode vs ASCII doesn't explain anything, as the hex codes for the ASCII codepage are translated verbatim to the Unicode codepage: "-" is 002D (45) while "0" is 0030 (48).
What is happening is that .NET is using "linguistic" sorting by default, which is based on a non-ordinal ordering and weight applied to various symbols by the specified or current culture. This linguistic algorithm allows, for instance, "résumé" (spelled with accents) to appear immediately following "resume" (spelled without accents) in a sorted list of words, as "é" is given a fractional order just after "e" and well before "f". It also allows "cooperation" and "co-operation" to be placed closely together, as the dash symbol is given low "weight"; it matters only as the absolute final tiebreakers when sorting words like "bits", "bit's", and "bit-shift" (which would appear in that order).
So-called ordinal sorting (strictly according to Unicode values, with or without case insensitivity) will produce very different and sometimes illogical results, as variants of letters usually appear well after the basic undecorated Latin alphabet in ASCII/Unicode ordinals, while symbols occur before it. For instance, "é" comes after "z" and so the words "resume", "rosin", "ruble", "résumé" would be sorted in that order. "Bit's", "Bit-shift", "Biter", "Bits" would be sorted in that order as the apostrophe comes first, followed by the dash, then the letter "e", then the letter "s". Neither of these seem logical from a "natural language" perspective.

In SQL you used varchar which is basically ASCII (subject to collation) which will give - before 0
In C# all strings are Unicode
The finer points of UTF-xx (c#) vs UCS-2 (SQL Server) are quite tricky.
Edit:
I posted too soon
I get "Greater Than" on SQL Server 2008 with collation Latin1_General_CI_AS
Edit 2:
I'd also try SELECT ASCII(...) on your dash. For example, if the SQL snippet has ever been in a Word document the - (150) is not the - (45) I copied into SQL Server for testing out of my browser from your questions. See CP 1252 (= CP1 = SQL Server lingo)
Edit 3: See Martin Smith's answer: the 2 collations have different sort orders.

Several great answers already on why this happens, but I'm sure others just want to know the C# code to iterate the collection in the same order as SQL server. I have found the following works best. "Ordinal" gets around the hyphen issue while "IgnoreCase" seems to reflect the SQL server default as well.
Debug.WriteLine(string.Compare(lesser, greater, StringComparison.OrdinalIgnoreCase));

Related

String interpolation C#: Documentation of colon and semicolon functionality

I found this codegolf answer for the FizzBuzz test, and after examining it a bit I realized I had no idea how it actually worked, so I started investigating:
for(int i=1; i<101;i++)
System.Console.Write($"{(i%3*i%5<1?0:i):#}{i%3:;;Fizz}{i%5:;;Buzz}\n");
I put it into dotnetfiddle and established the 1st part works as follows:
{(BOOL?0:i):#}
When BOOL is true, then the conditional expression returns 0 otherwise the number.
However the number isn't returned unless it's <> 0. I'm guessing this is the job the of :# characters. I can't find any documentation on the :# characters workings. Can anyone explain the colon/hash or point me in the right direction?
Second part:
{VALUE:;;Fizz}
When VALUE = 0 then nothing is printed. I assume this is determined by the first ; character [end statement]. The second ; character determines 'if VALUE <> 0 then print what's after me.'
Again, does anyone have documentation on the use of a semicolon in string interpolation, as I can't find anything useful.

This is all covered in the String Interpolation documentation, especially the section on the Structure of an Interpolated String, which includes this:
{<interpolatedExpression>[,<alignment>][:<formatString>]}
along with a more detailed description for each of those three sections.
The format string portion of that structure is defined on separate pages, where you can use standard and custom formats for numeric types as well as standard and custom formats for date and time types. There are also options for Enum values, and you can even create your own custom format provider.
It's worth taking a look at the custom format provider documentation just because it will also lead you to the FormattableString type. This isn't well-covered by the documentation, but my understanding is this type may in theory allow you to avoid re-parsing the interpolated string for each iteration when used in a loop, thus potentially improving performance (though in practice, there's no difference at this time). I've written about this before, and my conclusion is MS needs to build this into the framework in a better way.

Thanks to all the commenters! Fast response.
The # is defined here (Custom specifier)
https://learn.microsoft.com/en-us/dotnet/standard/base-types/custom-numeric-format-strings#the--custom-specifier
The "#" custom format specifier serves as a digit-placeholder symbol.
If the value that is being formatted has a digit in the position where
the "#" symbol appears in the format string, that digit is copied to
the result string. Otherwise, nothing is stored in that position in
the result string. Note that this specifier never displays a zero that
is not a significant digit, even if zero is the only digit in the
string. It will display zero only if it is a significant digit in the
number that is being displayed.
The ; is defined here (Section Seperator):
https://learn.microsoft.com/en-us/dotnet/standard/base-types/custom-numeric-format-strings#the--section-separator
The semicolon (;) is a conditional format specifier that applies
different formatting to a number depending on whether its value is
positive, negative, or zero. To produce this behavior, a custom format
string can contain up to three sections separated by semicolons...

C#'s StringInfo and TextElementEnumerator can't recognize graphemes properly

In C# StringInfo and TextElementEnumerator classes provide methods and properties for text elements.
And here, we can find the definition of the Text Element.
The .NET Framework defines a text element as a unit of text that is
displayed as a single character, that is, a grapheme. A text element
can be any of the following:
Yes, it says a text element is a grapheme in .NET. I also tested with some unicode characters myself, and it really seemed true until I tested one Korean letter '가'.
As we all know some Unicode characters consist of multiple code points. Also we may face code point sequences and that's the reason I'm using StringInfo and TextElementEnumerator instead of simple String.
StringInfo and TextElementEnumerator could tell if Chars were surrogate pairs correctly. And "\u0061\u0308", a Unicode character which consists of multiple code points, was recognized as one text element just as expected. But as for "\u1100\u1161", it failed to say that it was also one text element.
"\u1100" is a leading letter "ㄱ", and "\u1161" is a vowel letter "ㅏ". They can be individual characters and shown to the users just as I write here and you can see them now. But if they are used together, they are rendered as one character "가" instead of "ㄱㅏ".
There are two ways in order to represent a Korean character "가":
Using a single code point U+AC00 from Hangul Syllable.
Using two code points U+1100 and U+1161 from Jamo.
Most of the time the former is used. The latter is rarely used, to be honest, I can't imagine when it's used at all..
Anyway, the first one is just one precomposed letter and the second is a sequence of Lead and Vowel which is treated as one character. When rendered they look the exactly same and both are actually canonically equivalent.
Also the following line returns true in C# :
"\u1100\u1161".Normalize() == "\uAC00"
I wonder why Normalize() here works just fine when C# doesn't think they are one complete text element..
I thought it had something to do with my .NET's version, but it turns out it's not the case. This thing happens even in Mono too.
I tested this with ICU as well, and it could treat "\u1100\u1161" as one grapheme correctly!
I initially thought StringInfo and TextElementEnumerator could eliminate need for ICU4C in some simple cases, so I'm very disappointed now..
Here's my question :
Am I doing something wrong here?
or
A Text Element in .NET isn't a user-perceived character unlike in ICU?

The basic issue here is that per the Korean standard KS X 1026, the two jamos ㄱ and ㅏ are distinct from their combined form 가. In fact, this exact example is used in the official standard (see section 6.2).
Long story short, Microsoft attempted to follow the standard but other operating systems and applications don't necessarily do so. Hence you can get "malformed" content from other software / platforms that appears to be parsed incorrectly on Windows / in .NET, even though it is parsed "correctly" on those platforms.
You will either need to ensure your data is correctly formed in the first place (unlikely, given that the de-facto standard is to completely ignore the official standard) or you will need to use ICU (or a similar library) to deal with these cases.

C# string.IndexOf() returns unexpected value

This question applies to C#, .net Compact Framework 2 and Windows CE 5 devices.
I encountered a bug in a .net DLL which was in use on very different CE devices for years, without showing any problems. Suddenly, on a new Windows CE 5.0 device, this bug appeared in the following code:
string s = "Print revenue receipt"; // has only single space chars
int i = s.IndexOf(" "); // two space chars
I expect i to be -1, however this was only true until today, when indexOf suddenly returned 5.
Since this behaviour doesn't occur when using
int i = s.IndexOf(" ", StringComparison.Ordinal);
, I'm quite sure that this is a culture based phenomenom, but I can't recognize the difference this new device makes. It is a mostly identical version of a known device (just a faster cpu and new board).
Both devices:
run Windows CE 5.0 with identical localization
System.Environment.Version reports '2.0.7045.0'
CultureInfo.CurrentUICulture and CultureInfo.CurrentCulture report 'en-GB' (also tested with 'de-DE')
'all' related registry keys are equal.
The new device had the CF 3.5 preinstalled, whose GAC files I experimentally renamed, with no change in the described behaviour. Since at runtime always Version 2.0.7045.0 is reported, I assume these assemblies have no effect.
Although this is not difficult to fix, i can not stand it when things seem that magical. Any hints what i was missing?
Edit: it is getting stranger and stranger, see screenshot:
One more:

I believe you already have the answer using an ordinal search
int i = s.IndexOf(" ", StringComparison.Ordinal);
You can read a small section in the documentation for the String Class which has this to say on the subject:
String search methods, such as String.StartsWith and String.IndexOf, also can perform culture-sensitive or ordinal string comparisons. The following example illustrates the differences between ordinal and culture-sensitive comparisons using the IndexOf method. A culture-sensitive search in which the current culture is English (United States) considers the substring "oe" to match the ligature "œ". Because a soft hyphen (U+00AD) is a zero-width character, the search treats the soft hyphen as equivalent to Empty and finds a match at the beginning of the string. An ordinal search, on the other hand, does not find a match in either case.

Culture stuff can really appear to be quite magical on some systems. What I came to always do after years of pain is always set the culture information manually to InvariantCulture where I do not explicitly want different behaviour for different cultures. So my suggestion would be: Make that IndexOf check always use the same culture information, like so:
int i = s.IndexOf(" ", StringComparison.InvariantCulture);

The reference at http://msdn.microsoft.com/en-us/library/k8b1470s.aspx states:
"Character sets include ignorable characters, which are characters that are not considered when performing a linguistic or culture-sensitive comparison. In a culture-sensitive search, if value contains an ignorable character, the result is equivalent to searching with that character removed."
This is from 4.5 reference, references from previous versions don't contain nothing like that.
So let me take a guess: they have changed the rules from 4.0 to 4.5 and now the second space of a two space sequence is considered to be a "ignorable character" - at least if the engine recognizes your string as english text (like in your example string s), otherwise not.
And somehow on your new device, a 4.5 dll is used instead of the expected 2.0 dll.
A wild guess, I know :)

What is wrong with ToLowerInvariant()?

I have the following line of code:
var connectionString = configItems.
Find(item => item.Name.ToLowerInvariant() == "connectionstring");
VS 2010 code analysis is telling me the following:
Warning 7 CA1308 : Microsoft.Globalization : In method ... replace the call to 'string.ToLowerInvariant()' with String.ToUpperInvariant().
Does this mean ToUpperInvariant() is more reliable?

Google gives a hint pointing to CA1308: Normalize strings to uppercase
It says:
Strings should be normalized to uppercase. A small group of characters, when they are converted to lowercase, cannot make a round trip. To make a round trip means to convert the characters from one locale to another locale that represents character data differently, and then to accurately retrieve the original characters from the converted characters.
So, yes - ToUpper is more reliable than ToLower.
In the future I suggest googling first - I do that for all those FxCop warnings I get thrown around ;) Helps a lot to read the corresponding documentation ;)

Besides what TomTom says, .net is optimized for string comparison in upper case. So using upper invariant is theoretically faster than lowerinvariant.
This is indeed stated in CLR via C# as pointed out in the comments.
Im not sure if this is of course really true since there is nothing to be found on MSDN about this topic. The string comparison guide on msdn mentions that toupperinvariant and tolowerinvariant are equal and does not prefer the former.

Why does string.Compare seem to handle accented characters inconsistently?

If I execute the following statement:
string.Compare("mun", "mün", true, CultureInfo.InvariantCulture)
The result is '-1', indicating that 'mun' has a lower numeric value than 'mün'.
However, if I execute this statement:
string.Compare("Muntelier, Schweiz", "München, Deutschland", true, CultureInfo.InvariantCulture)
I get '1', indicating that 'Muntelier, Schewiz' should go last.
Is this a bug in the comparison? Or, more likely, is there a rule I should be taking into account when sorting strings containing accented
The reason this is an issue is, I'm sorting a list and then doing a manual binary filter that's meant to get every string beginning with 'xxx'.
Previously I was using the Linq 'Where' method, but now I have to use this custom function written by another person, because he says it performs better.
But the custom function doesn't seem to take into account whatever 'unicode' rules .NET has. So if I tell it to filter by 'mün', it doesn't find any items, even though there are items in the list beginning with 'mun'.
This seems to be because of the inconsistent ordering of accented characters, depending on what characters go after the accented character.
OK, I think I've fixed the problem.
Before the filter, I do a sort based on the first n letters of each string, where n is the length of the search string.

There is a tie-breaking algorithm at work, see http://unicode.org/reports/tr10/
To address the complexities of
language-sensitive sorting, a
multilevel comparison algorithm is
employed. In comparing two words, for
example, the most important feature is
the base character: such as the
difference between an A and a B.
Accent differences are typically
ignored, if there are any differences
in the base letters. Case differences
(uppercase versus lowercase), are
typically ignored, if there are any
differences in the base or accents.
Punctuation is variable. In some
situations a punctuation character is
treated like a base character. In
other situations, it should be ignored
if there are any base, accent, or case
differences. There may also be a
final, tie-breaking level, whereby if
there are no other differences at all
in the string, the (normalized) code
point order is used.
So, "Munt..." and "Münc..." are alphabetically different and sort based on the "t" and "c".
Whereas, "mun" and "mün" are alphabetically the same ("u" equivelent to "ü" in lost languages) so the character codes are compared

It looks like the accented character is only being used in a sort of "tie-break" situation - in other words, if the strings are otherwise equal.
Here's some sample code to demonstrate:
using System;
using System.Globalization;
class Test
{
static void Main()
{
Compare("mun", "mün");
Compare("muna", "münb");
Compare("munb", "müna");
}
static void Compare(string x, string y)
{
int result = string.Compare(x, y, true,
CultureInfo.InvariantCulture));
Console.WriteLine("{0}; {1}; {2}", x, y, result);
}
}
(I've tried adding a space after the "n" as well, to see if it was done on word boundaries - it isn't.)
Results:
mun; mün; -1
muna; münb; -1
munb; müna; 1
I suspect this is correct by various complicated Unicode rules - but I don't know enough about them.
As for whether you need to take this into account... I wouldn't expect so. What are you doing that is thrown by this?

As I understand this it is still somewhat consistent. When comparing using CultureInfo.InvariantCulture the umlaut character ü is treated like the non-accented character u.
As the strings in your first example obviously are not equal the result will not be 0 but -1 (which seems to be a default value). In the second example Muntelier goes last because t follows c in the alphabet.
I couldn't find any clear documentation in MSDN explaining these rules, but I found that
string.Compare("mun", "mün", CultureInfo.InvariantCulture,
CompareOptions.StringSort);
and
string.Compare("Muntelier, Schweiz", "München, Deutschland",
CultureInfo.InvariantCulture, CompareOptions.StringSort);
gives the desired result.
Anyway, I think you'd be better off to base your sorting on a specific culture such as the current user's culture (if possible).

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.