C# string.IndexOf() returns unexpected value - c#

This question applies to C#, .net Compact Framework 2 and Windows CE 5 devices.
I encountered a bug in a .net DLL which was in use on very different CE devices for years, without showing any problems. Suddenly, on a new Windows CE 5.0 device, this bug appeared in the following code:
string s = "Print revenue receipt"; // has only single space chars
int i = s.IndexOf(" "); // two space chars
I expect i to be -1, however this was only true until today, when indexOf suddenly returned 5.
Since this behaviour doesn't occur when using
int i = s.IndexOf(" ", StringComparison.Ordinal);
, I'm quite sure that this is a culture based phenomenom, but I can't recognize the difference this new device makes. It is a mostly identical version of a known device (just a faster cpu and new board).
Both devices:
run Windows CE 5.0 with identical localization
System.Environment.Version reports '2.0.7045.0'
CultureInfo.CurrentUICulture and CultureInfo.CurrentCulture report 'en-GB' (also tested with 'de-DE')
'all' related registry keys are equal.
The new device had the CF 3.5 preinstalled, whose GAC files I experimentally renamed, with no change in the described behaviour. Since at runtime always Version 2.0.7045.0 is reported, I assume these assemblies have no effect.
Although this is not difficult to fix, i can not stand it when things seem that magical. Any hints what i was missing?
Edit: it is getting stranger and stranger, see screenshot:
One more:

I believe you already have the answer using an ordinal search
int i = s.IndexOf(" ", StringComparison.Ordinal);
You can read a small section in the documentation for the String Class which has this to say on the subject:
String search methods, such as String.StartsWith and String.IndexOf, also can perform culture-sensitive or ordinal string comparisons. The following example illustrates the differences between ordinal and culture-sensitive comparisons using the IndexOf method. A culture-sensitive search in which the current culture is English (United States) considers the substring "oe" to match the ligature "œ". Because a soft hyphen (U+00AD) is a zero-width character, the search treats the soft hyphen as equivalent to Empty and finds a match at the beginning of the string. An ordinal search, on the other hand, does not find a match in either case.

Culture stuff can really appear to be quite magical on some systems. What I came to always do after years of pain is always set the culture information manually to InvariantCulture where I do not explicitly want different behaviour for different cultures. So my suggestion would be: Make that IndexOf check always use the same culture information, like so:
int i = s.IndexOf(" ", StringComparison.InvariantCulture);

The reference at http://msdn.microsoft.com/en-us/library/k8b1470s.aspx states:
"Character sets include ignorable characters, which are characters that are not considered when performing a linguistic or culture-sensitive comparison. In a culture-sensitive search, if value contains an ignorable character, the result is equivalent to searching with that character removed."
This is from 4.5 reference, references from previous versions don't contain nothing like that.
So let me take a guess: they have changed the rules from 4.0 to 4.5 and now the second space of a two space sequence is considered to be a "ignorable character" - at least if the engine recognizes your string as english text (like in your example string s), otherwise not.
And somehow on your new device, a 4.5 dll is used instead of the expected 2.0 dll.
A wild guess, I know :)

Related

C# Contains not picking up substring (on foreign country) [duplicate]

I came across the word 'The Turkey Test' while learning about code testing. I don't know really what it means.
What is Turkey Test? Why is it called so?
The Turkey problem is related to software internationalization or simply to its misbehavior in various language cultures.
In various countries there are different standards, for example for writing dates (14.04.2008 in Turkey and 4/14/2008 in US), numbers (i.e. 123,45 in Poland and 123.45 in USA) and rules about character uppercasing (like in Turkey with letters i, I and ı).
As Jeff Moser pointed below one such problem was pointed out by a Turkish user who found a bug in the ToUpper() function. There are more details in comments below.
However the problem is not limited to Turkey and to string conversions.
For example, in Poland and many other countries, dates and numbers are also written in a different manner.
Some links from a Google search for the Turkey Test :
Does Your Code Pass The Turkey Test?
by Jeff Moser
What's Wrong With Turkey?
by Jeff Atwood
Here is described the turkey test
Forget about Turkey, this won't even pass in the USA. You need a case insensitive compare. So you try:
String.Compare(string,string,bool ignoreCase):
....
Do any of these pass "The Turkey Test?"
Not a chance!
Reason: You've been hit with the "Turkish I" problem.
As discussed by lots and lots of people, the "I" in Turkish behaves differently than in most languages. Per the Unicode standard, our lowercase "i" becomes "İ" (U+0130 "Latin Capital Letter I With Dot Above") when it moves to uppercase. Similarly, our uppercase "I" becomes "ı" (U+0131 "Latin Small Letter Dotless I") when it moves to lowercase.
We write dates smaller to bigger like dd.MM.yyyy: 28.10.2010
We use '.'(dot) for thousands separator, and ','(comma) for decimal separator: 4.567,9
We have ö=>Ö, ç=>Ç, ş=>Ş, ğ=>Ğ, ü=>Ü, and most importantly ı=>I and i => İ; in other words, lower case of upper I is dotless and upper case of lower i is dotted.
People may have very stressful times because of meaningless errors caused by the above rules.
If your code properly runs in Turkey, it'll probably work anywhere.
The so called "Turkey Test" is related to Software internationalization. One problem of globalization/internationalization are that date and time formats in different cultures can differ on many levels (day/month/year order, date separator etc).
Also, Turkey has some special rules for capitalization, which can lead to problems. For example, the Turkish "i" character is a common problem for many programs which capitalize it in a wrong way.
The link provided by #Luixv gives a comprehensive description of the issue.
The summary is that if your going to test your code on only one non-English locale, test it on Turkish.
This is because the Turkish has instances of most edge cases you are likely to encounter with localization, including "unusual" format strings and non-standard characters (such as a different capitalization rules for i).
Jeff Atwood has a blog article on same which is the first place I came across it myself.
in summary attempting to run your application under a Turkish Locale is an excellent test
of your I18n.
here's jeffs article

C#'s StringInfo and TextElementEnumerator can't recognize graphemes properly

In C# StringInfo and TextElementEnumerator classes provide methods and properties for text elements.
And here, we can find the definition of the Text Element.
The .NET Framework defines a text element as a unit of text that is
displayed as a single character, that is, a grapheme. A text element
can be any of the following:
Yes, it says a text element is a grapheme in .NET. I also tested with some unicode characters myself, and it really seemed true until I tested one Korean letter '가'.
As we all know some Unicode characters consist of multiple code points. Also we may face code point sequences and that's the reason I'm using StringInfo and TextElementEnumerator instead of simple String.
StringInfo and TextElementEnumerator could tell if Chars were surrogate pairs correctly. And "\u0061\u0308", a Unicode character which consists of multiple code points, was recognized as one text element just as expected. But as for "\u1100\u1161", it failed to say that it was also one text element.
"\u1100" is a leading letter "ㄱ", and "\u1161" is a vowel letter "ㅏ". They can be individual characters and shown to the users just as I write here and you can see them now. But if they are used together, they are rendered as one character "가" instead of "ㄱㅏ".
There are two ways in order to represent a Korean character "가":
Using a single code point U+AC00 from Hangul Syllable.
Using two code points U+1100 and U+1161 from Jamo.
Most of the time the former is used. The latter is rarely used, to be honest, I can't imagine when it's used at all..
Anyway, the first one is just one precomposed letter and the second is a sequence of Lead and Vowel which is treated as one character. When rendered they look the exactly same and both are actually canonically equivalent.
Also the following line returns true in C# :
"\u1100\u1161".Normalize() == "\uAC00"
I wonder why Normalize() here works just fine when C# doesn't think they are one complete text element..
I thought it had something to do with my .NET's version, but it turns out it's not the case. This thing happens even in Mono too.
I tested this with ICU as well, and it could treat "\u1100\u1161" as one grapheme correctly!
I initially thought StringInfo and TextElementEnumerator could eliminate need for ICU4C in some simple cases, so I'm very disappointed now..
Here's my question :
Am I doing something wrong here?
or
A Text Element in .NET isn't a user-perceived character unlike in ICU?
The basic issue here is that per the Korean standard KS X 1026, the two jamos ㄱ and ㅏ are distinct from their combined form 가. In fact, this exact example is used in the official standard (see section 6.2).
Long story short, Microsoft attempted to follow the standard but other operating systems and applications don't necessarily do so. Hence you can get "malformed" content from other software / platforms that appears to be parsed incorrectly on Windows / in .NET, even though it is parsed "correctly" on those platforms.
You will either need to ensure your data is correctly formed in the first place (unlikely, given that the de-facto standard is to completely ignore the official standard) or you will need to use ICU (or a similar library) to deal with these cases.

C# Two strings, visually the same, yet they are not Equal nor Equivalent

I have a strange situation I can't figure out.
I am using a third party conversion framework which is expecting units in abbreviated form e.g. "μV" which is MicroVolts
But when I go to parse the string "μV" as MicroVolts it fails.
I boiled it down to the fact the abbreviation string I pass in is not equal to the string the third party framework is using for Microvolts, even though they look identical.
Here is the output of the Immediate window, to help shed some light on the context:
targetUom
"µV"
targetUom.GetHashCode()
-837503221
"μV".GetHashCode()
-837502956
targetUom.Equals("µV") // This is using the value of targetUom
true
targetUom.Equals("μV") // This is using the value from the 3rd party framework
false
I have obtained the value used in the third party framework by debugging and copying the value of the abbreviation I know they use for MicroVolts.
Any idea why two strings, even though the look to be made up of the exact same characters, would not be considered equal?
I've also compared the first character, the micro unit representation, between the two strings which yields:
'μ'.CompareTo(targetUom[0])
775
*********** UPDATE ****************
So I've found that the two micro characters are different encodings.
But when i attempt to use the same encoding that the target framework uses, Visual Studio gives me this message:
What are the implication of changing the encoding of the file..should I be doing this or should i collaborate with the framework author to enable their framework to handle both encodings?
Turns out there are two unicode characters which are probably identical in most fonts:
Greek small letter mu, U+03BC
Micro sign, U+00B5
You can access them both in strings using the \u escape:
Console.WriteLine("Greek small letter mu: \u03bc");
Console.WriteLine("Micro sign: \u00b5");

String Comparison differences between .NET and T-SQL?

In a test case I've written, the string comparison doesn't appear to work the same way between SQL server / .NET CLR.
This C# code:
string lesser = "SR2-A1-10-90";
string greater = "SR2-A1-100-10";
Debug.WriteLine(string.Compare("A","B"));
Debug.WriteLine(string.Compare(lesser, greater));
Will output:
-1
1
This SQL Server code:
declare #lesser varchar(20);
declare #greater varchar(20);
set #lesser = 'SR2-A1-10-90';
set #greater = 'SR2-A1-100-10';
IF #lesser < #greater
SELECT 'Less Than';
ELSE
SELECT 'Greater than';
Will output:
Less Than
Why the difference?
This is documented here.
Windows collations (e.g. Latin1_General_CI_AS) use Unicode type collation rules. SQL Collations don't.
This causes the hyphen character to be treated differently between the two.
Further to gbn's answer, you can make them behave the same by using CompareOptions.StringSort in C# (or by using StringComparison.Ordinal). This treats symbols as occurring before alphanumeric symbols, so "-" < "0".
However, Unicode vs ASCII doesn't explain anything, as the hex codes for the ASCII codepage are translated verbatim to the Unicode codepage: "-" is 002D (45) while "0" is 0030 (48).
What is happening is that .NET is using "linguistic" sorting by default, which is based on a non-ordinal ordering and weight applied to various symbols by the specified or current culture. This linguistic algorithm allows, for instance, "résumé" (spelled with accents) to appear immediately following "resume" (spelled without accents) in a sorted list of words, as "é" is given a fractional order just after "e" and well before "f". It also allows "cooperation" and "co-operation" to be placed closely together, as the dash symbol is given low "weight"; it matters only as the absolute final tiebreakers when sorting words like "bits", "bit's", and "bit-shift" (which would appear in that order).
So-called ordinal sorting (strictly according to Unicode values, with or without case insensitivity) will produce very different and sometimes illogical results, as variants of letters usually appear well after the basic undecorated Latin alphabet in ASCII/Unicode ordinals, while symbols occur before it. For instance, "é" comes after "z" and so the words "resume", "rosin", "ruble", "résumé" would be sorted in that order. "Bit's", "Bit-shift", "Biter", "Bits" would be sorted in that order as the apostrophe comes first, followed by the dash, then the letter "e", then the letter "s". Neither of these seem logical from a "natural language" perspective.
In SQL you used varchar which is basically ASCII (subject to collation) which will give - before 0
In C# all strings are Unicode
The finer points of UTF-xx (c#) vs UCS-2 (SQL Server) are quite tricky.
Edit:
I posted too soon
I get "Greater Than" on SQL Server 2008 with collation Latin1_General_CI_AS
Edit 2:
I'd also try SELECT ASCII(...) on your dash. For example, if the SQL snippet has ever been in a Word document the - (150) is not the - (45) I copied into SQL Server for testing out of my browser from your questions. See CP 1252 (= CP1 = SQL Server lingo)
Edit 3: See Martin Smith's answer: the 2 collations have different sort orders.
Several great answers already on why this happens, but I'm sure others just want to know the C# code to iterate the collection in the same order as SQL server. I have found the following works best. "Ordinal" gets around the hyphen issue while "IgnoreCase" seems to reflect the SQL server default as well.
Debug.WriteLine(string.Compare(lesser, greater, StringComparison.OrdinalIgnoreCase));

What is wrong with ToLowerInvariant()?

I have the following line of code:
var connectionString = configItems.
Find(item => item.Name.ToLowerInvariant() == "connectionstring");
VS 2010 code analysis is telling me the following:
Warning 7 CA1308 : Microsoft.Globalization : In method ... replace the call to 'string.ToLowerInvariant()' with String.ToUpperInvariant().
Does this mean ToUpperInvariant() is more reliable?
Google gives a hint pointing to CA1308: Normalize strings to uppercase
It says:
Strings should be normalized to uppercase. A small group of characters, when they are converted to lowercase, cannot make a round trip. To make a round trip means to convert the characters from one locale to another locale that represents character data differently, and then to accurately retrieve the original characters from the converted characters.
So, yes - ToUpper is more reliable than ToLower.
In the future I suggest googling first - I do that for all those FxCop warnings I get thrown around ;) Helps a lot to read the corresponding documentation ;)
Besides what TomTom says, .net is optimized for string comparison in upper case. So using upper invariant is theoretically faster than lowerinvariant.
This is indeed stated in CLR via C# as pointed out in the comments.
Im not sure if this is of course really true since there is nothing to be found on MSDN about this topic. The string comparison guide on msdn mentions that toupperinvariant and tolowerinvariant are equal and does not prefer the former.

Categories