Using InvariantCultureIgnoreCase instead of ToUpper for case-insensitive string comparisons - c#

On this page, a commenter writes:
Do NOT ever use .ToUpper to insure comparing strings is case-insensitive.
Instead of this:
type.Name.ToUpper() == (controllerName.ToUpper() + "Controller".ToUpper()))
Do this:
type.Name.Equals(controllerName + "Controller",
StringComparison.InvariantCultureIgnoreCase)
Why is this way preferred?

Here is the answer in details .. The Turkey Test (read section 3)
As discussed by lots and lots of
people, the "I" in Turkish behaves
differently than in most languages.
Per the Unicode standard, our
lowercase "i" becomes "İ" (U+0130
"Latin Capital Letter I With Dot
Above") when it moves to uppercase.
Similarly, our uppercase "I" becomes
"ı" (U+0131 "Latin Small Letter
Dotless I") when it moves to
lowercase.
Fix: Again, use an ordinal (raw byte)
comparer, or invariant culture for
comparisons unless you absolutely need
culturally based linguistic
comparisons (which give you uppercase
I's with dots in Turkey)
And according to Microsoft you should not even be using the Invariant... but the Ordinal... (New Recommendations for Using Strings in Microsoft .NET 2.0)

In short, it's optimized by the CLR (less memory as well).
Further, uppercase comparison is more optimized than ToLower(), if that tiny degree of performance matters.
In response to your example there is a faster way yet:
String.Equals(type.Name, controllerName + "Controller",
StringComparison.InvariantCultureIgnoreCase);

Related

Alphabetical order does not compare from left to right?

I thought that in .NET strings were compared alphabetically and that they were compared from left to right.
string[] strings = { "-1", "1", "1Foo", "-1Foo" };
Array.Sort(strings);
Console.WriteLine(string.Join(",", strings));
I'd expect this (or the both with minus at the beginning first):
1,1Foo,-1,-1Foo
But the result is:
1,-1,1Foo,-1Foo
It seems to be a mixture, either the minus sign is ignored or multiple characters are compared even if the first character was already different.
Edit: I've now tested OrdinalIgnoreCase and i get the expected order:
Array.Sort(strings, StringComparer.OrdinalIgnoreCase);
But even if i use InvariantCultureIgnoreCase i get the unexpected order.
Jon Skeet to the rescue here
Specifically:
The .NET Framework uses three distinct ways of sorting: word sort,
string sort, and ordinal sort. Word sort performs a culture-sensitive
comparison of strings. Certain nonalphanumeric characters might have
special weights assigned to them. For example, the hyphen ("-") might
have a very small weight assigned to it so that "coop" and "co-op"
appear next to each other in a sorted list. String sort is similar to
word sort, except that there are no special cases. Therefore, all
nonalphanumeric symbols come before all alphanumeric characters.
Ordinal sort compares strings based on the Unicode values of each
element of the string.
But adding the StringComparer.Ordinal makes it behave as you want:
string[] strings = { "-1", "1", "10", "-10", "a", "ba","-a" };
Array.Sort(strings,StringComparer.Ordinal );
Console.WriteLine(string.Join(",", strings));
// prints: -1,-10,-a,1,10,a,ba
Edit:
About the Ordinal, quoting from MSDN CompareOptions Enumeration
Ordinal Indicates that the string comparison must use successive
Unicode UTF-16 encoded values of the string (code unit by code unit
comparison), leading to a fast comparison but one that is
culture-insensitive. A string starting with a code unit XXXX16 comes
before a string starting with YYYY16, if XXXX16 is less than YYYY16.
This value cannot be combined with other CompareOptions values and
must be used alone.
Also seems you have String.CompareOrdinal if you want the ordinal of 2 strings.
Here's another note of interest:
When possible, the application should use string comparison methods
that accept a CompareOptions value to specify the kind of comparison
expected. As a general rule, user-facing comparisons are best served
by the use of linguistic options (using the current culture), while
security comparisons should specify Ordinal or OrdinalIgnoreCase.
I guess we humans expect ordinal when dealing with strings :)
There is a small note on the String.CompareTo method documentation:
Notes to Callers:
Character sets include ignorable characters. The
CompareTo(String) method does not consider such characters when it
performs a culture-sensitive comparison. For example, if the following
code is run on the .NET Framework 4 or later, a comparison of "animal"
with "ani-mal" (using a soft hyphen, or U+00AD) indicates that the two
strings are equivalent.
And then a little later states:
To recognize ignorable characters in a string comparison, call the
CompareOrdinal(String, String) method.
These two statements seem to be consistent with the results you are seeing.

Why the capital letter is greater than small letter in .Net?

In Java:
"A".compareTo("a"); return -32 //"A" is less than "a".
In .Net, use String.CompareTo:
"A".CompareTo("a"); return 1 //"A" is greater than "a".
In .Net, use Char.CompareTo:
'A'.CompareTo('a'); return -32 //"A" is less than "a".
I know the Java compares string characters using its position in unicode table, but .Net is not. How determines which capital letter is greater than small letter in .Net?
String.CompareTo Method (String)
The doc I could find says that:
This method performs a word (case-sensitive and culture-sensitive) comparison using the current culture.
So, it is not quite the same as Java's .compareTo() which does a lexicographical comparison by default, using Unicode code points, as you say.
Therefore, in .NET, it depends on your current "culture" (Java would call this a "locale", I guess).
It seems that if you want to do String comparison "à la Java" in .NET, you must use String.CompareOrdinal() instead.
On the opposite, if you want to do locale-dependent string comparison in Java, you need to use a Collator.
Lastly, another link on MSDN shows the influence of cultures on comparisons and even string equality.
From Java String
Returns:
the value 0 if the argument string is equal to this string; a value less than 0 if this string is lexicographically less than the string argument; and a value greater than 0 if this string is lexicographically greater than the string argument.
From .Net String.CompareTo
This method performs a word (case-sensitive and culture-sensitive)
comparison using the current culture. For more information about word,
string, and ordinal sorts, see System.Globalization.CompareOptions.
This post explains the difference between the comparison types
And the doc explains the difference between all the comparison types;
IF you look at these two, CurrentCulture and Ordinal
StringComparison.Ordinal:
LATIN SMALL LETTER I (U+0069) is less than LATIN SMALL LETTER DOTLESS I (U+0131)
LATIN SMALL LETTER I (U+0069) is greater than LATIN CAPITAL LETTER I (U+0049)
LATIN SMALL LETTER DOTLESS I (U+0131) is greater than LATIN CAPITAL LETTER I (U
StringComparison.CurrentCulture:
LATIN SMALL LETTER I (U+0069) is less than LATIN SMALL LETTER DOTLESS I (U+0131)
LATIN SMALL LETTER I (U+0069) is less than LATIN CAPITAL LETTER I (U+0049)
LATIN SMALL LETTER DOTLESS I (U+0131) is greater than LATIN CAPITAL LETTER I (U+0049)
Ordinal is the only one where "i" > "I" and hence Java like
This is due to the order of the characters in the ASCII character set. this is something you should really understand if you are going to do any form of data manipulation in your programs.
I am not sure if the grid control has any properties that allow you to modify the sort order, if not you will have to write your own sort subroutine.
You could use the std::sort function with a user defined predicate function that puts all lower case before upper case.

Counting special UTF-8 character

I'm finding a way to count special character that form by more than one character but found no solution online!
For e.g. I want to count the string "வாழைப்பழம". It actually consist of 6 tamil character but its 9 character in this case when we use the normal way to find the length. I am wondering is tamil the only kind of encoding that will cause this problem and if there is a solution to this. I'm currently trying to find a solution in C#.
Thank you in advance =)
Use StringInfo.LengthInTextElements:
var text = "வாழைப்பழம";
Console.WriteLine(text.Length); // 9
Console.WriteLine(new StringInfo(text).LengthInTextElements); // 6
The explanation for this behaviour can be found in the documentation of String.Length:
The Length property returns the number of Char objects in this instance, not the number of Unicode characters. The reason is that a Unicode character might be represented by more than one Char. Use the System.Globalization.StringInfo class to work with each Unicode character instead of each Char.
A minor nitpick: strings in .NET use UTF-16, not UTF-8
When you're talking about the length of a string, there are several different things you could mean:
Length in bytes.  This is the old C way of looking at things, usually.
Length in Unicode code points.  This gets you closer to the modern times and should be the way how string lengths are treated, except it isn't.
Length in UTF-8/UTF-16 code units.  This is the most common interpretation, deriving from 1. Certain characters take more than one code unit in those encodings which complicates things if you don't expect it.
Count of visible “characters” (graphemes). This is usually what people mean when they say characters or length of a string.
In your case your confusion stems from the difference between 4. and 3. 3. is what C# uses, 4. is what you expect. Complex scripts such as Tamil use ligatures and diacritics. Ligatures are contractions of two or more adjacent characters into a single glyph – in your case ழை is a ligature of ழ and ை – the latter of which changes the appearance of the former; வா is also such a ligature. Diacritics are ornaments around a letter, e.g. the accent in à or the dot above ப்.
The two cases I mentioned both result in a single grapheme (what you perceive as a single character), yet they both need two actual characters each. So you end up with three code points more in the string.
One thing to note: For your case the distinction between 2. and 3. is irrelevant, but generally you should keep it in mind.

about string.compare method

a strange question,my code is:
static void Main(string[] args)
{
Console.WriteLine(string.Compare("-", "a"));//output -1
Console.WriteLine(string.Compare("-d", "a"));//output 1
Console.Read();
}
who can tell me why?
By default, string comparison uses culture-specific settings. These settings allow for varying orders and weights to be applied to letters and symbols; for instance, "resume" and "résumé" will appear fairly close to each other when sorting using most culture settings, because "é" is ordered just after "e" and well before "f", even though the Unicode codepage places é well after the rest of the English alphabet. Similarly, symbols that aren't whitespace, take up a position in the string, but are considered "connective" like dashes, slashes, etc are given low "weight", so that they are only considered as tie-breakers. That means that "a-b" would be sorted just after "ab" and before "ac", because the dash is less important than the letters.
What you think you want is "ordinal sorting", where strings are sorted based on the first difference in the string, based on the relative ordinal positions of the differing characters in the Unicode codepage. This would place "-d" before "a" if "-" would also come before "a", because the dash is considered a full "character" and is compared to the character "a" in the same position. However, in a list of real words, this would place the words "redo", "resume", "rosin", "ruble", "re-do", and "résumé" in that order when in an ordinal-sorted list, which may not make sense in context, and certainly not to a non-English speaker.
It compares the position of the characters within each other. In other words, "-" comes before (is less than) "a".
String.Compare() uses word sort rules when comparing. Mind you, these are all relative positions. Here is some information from MSDN.
Value : Condition
Negative : strA is less than strB
Zero : strA equals strB
Positive : strA is greater than strB
The above comparison applies to this overload:
public static int Compare(
string strA,
string strB
)
The - is treated as a special case in sorting by the .NET Framework. This answer has the details: https://stackoverflow.com/a/9355086/1180433

Regular expression to catch letters beyond a-z

A normal regexp to allow letters only would be "[a-zA-Z]" but I'm from, Sweden so I would have to change that into "[a-zåäöA-ZÅÄÖ]". But suppose I don't know what letters are used in the alphabet.
Is there a way to automatically know what chars are are valid in a given locale/language or should I just make a blacklist of chars that I (think I) know I don't want?
You can use \pL to match any 'letter', which will support all letters in all languages. You can narrow it down to specific languages using 'named blocks'. More information can be found on the Character Classes documentation on MSDN.
My recommendation would be to put the regular expression (or at least the "letter" part) into a localised resource, which you can then pull out based on the current locale and form into the larger pattern.
What about \p{name} ?
Matches any character in the named character class specified by {name}.
Supported names are Unicode groups and block ranges. For example, Ll, Nd, Z,
IsGreek, IsBoxDrawing.
I don't know enough about unicode, but maybe your characters fit a unicode class?
See character categories selection with \p and \w unicode semantics.
All chars are "valid," so I think you're really asking for chars that are "generally considered to be letters" in a locale.
The Unicode specification has some guidelines, but in general the answer is "no," you would need to list the characters you decide are "letters."
Is there a way to automatically know what chars are are valid in a given locale/language or should I just make a blacklist of chars that I (think I) know I don't want?
This is not, in general, possible.
After all Engligh text does include some accented characters (e.g. in "fête" and "naïve" -- which in UK-English to be strictly correct still use accents). In some languages some of the standard letters are rarely used (e.g. y-diaeresis in French).
Then consider including foreign words are included (this will often be the case where technical terms are used). Quotations would be another source.
If your requirements are sufficiently narrowly defined you may be able to create a definition, but this requires linguistic experience in that language.
This regex allows only valid symbols through:
[a-zA-ZÀ-ÿ ]

Categories