Why the capital letter is greater than small letter in .Net? - c#

In Java:
"A".compareTo("a"); return -32 //"A" is less than "a".
In .Net, use String.CompareTo:
"A".CompareTo("a"); return 1 //"A" is greater than "a".
In .Net, use Char.CompareTo:
'A'.CompareTo('a'); return -32 //"A" is less than "a".
I know the Java compares string characters using its position in unicode table, but .Net is not. How determines which capital letter is greater than small letter in .Net?
String.CompareTo Method (String)

The doc I could find says that:
This method performs a word (case-sensitive and culture-sensitive) comparison using the current culture.
So, it is not quite the same as Java's .compareTo() which does a lexicographical comparison by default, using Unicode code points, as you say.
Therefore, in .NET, it depends on your current "culture" (Java would call this a "locale", I guess).
It seems that if you want to do String comparison "à la Java" in .NET, you must use String.CompareOrdinal() instead.
On the opposite, if you want to do locale-dependent string comparison in Java, you need to use a Collator.
Lastly, another link on MSDN shows the influence of cultures on comparisons and even string equality.

From Java String
Returns:
the value 0 if the argument string is equal to this string; a value less than 0 if this string is lexicographically less than the string argument; and a value greater than 0 if this string is lexicographically greater than the string argument.
From .Net String.CompareTo
This method performs a word (case-sensitive and culture-sensitive)
comparison using the current culture. For more information about word,
string, and ordinal sorts, see System.Globalization.CompareOptions.
This post explains the difference between the comparison types
And the doc explains the difference between all the comparison types;
IF you look at these two, CurrentCulture and Ordinal
StringComparison.Ordinal:
LATIN SMALL LETTER I (U+0069) is less than LATIN SMALL LETTER DOTLESS I (U+0131)
LATIN SMALL LETTER I (U+0069) is greater than LATIN CAPITAL LETTER I (U+0049)
LATIN SMALL LETTER DOTLESS I (U+0131) is greater than LATIN CAPITAL LETTER I (U
StringComparison.CurrentCulture:
LATIN SMALL LETTER I (U+0069) is less than LATIN SMALL LETTER DOTLESS I (U+0131)
LATIN SMALL LETTER I (U+0069) is less than LATIN CAPITAL LETTER I (U+0049)
LATIN SMALL LETTER DOTLESS I (U+0131) is greater than LATIN CAPITAL LETTER I (U+0049)
Ordinal is the only one where "i" > "I" and hence Java like

This is due to the order of the characters in the ASCII character set. this is something you should really understand if you are going to do any form of data manipulation in your programs.
I am not sure if the grid control has any properties that allow you to modify the sort order, if not you will have to write your own sort subroutine.
You could use the std::sort function with a user defined predicate function that puts all lower case before upper case.

Related

String.length() vs national letters

So i have a program that works on text and i need to get the length of string.
BUT if in my word i have a national letter the output of length method is not correct. It gets additional +1 for each national letter, so it returns 6 from "qwerty", but 7 if i use "e with a little tail" instead of regular 'e'.
Any ideas how could i fix that?
Also, sorry for descriptions of letters, but i think stackoverflow takes my national symbols as grammar errors and doesn't allow me to post a question :/
It tells you on the page for string.Length what to do (emphasis mine):
The Length property returns the number of Char objects in this
instance, not the number of Unicode characters. The reason is that a
Unicode character might be represented by more than one Char. Use the
System.Globalization.StringInfo class to work with each Unicode
character instead of each Char.

List of ignorable characters for string comparison

Culture sensitive comparison in C# does not take into account "ignorable characters":
Character sets include ignorable characters. The Compare(String, String) method does not consider such characters when it performs a culture-sensitive comparison. For example, a culture-sensitive comparison of "animal" with "ani-mal" (using a soft hyphen, or U+00AD) indicates that the two strings are equivalent, as the following example shows.
Where can I find complete list of such characters and maybe some details of comparison of strings containing ignorable characters?
All Unicode code points have a "default ignorable" property that is specified by the Unicode consortium; I would be very surprised if the .NET concept of ignorable characters is in any way different from the value of that property.
The definitive resource on which characters are default-ignorable is the Unicode standard, specifically section 5.21 (link to chapter 5 PDF for Unicode v6.2.0).

regular expressions with the Cyrillic alphabet?

I am currently writing some validation that will validate inputted data. I am using regular expressions to do so, working with C#.
Password = #"(?!^[0-9]*$)(?!^[a-zA-Z]*$)^([a-zA-Z0-9]{6,18})$"
Validate Alpha Numeric = [^a-zA-Z0-9ñÑáÁéÉíÍóÓúÚüÜ¡¿{0}]
The above work fine on the latin alphabet, but how can I expand such to working with the Cyrillic alphabet?
The basic approach to covering ranges of characters using regular expressions is to construct an expression of the form [A-Za-z], where A is the first letter of the range, and Z is the last letter of the range.
The problem is, there is no such thing as "The" Cyrillic alphabet: the alphabet is slightly different depending on the language. If you would like to cover Russian version of the Cyrillic, use [А-Яа-я]. You would use a different range, say, for Serbian, because the last letter in their Cyrillic is Ш, not Я.
Another approach is to list all characters one-by-one. Simply find an authoritative reference for the alphabet that you want to put in a regexp, and put all characters for it into a pair of square brackets:
[АБВГДЕЁЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯабвгдеёжзийклмнопрстуфхцчшщъыьэюя]
You can use character classes if you need to allow characters of particular language or particular type:
#"\p{IsCyrillic}+" // Cyrillic letters
#"[\p{Ll}\p{Lt}]+" // any upper/lower case letters in any language
In your case maybe "not a whitespace" would be enough: #"[^\s]+" or maybe "word character (which includes numbers and underscores) - #"\w+".
Password = #"(?!^[0-9]*$)(?!^[А-Яа-я]*$)^([А-Яа-я0-9]{6,18})$"
Validate Alpha Numeric = [^а-яА-Я0-9ñÑáÁéÉíÍóÓúÚüÜ¡¿{0}]

about string.compare method

a strange question,my code is:
static void Main(string[] args)
{
Console.WriteLine(string.Compare("-", "a"));//output -1
Console.WriteLine(string.Compare("-d", "a"));//output 1
Console.Read();
}
who can tell me why?
By default, string comparison uses culture-specific settings. These settings allow for varying orders and weights to be applied to letters and symbols; for instance, "resume" and "résumé" will appear fairly close to each other when sorting using most culture settings, because "é" is ordered just after "e" and well before "f", even though the Unicode codepage places é well after the rest of the English alphabet. Similarly, symbols that aren't whitespace, take up a position in the string, but are considered "connective" like dashes, slashes, etc are given low "weight", so that they are only considered as tie-breakers. That means that "a-b" would be sorted just after "ab" and before "ac", because the dash is less important than the letters.
What you think you want is "ordinal sorting", where strings are sorted based on the first difference in the string, based on the relative ordinal positions of the differing characters in the Unicode codepage. This would place "-d" before "a" if "-" would also come before "a", because the dash is considered a full "character" and is compared to the character "a" in the same position. However, in a list of real words, this would place the words "redo", "resume", "rosin", "ruble", "re-do", and "résumé" in that order when in an ordinal-sorted list, which may not make sense in context, and certainly not to a non-English speaker.
It compares the position of the characters within each other. In other words, "-" comes before (is less than) "a".
String.Compare() uses word sort rules when comparing. Mind you, these are all relative positions. Here is some information from MSDN.
Value : Condition
Negative : strA is less than strB
Zero : strA equals strB
Positive : strA is greater than strB
The above comparison applies to this overload:
public static int Compare(
string strA,
string strB
)
The - is treated as a special case in sorting by the .NET Framework. This answer has the details: https://stackoverflow.com/a/9355086/1180433

Using InvariantCultureIgnoreCase instead of ToUpper for case-insensitive string comparisons

On this page, a commenter writes:
Do NOT ever use .ToUpper to insure comparing strings is case-insensitive.
Instead of this:
type.Name.ToUpper() == (controllerName.ToUpper() + "Controller".ToUpper()))
Do this:
type.Name.Equals(controllerName + "Controller",
StringComparison.InvariantCultureIgnoreCase)
Why is this way preferred?
Here is the answer in details .. The Turkey Test (read section 3)
As discussed by lots and lots of
people, the "I" in Turkish behaves
differently than in most languages.
Per the Unicode standard, our
lowercase "i" becomes "İ" (U+0130
"Latin Capital Letter I With Dot
Above") when it moves to uppercase.
Similarly, our uppercase "I" becomes
"ı" (U+0131 "Latin Small Letter
Dotless I") when it moves to
lowercase.
Fix: Again, use an ordinal (raw byte)
comparer, or invariant culture for
comparisons unless you absolutely need
culturally based linguistic
comparisons (which give you uppercase
I's with dots in Turkey)
And according to Microsoft you should not even be using the Invariant... but the Ordinal... (New Recommendations for Using Strings in Microsoft .NET 2.0)
In short, it's optimized by the CLR (less memory as well).
Further, uppercase comparison is more optimized than ToLower(), if that tiny degree of performance matters.
In response to your example there is a faster way yet:
String.Equals(type.Name, controllerName + "Controller",
StringComparison.InvariantCultureIgnoreCase);

Categories