about string.compare method

about string.compare method - c#

a strange question,my code is:
static void Main(string[] args)
{
Console.WriteLine(string.Compare("-", "a"));//output -1
Console.WriteLine(string.Compare("-d", "a"));//output 1
Console.Read();
}
who can tell me why?

By default, string comparison uses culture-specific settings. These settings allow for varying orders and weights to be applied to letters and symbols; for instance, "resume" and "résumé" will appear fairly close to each other when sorting using most culture settings, because "é" is ordered just after "e" and well before "f", even though the Unicode codepage places é well after the rest of the English alphabet. Similarly, symbols that aren't whitespace, take up a position in the string, but are considered "connective" like dashes, slashes, etc are given low "weight", so that they are only considered as tie-breakers. That means that "a-b" would be sorted just after "ab" and before "ac", because the dash is less important than the letters.
What you think you want is "ordinal sorting", where strings are sorted based on the first difference in the string, based on the relative ordinal positions of the differing characters in the Unicode codepage. This would place "-d" before "a" if "-" would also come before "a", because the dash is considered a full "character" and is compared to the character "a" in the same position. However, in a list of real words, this would place the words "redo", "resume", "rosin", "ruble", "re-do", and "résumé" in that order when in an ordinal-sorted list, which may not make sense in context, and certainly not to a non-English speaker.

It compares the position of the characters within each other. In other words, "-" comes before (is less than) "a".
String.Compare() uses word sort rules when comparing. Mind you, these are all relative positions. Here is some information from MSDN.
Value : Condition
Negative : strA is less than strB
Zero : strA equals strB
Positive : strA is greater than strB
The above comparison applies to this overload:
public static int Compare(
string strA,
string strB
)

The - is treated as a special case in sorting by the .NET Framework. This answer has the details: https://stackoverflow.com/a/9355086/1180433

Related

Simplify my regular expression (it's in C# so many suggestions are not working, I already tried)

Can any one simplify my regex? I have designed it after many tests and tried many things. Please don't simplify according to JS rules they seems to be working different. otherwise i would have done that myself.
"^[M]{0,3}([C]{1}[M]{1}){0,1}[D]{0,3}([C]{1}[D]{1}){0,1}[C]{0,3}([X]{1}[C]{1}){0,1}[L]{0,3}([X]{1}[L]{1}){0,1}[X]{0,3}([I]{1}[X]{1}){0,1}[V]{0,3}([I]{1}[V]{1}){0,1}[I]{0,3}$"
All characters with sequence are compulsory.
Adding some rules. This one is for some roman number system as per my requirements...
Numbers are formed by combining symbols together and adding the values. For example, MMVI is 1000 + 1000 + 5 + 1 = 2006. Generally, symbols are placed in order of value, starting with the largest values. When smaller values precede larger values, the smaller values are subtracted from the larger values, and the result is added to the total. For example MCMXLIV = 1000 + (1000 − 100) + (50 − 10) + (5 − 1) = 1944.
The symbols "I", "X", "C", and "M" can be repeated three times in succession, but no more. (They may appear four times if the third and fourth are separated by a smaller value, such as XXXIX.) "D", "L", and "V" can never be repeated.
"I" can be subtracted from "V" and "X" only. "X" can be subtracted from "L" and "C" only. "C" can be subtracted from "D" and "M" only. "V", "L", and "D" can never be subtracted.
Only one small-value symbol may be subtracted from any large-value symbol.
A number written in [16]Arabic numerals can be broken into digits. For example, 1903 is composed of 1, 9, 0, and 3. To write the Roman numeral, each of the non-zero digits should be treated separately. Inthe above example, 1,000 = M, 900 = CM, and 3 = III. Therefore, 1903 = MCMIII.

A few points:
No need for character classes with only one item, so "[M]" can be replaced with "M" (for example)
"{0, 1}" can always be replaced with "?" without changing the meaning of the regex
You never need to include "{1}" as it doesn't add any additional constraints
For long regular expressions I suggest breaking the regex down into logical "subgroups" using string constants and "build" the regex with them - it's easier to read
Always include comments above the regular expression explaining its purpose and giving examples of valid and invalid inputs (unless it's short enough to be obvious), otherwise it'll be difficult to maintain
I haven't tested this as thoroughly as I'd like (it would be easier to do so given some examples of valid and invalid strings) but here's a stab at it:
"^M{0,3}(CM)?D{0,3}(CD)?C{0,3}(XC)?L{0,3}(XL)?X{0,3}(IX)?V{0,3}(IV)?I{0,3}$"
This'll match the string "MDCLXVI" but not something like "MMMMDCLXVI".
With that said, I suspect that your original regex isn't doing exactly what you intended it to, so this may not be only a problem of simplification. For example, you state in your post that "All characters with sequence are compulsory", but right now no particular sequence of strings is required; in fact, the regex will even match the empty string, which I suspect isn't what you want.

This equation cannot be simplified for now because i am trying to validate string in C# regex processing. I have tried many other ways also including suggestion provided above.
So closing this question for now.

Alphabetical order does not compare from left to right?

I thought that in .NET strings were compared alphabetically and that they were compared from left to right.
string[] strings = { "-1", "1", "1Foo", "-1Foo" };
Array.Sort(strings);
Console.WriteLine(string.Join(",", strings));
I'd expect this (or the both with minus at the beginning first):
1,1Foo,-1,-1Foo
But the result is:
1,-1,1Foo,-1Foo
It seems to be a mixture, either the minus sign is ignored or multiple characters are compared even if the first character was already different.
Edit: I've now tested OrdinalIgnoreCase and i get the expected order:
Array.Sort(strings, StringComparer.OrdinalIgnoreCase);
But even if i use InvariantCultureIgnoreCase i get the unexpected order.

Jon Skeet to the rescue here
Specifically:
The .NET Framework uses three distinct ways of sorting: word sort,
string sort, and ordinal sort. Word sort performs a culture-sensitive
comparison of strings. Certain nonalphanumeric characters might have
special weights assigned to them. For example, the hyphen ("-") might
have a very small weight assigned to it so that "coop" and "co-op"
appear next to each other in a sorted list. String sort is similar to
word sort, except that there are no special cases. Therefore, all
nonalphanumeric symbols come before all alphanumeric characters.
Ordinal sort compares strings based on the Unicode values of each
element of the string.
But adding the StringComparer.Ordinal makes it behave as you want:
string[] strings = { "-1", "1", "10", "-10", "a", "ba","-a" };
Array.Sort(strings,StringComparer.Ordinal );
Console.WriteLine(string.Join(",", strings));
// prints: -1,-10,-a,1,10,a,ba
Edit:
About the Ordinal, quoting from MSDN CompareOptions Enumeration
Ordinal Indicates that the string comparison must use successive
Unicode UTF-16 encoded values of the string (code unit by code unit
comparison), leading to a fast comparison but one that is
culture-insensitive. A string starting with a code unit XXXX16 comes
before a string starting with YYYY16, if XXXX16 is less than YYYY16.
This value cannot be combined with other CompareOptions values and
must be used alone.
Also seems you have String.CompareOrdinal if you want the ordinal of 2 strings.
Here's another note of interest:
When possible, the application should use string comparison methods
that accept a CompareOptions value to specify the kind of comparison
expected. As a general rule, user-facing comparisons are best served
by the use of linguistic options (using the current culture), while
security comparisons should specify Ordinal or OrdinalIgnoreCase.
I guess we humans expect ordinal when dealing with strings :)

There is a small note on the String.CompareTo method documentation:
Notes to Callers:
Character sets include ignorable characters. The
CompareTo(String) method does not consider such characters when it
performs a culture-sensitive comparison. For example, if the following
code is run on the .NET Framework 4 or later, a comparison of "animal"
with "ani-mal" (using a soft hyphen, or U+00AD) indicates that the two
strings are equivalent.
And then a little later states:
To recognize ignorable characters in a string comparison, call the
CompareOrdinal(String, String) method.
These two statements seem to be consistent with the results you are seeing.

string.IndexOf() not recognizing modified characters

When using IndexOf to find a char which is followed by a large valued char (e.g. char 700 which is ʼ) then the IndexOf fails to recognize the char you are looking for.
e.g.
string find = "abcʼabcabc";
int index = find.IndexOf("c");
In this code, index should be 2, but it returns 6.
Is there a way to get around this?

Unicode letter 700 is a modifier apostrophe: in other words, it modifies the letter c. In the same way, if you were to use an 'e' followed by character 769 (0x301), it would not really be an 'e' anymore: the e has been modified to be e with an acute accent. To wit: é. You'll see that letter is actually two characters: copy it to notepad and hit backspace (neat, huh?).
You need to do an "Ordinal" comparison (byte-by-byte) without any linguistic comparison. That will find the 'c', and ignore the linguistic fact that it is modified by the next letter. In my 'e' example, the bytes are (65)(769), so if you go byte-by-byte looking for 65, you will find it, and that ignores the fact that (65)(769) is linguistically the same as (233): é. If you search for (233) linguistically it will find the "equivalent" (65)(769):
string find = "abéabcabc";
int index = find.IndexOf("é"); //gives you '2' even though the "find" has two characters and the the "indexof" is one
Hopefully that's not too confusing. If you're doing this in real code you should explain in comments exactly what you're doing: as in my 'e' example generally you would want to do semantic equivalence for user data, and ordinal equivalence for e.g. constants (which hopefully wouldn't be different like this, lest your successor hunt you down with an axe).

The cʼ construct is being handled as linguistically different to the simple bytes. Use the Ordinal string comparison to force a byte comparison.
string find = "abcʼabcabc";
int index = find.IndexOf("c", StringComparison.Ordinal);

Why the capital letter is greater than small letter in .Net?

In Java:
"A".compareTo("a"); return -32 //"A" is less than "a".
In .Net, use String.CompareTo:
"A".CompareTo("a"); return 1 //"A" is greater than "a".
In .Net, use Char.CompareTo:
'A'.CompareTo('a'); return -32 //"A" is less than "a".
I know the Java compares string characters using its position in unicode table, but .Net is not. How determines which capital letter is greater than small letter in .Net?
String.CompareTo Method (String)

The doc I could find says that:
This method performs a word (case-sensitive and culture-sensitive) comparison using the current culture.
So, it is not quite the same as Java's .compareTo() which does a lexicographical comparison by default, using Unicode code points, as you say.
Therefore, in .NET, it depends on your current "culture" (Java would call this a "locale", I guess).
It seems that if you want to do String comparison "à la Java" in .NET, you must use String.CompareOrdinal() instead.
On the opposite, if you want to do locale-dependent string comparison in Java, you need to use a Collator.
Lastly, another link on MSDN shows the influence of cultures on comparisons and even string equality.

From Java String
Returns:
the value 0 if the argument string is equal to this string; a value less than 0 if this string is lexicographically less than the string argument; and a value greater than 0 if this string is lexicographically greater than the string argument.
From .Net String.CompareTo
This method performs a word (case-sensitive and culture-sensitive)
comparison using the current culture. For more information about word,
string, and ordinal sorts, see System.Globalization.CompareOptions.
This post explains the difference between the comparison types
And the doc explains the difference between all the comparison types;
IF you look at these two, CurrentCulture and Ordinal
StringComparison.Ordinal:
LATIN SMALL LETTER I (U+0069) is less than LATIN SMALL LETTER DOTLESS I (U+0131)
LATIN SMALL LETTER I (U+0069) is greater than LATIN CAPITAL LETTER I (U+0049)
LATIN SMALL LETTER DOTLESS I (U+0131) is greater than LATIN CAPITAL LETTER I (U
StringComparison.CurrentCulture:
LATIN SMALL LETTER I (U+0069) is less than LATIN SMALL LETTER DOTLESS I (U+0131)
LATIN SMALL LETTER I (U+0069) is less than LATIN CAPITAL LETTER I (U+0049)
LATIN SMALL LETTER DOTLESS I (U+0131) is greater than LATIN CAPITAL LETTER I (U+0049)
Ordinal is the only one where "i" > "I" and hence Java like

This is due to the order of the characters in the ASCII character set. this is something you should really understand if you are going to do any form of data manipulation in your programs.
I am not sure if the grid control has any properties that allow you to modify the sort order, if not you will have to write your own sort subroutine.
You could use the std::sort function with a user defined predicate function that puts all lower case before upper case.

Using InvariantCultureIgnoreCase instead of ToUpper for case-insensitive string comparisons

On this page, a commenter writes:
Do NOT ever use .ToUpper to insure comparing strings is case-insensitive.
Instead of this:
type.Name.ToUpper() == (controllerName.ToUpper() + "Controller".ToUpper()))
Do this:
type.Name.Equals(controllerName + "Controller",
StringComparison.InvariantCultureIgnoreCase)
Why is this way preferred?

Here is the answer in details .. The Turkey Test (read section 3)
As discussed by lots and lots of
people, the "I" in Turkish behaves
differently than in most languages.
Per the Unicode standard, our
lowercase "i" becomes "İ" (U+0130
"Latin Capital Letter I With Dot
Above") when it moves to uppercase.
Similarly, our uppercase "I" becomes
"ı" (U+0131 "Latin Small Letter
Dotless I") when it moves to
lowercase.
Fix: Again, use an ordinal (raw byte)
comparer, or invariant culture for
comparisons unless you absolutely need
culturally based linguistic
comparisons (which give you uppercase
I's with dots in Turkey)
And according to Microsoft you should not even be using the Invariant... but the Ordinal... (New Recommendations for Using Strings in Microsoft .NET 2.0)

In short, it's optimized by the CLR (less memory as well).
Further, uppercase comparison is more optimized than ToLower(), if that tiny degree of performance matters.
In response to your example there is a faster way yet:
String.Equals(type.Name, controllerName + "Controller",
StringComparison.InvariantCultureIgnoreCase);

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

about string.compare method - c#

a strange question,my code is: static void Main(string[] args) { Console.WriteLine(string.Compare("-", "a"));//output -1 Console.WriteLine(string.Compare("-d", "a"));//output 1 Console.Read(); } who can tell me why?

The - is treated as a special case in sorting by the .NET Framework. This answer has the details: https://stackoverflow.com/a/9355086/1180433

Related

Simplify my regular expression (it's in C# so many suggestions are not working, I already tried)

Alphabetical order does not compare from left to right?

string.IndexOf() not recognizing modified characters

Why the capital letter is greater than small letter in .Net?

Using InvariantCultureIgnoreCase instead of ToUpper for case-insensitive string comparisons

Categories

Resources