List of ignorable characters for string comparison

List of ignorable characters for string comparison - c#

Culture sensitive comparison in C# does not take into account "ignorable characters":
Character sets include ignorable characters. The Compare(String, String) method does not consider such characters when it performs a culture-sensitive comparison. For example, a culture-sensitive comparison of "animal" with "ani-mal" (using a soft hyphen, or U+00AD) indicates that the two strings are equivalent, as the following example shows.
Where can I find complete list of such characters and maybe some details of comparison of strings containing ignorable characters?

All Unicode code points have a "default ignorable" property that is specified by the Unicode consortium; I would be very surprised if the .NET concept of ignorable characters is in any way different from the value of that property.
The definitive resource on which characters are default-ignorable is the Unicode standard, specifically section 5.21 (link to chapter 5 PDF for Unicode v6.2.0).

Related

why some vendors map Unicode characters to another character set(code page)?

I'm reading a book which talks about text encoding in .NET:
There are two categories of text encoding in .NET:
• Those that map Unicode characters to another character set
• Those that use standard Unicode encoding schemes
The first category contains legacy encodings such as IBM’s EBCDIC and 8-bit char‐acter sets with extended characters in the upper-128 region that were popular prior to Unicode (identified by a code page). In the second category are UTF-8, UTF-16, and UTF-32
I'm confused about the first one, code page part, I have read some questions on stackoverflow, none of them the same as the question I'm going to ask, my question is:
Why some vendors need to map Unicode characters to another character set? from my understanding on Unicode characters, Unicode can cover all characters of almost all language over the world, why reinvent the wheel to map Unicode characters to another character set? for example, line feed in unicode is U+000A, why would you want to map it to other character? just stick to the unicode standard, then you can use binary code to represent all kinds of character.

String.length() vs national letters

So i have a program that works on text and i need to get the length of string.
BUT if in my word i have a national letter the output of length method is not correct. It gets additional +1 for each national letter, so it returns 6 from "qwerty", but 7 if i use "e with a little tail" instead of regular 'e'.
Any ideas how could i fix that?
Also, sorry for descriptions of letters, but i think stackoverflow takes my national symbols as grammar errors and doesn't allow me to post a question :/

It tells you on the page for string.Length what to do (emphasis mine):
The Length property returns the number of Char objects in this
instance, not the number of Unicode characters. The reason is that a
Unicode character might be represented by more than one Char. Use the
System.Globalization.StringInfo class to work with each Unicode
character instead of each Char.

Alphabetical order does not compare from left to right?

I thought that in .NET strings were compared alphabetically and that they were compared from left to right.
string[] strings = { "-1", "1", "1Foo", "-1Foo" };
Array.Sort(strings);
Console.WriteLine(string.Join(",", strings));
I'd expect this (or the both with minus at the beginning first):
1,1Foo,-1,-1Foo
But the result is:
1,-1,1Foo,-1Foo
It seems to be a mixture, either the minus sign is ignored or multiple characters are compared even if the first character was already different.
Edit: I've now tested OrdinalIgnoreCase and i get the expected order:
Array.Sort(strings, StringComparer.OrdinalIgnoreCase);
But even if i use InvariantCultureIgnoreCase i get the unexpected order.

Jon Skeet to the rescue here
Specifically:
The .NET Framework uses three distinct ways of sorting: word sort,
string sort, and ordinal sort. Word sort performs a culture-sensitive
comparison of strings. Certain nonalphanumeric characters might have
special weights assigned to them. For example, the hyphen ("-") might
have a very small weight assigned to it so that "coop" and "co-op"
appear next to each other in a sorted list. String sort is similar to
word sort, except that there are no special cases. Therefore, all
nonalphanumeric symbols come before all alphanumeric characters.
Ordinal sort compares strings based on the Unicode values of each
element of the string.
But adding the StringComparer.Ordinal makes it behave as you want:
string[] strings = { "-1", "1", "10", "-10", "a", "ba","-a" };
Array.Sort(strings,StringComparer.Ordinal );
Console.WriteLine(string.Join(",", strings));
// prints: -1,-10,-a,1,10,a,ba
Edit:
About the Ordinal, quoting from MSDN CompareOptions Enumeration
Ordinal Indicates that the string comparison must use successive
Unicode UTF-16 encoded values of the string (code unit by code unit
comparison), leading to a fast comparison but one that is
culture-insensitive. A string starting with a code unit XXXX16 comes
before a string starting with YYYY16, if XXXX16 is less than YYYY16.
This value cannot be combined with other CompareOptions values and
must be used alone.
Also seems you have String.CompareOrdinal if you want the ordinal of 2 strings.
Here's another note of interest:
When possible, the application should use string comparison methods
that accept a CompareOptions value to specify the kind of comparison
expected. As a general rule, user-facing comparisons are best served
by the use of linguistic options (using the current culture), while
security comparisons should specify Ordinal or OrdinalIgnoreCase.
I guess we humans expect ordinal when dealing with strings :)

There is a small note on the String.CompareTo method documentation:
Notes to Callers:
Character sets include ignorable characters. The
CompareTo(String) method does not consider such characters when it
performs a culture-sensitive comparison. For example, if the following
code is run on the .NET Framework 4 or later, a comparison of "animal"
with "ani-mal" (using a soft hyphen, or U+00AD) indicates that the two
strings are equivalent.
And then a little later states:
To recognize ignorable characters in a string comparison, call the
CompareOrdinal(String, String) method.
These two statements seem to be consistent with the results you are seeing.

Algorithm to code URL [duplicate]

This question already has answers here:
URL Encoding using C#
(14 answers)
Closed 9 years ago.
is there some algorithm in C# to encode url with symbols that can correct display in web-browser?
something like Base64.

The Standard (RFC 3986 aka STD 66) lays it out for you. In particular, §2 and 2.1:
2. Characters
The URI syntax provides a method of encoding data, presumably for the
sake of identifying a resource, as a sequence of characters. The URI
characters are, in turn, frequently encoded as octets for transport
or presentation. This specification does not mandate any particular
character encoding for mapping between URI characters and the octets
used to store or transmit those characters. When a URI appears in a
protocol element, the character encoding is defined by that protocol;
without such a definition, a URI is assumed to be in the same
character encoding as the surrounding text.
The ABNF notation defines its terminal values to be non-negative
integers (codepoints) based on the US-ASCII coded character set
[ASCII]. Because a URI is a sequence of characters, we must invert
that relation in order to understand the URI syntax. Therefore, the
integer values used by the ABNF must be mapped back to their
corresponding characters via US-ASCII in order to complete the syntax
rules.
A URI is composed from a limited set of characters consisting of
digits, letters, and a few graphic symbols. A reserved subset of
those characters may be used to delimit syntax components within a
URI while the remaining characters, including both the unreserved set
and those reserved characters not acting as delimiters, define each
component's identifying data.
2.1. Percent-Encoding
A percent-encoding mechanism is used to represent a data octet in a
component when that octet's corresponding character is outside the
allowed set or is being used as a delimiter of, or within, the
component. A percent-encoded octet is encoded as a character
triplet, consisting of the percent character "%" followed by the two
hexadecimal digits representing that octet's numeric value. For
example, "%20" is the percent-encoding for the binary octet
"00100000" (ABNF: %x20), which in US-ASCII corresponds to the space
character (SP). Section 2.4 describes when percent-encoding and
decoding is applied.
pct-encoded = "%" HEXDIG HEXDIG
The uppercase hexadecimal digits 'A' through 'F' are equivalent to
the lowercase digits 'a' through 'f', respectively. If two URIs
differ only in the case of hexadecimal digits used in percent-encoded
octets, they are equivalent. For consistency, URI producers and
normalizers should use uppercase hexadecimal digits for all percent-
encodings.
In general, the only characters that may freely be represented in a URL without being percent-encoded are
The unreserved characters. These are the US-ASCII (7-bit) characters
A-Z
a-z
0-9
-._~
The reserved characters ... when in use as within their role in the grammar of a URL and its scheme. These reserved characters are:
:/?#[]#!$&'()*+,;=
Any other characters, per the standard must be properly percent-encoded.
Further note that a URL may only contains characters drawn from the US-ASCII character set (0x00-0x7F): If your URL contains characters outside that range of codepoints, those characters will need to be suitably encoded for representation in US-ASCII (e.g., via HTML/XML entity references). Further, you application is responsible for interpreting such.

Why the capital letter is greater than small letter in .Net?

In Java:
"A".compareTo("a"); return -32 //"A" is less than "a".
In .Net, use String.CompareTo:
"A".CompareTo("a"); return 1 //"A" is greater than "a".
In .Net, use Char.CompareTo:
'A'.CompareTo('a'); return -32 //"A" is less than "a".
I know the Java compares string characters using its position in unicode table, but .Net is not. How determines which capital letter is greater than small letter in .Net?
String.CompareTo Method (String)

The doc I could find says that:
This method performs a word (case-sensitive and culture-sensitive) comparison using the current culture.
So, it is not quite the same as Java's .compareTo() which does a lexicographical comparison by default, using Unicode code points, as you say.
Therefore, in .NET, it depends on your current "culture" (Java would call this a "locale", I guess).
It seems that if you want to do String comparison "à la Java" in .NET, you must use String.CompareOrdinal() instead.
On the opposite, if you want to do locale-dependent string comparison in Java, you need to use a Collator.
Lastly, another link on MSDN shows the influence of cultures on comparisons and even string equality.

From Java String
Returns:
the value 0 if the argument string is equal to this string; a value less than 0 if this string is lexicographically less than the string argument; and a value greater than 0 if this string is lexicographically greater than the string argument.
From .Net String.CompareTo
This method performs a word (case-sensitive and culture-sensitive)
comparison using the current culture. For more information about word,
string, and ordinal sorts, see System.Globalization.CompareOptions.
This post explains the difference between the comparison types
And the doc explains the difference between all the comparison types;
IF you look at these two, CurrentCulture and Ordinal
StringComparison.Ordinal:
LATIN SMALL LETTER I (U+0069) is less than LATIN SMALL LETTER DOTLESS I (U+0131)
LATIN SMALL LETTER I (U+0069) is greater than LATIN CAPITAL LETTER I (U+0049)
LATIN SMALL LETTER DOTLESS I (U+0131) is greater than LATIN CAPITAL LETTER I (U
StringComparison.CurrentCulture:
LATIN SMALL LETTER I (U+0069) is less than LATIN SMALL LETTER DOTLESS I (U+0131)
LATIN SMALL LETTER I (U+0069) is less than LATIN CAPITAL LETTER I (U+0049)
LATIN SMALL LETTER DOTLESS I (U+0131) is greater than LATIN CAPITAL LETTER I (U+0049)
Ordinal is the only one where "i" > "I" and hence Java like

This is due to the order of the characters in the ASCII character set. this is something you should really understand if you are going to do any form of data manipulation in your programs.
I am not sure if the grid control has any properties that allow you to modify the sort order, if not you will have to write your own sort subroutine.
You could use the std::sort function with a user defined predicate function that puts all lower case before upper case.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.