C#'s StringInfo and TextElementEnumerator can't recognize graphemes properly

C#'s StringInfo and TextElementEnumerator can't recognize graphemes properly - c#

In C# StringInfo and TextElementEnumerator classes provide methods and properties for text elements.
And here, we can find the definition of the Text Element.
The .NET Framework defines a text element as a unit of text that is
displayed as a single character, that is, a grapheme. A text element
can be any of the following:
Yes, it says a text element is a grapheme in .NET. I also tested with some unicode characters myself, and it really seemed true until I tested one Korean letter '가'.
As we all know some Unicode characters consist of multiple code points. Also we may face code point sequences and that's the reason I'm using StringInfo and TextElementEnumerator instead of simple String.
StringInfo and TextElementEnumerator could tell if Chars were surrogate pairs correctly. And "\u0061\u0308", a Unicode character which consists of multiple code points, was recognized as one text element just as expected. But as for "\u1100\u1161", it failed to say that it was also one text element.
"\u1100" is a leading letter "ㄱ", and "\u1161" is a vowel letter "ㅏ". They can be individual characters and shown to the users just as I write here and you can see them now. But if they are used together, they are rendered as one character "가" instead of "ㄱㅏ".
There are two ways in order to represent a Korean character "가":
Using a single code point U+AC00 from Hangul Syllable.
Using two code points U+1100 and U+1161 from Jamo.
Most of the time the former is used. The latter is rarely used, to be honest, I can't imagine when it's used at all..
Anyway, the first one is just one precomposed letter and the second is a sequence of Lead and Vowel which is treated as one character. When rendered they look the exactly same and both are actually canonically equivalent.
Also the following line returns true in C# :
"\u1100\u1161".Normalize() == "\uAC00"
I wonder why Normalize() here works just fine when C# doesn't think they are one complete text element..
I thought it had something to do with my .NET's version, but it turns out it's not the case. This thing happens even in Mono too.
I tested this with ICU as well, and it could treat "\u1100\u1161" as one grapheme correctly!
I initially thought StringInfo and TextElementEnumerator could eliminate need for ICU4C in some simple cases, so I'm very disappointed now..
Here's my question :
Am I doing something wrong here?
or
A Text Element in .NET isn't a user-perceived character unlike in ICU?

The basic issue here is that per the Korean standard KS X 1026, the two jamos ㄱ and ㅏ are distinct from their combined form 가. In fact, this exact example is used in the official standard (see section 6.2).
Long story short, Microsoft attempted to follow the standard but other operating systems and applications don't necessarily do so. Hence you can get "malformed" content from other software / platforms that appears to be parsed incorrectly on Windows / in .NET, even though it is parsed "correctly" on those platforms.
You will either need to ensure your data is correctly formed in the first place (unlikely, given that the de-facto standard is to completely ignore the official standard) or you will need to use ICU (or a similar library) to deal with these cases.

Related

Does C# remove the characters • from a string?

I have this C# code used to populate a label on the screen of a phone. Note that it's not HTML source being used here.
c1Label.Text = "To select cards for your deck you can one of a number of options
•
and this XAML
<local:JustifiedLabel x:Name="c1Label" Text= "To select cards for your deck you can one of a number of options
•
The former shows &#10 as part of the text but the XAML version works fine and shows this as a line feed followed by a bullet.

This is to be expected. Both languages (C# and XML) have different rules, especially regarding what characters are “special” and how they have to be escaped when you want to use them anyway. In the C# string
"
•"
are just exactly those letters since they have no special meaning to the C# compiler. In XML they are numeric character references, and are an escape mechanism of including arbitrary characters.
Conversely, in C# the following
"\n \u2022"
represents a line feed and a bullet. But in XML it's just the exact characters as written.
You can construct endless such examples with almost any two different languages. Yes, this means you cannot just copy text from one language and expect it to represent the same string in another language. If you're transforming one language into another it's easy to handle programmatically, when you're copying stuff around manually you just have to live with this and adapt accordingly.

C# Two strings, visually the same, yet they are not Equal nor Equivalent

I have a strange situation I can't figure out.
I am using a third party conversion framework which is expecting units in abbreviated form e.g. "μV" which is MicroVolts
But when I go to parse the string "μV" as MicroVolts it fails.
I boiled it down to the fact the abbreviation string I pass in is not equal to the string the third party framework is using for Microvolts, even though they look identical.
Here is the output of the Immediate window, to help shed some light on the context:
targetUom
"µV"
targetUom.GetHashCode()
-837503221
"μV".GetHashCode()
-837502956
targetUom.Equals("µV") // This is using the value of targetUom
true
targetUom.Equals("μV") // This is using the value from the 3rd party framework
false
I have obtained the value used in the third party framework by debugging and copying the value of the abbreviation I know they use for MicroVolts.
Any idea why two strings, even though the look to be made up of the exact same characters, would not be considered equal?
I've also compared the first character, the micro unit representation, between the two strings which yields:
'μ'.CompareTo(targetUom[0])
775
*********** UPDATE ****************
So I've found that the two micro characters are different encodings.
But when i attempt to use the same encoding that the target framework uses, Visual Studio gives me this message:
What are the implication of changing the encoding of the file..should I be doing this or should i collaborate with the framework author to enable their framework to handle both encodings?

Turns out there are two unicode characters which are probably identical in most fonts:
Greek small letter mu, U+03BC
Micro sign, U+00B5
You can access them both in strings using the \u escape:
Console.WriteLine("Greek small letter mu: \u03bc");
Console.WriteLine("Micro sign: \u00b5");

Why does "i" get replaced with "ı"

I received a crash report from an application which was trying to read XML from a file it had previously written. After requesting the user send me the file, I compared it with what should have been written and found a really odd problem I haven't come across before.
Some (but not all) of the i characters had been replaced with ı - a dotless i. For example, a node named "title" was fine, but a node named "initialdirectory" had the first i replaced, the second was left alone, i.e. ınitialdirectory.
Until today I wasn't even aware there was such a character, but now I do and I just don't know how it was written like that - the XML was written using an XmlWriter with UTF8 encoding. Just a normal everyday write, nothing complicated.
I normally (well, since getting Resharper and it yells at me for skipping the parameter) use StringComparison.OrdinalIgnoreCase when doing IndexOf etc, but I'm at a loss on how I'm supposed to do this when writing data, unless I'm supposed to start changing thread cultures.
Has anyone experienced a similar issue before, and if so, what's the best way to deal with it?

In Turkish there are two i's: one with a dot, i, and one without a dot, ı. In upper case the first one has a dot, İ, and the second one hasn't, I.
At some point your program is converting InitialDirectory to lower case according to the default locale, which is known to be Turkish in some parts of the world. To fix the problem you can convert cases using a fixed, known locale, such as American English.
Update: Even better, use the ToLowerInvariant() method which converts a string to lower case in the "invariant culture".

Correct Hebrew character sequence in C# and searchable PDFs

I'm testing an SDK that extracts text from a searchable PDF. One of the SDK's dependencies was recently updated, and it's causing an existing test on Hebrew text to fail. I don't know Hebrew nor enough about how the involved technologies represent right-to-left languages.
The NUnit test asserts that the extracted text matches the C# string "מנבוצץז ".
string hebrewText = reader.ReadToEnd();
Assert.AreEqual("מנבוצץז ", hebrewText);
The rasterized PDF has what I believe are the same characters, but in the opposite order.
The unit test fails with this message:
Expected: "מנבוצץז "
But was: " זץצובנמ"
Although the actual result more closely matches what I see in the rasterized PDF, I'm not completely sure the original test is wrong.
Are Hebrew characters in a C# string supposed to be read right-to-left like printed Hebrew text?
Does any part of the .NET stack tamper with the direction of Hebrew strings?
What about NUnit?
Are Hebrew characters embedded in a searchable PDF normally supposed to go in the same direction as the rasterized text?
Anything else I should know before deciding whether to "fix" this unit test?

There are various ways to encode RTL languages. The most common way (and Window's default) is to use logical ordering, which means the first letter is encoded as the first character in a string (or file). So whether visually the first letter appears on the left or right side of the screen doesn't affect the order in which they are stored.
Now as for the text appearing in Visual Studio, it depends on the version. As far as I remember, prior to Visual Studio 2010 the code editor displayed Hebrew backwards, and it was apparent as when you tried to select Hebrew text, it reversed in an odd way (which was visually confusing). It appears this issue no longer exists is Visual Studio 2010 (at least with SP1 which I just tested).
Let's take a Hebrew word for which the direction is more clear to non-Hebrew speakers than the string specified in your text:
יון
The word happens to be the Hebrew word for an ion, and on your screen, it should appear as three letters where the tallest letter is on the left and the shortest is on the right. In a .NET string, the expression "יון".Substring(0, 1) will produce the short letter, since it's the first letter in the string. The string can also be written as "\u05D9\u05D5\u05DF" where the leftmost Unicode character \u05D9 represents the short letter displayed on the right, which clearly demonstrates the order in which the letters are stored.
Since the string in your test case is nonsensical, I can't tell you whether it was a wrong test all along or if it a correct test that should pass. If the image you uploaded has been rendered correctly then it appears the actual result of your test is correct and the expected value is incorrect, and so you should fix the test.

I believe that all strings in C# will be stored internally as LTR; RTL strings will have a non-printable character (or something) denoting that they are indeed RTL.
More than likely. RTL GUIs and rendered text for example need certain properties (specifically RightToLeft and RightToLeftLayout) to be set in order to display correctly.
NUnit shouldn't. Nor should it care. IMHO a reversed string != the original string.
I couldn't comment. I'd assume that they should be whatever the test is expecting though, assuming it was passing at first.
Don't do half measures with RTL, it really doesn't like it. Either have full RTL support, or nothing. It can be pretty nasty, I wish you the best of luck!

CamelCase conversion to friendly name, i.e. Enum constants; Problems?

In my answer to this question, I mentioned that we used UpperCamelCase parsing to get a description of an enum constant not decorated with a Description attribute, but it was naive, and it didn't work in all cases. I revisited it, and this is what I came up with:
var result = Regex.Replace(camelCasedString,
#"(?<a>(?<!^)[A-Z][a-z])", #" ${a}");
result = Regex.Replace(result,
#"(?<a>[a-z])(?<b>[A-Z0-9])", #"${a} ${b}");
The first Replace looks for an uppercase letter, followed by a lowercase letter, EXCEPT where the uppercase letter is the start of the string (to avoid having to go back and trim), and adds a preceding space. It handles your basic UpperCamelCase identifiers, and leading all-upper acronyms like FDICInsured.
The second Replace looks for a lowercase letter followed by an uppercase letter or a number, and inserts a space between the two. This is to handle special but common cases of middle or trailing acronyms, or numbers in an identifier (except leading numbers, which are usually prohibited in C-style languages anyway).
Running some basic unit tests, the combination of these two correctly separated all of the following identifiers: NoDescription, HasLotsOfWords, AAANoDescription, ThisHasTheAcronymABCInTheMiddle, MyTrailingAcronymID, TheNumber3, IDo3Things, IAmAValueWithSingleLetterWords, and Basic (which didn't have any spaces added).
So, I'm posting this first to share it with others who may find it useful, and second to ask two questions:
Anyone see a case that would follow common CamelCase-ish conventions, that WOULDN'T be correctly separated into a friendly string this way? I know it won't separate adjacent acronyms (FDICFCUAInsured), recapitalize "properly" camelCased acronyms like FdicInsured, or capitalize the first letter of a lowerCamelCased identifier (but that one's easy to add - result = Regex.Replace(result, "^[a-z]", m=>m.ToString().ToUpper());). Anything else?
Can anyone see a way to make this one statement, or more elegant? I was looking to combine the Replace calls, but as they do two different things to their matches it can't be done with these two strings. They could be combined into a method chain with a RegexReplace extension method on String, but can anyone think of better?

So while I agree with Hans Passant here, I have to say that I had to try my hand at making it one regex as an armchair regex user.
(?<a>(?<!^)((?:[A-Z][a-z])|(?:(?<!^[A-Z]+)[A-Z0-9]+(?:(?=[A-Z][a-z])|$))|(?:[0-9]+)))
Is what I came up with. It seems to pass all the tests you put forward in the question.
So
var result = Regex.Replace(camelCasedString, #"(?<a>(?<!^)((?:[A-Z][a-z])|(?:(?<!^[A-Z]+)[A-Z0-9]+(?:(?=[A-Z][a-z])|$))|(?:[0-9]+)))", #" ${a}");
Does it in one pass.

not that this directly answers the question, but why not test by taking the standard C# API and converting each class into a friendly name? It'd take some manual verification, but it'd give you a good list of standard names to test.

Let's say every case you come across works with this (you're asking us for examples that won't and then giving us some, so you don't even have a question left).
This still binds UI to programmatic identifiers in a way that will make both programming and UI changes brittle.
It still assumes your program will only be used in one language. Either your potential market it so small that just indexing an array of names would be scalable enough (e.g. a one-client bespoke or in-house project), or you are assuming you will never be successful enough to need to be available to other languages or other dialects of your first-chosen language.
Does "well, it'll work as long as we're a failure" sound like a passing grade in balancing designs?
Either code it to use resources, or else code it to pass the enum name blindly or use an array of names, as that at least will be modifiable afterwards.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.