Issue with surrogate unicode characters in F#

Issue with surrogate unicode characters in F# - c#

I'm working with strings, which could contain surrogate unicode characters (non-BMP, 4 bytes per character).
When I use "\Uxxxxxxxxv" format to specify surrogate character in F# - for some characters it gives different result than in the case of C#. For example:
C#:
string s = "\U0001D11E";
bool c = Char.IsSurrogate(s, 0);
Console.WriteLine(String.Format("Length: {0}, is surrogate: {1}", s.Length, c));
Gives: Length: 2, is surrogate: True
F#:
let s = "\U0001D11E"
let c = Char.IsSurrogate(s, 0)
printf "Length: %d, is surrogate: %b" s.Length c
Gives: Length: 2, is surrogate: false
Note: Some surrogate characters works in F# ("\U0010011", "\U00100011"), but some of them doesn't work.
Q: Is this is bug in F#? How can I handle allowed surrogate unicode characters in strings with F# (Does F# has different format, or only the way is to use Char.ConvertFromUtf32 0x1D11E)
Update:
s.ToCharArray() gives for F# [| 0xD800; 0xDF41 |]; for C# { 0xD834, 0xDD1E }

This is a known bug in the F# compiler that shipped with VS2010 (and SP1); the fix appears in the VS11 bits, so if you have the VS11 Beta and use the F# 3.0 compiler, you'll see this behave as expected.
(If the other answers/comments here don't provide you with a suitable workaround in the meantime, let me know.)

That obviously means that F# makes mistake while parsing some string literals. That is proven by the fact character you've mentioned is non-BMP, and in UTF-16 it should be represented as pair of surrogates.
Surrogates are words in range 0xD800-0xDFFF, while neither of chars in produced string fits in that range.
But processing of surrogates doesn't change, as framework (what is under the hood) is the same. So you already have answer in your question - if you need string literals with non-BMP characters in your code, you should just use Char.ConvertFromUtf32 instead of \UXXXXXXXX notation. And all the rest processing will be just the same as always.

It seem to me that this is something connected with different forms of normalization.
Both in C# and in F# s.IsNormalized() returns true
But in C#
s.ToCharArray() gives us {55348, 56606} //0xD834, 0xDD1E
and in F#
s.ToCharArray() gives us {65533, 57422} //0xFFFD, 0xE04E
And as you probably know System.Char.IsSurrogate is implemented in the following way:
public static bool IsSurrogate(char c)
{
return (c >= HIGH_SURROGATE_START && c <= LOW_SURROGATE_END);
}
where
HIGH_SURROGATE_START = 0x00d800;
LOW_SURROGATE_END = 0x00dfff;
So in C# first char (55348) is less than LOW_SURROGATE_END but in F# first char (65533) is not less than LOW_SURROGATE_END.
I hope this helps.

Related

In C#, how to convert to upper case following Unicode rules [duplicate]

Is it possible to convert a string to ordinal upper or lower case. Similar like invariant.
string upperInvariant = "ß".ToUpperInvariant();
string lowerInvariant = "ß".ToLowerInvariant();
bool invariant = upperInvariant == lowerInvariant; // true
string upperOrdinal = "ß".ToUpperOrdinal(); // SS
string lowerOrdinal = "ß".ToLowerOrdinal(); // ss
bool ordinal = upperOrdinal == lowerOrdinal; // false
How to implement ToUpperOrdinal and ToLowerOrdinal?
Edit:
How to to get the ordinal string representation? Likewise, how to get the invariant string representation? Maybe that's not possible as in the above case it might be ambiguous, at least for the ordinal representation.
Edit2:
string.Equals("ß", "ss", StringComparison.InvariantCultureIgnoreCase); // true
but
"ß".ToLowerInvariant() == "ss"; // false

I don't believe this functionality exists in the .NET Framework or .NET Core. The closest thing is string.Normalize(), but it is missing the case fold option that you need to successfully pull this off.
This functionality exists in the ICU project (which is available in C/Java). The functionality you are after is the unorm2.h file in C or the Normalizer2 class in Java. Example usage in Java and related test.
There are 2 implementations of Normalizer2 that I am aware of that have been ported to C#:
icu-dotnet (a C# wrapper library for ICU4C)
ICU4N (a fully managed port of ICU4J)
Full Disclosure: I am a maintainer of ICU4N.

From msdn:
TheStringComparer returned by the OrdinalIgnoreCase property treats the characters in the strings to compare as if they were converted to uppercase using the conventions of the invariant culture, and then performs a simple byte comparison that is independent of language.
But I'm guessing doing that won't achieve what you want, since simply doing "ß".ToUpperInvariant() won't give you a string that is ordinally equivallent to "ss". There must be some magic in the String.Equals method that handles the speciall case of Why “ss” equals 'ß'.
If you're only worried about German text then this answer might help.

c# - Replacing extended ascii characters

I'm parsing a number of text files that contain 99.9% ascii characters. Numbers, basic punctuation and letters A-Z (upper and lower case).
The files also contain names, which occasionally contain characters which are part of the extended ascii character set, for example umlauts Ü and cedillas ç.
I want to only work with standard ascii, so I handle these extended characters by processing any names through a series of simple replace() commands...
myString = myString.Replace("ç", "c");
myString = myString.Replace("Ü", "U");
This works with all the strange characters I want to replace except for Ø (capital O with a forward slash through it). I think this has the decimal equivalent of 157.
If I process the string character-by-character using ToInt32() on each character it claims the decimal equivalent is 65533 - well outside the normal range of extended ascii codes.
Questions
why doesn't myString.Replace("Ø", "O"); work on this character?
How can I replace "Ø" with "O"?
Other information - may be pertinent. Opening the file with Notepad shows the character as a "Ø". Comparison with other sources indicate that the data is correct (i.e. the full string is "Jørgensen" - a valid Danish name). Viewing the character in visual studio shows it as "�". I'm getting exactly the same problem (with this one character) in hundreds of different files. I can happily replace all the other extended characters I encounter without problems. I'm using System.IO.File.ReadAllLines() to read all the lines into an array of strings for processing.

Replace works fine for the 'Ø' when it 'knows' about it:
Console.WriteLine("Jørgensen".Replace("ø", "o"));
In your case the problem is that you are trying to read the data with the wrong encoding, that's why the string does not contain the character which you are trying to replace.
Ø is part of the extended ASCII set - iso-8859-1, but File.ReadAllLines tries to detect encoding using BOM chars and, I suspect, falls back to UTF-8 in your case (see Remarks in the documentation).
The same behavior you see in the VS code - it tries to open the file with UTF-8 encoding and shows you �:
If you switch the encoding to the correct one - it shows the text correctly:
If you know what encoding is used for your files, just use it explicitly, here is an example to illustrate the difference:
// prints J?rgensen
File.ReadAllLines("data.txt")
.Select(l => l.Replace("Ø", "O"))
.ToList()
.ForEach(Console.WriteLine);
// prints Jorgensen
File.ReadAllLines("data.txt",Encoding.GetEncoding("iso-8859-1"))
.Select(l => l.Replace("Ø", "O"))
.ToList()
.ForEach(Console.WriteLine);
If you want to use chars from the default ASCII set, you may convert all special chars from the extended set to the base one (it will be ugly and non-trivial). Or you can search online how to deal with your concern, and you may find String.Normalize() or this thread with several other suggestions.
public static string RemoveDiacritics(string s)
{
var normalizedString = s.Normalize(NormalizationForm.FormD);
var stringBuilder = new StringBuilder();
for(var i = 0; i < normalizedString.Length; i++)
{
var c = normalizedString[i];
if(CharUnicodeInfo.GetUnicodeCategory(c) != UnicodeCategory.NonSpacingMark)
stringBuilder.Append(c);
}
return stringBuilder.ToString();
}
...
// prints Jorgensen
File.ReadAllLines("data.txt", Encoding.GetEncoding("iso-8859-1"))
.Select(RemoveDiacritics)
.ToList()
.ForEach(Console.WriteLine);
I'd strongly recommend reading C# in Depth: Unicode by Jon Skeet and Programming with Unicode by Victor Stinner books to have a much better understanding of what's going on :) Good luck.
PS. My code example is functional, compact but pretty inefficient, if you parse huge files consider using another solution.

C# "anyString".Contains('\0', StringComparison.InvariantCulture) returns true in .NET5 but false in older versions

I encountered an incompatible problem while I was trying to upgrade my projects from .NET core 3.1 to the latest .NET 5.
My original code has a validation logic to check invalid file name characters by checking each character returned from Path.GetInvalidFileNameChars() API.
var invalidFilenameChars = Path.GetInvalidFileNameChars();
bool validFileName = !invalidFilenameChars.Any(ch => fileName.Contains(ch, StringComparison.InvariantCulture));
Suppose you give a regular value to fileName such as "test.txt" that should be valid. Surprisingly, however, the above code gives the file name is invalid if you run it with 'net5' target framework.
After spend some time on debugging, what I found is that the returned invalid character set contains '\0', null ASCII character and "text.txt".Contains("\0, StringComparison.InvariantCulture) gives true.
class Program
{
static void Main(string[] args)
{
var containsNullChar = "test".Contains("\0", StringComparison.InvariantCulture);
Console.WriteLine($"Contains null char {containsNullChar}");
}
}
If you run in .NET core 3.1, it never says regular string contains null character. Also, if I omit the second parameter (StringComparison.InvariantCulture) or if I use StringComparison.Ordinal, the strange result is never returned.
Why this behavior is changed in .NET5?
EDIT:
As commented by Karl-Johan Sjögren before, there is indeed a behavior change in .NET5 regarding string comparison:
Behavior changes when comparing strings on .NET 5+
Also see the related ticket:
string.IndexOf get different result in .Net 5
Though this issue should be related to above, the current result related to '\0' still looks strange to me and might still be considered to be a bug as answered by #xanatos.
EDIT2:
Now I realized that the actual cause of this problem was my confusion between InvariantCulture and Ordinal string comparison. They are actually quite different things. See the ticket below:
Difference between InvariantCulture and Ordinal string comparison
Also note that this should be unique problem of .NET as other major programming languages such as Java, C++ and Python treat ordinal comparison by default.

not a bug, a feature
The issue that I've opened has been closed, but they gave a very good explanation. Now... In .NET 5.0 they began using on Windows (on Linux it was already present) a new library for comparing strings, the ICU library. It is the official library of the Unicode Consortium, so it is "the verb". That library is used for CurrentCulture, InvariantCulture (plus the respective IgnoreCase) and and any other culture. The only exception is the Ordinal/OrdinalIgnoreCase. The library is targetted for text and it has some "particular" ideas about non-text. In this particular case, there are some characters that are simply ignored. In the block 0000-00FF I would say the ignored characters are all control codes (please ignore the fact that they are shown as €‚ƒ„†‡ˆ‰Š‹ŒŽ‘’“”•–—™š›œžŸ, at a certain point these characters have been remapped somewhere else in the Unicode, but the glyps shown don't reflect it, but if you try to see their code, like doing char ch = '€'; int val = (int)ch; you'll see it), and '\0' is a control code.
Now... My personal thinking is that to compare string from today you'll need a master's degree in Unicode Technologies 😥, and I do hope that they'll do some shenanigans in .NET 6.0 to make the default comparison Ordinal (it is one of the proposals for .NET 6.0, the Option B). Note that if you want to make programs that can run in Turkey you already needed a master's degree in Unicode Technologies (see the Turkish i problem).
In general I would say that to look for words that aren't keywords/fixed words (for example column names), you should use Culture-aware comparisons, while to look for keywords/fixed words (for example column names) and symbols/control codes you should use Ordinal comparisons. The problem is when you want to look for both at the same time. Normally in this case you are looking for exact words, so you can use Ordinal. Otherwise it becames hellish. And I don't even want to think how Regex works internally in a Culture-aware environment. That I don't want to think about. Becasue in that direction there can only be folly and nightmares 😁.
As a sidenote, even before the "default" Culture-aware comparisons had some secret shaeaningans... for example:
int ix = "ʹ$ʹ".IndexOf("$"); // -1 on .NET Framework or .NET Core <= 3.1
what I had written before
I'll say that it is a bug. There is a similar bug with IndexOf. I've opened an Issue on github to track it.
As you have written, the Ordinal and OrdinalIgnoreCase work as expected (probably because they don't need to use the new ICU library for handling Unicode).
Some sample code:
Console.WriteLine($"Ordinal Contains null char {"test".Contains("\0", StringComparison.Ordinal)}");
Console.WriteLine($"OrdinalIgnoreCase Contains null char {"test".Contains("\0", StringComparison.OrdinalIgnoreCase)}");
Console.WriteLine($"CurrentCulture Contains null char {"test".Contains("\0", StringComparison.CurrentCulture)}");
Console.WriteLine($"CurrentCultureIgnoreCase Contains null char {"test".Contains("\0", StringComparison.CurrentCultureIgnoreCase)}");
Console.WriteLine($"InvariantCulture Contains null char {"test".Contains("\0", StringComparison.InvariantCulture)}");
Console.WriteLine($"InvariantCultureIgnoreCase Contains null char {"test".Contains("\0", StringComparison.InvariantCultureIgnoreCase)}");
Console.WriteLine($"Ordinal IndexOf null char {"test".IndexOf("\0t", StringComparison.Ordinal)}");
Console.WriteLine($"OrdinalIgnoreCase IndexOf null char {"test".IndexOf("\0", StringComparison.OrdinalIgnoreCase)}");
Console.WriteLine($"CurrentCulture IndexOf null char {"test".IndexOf("\0", StringComparison.CurrentCulture)}");
Console.WriteLine($"CurrentCultureIgnoreCase IndexOf null char {"test".IndexOf("\0", StringComparison.CurrentCultureIgnoreCase)}");
Console.WriteLine($"InvariantCulture IndexOf null char {"test".IndexOf("\0", StringComparison.InvariantCulture)}");
Console.WriteLine($"InvariantCultureIgnoreCase IndexOf null char {"test".IndexOf("\0", StringComparison.InvariantCultureIgnoreCase)}");
and
Console.WriteLine($"Ordinal Contains null char {"test".Contains("\0test", StringComparison.Ordinal)}");
Console.WriteLine($"OrdinalIgnoreCase Contains null char {"test".Contains("\0test", StringComparison.OrdinalIgnoreCase)}");
Console.WriteLine($"CurrentCulture Contains null char {"test".Contains("\0test", StringComparison.CurrentCulture)}");
Console.WriteLine($"CurrentCultureIgnoreCase Contains null char {"test".Contains("\0test", StringComparison.CurrentCultureIgnoreCase)}");
Console.WriteLine($"InvariantCulture Contains null char {"test".Contains("\0test", StringComparison.InvariantCulture)}");
Console.WriteLine($"InvariantCultureIgnoreCase Contains null char {"test".Contains("\0test", StringComparison.InvariantCultureIgnoreCase)}");
Console.WriteLine($"Ordinal IndexOf null char {"test".IndexOf("\0t", StringComparison.Ordinal)}");
Console.WriteLine($"OrdinalIgnoreCase IndexOf null char {"test".IndexOf("\0test", StringComparison.OrdinalIgnoreCase)}");
Console.WriteLine($"CurrentCulture IndexOf null char {"test".IndexOf("\0test", StringComparison.CurrentCulture)}");
Console.WriteLine($"CurrentCultureIgnoreCase IndexOf null char {"test".IndexOf("\0test", StringComparison.CurrentCultureIgnoreCase)}");
Console.WriteLine($"InvariantCulture IndexOf null char {"test".IndexOf("\0test", StringComparison.InvariantCulture)}");
Console.WriteLine($"InvariantCultureIgnoreCase IndexOf null char {"test".IndexOf("\0test", StringComparison.InvariantCultureIgnoreCase)}");

C# casing convention for hexadecimal literals?

I like unity and I want to keep my code style guidelines strictly consistent.
So consider this ulong literal:
var x = 0xFFUL;
vs.
var x = 0xffUL;
It might be a stupid question but I hate when my code is not consistent even in these negligible things, so I'd like to know what's hex numbers like for the C# project team...

To date, the most popular format for representing hexadecimal literals is 0 to 9 and A to F.
You can confirm that the use of uppercase characters is most popular by referring to the hexadecimal Wikipedia entry here.
Hence follow the "wisdom of the crowd" and use 0-9, A-F.

I'm actually looking at Microsoft's code base and find inconsistencies there too.
Here it's lowercase while here it's uppercase.
So I believe the true answer is there is no convention about that, and one should pick his favorite naming.
Also searching here I'm not finding anything about hex casing.

Why do some character literals cause Syntax Errors in Java?

In the latest edition of JavaSpecialists newsletter, the author mentions a piece of code that is un-compilable in Java
public class A1 {
Character aChar = '\u000d';
}
Try compile it, and you will get an error, such as:
A1.java:2: illegal line end in character literal
Character aChar = '\u000d';
^
Why an equivalent piece of c# code does not show such a problem?
public class CharacterFixture
{
char aChar = '\u000d';
}
Am I missing anything?
EDIT: My original intention of question was how c# compiler got unicode file parsing correct (if so) and why java should still stick with the incorrect(if so) parsing?
EDIT: Also i want myoriginal question title to be restored? Why such a heavy editing and i strongly suspect that it heavily modified my intentions.

Java's compiler translates \uxxxx escape sequences as one of the very first steps, even before the tokenizer gets a crack at the code. By the time it actually starts tokenizing, there are no \uxxxx sequences anymore; they're already turned into the chars they represent, so to the compiler your Java example looks the same as if you'd actually typed a carriage return in there somehow. It does this in order to provide a way to use Unicode within the source, regardless of the source file's encoding. Even ASCII text can still fully represent Unicode chars if necessary (at the cost of readability), and since it's done so early, you can have them almost anywhere in the code. (You could say \u0063\u006c\u0061\u0073\u0073\u0020\u0053\u0074\u0075\u0066\u0066\u0020\u007b\u007d, and the compiler would read it as class Stuff {}, if you wanted to be annoying or torture yourself.)
C# doesn't do that. \uxxxx is translated later, with the rest of the program, and is only valid in certain types of tokens (namely, identifiers and string/char literals). This means it can't be used in certain places where it can be used in Java. cl\u0061ss is not a keyword, for example.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Issue with surrogate unicode characters in F# - c#

Related

In C#, how to convert to upper case following Unicode rules [duplicate]

c# - Replacing extended ascii characters

C# "anyString".Contains('\0', StringComparison.InvariantCulture) returns true in .NET5 but false in older versions

C# casing convention for hexadecimal literals?

Why do some character literals cause Syntax Errors in Java?

Categories

Resources