Is there any reason to prefer UTF-16 over UTF-8?

Is there any reason to prefer UTF-16 over UTF-8? - c#

Examining the attributes of UTF-16 and UTF-8, I can't find any reason to prefer UTF-16.
However, checking out Java and C#, it looks like strings and chars there default to UTF-16. I was thinking that it might be for historic reasons, or perhaps for performance reasons, but couldn't find any information.
Anyone knows why these languages chose UTF-16? And is there any valid reason for me to do that as well?
EDIT: Meanwhile I've also found this answer, which seems relevant and has some interesting links.

East Asian languages typically require less storage in UTF-16 (2 bytes is enough for 99% of East-Asian language characters) than UTF-8 (typically 3 bytes is required).
Of course, for Western lanagues, UTF-8 is usually smaller (1 byte instead of 2). For mixed files like HTML (where there's a lot of markup) it's much of a muchness.
Processing of UTF-16 for user-mode applications is slightly easier than processing UTF-8, because surrogate pairs behave in almost the same way that combining characters behave. So UTF-16 can usually be processed as a fixed-size encoding.

#Oak: this too long for a comment...
I don't know about C# (and would be really surprised: it would mean they just copied Java too much) but for Java it's simple: Java was conceived before Unicode 3.1 came out.
Hence there were less than 65537 codepoints, hence every Unicode codepoint was still fitting on 16-bit and so the Java char was born.
Of course this led to crazy issues that are still affecting Java programmers (like me) today, where you have a method charAt which in some case does return neither a Unicode character nor a Unicode codepoint and a method (added in Java 5) codePointAt which takes an argument which is not the number of codepoints you want you want to skip! (you have to supply to codePointAt the number of Java char you want to skip, which makes it one of the least understood method in the String class).
So, yup, this is definitely wild and confusing most Java programmers (most aren't even aware of these issues) and, yup, it is for historical reason. At least, that was the excuse that came up with when people got mad after this issue: but it's because Unicode 3.1 wasn't out yet.
:)

I imagine C# using UTF-16 derives from the Windows NT family of operating systems using UTF-16 internally.
I imagine there are two main reasons why Windows NT uses UTF-16 internally:
For memory usage: UTF-32 wastes a
lot of space to encode.
For performance: UTF-8 is much harder to
decode than UTF-16. In UTF-16 characters are either
a Basic Multilingual Plane character (2 bytes) or a Surrogate
Pair (4 bytes). UTF-8 characters
can be anywhere between 1 and 4
bytes.
Contrary to what other people have answered - you cannot treat UTF-16 as UCS-2. If you want to correctly iterate over actual characters in a string, you have to use unicode-friendly iteration functions. For example in C# you need to use StringInfo.GetTextElementEnumerator().
For further information, this page on the wiki is worth reading: http://en.wikipedia.org/wiki/Comparison_of_Unicode_encodings

It depends on the expected character sets. If you expect heavy use of Unicode code points outside of the 7-bit ASCII range then you might find that UTF-16 will be more compact than UTF-8, since some UTF-8 sequences are more than two bytes long.
Also, for efficiency reasons, Java and C# does not take surrogate pairs into account when indexing strings. This would break down completely when using code points that are represented with UTF-8 sequences that take up an odd number of bytes.

UTF-16 can be more efficient for representing characters in some languages such as Chinese, Japanese and Korean where most characters can be represented in one 16 bit word. Some rarely used characters may require two 16 bit words. UTF-8 is generally much more efficient for representing characters from Western European character sets - UTF-8 and ASCII are equivalent over the ASCII range (0-127) - but less efficient with Asian languages, requiring three or four bytes to represent characters that can be represented with two bytes in UTF-16.
UTF-16 has an advantage as an in-memory format for Java/C# in that every character in the Basic Multilingual Plane can be represented in 16 bits (see Joe's answer) and some of the disadvantages of UTF-16 (e.g. confusing code relying on \0 terminators) are less relevant.

If we're talking about plain text alone, UTF-16 can be more compact in some languages, Japanese (about 20%) and Chinese (about 40%) being prime examples. As soon as you're comparing HTML documents, the advantage goes completely the other way, since UTF-16 is going to waste a byte for every ASCII character.
As for simplicity or efficiency: if you implement Unicode correctly in an editor application, complexity will be similar because UTF-16 does not always encode codepoints as a single number anyway, and single codepoints are generally not the right way to segment text.
Given that in the most common applications, UTF-16 is less compact, and equally complex to implement, the singular reason to prefer UTF-16 over UTF-8 is if you have a completely closed ecosystem where you are regularly storing or transporting plain text entirely in complex writing systems, without compression.
After compression with zstd or LZMA2, even for 100% Chinese plain text, the advantage is completely wiped out; with gzip the UTF-16 advantage is about 4% on Chinese text with around 3000 unique graphemes.

For many (most?) applications, you will be dealing only with characters in the Basic Multilingual Plane, so can treat UTF-16 as a fixed-length encoding.
So you avoid all the complexity of variable-length encodings like UTF-8.

Related

Why is using UTF-8 for encoding during deserialization better than ASCII?

I want to deserialize a JSON file – which represents a RESTful web service response – into the corresponding classes. I was using
System.Text.ASCIIEncoding.ASCII.GetBytes(ResponseString) and I read on the Microsoft Docs that using UTF-8 encoding instead of ASCII is better for security reasons.
Now I am a little confused because I don't know the real difference between these two (regarding the security thing). Can anyone show me what the real practical advantages of using UTF-8 over ASCII for deserialization are?

Ultimately, the intention of an encoder is to get back the data you were meant to get. ASCII only defines a tiny tiny 7-bit range of values; anything over that isn't handled, and you could get back garbage - or ?, from payloads that include e̵v̷e̴n̸ ̷r̵e̸m̵o̸t̸e̵l̶y̸ ̶i̴n̴t̵e̵r̷e̵s̶t̶i̷n̷g̵ ̶t̸e̵x̵t̵.
Now; what happens when your application gets data it can't handle? We don't know, and it could indeed quite possibly cause a security problem when you get payloads you can't handle.
It is also just frankly embarrassing in this connected world if you can't correctly store and display the names etc of your customers (or print their name backwards because of right-to-left markers). Most people in the world use things outside of ASCII on a daily basis.
Since UTF-8 is a superset of ASCII, and UTF-8 basically won the encoding war: you might as well just use UTF-8.

Since not every sequence of bytes is a valid encoded string vulnerabilities arise from unwanted transformations which can be exploited by clever attackers.
Let me cite from a black hat whitepaper on Unicode security:
Character encodings and the Unicode standard are also exposed to
vulnerability. ... often they’re related to implementation in
practical use. ... the following categories can enable vulnerability
in applications which are not built to prevent the relevant attacks:
Visual Spoofing 
Best-fit mappings
Charset transcodings and character mappings
Normalization
Canonicalization of overlong UTF-8
Over-consumption
Character substitution
Character deletion
Casing
Buffer overflows
Controlling Syntax
Charset mismatches
Consider the following ... example. In the case of U+017F LATIN SMALL
LETTER LONG S, the upper casing and normalization operations transform
the character into a completely different value. In some situations,
this behavior could be exploited to create cross-site scripting or
other attack scenarios
... software vulnerabilities arise when best-fit mappings occur. To name a
few:
Best-fit mappings are not reversible, so data is irrevocably lost.
Characters can be manipulated to bypass string handling filters, such as cross-site scripting (XSS) filters, WAF's, and IDS devices.
Characters can be manipulated to abuse logic in software. Such as when the characters can be used to access files on the file system. In
this case, a best-fit mapping to characters such as ../ or file://
could be damaging.
If you are actually storing binary data consider base64 or hex instead.

performance Encoding UTF 8/16 handling Char[] /char* / std::string / BSTR

Quick intro : the question is about UTF-8 vs UTF-16.
*I tried my best to keep it as short and specific as possible please bear with me.
I know there's gazillion of variations of the specific issue UTF-8/16 not mentioning the global encoding subject,
which was the start of my questioning (ANSI vs UNICODE) and I guess it's not *MY* quest only,
as it could serve many other (performance motivated) beginners in c++.
being more specific - to the point:
giving the Following Environment parameters:
WINDOWS platform
C++ AND C#
using some english
/russian/hebrew
*lets say this is a constant.
can I use UTF-8 (half the size of UTF-16) and "get away with it" ?
...saving space and time
TLDR
I have recently moved to using C++, in the last few days I have tried to decide how to handle strings which is one of the most expensive datatypes to process, I have followed almost every famous and less famous articles on the encoding issue, though the more i tried to continue searching the more confused I have become, regarding the compatibility, while keeping high performance application without crossing the boundaries of the *framework
I have used the term framework although I am planning to do most of I/O via Native c++
can I use UTF-8 ? do I want UTF-8, i know one thing !
windows 'blood' type is UTF-16, although I think Low Level I/O and also HTTP uses/defaults/prefers/benefits from UTF-8
but I am on windows and still working with .NET
what can I use to max out my apps performance, querying manipulating saving to database...
a point I have read in a less famous [article]

A bit of research
This is a compilation of research I did to answer your problem:
Hebrew and Cyrillic in Unicode
According to Wikipedia, the Unicode Hebrew block extends from U+0590 to U+05FF and from U+FB1D to U+FB4F (I don't know the proportions):
https://en.wikipedia.org/wiki/Unicode_and_HTML_for_the_Hebrew_alphabet
According to Wikipedia, again, the cyrrilic can be found in the following bolcks: U+0400–U+04FF, U+0500–U+052F, U+2DE0–U+2DFF, U+A640–U+A69F, U+1D2B, U+1D78, U+FE2E–U+FE2F
https://en.wikipedia.org/wiki/Cyrillic_script_in_Unicode
UTF-8 vs. UTF-16
https://en.wikipedia.org/wiki/UTF-8
https://en.wikipedia.org/wiki/UTF-16
UTF-16 can represent the following glyphs with two bytes: U+0000 to U+D7FF and U+E000 to U+FFFF, which means all the characters above will be represented with two bytes (a wchar_t on Windows).
To represent Herbew and Cyrillic, UTF-8 will always need at least two bytes, and possibly three:
U+0000 - U+007F : 1 byte
U+0080 - U+07FF : 2 bytes
U+0800 - U+FFFF : 3 bytes
Windows
You said it yourself: Windows's DNA is UTF-16. No matter what delusional websites claim, the WinAPI won't change to UTF-8 because that makes not sense from Microsoft's viewpoint (breaking compatibility with previous Windows' applications just to make Linux lovers happy? Seriously?).
When you will develop under Windows, everything Unicode there will be optimized/designed for UTF-16.
Even the "char" API from the WinAPI is just a wrapper that will convert your char strings into wchar_t strings before calling the UTF-16 you should have been calling directly anyway.
Test!
As your problem seems to be mainly I/O, you should experiment to see if there is a meaningful difference between reading/writing/sending/receiving UTF-16 vs. UTF-8 with sample data.
Conclusion
From every fact above, I see either a neutral choice between UTF-8 and UTF-16 (the russian and cyrillic glyphs) (*), or a choice leading to UTF-16 (windows).
So, my own conclusion, unless your tests show otherwise, would be to stick to UTF-16 on Windows.
(*) You could sample a few strings in all the languages you are using, and try to have statistics on the averages the most common characters are used.
Bonus?
Now, in your stead, I would avoid using directly wchar_t on Windows.
Instead, I would use the _T(), TCHAR and <tchar.h> macro/typedef/include machinery offered by Windows: With but a few macros defined (UNICODE and _UNICODE if memory serves), as well as a few smart overloads, you can:
use wchar_t and utf-16 on Windows
use utf-8 on Linux
Which will make your code more portable should you switch to another OS.

Please read this article
http://www.joelonsoftware.com/articles/Unicode.html
Please read it carefully.
Now regarding performance I very much doubt it that you'll see any difference.
You choose your encoding based on what your program is supposed to do.
Is it supposed to communicate with other programs?
Are you storing information in database that will be accessed by other people?
Performance and disk space are not your first priorities when deciding which encoding to use.

Why no byte strings in .net / c#?

Is there a good reason that .NET provides string functions (like search, substring extraction, splitting, etc) only for UTF-16 and not for byte arrays? I see many cases when it would be easier and much more efficient to work with 8-bit chars instead of 16-bit.
Let's take MIME (.EML) format for example. It's basically 8-bit text file. You cannot read it properly using ANY encoding (because encoding info is contained within the file, moreover, different parts can have different encodings).
So you basically better read a MIME file as bytes, determine it's structure (ideally, using 8bit-string parsing tools), and after finding encodings for all encoding-dependent data blocks apply encoding.GetString(data) to get normal UTF-16 representation of them.
Another thing is with base64 data blocks (base64 is just an example, there are also UUE and others). Currently .NET expects you to have a base64 16-bit string but it's not effective to read data of double size and do all conversions from bytes to string just to decode this data. When dealing with megabytes of data, it becomes important.
Missing byte string manipulation functions leads to the need to write them manually but the implementation is obviously less efficient than native code implementation of string functions.
I don't say it needs to be called 8-bit chars, let's keep it bytes. Just have a set of native methods which reflect most string manipulation routines, but with byte arrays. Is this needed only by me or am I missing something important about common .NET architecture?

Let's take MIME (.EML) format for example. It's basically 8-bit text file. You cannot read it properly using ANY encoding. (because encoding info is contained within the file, moreover, different parts can have different encodings).
So, you're talking about a case where general-purpose byte-string methods aren't very useful, and you'd need to specialise.
And then for other cases, you'd need to specialise again.
And again.
I actually think byte-string methods would be more useful than your example suggests, but it remains that a lot of cases for them have specialised needs that differ from other uses in incompatible ways.
Which suggests it may not be well-suited for the base library. It's not like you can't make your own that do fit those specialised needs.

Code to deal with mixed-encoding string manipulation is unnecessarily hard and much harder to explain/get right. The way you suggest to handle mixed encoding every "string" would need to keep encoding information in it and framework would have to provide implementations of all possible combinations of encodings.
Standard solution for such problem is to provide well defined way convert all types to/from single "canonical" representation and perform most operations on that canonical type. You see that more easily in image/video processing where random incoming formats converted into one format tool knows about, processed and converted back to original/any other format.
.Net strings are almost there with "canonical" way to represent Unicode string. There are still many ways to represent same string from user point of view that is actually composed from different char elements. Even regular string comparison is huge problem (as frequently in addition to encoding there are locale differences).
Notes
there are already plenty of API dealing with byte arrays to compare/slice - both in Array/List classes and as LINQ helpers. The only real missing part is regex-like matches.
even dealing with single type of encoding for strings (UTF-16 in .Net, UTF-8 in many other systems) is hard enough - even getting "sting length" is a problem (do you need to count surrogate pairs only or include all combining characters, or just .Length is enough).
it is good idea to try to write code yourself to see where complexity come from and whether particular framework decision makes sense. Try to implement 10-15 common string functions to support several encodings - i.e. (UTF8, UTF16, and one of 8-bit encoding).

Fast and memory efficient ASCII string class for .NET

This might have been asked before, but I can't find any such posts. Is there a class to work with ASCII Strings? The benefits are numerous:
Comparison should be faster since its just byte-for-byte (instead of UTF-8 with variable encoding)
Memory efficient, should use about half the memory in large strings
Faster versions of ToUpper()/ToLower() which use a Look-Up-Table that is language invariant
Jon Skeet wrote a basic AsciiString implementation and proved #2, but I'm wondering if anyone took this further and completed such a class. I'm sure there would be uses, although no one would typically take such a route since all the existing String functions would have to be re-implemented by hand. And conversions between String <> AsciiString would be scattered everywhere complicating an otherwise simple program.
Is there such a class? Where?

I thought I would post the outcome of my efforts to implement a system as described with as much string support and compatibility as I could. It's possibly not perfect but it should give you a decent base to improve on if needed.
The ASCIIChar struct and ASCIIString string implicitly convert to their native counterparts for ease of use.
The OP's suggestion for replacements of ToUpper/Lower etc have been implemented in a much quicker way than a lookup list and all the operations are as quick and memory friendly as I could make them.
Sorry couldn't post source, it was too long. See links below.
ASCIIChar - Replaces char, stores the value in a byte instead of int and provides support methods and compatibility for the string class. Implements virtual all methods and properties available for char.
ASCIIChars - Provides static properties for each of the valid ASCII characters for ease of use.
ASCIIString - Replaces string, stores characters in a byte array and implements virtually all methods and properties available for string.

Dotnet has no ASCII string support directly. Strings are UTF16 because Windows API works with ASCII (onr char - one byte) or UTF16 only. Utf8 will be the best solution (java uses it), but .NET does not support it because Windows doesn't.
Windows API can convert between charsets, but windows api only works with 1 byte chars or 2 byte chars, so if you use UTF8 strings in .NET you must convert them everytime which has impact in performace. Dotnet can use UTF8 and other encondings via BinaryWriter/BinaryReader or a simple StreamWriter/StreamReader.

Is it safe to use random Unicode for complex delimiter sequences in strings?

Question: In terms of program stability and ensuring that the system will actually operate, how safe is it to use chars like ¦, § or ‡ for complex delimiter sequences in strings? Can I reliable believe that I won't run into any issues in a program reading these incorrectly?
I am working in a system, using C# code, in which I have to store a fairly complex set of information within a single string. The readability of this string is only necessary on the computer side, end-users should only ever see the information after it has been parsed by the appropriate methods. Because some of the data in these strings will be collections of variable size, I use different delimiters to identify what parts of the string correspond to a certain tier of organization. There are enough cases that the standard sets of ;, |, and similar ilk have been exhausted. I considered two-char delimiters, like ;# or ;|, but I felt that it would be very inefficient. There probably isn't that large of a performance difference in storing with one char versus two chars, but when I have the option of picking the smaller option, it just feels wrong to pick the larger one.
So finally, I considered using the set of characters like the double dagger and section. They only take up one char, and they are definitely not going to show up in the actual text that I'll be storing, so they won't be confused for anything.
But character encoding is finicky. While the visibility to the end user is meaningless (since they, in fact, won't see it), I became recently concerned about how the programs in the system will read it. The string is stored in one database, while a separate program is responsible for both encoding and decoding the string into different object types for the rest of the application to work with. And if something is expected to be written one way, is possibly written another, then maybe the whole system will fail and I can't really let that happen. So is it safe to use these kind of chars for background delimiters?

Because you must encode the data in a string, I am assuming it is because you are interfacing with other systems. Why not use something like XML or JSON for this rather than inventing your own data format?
With XML you can specify the encoding in use, e.g.:
<?xml version="1.0" encoding="UTF-8"?>

There is very little danger that any system that stores and retrieves Unicode text will alter those specific characters.
The main characters that can be altered in a text transfer process are the end of line markers. For example, FTPing a file from a Unix system to a Windows system in text mode might replace LINE FEED characters for CARRIAGE RETURN + LINE FEED pairs.
After that, some systems may perform a canonical normalization of the text. Combining characters and characters with diacritics on them should not be used unless canonical normalization (either composing or decomposing) is taken into account. The Unicode character database contains information about which transformations are required under these normalization schemes.
That sums up the biggest things to watch out for, and none of them are a problem for the characters that you have listed.
Other transformations that might be made, but are less likely, are case changes and compatibility normalizations. To avoid these, just stay away from alphabetic letters or anything that looks like an alphabetic letter. Some symbols are also converted in a compatibility normalization, so you should check the properties in the Unicode Character Database just to be sure. But it is unlikely that any system will do a compatibility normalization without expressly indicating that it will do so.
In the Unicode Code Charts, cannonical normalizations are indicated by "≡" and compatability normalizations are indicated by "≈".

You could take the same approach as URL or HTML encoding, and replace key chars with sequences of chars. I.e. & becomes &.
Although this results in more chars, it could be pretty efficiently compressed due to the repetition of those sequences.

Well, UNICODE is a standard, so as long as everybody involved (code, db, etc) is using UNICODE, you shouldn't have any problems.

There are rarer characters in the Unicode set. As far as I know, only the chars below 0x32 (space) have special meanings, anything abovde that should be preserved in an NVARCHAR data column.
It is never going to be totally safe unless you have a good specification what characters can and cannot be part of your data.

Remember some of the laws of Murphy:
"Anything that can go wrong will."
"Anything that can't go wrong, will
anyway."
Those characters that definitely will not be used, may eventually be used. When they are, the application will definitely fail.
You can use any character you like as delimiter, if you only escape the values so that character is guaranteed not to appear in them. I wrote an example a while back, showing that you could even use a common character like "a" as delimiter.
Escaping the values of course means that some characters will be represented as two characters, but usually that will still be less of an overhead than using a multiple character delimiter. And more importantly, it's completely safe.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.