Quick intro : the question is about UTF-8 vs UTF-16.
*I tried my best to keep it as short and specific as possible please bear with me.
I know there's gazillion of variations of the specific issue UTF-8/16 not mentioning the global encoding subject,
which was the start of my questioning (ANSI vs UNICODE) and I guess it's not *MY* quest only,
as it could serve many other (performance motivated) beginners in c++.
being more specific - to the point:
giving the Following Environment parameters:
WINDOWS platform
C++ AND C#
using some english
/russian/hebrew
*lets say this is a constant.
can I use UTF-8 (half the size of UTF-16) and "get away with it" ?
...saving space and time
TLDR
I have recently moved to using C++, in the last few days I have tried to decide how to handle strings which is one of the most expensive datatypes to process, I have followed almost every famous and less famous articles on the encoding issue, though the more i tried to continue searching the more confused I have become, regarding the compatibility, while keeping high performance application without crossing the boundaries of the *framework
I have used the term framework although I am planning to do most of I/O via Native c++
can I use UTF-8 ? do I want UTF-8, i know one thing !
windows 'blood' type is UTF-16, although I think Low Level I/O and also HTTP uses/defaults/prefers/benefits from UTF-8
but I am on windows and still working with .NET
what can I use to max out my apps performance, querying manipulating saving to database...
a point I have read in a less famous [article]
A bit of research
This is a compilation of research I did to answer your problem:
Hebrew and Cyrillic in Unicode
According to Wikipedia, the Unicode Hebrew block extends from U+0590 to U+05FF and from U+FB1D to U+FB4F (I don't know the proportions):
https://en.wikipedia.org/wiki/Unicode_and_HTML_for_the_Hebrew_alphabet
According to Wikipedia, again, the cyrrilic can be found in the following bolcks: U+0400–U+04FF, U+0500–U+052F, U+2DE0–U+2DFF, U+A640–U+A69F, U+1D2B, U+1D78, U+FE2E–U+FE2F
https://en.wikipedia.org/wiki/Cyrillic_script_in_Unicode
UTF-8 vs. UTF-16
https://en.wikipedia.org/wiki/UTF-8
https://en.wikipedia.org/wiki/UTF-16
UTF-16 can represent the following glyphs with two bytes: U+0000 to U+D7FF and U+E000 to U+FFFF, which means all the characters above will be represented with two bytes (a wchar_t on Windows).
To represent Herbew and Cyrillic, UTF-8 will always need at least two bytes, and possibly three:
U+0000 - U+007F : 1 byte
U+0080 - U+07FF : 2 bytes
U+0800 - U+FFFF : 3 bytes
Windows
You said it yourself: Windows's DNA is UTF-16. No matter what delusional websites claim, the WinAPI won't change to UTF-8 because that makes not sense from Microsoft's viewpoint (breaking compatibility with previous Windows' applications just to make Linux lovers happy? Seriously?).
When you will develop under Windows, everything Unicode there will be optimized/designed for UTF-16.
Even the "char" API from the WinAPI is just a wrapper that will convert your char strings into wchar_t strings before calling the UTF-16 you should have been calling directly anyway.
Test!
As your problem seems to be mainly I/O, you should experiment to see if there is a meaningful difference between reading/writing/sending/receiving UTF-16 vs. UTF-8 with sample data.
Conclusion
From every fact above, I see either a neutral choice between UTF-8 and UTF-16 (the russian and cyrillic glyphs) (*), or a choice leading to UTF-16 (windows).
So, my own conclusion, unless your tests show otherwise, would be to stick to UTF-16 on Windows.
(*) You could sample a few strings in all the languages you are using, and try to have statistics on the averages the most common characters are used.
Bonus?
Now, in your stead, I would avoid using directly wchar_t on Windows.
Instead, I would use the _T(), TCHAR and <tchar.h> macro/typedef/include machinery offered by Windows: With but a few macros defined (UNICODE and _UNICODE if memory serves), as well as a few smart overloads, you can:
use wchar_t and utf-16 on Windows
use utf-8 on Linux
Which will make your code more portable should you switch to another OS.
Please read this article
http://www.joelonsoftware.com/articles/Unicode.html
Please read it carefully.
Now regarding performance I very much doubt it that you'll see any difference.
You choose your encoding based on what your program is supposed to do.
Is it supposed to communicate with other programs?
Are you storing information in database that will be accessed by other people?
Performance and disk space are not your first priorities when deciding which encoding to use.
Related
I want to deserialize a JSON file – which represents a RESTful web service response – into the corresponding classes. I was using
System.Text.ASCIIEncoding.ASCII.GetBytes(ResponseString) and I read on the Microsoft Docs that using UTF-8 encoding instead of ASCII is better for security reasons.
Now I am a little confused because I don't know the real difference between these two (regarding the security thing). Can anyone show me what the real practical advantages of using UTF-8 over ASCII for deserialization are?
Ultimately, the intention of an encoder is to get back the data you were meant to get. ASCII only defines a tiny tiny 7-bit range of values; anything over that isn't handled, and you could get back garbage - or ?, from payloads that include e̵v̷e̴n̸ ̷r̵e̸m̵o̸t̸e̵l̶y̸ ̶i̴n̴t̵e̵r̷e̵s̶t̶i̷n̷g̵ ̶t̸e̵x̵t̵.
Now; what happens when your application gets data it can't handle? We don't know, and it could indeed quite possibly cause a security problem when you get payloads you can't handle.
It is also just frankly embarrassing in this connected world if you can't correctly store and display the names etc of your customers (or print their name backwards because of right-to-left markers). Most people in the world use things outside of ASCII on a daily basis.
Since UTF-8 is a superset of ASCII, and UTF-8 basically won the encoding war: you might as well just use UTF-8.
Since not every sequence of bytes is a valid encoded string vulnerabilities arise from unwanted transformations which can be exploited by clever attackers.
Let me cite from a black hat whitepaper on Unicode security:
Character encodings and the Unicode standard are also exposed to
vulnerability. ... often they’re related to implementation in
practical use. ... the following categories can enable vulnerability
in applications which are not built to prevent the relevant attacks:
Visual Spoofing
Best-fit mappings
Charset transcodings and character mappings
Normalization
Canonicalization of overlong UTF-8
Over-consumption
Character substitution
Character deletion
Casing
Buffer overflows
Controlling Syntax
Charset mismatches
Consider the following ... example. In the case of U+017F LATIN SMALL
LETTER LONG S, the upper casing and normalization operations transform
the character into a completely different value. In some situations,
this behavior could be exploited to create cross-site scripting or
other attack scenarios
... software vulnerabilities arise when best-fit mappings occur. To name a
few:
Best-fit mappings are not reversible, so data is irrevocably lost.
Characters can be manipulated to bypass string handling filters, such as cross-site scripting (XSS) filters, WAF's, and IDS devices.
Characters can be manipulated to abuse logic in software. Such as when the characters can be used to access files on the file system. In
this case, a best-fit mapping to characters such as ../ or file://
could be damaging.
If you are actually storing binary data consider base64 or hex instead.
Is there a good reason that .NET provides string functions (like search, substring extraction, splitting, etc) only for UTF-16 and not for byte arrays? I see many cases when it would be easier and much more efficient to work with 8-bit chars instead of 16-bit.
Let's take MIME (.EML) format for example. It's basically 8-bit text file. You cannot read it properly using ANY encoding (because encoding info is contained within the file, moreover, different parts can have different encodings).
So you basically better read a MIME file as bytes, determine it's structure (ideally, using 8bit-string parsing tools), and after finding encodings for all encoding-dependent data blocks apply encoding.GetString(data) to get normal UTF-16 representation of them.
Another thing is with base64 data blocks (base64 is just an example, there are also UUE and others). Currently .NET expects you to have a base64 16-bit string but it's not effective to read data of double size and do all conversions from bytes to string just to decode this data. When dealing with megabytes of data, it becomes important.
Missing byte string manipulation functions leads to the need to write them manually but the implementation is obviously less efficient than native code implementation of string functions.
I don't say it needs to be called 8-bit chars, let's keep it bytes. Just have a set of native methods which reflect most string manipulation routines, but with byte arrays. Is this needed only by me or am I missing something important about common .NET architecture?
Let's take MIME (.EML) format for example. It's basically 8-bit text file. You cannot read it properly using ANY encoding. (because encoding info is contained within the file, moreover, different parts can have different encodings).
So, you're talking about a case where general-purpose byte-string methods aren't very useful, and you'd need to specialise.
And then for other cases, you'd need to specialise again.
And again.
I actually think byte-string methods would be more useful than your example suggests, but it remains that a lot of cases for them have specialised needs that differ from other uses in incompatible ways.
Which suggests it may not be well-suited for the base library. It's not like you can't make your own that do fit those specialised needs.
Code to deal with mixed-encoding string manipulation is unnecessarily hard and much harder to explain/get right. The way you suggest to handle mixed encoding every "string" would need to keep encoding information in it and framework would have to provide implementations of all possible combinations of encodings.
Standard solution for such problem is to provide well defined way convert all types to/from single "canonical" representation and perform most operations on that canonical type. You see that more easily in image/video processing where random incoming formats converted into one format tool knows about, processed and converted back to original/any other format.
.Net strings are almost there with "canonical" way to represent Unicode string. There are still many ways to represent same string from user point of view that is actually composed from different char elements. Even regular string comparison is huge problem (as frequently in addition to encoding there are locale differences).
Notes
there are already plenty of API dealing with byte arrays to compare/slice - both in Array/List classes and as LINQ helpers. The only real missing part is regex-like matches.
even dealing with single type of encoding for strings (UTF-16 in .Net, UTF-8 in many other systems) is hard enough - even getting "sting length" is a problem (do you need to count surrogate pairs only or include all combining characters, or just .Length is enough).
it is good idea to try to write code yourself to see where complexity come from and whether particular framework decision makes sense. Try to implement 10-15 common string functions to support several encodings - i.e. (UTF8, UTF16, and one of 8-bit encoding).
This might have been asked before, but I can't find any such posts. Is there a class to work with ASCII Strings? The benefits are numerous:
Comparison should be faster since its just byte-for-byte (instead of UTF-8 with variable encoding)
Memory efficient, should use about half the memory in large strings
Faster versions of ToUpper()/ToLower() which use a Look-Up-Table that is language invariant
Jon Skeet wrote a basic AsciiString implementation and proved #2, but I'm wondering if anyone took this further and completed such a class. I'm sure there would be uses, although no one would typically take such a route since all the existing String functions would have to be re-implemented by hand. And conversions between String <> AsciiString would be scattered everywhere complicating an otherwise simple program.
Is there such a class? Where?
I thought I would post the outcome of my efforts to implement a system as described with as much string support and compatibility as I could. It's possibly not perfect but it should give you a decent base to improve on if needed.
The ASCIIChar struct and ASCIIString string implicitly convert to their native counterparts for ease of use.
The OP's suggestion for replacements of ToUpper/Lower etc have been implemented in a much quicker way than a lookup list and all the operations are as quick and memory friendly as I could make them.
Sorry couldn't post source, it was too long. See links below.
ASCIIChar - Replaces char, stores the value in a byte instead of int and provides support methods and compatibility for the string class. Implements virtual all methods and properties available for char.
ASCIIChars - Provides static properties for each of the valid ASCII characters for ease of use.
ASCIIString - Replaces string, stores characters in a byte array and implements virtually all methods and properties available for string.
Dotnet has no ASCII string support directly. Strings are UTF16 because Windows API works with ASCII (onr char - one byte) or UTF16 only. Utf8 will be the best solution (java uses it), but .NET does not support it because Windows doesn't.
Windows API can convert between charsets, but windows api only works with 1 byte chars or 2 byte chars, so if you use UTF8 strings in .NET you must convert them everytime which has impact in performace. Dotnet can use UTF8 and other encondings via BinaryWriter/BinaryReader or a simple StreamWriter/StreamReader.
with .NET things are fairly simple - it is all (including ARM ASFAIK) running little endian .
The question that I have is: what is happing on Mono and (potentially) big endian systems? Do the bits reverse (when compared to x86) in Int32 / Int64 structure or does the framework force little endian rule-set?
Thanks
Your assertion that all MS .NET are little endian is not correct. It depends on the architecture that you are running on - the CLR spec says so:
From the CLI Annotated Standard (p.161) — Partition I, section 12.6.3: "Byte Ordering":
For data types larger than 1 byte, the byte ordering is dependent on the target CPU. Code that depends on byte ordering may not run on all platforms. [...]
(taken from this SO answer)
See this answer for more information on the internals of BitConverter and how it handles endianness.
A list of behavioral changes I can think of at the moment (unchecked and incomplete):
IPAddress.HostToNetworkOrder and IPAddress.NetworkToHostOrder
Nearly everything in BitConverter
BinaryReader and BinaryWriter (EDIT: From documentation: "BinaryReader reads this data type in little-endian format.")
Binary serialization
Everything that reads and writes Unicode in default encoding from/to streams (UnicodeEncoding) (EDIT: Default is defined as little endian)
and of course every (runtime library) function using these.
Usually Microsoft doesn't mention endianness in their docs - with some strange exceptions. For instance, BinaryReader.ReadUInt16 is defined to read little endian. Nothing mentioned for the other methods. One may assume that binary serialization is always little-endian, even on big-endian machines.
Note that XNA on XBox360 is big-endian, so this not just a theoretical problem with Mono.
c#/.Net does not make any claims on endian. int32/64 are atomic not structures.
As far as I know such conversion would happen outside the scope of your code and hidden to you. It's called "managed code" for some reasons, including such potential issues.
To know if bytes are "reversed", just check BitConverter.IsLittleEndian:
if (BitConverter.IsLittleEndian)
{
// reverse bytes
}
Considering how similar .Net and Mono are by design, I'd say they probably handle endianness the same.
You can always test it by creating a managed int with a known value, then using reflection or marshalling to access the memory and take a look.
Examining the attributes of UTF-16 and UTF-8, I can't find any reason to prefer UTF-16.
However, checking out Java and C#, it looks like strings and chars there default to UTF-16. I was thinking that it might be for historic reasons, or perhaps for performance reasons, but couldn't find any information.
Anyone knows why these languages chose UTF-16? And is there any valid reason for me to do that as well?
EDIT: Meanwhile I've also found this answer, which seems relevant and has some interesting links.
East Asian languages typically require less storage in UTF-16 (2 bytes is enough for 99% of East-Asian language characters) than UTF-8 (typically 3 bytes is required).
Of course, for Western lanagues, UTF-8 is usually smaller (1 byte instead of 2). For mixed files like HTML (where there's a lot of markup) it's much of a muchness.
Processing of UTF-16 for user-mode applications is slightly easier than processing UTF-8, because surrogate pairs behave in almost the same way that combining characters behave. So UTF-16 can usually be processed as a fixed-size encoding.
#Oak: this too long for a comment...
I don't know about C# (and would be really surprised: it would mean they just copied Java too much) but for Java it's simple: Java was conceived before Unicode 3.1 came out.
Hence there were less than 65537 codepoints, hence every Unicode codepoint was still fitting on 16-bit and so the Java char was born.
Of course this led to crazy issues that are still affecting Java programmers (like me) today, where you have a method charAt which in some case does return neither a Unicode character nor a Unicode codepoint and a method (added in Java 5) codePointAt which takes an argument which is not the number of codepoints you want you want to skip! (you have to supply to codePointAt the number of Java char you want to skip, which makes it one of the least understood method in the String class).
So, yup, this is definitely wild and confusing most Java programmers (most aren't even aware of these issues) and, yup, it is for historical reason. At least, that was the excuse that came up with when people got mad after this issue: but it's because Unicode 3.1 wasn't out yet.
:)
I imagine C# using UTF-16 derives from the Windows NT family of operating systems using UTF-16 internally.
I imagine there are two main reasons why Windows NT uses UTF-16 internally:
For memory usage: UTF-32 wastes a
lot of space to encode.
For performance: UTF-8 is much harder to
decode than UTF-16. In UTF-16 characters are either
a Basic Multilingual Plane character (2 bytes) or a Surrogate
Pair (4 bytes). UTF-8 characters
can be anywhere between 1 and 4
bytes.
Contrary to what other people have answered - you cannot treat UTF-16 as UCS-2. If you want to correctly iterate over actual characters in a string, you have to use unicode-friendly iteration functions. For example in C# you need to use StringInfo.GetTextElementEnumerator().
For further information, this page on the wiki is worth reading: http://en.wikipedia.org/wiki/Comparison_of_Unicode_encodings
It depends on the expected character sets. If you expect heavy use of Unicode code points outside of the 7-bit ASCII range then you might find that UTF-16 will be more compact than UTF-8, since some UTF-8 sequences are more than two bytes long.
Also, for efficiency reasons, Java and C# does not take surrogate pairs into account when indexing strings. This would break down completely when using code points that are represented with UTF-8 sequences that take up an odd number of bytes.
UTF-16 can be more efficient for representing characters in some languages such as Chinese, Japanese and Korean where most characters can be represented in one 16 bit word. Some rarely used characters may require two 16 bit words. UTF-8 is generally much more efficient for representing characters from Western European character sets - UTF-8 and ASCII are equivalent over the ASCII range (0-127) - but less efficient with Asian languages, requiring three or four bytes to represent characters that can be represented with two bytes in UTF-16.
UTF-16 has an advantage as an in-memory format for Java/C# in that every character in the Basic Multilingual Plane can be represented in 16 bits (see Joe's answer) and some of the disadvantages of UTF-16 (e.g. confusing code relying on \0 terminators) are less relevant.
If we're talking about plain text alone, UTF-16 can be more compact in some languages, Japanese (about 20%) and Chinese (about 40%) being prime examples. As soon as you're comparing HTML documents, the advantage goes completely the other way, since UTF-16 is going to waste a byte for every ASCII character.
As for simplicity or efficiency: if you implement Unicode correctly in an editor application, complexity will be similar because UTF-16 does not always encode codepoints as a single number anyway, and single codepoints are generally not the right way to segment text.
Given that in the most common applications, UTF-16 is less compact, and equally complex to implement, the singular reason to prefer UTF-16 over UTF-8 is if you have a completely closed ecosystem where you are regularly storing or transporting plain text entirely in complex writing systems, without compression.
After compression with zstd or LZMA2, even for 100% Chinese plain text, the advantage is completely wiped out; with gzip the UTF-16 advantage is about 4% on Chinese text with around 3000 unique graphemes.
For many (most?) applications, you will be dealing only with characters in the Basic Multilingual Plane, so can treat UTF-16 as a fixed-length encoding.
So you avoid all the complexity of variable-length encodings like UTF-8.