Maximum UTF-8 string size given UTF-16 size

Maximum UTF-8 string size given UTF-16 size - c#

What is the formula for determining the maximum number of UTF-8 bytes required to encode a given number of UTF-16 code units (i.e. the value of String.Length in C# / .NET)?
I see 3 possibilities:
# of UTF-16 code units x 2
# of UTF-16 code units x 3
# of UTF-16 code units x 4
A UTF-16 code point is represented by either 1 or 2 code units, so we just need to consider the worst case scenario of a string filled with one or the other. If a UTF-16 string is composed entirely of 2 code unit code points, then we know the UTF-8 representation will be at most the same size, since the code points take up a maximum of 4 bytes in both representations, thus worst case is option (1) above.
So the interesting case to consider, which I don't know the answer to, is the maximum number of bytes that a single code unit UTF-16 code point can require in UTF-8 representation.
If all single code unit UTF-16 code points can be represented with 3 UTF-8 bytes, which my gut tells me makes the most sense, then option (2) will be the worst case scenario. If there are any that require 4 bytes then option (3) will be the answer.
Does anyone have insight into which is correct? I'm really hoping for (1) or (2) as (3) is going to make things a lot harder :/
UPDATE
From what I can gather, UTF-16 encodes all characters in the BMP in a single code unit, and all other planes are encoded in 2 code units.
It seems that UTF-8 can encode the entire BMP within 3 bytes and uses 4 bytes for encoding the other planes.
Thus it seems to me that option (2) above is the correct answer, and this should work:
string str = "Some string";
int maxUtf8EncodedSize = str.Length * 3;
Does that seem like it checks out?

The worst case for a single UTF-16 word is U+FFFF which in UTF-16 is encoded just as-is (0xFFFF) Cyberchef. In UTF-8 it is encoded to ef bf bf (three bytes).
The worst case for two UTF-16 words (a "surrogate pair") is U+10FFFF which in UTF-16 is encoded as 0xDBFF DFFF. In UTF-8 it is encoded to f3 cf bf bf (four bytes).
Therefore the worst case is a load of U+FFFF's which will convert a UTF-16 string of length 2N bytes to a UTF-8 string of length 3N bytes.
So yes, you are correct. I don't think you need to consider stuff like glyphs because that sort of thing is done after decoding from UTF8/16 to code points.

Properly formed UTF-8 can be up to 4 bytes per Unicode codepoint.
UTF-16-encoded characters can be up to 2 16-bit sequences per Unicode codepoint.
Characters outside the basic multilingual plane (including emoji and languages that were added to more recent versions of Unicode) are represented in up to 21 bits, which in the UTF-8 format results in 4 byte sequences, which turn out to also take up 4 bytes in UTF-16.
However, there are some environments that do things weirdly. Since UTF-16 characters outside the basic multilingual plane take up to 2 16-bit sequences (they're detectible because they're always 16 bit sequences in the range U+D800 to U+DFFF), some mistaken UTF-8 implementations, usually referred to as CESU-8, that convert those UTF-8 sequences into two 3-byte UTF-8 sequences, for a total of six bytes per UTF-32 codepoint. (I believe some early Oracle DB implementations did this, and I'm sure they weren't the only ones).
There's one more minor wrench in things, which is that some glyphs are classified as combining characters, and multiple UTF-16 (or UTF-32) sequences are used when determining what gets displayed on the screen, but I don't think that applies in your case.
Based on your edit, it looks like you're trying to estimate the maximum length of .Net encoding conversion. String Length measures the total number of Chars, which are a count of UTF-16 codepoints. As a worst-case estimate, therefore, I believe you can safely estimate count(Char) * 3, because the non-BMP characters will be count(Char) * 2 yielding 4 bytes as UTF-8.
If you want to get the total number of UTF-32 codepoints represented, you should be able to do something like
var maximumUtf8Bytes = System.Globalization.StringInfo(myString).LengthInTextElements * 4;
(My C# is a bit rusty as I haven't used a .Net environment much in the last few years, but I think that does the trick).

Related

Run-Length Encoding assumptions

When implementing the Run-length encoding (RLE), can I assume that the Runs are going to be shorter than one byte?
So there will not be a situation where there is a run like this
WWWBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB...
Where there are 256 B's because you cannot represent that length in one byte whereas you can represent the W's as 3W
If not, should the Run be split into two Runs? How should this situation be handled? I couldn't find any information about this case.

To my understanding, you understand the situation correctly. The word length used for counting the repetition of a character is usually a byte, and the individual characters usually are also encoded as a byte. If in the input there is a repetition of e.g. 300 b, the encoding will be as follows.
255 (number of repetitions of the next character)
98 (ASCII value for b)
45 (nunber of repetitions of the next character)
98 (ASCII value for b)
In total, a run of length larger than 255 will have to be split in two runs. That being said, the actual encoding depends on the specific implementations; it is also possible to use other types than bytes for counting the repetition of characters.

Advice on marshalled string that can be either ASCII or UTF-16

Welcome to unsafe land.
I'm doing P/Invoke to a legacy lib that gives me a 0-terminated C-style string in the form of an unknown-length unmanaged byte buffer that can be either ASCII or UTF-16, but without giving any indication whatsoever thereof - other than the byte stream itself that is...
Right now I have a bad scheme, based on checking for single and double 0-bytes, to decide if I should create a managed String from Char* or SByte*. The scheme obviously breaks down for every Unicode code-point higher than U+00FF.
This is what I have:
The address of the unmanaged byte buffer.
The unmanaged byte buffer is of unknown length.
The unmanaged byte buffer is either a 0-terminated ASCII C-style string or a 0-terminated UTF-16 C-style string.
This is what I want:
Create a correct managed String from the unmanaged byte buffer, whether it's ASCII or UTF-16.
Is that problem generically solvable?

I don't think this can be solved 100%. If the buffer contains 6c 34 00 00 ("l4"), is that the Chinese sign for water, or just an ASCII lower L and 4? But it should be possible to guess right "most of the time" depending on the specific strings.
Is the UTF-16 little endian or (probably) big endian?
The largest risk is buffer overrun. For instance, if the buffer starts with a 00, is that a zero-length ASCII string or should we try ready more of the buffer interpreting it as UTF-16BE?

Is that problem generically solvable?
No.
If you know the length of the string (and that it's even), you could identify UTF-16 by the presence of 00 bytes padding ISO-8859-1 characters. (Even a non-Latin alphabet language would still make heavy use of ASCII space and newline.)
But if you depend on null termination, that won't help you. If you look for 00 00, you can indirectly match a 00 byte that just happens to be right after the null-terminator. Worse, if in ASCII string isn't double null terminated, you'll run right past the end of the string.

One way of adding a level of heuristics to the naïve encoding detection scheme that is based on checking for single and double 0-bytes:
Assume that a marshalled "context" from the legacy lib consists of one or more strings.
If one string in such a context is likely to be UTF-16, then all other strings in that context are also UTF-16.
So, as soon as a UTF-16 string is found with "high enough" certainty, bias all other detections to be "probably UTF-16".
If a "probably not UTF-16" string is found to be a "definitely not UTF-8" string, then it cannot be ASCII either, so set it as UTF-16.
That'll give a much higher rate of accurately created managed Strings.

Bit/byte conversion

How many bits is a .NET string that's 10 characters in length? (.NET strings are UTF-16, right?)

On 32-bit systems:
4 bytes = Type pointer (Every object has one of these)
4 bytes = Lock (One of these too!)
4 bytes = Length (Need the length)
2 * Length bytes = Data (And the chars themselves)
=======================
12 + 2*Length bytes
=======================
96 + 16*Length bits
So 10 chars would = 256 bits = 32 bytes
I am not sure if the Lock grows to 64-bit on 64-bit systems. I kinda hope not, but you never know. The 64-bit structure overhead is therefore anywhere from 16-20 bytes (as opposed to the 12 bytes on 32-bit).

Every char in the string is two bytes in size, so if you are just converting the chars directly and not using any particular encoding, the answer is string.Length * 2 * 8
otherwise the result depends on the encoding, you can write:
int numbits = System.Text.Encoding.UTF8.GetByteCount(str)*8; //returns 80
or
int numbits = System.Text.Encoding.Unicode.GetByteCount(str)*8 //returns 160

If you are talking pure Unicode-16 then:
10 characters = 20 bytes = 160 bits
This really needs a context in order to be answered properly.

It all comes down to how you define character and how to you store the data.
For example, if you define character as a single letter from the users point of view it can be more than 2 bytes, for example this character: Å is two Unicode code points (U+0041 U+030A, Latin Capital A + Combining Ring Above) so it will require two .net chars or 4 bytes int UTF-16.
Now even if you are talking about 10 .net Char elements than if it's in memory you have some object overhead (that was already mentioned) and a bit of alignment overhead (on 32bit system everything has to be aligned to 4 bytes boundary, in 64bit the rules are more complicated) so you may have some empty bytes at the end.
If you are talking about database or files than each database and file system has its own overhead.

Predicting the length of an encrypted string

I am using this for encryption: http://msdn.microsoft.com/en-us/library/system.security.cryptography.rijndaelmanaged.aspx
Is there a way I can predict what the encrypted text will look like? I am converting the encrypted output to text so I can store it in the db.
I just want to make sure the size of the database column is large enough.
I am limiting the text input to be 20 characters.

Are you using SQL Server 2005 or above? If so you could just use VARCHAR(MAX) or NVARCHAR(MAX) for the column type.
If you want to be a bit more precise...
The maximum block size for RijndaelManaged is 256 bits (32 bytes).
Your maximum input size is 20 characters, so even if we assume a worst-case scenario of 4 bytes per character, that'll only amount to 80 bytes, which will then be padded up to a maximum of 96 bytes for the encryption process.
If you use Base64 encoding on the encrypted output that will create 128 characters from the 96 encrypted bytes. If you use hex encoding then that will create 192 characters from the 96 encrypted bytes (plus maybe a couple of extra characters if you're prefixing the hex string with "0x"). In either case a column width of 200 characters should give you more than enough headroom.
(NB: These are just off-the-top-of-my-head calculations. I haven't verified that they're actually correct!)

For an unknown encryption algorithm with no information to be found online, I would write a little test program that encrypted a random set of strings of maximum length, find the longest length in the output, then multiply by a safety factor based on how likely the length of input is to change, and how accurate the result of the test program was.
Really generally speaking though, you're probably going to be in the 1.5x - 2x input length range.

For this specific algorithm, the length of ciphertext will be,
((length+16)/16)*16
This is to meet the block size and padding requirement.
I suggest you also add an random IV to the ciphertext so that will take another 16 bytes.
However, if you want put this as char in database, you have to encode it. That will increase it even more.
For base64, multiply it by 4/3. For hex, double it.

Encryption will never increase the size of data beyond the minimum padding required.
If it does 'expand' the data, it is probably not a very good encryption algorithm.

ASCII values in hexadecimal notation

I am trying to parse some output data from and PBX and I have found something that I can't really figure out.
In the documentation it says the following
Information for type of call and feature. Eight character for ’status information 3’ with following ASCII values in hexadecimal notation.
1. Character
Bit7 Incoming call
Bit6 Outgoing call
Bit5 Internal call
Bit4 CN call
2. Character
Bit3 Transferred call (transferring party inside)
Bit2 CN-transferred call (transferring party outside)
Bit1
Bit0
Any ideas how to interpret this? I have no raw data at the time to match against but I still need to figure it out.

Probably you'll receive two characters (hex digits: 0-9, A-F) First digit represents the hex value for the most significant 4 bits, next digit for the least significant 4 bits.
Example:
You will probably receive something like the string "7C" as hex representation of the bitmap: 01111100.

Eight character for ’status information 3’ with following ASCII values in hexadecimal notation.
If think this means the following.
You will get 8 bytes - one byte per line, I guess.
It is just the wrong term. They mean two hex digits per byte but call them characters.
So it is just a byte with bit flags - or more precisely a array of eight such bytes.
Bit
7 incoming
6 outgoing
5 internal
4 CN
3 transfered
2 CN transfered
1 unused?
0 unused?
You could map this to a enum.
[BitFlags]
public enum CallInformation : Byte
{
Incoming = 128,
Outgoing = 64,
Internal = 32,
CN = 16
Transfered = 8,
CNTransfered = 4,
Undefined = 0
}

Very hard without data. I'd guess that you will get two bytes (two ASCII characters), and need to pick them apart at the bit level.
For instance, if the first character is 'A', you will need to look up its character code (65, or hex 0x41), and then look at the bits. Of course the bits are the same regardless of decimal or hex, but its easer to do by hand in hex. 0x41 is bit 5 and bit 1 set, so that would be an "internal call". Bit 1 seems undocumented.
I'm not sure why it looks as if that would require two characters; it's only eight bits documented.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.