Bit/byte conversion

Bit/byte conversion - c#

How many bits is a .NET string that's 10 characters in length? (.NET strings are UTF-16, right?)

On 32-bit systems:
4 bytes = Type pointer (Every object has one of these)
4 bytes = Lock (One of these too!)
4 bytes = Length (Need the length)
2 * Length bytes = Data (And the chars themselves)
=======================
12 + 2*Length bytes
=======================
96 + 16*Length bits
So 10 chars would = 256 bits = 32 bytes
I am not sure if the Lock grows to 64-bit on 64-bit systems. I kinda hope not, but you never know. The 64-bit structure overhead is therefore anywhere from 16-20 bytes (as opposed to the 12 bytes on 32-bit).

Every char in the string is two bytes in size, so if you are just converting the chars directly and not using any particular encoding, the answer is string.Length * 2 * 8
otherwise the result depends on the encoding, you can write:
int numbits = System.Text.Encoding.UTF8.GetByteCount(str)*8; //returns 80
or
int numbits = System.Text.Encoding.Unicode.GetByteCount(str)*8 //returns 160

If you are talking pure Unicode-16 then:
10 characters = 20 bytes = 160 bits
This really needs a context in order to be answered properly.

It all comes down to how you define character and how to you store the data.
For example, if you define character as a single letter from the users point of view it can be more than 2 bytes, for example this character: Å is two Unicode code points (U+0041 U+030A, Latin Capital A + Combining Ring Above) so it will require two .net chars or 4 bytes int UTF-16.
Now even if you are talking about 10 .net Char elements than if it's in memory you have some object overhead (that was already mentioned) and a bit of alignment overhead (on 32bit system everything has to be aligned to 4 bytes boundary, in 64bit the rules are more complicated) so you may have some empty bytes at the end.
If you are talking about database or files than each database and file system has its own overhead.

Related

Maximum UTF-8 string size given UTF-16 size

What is the formula for determining the maximum number of UTF-8 bytes required to encode a given number of UTF-16 code units (i.e. the value of String.Length in C# / .NET)?
I see 3 possibilities:
# of UTF-16 code units x 2
# of UTF-16 code units x 3
# of UTF-16 code units x 4
A UTF-16 code point is represented by either 1 or 2 code units, so we just need to consider the worst case scenario of a string filled with one or the other. If a UTF-16 string is composed entirely of 2 code unit code points, then we know the UTF-8 representation will be at most the same size, since the code points take up a maximum of 4 bytes in both representations, thus worst case is option (1) above.
So the interesting case to consider, which I don't know the answer to, is the maximum number of bytes that a single code unit UTF-16 code point can require in UTF-8 representation.
If all single code unit UTF-16 code points can be represented with 3 UTF-8 bytes, which my gut tells me makes the most sense, then option (2) will be the worst case scenario. If there are any that require 4 bytes then option (3) will be the answer.
Does anyone have insight into which is correct? I'm really hoping for (1) or (2) as (3) is going to make things a lot harder :/
UPDATE
From what I can gather, UTF-16 encodes all characters in the BMP in a single code unit, and all other planes are encoded in 2 code units.
It seems that UTF-8 can encode the entire BMP within 3 bytes and uses 4 bytes for encoding the other planes.
Thus it seems to me that option (2) above is the correct answer, and this should work:
string str = "Some string";
int maxUtf8EncodedSize = str.Length * 3;
Does that seem like it checks out?

The worst case for a single UTF-16 word is U+FFFF which in UTF-16 is encoded just as-is (0xFFFF) Cyberchef. In UTF-8 it is encoded to ef bf bf (three bytes).
The worst case for two UTF-16 words (a "surrogate pair") is U+10FFFF which in UTF-16 is encoded as 0xDBFF DFFF. In UTF-8 it is encoded to f3 cf bf bf (four bytes).
Therefore the worst case is a load of U+FFFF's which will convert a UTF-16 string of length 2N bytes to a UTF-8 string of length 3N bytes.
So yes, you are correct. I don't think you need to consider stuff like glyphs because that sort of thing is done after decoding from UTF8/16 to code points.

Properly formed UTF-8 can be up to 4 bytes per Unicode codepoint.
UTF-16-encoded characters can be up to 2 16-bit sequences per Unicode codepoint.
Characters outside the basic multilingual plane (including emoji and languages that were added to more recent versions of Unicode) are represented in up to 21 bits, which in the UTF-8 format results in 4 byte sequences, which turn out to also take up 4 bytes in UTF-16.
However, there are some environments that do things weirdly. Since UTF-16 characters outside the basic multilingual plane take up to 2 16-bit sequences (they're detectible because they're always 16 bit sequences in the range U+D800 to U+DFFF), some mistaken UTF-8 implementations, usually referred to as CESU-8, that convert those UTF-8 sequences into two 3-byte UTF-8 sequences, for a total of six bytes per UTF-32 codepoint. (I believe some early Oracle DB implementations did this, and I'm sure they weren't the only ones).
There's one more minor wrench in things, which is that some glyphs are classified as combining characters, and multiple UTF-16 (or UTF-32) sequences are used when determining what gets displayed on the screen, but I don't think that applies in your case.
Based on your edit, it looks like you're trying to estimate the maximum length of .Net encoding conversion. String Length measures the total number of Chars, which are a count of UTF-16 codepoints. As a worst-case estimate, therefore, I believe you can safely estimate count(Char) * 3, because the non-BMP characters will be count(Char) * 2 yielding 4 bytes as UTF-8.
If you want to get the total number of UTF-32 codepoints represented, you should be able to do something like
var maximumUtf8Bytes = System.Globalization.StringInfo(myString).LengthInTextElements * 4;
(My C# is a bit rusty as I haven't used a .Net environment much in the last few years, but I think that does the trick).

Size of different objects in memory

I have like about 100,000 sentences in a List<string>.
I'm trying to split each of these sentences by words and add everything into List<List<string>> where each List contains a sentence and which contains another List of words. I'm doing that because I have to do a different work on each individual words. What would be the size difference of just List<string> of sentences vs List<List<string>> of words in memory?
One of these will be stored in memory eventually so I'm looking for the memory impact of splitting each sentence vs just a string

We'll start with your List<string>. I'm going to assume the 64-bit runtime. Numbers for the 32-bit runtime are slightly smaller.
The List itself requires about 32 bytes (allocation overhead, plus internal variables), plus the backing array of strings. The array overhead is 50 bytes, and you need 8 bytes per string for the references. So if you have 100,000 sentences, you'll need at minimum 800,000 bytes for the array.
The strings themselves require something like 26 bytes each, plus two bytes per character. So if your average sentence is 80 characters, you need 186 bytes per string. Multiplies by 100K strings, that's about 18.5 megabytes. Altogether, your list of sentences will take around 20 MB (round number).
If you split the sentences into words, you now have 100,000 List<string> instances. That's about 5 megabytes just for the List<List<string>>. If we assume 10 words per sentence, then each sentence's list will require about 80 bytes for the backing array, plus 26 bytes per string (total of about 260 bytes), plus the string data itself (8 chars, or 160 bytes total). So each sentence costs you (again, round numbers) 80 + 260 + 160, or 500 bytes. Multiplied by 100,000 sentences, that's 50 MB.
So, very rough numbers, splitting your sentences into a List<List<string>> will occupy 55 or 60 megabytes.

So, first off we'll compare the difference in memory between a single string or two strings which, if concatted together, would result in the first:
string first = "ab";
string second = "a";
string third = "b";
How much memory does first use compared to second and third together? Well, the actual characters that they need to reference is the same, but every single string object has a small overhead (14 bytes on a 32 bit system, 26 bytes on a 64 bit system).
So for each string that you break up into a List<string> representing smaller strings there is a 14 * (wordsPerSentance - 1) byte overhead.
Then there is the overhead for the list itself. The list will consume one word of memory (32 bits on a 32 bit system, 64 on a 64 bit system, etc.) for each item added to the list plus the overhead of a List<string> itself (which is 24 bytes on a 32 bit system).
So for that you need to add (on a 32 bit system) (24 + (8 * averageWordsPerSentance)) * numberOfSentances bytes of memory.

Unfortunately, this isn't a question that can be answered very easily -- it depends on the particular strings, and what lengths you're willing to go to in order to optimize.
For example, take a look at the String.Intern() method. If you intern all the words, it's possible that the collection of words will require less memory than the collection of sentences. It would depend on the contents. There are other implications to interning, though, so that might not be the best idea. Again, it would depend on the particulars of the situation -- check the "Performance Considerations" section of the doc page I linked.
I think the best thing to do is to use GC.GetTotalMemory(true) before and after your operation to get a rough idea of how much memory is actually being used.

sizeof empty string in C#

In Java, an empty string is 40 bytes. In Python it's 20 bytes. How big is an empty string object in C#? I cannot do sizeof, and I don't know how else to find out. Thanks.

It's 18 bytes:
16 bytes of memory + 2 bytes per character allocated + 2 bytes for the final null character.
Note that this was written about .Net 1.1.
The m_ArrayLength field was removed in .Net 4.0 (you can see this in the reference source)

The CLR version matters. Prior to .NET 4, a string object had an extra 4-byte field that stored the "capacity", m_arrayLength field. That field is no longer around in .NET 4. It otherwise has the standard object header, 4 bytes for the sync-block, 4 bytes for the method table pointer. Then 4 bytes to store the string length (m_stringLength), followed by 2 bytes each for each character in the string. And a 0 char to make it compatible with native code. Objects are always a multiple of 4 bytes long, minimum 16 bytes.
An empty string is thus 4 + 4 + 4 + 2 = 14 bytes, rounded up to 16 bytes on .NET 4.0. 20 bytes on earlier versions. Given values are for x86. This is all very visible in the debugger, check this answer for hints.

Jon Skeet recently wrote a whole article on the subject.
On x86, an empty string is 16 bytes, and on x64 it's 32 bytes

If byte is 8 bit integer then how can we set it to 255?

The byte keyword denotes an integral
type that stores values as indicated
in the following table. It's an Unsigned 8-bit integer.
If it's only 8 bits then how can we assign it to equal 255?
byte myByte = 255;
I thought 8 bits was the same thing as just one character?

There are 256 different configuration of bits in a byte
0000 0000
0000 0001
0000 0010
...
1111 1111
So can assign a byte a value in the 0-255 range

Characters are described (in a basic sense) by a numeric representation that fits inside an 8 bit structure. If you look at the ASCII Codes for ascii characters, you'll see that they're related to numbers.
The integer count a bit sequence can represent is generated by the formula 2^n - 1 (as partially described above by #Marc Gravell). So an 8 bit structure can hold 256 values including 0 (also note TCPIP numbers are 4 separate sequences of 8 bit structures). If this was a signed integer, the first bit would be a flag for the sign and the remaining 7 would indicate the value, so while it would still hold 256 values, but the maximum and minimum would be determined by the 7 trailing bits (so 2^7 - 1 = 127).
When you get into Unicode characters and "high ascii" characters, the encoding requires more than an 8 bit structure. So in your example, if you were to assign a byte a value of 76, a lookup table could be consulted to derive the ascii character v.

11111111 (8 on bits) is 255: 128 + 64 + 32 + 16 + 8 + 4 + 2 + 1
Perhaps you're confusing this with 256, which is 2^8?

8 bits (unsigned) is 0 thru 255, or (2^8)-1.
It sounds like you are confusing integer vs text representations of data.

i thought 8 bits was the same thing as
just one character?
I think you're confusing the number 255 with the string "255."
Think about it this way: if computers stored numbers internally using characters, how would it store those characters? Using bits, right?
So in this hypothetical scenario, a computer would use bits to represent characters which it then in turn used to represent numbers. Aside from being horrendous from an efficiency standpoint, this is just redundant. Bits can represent numbers directly.

255 = 2^8 − 1 = FF[hex] = 11111111[bin]

range of values for unsigned 8 bits is 0 to 255. so this is perfectly valid
8 bits is not the same as one character in c#. In c# character is 16 bits. ANd even if character is 8 bits it has no relevance to the main question

I think you're confusing character encoding with the actual integral value stored in the variable.
A 8 bit value can have 255 configurations as answered by Arkain
Optionally, in ASCII, each of those configuration represent a different ASCII character
So, basically it depends how you interpret the value, as a integer value or as a character
ASCII Table
Wikipedia on ASCII

Sure, a bit late to answer, but for those who get this in a google search, here we go...
Like others have said, a character is definitely different to an integer. Whether it's 8-bits or not is irrelevant, but I can help by simply stating how each one works:
for an 8-bit integer, a value range between 0 and 255 is possible (or -127..127 if it's signed, and in this case, the first bit decides the polarity)
for an 8-bit character, it will most likely be an ASCII character, of which is usually referenced by an index specified with a hexadecimal value, e.g. FF or 0A. Because computers back in the day were only 8-bit, the result was a 16x16 table i.e. 256 possible characters in the ASCII character set.
Either way, if the byte is 8 bits long, then both an ASCII address or an 8-bit integer will fit in the variable's data. I would recommend using a different more dedicated data type though for simplicity. (e.g. char for ASCII or raw data, int for integers of any bit length, usually 32-bit)

ASCII values in hexadecimal notation

I am trying to parse some output data from and PBX and I have found something that I can't really figure out.
In the documentation it says the following
Information for type of call and feature. Eight character for ’status information 3’ with following ASCII values in hexadecimal notation.
1. Character
Bit7 Incoming call
Bit6 Outgoing call
Bit5 Internal call
Bit4 CN call
2. Character
Bit3 Transferred call (transferring party inside)
Bit2 CN-transferred call (transferring party outside)
Bit1
Bit0
Any ideas how to interpret this? I have no raw data at the time to match against but I still need to figure it out.

Probably you'll receive two characters (hex digits: 0-9, A-F) First digit represents the hex value for the most significant 4 bits, next digit for the least significant 4 bits.
Example:
You will probably receive something like the string "7C" as hex representation of the bitmap: 01111100.

Eight character for ’status information 3’ with following ASCII values in hexadecimal notation.
If think this means the following.
You will get 8 bytes - one byte per line, I guess.
It is just the wrong term. They mean two hex digits per byte but call them characters.
So it is just a byte with bit flags - or more precisely a array of eight such bytes.
Bit
7 incoming
6 outgoing
5 internal
4 CN
3 transfered
2 CN transfered
1 unused?
0 unused?
You could map this to a enum.
[BitFlags]
public enum CallInformation : Byte
{
Incoming = 128,
Outgoing = 64,
Internal = 32,
CN = 16
Transfered = 8,
CNTransfered = 4,
Undefined = 0
}

Very hard without data. I'd guess that you will get two bytes (two ASCII characters), and need to pick them apart at the bit level.
For instance, if the first character is 'A', you will need to look up its character code (65, or hex 0x41), and then look at the bits. Of course the bits are the same regardless of decimal or hex, but its easer to do by hand in hex. 0x41 is bit 5 and bit 1 set, so that would be an "internal call". Bit 1 seems undocumented.
I'm not sure why it looks as if that would require two characters; it's only eight bits documented.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.