For the purpose of learning, I'm trying to understand how C# strings are internally stored in memory.
According to this blog post, C# string size is (x64 with .NET framework 4.0) :
26 + 2 * length
A string with a single character will take (26 + 2 * 1) / 8 * 8 = 32 bytes .
This is indeed similar to what I measured.
What puzzle me is what is in that 26 bytes overhead.
I have run the following code and inspected memory :
string abc = "abcdeg";
string aaa = "x";
string ccc = "zzzzz";
AFAIK those blocks are the following :
Green : Sync block (8 bytes)
Cyan : Type info (8 bytes)
Yellow : Length (4 bytes)
Pink : The actual characters : 2 bytes per char + 2 bytes for NULL terminator.
Look at the "x" string. It is indeed 32 bytes (as calculated).
Anyway it looks like the end of the string if padded with zeroes.
The "x" string could end up after the two bytes for NULL terminator and still be memory aligned (thus being 24 bytes).
Why do we need an extra 8 bytes ?
I have experimented similar results with other (bigger) string sizes.
It looks like there is always an extra 8 bytes.
As Hans Passant suggested, there is an extra field added at the end of string object which is 4 bytes (in x64, it might require another 4 bytes extra, for padding).
So in the end we have :
= 8 (sync) + 8 (type) + 4 (length) + 4(extra field) + 2 (null terminator) + 2 * length
= 26 + 2 * length
So Jon Skeet's blog post was right (how could it be wrong ?)
Related
"encryptedHelloWorld=" ==
Convert.ToBase64String(
Convert.FromBase64String("encryptedHelloWorld="))
This statement returns false
Convert.ToBase64String(Convert.FromBase64String("encryptedHelloWorld="))
return "encryptedHelloWorlc="
Any idea why?
The original value of encryptedHelloWorld= is not correctly base-64 encoded.
The last "d" contains an extra bit that is ignored on extraction in this context, where it occurs immediately before padding. A stricter base-64 decoder could validly throw an error.
Minimal failing input cases include rld= or abq=. Only the last portion with the padding is relevant, as discussed below.
Consider that each output character of base-64 character represent 6 bits each.
Thus the information encoded in rld= is:
r - 6 bits
l - 6 bits
d - 6 bits (4 relevant = "c" + 2 extra)
= - n/a
This must be extracted into 2 bytes (8 + 8 = 16 bits).
However, 6 + 6 + 6 = 18 bits and is not a multiple of 8. There are 2 extra bits which differentiate "c" from "d" in the initial base-64 value which do not reflect the actual encoded information.
During the decoding, the .NET decoder implementation silently drops the two extra bits in the "d", as they have nowhere to go. (This is also true for cases like abq= as "q" > "c"; note that capital letters are ordered first in the base-64 output space so "Q" < "c".)
In the normal case without padding, every 4 base-64 characters decode in 3 bytes evenly, which is why this particular issue is only present at the end of a base-64 string which is not an even multiple of 4 base-64 characters (excluding padding characters).
We are rewriting some applications previously developed in Visual FoxPro and redeveloping them using .Net ( using C# )
Here is our scenario:
Our application uses smartcards. We read in data from a smartcard which has a name and number. The name comes back ok in readable text but the number, in this case '900' comes back as a 2 byte character representation (131 & 132) and look like this - ƒ„
Those 2 special characters can be seen in the extended Ascii table.. now as you can see the 2 bytes are 131 and 132 and can vary as there is no single standard extended ascii table ( as far as I can tell reading some of the posts on here )
So... the smart card was previously written to using the BINTOC function in VFP and therefore the 900 was written to the card as ƒ„. And within foxpro those 2 special characters can be converted back into integer format using CTOBIN function.. another built in function in FoxPro..
So ( finally getting to the point ) - So far we have been unable to convert those 2 special characters back to an int ( 900 ) and we are wondering if this is possible in .NET to read the character representation of an integer back to an actual integer.
Or is there a way to rewrite the logic of those 2 VFP functions in C#?
UPDATE:
After some fiddling we realise that to get 900 into 2bytes we need to convert 900 into a 16bit Binary Value, then we need to convert that 16 bit binary value into a decimal value.
So as above we are receiving back 131 and 132 and their corresponding binary values as being 10000011 ( decimal value 131 ) and 10000100 ( decimal value 132 ).
When we concatenate these 2 values to '1000001110000100' it gives the decimal value 33668 however if we removed the leading 1 and transform '000001110000100' to decimal it gives the correct value of 900...
Not too sure why this is though...
Any help would be appreciated.
It looks like VFP is storing your value as a signed 16 bit (short) integer. It seems to have a strange changeover point to me for the negative numbers but it adds 128 to 8 bit numbers and adds 32768 to 16 bit numbers.
So converting your 16 bit numbers from the string should be as easy as reading it as a 16 bit integer and then taking 32768 away from it. If you have to do this manually then the first number has to be multiplied by 256 and then add the second number to get the stored value. Then take 32768 away from this number to get your value.
Examples:
131 * 256 = 33536
33536 + 132 = 33668
33668 - 32768 = 900
You could try using the C# conversions as per http://msdn.microsoft.com/en-us/library/ms131059.aspx and http://msdn.microsoft.com/en-us/library/tw38dw27.aspx to do at least some of the work for you but if not it shouldn't be too hard to code the above manually.
It's a few years late, but here's a working example.
public ulong CharToBin(byte[] s)
{
if (s == null || s.Length < 1 || s.Length > 8)
return 0ul;
var v = s.Select(c => (ulong)c).ToArray();
var result = 0ul;
var multiplier = 1ul;
for (var i = 0; i < v.Length; i++)
{
if (i > 0)
multiplier *= 256ul;
result += v[i] * multiplier;
}
return result;
}
This is a VFP 8 and earlier equivalent for CTOBIN, which covers your scenario. You should be able to write your own BINTOC based on the code above. VFP 9 added support for multiple options like non-reversed binary data, currency and double data types, and signed values. This sample only covers reversed unsigned binary like older VFP supported.
Some notes:
The code supports 1, 2, 4, and 8-byte values, which covers all
unsigned numeric values up to System.UInt64.
Before casting the
result down to your expected numeric type, you should verify the
ceiling. For example, if you need an Int32, then check the result
against Int32.MaxValue before you perform the cast.
The sample avoids the complexity of string encoding by accepting a
byte array. You would need to understand which encoding was used to
read the string, then apply that same encoding to get the byte array
before calling this function. In the VFP world, this is frequently
Encoding.ASCII, but it depends on the application.
In Java, an empty string is 40 bytes. In Python it's 20 bytes. How big is an empty string object in C#? I cannot do sizeof, and I don't know how else to find out. Thanks.
It's 18 bytes:
16 bytes of memory + 2 bytes per character allocated + 2 bytes for the final null character.
Note that this was written about .Net 1.1.
The m_ArrayLength field was removed in .Net 4.0 (you can see this in the reference source)
The CLR version matters. Prior to .NET 4, a string object had an extra 4-byte field that stored the "capacity", m_arrayLength field. That field is no longer around in .NET 4. It otherwise has the standard object header, 4 bytes for the sync-block, 4 bytes for the method table pointer. Then 4 bytes to store the string length (m_stringLength), followed by 2 bytes each for each character in the string. And a 0 char to make it compatible with native code. Objects are always a multiple of 4 bytes long, minimum 16 bytes.
An empty string is thus 4 + 4 + 4 + 2 = 14 bytes, rounded up to 16 bytes on .NET 4.0. 20 bytes on earlier versions. Given values are for x86. This is all very visible in the debugger, check this answer for hints.
Jon Skeet recently wrote a whole article on the subject.
On x86, an empty string is 16 bytes, and on x64 it's 32 bytes
I want to create a ASCII string which will have a number of fields. For e.g.
string s = f1 + "|" + f2 + "|" + f3;
f1, f2, f3 are fields and "|"(pipe) is the delimiter. I want to avoid this delimiter and keep the field count at the beginning like:
string s = f1.Length + f2.Length + f3.Length + f1 + f2 + f3;
All lengths are going to be packed in 2 chars, Max length = 00-99 in this case. I was wondering if I can pack the length of each field in 2 bytes by extracting bytes out of a short. This would allow me to have a range 0-65536 using only 2 bytes. E.g.
short length = 20005;
byte b1 = (byte)length;
byte b2 = (byte)(length >> 8);
// Save bytes b1 and b2
// Read bytes b1 and b2
short length = 0;
length = b2;
length = (short)(length << 8);
length = (short)(length | b1);
// Now length is 20005
What do you think about the above code, Is this a good way to keep the record lengths?
I cannot see what you are trying to achieve. short aka Int16 is 2 bytes - yes, so you can happily use it. But creating a string does not make sense.
short sh = 56100; // 2 bytes
I believe you mean, being able to output the short to a stream. For this there are ways:
BinaryWriter.Write(sh) which writes 2 bytes straight to the stream
BitConverter.GetBytes(sh) which gives you bytes of a short
Reading back you can use the same classes.
If you want ascii, i.e. "00" as characters, then just:
byte[] bytes = Encoding.Ascii.GetBytes(length.ToString("00"));
or you could optimise it if you want.
But IMO, if you are storing 0-99, 1 byte is plenty:
byte b = (byte)length;
If you want the range 0-65535, then just:
bytes[0] = (byte)length;
bytes[1] = (byte)(length >> 8);
or swap index 0 and 1 for endianness.
But if you are using the full range (of either single or double byte), then it isn't ascii nor a string. Anything that tries to read it as a string might fail.
Whether it's a good idea depends on the details of what it's for, but it's not likely to be good.
If you do this then you're no longer creating an "ASCII string". Those were your words, but maybe you don't really care whether it's ASCII.
You will sometimes get bytes with a value of 0 in your "string". If you're handling the strings with anything written in C, this is likely to cause trouble. You'll also get all sorts of other characters -- newlines, tabs, commas, etc. -- that may confuse software that's trying to work with your data.
The original plan of separating with (say) | characters will be more compact and easier for humans and software to read. The only obvious downsides are (1) you can't allow field values with a | in (or else you need some sort of escaping) and (2) parsing will be marginally slower.
If you want to get clever you could pack your 2 bytes into 1 where the value of byte 1 is <= 127, or if the value is >=128 you use 2 bytes instead. This technique looses you 1 bit, per byte that you are using, but if you normally have small values, but occasionally have larger values it dynamically grows to accommodate the value.
All you need to do is mark bit 8 with a value indicating that the 2nd byte is required to be read.
If bit 8 of the active byte is not set, it means you have completed your value.
EG
If you have a value of 4 then you use this
|8|7|6|5|4|3|2|1|
|0|0|0|0|0|1|0|0|
If you have a value of 128 you then can read the 1st byte check if bit 8 is high, and read the remaining 7 bits of the 1st byte, then you do the same with the 2nd byte, moving the 7bits left 7 bits.
|BYTE 0 |BYTE 1 |
|8|7|6|5|4|3|2|1|8|7|6|5|4|3|2|1|
|1|0|0|0|0|0|0|0|0|0|0|0|0|0|0|1|
How many bits is a .NET string that's 10 characters in length? (.NET strings are UTF-16, right?)
On 32-bit systems:
4 bytes = Type pointer (Every object has one of these)
4 bytes = Lock (One of these too!)
4 bytes = Length (Need the length)
2 * Length bytes = Data (And the chars themselves)
=======================
12 + 2*Length bytes
=======================
96 + 16*Length bits
So 10 chars would = 256 bits = 32 bytes
I am not sure if the Lock grows to 64-bit on 64-bit systems. I kinda hope not, but you never know. The 64-bit structure overhead is therefore anywhere from 16-20 bytes (as opposed to the 12 bytes on 32-bit).
Every char in the string is two bytes in size, so if you are just converting the chars directly and not using any particular encoding, the answer is string.Length * 2 * 8
otherwise the result depends on the encoding, you can write:
int numbits = System.Text.Encoding.UTF8.GetByteCount(str)*8; //returns 80
or
int numbits = System.Text.Encoding.Unicode.GetByteCount(str)*8 //returns 160
If you are talking pure Unicode-16 then:
10 characters = 20 bytes = 160 bits
This really needs a context in order to be answered properly.
It all comes down to how you define character and how to you store the data.
For example, if you define character as a single letter from the users point of view it can be more than 2 bytes, for example this character: Å is two Unicode code points (U+0041 U+030A, Latin Capital A + Combining Ring Above) so it will require two .net chars or 4 bytes int UTF-16.
Now even if you are talking about 10 .net Char elements than if it's in memory you have some object overhead (that was already mentioned) and a bit of alignment overhead (on 32bit system everything has to be aligned to 4 bytes boundary, in 64bit the rules are more complicated) so you may have some empty bytes at the end.
If you are talking about database or files than each database and file system has its own overhead.