Why is sizeof(bool) == sizeof(byte) in C#? [duplicate]

Why is sizeof(bool) == sizeof(byte) in C#? [duplicate] - c#

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
What is the binary representation of a boolean value in c#
According to the MSDN documentation, the sizeof keyword is "used to obtain the size in bytes for an unmanaged type" and primitives are considered unmanaged types. If I check the sizeof(bool), the result is 1.
It seems to me that using a Boolean value should only require a bit of memory. Am I mistaken? Does using a Boolean value actually requires a full byte of memory? Why?

It uses a whole byte of memory for performance reasons.
If it only used a single bit, what do you do with the other 7 bits? Few variables are booleans, and other variables may not require a single bit. So it would only be useful for other booleans.
For example, 4-byte integers. Also, many larger types need to start at appropriate byte boundaries for performance reasons. For example, a CPU may not allow you to easily reference a 4-byte address starting from any address (ie. the address may need to be divisible by 4).
If it used a single bit of memory, meaning the other 7-bits could be used for other booleans, trying to use this boolean would be more complicated. Because it is not directly addressable, you would need to get the byte, and then extract the bit, before testing if it is 1 or 0. That means more instructions - hence slower performance.
If you have many booleans, and you want them to only use a single bit of memory EACH, you should use a BitArray. These are containers for single bits. They act like arrays of booleans.

A byte is the smallest amount of addressable memory. The .NET team have chosen to use a byte to store a bool to simplify the implementation.
If you want to store a large number of bits more compactly you can look at BitArray.

Yes it requires a full byte of memory because that's the smallest addressable memory.
It would of course be possible to come up with a scheme where several bools can be put in the same byte, thus saving space. For more cases the overhead of such a solution would cost much more than gained.
If you have a lot of bits to store, a specialised bit vector (such as BitArray that Mark Byers metnions) can save precious space.

If you think of 1 Byte as numeral value is 1 due to sizeof. So how can it say 1 bit ? Impossible, either it floors and return 0 and thats impossible or it returns 1 because it takes up for saving a byte because you don't save in bits.
But wether it's managed as a bit or a byte, I don't know.
In c++ you add to the variable-name a :1 to say it should be just 1 bit wide.

Related

Does primitive array expects integer as index

Should primitive array content be accessed by int for best performance?
Here's an example
int[] arr = new arr[]{1,2,3,4,5};
Array is only 5 elements in length, so the index doesn't have to be int, but short or byte, that would save useless 3 byte memory allocation if byte is used instead of int. Of course, if only i know that array wont overflow size of 255.
byte index = 1;
int value = arr[index];
But does this work as good as it sounds?
Im worried about how this is executed on lower level, does index gets casted to int or other operations which would actually slow down the whole process instead of this optimizing it.

In C and C++, arr[index] is formally equivalent to *(arr + index). Your concerns about casting should be answerable in terms of the simpler question about what the machine will do when it needs to add add an integer offset to a pointer.
I think it's safe to say that on most modern machines when you add a "byte" to a pointer, its going to use the same instruction as it would if you added a 32-bit integer to a pointer. And indeed it's still going to represent that byte using the machine word size, padded with some unused space. So this isn't going to make using the array faster.
Your optimization might make a difference if you need to store millions of these indices in a table, and then using byte instead of int would use 4 times less memory and take less time to move that memory around. If the array you are indexing is huge, and the index needs to be larger than the machine word side, then that's a different consideration. But I think it's safe to say that in most normal situations this optimization doesn't really make sense, and size_t is probably the most appropriate generic type for array indices all things being equal (since it corresponds exactly to the machine word size, on the majority of architectures).

does index gets casted to int or other operations which would actually slow down the whole process instead of this optimizing it
No, but
that would save useless 3 byte memory allocation
You don't gain anything by saving 3 bytes.
Only if you are storing a huge array of those indices then the amount of space you would save might make it a worthwhile investment.
Otherwise stick with a plain int, it's the processor's native word size and thus the fastest.

What is the most significant byte of 160 bit hash for arithmetic operations?

Could somebody help me to understand what is the most significant byte of a 160 bit (SHA-1) hash?
I have a C# code which calls the cryptography library to calculate a hash code from a data stream. In the result I get a 20 byte C# array. Then I calculate another hash code from another data stream and then I need to place the hash codes in ascending order.
Now, I'm trying to understand how to compare them right. Apparently I need to subtract one from another and then check if the result is negative, positive or zero. Technically, I have 2 20 byte arrays, which if we look at from the memory perspective having the least significant byte at the beginning (lower memory address) and the most significant byte at the end (higher memory address). On the other hand looking at them from the human reading perspective the most significant byte is at the beginning and the least significant is at the end and if I'm not mistaken this order is used for comparing GUIDs. Of course, it will give us different order if we use one or another approach. Which way is considered to be the right or conventional one for comparing hash codes? It is especially important in our case because we are thinking about implementing a distributed hash table which should be compatible with existing ones.

You should think of the initial hash as just bytes, not a number. If you're trying to order them for indexed lookup, use whatever ordering is simplest to implement - there's no general purpose "right" or "conventional" here, really.
If you've got some specific hash table you want to be "compatible" with (not even sure what that would mean) you should see what approach to ordering that hash table takes, assuming it's even relevant. If you've got multiple tables you need to be compatible with, you may find you need to use different ordering for different tables.
Given the comments, you're trying to work with Kademlia, which based on this document treats the hashes as big-endian numbers:
Kademlia follows Pastry in interpreting keys (including nodeIDs) as bigendian numbers. This means that the low order byte in the byte array representing the key is the most significant byte and so if two keys are close together then the low order bytes in the distance array will be zero.
That's just an arbitrary interpretation of the bytes - so long as everyone uses the same interpretation, it will work... but it would work just as well if everyone decided to interpret them as little-endian numbers.

You can use SequenceEqual to compare Byte arrays, check the following links for elaborate details:
How to compare two arrays of bytes
Comparing two byte arrays in .NET

Data in a condensed format

I need a library which would help me to save and query data in a condensed format (a mini DSL in essence) here's a sample of what I want:
Update 1 - Please note, figures in the samples above are made small just to make is easier to follow the logic, the real figures are limited with c# long type capacity, ex:
1,18,28,29,39,18456789,18456790,18456792,184567896.
Sample Raw Data set: 1,2,3,8,11,12,13,14
Condensed Sample Data set:
1..3,8,11..14
What would be absolute nice to have is to be able to present 1,2,4,5,6,7,8,9,10 as 1..10-3.
Querying Sample Data set:
Query 1 (get range):
1..5 -> 1..3
Query 2 (check if the value exists)
?2 -> true
Query 3 (get multiple ranges and scalar values):
1..5,11..12,14 -> 1..3,11..12,14
I don't want to develop it from scratch and would highly prefer to use something which already exists.

Here are some ideas I've had over the days since I read your question. I can't be sure any of them really apply to your use case but I hope you'll find something useful here.
Storing your data compressed
Steps you can take to reduce the amount of space your numbers take up on disk:
If your values are between 1 and ~10M, don't use a long, use a uint. (4 bytes per number.)
Actually, don't use a uint. Store your numbers 7 bits to a byte, with the remaining bit used to say "there are more bytes in this number". (Then 1-127 will fit in 1 byte, 128-~16k in 2 bytes, ~16k-~2M in 3 bytes, ~2M-~270M in 4 bytes.)
This should reduce your storage from 8 bytes per number (if you were originally storing them as longs) to, say, on average 3 bytes. Also, if you end up needing bigger numbers, the variable-byte storage will be able to hold them.
Then I can think of a couple of ways to reduce it further, given you know the numbers are always increasing and may contain lots of runs. Which works best for you only you can know by trying it on your actual data.
For each of your actual numbers, store two numbers: the number itself, followed by the number of numbers contiguous after it (e.g. 2,3,4,5,6 => 2,4). You'll have to store lone numbers as e.g. 8,0 so will increase storage for those, but if your data has lots of runs (especially long ones) this should reduce storage on average. You could further store "single gaps" in runs as e.g. 1,2,3,5,6,7 => 1,6,4 (unambiguous as 4 is too small to be the start of the next run) but this will make processing more complex and won't save much space so I wouldn't bother.
Or, rather than storing the numbers themselves, store the deltas (so 3,4,5,7,8,9 => 3,1,1,2,1,1. This will reduce the number of bytes used for storing larger numbers (e.g. 15000,15005 (4 bytes) => 15000,5 (3 bytes)). Further, if the data contains a lot of runs (e.g. lots of 1 bytes), it will then compress (e.g. zip) nicely.
Handling in code
I'd simply advise you to write a couple of methods that stream a file from disk into an IEnumerable<uint> (or ulong if you end up with bigger numbers), and do the reverse, while handling whatever you've implemented from the above.
If you do this in a lazy fashion - using yield return to return the numbers as you read them from disk and calculate them, and streaming numbers to disk rather than holding them in memory and returning them at once, you can keep your memory usage down whatever the size of the stored data.
(I think, but I'm not sure, that even the GZipStream and other compression streams will let you stream your data without having it all in memory.)
Querying
If you're comparing two of your big data sets, I wouldn't advise using LINQ's Intersect method as it requires reading one of the sources completely into memory. However, as you know both sequences are increasing, you can write a similar method that needs only hold an enumerator for each sequence.
If you're querying one of your data sets against a user-input, small list of numbers, you can happily use LINQ's Intersect method as it is currently implemented, as it only needs the second sequence to be entirely in memory.

I'm not aware of any off-the-shelf library that does quite what you want, but I'm not sure you need one.
I suggest you consider using the existing BitArray class. If, as your example suggests, you're interested in compressing sets of small integers then a single BitArray with, say 256 bits, could represent any set of integers in the range [0..255]. Of course, if your typical set has only 5 integers in it then this approach would actually expand your storage requirements; you'll have to figure out the right size of such arrays from your own knowledge of your sets.
I'd suggest also looking at your data as sets of integers, so your example 1,2,3,8,11,12,13,14 would be represented by setting on the corresponding bits in a BitArray. Your query operations then reduce to intersection between a test BitArray and your data BitArray.
Incidentally, I think your example 2, which transforms 2 -> true, would be better staying in the domain of functions that map sets of integers to sets of integers, ie it should transform 2 -> 2. If you want to, write a different method which returns a boolean.
I guess you'd need to write code to pack integers into BitArrays and to unpack BitArrays into integers, but that's part of the cost of compression.

How to build struct with variable 1 to 4 bytes in one value?

What I try to do:
I want to store very much data in RAM. For faster access and less memory footprint I need to use an array of struct values:
MyStruct[] myStructArray = new MyStruct[10000000000];
Now I want to store unsigned integer values with one, two, three or four bytes in MyStruct. But it should only use the less possible memory amount. When I store a value it one byte it should only use one byte and so on.
I could implement this with classes, but that is inappropriate here because the pointer to the object would need 8 bytes on a 64bit system. So it would be better to store just 4 bytes for every array entry. But I want to only store/use one/two/three byte when needed. So I can't use some of the fancy classes.
I also can't use one array with one bytes, one array with two bytes and so on, because I need the special order of the values. And the values are very mixed, so storing an additional reference when to switch to the other array would not help.
Is it possible what want or is the only way to just store an array of 4 byte uints regardless I only need to store one byte, two byte in about 60% of the time and three bytes in about 25% of the time?

This is not possible. How would the CLR process the following expression?
myStructArray[100000]
If the elements are of variable size, the CLR cannot know the address of the 100000th element. Therefore array elements are of fixed size, always.
If you don't require O(1) access, you can implement variable-length elements on top of a byte[] and search the array yourself.
You could split the list into 1000 sublists, which are packed individually. That way you get O(n/2000) search performance on average. Maybe that is good enough in practice.
A "packed" array can only be searched in O(n/2) on average. But if your partial arrays are 1/1000th the size, it becomes O(n/2000). You can pick the partial array in O(1) because they all would be of the same size.
Also, you can adjust the number of partial arrays so that they are individually about 1k elements in size. At that point the overhead of the array object and reference to it vanish. That would give you O(1000/2 + 1) lookup performance which I think is quite an improvement over O(n/2). It is a constant-time lookup (with a big constant).

You could get close to that what you want if you are willing to sacrifice some additional CPU time and waste additional 2 or 4 bits per one stored value.
You could just use byte byte[] and combine it with BitArray collection. In byte[] you would then just sequentially store one, two, three or four bytes and in BitArray denote in binary form (pairs of two bits) or just put a bit to value 1 to denote a new set of bytes have just started (or ended, however you implement it) in your data array.
However you could get something like this in memory:
byte[] --> [byte][byte][byte][byte][byte][byte][byte]...
BitArray --> 1001101...
Which means you have 3 byte, 1 byte, 2 bytes etc. values stored in your byte array.
Or you could alternatively encode your bitarray as binary pairs to make it even smaller. This means you would vaste somewhere between 1.0625 and 1.25 bytes per your actual data byte.
It depends on your actual data (your MyStruct) if this will suffice. If you need to distinguish to which values in your struct those bytes really corresponds, you could waste some additional bits in BitArray.
Update to your O(1) requirement:
Use another index structure which would store one index for each N elements, for example 1000. You could then for example access item with index 234241 as
indexStore[234241/1000]
which gives you index of element 234000, then you just calculate the exact index of element 234241 by examining those few hundred elements in BitArray.
O(const) is acheieved this way, const can be controlled with density of main index, of course you trade time for space.

You can't do it.
If the data isn't sorted, and there is nothing more you can say about it, then you are not going to be able to do what you want.
Simple scenario:
array[3]
Should point to some memory address. But, how would you know what are dimensions of array[0]-array[2]? To store that information in an O(1) fashion, you would only waste MORE memory than you want to save in the first place.
You are thinking out of the box, and that's great. But, my guess is that this is the wrong box that you are trying to get out of. If your data is really random, and you want direct access to every array member, you'll have to use MAXIMUM width that is needed for your number for every number. Sorry.
I had one similar situation, with having numbers of length smaller than 32 bits that I needed to store. But they were all fixed width, so I was able to solve that, with custom container and some bit shifting.
HOPE:
http://www.dcc.uchile.cl/~gnavarro/ps/spire09.3.pdf
Maybe you can read it, and you'll be able not only to have 8, 16, 24, 32 bit per number, but ANY number size...

I'd almost start looking at some variant of short-word encoding like a PkZip program.
Or even RLE encoding.
Or try to understand the usage of your data better. Like, if these are all vectors or something, then there are certain combinations that are disallowed like, -1,-1,-1 is basically meaningless to a financial graphing application, as it denotes data outsides the graphable range. If you can find some oddities about your data, you may be able to reduce the size by having different structures for different needs.

Just to avoid inventing hot-water, I am asking here...
I have an application with lots of arrays, and it is running out of memory.
So the thought is to compress the List<int> to something else, that would have same interface (IList<T> for example), but instead of int I could use shorter integers.
For example, if my value range is 0 - 100.000.000 I need only ln2(1000000) = 20 bits. So instead of storing 32 bits, I can trim the excess and reduce memory requirements by 12/32 = 37.5%.
Do you know of an implementation of such array. c++ and java would be also OK, since I could easily convert them to c#.
Additional requirements (since everyone is starting to getting me OUT of the idea):
integers in the list ARE unique
they have no special property so they aren't compressible in any other way then reducing the bit count
if the value range is one million for example, lists would be from 2 to 1000 elements in size, but there will be plenty of them, so no BitSets
new data container should behave like re-sizable array (regarding method O()-ness)
EDIT:
Please don't tell me NOT to do it. The requirement for this is well thought-over, and it is the ONLY option that is left.
Also, 1M of value range and 20 bit for it is ONLY AN EXAMPLE. I have cases with all different ranges and integer sizes.
Also, I could have even shorter integers, for example 7 bit integers, then packing would be
00000001
11111122
22222333
33334444
444.....
for first 4 elements, packed into 5 bytes.
Almost done coding it - will be posted soon...

Since you can only allocate memory in byte quantums, you are essentially asking if/how you can fit the integers in 3 bytes instead of 4 (but see #3 below). This is not a good idea.
Since there is no 3-byte sized integer type, you would need to use something else (e.g. an opaque 3-byte buffer) in its place. This would require that you wrap all access to the contents of the list in code that performs the conversion so that you can still put "ints" in and pull "ints" out.
Depending on both the architecture and the memory allocator, requesting 3-byte chunks might not affect the memory footprint of your program at all (it might simply litter your heap with unusable 1-byte "holes").
Reimplementing the list from scratch to work with an opaque byte array as its backing store would avoid the two previous issues (and it can also let you squeeze every last bit of memory instead of just whole bytes), but it's a tall order and quite prone to error.
You might want instead to try something like:
Not keeping all this data in memory at the same time. At 4 bytes per int, you 'd need to reach hundreds of millions of integers before memory runs out. Why do you need all of them at the same time?
Compressing the dataset by not storing duplicates if possible. There are bound to be a few of them if you are up to hundreds of millions.
Changing your data structure so that it stores differences between successive values (deltas), if that is possible. This might be not very hard to achieve, but you can only realistically expect something at the ballpark of 50% improvement (which may not be enough) and it will totally destroy your ability to index into the list in constant time.

One option that will get your from 32 bits to 24bits is to create a custom struct that stores an integer inside of 3 bytes:
public struct Entry {
byte b1; // low
byte b2; // middle
byte b3; // high
public void Set(int x) {
b1 = (byte)x;
b2 = (byte)(x >> 8);
b3 = (byte)(x >> 16);
}
public int Get() {
return (b3 << 16) | (b2 << 8) | b1;
}
}
You can then just create a List<Entry>.
var list = new List<Entry>();
var e = new Entry();
e.Set(12312);
list.Add(e);
Console.WriteLine(list[0].Get()); // outputs 12312

This reminds me of base64 and similar kinds of binary-to-text encoding.
They take 8 bit bytes and do a bunch of bit-fiddling to pack them into 4-, 5-, or 6-bit printable characters.
This also reminds me of the Zork Standard Code for Information Interchange (ZSCII), which packs 3 letters into 2 bytes, where each letter occupies 5 bits.
It sounds like you want to taking a bunch of 10- or 20-bit integers and pack them into a buffer of 8-bit bytes.
The source code is available for many libraries that handle a packed array of single bits
(a
b
c
d
e).
Perhaps you could
(a) download that source code and modify the source (starting from some BitArray or other packed encoding), recompiling to create a new library that handles packing and unpacking 10- or 20-bit integers rather than single bits.
It may take less programming and testing time to
(b) write a library that, from the outside, appears to act just like (a), but internally it breaks up 20-bit integers into 20 separate bits, then stores them using an (unmodified) BitArray class.

Edit: Given that your integers are unique you could do the following: store unique integers until the number of integers you're storing is half the maximum number. Then switch to storing the integers you don't have. This will reduce the storage space by 50%.
Might be worth exploring other simplification techniques before trying to use 20-bit ints.
How do you treat duplicate integers? If have lots of duplicates you could reduce the storage size by storing the integers in a Dictionary<int, int> where keys are unique integers and values are corresponding counts. Note this assumes you don't care about the order of your integers.
Are your integers all unique? Perhaps you're storing lots of unique integers in the range 0 to 100 mil. In this case you could try storing the integers you don't have. Then when determining if you have an integer i just ask if it's not in your collection.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.