When creating a long[] in C#, due to the size limitation of 2GiB for any object in the CLR, I expect it to be able to hold a maximum of 2GiB / 64 Bit = 268,435,456 Elements. However, the maximum number of elements that the array can actually hold before throwing an exception is 268,435,448. Also, a long[][] can hold multiple long[]s with the above number of elements, thus being substantially larger than 2GiB. My questions are:
Where did those 64 bytes go that cannot be allocated. What are they used for by the CLR?
Why can a twodimensional array be larger than 2GiB?
Where did those 64 bytes go that cannot be allocated. What are they used for by the CLR?
Part of them go for the object header (sync block and vtable pointer, two pointers) and part for the array dimensions. Also, possibly a few pointers are used by the managed heap itself, because an object that big will require a separate chunk of heap.
Why can a twodimensional array be larger than 2GiB?
Because it is not a single CLR object. Every inner array is a separate object limited to 2GB, and the outer array only holds references to the inner arrays.
Related
Let's just say I'm running a physics simulation that uses integers as vertexes on a model. In this simulation I load arrays of integers to a list as the amount of vertexes may vary; like so:
List<int[]> x = new List<int[]>();
x.Add(new <int[1]>());
I know it's a bit overboard, considering to use 2GB worth of integers, but the model could range anywhere from a single object to entire open field. So, considering this process is repeated enough to take up 2GB, would each element/array have it's own 2GB as it's own object or does the entire list still count as the same object.
The list is an object, the backing array inside the list (T[], so: int[][]) is an object, and each int[] array is (separately) an object. As long as no individual array is too large, you're OK. At no point are the arrays in a List<some array> treated as contiguous, so it doesn't matter if their combined length exceeds the 2 GiB limit.
Note that you can enable very-large-object support in your configuration (<gcAllowVeryLargeObjects>) to squeeze out a slightly larger array limit - for most arrays (not bytes/single-byte elements) it changes the maximum element count to 2,146,435,071 - which is ~8 GiB in your case (int[]). That doesn't necessarily mean it is a good idea to do so :)
So on Intel's I7 Processor memory is written and read at 64 bytes.
So if I wanted to fill a cache line up, I could use 16 longs (4 bytes each).
If a make an array of 16 longs, would that fit the entire cache line, or is there some overhead for the array?
My concern is that if an array has any overhead at all, and I use 16 longs, the total size in bytes will spill over 64.
So is it more like new long[63], or new long[62] etc?
In C#, the long data type, which is an alias for System.Int64, is 8 bytes, not 4. The int type, aka System.Int32, is 4 bytes. So 64 bytes of long values is eight elements, not sixteen.
The storage of a managed array is in fact contiguous, so in theory yes, a long[8] would fit in a 64-byte cache line exactly. But note that it would do so only if properly aligned on an address that's a multiple of 64. Since you don't have control over allocation location, that's going to be difficult to do.
So even without overhead in the array, you can't guarantee that a single 8-element array of longs would actually fit exactly in a 64-byte cache line.
Of course, an array longer than that will have sub-ranges that are aligned and so can be cached entirely. But then that's true for pretty much any data type you might have. Frankly, the thing to worry about isn't the size of your data or the length of your array, but the pattern of access. See "data locality" for advice on how to access your data in ways that help ensure efficient use of the cache.
If I declare a List of char arrays, are they allocated in contiguous memory, or does .NET create a linked list instead?
If it's not contiguous, is there a way I can declare a contiguous list of char arrays? The size of the char arrays is know ahead of time and is fixed (they are all the same size).
Yes, but not in the way that you want. List<T> guarantees that its elements are stored contiguously.
Arrays are a reference type, so the references are stored cotiguously as List<T> guarantees. However, the arrays themselves are allocated separately and where they are stored has nothing to do with the list. It is only concerned with its elements, the references.
If you require that then you should simply use one large array and maintain boundary data.
EDIT: Per your comment:
The inner arrays are always 9 chars.
So, in this case, cache coherency may be an issue because the sub-arrays are so small. You'll be jumping around a lot in memory getting from one array to the next, and I'll just take you on your word about the performance sensitivity of this code.
Just use a multi-dimensional if you can. This of course assumes you know the size or that you can impose a maximum size on it.
Is it possible to trade some memory to reduce complexity/time and just set a max size for N? Using a multi-dimensional array (but don't use the latter) is the only way you can guarantee contiguous allocation.
EDIT 2:
Trying to keep the answer in sync with the comments. You say that the max size of the first dimension is 9! and, as before, the size of the second dimension is 9.
Allocate it all up front. You're trading some memory for time. 9! * 9 * 2 / 1024 / 1024 == ~6.22MB.
As you say, the List may grow to that size anyway, so worst case you waste a few MB of memory. I don't think it's going to be an issue unless you plan on running this code in a toaster oven. Just allocate the buffer as one array up front and you're good.
List functions as a dynamic array, not a linked list, but this is beside the point. No memory will be allocated for the char[]s until they themselves are instantiated. The List is merely responsible for holding references to char[]s, of which it will contain none when first created.
If it's not contiguous, is there a way I can declare a contiguous list of char arrays? The size of the char arrays is know ahead of time and is fixed (they are all the same size).
No, but you could instantiate a 2-dimensional array of chars, if you also know how many char arrays there would have been:
char[,] array = new char[x, y];
I'm trying to optimize some code where I have a large number of arrays containing structs of different size, but based on the same interface. In certain cases the structs are larger and hold more data, othertimes they are small structs, and othertimes I would prefer to keep null as a value to save memory.
My first question is. Is it a good idea to do something like this? I've previously had an array of my full data struct, but when testing mixing it up I would virtually be able to save lots of memory. Are there any other downsides?
I've been trying out different things, and it seams to work quite well when making an array of a common interface, but I'm not sure I'm checking the size of the array correctly.
To simplified the example quite a bit. But here I'm adding different structs to an array. But I'm unable to determine the size using the traditional Marshal.SizeOf method. Would it be correct to simply iterate through the collection and count the sizeof for each value in the collection?
IComparable[] myCollection = new IComparable[1000];
myCollection[0] = null;
myCollection[1] = (int)1;
myCollection[2] = "helloo world";
myCollection[3] = long.MaxValue;
System.Runtime.InteropServices.Marshal.SizeOf(myCollection);
The last line will throw this exception:
Type 'System.IComparable[]' cannot be marshaled as an unmanaged structure; no meaningful size or offset can be computed.
Excuse the long post:
Is this an optimal and usable solution?
How can I determine the size
of my array?
I may be wrong but it looks to me like your IComparable[] array is a managed array? If so then you can use this code to get the length
int arrayLength = myCollection.Length;
If you are doing platform interop between C# and C++ then the answer to your question headline "Can I find the length of an unmanaged array" is no, its not possible. Function signatures with arrays in C++/C tend to follow the following pattern
void doSomeWorkOnArrayUnmanaged(int * myUnmanagedArray, int length)
{
// Do work ...
}
In .NET the array itself is a type which has some basic information, such as its size, its runtime type etc... Therefore we can use this
void DoSomeWorkOnManagedArray(int [] myManagedArray)
{
int length = myManagedArray.Length;
// Do work ...
}
Whenever using platform invoke to interop between C# and C++ you will need to pass the length of the array to the receiving function, as well as pin the array (but that's a different topic).
Does this answer your question? If not, then please can you clarify
Optimality always depends on your requirements. If you really need to store many elements of different classes/structs, your solution is completely viable.
However, I guess your expectations on the data structure might be misleading: Array elements are per definition all of the same size. This is even true in your case: Your array doesn't store the elements themselves but references (pointers) to them. The elements are allocated somewhere on the VM heap. So your data structure actually goes like this: It is an array of 1000 pointers, each pointer pointing to some data. The size of each particular element may of course vary.
This leads to the next question: The size of your array. What are you intending to do with the size? Do you need to know how many bytes to allocate when you serialize your data to some persistent storage? This depends on the serialization format... Or do you need just a rough estimate on how much memory your structure is consuming? In the latter case you need to consider the array itself and the size of each particular element. The array which you gave in your example consumes approximately 1000 times the size of a reference (should be 4 bytes on a 32 bit machine and 8 bytes on a 64 bit machine). To compute the sizes of each element, you can indeed iterate over the array and sum up the size of the particular elements. Please be aware that this is only an estimate: The virtual machine adds some memory management overhead which is hard to determine exactly...
Is it worthwhile to initialize the collection size of a List<T> if it's reasonably known?
Edit: Furthering this question, after reading the first answers this question really boils down to what is the default capacity and how is the growth operation performed, does it double the capacity etc.?
Yes, it gets to be important when your List<T> gets large. The exact numbers depend on the element type and the machine architecture, let's pick a List of reference types on a 32-bit machine. Each element will then take 4 bytes inside an internal array. The list will start out with a Capacity of 0 and an empty array. The first Add() call grows the Capacity to 4, reallocating the internal array to 16 bytes. Four Add() calls later, the array is full and needs to be reallocated again. It doubles the size, Capacity grows to 8, array size to 32 bytes. The previous array is garbage.
This repeats as necessary, several copies of the internal array will become garbage.
Something special happens when the array has grown to 65,536 bytes (16,384 elements). The next Add() doubles the size again to 131,072 bytes. That's a memory allocation that exceeds the threshold for "large objects" (85,000 bytes). The allocation is now no longer made on the generation 0 heap, it is taken from the Large Object Heap.
Objects on the LOH are treated specially. They are only garbage collected during a generation 2 collection. And the heap doesn't get compacted, it takes too much time to move such large chunks.
This repeats as necessary, several LOH objects will become garbage. They can take up memory for quite a while, generation 2 collections do not happen very often. Another problem is that these large blocks tend to fragment the virtual memory address space.
This doesn't repeat endlessly, sooner or later the List class needs to re-allocate the array and it has grown so large that there isn't a hole left in the virtual memory address space to fit the array. Your program will bomb with an OutOfMemoryException. Usually well before all available virtual memory has been consumed.
Long story short, by setting the Capacity early, before you start filling the List, you can reserve that large internal array up front. You won't get all those awkward released blocks in the Large Object Heap and avoid fragmentation. In effect, you'll be able to store many more objects in the list and your program runs leaner since there's so little garbage. Do this only if you have a good idea how large the list will be, using a large Capacity that you'll never fill is wasteful.
It is, as per documentation
If the size of the collection can be
estimated, specifying the initial
capacity eliminates the need to
perform a number of resizing
operations while adding elements to
the List(T).
Well, it will stop you the values in the list (which will be references if the element type is a reference type) from having to be copied occasionally as the list grows.
If it's going to be a particularly large list and you've got a pretty good idea of the size, it won't hurt. However, if estimating the size involves extra calculations or any significant amount of code, I wouldn't worry about it unless you find it becomes a problem - it could distract from the main focus of the code, and the resizing is unlikely to cause performance issues unless it's a really big list or you're doing it a lot.