Resizing big arrays

Resizing big arrays - c#

I have an array which size is like 2 GB (filled with audio samples). Now I want to apply a filter for that array. This filter is generating like 50% more samples than input source. So now I need to create new array which size is 3 GB. Now I gave 5 GB of memory used. But if this filter can operate only at that source array and only need some more space in this array.
Question: can I allocate a memory in C# that can be resized w/o creating a second memory block, then removing that first one?
I just thought, If memory in PC's is divided into 4 kB pages (or more), so why C# cannot (?) use that good feature?

If your filter can work in-place just allocate 50% more space at the beginning. All you need to know is the actual length of the original sample.
If that code doesn't work always and you don't want to consume more memory beforehand, you can allocate half of the original array (the extension array) and check which part your access relates to:
byte[] myOriginalArray = new byte[2GB]; // previously allocated
byte[] myExtensionArray = new byte[1GB]; // 50% of the original
for(... my processing code of the array ...)
{
byte value = read(index);
... process the index and the value here
store(index, value);
}
byte read(int index)
{
if(index < 2GB) return myOriginalArray[index];
return myExtensionArray[index - 2GB];
}
void store(int index, byte value)
{
if(index < 2GB) myOriginalArray[index] = value;
myExtensionArray[index - 2GB] = value;
}
You add index check and subtraction overhead for each access to the array. That could also be made smarter for certain cases. For instance for the portion you do not need to access extension you can use your faster loop and for the part where you need to write to extension part you can use the slower version (two consecutive loops).

Question: can I allocate a memory in C# that can be resized w/o creating a second memory block, then removing that first one?
No, you cannot resize an array in .NET. If you want to increase the size of an array you will have to create a new and bigger array and copy all the data from the existing array to the new array.
To get around this problem you could provide your own "array" implementation based on allocating smaller chunks of memory but presenting it as one big buffer of data. An example of this is StringBuilder that is based on an implementation of chunks of characters, each chunk being a separate Char[] array.
Another option is to use P/Invoke to get access to low level memory management functions like VirtualAlloc that allows you to reserve pages of memory in advance. You need to do this in a 64 bit process because the virtual address space of a 32 bit process is only 4 GB. You probably also need to work with unsafe code and pointers.

Related

.Net Dictionary<int,int> out of memory exception at around 6,000,000 entries

I am using a Dictionary<Int,Int> to store the frequency of colors in an image, where the key is the the color (as an int), and the value is the number of times the color has been found in the image.
When I process larger / more colorful images, this dictionary grows very large. I get an out of memory exception at just around 6,000,000 entries. Is this the expected capacity when running in 32-bit mode? If so, is there anything I can do about it? And what might be some alternative methods of keeping track of this data that won't run out of memory?
For reference, here is the code that loops through the pixels in a bitmap and saves the frequency in the Dictionary<int,int>:
Bitmap b; // = something...
Dictionary<int, int> count = new Dictionary<int, int>();
System.Drawing.Color color;
for (int i = 0; i < b.Width; i++)
{
for (int j = 0; j < b.Height; j++)
{
color = b.GetPixel(i, j);
int colorString = color.ToArgb();
if (!count.Keys.Contains(color.ToArgb()))
{
count.Add(colorString, 0);
}
count[colorString] = count[colorString] + 1;
}
}
Edit: In case you were wondering what image has that many different colors in it: http://allrgb.com/images/mandelbrot.png
Edit: I also should mention that this is running inside an asp.net web application using .Net 4.0. So there may be additional memory restrictions.
Edit: I just ran the same code inside a console application and had no problems. The problem only happens in ASP.Net.

Update: Given the OP's sample image, it seems that the maximum number of items would be over 16 million, and apparently even that is too much to allocate when instantiating the dictionary. I see three options here:
Resize the image down to a manageable size and work from that.
Try to convert to a color scheme with fewer color possibilities.
Go for an array of fixed size as others have suggested.
Previous answer: the problem is that you don't allocate enough space for your dictionary. At some point, when it is expanding, you just run out of memory for the expansion, but not necessarily for the new dictionary.
Example: this code runs out of memory at nearly 24 million entries (in my machine, running in 32-bit mode):
Dictionary<int, int> count = new Dictionary<int, int>();
for (int i = 0; ; i++)
count.Add(i, i);
because with the last expansion it is currently using space for the entries already there, and tries to allocate new space for another so many million more, and that is too much.
Now, if we initially allocate space for, say, 40 million entries, it runs without problem:
Dictionary<int, int> count = new Dictionary<int, int>(40000000);
So try to indicate how many entries there will be when creating the dictionary.
From MSDN:
The capacity of a Dictionary is the number of elements that can be added to the Dictionary before resizing is necessary. As elements are added to a Dictionary, the capacity is automatically increased as required by reallocating the internal array.
If the size of the collection can be estimated, specifying the initial capacity eliminates the need to perform a number of resizing operations while adding elements to the Dictionary.

Each dictionary entry holds two 4-byte integers: 8 bytes total. 8 bytes * 6 millions entries is only about 48MB, +/- some space for object overhead, alignment, etc. There's plenty of space in memory for this. .Net provides virtual address space of up to 2 GB per process. 48MB or so shouldn't cause a problem.
I expect what's actually happening here is related to how the dictionary auto-expands and how the garbage collector handles (or doesn't handle) compaction.
First, the auto-expanding part. Last time I checked (back around .Net 2.0*), collections in .Net tended to use arrays internally. They would allocated a reasonably-sized array in the collection constructor (say, 10 items), and then use a doubling algorithm to create additional space whenever the array filled up. All the existing items would have to be copied to the new array, but then the old array could be garbage collected. The garbage collector is pretty reliable about this, and so it means you're left using space for at most 2n - 1 items in the collection.
Now the Garbage Collector compaction part. After a certain size, these arrays end up in a section of memory called the Large Object Heap. Garbage Collection still works here (though less often). What doesn't really work here well is compaction (think memory defragmentation). The physical memory used by the old object will be released, returned to the operating system, and available for other processes. However, the virtual address space in your process... the table that maps program memory offsets to physical memory addresses, will still have the (empty) space reserved.
This is important, because remember: we're working with a rapidly growing object. It's possible for such an object to take up address space far larger than the final size of the object itself. An object grows enough, fast enough, and suddenly you get an OutOfMemoryException, even though your app isn't really using all that much RAM.
The first solution here is allocate enough space in the initial collection for all of your data. This allows you to skip all those re-allocations and copying. Your data will live in a single array, and use only the space you actually asked for. Most collections, including the Dictionary, have an overload for the constructor that allows you to give it the number of items you want the first array to use. Be careful here: you don't need to allocate an item for every pixel in your image. There will be a lot of repeated colors. You only need to allocate enough to have space for each color in your image. If it's only large images that give you problems, and you're almost handling them with six millions records, you might find that 8 million is plenty.
My next suggestion is to group your pixel colors. A human can't tell and doesn't care if two colors are just one bit apart in any of the rgb components. You might go as far as to look at the separate RGB values for each pixel and normalize the pixel so that you only care about changes of more than 5 or so for an R,G,or B value. That would get you from 16.5 million potential colors all the way down to only about 132,000, and the data will likely be more useful, too. That might look something like this:
var colorCounts = new Dictionary<Color, int>(132651);
foreach(Color c in GetImagePixels().Select( c=> Color.FromArgb( (c.R/5) * 5, (c.G/5) * 5, (c.B/5) * 5) )
{
colorCounts[c] += 1;
}
* IIRC, somewhere in a recent or upcoming version of .Net both of these issues are being addressed. One by allowing you to force compaction of the LOH, and the other by using a set of arrays for collection backing stores, rather than trying to keep everything in one big array

The maximum size limit provided by CLR is 2GB
When you run a 64-bit managed application on a 64-bit Windows
operating system, you can create an object of no more than 2 gigabytes
(GB).
You may better use an array.
You may also check this BigArray<T>, getting around the 2GB array size limit

In the 32 bit runtime, the maximum number of items you can have in a Dictionary<int, int> is in the neighborhood of 61.7 million. See my old article for more info.
If you're running in 32 bit mode, then your entire application plus whatever bits of ASP.NET and the underlying machinery is required all have to fit within the memory available to your process: normally 2 GB in the 32-bit runtime.
By the way, a really wacky way to solve your problem (but one I wouldn't recommend unless you're really hurting for memory), would be the following (assuming a 24-bit image):
Call LockBits to get a pointer to the raw image data
Compress the per-scan-line padding by moving the data for each scan line to fill the previous row's padding. You end up with an array of 3-byte values followed by a bunch of empty space (to equal the padding).
Sort the image data. That is, sort the 3-byte values. You'd have to write a custom sort, but it wouldn't be too bad.
Go sequentially through the array and count the number of unique values.
Allocate a 2-dimensional array: int[count,2] to hold the values and their occurrence counts.
Go sequentially through the array again to count occurrences of each unique value and populate the counts array.
I wouldn't honestly suggest using this method. Just got a little laugh when I thought of it.

Try using an array instead. I doubt it will run out of memory. 6 million int array elements is not a big deal.

Is C# List<char[]> allocated in contiguous memory?

If I declare a List of char arrays, are they allocated in contiguous memory, or does .NET create a linked list instead?
If it's not contiguous, is there a way I can declare a contiguous list of char arrays? The size of the char arrays is know ahead of time and is fixed (they are all the same size).

Yes, but not in the way that you want. List<T> guarantees that its elements are stored contiguously.
Arrays are a reference type, so the references are stored cotiguously as List<T> guarantees. However, the arrays themselves are allocated separately and where they are stored has nothing to do with the list. It is only concerned with its elements, the references.
If you require that then you should simply use one large array and maintain boundary data.
EDIT: Per your comment:
The inner arrays are always 9 chars.
So, in this case, cache coherency may be an issue because the sub-arrays are so small. You'll be jumping around a lot in memory getting from one array to the next, and I'll just take you on your word about the performance sensitivity of this code.
Just use a multi-dimensional if you can. This of course assumes you know the size or that you can impose a maximum size on it.
Is it possible to trade some memory to reduce complexity/time and just set a max size for N? Using a multi-dimensional array (but don't use the latter) is the only way you can guarantee contiguous allocation.
EDIT 2:
Trying to keep the answer in sync with the comments. You say that the max size of the first dimension is 9! and, as before, the size of the second dimension is 9.
Allocate it all up front. You're trading some memory for time. 9! * 9 * 2 / 1024 / 1024 == ~6.22MB.
As you say, the List may grow to that size anyway, so worst case you waste a few MB of memory. I don't think it's going to be an issue unless you plan on running this code in a toaster oven. Just allocate the buffer as one array up front and you're good.

List functions as a dynamic array, not a linked list, but this is beside the point. No memory will be allocated for the char[]s until they themselves are instantiated. The List is merely responsible for holding references to char[]s, of which it will contain none when first created.
If it's not contiguous, is there a way I can declare a contiguous list of char arrays? The size of the char arrays is know ahead of time and is fixed (they are all the same size).
No, but you could instantiate a 2-dimensional array of chars, if you also know how many char arrays there would have been:
char[,] array = new char[x, y];

Best Way to Represent Large Byte Array

I'm looking for the most efficient way to store and manage a large byte array in memory. I will have the need to both insert and delete bytes from any position within the array.
At first, I was thinking that a regular array was best.
byte[] buffer = new byte[ArraySize];
This would allow me to access any byte within the array. I can also resize the array. However, there doesn't appear to be any built-in support for shifting or moving items within the array.
One option is to have a loop to move items one by one but that sounds horribly inefficient in C#. Another option is to create a new array and copy bytes over to the correct position, but that requires copying all data in the array.
Is there no better option?

Actually, I just found the Buffer Class, which appears ideal for what I need.
It looks like the BlockCopy method will block copy a bunch of items and supports copying within the same array, and even correctly handles overlapping items.

I think the best option in this case is a hybrid between a regular array and a list. This would only be necessary with megabyte sized arrays though.
So you could do something like this:
List<byte[]> buffer;
And have each element of the list just a chunk of the data(say 64K or something small and manageable)
It'd require quite a bit of custom code, but would definitely be the fastest option when having to shift data around in a large array.
Also, if you're doing a lot more shifting of bytes than anything else, LinkedList<T> may work better (but it's famously bad for everything but a specific set of cases)
To clarify why this is more correct than an array, consider inserting 1 byte to the beginning of an array. You must allocate another array (double memory consumption) and then copy every byte to the new array after inserting the new byte, and then free the old array (possible heap corruption depending on size)
Consider now this method with lists.
If you have to insert a lot of bytes, you'll probably want to insert at the beginning of the buffer list. This is an O(n) operation, so your ending efficiency for this operation is O(n/CHUNK_SIZE)
Or, if you just need to insert a single byte, you can just get the first element of the list and copy the array as normal. Then, the speed is O(CHUNK_SIZE), which isn't horrible, especially if n in comparison is very large (megabytes of data)

How to build struct with variable 1 to 4 bytes in one value?

What I try to do:
I want to store very much data in RAM. For faster access and less memory footprint I need to use an array of struct values:
MyStruct[] myStructArray = new MyStruct[10000000000];
Now I want to store unsigned integer values with one, two, three or four bytes in MyStruct. But it should only use the less possible memory amount. When I store a value it one byte it should only use one byte and so on.
I could implement this with classes, but that is inappropriate here because the pointer to the object would need 8 bytes on a 64bit system. So it would be better to store just 4 bytes for every array entry. But I want to only store/use one/two/three byte when needed. So I can't use some of the fancy classes.
I also can't use one array with one bytes, one array with two bytes and so on, because I need the special order of the values. And the values are very mixed, so storing an additional reference when to switch to the other array would not help.
Is it possible what want or is the only way to just store an array of 4 byte uints regardless I only need to store one byte, two byte in about 60% of the time and three bytes in about 25% of the time?

This is not possible. How would the CLR process the following expression?
myStructArray[100000]
If the elements are of variable size, the CLR cannot know the address of the 100000th element. Therefore array elements are of fixed size, always.
If you don't require O(1) access, you can implement variable-length elements on top of a byte[] and search the array yourself.
You could split the list into 1000 sublists, which are packed individually. That way you get O(n/2000) search performance on average. Maybe that is good enough in practice.
A "packed" array can only be searched in O(n/2) on average. But if your partial arrays are 1/1000th the size, it becomes O(n/2000). You can pick the partial array in O(1) because they all would be of the same size.
Also, you can adjust the number of partial arrays so that they are individually about 1k elements in size. At that point the overhead of the array object and reference to it vanish. That would give you O(1000/2 + 1) lookup performance which I think is quite an improvement over O(n/2). It is a constant-time lookup (with a big constant).

You could get close to that what you want if you are willing to sacrifice some additional CPU time and waste additional 2 or 4 bits per one stored value.
You could just use byte byte[] and combine it with BitArray collection. In byte[] you would then just sequentially store one, two, three or four bytes and in BitArray denote in binary form (pairs of two bits) or just put a bit to value 1 to denote a new set of bytes have just started (or ended, however you implement it) in your data array.
However you could get something like this in memory:
byte[] --> [byte][byte][byte][byte][byte][byte][byte]...
BitArray --> 1001101...
Which means you have 3 byte, 1 byte, 2 bytes etc. values stored in your byte array.
Or you could alternatively encode your bitarray as binary pairs to make it even smaller. This means you would vaste somewhere between 1.0625 and 1.25 bytes per your actual data byte.
It depends on your actual data (your MyStruct) if this will suffice. If you need to distinguish to which values in your struct those bytes really corresponds, you could waste some additional bits in BitArray.
Update to your O(1) requirement:
Use another index structure which would store one index for each N elements, for example 1000. You could then for example access item with index 234241 as
indexStore[234241/1000]
which gives you index of element 234000, then you just calculate the exact index of element 234241 by examining those few hundred elements in BitArray.
O(const) is acheieved this way, const can be controlled with density of main index, of course you trade time for space.

You can't do it.
If the data isn't sorted, and there is nothing more you can say about it, then you are not going to be able to do what you want.
Simple scenario:
array[3]
Should point to some memory address. But, how would you know what are dimensions of array[0]-array[2]? To store that information in an O(1) fashion, you would only waste MORE memory than you want to save in the first place.
You are thinking out of the box, and that's great. But, my guess is that this is the wrong box that you are trying to get out of. If your data is really random, and you want direct access to every array member, you'll have to use MAXIMUM width that is needed for your number for every number. Sorry.
I had one similar situation, with having numbers of length smaller than 32 bits that I needed to store. But they were all fixed width, so I was able to solve that, with custom container and some bit shifting.
HOPE:
http://www.dcc.uchile.cl/~gnavarro/ps/spire09.3.pdf
Maybe you can read it, and you'll be able not only to have 8, 16, 24, 32 bit per number, but ANY number size...

I'd almost start looking at some variant of short-word encoding like a PkZip program.
Or even RLE encoding.
Or try to understand the usage of your data better. Like, if these are all vectors or something, then there are certain combinations that are disallowed like, -1,-1,-1 is basically meaningless to a financial graphing application, as it denotes data outsides the graphable range. If you can find some oddities about your data, you may be able to reduce the size by having different structures for different needs.

How can I determine the sizeof for an unmanaged array in C#?

I'm trying to optimize some code where I have a large number of arrays containing structs of different size, but based on the same interface. In certain cases the structs are larger and hold more data, othertimes they are small structs, and othertimes I would prefer to keep null as a value to save memory.
My first question is. Is it a good idea to do something like this? I've previously had an array of my full data struct, but when testing mixing it up I would virtually be able to save lots of memory. Are there any other downsides?
I've been trying out different things, and it seams to work quite well when making an array of a common interface, but I'm not sure I'm checking the size of the array correctly.
To simplified the example quite a bit. But here I'm adding different structs to an array. But I'm unable to determine the size using the traditional Marshal.SizeOf method. Would it be correct to simply iterate through the collection and count the sizeof for each value in the collection?
IComparable[] myCollection = new IComparable[1000];
myCollection[0] = null;
myCollection[1] = (int)1;
myCollection[2] = "helloo world";
myCollection[3] = long.MaxValue;
System.Runtime.InteropServices.Marshal.SizeOf(myCollection);
The last line will throw this exception:
Type 'System.IComparable[]' cannot be marshaled as an unmanaged structure; no meaningful size or offset can be computed.
Excuse the long post:
Is this an optimal and usable solution?
How can I determine the size
of my array?

I may be wrong but it looks to me like your IComparable[] array is a managed array? If so then you can use this code to get the length
int arrayLength = myCollection.Length;
If you are doing platform interop between C# and C++ then the answer to your question headline "Can I find the length of an unmanaged array" is no, its not possible. Function signatures with arrays in C++/C tend to follow the following pattern
void doSomeWorkOnArrayUnmanaged(int * myUnmanagedArray, int length)
{
// Do work ...
}
In .NET the array itself is a type which has some basic information, such as its size, its runtime type etc... Therefore we can use this
void DoSomeWorkOnManagedArray(int [] myManagedArray)
{
int length = myManagedArray.Length;
// Do work ...
}
Whenever using platform invoke to interop between C# and C++ you will need to pass the length of the array to the receiving function, as well as pin the array (but that's a different topic).
Does this answer your question? If not, then please can you clarify

Optimality always depends on your requirements. If you really need to store many elements of different classes/structs, your solution is completely viable.
However, I guess your expectations on the data structure might be misleading: Array elements are per definition all of the same size. This is even true in your case: Your array doesn't store the elements themselves but references (pointers) to them. The elements are allocated somewhere on the VM heap. So your data structure actually goes like this: It is an array of 1000 pointers, each pointer pointing to some data. The size of each particular element may of course vary.
This leads to the next question: The size of your array. What are you intending to do with the size? Do you need to know how many bytes to allocate when you serialize your data to some persistent storage? This depends on the serialization format... Or do you need just a rough estimate on how much memory your structure is consuming? In the latter case you need to consider the array itself and the size of each particular element. The array which you gave in your example consumes approximately 1000 times the size of a reference (should be 4 bytes on a 32 bit machine and 8 bytes on a 64 bit machine). To compute the sizes of each element, you can indeed iterate over the array and sum up the size of the particular elements. Please be aware that this is only an estimate: The virtual machine adds some memory management overhead which is hard to determine exactly...

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.