.net Generic List optimization relating to capacity

.net Generic List optimization relating to capacity - c#

I'm currently working with an app that will do the following.
// Initialize a list:
myList = new List<aPoint>;
while(WeHaveMoreData)
myList->Add(ReturnNext1000Points());
I have no way of knowing the total size of the list from the beginning. From what I've read, List<> is the best way to handle this much data incoming (could be upwards of 500k records).
I'm wondering if I should handle the capicity of the list (give it initial values, or increase the cap if it needs it)?
How do I approach optimizing such a procedure?

If you have an approximation of the total records you could set the capacity of the list otherwise leave it to grow. It is pretty optimized, just ensure you don't run out of memory. Another approach is to use a lazy iterator which won't load the entire list in memory:
public IEnumerable<aPoint> GetPoints()
{
while(WeHaveMoreData)
{
yield return new aPoint();
}
}
It is only once you start iterating that records will begin to be fetched, one by one and released immediately:
foreach (var point in GetPoints())
{
/// TODO: do something with the point
}

First rule: premature optimization is the root of all evil. If perfomance is not an issue leave it as is. Overwise you should try set initial size of list to about AverageExpectedSize/0.7.

I also think you can't optimize it much.. I guess you could do slightly better in some specific cases, so I have a question - what do you do with the data afterwards? Also - do you want to optimize for memory or speed?
A typical list implementation will grow capacity by a factor of 2x every time, so maybe you could save some space by having a List<aPoint[]>, which would have much fewer elements, so it would be less likely that you have a few 100k of spare capacity. But that would only matter if you were just about to run out of memory - it's likely that much more memory is spent on the data itself in any case..

In general, I would say that if you don't know the number of elements within say +/- 20% then you are probably should just add blindly to the List instead of guessing the capacity.
List is different than an array when it comes to matters of adding when at capacity. Remember that the List will double its capacity once you exceed the capacity. So for example, if your list has a current capacity of 128 elements and you add an element that makes it 129 elements, the list will resize its capacity to 256 elements. Then for the next 128 Adds you don't resize the list at all. Once you get to 257, it will double to 512, and the process repeats itself.
Thus you will have O(log(n)) resizes to your list.

Related

.Net Dictionary<int,int> out of memory exception at around 6,000,000 entries

I am using a Dictionary<Int,Int> to store the frequency of colors in an image, where the key is the the color (as an int), and the value is the number of times the color has been found in the image.
When I process larger / more colorful images, this dictionary grows very large. I get an out of memory exception at just around 6,000,000 entries. Is this the expected capacity when running in 32-bit mode? If so, is there anything I can do about it? And what might be some alternative methods of keeping track of this data that won't run out of memory?
For reference, here is the code that loops through the pixels in a bitmap and saves the frequency in the Dictionary<int,int>:
Bitmap b; // = something...
Dictionary<int, int> count = new Dictionary<int, int>();
System.Drawing.Color color;
for (int i = 0; i < b.Width; i++)
{
for (int j = 0; j < b.Height; j++)
{
color = b.GetPixel(i, j);
int colorString = color.ToArgb();
if (!count.Keys.Contains(color.ToArgb()))
{
count.Add(colorString, 0);
}
count[colorString] = count[colorString] + 1;
}
}
Edit: In case you were wondering what image has that many different colors in it: http://allrgb.com/images/mandelbrot.png
Edit: I also should mention that this is running inside an asp.net web application using .Net 4.0. So there may be additional memory restrictions.
Edit: I just ran the same code inside a console application and had no problems. The problem only happens in ASP.Net.

Update: Given the OP's sample image, it seems that the maximum number of items would be over 16 million, and apparently even that is too much to allocate when instantiating the dictionary. I see three options here:
Resize the image down to a manageable size and work from that.
Try to convert to a color scheme with fewer color possibilities.
Go for an array of fixed size as others have suggested.
Previous answer: the problem is that you don't allocate enough space for your dictionary. At some point, when it is expanding, you just run out of memory for the expansion, but not necessarily for the new dictionary.
Example: this code runs out of memory at nearly 24 million entries (in my machine, running in 32-bit mode):
Dictionary<int, int> count = new Dictionary<int, int>();
for (int i = 0; ; i++)
count.Add(i, i);
because with the last expansion it is currently using space for the entries already there, and tries to allocate new space for another so many million more, and that is too much.
Now, if we initially allocate space for, say, 40 million entries, it runs without problem:
Dictionary<int, int> count = new Dictionary<int, int>(40000000);
So try to indicate how many entries there will be when creating the dictionary.
From MSDN:
The capacity of a Dictionary is the number of elements that can be added to the Dictionary before resizing is necessary. As elements are added to a Dictionary, the capacity is automatically increased as required by reallocating the internal array.
If the size of the collection can be estimated, specifying the initial capacity eliminates the need to perform a number of resizing operations while adding elements to the Dictionary.

Each dictionary entry holds two 4-byte integers: 8 bytes total. 8 bytes * 6 millions entries is only about 48MB, +/- some space for object overhead, alignment, etc. There's plenty of space in memory for this. .Net provides virtual address space of up to 2 GB per process. 48MB or so shouldn't cause a problem.
I expect what's actually happening here is related to how the dictionary auto-expands and how the garbage collector handles (or doesn't handle) compaction.
First, the auto-expanding part. Last time I checked (back around .Net 2.0*), collections in .Net tended to use arrays internally. They would allocated a reasonably-sized array in the collection constructor (say, 10 items), and then use a doubling algorithm to create additional space whenever the array filled up. All the existing items would have to be copied to the new array, but then the old array could be garbage collected. The garbage collector is pretty reliable about this, and so it means you're left using space for at most 2n - 1 items in the collection.
Now the Garbage Collector compaction part. After a certain size, these arrays end up in a section of memory called the Large Object Heap. Garbage Collection still works here (though less often). What doesn't really work here well is compaction (think memory defragmentation). The physical memory used by the old object will be released, returned to the operating system, and available for other processes. However, the virtual address space in your process... the table that maps program memory offsets to physical memory addresses, will still have the (empty) space reserved.
This is important, because remember: we're working with a rapidly growing object. It's possible for such an object to take up address space far larger than the final size of the object itself. An object grows enough, fast enough, and suddenly you get an OutOfMemoryException, even though your app isn't really using all that much RAM.
The first solution here is allocate enough space in the initial collection for all of your data. This allows you to skip all those re-allocations and copying. Your data will live in a single array, and use only the space you actually asked for. Most collections, including the Dictionary, have an overload for the constructor that allows you to give it the number of items you want the first array to use. Be careful here: you don't need to allocate an item for every pixel in your image. There will be a lot of repeated colors. You only need to allocate enough to have space for each color in your image. If it's only large images that give you problems, and you're almost handling them with six millions records, you might find that 8 million is plenty.
My next suggestion is to group your pixel colors. A human can't tell and doesn't care if two colors are just one bit apart in any of the rgb components. You might go as far as to look at the separate RGB values for each pixel and normalize the pixel so that you only care about changes of more than 5 or so for an R,G,or B value. That would get you from 16.5 million potential colors all the way down to only about 132,000, and the data will likely be more useful, too. That might look something like this:
var colorCounts = new Dictionary<Color, int>(132651);
foreach(Color c in GetImagePixels().Select( c=> Color.FromArgb( (c.R/5) * 5, (c.G/5) * 5, (c.B/5) * 5) )
{
colorCounts[c] += 1;
}
* IIRC, somewhere in a recent or upcoming version of .Net both of these issues are being addressed. One by allowing you to force compaction of the LOH, and the other by using a set of arrays for collection backing stores, rather than trying to keep everything in one big array

The maximum size limit provided by CLR is 2GB
When you run a 64-bit managed application on a 64-bit Windows
operating system, you can create an object of no more than 2 gigabytes
(GB).
You may better use an array.
You may also check this BigArray<T>, getting around the 2GB array size limit

In the 32 bit runtime, the maximum number of items you can have in a Dictionary<int, int> is in the neighborhood of 61.7 million. See my old article for more info.
If you're running in 32 bit mode, then your entire application plus whatever bits of ASP.NET and the underlying machinery is required all have to fit within the memory available to your process: normally 2 GB in the 32-bit runtime.
By the way, a really wacky way to solve your problem (but one I wouldn't recommend unless you're really hurting for memory), would be the following (assuming a 24-bit image):
Call LockBits to get a pointer to the raw image data
Compress the per-scan-line padding by moving the data for each scan line to fill the previous row's padding. You end up with an array of 3-byte values followed by a bunch of empty space (to equal the padding).
Sort the image data. That is, sort the 3-byte values. You'd have to write a custom sort, but it wouldn't be too bad.
Go sequentially through the array and count the number of unique values.
Allocate a 2-dimensional array: int[count,2] to hold the values and their occurrence counts.
Go sequentially through the array again to count occurrences of each unique value and populate the counts array.
I wouldn't honestly suggest using this method. Just got a little laugh when I thought of it.

Try using an array instead. I doubt it will run out of memory. 6 million int array elements is not a big deal.

Preallocating List c#

I am working in c# and I am creating a list (newList) that I can get from the length of another list (otherList). In c# is list implemented in a way that preallocating the length of the list is better for performance using otherList.Count or to just use newList.Add(obj) and not worry about the length?

The following constructor for List<T> is implemented for the purpose of improving performance in scenarios like yours:
http://msdn.microsoft.com/en-us/library/dw8e0z9z.aspx
public List(int capacity)
Just pass the capacity in the constructor.
newList = new List<string>(otherList.Count);

If you know the exact length of the new list, creating it with that capacity indeed performs - a bit - better.
The reason is that the implementation of List<T> internally uses an array. If this gets too small, a new array is created and the items from the old array are copied over to the new item.

Taken from the Remarks section on MSDN
The capacity of a List<T> is the number of elements that the List<T>
can hold. As elements are added to a List<T>, the capacity is automatically increased as required by reallocating the internal array.
If the size of the collection can be estimated, specifying the initial capacity eliminates the need to perform a number of resizing operations while adding elements to the List<T>.
The capacity can be decreased by calling the TrimExcess method or by
setting the Capacity property explicitly. Decreasing the capacity
reallocates memory and copies all the elements in the List<T>.
So, this would suggest that there would be a performance increase if you have an estimate of the size of the list you are going to populate. Of course the other-side of this is allocating a list size too big and therefore using up memory unnecessarily.
To be honest, I would not worry about this sort of micro optimization unless I really need to.

So I benchmarked the scenario of preallocation, and just wanted to share some numbers. The code simply times this:
var repetitions = 100000000
var list = new List<object>();
for (var i = 0; i < repetitions; i++)
list.Add(new object());
The comparison is by allocating in the list the exact number of repetitions, twice the number and without preallocation. This are the times:
List not preallocated: 6561, 6394, 6556, 6283, 6466
List preallocated: 5951, 6037, 5885, 6044, 5996
List double preallocated: 6710, 6665, 6729, 6760, 6624
So is preallocation worth it? Yes if the number of objects in the list is known, or if a bottom line of this number is known (how many items will the list have at least).
If not, you could risk actually wasting more time and memory, since the preallocation will use time and memory to make space for the allocation desired.
Note that also in terms of memory, exact preallocation is the most memory efficient, since not preallocating will make your list grow over the tightest needed capacity.

Removing duplicate items from a list without using temp memory

I want to write a function that takes a collection of integers and removes the duplicates from the collection. I can not apply any sorting algorithm. Similarly I cannot duplicate the collection. I need to conserve the memory and provide an efficient solution that can process millions of items without significantly overusing the battery.

if you are very short on memory, best solution would be not to include the redundant
integers in the list in the first place.
To do this you might use an array [0..65536] of boolean (that you might 'pack' 8 by 8 to get it smaller) which record which one was allready used.
Another solution is to have the list sorted, by inserting items in the right place, but not inserting them if they are allready here. Insertion will be in log(number of unique items so far) for each item, so it should be something like a n*log(n) time for your list.
If you do not have control over the source, you could still use an array of boolean, maybe bigger if you need to, then initialize it (set all to false, then : isUsed[itemList[i]] = true;), then you can dispose of the list so you have memory again, then build a new list out of the array. So the output will be ordered.
If your integers are 32 bits, array would be 500 MB big, so maybe too big..., but depending on the integers distribution (is there a wide range of possible numbers ?? ), you might do be able to lower that size...
Notice that if you are very short on memory you might use an object pool to reuse objects.
(you might even re-use objects that you just removed from the list.)

Efficiently storing a set of numbers

I am looking for the most efficient way to store a collection of integers. Right now they're being stored in a HashSet<T>, but profiling has shown that these collections weigh heavily on some performance-critical code and I suspect there's a better option.
Some more details:
Random lookups must be O(1) or close to it.
The collections can grow large, so space efficiency is desirable.
The values are uniformly distributed in a 64-bit space.
Mutability is not needed.
There's no clear upper bound on size, but tens of millions of elements is not uncommon.
The most painful performance hit right now is creating them. That seems to be allocation-related - clearing and reusing HashSets helps a lot in benchmarks, but unfortunately that is not a feasible option in the application code.
(added) Implementing a data structure that's tailored to the task is fine. Is a hash table still the way to go? A trie also seems like a possibility at first glance, but I don't have any practical experience with them.

HashSet is usually the best general purpose collection in this case.
If you have any specific information about your collection you may have better options.
If you have a fixed upper bound that is not incredibly large you can use a bit vector of suitable size.
If you have a very dense collection you can instead store the missing values.
If you have very small collections, <= 4 items or so, you can store them in a regular array. A full scan of such small array may be faster than the hashing required to use the hash-set.
If you don't have any more specific characteristics of your data than "large collections of int" HashSet is the way to go.

If the size of the values is bounded you could use a bitset. It stores one bit per integer. In total the memory use would be log n bits with n being the greatest integer.
Another option is a bloom filter. Bloom filters are very compact but you have to be prepared for an occasional false positive in lookups. You can find more about them in wikipedia.
A third option is using a simle sorted array. Lookups are log n with n being the number of integers. It may be fast enough.

I decided to try and implement a special purpose hash-based set class that uses linear probing to handle collisions:
Backing store is a simple array of longs
The array is sized to be larger than the expected number of elements to be stored.
For a value's hash code, use the least-significant 31 bits.
Searching for the position of a value in the backing store is done using a basic linear probe, like so:
int FindIndex(long value)
{
var index = ((int)(value & 0x7FFFFFFF) % _storage.Length;
var slotValue = _storage[index];
if(slotValue == 0x0 || slotValue == value) return index;
for(++index; ; index++)
{
if (index == _storage.Length) index = 0;
slotValue = _storage[index];
if(slotValue == 0x0 || slotValue == value) return index;
}
}
(I was able to determine that the data being stored will never include 0, so that number is safe to use for empty slots.)
The array needs to be larger than the number of elements stored. (Load factor less than 1.) If the set is ever completely filled then FindIndex() will go into an infinite loop if it's used to search for a value that isn't already in the set. In fact, it will want to have quite a lot of empty space, otherwise search and retrieval may suffer as the data starts to form large clumps.
I'm sure there's still room for optimization, and I will may get stuck using some sort of BigArray<T> or sharding for the backing store on large sets. But initial results are promising. It performs over twice as fast as HashSet<T> at a load factor of 0.5, nearly twice as fast with a load factor of 0.8, and even at 0.9 it's still working 40% faster in my tests.
Overhead is 1 / load factor, so if those performance figures hold out in the real world then I believe it will also be more memory-efficient than HashSet<T>. I haven't done a formal analysis, but judging by the internal structure of HashSet<T> I'm pretty sure its overhead is well above 10%.
--
So I'm pretty happy with this solution, but I'm still curious if there are other possibilities. Maybe some sort of trie?
--
Epilogue: Finally got around to doing some competitive benchmarks of this vs. HashSet<T> on live data. (Before I was using synthetic test sets.) It's even beating my optimistic expectations from before. Real-world performance is turning out to be as much as 6x faster than HashSet<T>, depending on collection size.

What I would do is just create an array of integers with a sufficient enough size to handle how ever many integers you need. Is there any reason from staying away from the generic List<T>? http://msdn.microsoft.com/en-us/library/6sh2ey19.aspx

The most painful performance hit right now is creating them...
As you've obviously observed, HashSet<T> does not have a constructor that takes a capacity argument to initialize its capacity.
One trick which I believe would work is the following:
int capacity = ... some appropriate number;
int[] items = new int[capacity];
HashSet<int> hashSet = new HashSet<int>(items);
hashSet.Clear();
...
Looking at the implementation with reflector, this will initialize the capacity to the size of the items array, ignoring the fact that this array contains duplicates. It will, however, only actually add one value (zero), so I'd assume that initializing and clearing should be reasonably efficient.
I haven't tested this so you'd have to benchmark it. And be willing to take the risk of depending on an undocumented internal implementation detail.
It would be interesting to know why Microsoft didn't provide a constructor with a capacity argument like they do for other collection types.

C# Increasing an array by one element at the end

In my program I have a bunch of growing arrays where a new element is grown one by one to the end of the array. I identified Lists to be a speed bottleneck in a critical part of my program due to their slow access time in comparison with an array - switching to an array increased performance tremendously to an acceptable level. So to grow the array i'm using Array.Resize. This works well as my implementation restricts the array size to approximately 20 elements, so the O(N) performance of Array.Resize is bounded.
But it would be better if there was a way to just increase an array by one element at the end without having to use Array.Resize; which I believe does a copy of the old array to the newly sized array.
So my question is, is there a more efficiant method for adding one element to the end of an array without using List or Array.Resize?

A List has constant time access just like an array. For 'growing arrays' you really should be using List.
When you know that you may be adding elements to an array backed structure, you don't want to add one new size at a time. Usually it is best to grow an array by doubling it's size when it fills up.

As has been previously mentioned, List<T> is what you are looking for. If you know the initial size of the list, you can supply an initial capacity to the constructor, which will increase your performance for your initial allocations:
List<int> values = new List<int>(5);
values.Add(1);
values.Add(2);
values.Add(3);
values.Add(4);
values.Add(5);

List's allocate 4 elements to begin with (unless you specify a capacity when you construct it) and then grow every 4 elements.
Why don't you try a similar thing with Array? I.e. create it as having 4 elements, then when you insert the fifth element, first grow the array by another 4 elements.

There is no way to resize an array, so the only way to get a larger array is to use Array.Resize to create a new array.
Why not just create the arrays to have 20 elements from start (or whatever capacity you need at most), and use a variable to keep track of how many elements are used in the array? That way you never have to resize any arrays.

Growing an array AFAIK means that a new array is allocated, the existing content being copied to the new instance. I doubt that this should be faster than using List...?

it's much faster to resize an array in chunks (like 10) and store this as a seperate variable e.g capacity and then only resize the array when the capacity is reached. This is how a list works but if you prefer to use arrays then you should look into resizing them in larger chunks especially if you have a large number of Array.Resize calls

I think that every method, that wants to use array, will not be ever optimized because an array is a static structure so I think it's better to use dynamic structures like List or others.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.