List vs. Dictionary (Maximum Size, Number of Elements)

List vs. Dictionary (Maximum Size, Number of Elements) - c#

I am attempting to ascertain the maximum sizes (in RAM) of a List and a Dictionary. I am also curious as to the maximum number of elements / entries each can hold, and their memory footprint per entry.
My reasons are simple: I, like most programmers, am somewhat lazy (this is a virtue). When I write a program, I like to write it once, and try to future-proof it as much as possible. I am currently writing a program that uses Lists, but noticed that the iterator wants an integer. Since the capabilities of my program are only limited by available memory / coding style, I'd like to write it so I can use a List with Int64s or possibly BigInts (as the iterators). I've seen IEnumerable as a possibility here, but would like to find out if I can just stuff a Int64 into a Dictionary object as the key, instead of rewriting everything. If I can, I'd like to know what the cost of that might be compared to rewriting it.
My hope is that should my program prove useful, I need only hit recompile in 5 years time to take advantage of the increase in memory.

Is it specified in the documentation for the class? No, then it's unspecified.
In terms of current implementations, there's no maximum size in RAM in the classes themselves, if you create a value type that's 2MB in size, push a few thousand into a list, and receive an out of memory exception, that's nothing to do with List<T>.
Internally, List<T>s workings would prevent it from ever having more than 2billion items. It's harder to come to a quick answer with Dictionary<TKey, TValue>, since the way things are positioned within it is more complicated, but really, if I was looking at dealing with a billion items (if a 32-bit value, for example, then 4GB), I'd be looking to store them in a database and retrieve them using data-access code.
At the very least, once you're dealing with a single data structure that's 4GB in size, rolling your own custom collection class no longer counts as reinventing the wheel.

I am using a concurrentdictionary to rank 3x3 patterns in half a million games of go. Obviously there are a lot of possible patterns. With C# 4.0 the concurrentdictionary goes out of memory at around 120 million objects. It is using 8GB at that time (on a 32GB machine) but wants to grow way too much I think (tablegrowths happen in large chunks with concurrentdictionary). Using a database would slow me down at least a hundredfold I think. And the process is taking 10 hours already.
My solution was to use a multiphase solution, actually doing multiple passes, one for each subset of patterns. Like one pass for odd patterns and one for even patterns. When using more objects no longer fails I can reduce the amount of passes.
C# 4.5 adds support for larger arraysin 64bit by using unsigned 32bit pointers for arrays
(the mentioned limit goes from 2 billion to 4 billion). See also
http://msdn.microsoft.com/en-us/library/hh285054(v=vs.110).aspx. Not sure which objects will benefit from this, List<> might.

I think you have bigger issues to solve before even wondering if a Dictionary with an int64 key will be useful in 5 or 10 years.
Having a List or Dictionary of 2e+10 elements in memory (int32) doesn't seem to be a good idea, never mind 9e+18 elements (int64). Anyhow the framework will never allow you to create a monster that size (not even close) and probably never will. (Keep in mind that a simple int[int.MaxValue] array already far exceeds the framework's limit for memory allocation of any given object).
And the question remains: Why would you ever want your application to hold in memory a list of so many items? You are better of using a specialized data storage backend (database) if you have to manage that amount of information.

Related

Most efficient way to handle large arrays of data in C#?

Currently I am using XNA Game Studio 4.0 with C# Visual Studio 2010. I want to use a versatile method for handling triangles. I am using a preset array of VertexPositionColor items passed through the GraphicsDevice.DrawUserPrimitives() method, which only handles arrays. Because arrays are fixed, but I wanted to have a very large space to arbitrarily add new triangles to the array, my original idea was to make a large array, specifically
VertexPositionColor vertices = new VertexPositionColor[int.MaxValue];
but that ran my application out of memory. So what I'm wondering is how to approach this memory/performance issue best.
Is there an easy way to increase the amount of memory allocated to the stack whenever my program runs?
Would it be beneficial to store the array on the heap instead? And would I have to build my own allocator if I wanted to do that?
Or is my best approach simply to use a LinkedList and deal with the extra processing required to copy it to an array every frame?

I hit this building my voxel engine code.
Consider the problem I had:
Given an unknown volume size that would clearly be bigger than the amount of memory the computer had how do I manage that volume of data?
My solution was to use sparse chunking. for example:
In my case instead of using an array I used a dictionary.
This way I could lookup the values based on a key that was say the hashcode of a voxels position and the value was the voxel itself.
This meant that the voxels were fast to pull out, and self organised by the language / compiler in to an indexed set.
It also means that when pulling data back out I could default to Voxel.Empty for voxels that hadn't yet been assigned.
In your case you might not need a default value but using a dictionary might prove more helpful than an array.
The up shot ... Arrays are a tad faster for some things but when you consider all of your usage scenarios for the data you may find that overall the gains of using a dictionary are worth a slight allocation cost.
In testing I found that if I was prepared to drop from something like 100ms per thousand to say 120ms per thousand on allocations I could then retrieve the data 100% faster for most of the queries I was performing on the set.
Reason for my suggestion here:
It looks like you don't know the size of your data set and using an array only makes sense if you do know the size otherwise you tie up needless "pre allocated chunks of ram" for no reason in order to make your code ready for any eventuality you want to throw at it.
Hope this helps.

You may try List<T> and ToArray() method associate with List. And it's supported by XNA framework too (MSDN).
List is a successor to ArrayList and provide more features and strongly typed (A good comparison).
About performance, List<T>.ToArray is a O(n) operation. And I suggest you to break your lengthy array to sort of portions which you can name with a key [Some sort of unique identifier to a region or so on] . And store relevant information in a List and use Dictionary like Dictionary<Key, List<T>> which could reduce operations involved. Also you can process required models with priority based approach which would give a performance gain over processing complete array at once.

Memory exception in C# [duplicate]

This question already has an answer here:
Closed 10 years ago.
Possible Duplicate:
How to refer to children in a tree with millions of nodes
I'm trying to implement a tree which will hold millions of nodes, which in turn can have an unspecified number of children nodes.
To achieve this (since each node can have more than one child node), I'm storing a node's children within a Dictionary data structure. As a result of this, when each object node is created (out of millions), I've got a node object which contains a character value stored in the respective node, aswell as a separate Dictionary structure which holds a reference to the children nodes.
My tree works for a few thousand nodes, however when it reaches millions of nodes, an out of memory exception occurs. Is this due to the fact that each one of the millions of nodes running in memory also has its own Dictionary? i.e. I've got millions of objects running?
I need to have these objects running in memory, and cannot use files or databases. Could anyone suggest a solution?

Your OOM exception may be due to LOH fragmentation, rather than actually running out of memory. You could try switching to SortedDictionary, which uses a red-black tree, rather than Dictionary, which uses a hashtable, and see if that improves matters. Or you could implement your own tree structure.

You could try to use Windows with 64 bits and compile the program at 64 bits. This will give you much more memory... It isn't a magic bullet (there are still limits to how much big a memory structure can be)

You can read about Memory Limits for Windows Releases.
http://msdn.microsoft.com/en-us/library/aa366778(v=vs.85).aspx
Think about using 64-bit..
Take a look on this question: Is there a memory limit for a single .NET process
For your scenario try using BigArray
http://blogs.msdn.com/b/joshwil/archive/2005/08/10/450202.aspx

The solution isn't going to be easy.
Options are:
Offload some of those nodes to disk, where you have a greater amount of "workspace" to deal with. Only load in memory what really really needs to be there. Due to the radical difference between disk and RAM speeds this could result in a huge performance penalty.
Increase the amount of RAM in the machine to accommodate what you are doing. This might necessitate a move to 64 bit (if it is currently a 32 bit app). Depending on your real memory requirements this could be pretty expensive. Of course, if you have a 32bit app now AND have plenty of RAM available, switching to 64 bit is going to at least get you above the 3 to 4GB range...
Stream line each node to have a much much smaller footprint. In other words, do you need everything Dictionary offers or can you do with just a struct defining the left and right links to other nodes? Basically, take a look at how you need to process this and look at the traditional ways of dealing with tree data structures.

Another solution I can suggest if your only restriction is no file access (if you're opposed to ANY sort of database, this answer is moot) - is to use an in-memory database (like SQlite) or something else that provides similar functionality. This will help you in many ways, not the least of which is that the database will perform memory management for you and there are many proven algorithms for storing large trees in databases that you can adapt.
Hope this helps!

Optimizing the storage and processing of large series of data in .NET

I have a case here that I would like to have some opinions from the experts :)
Situation:
I have a data structure with ´Int32´ and ´Double´ values, with a total of 108 bytes.
I have to process a large series of this data structure. Its something like (conceptual, I will use a for loop instead):
double result = 0;
foreach(Item item in series)
{
double += //some calculation based on item
}
I expect the size of the series to be about 10 Mb.
To be useful, the whole series must be processed. It's all or nothing.
The series data will never change.
My requirements:
Memory consumption is not an issue. I think that nowadays, if the user doesn't have a few dozen Mb free on his machine, he probably has a deeper problem.
Speed is a concern. I want the iteration to be as fast as possible.
No unmanaged code, or interop, or even unsafe.
What I would like to know
Implement the item data structure as a value or reference type? From what I know, value types are cheaper, but I imagine that on each iteration a copy will be made for each item if I use a value type. Is this copy faster than a heap access?
Any real problem if I implement the accessors as anonymous properties? I believe this will increase the footprint. But also that the getter will be inlined anyway. Can I safely assume this?
I'm seriously considering to create a very large static readonly array of the series directly in code (it's rather easy do this with the data source). This would give me a 10Mb assembly. Any reason why I should avoid this?
Hope someone can give me a good opinion on this.
Thanks

Implement the item data structure as a value or reference type? From what I know, value types are cheaper, but I imagine that on each iteration a copy will be made for each item if I use a value type. Is this copy faster than a heap access?
Code it both ways and profile it aggressively on real-world input. Then you'll know exactly which one is faster.
Any real problem if I implement the accessors as anonymous properties?
Real problem? No.
I believe this will increase the footprint. But also that the getter will be inlined anyway. Can I safely assume this?
You can only safely assume things guaranteed by the spec. It's not guaranteed by the spec.
I'm seriously considering to create a very large static readonly array of the series directly in code (it's rather easy do this with the data source). This would give me a 10Mb assembly. Any reason why I should avoid this?
I think you're probably worrying about this too much.
I'm sorry if my answer seems dismissive. You're asking random people on the Internet to speculate which of two things is faster. We can guess, and we might be right, but you could just code it both ways in the blink of an eye and know exactly which is faster. So, just do it?
However, I always code for correctness, readability and maintainability at first. I establish reasonable performance requirements up front, and I see if my implementation meets them. If it does, I move on. If I need more performance from my application, I profile it to find the bottlenecks and then I start worrying.
You're asking about a trivial computation that takes ~10,000,000 / 108 ~= 100,000 iterations. Is this even a bottleneck in your application? Seriously, you are overthinking this. Just code it and move on.

That's 100,000 loops which in CPU time is sod all. Stop over thinking it and just write the code. You're making a mountain out of a molehill.

Speed is subjective. How do you load your data and how much data is inside your process elsewhere? Loading the data will be the slowest part of your app if you do not need complex parsing logic to create your struct.
I do think you ask this question because you have a struct of 108 bytes of size which you do perform calculations on and you wonder why your app is slow. Please note that structs are passed by value which means if you pass the struct to one or more method during your calcuations or you fetch it from a List you will create a copy of the struct every time. This is indeed very costly.
Change your struct to a class and expose only getters to be sure to have a read only object only. That should fix your perf issues.

A good practice is to separate data from code, so regarding your "big array embedded in the code question", I say don't do that.
Use LINQ for calculations on entire series; the speed is good.
Use a Node class for each point if you want more functionality.
I used to work with such large series of data. They were points that you plot on a graph. Originally they were taken every ms or less. The datasets were huge. Users wanted to apply different formulas to these series and have that displayed. It looks to me that your problem might be similar.
To improve speed we stored different zoom levels of the points in a db. Say every ms, then aggregate for every minute, every hr, every day, etc (whatever users needed). When users zoomed in or out we would load the new values from db instead of performing the calculations right then. We would also cache the values so users don't have to go to the db all the time.
Also if the users wanted to apply some formulas to the series (like in your case), the data is less in size.

OutOfMemoryException - Strategies to overcome this problem?

First of all, I am aware that this question has been discussed many times in this forum, such as
Large array C# OutOfMemoryException and
OutOfMemoryException
The object I am having problem with is
Dictionary<long, Double> results
which stores ID in long and calculation result in Double
I will have to reuse the same object about 10~20 times, every time when I reuse it I will call a
results = new Dictionary<long, Double>
I know that I can write it to a text file or database file for further processing but if possible I would try to avoid that as it is way too slow for the amount of data I handling. I have also tried GC.Collect() but no luck with that.
Can anyone with some previous experience give some pointer on this?
Edit: I have > 3 million objects in the list, but they are fixed (i.e. the key is the same in all iterations)

Ah - no. Makes also little sense to get out of memory exceptions in your calls.
I STRONGLY suggest you get serious in analysing - put a memory profiler onto the program and find the real problem. a long/double combo makes zero sense unless you store some hundred million pairs, and even then....
And: A moe to 64 bit is always wise. The 2 / 3 gb limit per process is harder on .net due to GC "overhead" - impossible to use up all the memory. 64 bit has much higher limits.
But again, your indication is wrong. The new Dictionary likely is NOT the error at all, something else wastes your memory.

if the issue is simply that the memory isn't being freed as expected; perhaps if you use ".Clear()" on the dictionary rather than re-creating every time?

Rather than creating 20 different instances, use one, but clear the list (which allows the GC to collect old elements) so that you have more memory to work with. Also, moving to a 64 bit environment might be wise if you require huge amounts of memory.

Adding items to a List<T> / defensive programming

Explicitly checking/handling that you don't hit the 2^31 - 1 (?) maximum number of entries when adding to a C# List is crazyness, true of false?
(Assuming this is an app where the average List size is less than a 100.)

1. Memory limits
Well, size of System.Object without any properties is 8 bytes (2x32 bit pointers), or 16 bytes in 64-bit system. [EDIT:] Actually, I just checked in WinDbg, and the size is 12bytes on x86 (32-bit).
So in a 32-bit system, you would need 24Gb ram (which you cannot have on a 32-bit system).
2. Program design
I strongly believe that such a large list shouldn't be held in memory, but rather in some other storage medium. But in that case, you will always have the option to create a cached class wrapping a List, which would handle actual storage under the hood. So testing the size before adding is the wrong place to do the testing, your List implementation should do it itself if you find it necessary one day.
3. To be on the safe side
Why not add a re-entrance counter inside each method to prevent a Stack Overflow? :)
So, yes, it's crazy to test for that. :)

Seems excessive. Would you not hit the machine's memory limit first, depending on the size of the objects in your list ? (I assume this check is performed by the user of the List class, and is not any check in the implementation?)
Perhaps it's reassuring that colleagues are thinking ahead though ? (sarcasm!)

It would seem so, and I probably wouldn't include the check but I'm conflicted on this. Programmers once though that 2 digits were enough to represent the year in date fields on the grounds that it was fine for the expected life of their code, however we discovered that this assumption wasn't correct.
Look at the risk, look at the effort and make a judgement call (otherwise known as an educated guess! :-) ). I wouldn't say there's any hard or fast rule on this one.

As in the answer above there would more things going wrong I suspect than to worry about that. But yes if you have the time and inclination that you can polish code till it shines!

True
(well you asked true or false..)

Just tried this code:
List<int> list = new List<int>();
while (true) list.Add(1);
I got a System.OutOfMemoryException. So what would you do to check / handle this?

If you keep adding items to the list, you'll run out of memory long before you hit that limit. By "long" I really mean "a lot sooner than you think".
See this discussion on the large object heap (LOB). Once you hit around 21500 elements (half that on a 64-bit system) (assuming you're storing object references), your list will start to be a large object. Since the LOB isn't compacted in the same way the normal .NET heaps are, you'll eventually fragment it badly enough that a large enough continous memory area cannot be allocated.
So you don't have to check for that limit at all, it's not a real limit.

Yes, that is crazyness.
Consider what happens to the rest of the code when you start to reach those numbers. Is the application even usable if you would have millions of items in the list?
If it's even possible that the application would reach that amount of data, perhaps you should instead take measures to keep the list from getting that large. Perhaps you should not even keep all the data in memory at once. I can't really imagine a scenario where any code could practially make use of that much data.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.