This question already has an answer here:
Closed 10 years ago.
Possible Duplicate:
How to refer to children in a tree with millions of nodes
I'm trying to implement a tree which will hold millions of nodes, which in turn can have an unspecified number of children nodes.
To achieve this (since each node can have more than one child node), I'm storing a node's children within a Dictionary data structure. As a result of this, when each object node is created (out of millions), I've got a node object which contains a character value stored in the respective node, aswell as a separate Dictionary structure which holds a reference to the children nodes.
My tree works for a few thousand nodes, however when it reaches millions of nodes, an out of memory exception occurs. Is this due to the fact that each one of the millions of nodes running in memory also has its own Dictionary? i.e. I've got millions of objects running?
I need to have these objects running in memory, and cannot use files or databases. Could anyone suggest a solution?
Your OOM exception may be due to LOH fragmentation, rather than actually running out of memory. You could try switching to SortedDictionary, which uses a red-black tree, rather than Dictionary, which uses a hashtable, and see if that improves matters. Or you could implement your own tree structure.
You could try to use Windows with 64 bits and compile the program at 64 bits. This will give you much more memory... It isn't a magic bullet (there are still limits to how much big a memory structure can be)
You can read about Memory Limits for Windows Releases.
http://msdn.microsoft.com/en-us/library/aa366778(v=vs.85).aspx
Think about using 64-bit..
Take a look on this question: Is there a memory limit for a single .NET process
For your scenario try using BigArray
http://blogs.msdn.com/b/joshwil/archive/2005/08/10/450202.aspx
The solution isn't going to be easy.
Options are:
Offload some of those nodes to disk, where you have a greater amount of "workspace" to deal with. Only load in memory what really really needs to be there. Due to the radical difference between disk and RAM speeds this could result in a huge performance penalty.
Increase the amount of RAM in the machine to accommodate what you are doing. This might necessitate a move to 64 bit (if it is currently a 32 bit app). Depending on your real memory requirements this could be pretty expensive. Of course, if you have a 32bit app now AND have plenty of RAM available, switching to 64 bit is going to at least get you above the 3 to 4GB range...
Stream line each node to have a much much smaller footprint. In other words, do you need everything Dictionary offers or can you do with just a struct defining the left and right links to other nodes? Basically, take a look at how you need to process this and look at the traditional ways of dealing with tree data structures.
Another solution I can suggest if your only restriction is no file access (if you're opposed to ANY sort of database, this answer is moot) - is to use an in-memory database (like SQlite) or something else that provides similar functionality. This will help you in many ways, not the least of which is that the database will perform memory management for you and there are many proven algorithms for storing large trees in databases that you can adapt.
Hope this helps!
Related
I have a case here that I would like to have some opinions from the experts :)
Situation:
I have a data structure with ´Int32´ and ´Double´ values, with a total of 108 bytes.
I have to process a large series of this data structure. Its something like (conceptual, I will use a for loop instead):
double result = 0;
foreach(Item item in series)
{
double += //some calculation based on item
}
I expect the size of the series to be about 10 Mb.
To be useful, the whole series must be processed. It's all or nothing.
The series data will never change.
My requirements:
Memory consumption is not an issue. I think that nowadays, if the user doesn't have a few dozen Mb free on his machine, he probably has a deeper problem.
Speed is a concern. I want the iteration to be as fast as possible.
No unmanaged code, or interop, or even unsafe.
What I would like to know
Implement the item data structure as a value or reference type? From what I know, value types are cheaper, but I imagine that on each iteration a copy will be made for each item if I use a value type. Is this copy faster than a heap access?
Any real problem if I implement the accessors as anonymous properties? I believe this will increase the footprint. But also that the getter will be inlined anyway. Can I safely assume this?
I'm seriously considering to create a very large static readonly array of the series directly in code (it's rather easy do this with the data source). This would give me a 10Mb assembly. Any reason why I should avoid this?
Hope someone can give me a good opinion on this.
Thanks
Implement the item data structure as a value or reference type? From what I know, value types are cheaper, but I imagine that on each iteration a copy will be made for each item if I use a value type. Is this copy faster than a heap access?
Code it both ways and profile it aggressively on real-world input. Then you'll know exactly which one is faster.
Any real problem if I implement the accessors as anonymous properties?
Real problem? No.
I believe this will increase the footprint. But also that the getter will be inlined anyway. Can I safely assume this?
You can only safely assume things guaranteed by the spec. It's not guaranteed by the spec.
I'm seriously considering to create a very large static readonly array of the series directly in code (it's rather easy do this with the data source). This would give me a 10Mb assembly. Any reason why I should avoid this?
I think you're probably worrying about this too much.
I'm sorry if my answer seems dismissive. You're asking random people on the Internet to speculate which of two things is faster. We can guess, and we might be right, but you could just code it both ways in the blink of an eye and know exactly which is faster. So, just do it?
However, I always code for correctness, readability and maintainability at first. I establish reasonable performance requirements up front, and I see if my implementation meets them. If it does, I move on. If I need more performance from my application, I profile it to find the bottlenecks and then I start worrying.
You're asking about a trivial computation that takes ~10,000,000 / 108 ~= 100,000 iterations. Is this even a bottleneck in your application? Seriously, you are overthinking this. Just code it and move on.
That's 100,000 loops which in CPU time is sod all. Stop over thinking it and just write the code. You're making a mountain out of a molehill.
Speed is subjective. How do you load your data and how much data is inside your process elsewhere? Loading the data will be the slowest part of your app if you do not need complex parsing logic to create your struct.
I do think you ask this question because you have a struct of 108 bytes of size which you do perform calculations on and you wonder why your app is slow. Please note that structs are passed by value which means if you pass the struct to one or more method during your calcuations or you fetch it from a List you will create a copy of the struct every time. This is indeed very costly.
Change your struct to a class and expose only getters to be sure to have a read only object only. That should fix your perf issues.
A good practice is to separate data from code, so regarding your "big array embedded in the code question", I say don't do that.
Use LINQ for calculations on entire series; the speed is good.
Use a Node class for each point if you want more functionality.
I used to work with such large series of data. They were points that you plot on a graph. Originally they were taken every ms or less. The datasets were huge. Users wanted to apply different formulas to these series and have that displayed. It looks to me that your problem might be similar.
To improve speed we stored different zoom levels of the points in a db. Say every ms, then aggregate for every minute, every hr, every day, etc (whatever users needed). When users zoomed in or out we would load the new values from db instead of performing the calculations right then. We would also cache the values so users don't have to go to the db all the time.
Also if the users wanted to apply some formulas to the series (like in your case), the data is less in size.
I am attempting to ascertain the maximum sizes (in RAM) of a List and a Dictionary. I am also curious as to the maximum number of elements / entries each can hold, and their memory footprint per entry.
My reasons are simple: I, like most programmers, am somewhat lazy (this is a virtue). When I write a program, I like to write it once, and try to future-proof it as much as possible. I am currently writing a program that uses Lists, but noticed that the iterator wants an integer. Since the capabilities of my program are only limited by available memory / coding style, I'd like to write it so I can use a List with Int64s or possibly BigInts (as the iterators). I've seen IEnumerable as a possibility here, but would like to find out if I can just stuff a Int64 into a Dictionary object as the key, instead of rewriting everything. If I can, I'd like to know what the cost of that might be compared to rewriting it.
My hope is that should my program prove useful, I need only hit recompile in 5 years time to take advantage of the increase in memory.
Is it specified in the documentation for the class? No, then it's unspecified.
In terms of current implementations, there's no maximum size in RAM in the classes themselves, if you create a value type that's 2MB in size, push a few thousand into a list, and receive an out of memory exception, that's nothing to do with List<T>.
Internally, List<T>s workings would prevent it from ever having more than 2billion items. It's harder to come to a quick answer with Dictionary<TKey, TValue>, since the way things are positioned within it is more complicated, but really, if I was looking at dealing with a billion items (if a 32-bit value, for example, then 4GB), I'd be looking to store them in a database and retrieve them using data-access code.
At the very least, once you're dealing with a single data structure that's 4GB in size, rolling your own custom collection class no longer counts as reinventing the wheel.
I am using a concurrentdictionary to rank 3x3 patterns in half a million games of go. Obviously there are a lot of possible patterns. With C# 4.0 the concurrentdictionary goes out of memory at around 120 million objects. It is using 8GB at that time (on a 32GB machine) but wants to grow way too much I think (tablegrowths happen in large chunks with concurrentdictionary). Using a database would slow me down at least a hundredfold I think. And the process is taking 10 hours already.
My solution was to use a multiphase solution, actually doing multiple passes, one for each subset of patterns. Like one pass for odd patterns and one for even patterns. When using more objects no longer fails I can reduce the amount of passes.
C# 4.5 adds support for larger arraysin 64bit by using unsigned 32bit pointers for arrays
(the mentioned limit goes from 2 billion to 4 billion). See also
http://msdn.microsoft.com/en-us/library/hh285054(v=vs.110).aspx. Not sure which objects will benefit from this, List<> might.
I think you have bigger issues to solve before even wondering if a Dictionary with an int64 key will be useful in 5 or 10 years.
Having a List or Dictionary of 2e+10 elements in memory (int32) doesn't seem to be a good idea, never mind 9e+18 elements (int64). Anyhow the framework will never allow you to create a monster that size (not even close) and probably never will. (Keep in mind that a simple int[int.MaxValue] array already far exceeds the framework's limit for memory allocation of any given object).
And the question remains: Why would you ever want your application to hold in memory a list of so many items? You are better of using a specialized data storage backend (database) if you have to manage that amount of information.
I need to be able to look up this data quickly and need access to all of this data. Unfortunately, I also need to conserve memory (several of this will cause OutofMemoryExceptions)
short[,,] data = new short[8000,8000,2];
I have attempted the following:
tried jagged array - same memory problems
tried breaking into smaller arrays - still get memory issues
only resolution is to map this data efficiently using a memory mapped file or is there some other way to do this?
How about a database? After all, they are made for this.
I'd suggest you take a look at some NoSQL database. Depending on your needs, there are also in-memory databases [which obviously could suffer from the same out-of-memory problem] and databases that can be copy deployed or linked to your application.
I wouldn't want to mess with the storage details manually, and memory-mapping files is what some databases (at least MongoDB) are doing internally. So essentially, you'd be rolling your own DB, and writing a database is not trivial -- even if you narrow down the use case.
Redis or Membase sound like suitable alternatives for your problem. As far as I can see, both are able to manage the RAM utilization for you, that is, read data from the disk as needed and cache data in RAM for fast access. Of course, your access patterns will play a role here.
Keep in mind that a lot of effort went into building these DBs. According to Wikipedia, Zynga is using Membase and Redis is sponsored by VMWare.
Are you sure you need access to all of it all of the time? ...or could you load a portion of it, do your processing then move onto the next?
Could you get away with using mip-mapping or LoD representations if it's just height data? Both of those could allow you to hold lower resolutions until you need to load up specific chunks of the higher resolution data.
How much free memory do you have on your machine? What operating system are you using? Is it 64 bit?
If you're doing memory / processing intensive operations, have you considered implementing those parts in C++ where you have greater control over such things?
It's difficult to help you much further without knowing some more specifics of your system and what your actually doing with your data... ?
I wouldn't recommend a traditional relational database if you're doing numeric calculations with this data. I suspect what you're running into here isn't the size of the data itself, but rather a known problem with .NET called Large Object Heap Fragmentation. If you're running into a problem after allocating these buffers frequently (even though they should be being garbage collected), this is likely your culprit. Your best solution is to keep as many buffers as you need pre-allocated and re-use them, to prevent the reallocation and subsequent fragmentation.
How are you interacting with this large multi dimensional array? Are you using Recursion? If so, make sure your recursive methods are passing parameters by reference, rather than by value.
On a side note, do you need 100% of this data accessible at the same time? The best way to deal with large volumes of data is usually via a stream, or some kind of reader object. Try to deal with the data in segments. I've got a few processes that deal with Gigs worth of data, and it can process it in a minor amount of memory due to how I'm streaming it in via a SqlDataReader.
TL;DR: look at how you pass data between your function calls O(ref) and maybe use streaming patterns to deal with the data in smaller chunks.
hope that helps!
.NET stores shorts as 32-bit values even though they only contain 16 bits. So you could save a factor two by using an array of ints and decoding the int to two shorts yourself using bit operations.
Then you pretty much have the most efficient way of storing such an array. What you can do then is:
Use a 64-bit machine. Then you can allocate a lot of memory and the OS will take care of paging the data to disk for you if you run out of RAM (make sure you have a large enough swap file). Then you can use 8 TERAbytes of data (if you have a large enough disk).
Read parts of this data from disk as you need them manually using file IO, or using memory mapping.
I have an out of memory exception using C# when reading in a massive file
I need to change the code but for the time being can I increase the heap size (like I would in Java) as a shaort term fix?
.Net does that automatically.
Looks like you have reached the limit of the memory one .Net process can use for its objects (on 32 bit machine this is 2 standard or 3GB by using the /3GB boot switch. Credits to Leppie & Eric Lippert for the info).
Rethink your algorithm, or perhaps a change to a 64 bit machine might help.
No, this is not possible. This problem might occur because you're running on a 32-bit OS and memory is too fragmented. Try not to load the whole file into memory (for instance, by processing line by line) or, when you really need to load it completely, by loading it in multiple, smaller parts.
No you can't see my answer here: Is there any way to pre-allocate the heap in the .NET runtime, like -Xmx/-Xms in Java?
For reading large files it is usually preferable to stream them from disk, reading them in chunks and dealing with them a piece at a time instead of loading the whole thing up front.
As others have already pointed out, this is not possible. The .NET runtime handles heap allocations on behalf of the application.
In my experience .NET applications commonly suffer from OOM when there should be plenty of memory available (or at least, so it appears). The reason for this is usually the use of huge collections such as arrays, List (which uses an array to store its data) or similar.
The problem is these types will sometimes create peaks in memory use. If these peak requests cannot be honored an OOM exception is throw. E.g. when List needs to increase its capacity it does so by allocating a new array of double the current size and then it copies all the references/values from one array to the other. Similarly operations such as ToArray makes a new copy of the array. I've also seen similar problems on big LINQ operations.
Each array is stored as contiguous memory, so to avoid OOM the runtime must be able to obtain one big chunk of memory. As the address space of the process may be fragmented due to both DLL loading and general use for the heap, this is not always possible in which case an OOM exception is thrown.
What sort of file are you dealing with ?
You might be better off using a StreamReader and yield returning the ReadLine result, if it's textual.
Sure, you'll be keeping a file-pointer around, but the worst case scenario is massively reduced.
There are similar methods for Binary files, if you're uploading a file to SQL for example, you can read a byte[] and use the Sql Pointer mechanics to write the buffer to the end of a blob.
Explicitly checking/handling that you don't hit the 2^31 - 1 (?) maximum number of entries when adding to a C# List is crazyness, true of false?
(Assuming this is an app where the average List size is less than a 100.)
1. Memory limits
Well, size of System.Object without any properties is 8 bytes (2x32 bit pointers), or 16 bytes in 64-bit system. [EDIT:] Actually, I just checked in WinDbg, and the size is 12bytes on x86 (32-bit).
So in a 32-bit system, you would need 24Gb ram (which you cannot have on a 32-bit system).
2. Program design
I strongly believe that such a large list shouldn't be held in memory, but rather in some other storage medium. But in that case, you will always have the option to create a cached class wrapping a List, which would handle actual storage under the hood. So testing the size before adding is the wrong place to do the testing, your List implementation should do it itself if you find it necessary one day.
3. To be on the safe side
Why not add a re-entrance counter inside each method to prevent a Stack Overflow? :)
So, yes, it's crazy to test for that. :)
Seems excessive. Would you not hit the machine's memory limit first, depending on the size of the objects in your list ? (I assume this check is performed by the user of the List class, and is not any check in the implementation?)
Perhaps it's reassuring that colleagues are thinking ahead though ? (sarcasm!)
It would seem so, and I probably wouldn't include the check but I'm conflicted on this. Programmers once though that 2 digits were enough to represent the year in date fields on the grounds that it was fine for the expected life of their code, however we discovered that this assumption wasn't correct.
Look at the risk, look at the effort and make a judgement call (otherwise known as an educated guess! :-) ). I wouldn't say there's any hard or fast rule on this one.
As in the answer above there would more things going wrong I suspect than to worry about that. But yes if you have the time and inclination that you can polish code till it shines!
True
(well you asked true or false..)
Just tried this code:
List<int> list = new List<int>();
while (true) list.Add(1);
I got a System.OutOfMemoryException. So what would you do to check / handle this?
If you keep adding items to the list, you'll run out of memory long before you hit that limit. By "long" I really mean "a lot sooner than you think".
See this discussion on the large object heap (LOB). Once you hit around 21500 elements (half that on a 64-bit system) (assuming you're storing object references), your list will start to be a large object. Since the LOB isn't compacted in the same way the normal .NET heaps are, you'll eventually fragment it badly enough that a large enough continous memory area cannot be allocated.
So you don't have to check for that limit at all, it's not a real limit.
Yes, that is crazyness.
Consider what happens to the rest of the code when you start to reach those numbers. Is the application even usable if you would have millions of items in the list?
If it's even possible that the application would reach that amount of data, perhaps you should instead take measures to keep the list from getting that large. Perhaps you should not even keep all the data in memory at once. I can't really imagine a scenario where any code could practially make use of that much data.