OutOfMemoryException - Strategies to overcome this problem?

OutOfMemoryException - Strategies to overcome this problem? - c#

First of all, I am aware that this question has been discussed many times in this forum, such as
Large array C# OutOfMemoryException and
OutOfMemoryException
The object I am having problem with is
Dictionary<long, Double> results
which stores ID in long and calculation result in Double
I will have to reuse the same object about 10~20 times, every time when I reuse it I will call a
results = new Dictionary<long, Double>
I know that I can write it to a text file or database file for further processing but if possible I would try to avoid that as it is way too slow for the amount of data I handling. I have also tried GC.Collect() but no luck with that.
Can anyone with some previous experience give some pointer on this?
Edit: I have > 3 million objects in the list, but they are fixed (i.e. the key is the same in all iterations)

Ah - no. Makes also little sense to get out of memory exceptions in your calls.
I STRONGLY suggest you get serious in analysing - put a memory profiler onto the program and find the real problem. a long/double combo makes zero sense unless you store some hundred million pairs, and even then....
And: A moe to 64 bit is always wise. The 2 / 3 gb limit per process is harder on .net due to GC "overhead" - impossible to use up all the memory. 64 bit has much higher limits.
But again, your indication is wrong. The new Dictionary likely is NOT the error at all, something else wastes your memory.

if the issue is simply that the memory isn't being freed as expected; perhaps if you use ".Clear()" on the dictionary rather than re-creating every time?

Rather than creating 20 different instances, use one, but clear the list (which allows the GC to collect old elements) so that you have more memory to work with. Also, moving to a 64 bit environment might be wise if you require huge amounts of memory.

Related

Optimizing the storage and processing of large series of data in .NET

I have a case here that I would like to have some opinions from the experts :)
Situation:
I have a data structure with ´Int32´ and ´Double´ values, with a total of 108 bytes.
I have to process a large series of this data structure. Its something like (conceptual, I will use a for loop instead):
double result = 0;
foreach(Item item in series)
{
double += //some calculation based on item
}
I expect the size of the series to be about 10 Mb.
To be useful, the whole series must be processed. It's all or nothing.
The series data will never change.
My requirements:
Memory consumption is not an issue. I think that nowadays, if the user doesn't have a few dozen Mb free on his machine, he probably has a deeper problem.
Speed is a concern. I want the iteration to be as fast as possible.
No unmanaged code, or interop, or even unsafe.
What I would like to know
Implement the item data structure as a value or reference type? From what I know, value types are cheaper, but I imagine that on each iteration a copy will be made for each item if I use a value type. Is this copy faster than a heap access?
Any real problem if I implement the accessors as anonymous properties? I believe this will increase the footprint. But also that the getter will be inlined anyway. Can I safely assume this?
I'm seriously considering to create a very large static readonly array of the series directly in code (it's rather easy do this with the data source). This would give me a 10Mb assembly. Any reason why I should avoid this?
Hope someone can give me a good opinion on this.
Thanks

Implement the item data structure as a value or reference type? From what I know, value types are cheaper, but I imagine that on each iteration a copy will be made for each item if I use a value type. Is this copy faster than a heap access?
Code it both ways and profile it aggressively on real-world input. Then you'll know exactly which one is faster.
Any real problem if I implement the accessors as anonymous properties?
Real problem? No.
I believe this will increase the footprint. But also that the getter will be inlined anyway. Can I safely assume this?
You can only safely assume things guaranteed by the spec. It's not guaranteed by the spec.
I'm seriously considering to create a very large static readonly array of the series directly in code (it's rather easy do this with the data source). This would give me a 10Mb assembly. Any reason why I should avoid this?
I think you're probably worrying about this too much.
I'm sorry if my answer seems dismissive. You're asking random people on the Internet to speculate which of two things is faster. We can guess, and we might be right, but you could just code it both ways in the blink of an eye and know exactly which is faster. So, just do it?
However, I always code for correctness, readability and maintainability at first. I establish reasonable performance requirements up front, and I see if my implementation meets them. If it does, I move on. If I need more performance from my application, I profile it to find the bottlenecks and then I start worrying.
You're asking about a trivial computation that takes ~10,000,000 / 108 ~= 100,000 iterations. Is this even a bottleneck in your application? Seriously, you are overthinking this. Just code it and move on.

That's 100,000 loops which in CPU time is sod all. Stop over thinking it and just write the code. You're making a mountain out of a molehill.

Speed is subjective. How do you load your data and how much data is inside your process elsewhere? Loading the data will be the slowest part of your app if you do not need complex parsing logic to create your struct.
I do think you ask this question because you have a struct of 108 bytes of size which you do perform calculations on and you wonder why your app is slow. Please note that structs are passed by value which means if you pass the struct to one or more method during your calcuations or you fetch it from a List you will create a copy of the struct every time. This is indeed very costly.
Change your struct to a class and expose only getters to be sure to have a read only object only. That should fix your perf issues.

A good practice is to separate data from code, so regarding your "big array embedded in the code question", I say don't do that.
Use LINQ for calculations on entire series; the speed is good.
Use a Node class for each point if you want more functionality.
I used to work with such large series of data. They were points that you plot on a graph. Originally they were taken every ms or less. The datasets were huge. Users wanted to apply different formulas to these series and have that displayed. It looks to me that your problem might be similar.
To improve speed we stored different zoom levels of the points in a db. Say every ms, then aggregate for every minute, every hr, every day, etc (whatever users needed). When users zoomed in or out we would load the new values from db instead of performing the calculations right then. We would also cache the values so users don't have to go to the db all the time.
Also if the users wanted to apply some formulas to the series (like in your case), the data is less in size.

List vs. Dictionary (Maximum Size, Number of Elements)

I am attempting to ascertain the maximum sizes (in RAM) of a List and a Dictionary. I am also curious as to the maximum number of elements / entries each can hold, and their memory footprint per entry.
My reasons are simple: I, like most programmers, am somewhat lazy (this is a virtue). When I write a program, I like to write it once, and try to future-proof it as much as possible. I am currently writing a program that uses Lists, but noticed that the iterator wants an integer. Since the capabilities of my program are only limited by available memory / coding style, I'd like to write it so I can use a List with Int64s or possibly BigInts (as the iterators). I've seen IEnumerable as a possibility here, but would like to find out if I can just stuff a Int64 into a Dictionary object as the key, instead of rewriting everything. If I can, I'd like to know what the cost of that might be compared to rewriting it.
My hope is that should my program prove useful, I need only hit recompile in 5 years time to take advantage of the increase in memory.

Is it specified in the documentation for the class? No, then it's unspecified.
In terms of current implementations, there's no maximum size in RAM in the classes themselves, if you create a value type that's 2MB in size, push a few thousand into a list, and receive an out of memory exception, that's nothing to do with List<T>.
Internally, List<T>s workings would prevent it from ever having more than 2billion items. It's harder to come to a quick answer with Dictionary<TKey, TValue>, since the way things are positioned within it is more complicated, but really, if I was looking at dealing with a billion items (if a 32-bit value, for example, then 4GB), I'd be looking to store them in a database and retrieve them using data-access code.
At the very least, once you're dealing with a single data structure that's 4GB in size, rolling your own custom collection class no longer counts as reinventing the wheel.

I am using a concurrentdictionary to rank 3x3 patterns in half a million games of go. Obviously there are a lot of possible patterns. With C# 4.0 the concurrentdictionary goes out of memory at around 120 million objects. It is using 8GB at that time (on a 32GB machine) but wants to grow way too much I think (tablegrowths happen in large chunks with concurrentdictionary). Using a database would slow me down at least a hundredfold I think. And the process is taking 10 hours already.
My solution was to use a multiphase solution, actually doing multiple passes, one for each subset of patterns. Like one pass for odd patterns and one for even patterns. When using more objects no longer fails I can reduce the amount of passes.
C# 4.5 adds support for larger arraysin 64bit by using unsigned 32bit pointers for arrays
(the mentioned limit goes from 2 billion to 4 billion). See also
http://msdn.microsoft.com/en-us/library/hh285054(v=vs.110).aspx. Not sure which objects will benefit from this, List<> might.

I think you have bigger issues to solve before even wondering if a Dictionary with an int64 key will be useful in 5 or 10 years.
Having a List or Dictionary of 2e+10 elements in memory (int32) doesn't seem to be a good idea, never mind 9e+18 elements (int64). Anyhow the framework will never allow you to create a monster that size (not even close) and probably never will. (Keep in mind that a simple int[int.MaxValue] array already far exceeds the framework's limit for memory allocation of any given object).
And the question remains: Why would you ever want your application to hold in memory a list of so many items? You are better of using a specialized data storage backend (database) if you have to manage that amount of information.

How do I know if I'm about to get an OutOfMemoryException?

I have a program that processes high volumes of data, and can cache much of it for reuse with subsequent records in memory. The more I cache, the faster it works. But if I cache too much, boom, start over, and that takes a lot longer!
I haven't been too successful trying to do anything after the exception occurs - I can't get enough memory to do anything.
Also I've tried allocating a huge object, then de-allocating it right away, with inconsistent results. Maybe I'm doing something wrong?
Anyway, what I'm stuck with is just setting a hardcoded limit on the # of cached objects that, from experience, seems to be low enough. Any better Ideas? thanks.
edit after answer
The following code seems to be doing exactly what I want:
Loop
Dim memFailPoint As MemoryFailPoint = Nothing
Try
memFailPoint = New MemoryFailPoint( mysize) ''// size of MB of several objects I'm about to add to cache
memFailPoint.Dispose()
Catch ex As InsufficientMemoryException
''// dump the oldest items here
End Try
''// do work
next loop.
I need to test if it is slowing things down in this arrangement or not, but I can see the yellow line in Task Manager looking like a very healthy sawtooth pattern with a consistent top - yay!!

You can use MemoryFailPoint to check for available memory before allocating.

You may need to think about your release strategy for the cached objects. There is no possible way you can hold all of them forever so you need to come up with an expiration timeframe and have older cached objects removed from memory. It should be possible to find out how much memory is left and use that as part of your strategy but one thing is certain, old objects must go.

If you implement your cache with WeakRerefences (http://msdn.microsoft.com/en-us/library/system.weakreference.aspx) that will leave the cached objects still eligible for garbage collection in situations where you might otherwise throw an OutOfMemory exception.
This is an alternative to a fixed sized cache, but potentially has the problem to be overly aggressive in clearing out the cache when a GC does occur.
You might consider taking a hybrid approach, where there are a (tunable) fixed number of non-weakreferences in the cahce but you let it grow additionally with weakreferences. Or this may be overkill.

There are a number of metrics you can use to keep track of how much memory your process is using:
GC.GetTotalMemory
Environment.WorkingSet (This one isn't useful, my bad)
The native GlobalMemoryStatusEx function
There are also various properties on the Process class
The trouble is that there isn't really a reliable way of telling from these values alone whether or not a given memory allocation will fail as although there may be sufficient space in the address space for a given memory allocation memory fragmentation means that the space may not be continuous and so the allocation may still fail.
You can however use these values as an indication of how much memory the process is using and therefore whether or not you should think about removing objects from your cache.
Update: Its also important to make sure that you understand the distinction between virtual memory and physical memory - unless your page file is disabled (very unlikely) the cause of the OutOfMemoryException will be caused by a lack / fragmentation of the virtual address space.

If you're only using managed resources you can use the GC.GetTotalMemory method and compare the results with the maximum allowed memory for a process on your architecture.
A more advanced solution (I think this is how SQL Server manages to actually adapt to the available memory) is to use the CLR Hosting APIs:
the interface allows the CLR to inform the host of the consequences of
failing a particular allocation
which will mean actually removing some objects from the cache and trying again.
Anyway I think this is probably an overkill for almost all applications unless you really need an amazing performance.

The simple answer... By knowing what your memory limit is.
The closer you are to reach that limit the more you ARE ABOUT to get an OutOfMemoryException.
The more elaborated answer.... Unless you yourself writes a mechanism to do that kind of thing, programming languages/systems do not work that way; as far as I know they cannot inform you ahead or in advance you are exceeding limits BUT, they gladly inform you when the problem has occurred, and that usually happens through exceptions which you are supposed to write code to handle.
Memory is a resource that you can use; it has limits and it also has some conventions and rules for you to follow to make good use of that resource.
I believe what you are doing of setting a good limit, hard coded or configurable, seems to be your best bet.

How to free, or recycle, a string in C#?

i have a large string (e.g. 20MB).
i am now parsing this string. The problem is that strings in C# are immutable; this means that once i've created a substring, and looked at it, the memory is wasted.
Because of all the processing, memory is getting clogged up with String objects that i no longer used, need or reference; but it takes the garbage collector too long to free them.
So the application runs out of memory.
i could use the poorly performing club approach, and sprinkle a few thousand calls to:
GC.Collect();
everywhere, but that's not really solving the issue.
i know StringBuilder exists when creating a large string.
i know TextReader exists to read a String into a char array.
i need to somehow "reuse" a string, making it no longer immutable, so that i don't needlessly allocate gigabytes of memory when 1k will do.

If your application is dying, that's likely to be because you still have references to strings - not because the garbage collector is just failing to clean them up. I have seen it fail like that, but it's pretty unlikely. Have you used a profiler to check that you really do have a lot of strings in memory at a time?
The long and the short of it is that you can't reuse a string to store different data - it just can't be done. You can write your own equivalent if you like - but the chances of doing that efficiently and correctly are pretty slim. Now if you could give more information about what you're doing, we may be able to suggest alternative approaches that don't use so much memory.

This question is almost 10 years old. These days, please look at ReadOnlySpan - instantiate one from the string using AsSpan() method. Then you can apply index operators to get slices as spans without allocating any new strings.

I would suggest, considering the fact, that you can not reuse the strings in C#, use Memory-Mapped Files. You simply save string on a disk and process it with performance/memory-consuption excelent relationship via mapped file like a stream. In this case you reuse the same file, the same stream and operate only on small possible portion of the data like a string, that you need in that precise moment, and after immediately throw it away.
This solution is strictly depends on your project requieremnts, but I think one of the solutions you may seriously consider, as especially memory consumption will go down dramatically, but you will "pay" something in terms of performance.

Do you have some sample code to test whether possible solutions would work well?
In general though, any object that is bigger than 85KB is going to be allocated onto the Large Object Heap, which will probably be garbage collected less often.
Also, if you're really pushing the CPU hard, the garbage collector will likely perform its work less often, trying to stay out of your way.

Adding items to a List<T> / defensive programming

Explicitly checking/handling that you don't hit the 2^31 - 1 (?) maximum number of entries when adding to a C# List is crazyness, true of false?
(Assuming this is an app where the average List size is less than a 100.)

1. Memory limits
Well, size of System.Object without any properties is 8 bytes (2x32 bit pointers), or 16 bytes in 64-bit system. [EDIT:] Actually, I just checked in WinDbg, and the size is 12bytes on x86 (32-bit).
So in a 32-bit system, you would need 24Gb ram (which you cannot have on a 32-bit system).
2. Program design
I strongly believe that such a large list shouldn't be held in memory, but rather in some other storage medium. But in that case, you will always have the option to create a cached class wrapping a List, which would handle actual storage under the hood. So testing the size before adding is the wrong place to do the testing, your List implementation should do it itself if you find it necessary one day.
3. To be on the safe side
Why not add a re-entrance counter inside each method to prevent a Stack Overflow? :)
So, yes, it's crazy to test for that. :)

Seems excessive. Would you not hit the machine's memory limit first, depending on the size of the objects in your list ? (I assume this check is performed by the user of the List class, and is not any check in the implementation?)
Perhaps it's reassuring that colleagues are thinking ahead though ? (sarcasm!)

It would seem so, and I probably wouldn't include the check but I'm conflicted on this. Programmers once though that 2 digits were enough to represent the year in date fields on the grounds that it was fine for the expected life of their code, however we discovered that this assumption wasn't correct.
Look at the risk, look at the effort and make a judgement call (otherwise known as an educated guess! :-) ). I wouldn't say there's any hard or fast rule on this one.

As in the answer above there would more things going wrong I suspect than to worry about that. But yes if you have the time and inclination that you can polish code till it shines!

True
(well you asked true or false..)

Just tried this code:
List<int> list = new List<int>();
while (true) list.Add(1);
I got a System.OutOfMemoryException. So what would you do to check / handle this?

If you keep adding items to the list, you'll run out of memory long before you hit that limit. By "long" I really mean "a lot sooner than you think".
See this discussion on the large object heap (LOB). Once you hit around 21500 elements (half that on a 64-bit system) (assuming you're storing object references), your list will start to be a large object. Since the LOB isn't compacted in the same way the normal .NET heaps are, you'll eventually fragment it badly enough that a large enough continous memory area cannot be allocated.
So you don't have to check for that limit at all, it's not a real limit.

Yes, that is crazyness.
Consider what happens to the rest of the code when you start to reach those numbers. Is the application even usable if you would have millions of items in the list?
If it's even possible that the application would reach that amount of data, perhaps you should instead take measures to keep the list from getting that large. Perhaps you should not even keep all the data in memory at once. I can't really imagine a scenario where any code could practially make use of that much data.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.