I have an out of memory exception using C# when reading in a massive file
I need to change the code but for the time being can I increase the heap size (like I would in Java) as a shaort term fix?
.Net does that automatically.
Looks like you have reached the limit of the memory one .Net process can use for its objects (on 32 bit machine this is 2 standard or 3GB by using the /3GB boot switch. Credits to Leppie & Eric Lippert for the info).
Rethink your algorithm, or perhaps a change to a 64 bit machine might help.
No, this is not possible. This problem might occur because you're running on a 32-bit OS and memory is too fragmented. Try not to load the whole file into memory (for instance, by processing line by line) or, when you really need to load it completely, by loading it in multiple, smaller parts.
No you can't see my answer here: Is there any way to pre-allocate the heap in the .NET runtime, like -Xmx/-Xms in Java?
For reading large files it is usually preferable to stream them from disk, reading them in chunks and dealing with them a piece at a time instead of loading the whole thing up front.
As others have already pointed out, this is not possible. The .NET runtime handles heap allocations on behalf of the application.
In my experience .NET applications commonly suffer from OOM when there should be plenty of memory available (or at least, so it appears). The reason for this is usually the use of huge collections such as arrays, List (which uses an array to store its data) or similar.
The problem is these types will sometimes create peaks in memory use. If these peak requests cannot be honored an OOM exception is throw. E.g. when List needs to increase its capacity it does so by allocating a new array of double the current size and then it copies all the references/values from one array to the other. Similarly operations such as ToArray makes a new copy of the array. I've also seen similar problems on big LINQ operations.
Each array is stored as contiguous memory, so to avoid OOM the runtime must be able to obtain one big chunk of memory. As the address space of the process may be fragmented due to both DLL loading and general use for the heap, this is not always possible in which case an OOM exception is thrown.
What sort of file are you dealing with ?
You might be better off using a StreamReader and yield returning the ReadLine result, if it's textual.
Sure, you'll be keeping a file-pointer around, but the worst case scenario is massively reduced.
There are similar methods for Binary files, if you're uploading a file to SQL for example, you can read a byte[] and use the Sql Pointer mechanics to write the buffer to the end of a blob.
Related
In my Azure role running C# code inside a 64 bit process I want to download a ZIP file and unpack it as fast as possible. I figured I could do the following: create a MemoryStream instance, download to that MemoryStream, then pass the stream to some ZIP handling library for unpacking and once unpacking is done discard the stream. This way I would get rid of write-read-write sequence that unnecessarily performs a lot of I/O.
However I've read that MemoryStream is backed by an array and with half gigabytes that array will definitely be considered a "large object" and will be allocated in a large object heap that doesn't compact on garbage collection. Which makes me worried that maybe this usage of MemoryStream will lead to fragmenting the process memory and negative long term effects.
Will this likely have any long-term negative effects on my process?
The answer is in the accepted answer to the question you linked to. Thanks for providing the reference.
The real problem is assuming that a program should be allowed to consume all virtual memory at any time. A problem that otherwise disappears completely by just running the code on a 64-bit operating system.
I would say if this is a 64 bit process you have nothing to worry about.
The hole that is created only leads to fragmentation of the virtual address space of the LOH. Fragmentation here isn't a big problem for you. In a 64 bit process any whole pages wasted due to fragmentation will just become unused and the physical memory they were mapped to becomes available again to map a new page. Very few partial pages will be wasted because these are large allocations. And locality of reference (the other advantage of defragmentation) is mostly preserved, again because these are large allocations.
I need to be able to look up this data quickly and need access to all of this data. Unfortunately, I also need to conserve memory (several of this will cause OutofMemoryExceptions)
short[,,] data = new short[8000,8000,2];
I have attempted the following:
tried jagged array - same memory problems
tried breaking into smaller arrays - still get memory issues
only resolution is to map this data efficiently using a memory mapped file or is there some other way to do this?
How about a database? After all, they are made for this.
I'd suggest you take a look at some NoSQL database. Depending on your needs, there are also in-memory databases [which obviously could suffer from the same out-of-memory problem] and databases that can be copy deployed or linked to your application.
I wouldn't want to mess with the storage details manually, and memory-mapping files is what some databases (at least MongoDB) are doing internally. So essentially, you'd be rolling your own DB, and writing a database is not trivial -- even if you narrow down the use case.
Redis or Membase sound like suitable alternatives for your problem. As far as I can see, both are able to manage the RAM utilization for you, that is, read data from the disk as needed and cache data in RAM for fast access. Of course, your access patterns will play a role here.
Keep in mind that a lot of effort went into building these DBs. According to Wikipedia, Zynga is using Membase and Redis is sponsored by VMWare.
Are you sure you need access to all of it all of the time? ...or could you load a portion of it, do your processing then move onto the next?
Could you get away with using mip-mapping or LoD representations if it's just height data? Both of those could allow you to hold lower resolutions until you need to load up specific chunks of the higher resolution data.
How much free memory do you have on your machine? What operating system are you using? Is it 64 bit?
If you're doing memory / processing intensive operations, have you considered implementing those parts in C++ where you have greater control over such things?
It's difficult to help you much further without knowing some more specifics of your system and what your actually doing with your data... ?
I wouldn't recommend a traditional relational database if you're doing numeric calculations with this data. I suspect what you're running into here isn't the size of the data itself, but rather a known problem with .NET called Large Object Heap Fragmentation. If you're running into a problem after allocating these buffers frequently (even though they should be being garbage collected), this is likely your culprit. Your best solution is to keep as many buffers as you need pre-allocated and re-use them, to prevent the reallocation and subsequent fragmentation.
How are you interacting with this large multi dimensional array? Are you using Recursion? If so, make sure your recursive methods are passing parameters by reference, rather than by value.
On a side note, do you need 100% of this data accessible at the same time? The best way to deal with large volumes of data is usually via a stream, or some kind of reader object. Try to deal with the data in segments. I've got a few processes that deal with Gigs worth of data, and it can process it in a minor amount of memory due to how I'm streaming it in via a SqlDataReader.
TL;DR: look at how you pass data between your function calls O(ref) and maybe use streaming patterns to deal with the data in smaller chunks.
hope that helps!
.NET stores shorts as 32-bit values even though they only contain 16 bits. So you could save a factor two by using an array of ints and decoding the int to two shorts yourself using bit operations.
Then you pretty much have the most efficient way of storing such an array. What you can do then is:
Use a 64-bit machine. Then you can allocate a lot of memory and the OS will take care of paging the data to disk for you if you run out of RAM (make sure you have a large enough swap file). Then you can use 8 TERAbytes of data (if you have a large enough disk).
Read parts of this data from disk as you need them manually using file IO, or using memory mapping.
I need to allocate very large arrays of simple structs (1 GB RAM). After a few allocations/deallocations the memory becomes fragmented and an OutOfMemory exception is thrown.
This is under 32 bit. I'd rather not use 64 bit due to the performance penalty I get - the same application runs 30% slower in 64 bit mode.
Are you aware of some implementations of IList compatible arrays which allocate memory in chunks and not all at once? That would avoid my memory fragmentation problem.
Josh Williams presented a BigArray<T> class on his blog using a chunked array:
BigArray<T>, getting around the 2GB array size limit
You will find more useful information in this related question:
C# huge size 2-dim arrays
A simple ad-hoc fix might be to enable the 3GB switch for your application. Doing so allows your application to use more than the 2GB per-process limit of 32-bit Windows. However, be aware that the maximum object size that the CLR allows is still 2GB. The switch can be enabled using a post-built action for your main executable:
call "$(DevEnvDir)..\tools\vsvars32.bat"
editbin.exe /LARGEADDRESSAWARE "$(TargetPath)"
When instantiating an array, .Net tries to find a contiguous part of memory for your array. Since total memory limit for a 32-bit app is 2Gb, you can see that it is going to be tough to find such block after several allocations.
You can try using something like a LinkedList<T>, to avoid the need for contiguous allocation, or restructure your code to make these chunks smaller (although you will not be completely safe this won't happen with a 500Mb array also).
On the other hand, one solution would be to instantiate this large buffer only once, at the start of your app, and then implement an algorithm which would reuse this same space during you app's lifetime.
If you can use IEnumerable instead of IList to pass your data to rest of your program, you would be able to collapse this list using the SelectMany LINQ method.
And at the end, you can simply implement the IList interface in a custom class, and use several smaller arrays under the hood.
Would LinkedList work for you?
http://msdn.microsoft.com/en-us/library/he2s3bh7.aspx
I have 10 threads writing thousands of small buffers (16-30 bytes each) to a huge file in random positions. Some of the threads throw OutOfMemoryException on FileStream.Write() opreation.
What is causing the OutOfMemoryException ? What to look for?
I'm using the FileStream like this (for every written item - this code runs from 10 different threads):
using (FileStream fs = new FileStream(path, FileMode.OpenOrCreate, FileAccess.Write, FileShare.ReadWrite, BigBufferSizeInBytes, FileOptions.SequentialScan))
{
...
fs.Write();
}
I suspect that all the buffers allocated inside the FileStream don't get released in time by the GC. What I don't understand is why the CLR, instead of throwing, doesn't just run a GC cycle and free up all the unused buffers?
If ten threads are opening files as your code shows, then you have a maximum of ten undisposed FileStream objects at any one time. Yes, FileStream does have an internal buffer, the size of which you specify with "BigBufferSizeInBytes" in your code. Could you please disclose the exact value? If this is big enough (e.g. ~100MB) then it could well be the source of the problem.
By default (i.e. when you don't specify a number upon construction), this buffer is 4kB and that is usually fine for most applications. In general, if you really care about disk write performance, then you might increase this one to a couple of 100kB but not more.
However, for your specific application doing so wouldn't make much sense, as said buffer will never contain more than the 16-30 bytes you write into it before you Dispose() the FileStream object.
To answer your question, an OutOfMemoryException is thrown only when the requested memory can't be allocated after a GC has run. Again, if the buffer is really big then the system could have plenty of memory left, just not a contiguous chunk. This is because the large object heap is never compacted.
I've reminded people about this one a few times, but the Large object heap can throw that exception fairly subtially, when seemingly you have pleanty of available memory or the application is running OK.
I've run into this issue fairly frequently when doing almost exactally what your discribing here.
You need to post more of your code to answer this question properly. However, I'm guessing it could also be related to a potential Halloween problem (Spooky Dooky) .
Your buffer to which you are reading from may also be the problem (again large object heap related) also again, you need to put up more details about what's going on there in the loop. I've just nailed out the last bug I had which is virtually identical (I am performing many parallel hash update's which all require independent state to be maintained across read's of the input file)....
OOP! just scrolled over and noticed "BigBufferSizeInBytes", I'm leaning towards Large Object Heap again...
If I were you, (and this is exceedingly difficult due to the lack of context), I would provide a small dispatch "mbuf", where you copied in and out instead of allowing all of your disperate thread's to individually read across your large backing array... (i.e. it's hard to not cause insadential allocations with very subtile code syntax).
Buffers aren't generally allocated inside the FileStream. Perhaps the problem is the line "writing thousands of small buffers" - do you really mean that? Normally you re-use a buffer many, many, many times (i.e. on different calls to Read/Write).
Also - is this a single file? A single FileStream is not guaranteed to be thread safe... so if you aren't doing synchronization, expect chaos.
It's possible that these limitations arise from the underlying OS, and that the .NET Framework is powerless to overcome these kind of limitations.
What I cannot deduce from your code sample is whether you open up a lot of these FileStream objects at the same time, or open them really fast in sequence. Your use of the 'using' keyword will make sure that the files are closed after the fs.Write() call. There's no GC cycle required to close the file.
The FileStream class is really geared towards sequential read/write access to files. If you need to quickly write to random locations in a big file, you might want to take a look at using virtual file mapping.
Update: It seems that virtual file mapping will not be officially supported in .NET until 4.0. You may want to take a look at third party implementations for this functionality.
Dave
I'm experiencing something similar and wondered if you ever pinned down the root of your problem?
My code does quite a lot of copying between files, passing quite a few megs between different byte files. I've noticed that whilst the process memory usage stays within a reasonable range, the system memory allocation shoots up way too high during the copying - much more than is being used by my process.
I've tracked the issue down to the FileStream.Write() call - when this line is taken out, memory usage seems to go as expected. My BigBufferSizeInBytes is the default (4k), and I can't see anywhere where these could be collecting...
Anything you discovered whilst looking at your problem would be gratefully received!
I am doing some calculations that require a large array to be initialized. The maximum size of the array determines the maximum size of the problem I can solve.
Is there a way to programmatically determine how much memory is available for say, the biggest array of bytes possible?
Thanks
Well, relying on a single huge array has a range of associated issues - memory fragmentation, contiguous blocks, the limit on the maximum object size, etc. If you need a lot of data, I would recommend creating a class that simulates a large array using lots of smaller (but still large) arrays, each of fixed size - i.e. the indexer divides to to find the appropriate array, then uses % to get the offset inside that array.
You might also want to ensure you are on a 64-bit OS, with lots of memory. This will give you the maximum available head-room.
Depending on the scenario, more sophisticated algorithms such as sparse arrays, eta-vectors, etc might be of use to maximise what you can do. You might be amazed what people could do years ago with limited memory, and just a tape spinning backwards and forwards...
In order to ensure you have enough free memory you could use a MemoryFailPoint. If the memory cannot be allocated, then an InsufficientMemoryException will be generated, which you can catch and deal with in an appropriate way.
The short answer is "no". There are two top level resources that would need to be queried
The largest block of unallocated virtual address space available to the process
The amount of available page file space.
As Marc Gravell correctly stated, you will have your best success on a 64-bit platform. Here, each process has a huge virtual address space. This will effectively solve your first problem. You should also make sure the page file is large.
But, there is a better way that is limited only by the free space on your disk: memory mapped files. You can create a large mapping (say 512MB) into an arbitrarily large file and move it as you process your data. Note, be sure to open it for exclusive access.
If you need really, really big arrays, don't use the CLR. Mono supports 64-bit array indexes, allowing you to fully take advantage of your memory resources.
I suppose binary search could be a way to go. First, start by allocating 1 byte, if that succeeds, free that byte (set the object to null) and double it to 2 bytes. Go on until you can't allocate any more, and you have found a limit that you can consider "the lower limit".
The correct number of bytes that can be allocated (let's call it x) is within the interval lower < x < 2 * lower. Continue searching this interval using binary search.
The biggest array one can allocate in a 64 bit .NET program is 2GB. (Another reference.)
You could find out how many available bytes there are easily enough:
Using pc As New System.Diagnostics.PerformanceCounter("Memory", "Available Bytes")
FreeBytes = pc.NextValue();
End Using
Given that information you should be able to make your decision.