C# object stored on disk (MemoryMappedFile?)

C# object stored on disk (MemoryMappedFile?) - c#

I have a set of large objects (over 20GB) that I need to access quickly from an application.
So far I have read these files from disk, to RAM on application startup. This is an expensive task as the files are deserialized to an in-memory object. However, after the initial startup delay in loading these files, the objects can be accessed very quickly. Now however, the sizes of the files are now too large to store in RAM.
I am now having to read part of the files from disk, deserializing them to memory, then discarding the used memory, reading the next files, and so on in a loop. This is very expensive computationally due to the deserialization.
Is there a way where I can have an "in-memory" object that points to a memory space that is stored on disk? This would be slower to access than if it was resident in RAM, but the slower access to disk rather than RAM would still be faster than repeatedly deserializing the data I suspect.
Is there a way to do this?
The data btw is essentially a List of structs that need to be iterated over.

If it is essentially a list of structs, then yes: you can use memory mapped files here. The most effective way to do this would be to create a single huge view over the data (let the OS worry about mapping it and paging it as needed), and obtain and store the unmanaged pointer to the root (you can get that from MemoryMappedViewStream, but IIRC there are more direct ways to get it).
Now; two things you don't want to do:
constantly deal in unmanaged pointers
constantly copy the data
But: you can use ref T and Span<T> as your friend; System.Runtime.CompilerServices.Unsafe has facilities to hack between void* and ref T, and Span<T> can take a void*; this gives you two easy ways of working with struct data that is held in unmanaged memory.

Related

WinRT how expensive is reading from local file cache

I am building a winrt metro app and always seem to run into this (non?) issue.
I find myself maintaining a lot of files of cached data that I can serialize back and forth. Data retrieved from services, user selected items and so on.
The question that I always seem to have when I write the calls is: is it the accessing of the actual file (and releasing etc) that takes time/is expensive or the amount of data that needs to be serialized from it?
How much should I worry about, for example, trying to combine a couple of files that may have the same object types stored into one and then identifying the ones I need once I have the objects 'out'.

Did you ever get insufficient Memory or memory out of bounds exception.
Winrt lets you use the ram and the cached file upto around 70-80% of its memory . Any thing beyond that will crash the app. Once you navigate away fro your page your resorces are garbage collected so thats not an issue. But if you are using using for the memory stream then also it will be fine but saving large data and continiously fetching files from the data base effects system memory. And since surface tablets have limited memory set so should take a bit care about large number of files :) i faced this while rendering bitmaps as loading around 100 bitmaps simultaneously into the memory threw insufficient memory exception.

Is there a Running Object Table in .NET

In C# I had do create my own dynamic memory management. For that reason I have created a static memory manager and a MappableObject. All object that should be dynamic mappable and unmappable from and to the harddisk implement this interface.
This memory management is only done for these large objects that have the ability to unmap/map the data from the harddisk. All other things use of course the regular GC.
Everytime a MappableObject is allocated it asks for memory. If no memory is available that the MemoryManager unmaps some data dynamically to the harddisk to get more memory to make it possible to allocate a new MappableObject.
A problem in my case is that I can have more than 100.000 MappableObject instances (scattered over a few files ~ 10-20 files) and everytime I have to run through a list of all objects if I need to unmap some data. Is there a way to get all allocated objects that are created in my current instance?
In fact I don't know what's easier to keep my own list or to run through the objects (if possible)? How would you solve such things?
Update
The reason is that I have a large amount of data. About 100GB of data that I need to keep during my run. Therefore I need the references on the data, and so the GC is not able to clean the memory. In fact C# manages the memory pretty well, but in such memory exhausting applications the GC gets really bad. Of course I tried to use the MemoryFailPoint, but this slows down my allocations tremendously and does not give correct results for whatever reason. I have also tried MemoryMappedFiles, but since I have to access the data randomly it doesn't help. Also MemoryMappedFiles only allow to have ~5000 file handles (on my system) and this is not enough.

Is there a ROT (Running Object Table) in .Net? The short answer is no.
You would have to maintain this information yourself.
Given your question update, could you not store your data in a database and use some sort of in-memory cache (perhaps with weak references or MFU, etc) to try and keep hot data close to you?

This is an obvious case for a classic cache. Your data is stored in a database or indexed flat file while you maintain a much smaller number of entries in RAM.
To implement a cache for your program I would create a class that implements IDictionary. Reserve a certain amount of slots in your cache, say a number of elements that would cause about 100 MB of RAM to be allocated; make this cache size an adjustable parameter.
When you override this[], if the object requested is in the cache, return it. If the object requested is not in the cache, remove the least recently used cached value, add the requested value as the most recently used value, and return it. Functions like Remove() and Add() not only adjust the memory cache, but also manipulate the underlying database or flat file on disk.
While it's true that your program might hold some references to objects you removed from the cache, if so, your program is still using them. Garbage collection will clean them up as needed.
Caches like this are easier to implement in C# because of its strong OOP features and safety.

What's the most efficient way to manage large amounts of data (height data) and replace this huge array?

I need to be able to look up this data quickly and need access to all of this data. Unfortunately, I also need to conserve memory (several of this will cause OutofMemoryExceptions)
short[,,] data = new short[8000,8000,2];
I have attempted the following:
tried jagged array - same memory problems
tried breaking into smaller arrays - still get memory issues
only resolution is to map this data efficiently using a memory mapped file or is there some other way to do this?

How about a database? After all, they are made for this.
I'd suggest you take a look at some NoSQL database. Depending on your needs, there are also in-memory databases [which obviously could suffer from the same out-of-memory problem] and databases that can be copy deployed or linked to your application.
I wouldn't want to mess with the storage details manually, and memory-mapping files is what some databases (at least MongoDB) are doing internally. So essentially, you'd be rolling your own DB, and writing a database is not trivial -- even if you narrow down the use case.
Redis or Membase sound like suitable alternatives for your problem. As far as I can see, both are able to manage the RAM utilization for you, that is, read data from the disk as needed and cache data in RAM for fast access. Of course, your access patterns will play a role here.
Keep in mind that a lot of effort went into building these DBs. According to Wikipedia, Zynga is using Membase and Redis is sponsored by VMWare.

Are you sure you need access to all of it all of the time? ...or could you load a portion of it, do your processing then move onto the next?
Could you get away with using mip-mapping or LoD representations if it's just height data? Both of those could allow you to hold lower resolutions until you need to load up specific chunks of the higher resolution data.
How much free memory do you have on your machine? What operating system are you using? Is it 64 bit?
If you're doing memory / processing intensive operations, have you considered implementing those parts in C++ where you have greater control over such things?
It's difficult to help you much further without knowing some more specifics of your system and what your actually doing with your data... ?

I wouldn't recommend a traditional relational database if you're doing numeric calculations with this data. I suspect what you're running into here isn't the size of the data itself, but rather a known problem with .NET called Large Object Heap Fragmentation. If you're running into a problem after allocating these buffers frequently (even though they should be being garbage collected), this is likely your culprit. Your best solution is to keep as many buffers as you need pre-allocated and re-use them, to prevent the reallocation and subsequent fragmentation.

How are you interacting with this large multi dimensional array? Are you using Recursion? If so, make sure your recursive methods are passing parameters by reference, rather than by value.
On a side note, do you need 100% of this data accessible at the same time? The best way to deal with large volumes of data is usually via a stream, or some kind of reader object. Try to deal with the data in segments. I've got a few processes that deal with Gigs worth of data, and it can process it in a minor amount of memory due to how I'm streaming it in via a SqlDataReader.
TL;DR: look at how you pass data between your function calls O(ref) and maybe use streaming patterns to deal with the data in smaller chunks.
hope that helps!

.NET stores shorts as 32-bit values even though they only contain 16 bits. So you could save a factor two by using an array of ints and decoding the int to two shorts yourself using bit operations.
Then you pretty much have the most efficient way of storing such an array. What you can do then is:
Use a 64-bit machine. Then you can allocate a lot of memory and the OS will take care of paging the data to disk for you if you run out of RAM (make sure you have a large enough swap file). Then you can use 8 TERAbytes of data (if you have a large enough disk).
Read parts of this data from disk as you need them manually using file IO, or using memory mapping.

C# Increase Heap Size - Is It Possible

I have an out of memory exception using C# when reading in a massive file
I need to change the code but for the time being can I increase the heap size (like I would in Java) as a shaort term fix?

.Net does that automatically.
Looks like you have reached the limit of the memory one .Net process can use for its objects (on 32 bit machine this is 2 standard or 3GB by using the /3GB boot switch. Credits to Leppie & Eric Lippert for the info).
Rethink your algorithm, or perhaps a change to a 64 bit machine might help.

No, this is not possible. This problem might occur because you're running on a 32-bit OS and memory is too fragmented. Try not to load the whole file into memory (for instance, by processing line by line) or, when you really need to load it completely, by loading it in multiple, smaller parts.

No you can't see my answer here: Is there any way to pre-allocate the heap in the .NET runtime, like -Xmx/-Xms in Java?
For reading large files it is usually preferable to stream them from disk, reading them in chunks and dealing with them a piece at a time instead of loading the whole thing up front.

As others have already pointed out, this is not possible. The .NET runtime handles heap allocations on behalf of the application.
In my experience .NET applications commonly suffer from OOM when there should be plenty of memory available (or at least, so it appears). The reason for this is usually the use of huge collections such as arrays, List (which uses an array to store its data) or similar.
The problem is these types will sometimes create peaks in memory use. If these peak requests cannot be honored an OOM exception is throw. E.g. when List needs to increase its capacity it does so by allocating a new array of double the current size and then it copies all the references/values from one array to the other. Similarly operations such as ToArray makes a new copy of the array. I've also seen similar problems on big LINQ operations.
Each array is stored as contiguous memory, so to avoid OOM the runtime must be able to obtain one big chunk of memory. As the address space of the process may be fragmented due to both DLL loading and general use for the heap, this is not always possible in which case an OOM exception is thrown.

What sort of file are you dealing with ?
You might be better off using a StreamReader and yield returning the ReadLine result, if it's textual.
Sure, you'll be keeping a file-pointer around, but the worst case scenario is massively reduced.
There are similar methods for Binary files, if you're uploading a file to SQL for example, you can read a byte[] and use the Sql Pointer mechanics to write the buffer to the end of a blob.

Preventing Memory issues when handling large amounts of text

I have written a program which analyzes a project's source code and reports various issues and metrics based on the code.
To analyze the source code, I load the code files that exist in the project's directory structure and analyze the code from memory. The code goes through extensive processing before it is passed to other methods to be analyzed further.
The code is passed around to several classes when it is processed.
The other day I was running it on one of the larger project my group has, and my program crapped out on me because there was too much source code loaded into memory. This is a corner case at this point, but I want to be able to handle this issue in the future.
What would be the best way to avoid memory issues?
I'm thinking about loading the code, do the initial processing of the file, then serialize the results to disk, so that when I need to access them again, I do not have to go through the process of manipulating the raw code again. Does this make sense? Or is the serialization/deserialization more expensive then processing the code again?
I want to keep a reasonable level of performance while addressing this problem. Most of the time, the source code will fit into memory without issue, so is there a way to only "page" my information when I am low on memory? Is there a way to tell when my application is running low on memory?
Update:
The problem is not that a single file fills memory, its all of the files in memory at once fill memory. My current idea is to rotate off the disk drive when I process them

1.6GB is still manageable and by itself should not cause memory problems. Inefficient string operations might do it.
As you parse the source code your probably split it apart into certain substrings - tokens or whatver you call them. If your tokens combined account for entire source code, that doubles memory consumption right there. Depending on the complexity of the processing you do the mutiplier can be even bigger.
My first move here would be to have a closer look on how you use your strings and find a way to optimize it - i.e. discarding the origianl after the first pass, compress the whitespaces, or use indexes (pointers) to the original strings rather than actual substrings - there is a number of techniques which can be useful here.
If none of this would help than I would resort to swapping them to and fro the disk

If the problem is that a single copy of your code causing you to fill the memory available then there are atleast two options.
serialize to disk
compress files in memory. If you have a lot of CPU it can be faster to zip and unzip information in memory, instead of caching to disk.
You should also check if you are disposing of objects properly. Do you have memory problems due to old copies of objects being in memory?

Use WinDbg with SOS to see what is holding on the string references (or what ever is causing the extreme memory usage).

Serializing/deserializing sounds like a good strategy. I've done a fair amount of this and it is very fast. In fact I have an app that instantiates objects from a DB and then serializes them to the hard drives of my web nodes. It has been a while since I benchmarked it, but it was serializing several hundred a second and maybe over 1k back when I was load testing.
Of course it will depend on the size of your code files. My files were fairly small.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.