out of memory exception while loading a huge DBpedia dump

out of memory exception while loading a huge DBpedia dump - c#

I am trying to load a large dump of dbpedia data into my C# application, I get out of memory exepction everytime I try to load it.
The files are very large text files, holding millions of records and their size is more than 250MB each (one of them is actually 7GB!!), When I try to load the 250MB file to my application, it waits for about 10 seconds during which my RAM (6GB, initially # 2GB used) increases to be about 5GB used then the program throws an out of memory exception.
I understood that the out of memory exception is all about the empty adjacent chunk of memory, I want to know how to manage to load such a file to my program?
Here's the code I use to load the files, I'm using the dotNetRDF library.
TripleStore temp = new TripleStore();
//adding Uris to the store
temp.LoadFromFile(#"C:\MyTripleStore\pnd_en.nt");

dotNetRDF is simply not designed to handle this amount of data in it's in-memory store. All its data parsing is streaming but you have to build in-memory structures to store the data which is what takes up all the memory and leads to the OOM exception.
By default triples are fully indexed so they can be efficiently queries with SPARQL and with the current release of the library that will require approximately 1.7kb per Triple so at most the library will let you work on a 2-3 million triples in memory depending on your available RAM. As a related point the SPARQL algorithm in the current release is terrible at that scale so even if you can load your data into memory you won't be able to query it effectively.
While the next release of the library does reduce memory usage and vastly improve SPARQL performance it was still never designed for that volume of data.
However dotNetRDF does support a wide variety of native triple stores out of the box (see the IQueryableGenericIOManager interface and it's implementations) so you should load the DBPedia dump into an appropriate store using that stores native loading mechanism (which will be faster than loading via dotNetRDF) and then use dotNetRDF simply as a client through which to make your queries

Related

Balancing Memory Mapped Files for data larger than the RAM

We are having very large data files, for example assume a 3D volume with 2048x2048 matrix size with a slice depth of 20000.
Originally, I had my own central memory management where each data is backed with a "page-file" on the disk. During processing I track the process memory and if the memory is low I check which slices haven't been touched for a while and page them in my own hand-made paging-file. This results of course in some form a zig-zag when you look at the process memory, but this methods works even if the total size of my files is much larger than the available RAM. Of course putting more RAM inside improves the situation but the system is able to adapt to these situation. So if you pay more you get more speed ;-)
But, the method is of course hand-made, I need a separate thread that watches over the allocated files and data, I need special methods that handle the paging etc.
So I decided to have a look at memory mapped files...
When I now create a memory mapped file (Read&Write) and run through the data slice by slice and get access to the data with a ReadOnlySpan<> in a for loop the whole memory is consumed and you see the process memory lineary growing until the whole memory is consumed. Afterwards the system goes into swapping without coming out ever again. Its even so that the system freezes and only a restart helps to recover.
So obviously the MM-Files are not balancing themselves and when it comes to the point where we a reaching the RAM maximum, the system is exhausted and is obviously helpless.
The question is can I control when the system does a swapping and advise which regions can be swapped? Or do you have suggestions for better approaches?
I have created a example that does following
It creates a Volume as input
It creates one or many processor that simply add 10 to each slice in each processor.
Since everything is created only on access, a Parallel.For iterates over the slices and requests the result slice. Depending on the amount N of processors in the chain each slice of the result should have a value of (N+1)*10.
See here: Example Code
Update [26.09.2021]: My latest investigations have shown that there are two independent process running. One process (my process) is writing or reading data into the memory mapped file(s). Anothter system process is trying to manage the memory. While the memory is growing the system is obviously very lazy flushing data really to the backing memory mapped file or the paging file.
This leads to a bad condition. When the maximum RAM is reached, the memory manager starts to empty the working set and flush the data to disk. While the system flushes, I'm still producing data which fill the memory again. So in the end I'm coming to a point where I'm always using the maximum RAM and never come out.
I tried a few things:
VirtualUnlock enables me mark regions as pageable for the memory manager.
You can call a Flush on the ViewAccessor.
Since the workingset is still showing high memory usage, I call EmptyWorkingSet to empty the working set, which immediately returns.
What I need in the end is something to see if the memory manager is currently unmapping so that I can slow down the process.
I have to search more...

C#: List all variables in a log file, along with their memory usage

I am working on a C# service project which is a TCP server. When I start my service, its memory usage goes around 40 MB, but as the time passes, it doubles - triples its memory usage. I know that C# is garbage collected and all, but there is some problem with the application.
I am using Log4net in my application and I am using Entity framework to store data in database and tcpServer to manage my incoming data.
Is there a way to list all the memory variables in RAM, sorted by size in descending order in a log file? Or, Is there a better way to debug such problems (like real-time analysis)?

You can try with windbg
!dumpheap -stat
as described here and here
Looking at the columns in the output, MT stands for Method Table and is basically a pointer to the table that describes that type of object. Count is the number of objects that exist in the heap of the given type. TotalSize is the amount of memory that is being consumed by any one type of object. The last column is obviously the fully typed name of the object.
Consider also SOSNet, an open source C# project, that analyzes the output from the command line version of WinDBG, to display data in a better way. You can also display the list of all .Net types loaded in memory with their "Total Size".
Generally speaking, there can be many causes, maybe you are leaking threads, is there only one thread accepting all the connections?
Otherwise you can take and analyse memory snapshots of your process.

WinRT how expensive is reading from local file cache

I am building a winrt metro app and always seem to run into this (non?) issue.
I find myself maintaining a lot of files of cached data that I can serialize back and forth. Data retrieved from services, user selected items and so on.
The question that I always seem to have when I write the calls is: is it the accessing of the actual file (and releasing etc) that takes time/is expensive or the amount of data that needs to be serialized from it?
How much should I worry about, for example, trying to combine a couple of files that may have the same object types stored into one and then identifying the ones I need once I have the objects 'out'.

Did you ever get insufficient Memory or memory out of bounds exception.
Winrt lets you use the ram and the cached file upto around 70-80% of its memory . Any thing beyond that will crash the app. Once you navigate away fro your page your resorces are garbage collected so thats not an issue. But if you are using using for the memory stream then also it will be fine but saving large data and continiously fetching files from the data base effects system memory. And since surface tablets have limited memory set so should take a bit care about large number of files :) i faced this while rendering bitmaps as loading around 100 bitmaps simultaneously into the memory threw insufficient memory exception.

What's the most efficient way to manage large amounts of data (height data) and replace this huge array?

I need to be able to look up this data quickly and need access to all of this data. Unfortunately, I also need to conserve memory (several of this will cause OutofMemoryExceptions)
short[,,] data = new short[8000,8000,2];
I have attempted the following:
tried jagged array - same memory problems
tried breaking into smaller arrays - still get memory issues
only resolution is to map this data efficiently using a memory mapped file or is there some other way to do this?

How about a database? After all, they are made for this.
I'd suggest you take a look at some NoSQL database. Depending on your needs, there are also in-memory databases [which obviously could suffer from the same out-of-memory problem] and databases that can be copy deployed or linked to your application.
I wouldn't want to mess with the storage details manually, and memory-mapping files is what some databases (at least MongoDB) are doing internally. So essentially, you'd be rolling your own DB, and writing a database is not trivial -- even if you narrow down the use case.
Redis or Membase sound like suitable alternatives for your problem. As far as I can see, both are able to manage the RAM utilization for you, that is, read data from the disk as needed and cache data in RAM for fast access. Of course, your access patterns will play a role here.
Keep in mind that a lot of effort went into building these DBs. According to Wikipedia, Zynga is using Membase and Redis is sponsored by VMWare.

Are you sure you need access to all of it all of the time? ...or could you load a portion of it, do your processing then move onto the next?
Could you get away with using mip-mapping or LoD representations if it's just height data? Both of those could allow you to hold lower resolutions until you need to load up specific chunks of the higher resolution data.
How much free memory do you have on your machine? What operating system are you using? Is it 64 bit?
If you're doing memory / processing intensive operations, have you considered implementing those parts in C++ where you have greater control over such things?
It's difficult to help you much further without knowing some more specifics of your system and what your actually doing with your data... ?

I wouldn't recommend a traditional relational database if you're doing numeric calculations with this data. I suspect what you're running into here isn't the size of the data itself, but rather a known problem with .NET called Large Object Heap Fragmentation. If you're running into a problem after allocating these buffers frequently (even though they should be being garbage collected), this is likely your culprit. Your best solution is to keep as many buffers as you need pre-allocated and re-use them, to prevent the reallocation and subsequent fragmentation.

How are you interacting with this large multi dimensional array? Are you using Recursion? If so, make sure your recursive methods are passing parameters by reference, rather than by value.
On a side note, do you need 100% of this data accessible at the same time? The best way to deal with large volumes of data is usually via a stream, or some kind of reader object. Try to deal with the data in segments. I've got a few processes that deal with Gigs worth of data, and it can process it in a minor amount of memory due to how I'm streaming it in via a SqlDataReader.
TL;DR: look at how you pass data between your function calls O(ref) and maybe use streaming patterns to deal with the data in smaller chunks.
hope that helps!

.NET stores shorts as 32-bit values even though they only contain 16 bits. So you could save a factor two by using an array of ints and decoding the int to two shorts yourself using bit operations.
Then you pretty much have the most efficient way of storing such an array. What you can do then is:
Use a 64-bit machine. Then you can allocate a lot of memory and the OS will take care of paging the data to disk for you if you run out of RAM (make sure you have a large enough swap file). Then you can use 8 TERAbytes of data (if you have a large enough disk).
Read parts of this data from disk as you need them manually using file IO, or using memory mapping.

Memory Management - How and when to write large objects to disk

I am working on an application which has potential for a large memory load (>5gb) but has a requirement to run on 32bit and .NET 2 based desktops due to the customer deployment environment. My solution so far has been to use an app-wide data store for these large volume objects, when an object is assigned to the store, the store checks for the total memory usage by the app and if it is getting close to the limit it will start serialising some of the older objects in the store to the user's temp folder, retrieving them back into memory as and when they are needed. This is proving to be decidedly unreliable, as if other objects within the app start using memory, the store has no prompt to clear up and make space. I did look at using weak pointers to hold the in-memory data objects, with them being serialised to disk when they were released, however the objects seemed to be getting released almost immediately, especially in debug, causing a massive performance hit as the app was serialising everything.
Are there any useful patterns/paradigms I should be using to handle this? I have googled extensively but as yet haven't found anything useful.

I thought virtual memory was supposed to have you covered in this situation?
Anyways, it seems suspect that you really need all 5gb of data in memory at any given moment - you can't possibly be processing all that data at any given time - at least not on what sounds like a consumer PC. You didn't go into detail about your data, but something to me smells like the object itself is poorly designed in the sense that you need the entire set to be in memory to work with it. Have you thought about trying to fragment out your data into more sensible units - and then do some preemptive loading of the data from disk, just before it needs to be processed? You'd essentially be paying a more constant performance trade-off this way, but you'd reduce your current thrashing issue.

Maybe you go with Managing Memory-Mapped Files and look here. In .NET 2.0 you have to use PInvoke to that functions. Since .NET 4.0 you have efficient built-in functionality with MemoryMappedFile.
Also take a look at:
http://msdn.microsoft.com/en-us/library/dd997372.aspx
You can't store 5GB data in-memory efficiently. You have 2 GB limit per process in 32-bit OS and 4 GB limit per 32-bit process in 64-bit Windows-on-Windows
So you have choice:
Go in Google's Chrome way (and FireFox 4) and maintain potions of data between processes. It may be applicable if your application started under 64-bit OS and you have some reasons to keep your app 32-bit. But this is not so easy way. If you don't have 64-bit OS I wonder where you get >5GB RAM?
If you have 32-bit OS when any solution will be file-based. When you try to keep data in memory (thru I wonder how you address them in memory under 32-bit and 2 GB per process limit) OS just continuously swap portions of data (memory pages) to disk and restores them again and again when you access it. You incur great performance penalty and you already noticed it (I guessed from description of your problem). The main problem OS can't predict when you need one data and when you want another. So it just trying to do best by reading and writing memory pages on/from disk.
So you already use disk storage indirecltly in inefficient way, MMFs just give you same solution in efficient and controlled manner.
You can rearchitecture your application to use MMFs and OS will help you in efficient caching. Do the quick test by yourself MMF maybe good enough for your needs.
Anyway I don't see any other solution to work with dataset greater than available RAM other than file-based. And usually better to have direct control on data manipulation especially when such amount of data came and needs to be processed.

When you have to store huge loads of data and mantain accessibility, sometimes the most useful solution is to use data store and management system like database. Database (MySQL for example) can store a lots of typical data types and of course binary data too. Maybe you can store your object to database (directly or by programming business object model) and get it when you need to. This solution sometimes can solve many problems with data managing (moving, backup, searching, updating...), and storage (data layer) - and it's location independent - mayby this point of view can help you.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.