In C# I had do create my own dynamic memory management. For that reason I have created a static memory manager and a MappableObject. All object that should be dynamic mappable and unmappable from and to the harddisk implement this interface.
This memory management is only done for these large objects that have the ability to unmap/map the data from the harddisk. All other things use of course the regular GC.
Everytime a MappableObject is allocated it asks for memory. If no memory is available that the MemoryManager unmaps some data dynamically to the harddisk to get more memory to make it possible to allocate a new MappableObject.
A problem in my case is that I can have more than 100.000 MappableObject instances (scattered over a few files ~ 10-20 files) and everytime I have to run through a list of all objects if I need to unmap some data. Is there a way to get all allocated objects that are created in my current instance?
In fact I don't know what's easier to keep my own list or to run through the objects (if possible)? How would you solve such things?
Update
The reason is that I have a large amount of data. About 100GB of data that I need to keep during my run. Therefore I need the references on the data, and so the GC is not able to clean the memory. In fact C# manages the memory pretty well, but in such memory exhausting applications the GC gets really bad. Of course I tried to use the MemoryFailPoint, but this slows down my allocations tremendously and does not give correct results for whatever reason. I have also tried MemoryMappedFiles, but since I have to access the data randomly it doesn't help. Also MemoryMappedFiles only allow to have ~5000 file handles (on my system) and this is not enough.
Is there a ROT (Running Object Table) in .Net? The short answer is no.
You would have to maintain this information yourself.
Given your question update, could you not store your data in a database and use some sort of in-memory cache (perhaps with weak references or MFU, etc) to try and keep hot data close to you?
This is an obvious case for a classic cache. Your data is stored in a database or indexed flat file while you maintain a much smaller number of entries in RAM.
To implement a cache for your program I would create a class that implements IDictionary. Reserve a certain amount of slots in your cache, say a number of elements that would cause about 100 MB of RAM to be allocated; make this cache size an adjustable parameter.
When you override this[], if the object requested is in the cache, return it. If the object requested is not in the cache, remove the least recently used cached value, add the requested value as the most recently used value, and return it. Functions like Remove() and Add() not only adjust the memory cache, but also manipulate the underlying database or flat file on disk.
While it's true that your program might hold some references to objects you removed from the cache, if so, your program is still using them. Garbage collection will clean them up as needed.
Caches like this are easier to implement in C# because of its strong OOP features and safety.
Related
I have a set of large objects (over 20GB) that I need to access quickly from an application.
So far I have read these files from disk, to RAM on application startup. This is an expensive task as the files are deserialized to an in-memory object. However, after the initial startup delay in loading these files, the objects can be accessed very quickly. Now however, the sizes of the files are now too large to store in RAM.
I am now having to read part of the files from disk, deserializing them to memory, then discarding the used memory, reading the next files, and so on in a loop. This is very expensive computationally due to the deserialization.
Is there a way where I can have an "in-memory" object that points to a memory space that is stored on disk? This would be slower to access than if it was resident in RAM, but the slower access to disk rather than RAM would still be faster than repeatedly deserializing the data I suspect.
Is there a way to do this?
The data btw is essentially a List of structs that need to be iterated over.
If it is essentially a list of structs, then yes: you can use memory mapped files here. The most effective way to do this would be to create a single huge view over the data (let the OS worry about mapping it and paging it as needed), and obtain and store the unmanaged pointer to the root (you can get that from MemoryMappedViewStream, but IIRC there are more direct ways to get it).
Now; two things you don't want to do:
constantly deal in unmanaged pointers
constantly copy the data
But: you can use ref T and Span<T> as your friend; System.Runtime.CompilerServices.Unsafe has facilities to hack between void* and ref T, and Span<T> can take a void*; this gives you two easy ways of working with struct data that is held in unmanaged memory.
I was reading an article here
https://azure.microsoft.com/en-in/documentation/articles/service-fabric-work-with-reliable-collections/
It says "you MUST not modify an object once you have given it to a reliable collection."
Why is that the case? Can I not modify the object and add it back to reliable collection? Will it not overwrite the previous value?
Theoretically you can modify the same object and write it back to reliable collection. But this approach is buggy. When you make changes to the object , the value is only modified locally and is not written to disk of primary and secondary replicas. Till you explicitly write the modified object back to reliable collection, the local copy of the object and persisted copy won't be same. So it is always a good practice, to treat the object as immutable and make modifications to deep copy of the object.
In traditional .net collections, the Value from a keyed lookup in a dictionary (or a pop/peek from a queue) returns a pointer to an object on the heap. When you modify this pointer, the value in the heap is modified. As a result the state is mutated in the dictionary.
Reliable collections are a facade around a much more complex interaction. While it is true that the collection is in memory*, the reliable state manager is also in charge of replicating any changes to the secondary replicas. The mechanism by which this occurs is calling CommitAsync on the ITransaction.
If you were to only mutate the in memory representation of an object, the change will never be replicated to secondary partitions, and undefined/unexpected behavior will result. (say when the active primary switches to a secondary) If you do call CommitAsync (even if you do a Get -> Modify -> Set), the transaction might fail to commit and the current in memory representation will differ from that of the secondary partitions and the on disk representation of the primary partition. This again, will lead to undefined/unexpected behavior.
*In most cases, unless the size of the collection is bigger than available memory. In this case the values are paged from disk and only the keys and recently used values are held in memory. In the future, I have heard talk around paging further in to blob storage when disk pressure increases.
The full statement is:
However, with reliable collections, this code exhibits the same
problem as already discussed: you MUST not modify an object once you
have given it to a reliable collection.
That statement is in the context of problems associated with working on reliable collections. When working with SF reliable dictionaries, the API will look like standard .NET dictionaries. Behind the scenes, it is doing more like managing the state. With this differentiation, we need to keep in mind common pitfalls when working with these data structures. In the first code demonstration, it will look like it will update the object but will not. Later on in the article, it provided you the correct way of modifying an object in the reliable collection.
In my application I use a dictionary (supporting adding, removing, updating and lookup) where both keys and values are or can be made serializable (values can possibly be quite large object graphs). I came to a point when the dictionary became so large that holding it completely in memory started to occasionally trigger OutOfMemoryException (sometimes in the dictionary methods, and sometimes in other parts of code).
After an attempt to completely replace the dictionary with a database, performance dropped down to an unacceptable level.
Analysis of the dictionary usage patterns showed that usually a smaller part of values are "hot" (are accessed quite often), and the rest (a larger part) are "cold" (accessed rarely or never). It is difficult to say when a new value is added if it will be hot or cold, moreover, some values may migrate back and forth between hot and cold parts over time.
I think that I need an implementation of a dictionary that is able to flush its cold values to a disk on a low memory event, and then reload some of them on demand and keep them in memory until the next low memory event occurs when their hot/cold status will be re-assessed. Ideally, the implementation should neatly adjust the sizes of its hot and cold parts and the flush interval depending on the memory usage profile in the application to maximize overall performance. Because several instances of a dictionary exist in the application (with different key/value types), I think, they might need to coordinate their workflows.
Could you please suggest how to implement such a dictionary?
Compile for 64 bit, deploy on 64 bit, add memory. Keep it in memory.
Before you grown your own you may alternatively look at WeakReference http://msdn.microsoft.com/en-us/library/ms404247.aspx. It would of course require you to rebuild those objects that were reclaimed but one should hope that those which are reclaimed are not used much. It comes with the caveat that its own guidleines state to avoid using weak references as an automatic solution to memory management problems. Instead, develop an effective caching policy for handling your application's objects.
Of course you can ignore that guideline and effectively work your code to account for it.
You can implement the caching policy and upon expiry save to database, on fetch get and cache. Use a sliding expiry of course since you are concerned with keeping those most used.
Do remember however that most used vs heaviest is a trade off. Losing an object 10 times a day that takes 5 minutes to restore would annoy users much more than losing an object 10000 times which tool just 5ms to restore.
And someone above mentioned the web cache. It does automatic memory management with callbacks as noted, depends if you want to lug that one around in your apps.
And...last but not least, look at a distributed cache. With sharding you can split that big dictionary across a few machines.
Just an idea - never did that and never used System.Runtime.Caching:
Implement a wrapper around MemoryCache which will:
Add items with an eviction callback specified. The callback will place evicted items to the database.
Fetch item from database and put back into MemoryCache if the item is absent in MemoryCache during retrieval.
If you expect a lot of request for items missing both in database and memory, you'll probably need to implement either bloom filter or cache keys for present/missing items also.
I have a similar problem in the past.
The concept you are looking for is a read through cache with a LRU (Least Recently Used or Most Recently Used) queue.
Is it there any LRU implementation of IDictionary?
As you add things to your dictionary keep track of which ones where used least recently, remove them from memory and persist those to disk.
Background:
I have a service whose purpose in life is to provide objects to requestors - it basically gets complicated data from a database and transforms it once (a bit like a view over data) to produce a simplified record. This then services requests from other services by providing up to 100k records (depending on the nature of the request) on demand.
The idea is that the complicated transformation is done once and is cached by the service - it works out quicker than letting the database work it out each time a view is accessed and for my purposes works just fine. (I believe this is called SSOS by some)
The way data is being cached is in a list of objects which are property bags for standard .Net types. These objects have no references to anything else.
Periodically a record will change, and the cache must be updated which means that the original record must be located, thrown away and replaced.
Now the record in the cache will have been in there for a long time and will have been marked for a Gen 2 collection; pretty much all the collections will happen in the Gen2 phase as these objects are hanging around for ages (on purpose).
So my understanding of Gen2 collections is that they are slow, and if the collections are mainly working on Gen2 then the optimizer is going to do this more often.
I would like to be able to de-reference an object in the list in a way that doesn't end up triggering a full Gen2 collection... I was thinking that maybe there is a way of marking it as Gen0 and then de-referencing it before replacing it - but I don't think that is possible.
I am constrained to using .Net 4 for this and the application is a service which serves data to up to 100 clients who request full lists or changes to the list over a period of time.
Question: Can anyone suggest a way to de-reference long lived objects in a GC friendly way or perhaps another way to approach this problem?
There is no simple answer to this. If you have lots of long-lived objects, then full collections really can hurt, as I discussed here. Since a picture tells a thousand words:
Those vertical spikes are where garbage collection happens and slaughters the response times.
The way we reduced the impact of this was: don't have a gazillion long-lived objects. What we did was to change the classes to structs, which meant that the only object was the array that contained them. We were fortunate here is that the data was simple and didn't involve strings, which would of course themselves be objects. We also did some crazy fixed-size buffer work to reduce things that were previously collections, and changed what were references to indices (into the array). If you do have to use string data, perhaps try to ensure you don't have 20,000 different string instancs with the same value - some kind of manual interner (a Dictionary<string,string> would suffice) can be really useful there.
Note that this needn't impact your public API, since you can always create the old class data from the struct storage - the difference is that this class will only exist briefly as a DTO - so will be collected cheaply in the next gen-0 sweep.
YMMV, but this worked enough well for us.
The problem is: you need to be really careful when working with structs; I strongly advise making them immutable.
I'm writing a web application that constantly retrieves XML "components" from a database and then transforms them into XHTML using XSLT. Some of these transformations happen frequently (e.g. a "sidebar navigation" component goes out for the same XML and performs the same XSL transformation on every page that features that sidebar), so I have started implementing some caching to speed things up.
In my current solution, before each component attempts to perform a transformation, the component checks with a static CacheManager object to see if a cached version of the transformed XML exists. If so, the component outputs this. If not, the component performs the transformation and then stores the transformed XML with the CacheManager object.
The CacheManager object keeps an in-memory store of the cached transformed XML (in a Dictionary, to be exact). In my local, development environment this is working beautifully, but I'm assuming this may not be a very scalable solution.
What are the potential downfalls of storing these data in-memory? Do I need to put a cap on the amount of data I can store in an in-memory data structure like this? Should I be using a different data store for this type of caching altogether?
The obvious disadvantage will be, as you suspect, potentially high memory usage by your cache. You may want to implement a system where rarely-used items "expire" out of the cache when memory pressure goes up. Microsoft's Caching Application Block implements pretty much everything you need, right out of the box.
Another potential (though unlikely) problem you could run into is the cost of digging through the cache to find what you need. At some point it could be faster to just go ahead and generate what you need instead of looking through the cache. We've run into this in at least one precise scenario related to very large caches and very cheap operations. It's unlikely, but it can happen.
You can do the caching but only keep reference values (uri) in the dictionary and store the actual transformed XML on disk. This is probably going to be faster than retrieving values from the database and doing the transformation again, slower than storing this all in memory but gets around the problem of having all of this cached data in memory in the first place. This also could allow the transformed XML documents to survive through a recycle/reset, but you'd have to rebuild your dictionary and you'd also have to think about "expiring" documents that need to be purged from the disk cache.
Just an idea..
You should definitly look for a generic chache implementation. I dont do C#, so I dont know of any solution in your language. In Java, I would have recommended EHCache.
Caching is much harder that it seems at first. That's why it is probably a good idea to rely on work done by others. Some of the problem you will run into at some point are concurrency, cache invalidation, time to live, blocking caches, cache management (statistics, clearing of caches, ...), overflowing to disk, distributed caching, ...
Monitoring a cache is a must. You need to see if the caching strategy you put in place is actually doing anything good (cache hits / cache misses, percentage of cache used, ...) You should probably divide your cache in multiple regions to be able to monitor better the cache usage.
As a side note, as long as you cache your XML after transformation, you should probably store its String representation (and not an object tree). That's one less transformation to do after the cache asyou are probably outputing it as String anyway. And there is a good chance that the String representation will take less space (but as always, measure that, dont take my word for it).