Does someone knows a library for in-process caching, other than MS ASP.NET cache and ENTLib, with at least two features:
- expiring time;
- object dependency.
I implemented a thread safe pseudo LRU for in memory caching. It's simpler and faster than using Memory cache - performance is very close to ConcurrentDictionary (10x faster than memory cache, and zero memory allocs for hits).
Usage of the LRU looks like this (just like dictionary but you need to give capacity - it's a bounded cache):
int capacity = 666;
var timeToLive = DateTime.FromMinutes(5);
var lru = new ConcurrentTLru<int, SomeItem>(capacity, timeToLive);
var value = lru.GetOrAdd(1, (k) => new SomeItem(k));
GitHub: https://github.com/bitfaster/BitFaster.Caching
Install-Package BitFaster.Caching
My advice would be to cache object graphs (e.g. classes with properties that reference other classes) in an LRU, so that you can evict things at well defined nodes in the dependency tree by simply updating the object. You will naturally end up with something that is easier to understand and doesn't dependency cycles.
There are a few cache providers over at Codeplex, SharedCache seems promising: http://sharedcache.codeplex.com/.
Related
I am coding a MVC 5 internet application, and am using the MemoryCache object for caching objects. I see that using the MemoryCache.Set method, an absoluteExpiration can be specified.
If I use the following way to add and retrieve an object from the MemoryCache, what is the absoluteExpiration set to:
cache['cacheItem'] = testObject;
TestObject testObject = cache['cacheItem'] as TestObject;
Also, when using the MemoryCache in an MVC internet application, should I set the amount of memory that can be used for the MemoryCache, or is the default implementation safe enough for an Azure website?
Thanks in advance.
Your code is equivalent to calling Add, like below:
cache.Add("cacheItem", testObject, null);
The added entry would have the default expiration time, which is infinite (i.e., it doesn't expire). See the MSDN on CacheItemPolicy.AbsoluteExpiration for details.
To answer the question about memory usage: (from CacheMemoryLimitMegabytes Property):
The default is zero, which indicates that MemoryCache instances manage their own memory based on the amount of memory that is installed on the computer.
I would say that it's safe to let the MemoryCache defaults decide how much memory to use, unless you're doing something really fancy.
We have a Hashtable (specifically the C# Dictionary class) that holds several thousands/millions of (Key,Value) pairs for near O(1) search hits/misses.
We'd like to be able to flush this data structure to disk (serialize it) and load it again later (deserialize) such that the internal hashtable of the Dictionary is preserved.
What we do right now:
Load from Disk => List<KVEntity>. (KVEntity is serializable. We use Avro to serialize - can drop Avro if needed)
Read every KVEntity from array => dictionary. This regenerates the dictionary/hashtable internal state.
< System operates, Dictionary can grow/shrink/values change etc >
When saving, read from the dictionary into array (via myKVDict.Values.SelectMany(x => x) into a new List<KVEntity>)
We serialize the array (List<KVEntity>) to disk to save the raw data
Notice that during our save/restore, we lose the internal tashtable/dictionary state and have to rebuild it each time.
We'd like to directly serialize to/from Dictionary (including it's internal "live" state) instead of using an intermediate array just for the disk i/o. How can we do that?
Some pseudo code:
// The actual "node" that has information. Both myKey and myValue have actual data work storing
public class KVEntity
{
public string myKey {get;set;}
public DataClass myValue {get;set;}
}
// unit of disk IO/serialization
public List<KVEntity> myKVList {get;set;}
// unit of run time processing. The string key is KVEntity.myKey
public Dictionary<string,KVEntity> myKVDict {get;set;}
Storing the internal state of the Dictionary instance would be bad practice - a key tenet of OOP is encapsulation: that internal implementation details are deliberately hidden from the consumer.
Furthermore, the mapping algorithm used by Dictionary might change across different versions of the .NET Framework, especially given that CIL assemblies are designed to be forward-compatible (i.e. a program written against .NET 2.0 will generally work against .NET 4.5).
Finally, there are no real performance gains from serialising the internal state of the dictionary. It is much better to use a well-defined file format with a focus on maintainability than speed. Besides, if the dictionary contains "several thousands" of entries then that should load from disk in under 15ms by my reckon (assuming you have an efficient on-disk format). Finally, a data structure optimised for RAM will not necessarily work well on-disk where sequential reads/writes are better.
Your post is very adamant about working with the internal state of the dictionary, but your existing approach seems fine (albiet, it could do with some optimisations). If you revealed more details we can help you make it faster.
Optimisations
The main issues I see with your existing implementation is the conversion to/from Arrays and Lists, which is unnecessary given that Dictionary is directly enumerable.
I would do something like this:
Dictionary<String,TFoo> dict = ... // where TFoo : new() && implements a arbitrary Serialize(BinaryWriter) and Deserialize(BinaryReader) methods
using(FileStream fs = File.OpenWrite("filename.dat"))
using(BinaryWriter wtr = new BinaryWriter(fs, Encoding.UTF8)) {
wtr.Write( dict.Count );
foreach(String key in dict.Keys) {
wtr.Write( key );
wtr.Write('\0');
dict[key].Serialize( wtr );
wtr.Write('\0'); // assuming NULL characters can work as record delimiters for safety.
}
}
Assuming that your TFoo's Serialize method is fast, I really don't think you'll get any faster speeds than this approach.
Implementing a de-serializer is an exercise for the reader, but should be trivial. Note how I stored the size of the dictionary to the file, so the returned dictionary can be set with the correct size when it's created, thus avoiding the re-balancing problem that #spender describes in his comment.
So we're going to stick with our existing strategy given Dai's reasoning and that we have C# and Java compatibility to maintain (which means the extra tree-state bits of the C# Dictionary would be dropped on the Java side anyways which would load only the node data as it does right now).
For later readers still interested in this I found a very good response here that somewhat answers the question posed. A critical difference is that this answer is for B+ Trees, not Dictionaries, although in practical applications those two data structures are very similar in performance. B+ Tree performance closer to Dictionaries than regular trees (like binary, red-black, AVL etc). Specifically, Dictionaries deliver near O(1) performance (but no "select from a range" abilities) while B+ Trees have O(logb(X)) where b = base is usually large which makes them very performant compared to regular trees where b=2. I'm copy-pasting it here for completeness but all credit goes to csharptest.net for the B+ Tree code, test, benchmarks and writeup(s).
For completeness I'm going to add my own implementation here.
Introduction - http://csharptest.net/?page_id=563
Benchmarks - http://csharptest.net/?p=586
Online Help - http://help.csharptest.net/
Source Code - http://code.google.com/p/csharptest-net/
Downloads - http://code.google.com/p/csharptest-net/downloads
NuGet Package - http://nuget.org/List/Packages/CSharpTest.Net.BPlusTree
My code has to generate millions object to perform some algorithm (millions objects will be created and at the same time 2/3 of them should be destroyed).
I know that object creation causes performance problems.
Could someone recommend how to manage so huge amount of objects, garbage collection and so on?
Thank you.
Elaborating a bit on my "make them a value type" comment above.
If you have a struct Foo, then preparing for the algorithm with e.g. var storage = new Foo[1000000] will only allocate one big block of memory (I 'm assuming the required amount of contiguous memory will be available).
You can then manually manage the memory inside that block to avoid performing more memory allocations:
Keep a count of how many slots in the array are actually used
To "create" a new Foo, put it at the first unused slot and increment the counter
To "delete" a Foo, swap it with the one in last used slot and decrement the counter
Of course making an algorithm work with value types vs reference types is not as simple as changing class to struct. But if workable it will allow you to side-step all of this overhead for an one-time startup cost.
If it is possible in your algorithm then try to reuse objects - if 2/3 are destroyed immedietly then you can try to use them again.
You can implement IDisposable interface on the type whose object is been created. Then you can implment using keyword and write whatever logic involving the object within the using scope. The following links will give you a fair idea of what i am trying to say. Hope they are of some help.
http://www.codeguru.com/csharp/csharp/cs_syntax/interfaces/article.php/c8679
Am I implementing IDisposable correctly?
Regards,
Samar
i am putting 2 very large datasets into memory, performing a join to filter out a subset from the first collection and then attempting to destroy the second collection as it uses approximately 600MB of my system's RAM. The problem is that the code below is not working. After the code below runs, a foreach loop runs and takes about 15 mins. During this time the memory does NOT reduce from 600MB+. Am i doing something wrong?
List<APPLES> tmpApples = dataContext.Apples.ToList(); // 100MB
List<ORANGES> tmpOranges = dataContext.Oranges.ToList(); // 600MB
List<APPLES> filteredApples = tmpApples
.Join(tmpOranges, apples => apples.Id, oranges => oranges.Id, (apples, oranges) => apples).ToList();
tmpOranges.Clear();
tmpOranges = null;
GC.Collect();
Note i re-use tmpApples later so i am not clearing it just now..
A few things to note:
Unless your dataContext can be cleared / garbage collected, that may well be retaining references to a lot of objects
Calling Clear() and then setting the variable to null is pointless, if you're really not doing anything else with the list. The GC can tell when you're not using a variable any more, in almost all cases.
Presumably you're judging how much memory the process has reserved; I don't think the CLR will actually return memory to the operating system, but the memory which has been freed by garbage collection will be available to further uses within the CLR. (EDIT: As per comments below, it's possible that the CLR frees areas of the Large Object Heap, but I don't know for sure.)
Clearing, nullifying and collecting hardly ever has any (positive) effect. The GC will automatically detect when objects are not referenced anymore. Further more, As long as the Join operation runs, both the tmpApples and tmpOranges collections are referenced and with it all their objects. They can therefore not be collected.
A better solution would be to do the filter in the database:
// NOTE That I removed the ToList operations
IQueryable<APPLE> tmpApples = dataContext.Apples;
IQueryable<ORANGE> tmpOranges = dataContext.Oranges;
List<APPLES> filteredApples = tmpApples
.Join(tmpOranges, apples => apples.Id,
oranges => oranges.Id, (apples, oranges) => apples)
.ToList();
The reason this data is not collected back is because although you are clearing the collection (hence collection does not have a reference to items anymore),DataContext keeps a reference and this causes it to stay in memory.
You have to dispose your DataContext as soon as you are done.
UPDATE
OK, you probably have fallen victim to large object issue.
Assuming this as Large Object Heap issue you could try to not retrieve all apples at once but instead get them in "packets". So instead of calling
List<APPLE> apples = dataContext.Apples.ToList()
instead try to store the apples in separate lists
int packetSize = 100;
List<APPLE> applePacket1 = dataContext.Apples.Take(packetSize);
List<APPLE> applePacket2 = dataContext.Applies.Skip(packetSize).Take(packetSize);
Does that help?
Use some profiler tools or SOS.dll to find out, where your memory belongs to. If some operations take TOO much time, this sounds like you are swapping out to page file.
EDIT: Also keep in mind, the Debug version will delay the collection of local variables which are not referenced anymore for easier investigation.
The only thing you're doing wrong is explicitly calling the Garbage collector. You don't need to do this (in fact you shouldn't) and as Steven says you don't need to do anything to the collections anyway they'll just go away - eventually.
If you're concern is the performance of the 15 minute foreach loop perhaps it is that loop which you should post. It is probably not related to the memory usage.
I have some places where implementing some sort of cache might be useful. For example in cases of doing resource lookups based on custom strings, finding names of properties using reflection, or to have only one PropertyChangedEventArgs per property name.
A simple example of the last one:
public static class Cache
{
private static Dictionary<string, PropertyChangedEventArgs> cache;
static Cache()
{
cache = new Dictionary<string, PropertyChangedEventArgs>();
}
public static PropertyChangedEventArgs GetPropertyChangedEventArgs(
string propertyName)
{
if (cache.ContainsKey(propertyName))
return cache[propertyName];
return cache[propertyName] = new PropertyChangedEventArgs(propertyName);
}
}
But, will this work well? For example if we had a whole load of different propertyNames, that would mean we would end up with a huge cache sitting there never being garbage collected or anything. I'm imagining if what is cached are larger values and if the application is a long-running one, this might end up as kind of a problem... or what do you think? How should a good cache be implemented? Is this one good enough for most purposes? Any examples of some nice cache implementations that are not too hard to understand or way too complex to implement?
This is a large problem, you need to determine the domain of the problem and apply the correct techniques. For instance, how would you describe the expiration of the objects? Do they become stale over a fixed interval of time? Do they become stale from an external event? How frequently does this happen? Additionally, how many objects do you have? Finally, how much does it cost to generate the object?
The simplest strategy would be to do straight memoization, as you have above. This assumes that objects never expire, and that there are not so many as to run your memory dry and that you think the cost to create these objects warrants the use of a cache to begin with.
The next layer might be to limit the number of objects, and use an implicit expiration policy, such as LRU (least recently used). To do this you'd typically use a doubly linked list in addition to your dictionary, and every time an objects is accessed it is moved to the front of the list. Then, if you need to add a new object, but it is over your limit of total objects, you'd remove from the back of the list.
Next, you might need to enforce explicit expiration, either based on time, or some external stimulus. This would require you to have some sort of expiration event that could be called.
As you can see there is alot of design in caching, so you need to understand your domain and engineer appropriately. You did not provide enough detail for me to discuss specifics, I felt.
P.S. Please consider using Generics when defining your class so that many types of objects can be stored, thus allowing your caching code to be reused.
You could wrap each of your cached items in a WeakReference. This would allow the GC to reclaim items if-and-when required, however it doesn't give you any granular control of when items will disappear from the cache, or allow you to implement explicit expiration policies etc.
(Ha! I just noticed that the example given on the MSDN page is a simple caching class.)
Looks like .NET 4.0 now supports System.Runtime.Caching for caching many types of things. You should look into that first, instead of re-inventing the wheel. More details:
http://msdn.microsoft.com/en-us/library/system.runtime.caching%28VS.100%29.aspx
This is a nice debate to have, but depending your application, here's some tips:
You should define the max size of the cache, what to do with old items if your cache is full, have a scavenging strategy, determine a time to live of the object in the cache, does your cache can/must be persisted somewhere else that memory, in case of application abnormal termination, ...
This is a common problem that has many solutions depending on your application need.
It is so common that Microsoft released a whole library to address it.
You should check out Microsoft Velocity before rolling up your own cache.
http://msdn.microsoft.com/en-us/data/cc655792.aspx
Hope this help.
You could use a WeakReference but if your object is not that large than don't because the WeakReference would be taking more memory than the object itself which is not a good technique. Also, if the object is a short-time usage where it will never make it to generation 1 from generation 0 on the GC, there is not much need for the WeakReference but IDisposable interface on the object would have with the release on SuppressFinalize.
If you want to control the lifetime you need a timer to update the datetime/ timespan again the desiredExpirationTime on the object in your cache.
The important thing is if the object is large then opt for the WeakReference else use the strong reference. Also, you can set the capacity on the Dictionary and create a queue for requesting additional objects in your temp bin serializing the object and loading it when there is room in the Dictionary, then clear it from the temp directory.