Memory management / caching for costly objects in C#

Memory management / caching for costly objects in C# - c#

Assume that I have the following object
public class MyClass
{
public ReadOnlyDictionary<T, V> Dict
{
get
{
return createDictionary();
}
}
}
Assume that ReadOnlyDictionary is a read-only wrapper around Dictionary<T, V>.
The createDictionary method takes significant time to complete and returned dictionary is relatively large.
Obviously, I want to implement some sort of caching so I could reuse result of createDictionary but also I do not want to abuse garbage collector and use to much memory.
I thought of using WeakReference for the dictionary but not sure if this is best approach.
What would you recommend? How to properly handle result of a costly method that might be called multiple times?
UPDATE:
I am interested in an advice for a C# 2.0 library (single DLL, non-visual). The library might be used in a desktop of a web application.
UPDATE 2:
The question is relevant for read-only objects as well. I changed value of the property from Dictionary to ReadOnlyDictionary.
UPDATE 3:
The T is relatively simple type (string, for example). The V is a custom class. You might assume that an instance of V is costly to create. The dictionary might contain from 0 to couple of thousands elements.
The code assumed to be accessed from a single thread or from multiple threads with an external synchronization mechanism.
I am fine if the dictionary is GC-ed when no one uses it. I am trying to find a balance between time (I want to somehow cache the result of createDictionary) and memory expenses (I do not want to keep memory occupied longer than necessary).

WeakReference is not a good solution for a cache since you object won´t survive the next GC if nobody else is referencing your dictionary. You can make a simple cache by storing the created value in a member variable and reuse it if it is not null.
This is not thread safe and you would end up in some situations creating the dictionary several times if you have heavy concurent access to it. You can use the double checked lock pattern to guard against this with minimal perf impact.
To help you further you would need to specify if concurrent access is an issue for you and how much memory your dictionary does consume and how it is created. If e.g. the dictionary is the result of an expensive query it might help to simply serialize the dictionary to disc and reuse it until you need to recreate it (this depends on your specific needs).
Caching is another word for memory leak if you have no clear policy when your object should be removed from the cache. Since you are trying WeakReference I assume you do not know when exactly a good time would be to clear the cache.
Another option is to compress the dictionary into a less memory hungry structure. How many keys does your dictionary has and what are the values?

There are four major mechanisms available for you (Lazy comes in 4.0, so it is no option)
lazy initialization
virtual proxy
ghost
value holder
each has it own advantages.
i suggest a value holder, which populates the dictionary on the first call of the GetValue
method of the holder. then you can use that value as long as you want to AND it is only
done once AND it is only done when in need.
for more information, see martin fowlers page

Are you sure you need to cache the entire dictionary?
From what you say, it might be better to keep a Most-Recently-Used list of key-value pairs.
If the key is found in the list, just return the value.
If it is not, create the one value (which is supposedly faster than creating all of them, and using less memory too) and store it in the list, thereby removing the key-value pair that hasn't been used the longest.
Here's a very simple MRU list implementation, it might serve as inspiration:
using System.Collections.Generic;
using System.Linq;
internal sealed class MostRecentlyUsedList<T> : IEnumerable<T>
{
private readonly List<T> items;
private readonly int maxCount;
public MostRecentlyUsedList(int maxCount, IEnumerable<T> initialData)
: this(maxCount)
{
this.items.AddRange(initialData.Take(maxCount));
}
public MostRecentlyUsedList(int maxCount)
{
this.maxCount = maxCount;
this.items = new List<T>(maxCount);
}
/// <summary>
/// Adds an item to the top of the most recently used list.
/// </summary>
/// <param name="item">The item to add.</param>
/// <returns><c>true</c> if the list was updated, <c>false</c> otherwise.</returns>
public bool Add(T item)
{
int index = this.items.IndexOf(item);
if (index != 0)
{
// item is not already the first in the list
if (index > 0)
{
// item is in the list, but not in the first position
this.items.RemoveAt(index);
}
else if (this.items.Count >= this.maxCount)
{
// item is not in the list, and the list is full already
this.items.RemoveAt(this.items.Count - 1);
}
this.items.Insert(0, item);
return true;
}
else
{
return false;
}
}
public IEnumerator<T> GetEnumerator()
{
return this.items.GetEnumerator();
}
System.Collections.IEnumerator System.Collections.IEnumerable.GetEnumerator()
{
return this.GetEnumerator();
}
}
In your case, T is a key-value pair. Keep maxcount small enough, so that searching stays fast, and to avoid excessive memory usage. Call Add each time you use an item.

An application should use WeakReference as a caching mechanism if the useful lifetime of an object's presence in the cache will be comparable to reference lifetime of the object. Suppose, for example, that you have a method which will create a ReadOnlyDictionary based on deserializing a String. If a common usage pattern would be to read a string, create a dictionary, do some stuff with it, abandon it, and start again with another string, WeakReference is probably not ideal. On the other hand, if your objective is to deserialize many strings (quite a few of which will be equal) into ReadOnlyDictionary instances, it may be very useful if repeated attempts to deserialize the same string yield the same instance. Note that the savings would not just come from the fact that one only had to do the work of building the instance once, but also from the facts that (1) it would not be necessary to keep multiple instances in memory, and (2) if ReadOnlyDictionary variables refer to the same instance, they can be known to be equivalent without having to examine the instances themselves. By contrast, determining whether two distinct ReadOnlyDictionary instances were equivalent might require examining all the items in each. Code which would have to do many such comparisons could benefit from using a WeakReference cache so that variables which hold equivalent instances would usually hold the same instance.

I think you have two mechanisms you can rely on for caching, instead of developing your own. The first, as you yourself suggested, was to use a WeakReference, and to let the garbage collector decide when to free this memory up.
You have a second mechanism - memory paging. If the dictionary is created in one swoop, it'll probably be stored in a more or less continuous part of the heap. Just keep the dictionary alive, and let Windows page it out to the swap file if you don't need it. Depending on your usage (how random is your dictionary access), you may end up with better performance than the WeakReference.
This second approach is problematic if you're close to your address space limits (this happens only in 32-bit processes).

Related

Is it safe to replace immutable data structure with Interlocked.Exchange(ref oldValue, newValue) in ASP.NET Core Web-Api

I got an api that's an end point for geographic coordinate requests. That means users can search for specific locations in their area. At the same time new locations can be added. To make the query as fast as possible, I thought I would make the R-tree unchangeable. That is, there are no locks within the R-Tree, since several threads can read at the same time, without race condition. The updates are collected and if e.g. 100 updates are collected, I want to create a new R-Tree and replace the old one. And now my question is how to do this best?
I have a SearchService, which is stored as a single tone and has an R-Tree as private instance.
In my Startup.cs
services.AddSingleton<ISearchService, SearchService>();
ISearchService.cs
public interface ISearchService
{
IEnumerable<GeoLocation> Get(RTreeQuery query);
void Update(IEnumerable<GeoLocation> data);
}
SearchService.cs
public class SearchService : ISearchService
{
private RTree rTree;
public IEnumerable<GeoLocation> Get(RTreeQuery query)
{
return rTree.Get(query);
}
public void Update(IEnumerable<GeoLocation> data)
{
var newTree = new RTree(data);
Interlocked.Exchange<RTree>(ref rTree, newTree);
}
}
My question is, if I exchange the reference with Interlock.Exchange() the operation is atomic and there should be no race condition. But what happens if threads still use the old instance to process their request. Could it be that the garbage collector deletes the old instance when threads still access it? After all, there is no longer a reference to the old instance.
I am relatively new to this topic, so any help is welcome. Thanks for your support!

Read and writes to references are atomic, which means there will be no alignment issues. However, they could be stale.
Section 12.6.6 of the CLI specs
Unless explicit layout control (see Partition II (Controlling Instance
Layout)) is used to alter the default behavior, data elements no
larger than the natural word size (the size of a native int) shall be
properly aligned. Object references shall be treated as though they
are stored in the native word size.
In regards to the GC, your trees are safe from garbage collection while they are running Get.
So in summary, your methods are thread safe as far as reference atomicity go, you can also use the Update method and safely overwrite the reference, there is no need for Interlocked.Exchange. The worst that can happen with your current implementation is you just get a stale tree which you have mentioned is not an issue.

Is List<T> really an undercover Array in C#?

I have been looking at .NET libraries using ILSpy and have come across List<T> class definition in System.Collections.Generic namespace. I see that the class uses methods like this one:
// System.Collections.Generic.List<T>
/// <summary>Removes all elements from the <see cref="T:System.Collections.Generic.List`1" />.</summary>
public void Clear()
{
if (this._size > 0)
{
Array.Clear(this._items, 0, this._size);
this._size = 0;
}
this._version++;
}
So, the Clear() method of the List<T> class actually uses Array.Clear method. I have seen many other List<T> methods that use Array stuff in the body.
Does this mean that List<T> is actually an undercover Array or List only uses some part of Array methods?
I know lists are type safe and don't require boxing/unboxing but this has confused me a bit.

The list class is not itself an array. In other words, it does not derive from an array. Instead it encapsulates an array that is used by the implementation to hold the list's member elements.
Since List<T> offers random access to its elements, and those elements are indexed 0..Count-1, using an array to store the elements is the obvious implementation.

This tends to surprise C++ programmers that know std::list. A linked list, covered in .NET as well with the LinkedList class. And has the same perf characteristics, O(1) for inserts and deletes.
You should however in general avoid it. Linked lists do not perform well on modern processors. Which greatly depend on the cpu caches to get reasonable performance with memory that's many times slower than the execution core. A simple array is by far the data structure that takes most advantage of the cache. Accessing an element gives very high odds that subsequent elements are present in the cache as well. That is not the case for a linked list, elements tend to be scattered throughout the address space, make a cache miss likely. They can be very expensive, as much as 200 cycles with the cpu doing nothing but waiting on the memory sub-system to supply the data.
But do keep the perf characteristics in mind, adding or removing an element that is not at the end of the List costs O(n), just like an array. And a large List can generate a lot of garbage as the array needs to be expanded, setting the Capacity property up front can help a lot to avoid that. More about that in this answer. And otherwise the exact same concerns for std::vector<>.

Yes, List<T> uses an array internally to store the items, although in most cases the array is actually larger than the number of elements in the collection -- it has some extra "padding" at the end so that you can add new items without it having to reallocate memory every time. It keeps track of the actual size of the collection with a separate field (you can see this._size in your generated code). When you add more elements than the current array has room for, it will automatically allocate a new larger array -- twice as big, I think -- and copy over all the existing elements.
If you're concerned about a List<T> using more memory than necessary, you can set the size of the array explicitly with the constructor override that accepts a capacity parameter, if you know the size in advance, or call the TrimExcess() method to make sure the array is (close to) to actual size of the collection.

Random access memory is an array, so in that sense all data structures from linked-lists to heaps and beyond, that rely on random-access to memory for their performance behaviour, are built on the array that is system memory. It is more a question of how many-levels of abstraction are in between.
Of course in a modern virtual memory machine, the random-access system memory is itself an abstraction built on a complicated virtual-memory model of multi-tier pipelined caches, non-cached RAM, and disk.

Is there a LinkedList collection that supports dictionary type operations

I was recently profiling an application trying to work out why certain operations were extremely slow. One of the classes in my application is a collection based on LinkedList. Here's a basic outline, showing just a couple of methods and some fluff removed:
public class LinkInfoCollection : PropertyNotificationObject, IEnumerable<LinkInfo>
{
private LinkedList<LinkInfo> _items;
public LinkInfoCollection()
{
_items = new LinkedList<LinkInfo>();
}
public void Add(LinkInfo item)
{
_items.AddLast(item);
}
public LinkInfo this[Guid id]
{ get { return _items.SingleOrDefault(i => i.Id == id); } }
}
The collection is used to store hyperlinks (represented by the LinkInfo class) in a single list. However, each hyperlink also has a list of hyperlinks which point to it, and a list of hyperlinks which it points to. Basically, it's a navigation map of a website. As this means you can having infinite recursion when links go back to each other, I implemented this as a linked list - as I understand it, it means for every hyperlink, no matter how many times it is referenced by another hyperlink, there is only ever one copy of the object.
The ID property in the above example is a GUID.
With that long winded description out the way, my problem is simple - according to the profiler, when constructing this map for a fairly small website, the indexer referred to above is called no less than 27906 times. Which is an extraordinary amount. I still need to work out if it's really necessary to be called that many times, but at the same time, I would like to know if there's a more efficient way of doing the indexer as this is the primary bottleneck identified by the profiler (also assuming it isn't lying!). I still needed the linked list behaviour as I certainly don't want more than one copy of these hyperlinks floating around killing my memory, but I also do need to be able to access them by a unique key.
Does anyone have any advice to offer on improving the performance of this indexer. I also have another indexer which uses a URI rather than a GUID, but this is less problematic as the building incoming/outgoing links is done by GUID.
Thanks;
Richard Moss

You should use a Dictionary<Guid, LinkInfo>.

You don't need to use LinkedList in order to have only one copy of each LinkInfo in memory. Remember that LinkInfo is a managed reference type, and so you can place it in any collection, and it'll just be a reference to the object that gets placed in the list, not a copy of the object itself.
That said, I'd implement the LinkInfo class as containing two lists of Guids: one for the things this links to, one for the things linking to this. I'd have just one Dictionary<Guid, LinkInfo> to store all the links. Dictionary is a very fast lookup, I think that'll help with your performance.
The fact that this[] is getting called 27,000 times doesn't seem like a big deal to me, but what's making it show up in your profiler is probably the SingleOrDefault call on the LinkedList. Linked lists are best for situations where you need fast insertions & removals, particularly in the middle of the list. For quick lookups, which is probably more important here, let the Dictionary do its work with hash tables.

Partially thread-safe dictionary

I have a class that maintains a private Dictionary instance that caches some data.
The class writes to the dictionary from multiple threads using a ReaderWriterLockSlim.
I want to expose the dictionary's values outside the class.
What is a thread-safe way of doing that?
Right now, I have the following:
public ReadOnlyCollection<MyClass> Values() {
using (sync.ReadLock())
return new ReadOnlyCollection<MyClass>(cache.Values.ToArray());
}
Is there a way to do this without copying the collection many times?
I'm using .Net 3.5 (not 4.0)

I want to expose the dictionary's values outside the class.
What is a thread-safe way of doing that?
You have three choices.
1) Make a copy of the data, hand out the copy. Pros: no worries about thread safe access to the data. Cons: Client gets a copy of out-of-date data, not fresh up-to-date data. Also, copying is expensive.
2) Hand out an object that locks the underlying collection when it is read from. You'll have to write your own read-only collection that has a reference to the lock of the "parent" collection. Design both objects carefully so that deadlocks are impossible. Pros: "just works" from the client's perspective; they get up-to-date data without having to worry about locking. Cons: More work for you.
3) Punt the problem to the client. Expose the lock, and make it a requirement that clients lock all views on the data themselves before using it. Pros: No work for you. Cons: Way more work for the client, work they might not be willing or able to do. Risk of deadlocks, etc, now become the client's problem, not your problem.

If you want a snapshot of the current state of the dictionary, there's really nothing else you can do with this collection type. This is the same technique used by the ConcurrentDictionary<TKey, TValue>.Values property.
If you don't mind throwing an InvalidOperationException if the collection is modified while you are enumerating it, you could just return cache.Values since it's readonly (and thus can't corrupt the dictionary data).

EDIT: I personally believe the below code is technically answering your question correctly (as in, it provides a way to enumerate over the values in a collection without creating a copy). Some developers far more reputable than I strongly advise against this approach, for reasons they have explained in their edits/comments. In short: This is apparently a bad idea. Therefore I'm leaving the answer but suggesting you not use it.
Unless I'm missing something, I believe you could expose your values as an IEnumerable<MyClass> without needing to copy values by using the yield keyword:
public IEnumerable<MyClass> Values {
get {
using (sync.ReadLock()) {
foreach (MyClass value in cache.Values)
yield return value;
}
}
}
Be aware, however (and I'm guessing you already knew this), that this approach provides lazy evaluation, which means that the Values property as implemented above can not be treated as providing a snapshot.
In other words... well, take a look at this code (I am of course guessing as to some of the details of this class of yours):
var d = new ThreadSafeDictionary<string, string>();
// d is empty right now
IEnumerable<string> values = d.Values;
d.Add("someKey", "someValue");
// if values were a snapshot, this would output nothing...
// but in FACT, since it is lazily evaluated, it will now have
// what is CURRENTLY in d.Values ("someValue")
foreach (string s in values) {
Console.WriteLine(s);
}
So if it's a requirement that this Values property be equivalent to a snapshot of what is in cache at the time the property is accessed, then you're going to have to make a copy.
(begin 280Z28): The following is an example of how someone unfamiliar with the "C# way of doing things" could lock the code:
IEnumerator enumerator = obj.Values.GetEnumerator();
MyClass first = null;
if (enumerator.MoveNext())
first = enumerator.Current;
(end 280Z28)

Review next possibility, just exposes ICollection interface, so in Values() you can return your own implementation. This implementation will use only reference on Dictioanry.Values and always use ReadLock for access items.

C#: How to implement a smart cache

I have some places where implementing some sort of cache might be useful. For example in cases of doing resource lookups based on custom strings, finding names of properties using reflection, or to have only one PropertyChangedEventArgs per property name.
A simple example of the last one:
public static class Cache
{
private static Dictionary<string, PropertyChangedEventArgs> cache;
static Cache()
{
cache = new Dictionary<string, PropertyChangedEventArgs>();
}
public static PropertyChangedEventArgs GetPropertyChangedEventArgs(
string propertyName)
{
if (cache.ContainsKey(propertyName))
return cache[propertyName];
return cache[propertyName] = new PropertyChangedEventArgs(propertyName);
}
}
But, will this work well? For example if we had a whole load of different propertyNames, that would mean we would end up with a huge cache sitting there never being garbage collected or anything. I'm imagining if what is cached are larger values and if the application is a long-running one, this might end up as kind of a problem... or what do you think? How should a good cache be implemented? Is this one good enough for most purposes? Any examples of some nice cache implementations that are not too hard to understand or way too complex to implement?

This is a large problem, you need to determine the domain of the problem and apply the correct techniques. For instance, how would you describe the expiration of the objects? Do they become stale over a fixed interval of time? Do they become stale from an external event? How frequently does this happen? Additionally, how many objects do you have? Finally, how much does it cost to generate the object?
The simplest strategy would be to do straight memoization, as you have above. This assumes that objects never expire, and that there are not so many as to run your memory dry and that you think the cost to create these objects warrants the use of a cache to begin with.
The next layer might be to limit the number of objects, and use an implicit expiration policy, such as LRU (least recently used). To do this you'd typically use a doubly linked list in addition to your dictionary, and every time an objects is accessed it is moved to the front of the list. Then, if you need to add a new object, but it is over your limit of total objects, you'd remove from the back of the list.
Next, you might need to enforce explicit expiration, either based on time, or some external stimulus. This would require you to have some sort of expiration event that could be called.
As you can see there is alot of design in caching, so you need to understand your domain and engineer appropriately. You did not provide enough detail for me to discuss specifics, I felt.
P.S. Please consider using Generics when defining your class so that many types of objects can be stored, thus allowing your caching code to be reused.

You could wrap each of your cached items in a WeakReference. This would allow the GC to reclaim items if-and-when required, however it doesn't give you any granular control of when items will disappear from the cache, or allow you to implement explicit expiration policies etc.
(Ha! I just noticed that the example given on the MSDN page is a simple caching class.)

Looks like .NET 4.0 now supports System.Runtime.Caching for caching many types of things. You should look into that first, instead of re-inventing the wheel. More details:
http://msdn.microsoft.com/en-us/library/system.runtime.caching%28VS.100%29.aspx

This is a nice debate to have, but depending your application, here's some tips:
You should define the max size of the cache, what to do with old items if your cache is full, have a scavenging strategy, determine a time to live of the object in the cache, does your cache can/must be persisted somewhere else that memory, in case of application abnormal termination, ...

This is a common problem that has many solutions depending on your application need.
It is so common that Microsoft released a whole library to address it.
You should check out Microsoft Velocity before rolling up your own cache.
http://msdn.microsoft.com/en-us/data/cc655792.aspx
Hope this help.

You could use a WeakReference but if your object is not that large than don't because the WeakReference would be taking more memory than the object itself which is not a good technique. Also, if the object is a short-time usage where it will never make it to generation 1 from generation 0 on the GC, there is not much need for the WeakReference but IDisposable interface on the object would have with the release on SuppressFinalize.
If you want to control the lifetime you need a timer to update the datetime/ timespan again the desiredExpirationTime on the object in your cache.
The important thing is if the object is large then opt for the WeakReference else use the strong reference. Also, you can set the capacity on the Dictionary and create a queue for requesting additional objects in your temp bin serializing the object and loading it when there is room in the Dictionary, then clear it from the temp directory.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.