How to design an api to a persistent collection in C#?

How to design an api to a persistent collection in C#? - c#

I am thinking about creating a persistent collection (lists or other) in C#, but I can't figure out a good API.
I use 'persistent' in the Clojure sense: a persistent list is a list that behaves as if it has value semantics instead of reference semantics, but does not incur the overhead of copying large value types. Persistent collections use copy-on-write to share internal structure. Pseudocode:
l1 = PersistentList()
l1.add("foo")
l1.add("bar")
l2 = l1
l1.add("baz")
print(l1) # ==> ["foo", "bar", "baz"]
print(l2) # ==> ["foo", "bar"]
# l1 and l2 share a common structure of ["foo", "bar"] to save memory
Clojure uses such datastructures, but additionally in Clojure all data structures are immutable. There is some overhead in doing all the copy-on-write stuff so Clojure provides a workaround in the form of transient datastructures that you can use if you are sure you're not sharing the datastructure with anyone else. If you have the only reference to a datastructure, why not mutate it directly instead of going through all the copy-on-write overhead.
One way to get this efficiency gain would be to keep a reference count on your datastructure (though I don't think Clojure works that way). If the refcount is 1, you're holding the only reference so do the updates destructively. If the refcount is higher, someone else is also holding a reference to it that's supposed to behave like a value type, so do copy-on-write to not disturb the other referrers.
In the API to such a datastructure, one could expose the refcounting, which makes the API seriously less usable, or one could not do the refcounting, leading to unnecessary copy-on-write overhead if every operation is COW'ed, or the API loses it's value type behaviour and the user has to manage when to do COW manually.
If C# had copy constructors for structs, this would be possible. One could define a struct containing a reference to the real datastructure, and do all the incref()/decref() calls in the copy constructor and destructor of the struct.
Is there a way to do something like reference counting or struct copy constructors automatically in C#, without bothering the API users?
Edit:
Just to be clear, I'm just asking about the API. Clojure already has an implementation of this written in Java.
It is certainly possible to make such an interface by using a struct with a reference to the real collection that is COW'ed on every operation. The use of refcounting would be an optimisation to avoid unnecessary COWing, but apparently isn't possible with a sane API.

What you're looking to do isn't possible, strictly speaking. You could get close by using static functions that do the reference counting, but I understand that that isn't a terrible palatable option.
Even if it were possible, I would stay away from this. While the semantics you describe may well be useful in Clojure, this cross between value type and reference type semantics will be confusing to most C# developers (mutable value types--or types with value type semantics that are mutable--are also usually considered Evil).

You may use the WeakReference class as an alternative to refcounting and achieve some of the benefits that refcounting gives you. When you hold the only copy to an object in a WeakReference, it will be garbage collected. WeakReference has some hooks for you to inspect whether that's been the case.
EDIT 3: While this approach does do the trick I'd urge you to stay away from persuing value semantics on C# collections. Users of your structure do not expect this kind of behavior on the platform. These semantics add confusion and the potential for mistakes.
EDIT 2: Added an example. #AdamRobinson: I'm afraid I was not clear how WeakReference can be of use. I must warn that performancewise, most of the time it might be even worse than doing a naive Copy-On-Write at every operation. This is due to the Garbage Collector call. Therefore this is merely an academic solution, and I cannot recommend it's use in production systems. It does do exactly what you ask however.
class Program
{
static void Main(string[] args)
{
var l1 = default(COWList);
l1.Add("foo"); // initialize
l1.Add("bar"); // no copy
l1.Add("baz"); // no copy
var l2 = l1;
l1.RemoveAt(0); // copy
l2.Add("foobar"); // no copy
l1.Add("barfoo"); // no copy
l2.RemoveAt(1); // no copy
var l3 = l2;
l3.RemoveAt(1); // copy
Trace.WriteLine(l1.ToString()); // bar baz barfoo
Trace.WriteLine(l2.ToString()); // foo baz foobar
Trace.WriteLine(l3.ToString()); // foo foobar
}
}
struct COWList
{
List<string> theList; // Contains the actual data
object dummy; // helper variable to facilitate detection of copies of this struct instance.
WeakReference weakDummy; // helper variable to facilitate detection of copies of this struct instance.
/// <summary>
/// Check whether this COWList has already been constructed properly.
/// </summary>
/// <returns>true when this COWList has already been initialized.</returns>
bool EnsureInitialization()
{
if (theList == null)
{
theList = new List<string>();
dummy = new object();
weakDummy = new WeakReference(dummy);
return false;
}
else
{
return true;
}
}
void EnsureUniqueness()
{
if (EnsureInitialization())
{
// If the COWList has been copied, removing the 'dummy' reference will not kill weakDummy because the copy retains a reference.
dummy = new object();
GC.Collect(2); // OUCH! This is expensive. You may replace it with GC.Collect(0), but that will cause spurious Copy-On-Write behaviour.
if (weakDummy.IsAlive) // I don't know if the GC guarantees detection of all GC'able objects, so there might be cases in which the weakDummy is still considered to be alive.
{
// At this point there is probably a copy.
// To be safe, do the expensive Copy-On-Write
theList = new List<string>(theList);
// Prepare for the next modification
weakDummy = new WeakReference(dummy);
Trace.WriteLine("Made copy.");
}
else
{
// At this point it is guaranteed there is no copy.
weakDummy.Target = dummy;
Trace.WriteLine("No copy made.");
}
}
else
{
Trace.WriteLine("Initialized an instance.");
}
}
public void Add(string val)
{
EnsureUniqueness();
theList.Add(val);
}
public void RemoveAt(int index)
{
EnsureUniqueness();
theList.RemoveAt(index);
}
public override string ToString()
{
if (theList == null)
{
return "Uninitialized COWList";
}
else
{
var sb = new StringBuilder("[ ");
foreach (var item in theList)
{
sb.Append("\"").Append(item).Append("\" ");
}
sb.Append("]");
return sb.ToString();
}
}
}
This outputs:
Initialized an instance.
No copy made.
No copy made.
Made copy.
No copy made.
No copy made.
No copy made.
Made copy.
[ "bar" "baz" "barfoo" ]
[ "foo" "baz" "foobar" ]
[ "foo" "foobar" ]

I read what you're asking for, and I'm thinking of a "terminal-server"-type API structure.
First, define an internal, thread-safe singleton class that will be your "server"; it actually holds the data you're looking at. It will expose a Get and Set method that will take the string of the value being set or gotten, controlled by a ReaderWriterLock to ensure that the value can be read by anyone, but not while anyone's writing and only one person can write at a time.
Then, provide a factory for a class that is your "terminal"; this class will be public, and contains a reference to the internal singleton (which otherwise cannot be seen). It will contain properties that are really just pass-throughs for the singleton instance. In this way, you can provide a large number of "terminals" that will all see the same data from the "server", and will be able to modify that data in a thread-safe way.
You could use copy constructors and a list of the values accessed by each instance to provide copy-type knowledge. You can also mashup the value names with the object's handle to support cases where L1 and L2 share an A, but L3 has a different A because it was declared seperately. Or, L3 can get the same A that L1 and L2 have. However you structure this, I would very clearly document how it should be expected to behave, because this is NOT the way things behave in basic .NET.

I'd like to have something like this on a flexible tree collection object of mine, though it wouldn't be by using value-type semantics (which would be essentially impossible in .net) but by having a clone generate a "virtual" deep clone instead of actually cloning every node within the collection. Instead of trying to keep an accurate reference count, every internal node would have three states:
Flexible
SharedImmutable
UnsharedMutable
Calling Clone() on a sharedImmutable node would simply yield the original object; calling Clone on a Flexible node would turn it into a SharedImmutable one. Calling Clone on an unshared mutable node would create a new node holding clones of all its descendents; the new object would be Flexible.
Before an object could be written, it would have to be made UnsharedMutable. To make an object UnsharedMutable if it isn't already, make its parent (the node via which it was accessed) UnsharedMutable (recursively). Then if the object was SharedImmutable, clone it (using a ForceClone method) and update the parent's link to point to the new object. Finally, set the new object's state to UnsharedMutable.
An essential aspect of this technique would be having separate classes for holding the data and providing the interface to it. A statement like MyCollection["this"]["that"]["theOther"].Add("George")needs to be evaluated by having the indexing operations return an indexer class which holds a reference to MyCollection. At that point, the "Add" method could then be able to act upon whatever intermediate nodes it had to in order to perform any necessary copy-on-write operations.

Related

Is it safe to replace immutable data structure with Interlocked.Exchange(ref oldValue, newValue) in ASP.NET Core Web-Api

I got an api that's an end point for geographic coordinate requests. That means users can search for specific locations in their area. At the same time new locations can be added. To make the query as fast as possible, I thought I would make the R-tree unchangeable. That is, there are no locks within the R-Tree, since several threads can read at the same time, without race condition. The updates are collected and if e.g. 100 updates are collected, I want to create a new R-Tree and replace the old one. And now my question is how to do this best?
I have a SearchService, which is stored as a single tone and has an R-Tree as private instance.
In my Startup.cs
services.AddSingleton<ISearchService, SearchService>();
ISearchService.cs
public interface ISearchService
{
IEnumerable<GeoLocation> Get(RTreeQuery query);
void Update(IEnumerable<GeoLocation> data);
}
SearchService.cs
public class SearchService : ISearchService
{
private RTree rTree;
public IEnumerable<GeoLocation> Get(RTreeQuery query)
{
return rTree.Get(query);
}
public void Update(IEnumerable<GeoLocation> data)
{
var newTree = new RTree(data);
Interlocked.Exchange<RTree>(ref rTree, newTree);
}
}
My question is, if I exchange the reference with Interlock.Exchange() the operation is atomic and there should be no race condition. But what happens if threads still use the old instance to process their request. Could it be that the garbage collector deletes the old instance when threads still access it? After all, there is no longer a reference to the old instance.
I am relatively new to this topic, so any help is welcome. Thanks for your support!

Read and writes to references are atomic, which means there will be no alignment issues. However, they could be stale.
Section 12.6.6 of the CLI specs
Unless explicit layout control (see Partition II (Controlling Instance
Layout)) is used to alter the default behavior, data elements no
larger than the natural word size (the size of a native int) shall be
properly aligned. Object references shall be treated as though they
are stored in the native word size.
In regards to the GC, your trees are safe from garbage collection while they are running Get.
So in summary, your methods are thread safe as far as reference atomicity go, you can also use the Update method and safely overwrite the reference, there is no need for Interlocked.Exchange. The worst that can happen with your current implementation is you just get a stale tree which you have mentioned is not an issue.

Diferences between object instantiation in C#: storing objects in references vs. calling a method directly

I have a doubt with the objects declarations in c#. I explain with this example
I can do this:
MyObject obj = New MyObject();
int a = obj.getInt();
Or I can do this
int a = new MyObject().getInt();
The result are the same, but, exists any diferences between this declarations? (without the syntax)
Thanks.

This isn't a declararation: it's a class instantiation.
There's no practical difference: it's all about readability and your own coding style.
I would add that there're few cases where you will need to declare reference to some object: when these objects are IDisposable.
For example:
// WRONG! Underlying stream may still be locked after reading to the end....
new StreamReader(...).ReadToEnd();
// OK! Store the whole instance in a reference so you can dispose it when you
// don't need it anymore.
using(StreamReader r = new StreamReader(...))
{
} // This will call r.Dispose() automatically
As some comment has added, there're a lot of edge cases where instantiating a class and storing the object in a reference (a variable) will be better/optimal, but about your simple sample, I believe the difference isn't enough and it's still a coding style/readability issue.

It's mostly syntax.
The main difference is that you can't use the instance of MyObject in the second example. Also, it may be nominated for Garbage Collection immediately.

No, technically they are the same.
The only thing I would suggest to consider in this case, as if the function does not actual need of instance creation, you may consider declare it static, so you can simply call it like:
int a = MyObject.getInt();
but this naturally depends on concrete implementation.

Memory management / caching for costly objects in C#

Assume that I have the following object
public class MyClass
{
public ReadOnlyDictionary<T, V> Dict
{
get
{
return createDictionary();
}
}
}
Assume that ReadOnlyDictionary is a read-only wrapper around Dictionary<T, V>.
The createDictionary method takes significant time to complete and returned dictionary is relatively large.
Obviously, I want to implement some sort of caching so I could reuse result of createDictionary but also I do not want to abuse garbage collector and use to much memory.
I thought of using WeakReference for the dictionary but not sure if this is best approach.
What would you recommend? How to properly handle result of a costly method that might be called multiple times?
UPDATE:
I am interested in an advice for a C# 2.0 library (single DLL, non-visual). The library might be used in a desktop of a web application.
UPDATE 2:
The question is relevant for read-only objects as well. I changed value of the property from Dictionary to ReadOnlyDictionary.
UPDATE 3:
The T is relatively simple type (string, for example). The V is a custom class. You might assume that an instance of V is costly to create. The dictionary might contain from 0 to couple of thousands elements.
The code assumed to be accessed from a single thread or from multiple threads with an external synchronization mechanism.
I am fine if the dictionary is GC-ed when no one uses it. I am trying to find a balance between time (I want to somehow cache the result of createDictionary) and memory expenses (I do not want to keep memory occupied longer than necessary).

WeakReference is not a good solution for a cache since you object won´t survive the next GC if nobody else is referencing your dictionary. You can make a simple cache by storing the created value in a member variable and reuse it if it is not null.
This is not thread safe and you would end up in some situations creating the dictionary several times if you have heavy concurent access to it. You can use the double checked lock pattern to guard against this with minimal perf impact.
To help you further you would need to specify if concurrent access is an issue for you and how much memory your dictionary does consume and how it is created. If e.g. the dictionary is the result of an expensive query it might help to simply serialize the dictionary to disc and reuse it until you need to recreate it (this depends on your specific needs).
Caching is another word for memory leak if you have no clear policy when your object should be removed from the cache. Since you are trying WeakReference I assume you do not know when exactly a good time would be to clear the cache.
Another option is to compress the dictionary into a less memory hungry structure. How many keys does your dictionary has and what are the values?

There are four major mechanisms available for you (Lazy comes in 4.0, so it is no option)
lazy initialization
virtual proxy
ghost
value holder
each has it own advantages.
i suggest a value holder, which populates the dictionary on the first call of the GetValue
method of the holder. then you can use that value as long as you want to AND it is only
done once AND it is only done when in need.
for more information, see martin fowlers page

Are you sure you need to cache the entire dictionary?
From what you say, it might be better to keep a Most-Recently-Used list of key-value pairs.
If the key is found in the list, just return the value.
If it is not, create the one value (which is supposedly faster than creating all of them, and using less memory too) and store it in the list, thereby removing the key-value pair that hasn't been used the longest.
Here's a very simple MRU list implementation, it might serve as inspiration:
using System.Collections.Generic;
using System.Linq;
internal sealed class MostRecentlyUsedList<T> : IEnumerable<T>
{
private readonly List<T> items;
private readonly int maxCount;
public MostRecentlyUsedList(int maxCount, IEnumerable<T> initialData)
: this(maxCount)
{
this.items.AddRange(initialData.Take(maxCount));
}
public MostRecentlyUsedList(int maxCount)
{
this.maxCount = maxCount;
this.items = new List<T>(maxCount);
}
/// <summary>
/// Adds an item to the top of the most recently used list.
/// </summary>
/// <param name="item">The item to add.</param>
/// <returns><c>true</c> if the list was updated, <c>false</c> otherwise.</returns>
public bool Add(T item)
{
int index = this.items.IndexOf(item);
if (index != 0)
{
// item is not already the first in the list
if (index > 0)
{
// item is in the list, but not in the first position
this.items.RemoveAt(index);
}
else if (this.items.Count >= this.maxCount)
{
// item is not in the list, and the list is full already
this.items.RemoveAt(this.items.Count - 1);
}
this.items.Insert(0, item);
return true;
}
else
{
return false;
}
}
public IEnumerator<T> GetEnumerator()
{
return this.items.GetEnumerator();
}
System.Collections.IEnumerator System.Collections.IEnumerable.GetEnumerator()
{
return this.GetEnumerator();
}
}
In your case, T is a key-value pair. Keep maxcount small enough, so that searching stays fast, and to avoid excessive memory usage. Call Add each time you use an item.

An application should use WeakReference as a caching mechanism if the useful lifetime of an object's presence in the cache will be comparable to reference lifetime of the object. Suppose, for example, that you have a method which will create a ReadOnlyDictionary based on deserializing a String. If a common usage pattern would be to read a string, create a dictionary, do some stuff with it, abandon it, and start again with another string, WeakReference is probably not ideal. On the other hand, if your objective is to deserialize many strings (quite a few of which will be equal) into ReadOnlyDictionary instances, it may be very useful if repeated attempts to deserialize the same string yield the same instance. Note that the savings would not just come from the fact that one only had to do the work of building the instance once, but also from the facts that (1) it would not be necessary to keep multiple instances in memory, and (2) if ReadOnlyDictionary variables refer to the same instance, they can be known to be equivalent without having to examine the instances themselves. By contrast, determining whether two distinct ReadOnlyDictionary instances were equivalent might require examining all the items in each. Code which would have to do many such comparisons could benefit from using a WeakReference cache so that variables which hold equivalent instances would usually hold the same instance.

I think you have two mechanisms you can rely on for caching, instead of developing your own. The first, as you yourself suggested, was to use a WeakReference, and to let the garbage collector decide when to free this memory up.
You have a second mechanism - memory paging. If the dictionary is created in one swoop, it'll probably be stored in a more or less continuous part of the heap. Just keep the dictionary alive, and let Windows page it out to the swap file if you don't need it. Depending on your usage (how random is your dictionary access), you may end up with better performance than the WeakReference.
This second approach is problematic if you're close to your address space limits (this happens only in 32-bit processes).

How to have transactions on objects

How I can imitate transactions on objects. For example, I want to delete the item from one collection and then add the same item to other collection as an atomic action. It is possible to do a lot of checks when something failed and to roll back everything but this is annoying.
Is there any technique (no difference what language (Java/C++/C#)) to achive this.

This sort of thing becomes easier when you use immutable collections. In an immutable collection, adding or removing a member does not change the collection, it returns a new collection. (Implementing immutable collections which can do that using acceptably little time and space is a tricky problem.)
But if you have immutable collections, the logic becomes much easier. Suppose you want to move an item from the left collection to the right collection:
newLeft = left.Remove(item);
newRight = right.Add(item);
left and right have not changed; they are immutable. Now the problem you have to solve is an atomic set of left = newLeft and right = newRight, which isn't that hard a problem to solve.

For small, simple objects, you can use a copy-modify-swap idiom. Copy the original object. Make the changes. If all the changes succeeded, swap the copy with the original. (In C++, swap is typically efficient and no-fail.) The destructor will then clean up the original, instead of the copy.
In your case, you'd copy both collections. Remove the object from the first, add it to the second, and then swap the original collections with the copies.
However, this may not be practical if you have large or hard-to-copy objects. In those cases, you generally have to do more work manually.

Yes, Memento pattern http://en.wikipedia.org/wiki/Memento_pattern

Software transactional memory is one approach. There is no language-agnostic technology for this that I know of.

You can use Herb Sutters' method
Like
class EmployeeDatabase
{
public void TerminateEmployee(int index)
{
// Clone sensitive objects.
ArrayList tempActiveEmployees =
(ArrayList) activeEmployees.Clone();
ArrayList tempTerminatedEmployees =
(ArrayList) terminatedEmployees.Clone();
// Perform actions on temp objects.
object employee = tempActiveEmployees[index];
tempActiveEmployees.RemoveAt( index );
tempTerminatedEmployees.Add( employee );
// Now commit the changes.
ArrayList tempSpace = null;
ListSwap( ref activeEmployees,
ref tempActiveEmployees,
ref tempSpace );
ListSwap( ref terminatedEmployees,
ref tempTerminatedEmployees,
ref tempSpace );
}
void ListSwap(ref ArrayList first,
ref ArrayList second,
ref ArrayList temp)
{
temp = first;
first = second;
second = temp;
temp = null;
}
private ArrayList activeEmployees;
private ArrayList terminatedEmployees;
}
Mainly it means to divide the code into 2 parts :
void ExceptionNeutralMethod()
{
//——————————
// All code that could possibly throw exceptions is in this
// first section. In this section, no changes in state are
// applied to any objects in the system including this.
//——————————
//——————————
// All changes are committed at this point using operations
// strictly guaranteed not to throw exceptions.
//——————————
}
Of course it is just to show method I mean concerning ArrayList :). Better to use generics if possible, etc...
EDIT
Additionally if you have extreme requirements reliability please have a look at
Constrained Execution Regions also.

Partially thread-safe dictionary

I have a class that maintains a private Dictionary instance that caches some data.
The class writes to the dictionary from multiple threads using a ReaderWriterLockSlim.
I want to expose the dictionary's values outside the class.
What is a thread-safe way of doing that?
Right now, I have the following:
public ReadOnlyCollection<MyClass> Values() {
using (sync.ReadLock())
return new ReadOnlyCollection<MyClass>(cache.Values.ToArray());
}
Is there a way to do this without copying the collection many times?
I'm using .Net 3.5 (not 4.0)

I want to expose the dictionary's values outside the class.
What is a thread-safe way of doing that?
You have three choices.
1) Make a copy of the data, hand out the copy. Pros: no worries about thread safe access to the data. Cons: Client gets a copy of out-of-date data, not fresh up-to-date data. Also, copying is expensive.
2) Hand out an object that locks the underlying collection when it is read from. You'll have to write your own read-only collection that has a reference to the lock of the "parent" collection. Design both objects carefully so that deadlocks are impossible. Pros: "just works" from the client's perspective; they get up-to-date data without having to worry about locking. Cons: More work for you.
3) Punt the problem to the client. Expose the lock, and make it a requirement that clients lock all views on the data themselves before using it. Pros: No work for you. Cons: Way more work for the client, work they might not be willing or able to do. Risk of deadlocks, etc, now become the client's problem, not your problem.

If you want a snapshot of the current state of the dictionary, there's really nothing else you can do with this collection type. This is the same technique used by the ConcurrentDictionary<TKey, TValue>.Values property.
If you don't mind throwing an InvalidOperationException if the collection is modified while you are enumerating it, you could just return cache.Values since it's readonly (and thus can't corrupt the dictionary data).

EDIT: I personally believe the below code is technically answering your question correctly (as in, it provides a way to enumerate over the values in a collection without creating a copy). Some developers far more reputable than I strongly advise against this approach, for reasons they have explained in their edits/comments. In short: This is apparently a bad idea. Therefore I'm leaving the answer but suggesting you not use it.
Unless I'm missing something, I believe you could expose your values as an IEnumerable<MyClass> without needing to copy values by using the yield keyword:
public IEnumerable<MyClass> Values {
get {
using (sync.ReadLock()) {
foreach (MyClass value in cache.Values)
yield return value;
}
}
}
Be aware, however (and I'm guessing you already knew this), that this approach provides lazy evaluation, which means that the Values property as implemented above can not be treated as providing a snapshot.
In other words... well, take a look at this code (I am of course guessing as to some of the details of this class of yours):
var d = new ThreadSafeDictionary<string, string>();
// d is empty right now
IEnumerable<string> values = d.Values;
d.Add("someKey", "someValue");
// if values were a snapshot, this would output nothing...
// but in FACT, since it is lazily evaluated, it will now have
// what is CURRENTLY in d.Values ("someValue")
foreach (string s in values) {
Console.WriteLine(s);
}
So if it's a requirement that this Values property be equivalent to a snapshot of what is in cache at the time the property is accessed, then you're going to have to make a copy.
(begin 280Z28): The following is an example of how someone unfamiliar with the "C# way of doing things" could lock the code:
IEnumerator enumerator = obj.Values.GetEnumerator();
MyClass first = null;
if (enumerator.MoveNext())
first = enumerator.Current;
(end 280Z28)

Review next possibility, just exposes ICollection interface, so in Values() you can return your own implementation. This implementation will use only reference on Dictioanry.Values and always use ReadLock for access items.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

How to design an api to a persistent collection in C#? - c#

Related

Is it safe to replace immutable data structure with Interlocked.Exchange(ref oldValue, newValue) in ASP.NET Core Web-Api

Diferences between object instantiation in C#: storing objects in references vs. calling a method directly

Memory management / caching for costly objects in C#

How to have transactions on objects

Partially thread-safe dictionary

Categories

Resources