How to write a thread-safe list using copy-on-write model in .NET?
Below is my current implementation, but after lots of reading about threading, memory barriers, etc, I know that I need to be cautious when multi-threading without locks is involved. Could someone comment if this is the correct implementation?
class CopyOnWriteList
{
private List<string> list = new List<string>();
private object listLock = new object();
public void Add(string item)
{
lock (listLock)
{
list = new List<string>(list) { item };
}
}
public void Remove(string item)
{
lock (listLock)
{
var tmpList = new List<string>(list);
tmpList.Remove(item);
list = tmpList;
}
}
public bool Contains(string item)
{
return list.Contains(item);
}
public string Get(int index)
{
return list[index];
}
}
EDIT
To be more specific: is above code thread safe, or should I add something more? Also, will all thread eventually see change in list reference? Or maybe I should add volatile keyword on list field or Thread.MemoryBarrier in Contains method between accessing reference and calling method on it?
Here is for example Java implementation, looks like my above code, but is such approach also thread-safe in .NET?
And here is the same question, but also in Java.
Here is another question related to this one.
Implementation is correct because reference assignment is atomic in accordance to Atomicity of variable references. I would add volatile to list.
Your approach looks correct, but I'd recommend using a string[] rather than a List<string> to hold your data. When you're adding an item, you know exactly how many items are going to be in the resulting collection, so you can create a new array of exactly the size required. When removing an item, you can grab a copy of the list reference and search it for your item before making a copy; if it turns out that the item doesn't exist, there's no need to remove it. If it does exist, you can create a new array of the exact required size, and copy to the new array all the items preceding or following the item to be removed.
Another thing you might want to consider would be to use a int[1] as your lock flag, and use a pattern something like:
static string[] withAddedItem(string[] oldList, string dat)
{
string[] result = new string[oldList.Length+1];
Array.Copy(oldList, result, oldList.Length);
return result;
}
int Add(string dat) // Returns index of newly-added item
{
string[] oldList, newList;
if (listLock[0] == 0)
{
oldList = list;
newList = withAddedItem(oldList, dat);
if (System.Threading.Interlocked.CompareExchange(list, newList, oldList) == oldList)
return newList.Length;
}
System.Threading.Interlocked.Increment(listLock[0]);
lock (listLock)
{
do
{
oldList = list;
newList = withAddedItem(oldList, dat);
} while (System.Threading.Interlocked.CompareExchange(list, newList, oldList) != oldList);
}
System.Threading.Interlocked.Decrement(listLock[0]);
return newList.Length;
}
If there is no write contention, the CompareExchange will succeed without having to acquire a lock. If there is write contention, writes will be serialized by the lock. Note that the lock here is neither necessary nor sufficient to ensure correctness. Its purpose is to avoid thrashing in the event of write contention. It is possible that thread #1 might get past its first "if" test, and get task task-switched out while many other threads simultaneously try to write the list and start using the lock. If that occurs, thread #1 might then "surprise" the thread in the lock by performing its own CompareExchange. Such an action would result in the lock-holding thread having to waste time making a new array, but that situation should arise rarely enough that the occasional cost of an extra array copy shouldn't matter.
Yes, it is thread-safe:
Collection modifications in Add and Remove are done on separate collections, so it avoids concurrent access to the same collection from Add and Remove or from Add/Remove and Contains/Get.
Assignment of the new collection is done inside lock, which is just pair of Monitor.Enter and Monitor.Exit, which both do a full memory barrier as noted here, which means that after the lock all threads should observe the new value of list field.
Related
My working assumption is that LINQ is thread-safe when used with the System.Collections.Concurrent collections (including ConcurrentDictionary).
(Other Overflow posts seem to agree: link)
However, an inspection of the implementation of the LINQ OrderBy extension method shows that it appears not to be threadsafe with the subset of concurrent collections which implement ICollection (e.g. ConcurrentDictionary).
The OrderedEnumerable GetEnumerator (source here) constructs an instance of a Buffer struct (source here) which tries to cast the collection to an ICollection (which ConcurrentDictionary implements) and then performs a collection.CopyTo with an array initialised to the size of the collection.
Therefore, if the ConcurrentDictionary (as the concrete ICollection in this case) grows in size during the OrderBy operation, between initialising the array and copying into it, this operation will throw.
The following test code shows this exception:
(Note: I appreciate that performing an OrderBy on a thread-safe collection which is changing underneath you is not that meaningful, but I do not believe it should throw)
using System;
using System.Collections.Concurrent;
using System.Linq;
using System.Threading;
using System.Threading.Tasks;
namespace Program
{
class Program
{
static void Main(string[] args)
{
try
{
int loop = 0;
while (true) //Run many loops until exception thrown
{
Console.WriteLine($"Loop: {++loop}");
_DoConcurrentDictionaryWork().Wait();
}
}
catch (Exception ex)
{
Console.WriteLine(ex);
}
}
private static async Task _DoConcurrentDictionaryWork()
{
var concurrentDictionary = new ConcurrentDictionary<int, object>();
var keyGenerator = new Random();
var tokenSource = new CancellationTokenSource();
var orderByTaskLoop = Task.Run(() =>
{
var token = tokenSource.Token;
while (token.IsCancellationRequested == false)
{
//Keep ordering concurrent dictionary on a loop
var orderedPairs = concurrentDictionary.OrderBy(x => x.Key).ToArray(); //THROWS EXCEPTION HERE
//...do some more work with ordered snapshot...
}
});
var updateDictTaskLoop = Task.Run(() =>
{
var token = tokenSource.Token;
while (token.IsCancellationRequested == false)
{
//keep mutating dictionary on a loop
var key = keyGenerator.Next(0, 1000);
concurrentDictionary[key] = new object();
}
});
//Wait for 1 second
await Task.Delay(TimeSpan.FromSeconds(1));
//Cancel and dispose token
tokenSource.Cancel();
tokenSource.Dispose();
//Wait for orderBy and update loops to finish (now token cancelled)
await Task.WhenAll(orderByTaskLoop, updateDictTaskLoop);
}
}
}
That the OrderBy throws an exception leads to one of a few possible conclusions:
1) My assumption about LINQ being threadsafe with concurrent collections is incorrect, and it is only safe to perform LINQ on collections (be they concurrent or not) which are not mutating during the LINQ query
2) There is a bug with the implementation of LINQ OrderBy and it is incorrect for the implementation to try and cast the source collection to an ICollection and try and perform the collection copy (and It should just drop through to its default behaviour iterating the IEnumerable).
3) I have misunderstood what is going on here...
Thoughts much appreciated!
It's not stated anywhere that OrderBy (or other LINQ methods) should always use GetEnumerator of source IEnumerable or that it should be thread safe on concurrent collections. All that is promised is this method
Sorts the elements of a sequence in ascending order according to a
key.
ConcurrentDictionary is not thread-safe in some global sense either. It's thread-safe with respect to other operations performed on it. Even more, documentation says that
All public and protected members of ConcurrentDictionary
are thread-safe and may be used concurrently from multiple threads.
However, members accessed through one of the interfaces the
ConcurrentDictionary implements, including extension
methods, are not guaranteed to be thread safe and may need to be
synchronized by the caller.
So, your understanding is correct (OrderBy will see IEnumerable you pass to it is really ICollection, will then get length of that collection, allocate buffer of that size, then will call ICollection.CopyTo, and this is of course not thread safe on any type of collection), but it's not a bug in OrderBy because neither OrderBy nor ConcurrentDictionary ever promised what you assume.
If you want to do OrderBy in a thread safe way on ConcurrentDictionary, you need to rely on methods that are promised to be thread safe. For example:
// note: this is NOT IEnumerable.ToArray()
// but public ToArray() method of ConcurrentDictionary itself
// it is guaranteed to be thread safe with respect to other operations
// on this dictionary
var snapshot = concurrentDictionary.ToArray();
// we are working on snapshot so no one other thread can modify it
// of course at this point real contents of dictionary might not be
// the same as our snapshot
var sorted = snapshot.OrderBy(c => c.Key);
If you don't want to allocate additional array (with ToArray), you can use Select(c => c) and it will work in this case, but then we are again in moot territory and relying on something to be safe to use in situation it was not promised to (Select will also not always enumerate your collection. If collection is array or list - it will shortcut and use indexers instead). So you can create extension method like this:
public static class Extensions {
public static IEnumerable<T> ForceEnumerate<T>(this ICollection<T> collection) {
foreach (var item in collection)
yield return item;
}
}
And use it like this if you want to be safe and don't want to allocate array:
concurrentDictionary.ForceEnumerate().OrderBy(c => c.Key).ToArray();
In this case we are forcing enumeration of ConcurrentDictionary (which we know is safe from documentation) and then pass that to OrderBy knowing that it cannot do any harm with that pure IEnumerable. Note that as correctly pointed out in comments by mjwills, this is not exactly the same as ToArray, because ToArray produces snapshot (locks collection preventing modifications while building array) and Select \ yield does not acquire any locks (so items might be added\removed right when enumeration is in progress). Though I doubt it matters when doing things like described in question - in both cases after OrderBy is completed - you have no idea whether your ordered results reflect current state of collection or not.
So I have a IList as the value in my ConcurrentDictionary.
ConcurrentDictionary<int, IList<string>> list1 = new ConcurrentDictionary<int, IList<string>>;
In order to update a value in a list I do this:
if (list1.ContainsKey[key])
{
IList<string> templist;
list1.TryGetValue(key, out templist);
templist.Add("helloworld");
}
However, does adding a string to templist update the ConcurrentDictionary? If so, is the update thread-safe so that no data corruption would occur?
Or is there a better way to update or create a list inside the ConcurrentDictionary
EDIT
If I were to use a ConcurrentBag instead of a List, how would I implement this? More specifically, how could I update it? ConcurrentDictionary's TryUpdate method feels a bit excessive.
Does ConcurrentBag.Add update the ConcurrentDictionary in a thread-safe mannar?
ConcurrentDictionary<int, ConcurrentBag<string>> list1 = new ConcurrentDictionary<int, ConcurrentBag<string>>
Firstly, there's no need to do ContainsKey() and TryGetValue().
You should just do this:
IList<string> templist;
if (list1.TryGetValue(key, out templist))
templist.Add("helloworld");
In fact your code as written has a race condition.
Inbetween one thread calling ContainsKey() and TryGetValue() a different thread may have removed the item with that key. Then TryGetValue() will return tempList as null, and then you'll get a null reference exception when you call tempList.Add().
Secondly, yes: There's another possible threading issue here. You don't know that the IList<string> stored inside the dictionary is threadsafe.
Therefore calling tempList.Add() is not guaranteed to be safe.
You could use ConcurrentQueue<string> instead of IList<string>. This is probably going to be the most robust solution.
Note that simply locking access to the IList<string> wouldn't be sufficient.
This is no good:
if (list1.TryGetValue(key, out templist))
{
lock (locker)
{
templist.Add("helloworld");
}
}
unless you also use the same lock everywhere else that the IList may be accessed. This is not easy to achieve, hence it's better to either use a ConcurrentQueue<> or add locking to this class and change the architecture so that no other threads have access to the underlying IList.
Operations on a thread-safe dictionary are thread-safe by key, so to say. So as long as you access your values (in this case an IList<T>) only from one thread, you're good to go.
The ConcurrentDictionary does not prevent two threads at the same time to access the value beloning to one key.
You can use ConcurrentDictionary.AddOrUpdate method to add item to list in thread-safe way. Its simpler and should work fine.
var list1 = new ConcurrentDictionary<int, IList<string>>();
list1.AddOrUpdate(key,
new List<string>() { "test" }, (k, l) => { l.Add("test"); return l;});
UPD
According to docs and sources, factories, which was passed to AddOrUpdate method will be run out of lock scope, so calling List methods inside factory delegate is NOT thread safe.
See comments under this answer.
The ConcurrentDictionary has no effect on whether you can apply changes to value objects in a thread-safe manner or not. That is the reponsiblity of the value object (the IList-implementation in your case).
Looking at the answers of No ConcurrentList<T> in .Net 4.0? there are some good reasons why there is no ConcurrentList implementation in .net.
Basically you have to take care of thread-safe changes yourself. The most simple way is to use the lock operator. E.g.
lock (templist)
{
templist.Add("hello world");
}
Another way is to use the ConcurrentBag in the .net Framework. But this way is only useful for you, if you do not rely on the IList interface and the ordering of items.
it has been already mentioned about what would be the best solution ConcurrentDictionary with ConcurrentBag. Just going to add how to do that
ConcurrentBag<string> bag= new ConcurrentBag<string>();
bag.Add("inputstring");
list1.AddOrUpdate(key,bag,(k,v)=>{
v.Add("inputString");
return v;
});
does adding a string to templist update the ConcurrentDictionary?
It does not.
Your thread safe collection (Dictionary) holds references to non-thread-safe collections (IList). So changing those is not thread safe.
I suppose you should consider using mutexes.
If you use ConcurrentBag<T>:
var dic = new ConcurrentDictionary<int, ConcurrentBag<string>>();
Something like this could work OK:
public static class ConcurentDictionaryExt
{
public static ConcurrentBag<V> AddToInternal<K, V>(this ConcurrentDictionary<K, ConcurrentBag<V>> dic, K key, V value)
=> dic.AddOrUpdate(key,
k => new ConcurrentBag<V>() { value },
(k, existingBag) =>
{
existingBag.Add(value);
return existingBag;
}
);
public static ConcurrentBag<V> AddRangeToInternal<K, V>(this ConcurrentDictionary<K, ConcurrentBag<V>> dic, K key, IEnumerable<V> values)
=> dic.AddOrUpdate(key,
k => new ConcurrentBag<V>(values),
(k, existingBag) =>
{
foreach (var v in values)
existingBag.Add(v);
return existingBag;
}
);
}
I didn't test it yet :)
I have a list that may be null if not yet instantiated, and I want when calling GetList() to be able to return existing or create the list and then return. This looks cleaner:
private List<object> m_objects;
public List<object> GetList()
{
m_objects = m_objects ?? new List<object>();
return m_objects;
}
But is there a performance hit for setting the list as itself, or does C# realize that that's not necessary?
The alternative would be:
private List<object> m_objects;
public List<object> GetList()
{
if(m_objects != null)
{
return m_objects;
}
m_objects = new List<object>();
return m_objects;
}
Obviously not the end of the world but I'm still curious.
Use Lazy<T>:
private Lazy<List<object>> m_objects = new Lazy<List<object>>();
public List<object> GetList()
{
return m_objects.Value;
}
Addressing the performance issue. Worrying about performance here is premature optimisation. You should code it first to just work and then if you see any performance related problems, profile it and optimise.
This is perfectly valid:
private List<string> items;
public List<string> Items { get { return items ?? (items = new List<string>()); } }
note the ?? (items = difference. There is no performance hit as it is boolean short circuited at ?? if it has a non null value.
As pointed out, in case you want to use your code, then YES, it does pose performance hit as new element is created each time.
As the question specifically asks for performance, this gives you the best possible performance when calling GetList() and it's thread safe:
private readonly List<object> m_objects = new List<object>();
public List<object> GetList()
{
return m_objects;
}
Another option for making it thread safe would be using Lazy<T>. This defers the new List<object>() at the expense of always doing new Lazy<List<object>>() with additional overhead in the GetList() method.
There will probably be a minimal performance hit because you are basically assigning the reference held in m_objects to the m_objects field. That is effectively just copying a 64 or 32-bit pointer. You wouldn't be copying around all the data, just the reference to the instance of the object (or null).
For your code, I think using the ?? quotations is very neat and adequate as long as you aren't worried about multiple threading. If you are, then the Lazy<> suggest that #Michał Kędrzyński suggested is the better way to go since you wouldn't need to program all that in.
The only exception being if that you needed to make sure you were on the Dispatcher thread in WPF or the UI thread in Winforms due to UI limitations.
Edit
Just to correct my own answer, using Lazy is moot if your generic type isn't thread-safety aware anyway (for example, List isn't thread safe)
I have the following code and wonder whether it is thread safe. I only lock when I add or remove items from the collection but do not lock when I iterate over the collection. Locking while iterating would severely impact performance because the collection potentially contains hundreds of thousands of items. Any advice what to do to make this thread safe?
Thanks
public class Item
{
public string DataPoint { get; private set; }
public Item(string dataPoint)
{
DataPoint = dataPoint;
}
}
public class Test
{
private List<Item> _items;
private readonly object myListLock = new object();
public Test()
{
_items = new List<Item>();
}
public void Subscribe(Item item)
{
lock (myListLock)
{
if (!_items.Contains(item))
{
_items.Add(item);
}
}
}
public void Unsubscribe(Item item)
{
lock (myListLock)
{
if (_items.Contains(item))
{
_items.Remove(item);
}
}
}
public void Iterate()
{
foreach (var item in _items)
{
var dp = item.DataPoint;
}
}
}
EDIT
I was curious and again profiled performance between an iteration that is not locked vs iterating inside the lock on the myListLock and the performance overhead of locking the iteration over 10 million items was actually quite minimal.
No, it isn't thread safe, because the collection could be modified while you look inside it... What you could do:
Item[] items;
lock (myListLock)
{
items = _items.ToArray();
}
foreach (var item in items)
{
var dp = item.DataPoint;
}
so you duplicate the collection inside a lock before cycling on it. This clearly will use memory (because you have to duplicate the List<>) (ConcurrentBag<>.GetEnumerator() does nearly exactly this)
Note that this works only if Item is thread safe (for example because it is immutable)
In theory your code is not threadsafe.
In the background foreachperforms a normal for loop and if you add an item from a distinct thread while foreachiterates through your list, an item might be left out. Also if you remove an item (from a distinct thread), you might get an AV exception or - even worse - gibberish data.
If you want your code to be threadsafe, you have two choices:
You can clone your list (I usually use the .ToArray() method for that purpose). It will lead to doubling the list in the memory with all its cost and the result might not be up-to-date for in-situ editions, or...
You can put your entire iteration in a locked block, which will result in blocking other threads accessing the array while you are performing a long-running operation.
No it isn't. Note that all classes documented on MSDN has a section on Thread Safety (near the end): https://msdn.microsoft.com/en-us/library/6sh2ey19%28v=vs.110%29.aspx
The documentation for GetEnumerator has some more notes: https://msdn.microsoft.com/en-us/library/b0yss765%28v=vs.110%29.aspx
The key point is that iteration is not itself thread-safe. Even if each individual iterated read from the collection is thread safe, a consistent iteration often breaks down if the collection is modified. You could get problems such as reading the same element twice or skipping some elements, even if the collection itself is never in an inconsistent state.
Btw, your Unsubscribe() is doing TWO linear searches of the list, which is probably not what you want. You shouldn't need to call Contains() before Remove().
As far as Thread Safety goes is this ok to do or do I need to be using a different collection ?
List<FileMemberEntity> fileInfo = getList();
Parallel.ForEach(fileInfo, fileMember =>
{
//Modify each fileMember
}
As long as you are only modifying the contents of the item that is passed to the method, there is no locking needed.
(Provided of course that there are no duplicate reference in the list, i.e. two references to the same FileMemberEntity instance.)
If you need to modify the list itself, create a copy that you can iterate, and use a lock when you modify the list:
List<FileMemberEntity> fileInfo = getList();
List<FileMemberEntity> copy = new List<FileMemberEntity>(fileInfo);
object sync = new Object();
Parallel.ForEach(copy, fileMember => {
// do something
lock (sync) {
// here you can add or remove items from the fileInfo list
}
// do something
});
You're safe since you are just reading. Just don't modify the list while you are iterating over its items.
We should use less lock object to make it faster. Only lock object in different local threads of Parrallel.ForEach:
List<FileMemberEntity> copy = new List<FileMemberEntity>(fileInfo);
object sync = new Object();
Parallel.ForEach<FileMemberEntity, List<FileMemberEntity>>(
copy,
() => { return new List<FileMemberEntity>(); },
(itemInCopy, state, localList) =>
{
// here you can add or remove items from the fileInfo list
localList.Add(itemInCopy);
return localList;
},
(finalResult) => { lock (sync) copy.AddRange(finalResult); }
);
// do something
Reference: http://msdn.microsoft.com/en-gb/library/ff963547.aspx
If it does not matter what order the FileMemberEntity objects are acted on, you can use List<T> because you are not modifying the list.
If you must ensure some sort of ordering, you can use OrderablePartitioner<T> as a base class and implement an appropriate partitioning scheme. For example, if the FileMemberEntity has some sort of categorization and you must process each of the categories in some specific order, you would want to go this route.
Hypothetically if you have
Object 1 Category A
Object 2 Category A
Object 3 Category B
there is no guarantee that Object 2 Category A will be processed before Object 3 Category B is processed when iterating a List<T> using Parallel.ForEach.
The MSDN documentation you link to provides an example of how to do that.