What is the difference between SynchronizedCollection<T> and the other concurrent collections? - c#

How does SynchronizedCollection<T> and the concurrent collections in the System.Collections.Concurrent namespace differ from each other, apart from Concurrent Collections being a namespace and SynchronizedCollection<T> being a class?
SynchronizedCollection<T> and all of the classes in Concurrent Collections provide thread-safe collections. How do I decide when to use one over the other, and why?

The SynchronizedCollection<T> class was introduced first in .NET 2.0 to provide a thread-safe collection class. It does this via locking so that you essentially have a List<T> where every access is wrapped in a lock statement.
The System.Collections.Concurrent namespace is much newer. It wasn't introduced until .NET 4.0 and it includes a substantially improved and more diverse set of choices. These classes no longer use locks to provide thread safety, which means they should scale better in a situation where multiple threads are accessing their data simultaneously. However, a class implementing the IList<T> interface is notably absent among these options.
So, if you're targeting version 4.0 of the .NET Framework, you should use one of the collections provided by the System.Collections.Concurrent namespace whenever possible. Just as with choosing between the various types of collections provided in the System.Collections.Generic namespace, you'll need to choose the one whose features and characteristics best fit your specific needs.
If you're targeting an older version of the .NET Framework or need a collection class that implements the IList<T> interface, you'll have to opt for the SynchronizedCollection<T> class.
This article on MSDN is also worth a read: When to Use a Thread-Safe Collection

The SynchronizedCollection<T> is a synchronized List<T>. It's a concept that can be devised in a second, and can be implemented fully in about one hour. Just wrap each method of a List<T> inside a lock (this), and you are done. Now you have a thread-safe collection, that can cover all the needs of a multithreaded application. Except that it doesn't.
The shortcomings of the SynchronizedCollection<T> become apparent as soon as you try to do anything non-trivial with it. Specifically as soon as you try to combine two or more methods of the collection for a conceptually singular operation. Then you realize that the operation is not atomic, and cannot be made atomic without resorting to explicit synchronization (locking on the SyncRoot property of the collection), which undermines the whole purpose of the collection. Some examples:
Ensure that the collection contains unique elements: if (!collection.Contains(x)) collection.Add(x);. This code ensures nothing. The inherent race condition between Contains and Add allows duplicates to occur.
Ensure that the collection contains at most N elements: if (collection.Count < N) collection.Add(x);. The race condition between Count and Add allows more than N elements in the collection.
Replace "Foo" with "Bar": int index = collection.IndexOf("Foo"); if (index >= 0) collection[index] = "Bar";. When a thread reads the index, its value is immediately stale. Another thread might change the collection in a way that the index points to some other element, or it's out of range.
At this point you realize that multithreading is more demanding than what you originally thought. Adding a layer of synchronization around the API of an existing collection doesn't cut it. You need a collection that is designed from the ground up for multithreaded usage, and has an API that reflects this design. This was the motivation for the introduction of the concurrent collections in .NET Framework 4.0.
The concurrent collections, for example the ConcurrentQueue<T> and the ConcurrentDictionary<K,V>, are highly sophisticated components. They are orders of magnitude more sophisticated than the clumsy SynchronizedCollection<T>. They are equipped with special atomic APIs that are well suited for multithreaded environments (TryDequeue, GetOrAdd, AddOrUpdate etc), and also with implementations that aim at minimizing the contention under heavy usage. Internally they employ lock-free, low-lock and granular-lock techniques. Learning how to use these collections requires some study. They are not direct drop-in replacements of their non-concurrent counterparts.
Caution: the enumeration of a SynchronizedCollection<T> is not synchronized. Getting an enumerator with GetEnumerator is synchronized, but using the enumerator is not. So if one thread does a foreach (var item in collection) while another thread mutates the collection in any way (Add, Remove etc), the behavior of the program is undefined. The safe way to enumerate a SynchronizedCollection<T> is to get a snapshot of the collection, and then enumerate the snapshot. Getting a snapshot is not trivial, because it involves two method calls (the Count getter and the CopyTo), so explicit synchronization is required. Beware of the LINQ ToArray operator, it's not thread-safe by itself. Below is a safe ToArraySafe extension method for the SynchronizedCollection<T> class:
/// <summary>Copies the elements of the collection to a new array.</summary>
public static T[] ToArraySafe<T>(this SynchronizedCollection<T> source)
{
ArgumentNullException.ThrowIfNull(source);
lock (source.SyncRoot)
{
T[] array = new T[source.Count];
source.CopyTo(array, 0);
return array;
}
}

Related

Migrating from Dictionary to ConcurrentDictionary, what are the common traps that I should be aware of?

I am looking at migrating from Dictionary to ConcurrentDictionary for a multi thread environment.
Specific to my use case, a kvp would typically be <string, List<T>>
What do I need to look out for?
How do I implement successfully for thread safety?
How do I manage reading key and values in different threads?
How do I manage updating key and values in different threads?
How do I manage adding/removing key and values in different threads?
What do I need to look out for?
Depends on what you are trying to achieve :)
How do I manage reading key and values in different threads?
How do I manage updating key and values in different threads?
How do I manage adding/removing key and values in different threads?
Those should be handled by the dictionary itself in thread safe manner. With several caveats:
Functions accepting factories like ConcurrentDictionary<TKey,TValue>.GetOrAdd(TKey, Func<TKey,TValue>) are not thread safe in terms of factory invocation (i.e. dictionary does not guarantee that the factory would be invoked only one time if multiple threads try to get or add the item, for example). Quote from the docs:
All these operations are atomic and are thread-safe with regards to all other operations on the ConcurrentDictionary<TKey,TValue> class. The only exceptions are the methods that accept a delegate, that is, AddOrUpdate and GetOrAdd. For modifications and write operations to the dictionary, ConcurrentDictionary<TKey,TValue> uses fine-grained locking to ensure thread safety. (Read operations on the dictionary are performed in a lock-free manner.) However, delegates for these methods are called outside the locks to avoid the problems that can arise from executing unknown code under a lock. Therefore, the code executed by these delegates is not subject to the atomicity of the operation.
In your particular case value - List<T> is not thread safe itself so while dictionary operations will be thread safe (with the exception from the previous point), mutating operations with value itself - will not, consider using something like ConcurrentBag or switching to IReadOnlyDictionary.
Personally I would be cautions working with concurrent dictionary via explicitly implemented interfaces like IDictionary<TKey, TValue> and/or indexer (can lead to race conditions in read-update-write scenarios).
The ConcurrentDictionary<TKey,TValue> collection is surprisingly difficult to master. The pitfalls that are waiting to trap the unwary are numerous and subtle. Here are some of them:
Giving the impression that the ConcurrentDictionary<TKey,TValue> blesses with thread-safery everything it contains. That's not true. If the TValue is a mutable class, and is allowed to be mutated by multiple threads, it can be corrupted just as easily as if it wasn't contained in the dictionary.
Using the ConcurrentDictionary<TKey,TValue> with patterns familiar from the Dictionary<TKey,TValue>. Race conditions can trivially emerge. For example if (dict.Contains(x)) list = dict[x] is wrong. In a multithreaded environment it is entirely possible that the key x will be removed between the dict.Contains(x) and the list = dict[x], resulting in a KeyNotFoundException. The ConcurrentDictionary<TKey,TValue> is equiped with special atomic APIs that should be used instead of the previous chatty check-then-act pattern.
Using the Count == 0 for checking if the dictionary is empty. The Count property is very cheep for a Dictionary<TKey,TValue>, and very expensive for a ConcurrentDictionary<TKey,TValue>. The correct property to use is the IsEmpty.
Assuming that the AddOrUpdate method can be safely used for updating a mutable TValue object. This is not a correct assumption. The "Update" in the name of the method means "update the dictionary, by replacing an existing value with a new value". It doesn't mean "modify an existing value".
Assuming that enumerating a ConcurrentDictionary<TKey,TValue> will yield the entries that were stored in the dictionary at the point in time that the enumeration started. That's not true. The enumerator does not maintain a snapshot of the dictionary. The behavior of the enumerator is not documented precisely. It's not even guaranteed that a single enumeration of a ConcurrentDictionary<TKey,TValue> will yield unique keys. In case you want to do an enumeration with snapshot semantics you must first take a snapshot explicitly with the (expensive) ToArray method, and then enumerate the snapshot. You might even consider switching to an ImmutableDictionary<TKey,TValue>, which is exceptionally good at providing these semantics.
Assuming that calling extension methods on ConcurrentDictionary<TKey,TValue>s interfaces is safe. This is not the case. For example the ToArray method is safe because it's a native method of the class. The ToList is not safe because it is a LINQ extension method on the IEnumerable<KeyValuePair<TKey,TValue>> interface. This method internally first calls the Count property of the ICollection<KeyValuePair<TKey,TValue>> interface, and then the CopyTo of the same interface. In a multithread environment the Count obtained by the first operation might not be compatible with the second operation, resulting in either an ArgumentException, or a list that contains empty elements at the end.
In conclusion, migrating from a Dictionary<TKey,TValue> to a ConcurrentDictionary<TKey,TValue> is not trivial. In many scenarios sticking with the Dictionary<TKey,TValue> and adding synchronization around it might be an easier (and safer) path to thread-safety. IMHO the ConcurrentDictionary<TKey,TValue> should be considered more as a performance-optimization over a synchronized Dictionary<TKey,TValue>, than as the tool of choice when a dictionary is needed in a multithreading scenario.

Using the ConcurrentBag type as a thread-safe substitute for a List

In genereal, would using the ConcurrentBag type be an acceptable thread-safe substitute for a List? I have read some answers on here that have suggested the use of ConcurrentBag when one was having a problem with thread safety with generic Lists in C#.
After reading a bit about ConcurrentBag, however, it seems performing a lot of searches and looping through the collection does not match with its intended usage. It seems to mostly be intended to solve producer/consumer problems, where jobs are being (some-what randomly) added and removed from the collection.
This is an example of the type of (IEnumerable) operations I want to use with the ConcurrentBag:
...
private readonly ConcurrentBag<Person> people = new ConcurrentBag<Person>();
public void AddPerson(Person person)
{
people.Add(person);
}
public Person GetPersonWithName(string name)
{
return people.Where(x => name.Equals(x.Name)).FirstOrDefault();
}
...
Would this cause performance concerns, and is it even a correct way to use a ConcurrentBag collection?
.NET's in-built concurrent data structures are most designed for patterns like producer-consumer, where there is a constant flow of work through the container.
In your case, the list seems to be long-term (relative to the lifetime of the class) storage, rather than just a resting place for some data before a consumer comes along to take it away and do something with it. In this case I'd suggest using a normal List<T> (or whichever non-concurrent collection is most appropriate for the operations you're intending), and simply using locks to control access to it.
A Bag is just the most general form of collection, allowing multiple identical entries, and without even the ordering of a List. It does happen to be useful in producer/consumer contexts where fairness is not an issue, but it is not specifically designed for that.
Because a Bag does not have any structure with respect to its contents, it's not very suitable for performing searches. In particular, the use case you mention will require time that scales with the size of the bag. A HashSet might be better if you don't need to be able to store multiple copies of an item and if manual synchronization is acceptable for your use case.
As far as I understand the ConcurrentBag it makes use of multiple lists. It creates a List for each thread using the ConcurrentBag. Thus when reading or accessing the ConcurrentBag within the same thread again the performance should be roughly the same as when just using a normal List, but if the ConcurrentBag is accessed from a different thread there will be a performance overhead as it has to search for the value in the "internal" lists created for each thread.
The MSDN page says the following regarding the ConcurrentBag.
Bags are useful for storing objects when ordering doesn't matter, and unlike sets, bags support duplicates. ConcurrentBag is a thread-safe bag implementation, optimized for scenarios where the same thread will be both producing and consuming data stored in the bag.
http://msdn.microsoft.com/en-us/library/dd381779%28v=VS.100%29.aspx
In genereal, would using the ConcurrentBag type be an acceptable thread-safe substitute for a List?
No, not in general, because, concurring with Warren Dew, a List is ordered, while a Bag is not (surely mine isn't ;)
But in cases where (potentially concurrent) reads greatly outnumber writes, you could just copy-on-write wrap your List.
That is a general solution, as you are working with original List instances, except (as explained in more detail in above link) you have to make sure that everyone modifying the List uses the appropriate copy-on-write utility method - which you could enforce by using List.AsReadOnly().
In highly concurrent programs, copy-on-write has many desirable performance properties in mostly-read scenarios, compared to locking.

Add collection to BlockingCollection

BlockingCollection contains only methods to add individual items. What if I want to add a collection? Should I just use foreach loop?
Why BlockingCollection doesn't contain method to add a collection? I think such method can be pretty useful.
ICollection interfaces and many of the BCL list-type classes don't have an AddRange method for some reason and it is annoying.
Yes, you'll need to foreach over the collection, you could write you own extension method if you're using it a lot.
The BlockingCollection<T> is a wrapper of an underlying collection that implements the IProducerConsumerCollection<T> interface, which by default is a ConcurrentQueue<T>. This interface has only the method TryAdd for adding elements. It doesn't have a TryAddRange. The reason is, I guess, because not all native IProducerConsumerCollection<T> implementation are equipped with AddRange functionality. The ConcurrentStack<T> does have the PushRange method, but the ConcurrentQueue<T> doesn't have something equivalent.
My understanding is that if this API existed, it would have atomic semantics. It wouldn't be just a convenience method for reducing the code needed for doing multiple non-atomic insertions. Quoting from the ConcurrentStack<T>.PushRange documentation:
When adding multiple items to the stack, using PushRange is a more efficient mechanism than using Push one item at a time. Additionally, PushRange guarantees that all of the elements will be added atomically, meaning that no other threads will be able to inject elements between the elements being pushed. Items at lower indices in the items array will be pushed before items at higher indices.
Atomicity is especially important when adding in a blocking-stack, because the item that is added can be immediately taken by another thread that is currently blocked. If I add the items A, B and C with the order C-B-A, it's because I want the A to be placed at the top of the stack, and picked by another thread first. But since the additions are not atomic, another thread could win the race and take the C before the current thread manages to add the B and the A. According to my experiments this happens rarely, but there is no way to prevent it. In case it is absolutely necessary to enforce the insertion order, there is no other option than implementing a custom BlockingStack<T> collection from scratch.

IEnumerable<T> thread safety?

I have a main thread that populates a List<T>. Further I create a chain of objects that will execute on different threads, requiring access to the List. The original list will never be written to after it's generated. My thought was to pass the list as IEnumerable<T> to the objects executing on other threads, mainly for the reason of not allowing those implementing those objects to write to the list by mistake. In other words if the original list is guaranteed not be written to, is it safe for multiple threads to use .Where or foreach on the IEnumerable?
I am not sure if the iterator in itself is thread safe if the original collection is never changed.
IEnumerable<T> can't be modified. So what can be non thread safe with it? (If you don't modify the actual List<T>).
For non thread safety you need writing and reading operations.
"Iterator in itself" is instantiated for each foreach.
Edit: I simplified my answer a bit, but #Eric Lippert added valuable comment. IEnumerable<T> doesn't define modifying methods, but it doesn't mean that access operators are thread safe (GetEnumerator, MoveNext and etc.) Simplest example: GetEnumerator implemented as this:
Every time returns same instance of IEnumerator
Resets it's position
More sophisticated example is caching.
This is interesting point, but fortunately I don't know any standard class that has not thread-safe implementation of IEnumerable.
Each thread that calls Where or foreach gets its own enumerator - they don't share one enumerator object for the same list. So since the List isn't being modified, and since each thread is working with its own copy of an enumerator, there should be no thread safety issues.
You can see this at work in one thread - Just create a List of 10 objects, and get two enumerators from that List. Use one enumerator to enumerate through 5 items, and use the other to enumerate through 5 items. You will see that both enumerators enumerated through only the first 5 items, and that the second one did not start where the first enumerator left off.
As long as you are certain that the List will never be modified then it will be safe to read from multiple threads. This includes the use of the IEnumerator instances it provides.
This is going to be true for most collections. In fact, all collections in the BCL should be stable during enumeration. In other words, the enumerator will not modify the data structure. I can think of some obscure cases, like a splay-tree, were enumerating it might modify the structure. Again, none of the BCL collections do that.
If you are certain that the list will not be modified after creation, you should guarantee that by converting it to a ReadOnlyCollection<T>. Of course if you keep the original list that the read only collection uses you can modify it, but if you toss the original list away you're effectively making it permentantly read only.
From the Thread Safety section of the collection:
A ReadOnlyCollection can support multiple readers concurrently, as long as the collection is not modified.
So if you don't touch the original list again and stop referencing it, you can ensure that multiple threads can read it without worry (so long as you don't do anything wacky with trying to modify it again).
In other words if the original list is guaranteed not be written to, is it safe for multiple threads to use .Where or foreach on the IEnumerable?
Yes it's only a problem if the list gets mutated.
But note than IEnumerable<T> can be cast back to a list and then modified.
But there is another alternative: wrap your list into a ReadOnlyCollection<T> and pass that around. If you now throw away the original list you basically created a new immutable list.
If you are using net framework 4.5 or greater, this could be a great soulution
http://msdn.microsoft.com/en-us/library/dd997305(v=vs.110).aspx
(microsoft already implemented a thread safe enumerable)

Are there any C# collections where modification does not invalidate iterators?

Are there any data structures in the C# Collections library where modification of the structure does not invalidate iterators?
Consider the following:
List<int> myList = new List<int>();
myList.Add( 1 );
myList.Add( 2 );
List<int>.Enumerator myIter = myList.GetEnumerator();
myIter.MoveNext(); // myIter.Current == 1
myList.Add( 3 );
myIter.MoveNext(); // throws InvalidOperationException
Yes, take a look at the System.Collections.Concurrent namespace in .NET 4.0.
Note that for some of the collections in this namespace (e.g., ConcurrentQueue<T>), this works by only exposing an enumerator on a "snapshot" of the collection in question.
From the MSDN documentation on ConcurrentQueue<T>:
The enumeration represents a
moment-in-time snapshot of the
contents of the queue. It does not
reflect any updates to the collection
after GetEnumerator was called. The
enumerator is safe to use concurrently
with reads from and writes to the
queue.
This is not the case for all of the collections, though. ConcurrentDictionary<TKey, TValue>, for instance, gives you an enumerator that maintains updates to the underlying collection between calls to MoveNext.
From the MSDN documentation on ConcurrentDictionary<TKey, TValue>:
The enumerator returned from the
dictionary is safe to use concurrently
with reads and writes to the
dictionary, however it does not
represent a moment-in-time snapshot of
the dictionary. The contents exposed
through the enumerator may contain
modifications made to the dictionary
after GetEnumerator was called.
If you don't have 4.0, then I think the others are right and there is no such collection provided by .NET. You can always build your own, however, by doing the same thing ConcurrentQueue<T> does (iterate over a snapshot).
According to this MSDN article on IEnumerator the invalidation behaviour you have found is required by all implementations of IEnumerable.
An enumerator remains valid as long as the collection remains unchanged. If
changes are made to the collection, such as adding, modifying, or deleting
elements, the enumerator is irrecoverably invalidated and the next call to
MoveNext or Reset throws an InvalidOperationException. If the collection is
modified between MoveNext and Current, Current returns the element that it is
set to, even if the enumerator is already invalidated.
Supporting this behavior requires some pretty complex internal handling, so most of the collections don't support this (I'm not sure about the Concurrent namespace).
However, you can very well simulate this behavior using immutable collections. They don't allow you to modify the collection by design, but you can work with them in a slightly different way and this kind of processing allows you to use enumerator concurrently without complex handling (implemented in Concurrent collections).
You can implement a collection like that easily, or you can use FSharpList<T> from FSharp.Core.dll (not a standard part of .NET 4.0 though):
open Microsoft.FSharp.Collections;
// Create immutable list from other collection
var list = ListModule.OfSeq(anyCollection);
// now we can use `GetEnumerable`
var en = list.GetEnumerable();
// To modify the collection, you create a new collection that adds
// element to the front (without actually copying everything)
var added = new FSharpList<int>(42, list);
The benefit of immutable collections is that you can work with them (by creating copies) without affecting the original one, and so the behavior you wanted is "for free". For more information, there is a great series by Eric Lippert.
The only way to do this is to make a copy of the list before you iterate it:
var myIter = new List<int>(myList).GetEnumerator();
No, they do not exist. ALl C# standard collections invalidate the numerator when the structure changes.
Use a for loop instead of a foreach, and then you can modify it. I wouldn't advise it though....

Categories