Prevent collection modifications while reading sequentially - c#

I'm working with large collections of objects and sequential reads of them.
I found most questions along these lines refer to multi-threading, but I am more concerned with errors within the thread itself due to misuse of a distributable library.
A system within the library manages a potentially large collection of objects, at one point it performs a sequential read of this collection performing an operation on each element.
Depending on the element implementation, which can be extended outside the library, an object may attempt to remove itself from the collection.
I would like that to be an option, but if this happens when the collection is being sequentially read this can lead to errors. I would like to be able to lock the contents of the collection while its being read and put any removal request on a schedule to be executed after the sequential read has finished.
The removal request has to go through the system since objects do not have public access to the collection, I could just go with an isReading flag but I wonder if there is a more elegant construct.
Does C# or .NET provide a tool to do this? perhaps to lock the list contents so I can intercept removal requests during sequential reads? or would I have to implement that behavior from scratch for this scenario?

You may want to look into using the SynchronizedCollection<T> class in .NET 2.0+.
Alternatively, have a look at the answer to this question: What is the difference between SynchronizedCollection<T> and the other concurrent collections?

You can use the next trick
List<T> collection;
for(int index = collection; index >= 0; --index)
{
var item = collection[index];
if(MUST BE DELETED)
{
collection.RemoveAt(index); // this is faster
OR
collection.Remove(item);
}
}
this code will not crash at collection modified and will process each item of collection

Related

Using the ConcurrentBag type as a thread-safe substitute for a List

In genereal, would using the ConcurrentBag type be an acceptable thread-safe substitute for a List? I have read some answers on here that have suggested the use of ConcurrentBag when one was having a problem with thread safety with generic Lists in C#.
After reading a bit about ConcurrentBag, however, it seems performing a lot of searches and looping through the collection does not match with its intended usage. It seems to mostly be intended to solve producer/consumer problems, where jobs are being (some-what randomly) added and removed from the collection.
This is an example of the type of (IEnumerable) operations I want to use with the ConcurrentBag:
...
private readonly ConcurrentBag<Person> people = new ConcurrentBag<Person>();
public void AddPerson(Person person)
{
people.Add(person);
}
public Person GetPersonWithName(string name)
{
return people.Where(x => name.Equals(x.Name)).FirstOrDefault();
}
...
Would this cause performance concerns, and is it even a correct way to use a ConcurrentBag collection?
.NET's in-built concurrent data structures are most designed for patterns like producer-consumer, where there is a constant flow of work through the container.
In your case, the list seems to be long-term (relative to the lifetime of the class) storage, rather than just a resting place for some data before a consumer comes along to take it away and do something with it. In this case I'd suggest using a normal List<T> (or whichever non-concurrent collection is most appropriate for the operations you're intending), and simply using locks to control access to it.
A Bag is just the most general form of collection, allowing multiple identical entries, and without even the ordering of a List. It does happen to be useful in producer/consumer contexts where fairness is not an issue, but it is not specifically designed for that.
Because a Bag does not have any structure with respect to its contents, it's not very suitable for performing searches. In particular, the use case you mention will require time that scales with the size of the bag. A HashSet might be better if you don't need to be able to store multiple copies of an item and if manual synchronization is acceptable for your use case.
As far as I understand the ConcurrentBag it makes use of multiple lists. It creates a List for each thread using the ConcurrentBag. Thus when reading or accessing the ConcurrentBag within the same thread again the performance should be roughly the same as when just using a normal List, but if the ConcurrentBag is accessed from a different thread there will be a performance overhead as it has to search for the value in the "internal" lists created for each thread.
The MSDN page says the following regarding the ConcurrentBag.
Bags are useful for storing objects when ordering doesn't matter, and unlike sets, bags support duplicates. ConcurrentBag is a thread-safe bag implementation, optimized for scenarios where the same thread will be both producing and consuming data stored in the bag.
http://msdn.microsoft.com/en-us/library/dd381779%28v=VS.100%29.aspx
In genereal, would using the ConcurrentBag type be an acceptable thread-safe substitute for a List?
No, not in general, because, concurring with Warren Dew, a List is ordered, while a Bag is not (surely mine isn't ;)
But in cases where (potentially concurrent) reads greatly outnumber writes, you could just copy-on-write wrap your List.
That is a general solution, as you are working with original List instances, except (as explained in more detail in above link) you have to make sure that everyone modifying the List uses the appropriate copy-on-write utility method - which you could enforce by using List.AsReadOnly().
In highly concurrent programs, copy-on-write has many desirable performance properties in mostly-read scenarios, compared to locking.

Concurrent collection supporting removal of a specified item?

Quite simple: Other than ConcurrentDictionary (which I'll use if I have to but it's not really the correct concept), is there any Concurrent collection (IProducerConsumer implementation) that supports removal of specific items based on simple equality of an item or a predicate defining a condition for removal?
Explanation: I have a multi-threaded, multi-stage workflow algorithm, which pulls objects from the DB and sticks them in a "starting" queue. From there they are grabbed by the next stage, further worked on, and stuffed into other queues. This process continues through a few more stages. Meanwhile, the first stage is invoked again by its supervisor and pulls objects out of the DB, and those can include objects still in process (because they haven't finished being processed and so haven't been re-persisted with the flag set saying they're done).
The solution I am designing is a master "in work" collection; objects go in that queue when they are retrieved for processing by the first stage, and are removed after they have been re-saved to the DB as "processed" by whatever stage of the workflow completed the necessary processing. While the object is in that list, it will be ignored if it is re-retrieved by the first stage.
I had planned to use a ConcurrentBag, but the only removal method (TryTake) removes an arbitrary item from the bag, not a specified one (and ConcurrentBag is slow in .NET 4). ConcurrentQueue and ConcurrentStack also do not allow removal of an item other than the next one it'll give you, leaving ConcurrentDictionary, which would work but is more than I need (all I really need is to store the Id of the records being processed; they don't change during the workflow).
The reason why there is no such a data structure is that all collections have lookup operation time of O(n). These are IndexOf, Remove(element) etc. They all enumerate through all elements and checking them for equality.
Only hash tables have lookup time of O(1). In concurrent scenario O(n) lookup time would lead to very long lock of a collection. Other threads will not be able to add elements during this time.
In dictionary only the cell hit by hash will be locked. Other threads can continue adding while one is checking for equality through elements in hash cell.
My advice is go on and use ConcurrentDictionary.
By the way, you are right that ConcurrentDictionary is a bit oversized for your solution. What you really need is to check quickly weather an object is in work or not. A HashSet would be a perfect for that. It does basically nothing then Add(element), Contains(element), Remove(element). There is a ConcurrentHeshSet implementation in java. For c# I found this: How to implement ConcurrentHashSet in .Net don't know how good is it.
As a first step I would still write a wrapper with HashSet interface around ConcurrentDictionary bring it up and running and then try different implementations and see performance differences.
As already explained by it other posts its not possible to remove items from a Queue or ConcurrentQueue by default, but actually the easiest way to get around is to extend or wrap the item.
public class QueueItem
{
public Boolean IsRemoved { get; private set; }
public void Remove() { IsRemoved = true; }
}
And when dequeuing:
QueueItem item = _Queue.Dequeue(); // Or TryDequeue if you use a concurrent dictionary
if (!item.IsRemoved)
{
// Do work here
}
It's really hard to make a collection thread-safe in the generic sense. There are so many factors that go into thread-safety that are outside the responsibility or purview of a library/framework class that affect the ability for it to be truly "thread-safe"... One of the drawbacks as you've pointed out is the performance. It's impossible to write a performant collection that is also thread-safe because it has to assume the worst...
The generally recommended practice is to use whatever collection you want and access it in a thread-safe way. This is basically why there aren't more thread-safe collections in the framework. More on this can be found at http://blogs.msdn.com/b/bclteam/archive/2005/03/15/396399.aspx#9534371

IEnumerable<T> thread safety?

I have a main thread that populates a List<T>. Further I create a chain of objects that will execute on different threads, requiring access to the List. The original list will never be written to after it's generated. My thought was to pass the list as IEnumerable<T> to the objects executing on other threads, mainly for the reason of not allowing those implementing those objects to write to the list by mistake. In other words if the original list is guaranteed not be written to, is it safe for multiple threads to use .Where or foreach on the IEnumerable?
I am not sure if the iterator in itself is thread safe if the original collection is never changed.
IEnumerable<T> can't be modified. So what can be non thread safe with it? (If you don't modify the actual List<T>).
For non thread safety you need writing and reading operations.
"Iterator in itself" is instantiated for each foreach.
Edit: I simplified my answer a bit, but #Eric Lippert added valuable comment. IEnumerable<T> doesn't define modifying methods, but it doesn't mean that access operators are thread safe (GetEnumerator, MoveNext and etc.) Simplest example: GetEnumerator implemented as this:
Every time returns same instance of IEnumerator
Resets it's position
More sophisticated example is caching.
This is interesting point, but fortunately I don't know any standard class that has not thread-safe implementation of IEnumerable.
Each thread that calls Where or foreach gets its own enumerator - they don't share one enumerator object for the same list. So since the List isn't being modified, and since each thread is working with its own copy of an enumerator, there should be no thread safety issues.
You can see this at work in one thread - Just create a List of 10 objects, and get two enumerators from that List. Use one enumerator to enumerate through 5 items, and use the other to enumerate through 5 items. You will see that both enumerators enumerated through only the first 5 items, and that the second one did not start where the first enumerator left off.
As long as you are certain that the List will never be modified then it will be safe to read from multiple threads. This includes the use of the IEnumerator instances it provides.
This is going to be true for most collections. In fact, all collections in the BCL should be stable during enumeration. In other words, the enumerator will not modify the data structure. I can think of some obscure cases, like a splay-tree, were enumerating it might modify the structure. Again, none of the BCL collections do that.
If you are certain that the list will not be modified after creation, you should guarantee that by converting it to a ReadOnlyCollection<T>. Of course if you keep the original list that the read only collection uses you can modify it, but if you toss the original list away you're effectively making it permentantly read only.
From the Thread Safety section of the collection:
A ReadOnlyCollection can support multiple readers concurrently, as long as the collection is not modified.
So if you don't touch the original list again and stop referencing it, you can ensure that multiple threads can read it without worry (so long as you don't do anything wacky with trying to modify it again).
In other words if the original list is guaranteed not be written to, is it safe for multiple threads to use .Where or foreach on the IEnumerable?
Yes it's only a problem if the list gets mutated.
But note than IEnumerable<T> can be cast back to a list and then modified.
But there is another alternative: wrap your list into a ReadOnlyCollection<T> and pass that around. If you now throw away the original list you basically created a new immutable list.
If you are using net framework 4.5 or greater, this could be a great soulution
http://msdn.microsoft.com/en-us/library/dd997305(v=vs.110).aspx
(microsoft already implemented a thread safe enumerable)

Under what conditions can TryDequeue and similar System.Collections.Concurrent collection methods fail

I have recently noticed that inside the collection objects contained in System.Collections.Concurrent namespace it is common to see Collection.TrySomeAction() rather then Collection.SomeAction().
What is the cause of this? I assume it has something to do with locking?
So I am wondering under what conditions could an attempt to (for example) Dequeue an item from a stack, queue, bag etc.. fail?
Collections in System.Collections.Concurrent namespace are considered to be thread-safe, so it is possible to use them to write multi-threaded programs that share data between threads.
Before .NET 4, you had to provide your own synchronization mechanisms if multiple threads might be accessing a single shared collection. You had to lock the collection each time you modified its elements. You also might need to lock the collection each time you accessed (or enumerated) it. That’s for the simplest of multi-threaded scenarios. Some applications would create background threads that delivered results to a shared collection over time. Another thread would read and process those results. You needed to implement your own message passing scheme between threads to notify each other when new results were available, and when those new results had been consumed. The classes and interfaces in System.Collections.Concurrent provide a consistent implementation for those and other common multi-threaded programming problems involving shared data across threads in lock-free way.
Try<something> has semantics - try to do that action and return operation result. DoThat semantics usually use exception thrown mechanics to indicate error which can be not efficient. As examples there they can return false,
if you try add new element, you might already have it in ConcurentDictionary;
if you try to get element from collection, it might not exists there;
if you try to update element there are can be more recent element already, so method ensures it updates only element which intended to update.
Try to read:
Patterns for Parallel Programming: Understanding and Applying Parallel Patterns with the .NET Framework 4 - best to start;
Parallel Programming in the .NET Framework;
read articles on Concurrency
Thread-safe Collections in .NET Framework 4 and Their Performance Characteristics
What do you mean with fail?
Take the following example:
var queue = new Queue<string>();
string temp = queue.Dequeue();
// do something with temp
The above code with throw an exception, since we try to dequeue from an empty queue. Now, if you use a ConcurrentQueue<T> instead:
var queue = new ConcurrentQueue<string>();
string temp;
if (queue.TryDequeue(out temp))
{
// do something with temp
}
The above code will not throw an exception. The queue will still fail to dequeue an item, but the code will not fail in the way of throwing an exception. The real use for this becomes apparent in a multithreaded environment. Code for the non-concurrent Queue<T> would typically look something like this:
lock (queueLock)
{
if (queue.Count > 0)
{
string temp = queue.Dequeue();
// do something with temp
}
}
In order to avoid race conditions, we need to use a lock to ensure that nothing happens with the queue in the time that passes from checking Count do calling Dequeue. With ConcurrentQueue<T>, we don't really need to check Count, but can instead call TryDequeue.
If you examine the types found in the Systems.Collections.Concurrent namespace you will find that many of them wrap two operations that are typically called sequentially, and that would traditionally require locking (Count followed by Dequeue in ConcurrentQueue<T>, GetOrAdd in ConcurrentDictionary<TKey, TValue> replaces sequences of calling ContainsKey, adding an item and getting it, and so on).
If there is nothing to be "dequeued", for example... This "Try-Pattern" is used commonly all across FCL and BCL elements. That has nothing to do with locking, concurrent collections are (or at least should be) mostly implemented without locks...

List with non-null elements ends up containing null. A synchronization issue?

First of all, sorry about the title -- I couldn't figure out one that was short and clear enough.
Here's the issue: I have a list List<MyClass> list to which I always add newly-created instances of MyClass, like this: list.Add(new MyClass()). I don't add elements any other way.
However, then I iterate over the list with foreach and find that there are some null entries. That is, the following code:
foreach (MyClass entry in list)
if (entry == null)
throw new Exception("null entry!");
will sometimes throw an exception.
I should point out that the list.Add(new MyClass()) are performed from different threads running concurrently. The only thing I can think of to account for the null entries is the concurrent accesses. List<> isn't thread-safe, after all. Though I still find it strange that it ends up containing null entries, instead of just not offering any guarantees on ordering.
Can you think of any other reason?
Also, I don't care in which order the items are added, and I don't want the calling threads to block waiting to add their items. If synchronization is truly the issue, can you recommend a simple way to call the Add method asynchronously, i.e., create a delegate that takes care of that while my thread keeps running its code? I know I can create a delegate for Add and call BeginInvoke on it. Does that seem appropriate?
Thanks.
EDIT: A simple solution based on Kevin's suggestion:
public class AsynchronousList<T> : List<T> {
private AddDelegate addDelegate;
public delegate void AddDelegate(T item);
public AsynchronousList() {
addDelegate = new AddDelegate(this.AddBlocking);
}
public void AddAsynchronous(T item) {
addDelegate.BeginInvoke(item, null, null);
}
private void AddBlocking(T item) {
lock (this) {
Add(item);
}
}
}
I only need to control Add operations and I just need this for debugging (it won't be in the final product), so I just wanted a quick fix.
Thanks everyone for your answers.
List<T> can only support multiple readers concurrently. If you are going to use multiple threads to add to the list, you'll need to lock the object first. There is really no way around this, because without a lock you can still have someone reading from the list while another thread updates it (or multiple objects trying to update it concurrently also).
http://msdn.microsoft.com/en-us/library/6sh2ey19.aspx
Your best bet probably is to encapsulate the list in another object, and have that object handle the locking and unlocking actions on the internal list. That way you could make your new object's "Add" method asynchronous and let the calling objects go on their merry way. Any time you read from it though you'll most likely still have to wait on any other objects finishing their updates though.
The only thing I can think of to account for the null entries is the concurrent accesses. List<> isn't thread-safe, after all.
That's basically it. We are specifically told it's not thread-safe, so we shouldn't be surprised that concurrent access results in contract-breaking behaviour.
As to why this specific problem occurs, we can but speculate, since List<>'s private implementation is, well, private (I know we have Reflector and Shared Source - but in principle it is private). Suppose the implementation involves an array and a 'last populated index'. Suppose also that 'Add an item' looks like this:
Ensure the array is big enough for another item
last populated index <- last populated index + 1
array[last populated index] = incoming item
Now suppose there are two threads calling Add. If the interleaved sequence of operations ends up like this:
Thread A : last populated index <- last populated index + 1
Thread B : last populated index <- last populated index + 1
Thread A : array[last populated index] = incoming item
Thread B : array[last populated index] = incoming item
then not only will there be a null in the array, but also the item that thread A was trying to add won't be in the array at all!
Now, I don't know for sure how List<> does its stuff internally. I have half a memory that it is with an ArrayList, which internally uses this scheme; but in fact it doesn't matter. I suspect that any list mechanism that expects to be run non-concurrently can be made to break with concurrent access and a sufficiently 'unlucky' interleaving of operations. If we want thread-safety from an API that doesn't provide it, we have to do some work ourselves - or at least, we shouldn't be surprised if the API sometimes breaks its when we don't.
For your requirement of
I don't want the calling threads to block waiting to add their item
my first thought is a Multiple-Producer-Single-Consumer queue, wherein the threads wanting to add items are the producers, which dispatch items to the queue async, and there is a single consumer which takes items off the queue and adds them to the list with appropriate locking. My second thought is that this feels as if it would be heavier than this situation warrants, so I'll let it mull for a bit.
If you're using .NET Framework 4, you might check out the new Concurrent Collections. When it comes to threading, it's better not to try to be clever, as it's extremely easy to get it wrong. Synchronization can impact performance, but the effects of getting threading wrong can also result in strange, infrequent errors that are a royal pain to track down.
If you're still using Framework 2 or 3.5 for this project, I recommend simply wrapping your calls to the list in a lock statement. If you're concerned about performance of Add (are you performing some long-running operation using the list somewhere else?) then you can always make a copy of the list within a lock and use that copy for your long-running operation outside the lock. Simply blocking on the Adds themselves shouldn't be a performance issue, unless you have a very large number of threads. If that's the case, you can try the Multiple-Producer-Single-Consumer queue that AakashM recommended.

Categories