Concurrent collection supporting removal of a specified item? - c#

Quite simple: Other than ConcurrentDictionary (which I'll use if I have to but it's not really the correct concept), is there any Concurrent collection (IProducerConsumer implementation) that supports removal of specific items based on simple equality of an item or a predicate defining a condition for removal?
Explanation: I have a multi-threaded, multi-stage workflow algorithm, which pulls objects from the DB and sticks them in a "starting" queue. From there they are grabbed by the next stage, further worked on, and stuffed into other queues. This process continues through a few more stages. Meanwhile, the first stage is invoked again by its supervisor and pulls objects out of the DB, and those can include objects still in process (because they haven't finished being processed and so haven't been re-persisted with the flag set saying they're done).
The solution I am designing is a master "in work" collection; objects go in that queue when they are retrieved for processing by the first stage, and are removed after they have been re-saved to the DB as "processed" by whatever stage of the workflow completed the necessary processing. While the object is in that list, it will be ignored if it is re-retrieved by the first stage.
I had planned to use a ConcurrentBag, but the only removal method (TryTake) removes an arbitrary item from the bag, not a specified one (and ConcurrentBag is slow in .NET 4). ConcurrentQueue and ConcurrentStack also do not allow removal of an item other than the next one it'll give you, leaving ConcurrentDictionary, which would work but is more than I need (all I really need is to store the Id of the records being processed; they don't change during the workflow).

The reason why there is no such a data structure is that all collections have lookup operation time of O(n). These are IndexOf, Remove(element) etc. They all enumerate through all elements and checking them for equality.
Only hash tables have lookup time of O(1). In concurrent scenario O(n) lookup time would lead to very long lock of a collection. Other threads will not be able to add elements during this time.
In dictionary only the cell hit by hash will be locked. Other threads can continue adding while one is checking for equality through elements in hash cell.
My advice is go on and use ConcurrentDictionary.
By the way, you are right that ConcurrentDictionary is a bit oversized for your solution. What you really need is to check quickly weather an object is in work or not. A HashSet would be a perfect for that. It does basically nothing then Add(element), Contains(element), Remove(element). There is a ConcurrentHeshSet implementation in java. For c# I found this: How to implement ConcurrentHashSet in .Net don't know how good is it.
As a first step I would still write a wrapper with HashSet interface around ConcurrentDictionary bring it up and running and then try different implementations and see performance differences.

As already explained by it other posts its not possible to remove items from a Queue or ConcurrentQueue by default, but actually the easiest way to get around is to extend or wrap the item.
public class QueueItem
{
public Boolean IsRemoved { get; private set; }
public void Remove() { IsRemoved = true; }
}
And when dequeuing:
QueueItem item = _Queue.Dequeue(); // Or TryDequeue if you use a concurrent dictionary
if (!item.IsRemoved)
{
// Do work here
}

It's really hard to make a collection thread-safe in the generic sense. There are so many factors that go into thread-safety that are outside the responsibility or purview of a library/framework class that affect the ability for it to be truly "thread-safe"... One of the drawbacks as you've pointed out is the performance. It's impossible to write a performant collection that is also thread-safe because it has to assume the worst...
The generally recommended practice is to use whatever collection you want and access it in a thread-safe way. This is basically why there aren't more thread-safe collections in the framework. More on this can be found at http://blogs.msdn.com/b/bclteam/archive/2005/03/15/396399.aspx#9534371

Related

Using the ConcurrentBag type as a thread-safe substitute for a List

In genereal, would using the ConcurrentBag type be an acceptable thread-safe substitute for a List? I have read some answers on here that have suggested the use of ConcurrentBag when one was having a problem with thread safety with generic Lists in C#.
After reading a bit about ConcurrentBag, however, it seems performing a lot of searches and looping through the collection does not match with its intended usage. It seems to mostly be intended to solve producer/consumer problems, where jobs are being (some-what randomly) added and removed from the collection.
This is an example of the type of (IEnumerable) operations I want to use with the ConcurrentBag:
...
private readonly ConcurrentBag<Person> people = new ConcurrentBag<Person>();
public void AddPerson(Person person)
{
people.Add(person);
}
public Person GetPersonWithName(string name)
{
return people.Where(x => name.Equals(x.Name)).FirstOrDefault();
}
...
Would this cause performance concerns, and is it even a correct way to use a ConcurrentBag collection?
.NET's in-built concurrent data structures are most designed for patterns like producer-consumer, where there is a constant flow of work through the container.
In your case, the list seems to be long-term (relative to the lifetime of the class) storage, rather than just a resting place for some data before a consumer comes along to take it away and do something with it. In this case I'd suggest using a normal List<T> (or whichever non-concurrent collection is most appropriate for the operations you're intending), and simply using locks to control access to it.
A Bag is just the most general form of collection, allowing multiple identical entries, and without even the ordering of a List. It does happen to be useful in producer/consumer contexts where fairness is not an issue, but it is not specifically designed for that.
Because a Bag does not have any structure with respect to its contents, it's not very suitable for performing searches. In particular, the use case you mention will require time that scales with the size of the bag. A HashSet might be better if you don't need to be able to store multiple copies of an item and if manual synchronization is acceptable for your use case.
As far as I understand the ConcurrentBag it makes use of multiple lists. It creates a List for each thread using the ConcurrentBag. Thus when reading or accessing the ConcurrentBag within the same thread again the performance should be roughly the same as when just using a normal List, but if the ConcurrentBag is accessed from a different thread there will be a performance overhead as it has to search for the value in the "internal" lists created for each thread.
The MSDN page says the following regarding the ConcurrentBag.
Bags are useful for storing objects when ordering doesn't matter, and unlike sets, bags support duplicates. ConcurrentBag is a thread-safe bag implementation, optimized for scenarios where the same thread will be both producing and consuming data stored in the bag.
http://msdn.microsoft.com/en-us/library/dd381779%28v=VS.100%29.aspx
In genereal, would using the ConcurrentBag type be an acceptable thread-safe substitute for a List?
No, not in general, because, concurring with Warren Dew, a List is ordered, while a Bag is not (surely mine isn't ;)
But in cases where (potentially concurrent) reads greatly outnumber writes, you could just copy-on-write wrap your List.
That is a general solution, as you are working with original List instances, except (as explained in more detail in above link) you have to make sure that everyone modifying the List uses the appropriate copy-on-write utility method - which you could enforce by using List.AsReadOnly().
In highly concurrent programs, copy-on-write has many desirable performance properties in mostly-read scenarios, compared to locking.

Prevent collection modifications while reading sequentially

I'm working with large collections of objects and sequential reads of them.
I found most questions along these lines refer to multi-threading, but I am more concerned with errors within the thread itself due to misuse of a distributable library.
A system within the library manages a potentially large collection of objects, at one point it performs a sequential read of this collection performing an operation on each element.
Depending on the element implementation, which can be extended outside the library, an object may attempt to remove itself from the collection.
I would like that to be an option, but if this happens when the collection is being sequentially read this can lead to errors. I would like to be able to lock the contents of the collection while its being read and put any removal request on a schedule to be executed after the sequential read has finished.
The removal request has to go through the system since objects do not have public access to the collection, I could just go with an isReading flag but I wonder if there is a more elegant construct.
Does C# or .NET provide a tool to do this? perhaps to lock the list contents so I can intercept removal requests during sequential reads? or would I have to implement that behavior from scratch for this scenario?
You may want to look into using the SynchronizedCollection<T> class in .NET 2.0+.
Alternatively, have a look at the answer to this question: What is the difference between SynchronizedCollection<T> and the other concurrent collections?
You can use the next trick
List<T> collection;
for(int index = collection; index >= 0; --index)
{
var item = collection[index];
if(MUST BE DELETED)
{
collection.RemoveAt(index); // this is faster
OR
collection.Remove(item);
}
}
this code will not crash at collection modified and will process each item of collection

IEnumerable<T> thread safety?

I have a main thread that populates a List<T>. Further I create a chain of objects that will execute on different threads, requiring access to the List. The original list will never be written to after it's generated. My thought was to pass the list as IEnumerable<T> to the objects executing on other threads, mainly for the reason of not allowing those implementing those objects to write to the list by mistake. In other words if the original list is guaranteed not be written to, is it safe for multiple threads to use .Where or foreach on the IEnumerable?
I am not sure if the iterator in itself is thread safe if the original collection is never changed.
IEnumerable<T> can't be modified. So what can be non thread safe with it? (If you don't modify the actual List<T>).
For non thread safety you need writing and reading operations.
"Iterator in itself" is instantiated for each foreach.
Edit: I simplified my answer a bit, but #Eric Lippert added valuable comment. IEnumerable<T> doesn't define modifying methods, but it doesn't mean that access operators are thread safe (GetEnumerator, MoveNext and etc.) Simplest example: GetEnumerator implemented as this:
Every time returns same instance of IEnumerator
Resets it's position
More sophisticated example is caching.
This is interesting point, but fortunately I don't know any standard class that has not thread-safe implementation of IEnumerable.
Each thread that calls Where or foreach gets its own enumerator - they don't share one enumerator object for the same list. So since the List isn't being modified, and since each thread is working with its own copy of an enumerator, there should be no thread safety issues.
You can see this at work in one thread - Just create a List of 10 objects, and get two enumerators from that List. Use one enumerator to enumerate through 5 items, and use the other to enumerate through 5 items. You will see that both enumerators enumerated through only the first 5 items, and that the second one did not start where the first enumerator left off.
As long as you are certain that the List will never be modified then it will be safe to read from multiple threads. This includes the use of the IEnumerator instances it provides.
This is going to be true for most collections. In fact, all collections in the BCL should be stable during enumeration. In other words, the enumerator will not modify the data structure. I can think of some obscure cases, like a splay-tree, were enumerating it might modify the structure. Again, none of the BCL collections do that.
If you are certain that the list will not be modified after creation, you should guarantee that by converting it to a ReadOnlyCollection<T>. Of course if you keep the original list that the read only collection uses you can modify it, but if you toss the original list away you're effectively making it permentantly read only.
From the Thread Safety section of the collection:
A ReadOnlyCollection can support multiple readers concurrently, as long as the collection is not modified.
So if you don't touch the original list again and stop referencing it, you can ensure that multiple threads can read it without worry (so long as you don't do anything wacky with trying to modify it again).
In other words if the original list is guaranteed not be written to, is it safe for multiple threads to use .Where or foreach on the IEnumerable?
Yes it's only a problem if the list gets mutated.
But note than IEnumerable<T> can be cast back to a list and then modified.
But there is another alternative: wrap your list into a ReadOnlyCollection<T> and pass that around. If you now throw away the original list you basically created a new immutable list.
If you are using net framework 4.5 or greater, this could be a great soulution
http://msdn.microsoft.com/en-us/library/dd997305(v=vs.110).aspx
(microsoft already implemented a thread safe enumerable)

Clarification of Read and Write on a C# Dictionary

In the context of this statement,
A Dictionary can support
multiple readers concurrently, as long
as the collection is not modified.
Even so, enumerating through a
collection is intrinsically not a
thread-safe procedure. In the rare
case where an enumeration contends
with write accesses, the collection
must be locked during the entire
enumeration. To allow the collection
to be accessed by multiple threads for
reading and writing, you must
implement your own synchronization.
What does read and write mean? My understanding is that a read is an operation which looks up a key and provides a reference to it's value and a write is an operation which adds or removes a key value pair from the dictionary. However, I can't find anything conclusive that regarding this.
So the big question is, while implementing a thread safe dictionary, would an operation that updates the value for an existing key in the dictionary be consider a reader or writer? I plan to have multiple threads accessing unique keys in a dictionary and modifying their values, but the threads will not add/remove new keys.
The obvious implication, assuming modifying an existing value is not a write operation on the dictionary, is that my implementation of the thread safe dictionay can be a lot more efficient, as I would not need to get an exclusive lock every time I try to update the value to an existing key.
Usage of ConcurrentDictionary from .Net 4.0 is not an option.
A major point not yet mentioned is that if TValue is a class type, the things held by a Dictionary<TKey,TValue> will be the identities of TValue objects. If one receives a reference from the dictionary, the dictionary will neither know nor care about anything one may do with the object referred to thereby.
One useful little utility class in cases where all of the keys associated with a dictionary will be known in advance of code that needs to use it is:
class MutableValueHolder<T>
{
public T Value;
}
If one wants to have multi-threaded code count how many times various strings appear in a bunch of files, and one knows in advance all the strings of interest, one may then use something like a Dictionary<string, MutableValueHolder<int>> for that purpose. Once the dictionary is loaded with all the proper strings and a MutableValueHolder<int> instance for each one, then any number of threads may retrieve references to MutableValueHolder<int> objects, and use Threading.Interlocked.Increment or other such methods to modify the Value associated with each one, without having to write to the Dictionary at all.
overwriting an existing value should be treated as a write operation
Anything that can affect the results of another read should be considered a write.
Changing a key is most definitly a write since it will cause the item to move in the internal hash or index or however dictionaries do their O(log(n)) stuff...
What you might want to do is look at ReaderWriterLock
http://msdn.microsoft.com/en-us/library/system.threading.readerwriterlock.aspx
Updating a value is conceptually a write operation. When updating a value with concurrent access where a read is performed before a write is completed, you read out an old value. When two writes conflict the wrong value may be stored.
Adding a new value could trigger a grow of the underlying storage. In this case new memory is allocated, all elements are copied into the new memory, the new element is added, the dictionary object is updated to refer to the new memory location for storage and the old memory is released and available for garbage collection. During this time, more writes could cause a big problem. Two writes at the same time could trigger two instances of this memory copying. If you follow through the logic, you'll see an element will get lost since only the last thread to update the reference will know about existing items and not the other items that were trying to be added.
ICollection provides a member to synchronize access and the reference remains valid across grow/shrink operations.
A read operation is anything that gets a key or value from a Dictionary, a write operation is anything that updates or adds a key or a value. So a process updating a key would be considered to be a writer.
A simple way to make a thread safe dictionary is to create your own implementation of IDictionary that simply locks a mutex and then forwards the call to an implementation:
public class MyThreadSafeDictionary<T, J> : IDictionary<T, J>
{
private object mutex = new object();
private IDictionary<T, J> impl;
public MyThreadSafeDictionary(IDictionary<T, J> impl)
{
this.impl = impl;
}
public void Add(T key, J value)
{
lock(mutex) {
impl.Add(key, value);
}
}
// implement the other methods as for Add
}
You could replace the mutex with a reader-writer lock if you are having some threads only read the dictionary.
Also note that Dictionary objects don't support changing keys; the only safe way to achieve want you want is to remove the existing key/value pair and add a new one with the updated key.
Modifying a value is a write and introduces a race condition.
Let's say the original value of mydict[5] = 42.
One thread updates mydict[5] to be 112.
Another thread updates mydict[5] to be 837.
What should the value of mydict[5] be at the end? The order of the threads is important in this case, meaning either you need to make sure the order is explicit or that they don't write.

List with non-null elements ends up containing null. A synchronization issue?

First of all, sorry about the title -- I couldn't figure out one that was short and clear enough.
Here's the issue: I have a list List<MyClass> list to which I always add newly-created instances of MyClass, like this: list.Add(new MyClass()). I don't add elements any other way.
However, then I iterate over the list with foreach and find that there are some null entries. That is, the following code:
foreach (MyClass entry in list)
if (entry == null)
throw new Exception("null entry!");
will sometimes throw an exception.
I should point out that the list.Add(new MyClass()) are performed from different threads running concurrently. The only thing I can think of to account for the null entries is the concurrent accesses. List<> isn't thread-safe, after all. Though I still find it strange that it ends up containing null entries, instead of just not offering any guarantees on ordering.
Can you think of any other reason?
Also, I don't care in which order the items are added, and I don't want the calling threads to block waiting to add their items. If synchronization is truly the issue, can you recommend a simple way to call the Add method asynchronously, i.e., create a delegate that takes care of that while my thread keeps running its code? I know I can create a delegate for Add and call BeginInvoke on it. Does that seem appropriate?
Thanks.
EDIT: A simple solution based on Kevin's suggestion:
public class AsynchronousList<T> : List<T> {
private AddDelegate addDelegate;
public delegate void AddDelegate(T item);
public AsynchronousList() {
addDelegate = new AddDelegate(this.AddBlocking);
}
public void AddAsynchronous(T item) {
addDelegate.BeginInvoke(item, null, null);
}
private void AddBlocking(T item) {
lock (this) {
Add(item);
}
}
}
I only need to control Add operations and I just need this for debugging (it won't be in the final product), so I just wanted a quick fix.
Thanks everyone for your answers.
List<T> can only support multiple readers concurrently. If you are going to use multiple threads to add to the list, you'll need to lock the object first. There is really no way around this, because without a lock you can still have someone reading from the list while another thread updates it (or multiple objects trying to update it concurrently also).
http://msdn.microsoft.com/en-us/library/6sh2ey19.aspx
Your best bet probably is to encapsulate the list in another object, and have that object handle the locking and unlocking actions on the internal list. That way you could make your new object's "Add" method asynchronous and let the calling objects go on their merry way. Any time you read from it though you'll most likely still have to wait on any other objects finishing their updates though.
The only thing I can think of to account for the null entries is the concurrent accesses. List<> isn't thread-safe, after all.
That's basically it. We are specifically told it's not thread-safe, so we shouldn't be surprised that concurrent access results in contract-breaking behaviour.
As to why this specific problem occurs, we can but speculate, since List<>'s private implementation is, well, private (I know we have Reflector and Shared Source - but in principle it is private). Suppose the implementation involves an array and a 'last populated index'. Suppose also that 'Add an item' looks like this:
Ensure the array is big enough for another item
last populated index <- last populated index + 1
array[last populated index] = incoming item
Now suppose there are two threads calling Add. If the interleaved sequence of operations ends up like this:
Thread A : last populated index <- last populated index + 1
Thread B : last populated index <- last populated index + 1
Thread A : array[last populated index] = incoming item
Thread B : array[last populated index] = incoming item
then not only will there be a null in the array, but also the item that thread A was trying to add won't be in the array at all!
Now, I don't know for sure how List<> does its stuff internally. I have half a memory that it is with an ArrayList, which internally uses this scheme; but in fact it doesn't matter. I suspect that any list mechanism that expects to be run non-concurrently can be made to break with concurrent access and a sufficiently 'unlucky' interleaving of operations. If we want thread-safety from an API that doesn't provide it, we have to do some work ourselves - or at least, we shouldn't be surprised if the API sometimes breaks its when we don't.
For your requirement of
I don't want the calling threads to block waiting to add their item
my first thought is a Multiple-Producer-Single-Consumer queue, wherein the threads wanting to add items are the producers, which dispatch items to the queue async, and there is a single consumer which takes items off the queue and adds them to the list with appropriate locking. My second thought is that this feels as if it would be heavier than this situation warrants, so I'll let it mull for a bit.
If you're using .NET Framework 4, you might check out the new Concurrent Collections. When it comes to threading, it's better not to try to be clever, as it's extremely easy to get it wrong. Synchronization can impact performance, but the effects of getting threading wrong can also result in strange, infrequent errors that are a royal pain to track down.
If you're still using Framework 2 or 3.5 for this project, I recommend simply wrapping your calls to the list in a lock statement. If you're concerned about performance of Add (are you performing some long-running operation using the list somewhere else?) then you can always make a copy of the list within a lock and use that copy for your long-running operation outside the lock. Simply blocking on the Adds themselves shouldn't be a performance issue, unless you have a very large number of threads. If that's the case, you can try the Multiple-Producer-Single-Consumer queue that AakashM recommended.

Categories