Is this use of Parallel.ForEach() thread safe? - c#

Essentially, I am working with this:
var data = input.AsParallel();
List<String> output = new List<String>();
Parallel.ForEach<String>(data, line => {
String outputLine = "";
// ** Do something with "line" and store result in "outputLine" **
// Additionally, there are some this.Invoke statements for updating UI
output.Add(outputLine);
});
Input is a List<String> object. The ForEach() statement does some processing on each value, updates the UI, and adds the result to the output List. Is there anything inherently wrong with this?
Notes:
Output order is unimportant
Update:
Based on feedback I've gotten, I've added a manual lock to the output.Add statement, as well as to the UI updating code.

Yes; List<T> is not thread safe, so adding to it ad-hoc from arbitrary threads (quite possibly at the same time) is doomed. You should use a thread-safe list instead, or add locking manually. Or maybe there is a Parallel.ToList.
Also, if it matters: insertion order will not be guaranteed.
This version is safe, though:
var output = new string[data.Count];
Parallel.ForEach<String>(data, (line,state,index) =>
{
String outputLine = index.ToString();
// ** Do something with "line" and store result in "outputLine" **
// Additionally, there are some this.Invoke statements for updating UI
output[index] = outputLine;
});
here we are using index to update a different array index per parallel call.

Is there anything inherently wrong with this?
Yes, everything. None of this is safe. Lists are not safe for updating on multiple threads concurrently, and you can't update the UI from any thread other than the UI thread.

The documentation says the following about the thread safety of List<T>:
Public static (Shared in Visual Basic) members of this type are thread safe. Any instance members are not guaranteed to be thread safe.
A List(Of T) can support multiple readers concurrently, as long as the collection is not modified. Enumerating through a collection is intrinsically not a thread-safe procedure. In the rare case where an enumeration contends with one or more write accesses, the only way to ensure thread safety is to lock the collection during the entire enumeration. To allow the collection to be accessed by multiple threads for reading and writing, you must implement your own synchronization.
Thus, output.Add(outputLine) is not thread-safe and you need to ensure thread safety yourself, for example, by wrapping the add operation in a lock statement.

When you want the results of a parallel operation, the PLINQ is more convenient than the Parallel class. You started well by converting your input to a ParallelQuery<T>:
ParallelQuery<string> data = input.AsParallel();
...but then you fed the data to the Parallel.ForEach, which treats it as a standard IEnumerable<T>. So the AsParallel() was wasted. It didn't provide any parallelization, only overhead. Here is the correct way to use PLINQ:
List<string> output = input
.AsParallel()
.Select(line =>
{
string outputLine = "";
// ** Do something with "line" and store result in "outputLine" **
return outputLine;
})
.ToList();
A few differences that you should have in mind:
The Parallel runs the code on the ThreadPool by default, but it's configurable. The PLINQ uses exclusively the ThreadPool.
The Parallel by default has unlimited parallelism (it uses all the available threads of the ThreadPool). The PLINQ uses by default at most Environment.ProcessorCount threads.
Regarding the order of the results, PLINQ doesn't preserve the order by default. In case you want to preserve the order, you can attach the AsOrdered operator.

Related

How to store results from tasks running in threadpool?

I have a problem with a threadpool efficiency. I'm not sure I understand the whole concept. I did a lot of reading before asking that question and I know that threadpool is a good solution if you have a lot of small, relatively quick functions AND what's more important - non-blocking tasks. Using lock is very bad in threadpool.
And here is my question: How to return values from threadpool functions? If you have functions to run they probably produce some results, right? It's good to store those results somewhere. Where?
I'm running c.a. 200k very quick functions in a threadpool. The results I store in the List. Of course I have to do:
lock(lockobj)
{
myList.Add(result);
}
So, is this the right way? I mean, if your functions returns SOMETHING, you have to store them in some kind of collection. It has to be a blocking collection. So, I started thinking... "Blocking is very bead in threadpool, but you have to do this, at least once - at the end of every function
How to store/return results from functions running in threadpool?
Thanks!
JB
EDIT: By "function" I mean...
ThreadPool.QueueUserWorkItem(state =>
{
Result r = function(); // previously named "Task"
lock(lockobj)
{
allResults.Add(r);
}
}
If you don't want to block the ThreadPool threads use a lock-free approach. ConcurrentQueue is currently lock-free (as of .NET 4.6.2) when you enqueue items.
So simply do this:
public static ConcurrentQueue<Result> AllResults { get; } = new ConcurrentQueue<Result>();
ThreadPool.QueueUserWorkItem(state =>
{
Result r = function();
AllResults.Enqueue(r);
}
This will guarantee you don't block ThreadPool threads.
Any kind of collection that is thread safe/synchronized will do. There are plenty in .net framework.
You can also use volatile variables to store data between multiple threads - but this is usually considered a bad practice.
Another approach can be to schedule those operations on tasks that can produce results, they run by default on the thread pool and you can get the return values by awaiting the methods and checking the Result of the Task that is returned.
Finally you can write your own code in order to synchronize access to certain regions of code/variables etc using stuff like lock, semaphores, mutex etc

Parallel.ForEach returning inconsistent result

I have a method which reads a text file which contains an int value per line, for making reading faster, i used Parallel.ForEach, but the behaviour what i am seeing is unexpected, i have 800 lines in the file but when i run this method, every time it returns different count of HashSet, what i have read after searching is Parallel.ForEach spawns multiple threads and it returns the result when all threads have completed their work, but my code execute contradicts, or i am missing something improtant here?
Here is my method:
private HashSet<int> GetKeyItemsProcessed()
{
HashSet<int> keyItems = new HashSet<int>();
if (!File.Exists(TrackingFilePath))
return keyItems;
// normal foreach works fine
//foreach(var keyItem in File.ReadAllLines(TrackingFilePath))
//{
// keyItems.Add(int.Parse(keyItem));
//}
// this does not return right number of hashset rows
Parallel.ForEach(File.ReadAllLines(TrackingFilePath).AsParallel(), keyItem =>
{
keyItems.Add(int.Parse(keyItem));
});
return keyItems;
}
HashSet.Add is NOT thread safe.
From MSDN:
Any public static (Shared in Visual Basic) members of this type are
thread safe. Any instance members are not guaranteed to be thread
safe.
The unpredictability of multithread timing could, and seems to be, causing issues.
You could wrap the access in a synchronization construct, which is sometimes faster than a concurrent collection, but may not speed anything up in some cases. As others have mentioned, another option is to use a thread safe collection like ConcurrenDictionary or ConcurrentQueue, though those may have additional memory overhead.
Be sure to benchmark any results you get with regards to timing. The raw power of singlethreaded access can sometimes be faster than dealing with the overhead of threading. It may not be worth it at all to thread this code.
The final word though, is that HashSet alone, without synchronization, is simply unacceptable for multi threaded operations.

List thread safe?

Can the following be considered thread safe due to the atomic operation appearance of the code.
My main concern is if the lists needs to be re-sized it becomes non-thread safe during the re-sizing.
List<int> list = new List<int>(10);
public List<int> GetList()
{
var temp = list;
list = new List<int>(10);
return temp;
}
TimerElapsed(int number)
{
list.Add(number);
}
No. List<T> is explicitly documented not to be thread-safe:
It is safe to perform multiple read operations on a List, but issues can occur if the collection is modified while it’s being read. To ensure thread safety, lock the collection during a read or write operation. To enable a collection to be accessed by multiple threads for reading and writing, you must implement your own synchronization. For collections with built-in synchronization, see the classes in the System.Collections.Concurrent namespace. For an inherently thread–safe alternative, see the ImmutableList class.
Neither your code nor the List<T> are thread-safe.
The list isn't thread-safe according to its documentation. Your code is not thread safe because it lacks synchronization.
Consider two threads calling GetList concurrently. Let's say the first thread gets pre-empted right after setting up the temp. Now the second thread sets the temp of its own, replaces the list, and lets the GetList function run to completion. When the first thread gets to continue, it would return the same list that the second thread has just returned.
But that's not all! If a third thread has called TimerElapsed after the second thread has completed but before the first thread has completed, it would place a value in a list that is about to be overwritten without a trace. So not only would multiple threads return the same data, but also some of your data will disappear.
No. It is not ThreadSafe.
Try using members of the System.Collections.Concurrent namespace
As already mentioned, a List<T> is not thread safe. You can look at alternatives in the Concurrent namespace, possibly using the ConcurrentBag, or there is an article here by Dean Chalk Fast Parallel ConcurrentList<T> Implementation.
It is not thread safe since there can be a context switch between the first line of the GetList method which transfers to TimerElapsed method. This will create inconsistent result on different scenarions. Also as other users already mentioned the List class is not thread safe and you should use the System.Collections.Concurrent equivalent.
It is thread safe for reading only, not for writing.

What is the correct usage of ConcurrentBag?

I've already read previous questions here about ConcurrentBag but did not find an actual sample of implementation in multi-threading.
ConcurrentBag is a thread-safe bag implementation, optimized for scenarios where the same thread will be both producing and consuming data stored in the bag."
Currently this is the current usage in my code (this is simplified not actual codes):
private void MyMethod()
{
List<Product> products = GetAllProducts(); // Get list of products
ConcurrentBag<Product> myBag = new ConcurrentBag<Product>();
//products were simply added here in the ConcurrentBag to simplify the code
//actual code process each product before adding in the bag
Parallel.ForEach(
products,
new ParallelOptions { MaxDegreeOfParallelism = Environment.ProcessorCount },
product => myBag.Add(product));
ProcessBag(myBag); // method to process each items in the concurrentbag
}
My questions:
Is this the right usage of ConcurrentBag? Is it ok to use ConcurrentBag in this kind of scenario?
For me I think a simple List<Product> and a manual lock will do better. The reason for this is that the scenario above already breaks the "same thread will be both producing and consuming data stored in the bag" rule.
Also I also found out that the ThreadLocal storage created in each thread in the parallel will still exist after the operation (even if the thread is reused is this right?) which may cause an undesired memory leak.
Am I right in this one guys? Or a simple clear or empty method to remove the items in the ConcurrentBag is enough?
This looks like an ok use of ConcurrentBag. The thread local variables are members of the bag, and will become eligible for garbage collection at the same time the bag is (clearing the contents won't release them). You are right that a simple List with a lock would suffice for your case. If the work you are doing in the loop is at all significant, the type of thread synchronization won't matter much to the overall performance. In that case, you might be more comfortable using what you are familiar with.
Another option would be to use ParallelEnumerable.Select, which matches what you are trying to do more closely. Again, any performance difference you are going to see is likely going to be negligible and there's nothing wrong with sticking with what you know.
As always, if the performance of this is critical there's no substitute for trying it and measuring.
It seems to me that bmm6o's is not correct. The ConcurrentBag instance internally contains mini-bags for each thread that adds items to it, so item insertion does not involve any thread locks, and thus all Environment.ProcessorCount threads may get into full swing without being stuck waiting and without any thread context switches. A thread sinchronization may require when iterating over the collected items, but again in the original example the iteration is done by a single thread after all insertions are done. Moreover, if the ConcurrentBag uses Interlocked techniques as the first layer of the thread synchronization, then it is possible not to involve Monitor operations at all.
On the other hand, using a usual List<T> instance and wrapping each its Add() method call with a lock keyword will hurt the performance a lot. First, due to the constant Monitor.Enter() and Monitor.Exit() calls that each require to step deep into the kernel mode and to work with Windows synchronization primitives. Secondly, sometimes occasionally one thread may be blocked by the second thread because the second thread has not finished its addition yet.
As for me, the code above is a really good example of the right usage of ConcurrentBag class.
Is this the right usage of ConcurrentBag? Is it ok to use ConcurrentBag in this kind of scenario?
No, for multiple reasons:
This is not the intended usage scenario for this collection. The ConcurrentBag<T> is intended for mixed producer-consumer scenarios, meaning that each thread is expected to add and take items from the bag. Your scenario is nothing like this. You have many threads that add items, and zero threads that take items. The main application for the ConcurrentBag<T> is for making object-pools (pools of reusable objects that are expensive to create or destroy). And given the availability of the ObjectPool<T> class in the Microsoft.Extensions.ObjectPool package, even this niche application for this collection is contested.
It doesn't preserve the insertion order. Even if preserving the insertion order is not important, getting a shuffled output makes the debugging more difficult.
It creates garbage that have to be collected by the GC. It creates one WorkStealingQueue (internal class) per thread, each containing an expandable array, so the more threads you have the more objects you allocate. Also each time it is enumerated it copies all the items in an array, and exposes an IEnumerator<T> GetEnumerator() property that is boxed on each foreach.
There are better options available, offering both better performance and better ordering behavior.
In your scenario you can store the results of the parallel execution in a simple array. Just create an array with length equal to the products.Count, switch from the Parallel.ForEach to the Parallel.For, and assign the result directly to the corresponding slot of the results array without doing any synchronization at all:
List<Product> products = GetAllProducts(); // Get list of products
Product[] results = Product[products.Count];
Parallel.For(0, products.Count,
new ParallelOptions { MaxDegreeOfParallelism = Environment.ProcessorCount },
i => results[i] = products[i]);
ProcessResults(results);
This way you'll get the results with perfect ordering, stored in a container that has the most compact size and the fastest enumeration of all .NET collections, doing only a single object allocation.
In case you are concerned about the thread-safety of the above operation, there is nothing to worry about. Each thread writes on different slots in the results array. After the completion of the parallel execution the current thread has full visibility of all the values that are stored in the array, because the TPL includes the appropriate barriers when tasks are queued, and at the beginning/end of task execution (citation).
(I have posted more thoughts about the ConcurrentBag<T> in this answer.)
If List<T> is used with a lock around Add() method it will make threads wait and will reduce the performance gain of using Parallel.ForEach()

Under what conditions can TryDequeue and similar System.Collections.Concurrent collection methods fail

I have recently noticed that inside the collection objects contained in System.Collections.Concurrent namespace it is common to see Collection.TrySomeAction() rather then Collection.SomeAction().
What is the cause of this? I assume it has something to do with locking?
So I am wondering under what conditions could an attempt to (for example) Dequeue an item from a stack, queue, bag etc.. fail?
Collections in System.Collections.Concurrent namespace are considered to be thread-safe, so it is possible to use them to write multi-threaded programs that share data between threads.
Before .NET 4, you had to provide your own synchronization mechanisms if multiple threads might be accessing a single shared collection. You had to lock the collection each time you modified its elements. You also might need to lock the collection each time you accessed (or enumerated) it. That’s for the simplest of multi-threaded scenarios. Some applications would create background threads that delivered results to a shared collection over time. Another thread would read and process those results. You needed to implement your own message passing scheme between threads to notify each other when new results were available, and when those new results had been consumed. The classes and interfaces in System.Collections.Concurrent provide a consistent implementation for those and other common multi-threaded programming problems involving shared data across threads in lock-free way.
Try<something> has semantics - try to do that action and return operation result. DoThat semantics usually use exception thrown mechanics to indicate error which can be not efficient. As examples there they can return false,
if you try add new element, you might already have it in ConcurentDictionary;
if you try to get element from collection, it might not exists there;
if you try to update element there are can be more recent element already, so method ensures it updates only element which intended to update.
Try to read:
Patterns for Parallel Programming: Understanding and Applying Parallel Patterns with the .NET Framework 4 - best to start;
Parallel Programming in the .NET Framework;
read articles on Concurrency
Thread-safe Collections in .NET Framework 4 and Their Performance Characteristics
What do you mean with fail?
Take the following example:
var queue = new Queue<string>();
string temp = queue.Dequeue();
// do something with temp
The above code with throw an exception, since we try to dequeue from an empty queue. Now, if you use a ConcurrentQueue<T> instead:
var queue = new ConcurrentQueue<string>();
string temp;
if (queue.TryDequeue(out temp))
{
// do something with temp
}
The above code will not throw an exception. The queue will still fail to dequeue an item, but the code will not fail in the way of throwing an exception. The real use for this becomes apparent in a multithreaded environment. Code for the non-concurrent Queue<T> would typically look something like this:
lock (queueLock)
{
if (queue.Count > 0)
{
string temp = queue.Dequeue();
// do something with temp
}
}
In order to avoid race conditions, we need to use a lock to ensure that nothing happens with the queue in the time that passes from checking Count do calling Dequeue. With ConcurrentQueue<T>, we don't really need to check Count, but can instead call TryDequeue.
If you examine the types found in the Systems.Collections.Concurrent namespace you will find that many of them wrap two operations that are typically called sequentially, and that would traditionally require locking (Count followed by Dequeue in ConcurrentQueue<T>, GetOrAdd in ConcurrentDictionary<TKey, TValue> replaces sequences of calling ContainsKey, adding an item and getting it, and so on).
If there is nothing to be "dequeued", for example... This "Try-Pattern" is used commonly all across FCL and BCL elements. That has nothing to do with locking, concurrent collections are (or at least should be) mostly implemented without locks...

Categories