Control order of execution of parallel.foreach tasks - c#

I have a list of table names (student, exam, school).
I use a Parallel.ForEach loop to iterate over the table names and do processing for each table, with MaxDegreeOfParallelism = 8.
My problem is that my Parallel.ForEach doesn't always engage in work stealing. For example, when two tables are left to process, they may be processed one after another instead of in parallel. I'm trying to improve performance and increase throughput.
I tried to do this by creating a custom TaskScheduler, however, for my implementation I need a sorted list of tasks with the easiest tasks ordered first, so that they aren't held-up by longer-running tables. I can't seem to do this by sorting the list passed to Parallel.ForEach (List< string >) because the tasks are Enqueued by the TaskScheduler out-of-order. Therefore, I need a way to sort a list of tasks inside my CustomTaskScheduler, which is based on https://psycodedeveloper.wordpress.com/2013/06/28/a-custom-taskscheduler-in-c/
How can I control the order in which tasks are passed by the Parallel.ForEach to the TaskScheduler to be enqueued?

The Parallel.ForEach method employs two different partitioning strategies depending on the type of the source. If the source is an array or a List, it is partitioned statically (upfront). If the source is an honest-to-goodness¹ IEnumerable, it is partitioned dynamically (on the go). The dynamic partitioning has the desirable behavior of work-stealing, but has more overhead. In your case the overhead is not important, because the granularity of your workload is very low.
To ensure that the partitioning is dynamic, the easiest way is to wrap your source with the Partitioner.Create method:
string[] tableNames;
Parallel.ForEach(Partitioner.Create(tableNames), tableName =>
{
// Process table
});
¹ (The expression is borrowed from a comment in the source code)

I would recommend looking up partitioners. Managing threads on a Parallel loop has some overhead, so there is some built-in logic to try to keep this overhead small while still balancing the work across all cores propperly. This is done by dividing the list into chunks and adjusting the chunk-size to hit some sweet-spot.
I would guess that ordering the tasks as as smallest first will work against the paritioners balancing. I would try ordering the work largest first if balancing is the goal. Another thing I would try is to partition the work items with some constant chunk-size and see if that helps. Or perhaps even write your own partitioner.
I'm not sure it is a great idea to try to enforce some execution order. Since you do not control the OS scheduler there cannot be any guaranteed ordering. And even if you can make it more ordered, it would probably be at the cost of throughput.
Also, if you are spending lots of time optimizing the parallelization, are you sure the rest of the code is optimized?

Related

What does the Parallel.Foreach do behind the scenes?

So I just cant grasp the concept here.
I have a Method that uses the Parallel class with the Foreach method.
But the thing I dont understand is, does it create new threads so it can run the function faster?
Let's take this as an example.
I do a normal foreach loop.
private static void DoSimpleWork()
{
foreach (var item in collection)
{
//DoWork();
}
}
What that will do is, it will take the first item in the list, assign the method DoWork(); to it and wait until it finishes. Simple, plain and works.
Now.. There are three cases I am curious about
If I do this.
Parallel.ForEach(stringList, simpleString =>
{
DoMagic(simpleString);
});
Will that split up the Foreach into let's say 4 chunks?
So what I think is happening is that it takes the first 4 lines in the list, assigns each string to each "thread" (assuming parallel creates 4 virtual threads) does the work and then starts with the next 4 in that list?
If that is wrong please correct me I really want to understand how this works.
And then we have this.
Which essentially is the same but with a new parameter
Parallel.ForEach(stringList, new ParallelOptions() { MaxDegreeOfParallelism = 32 }, simpleString =>
{
DoMagic(simpleString);
});
What I am curious about is this
new ParallelOptions() { MaxDegreeOfParallelism = 32 }
Does that mean it will take the first 32 strings from that list (if there even is that many in the list) and then do the same thing as I was talking about above?
And for the last one.
Task.Factory.StartNew(() =>
{
Parallel.ForEach(stringList, simpleString =>
{
DoMagic(simpleString);
});
});
Would that create a new task, assigning each "chunk" to it's own task?
Do not mix async code with parallel. Task is for async operations - querying a DB, reading file, awaiting some comparatively-computation-cheap operation such that your UI won't be blocked and unresponsive.
Parallel is different. That's designed for 1) multi-core systems and 2) computational-intensive operations. I won't go in details how it works, that kind of info could be found in an MS documentation. Long story short, Parallel.For most probably will make it's own decision on what exactly when and how to run. It might disobey you parameters, i.e. MaxDegreeOfParallelism or somewhat else. The whole idea is to provide the best possible parallezation, thus complete your operation as fast as possible.
Parallel.ForEach perform the equivalent of a C# foreach loop, but with each iteration executing in parallel instead of sequentially. There is no sequencing, it depends on whether the OS can find an available thread, if there is it will execute
MaxDegreeOfParallelism
By default, For and ForEach will utilize as many threads as the OS provides, so changing MaxDegreeOfParallelism from the default only limits how many concurrent tasks will be used by the application.
You do not need to modify this parameter in general but may choose to change it in advanced scenarios:
When you know that a particular algorithm you're using won't scale
beyond a certain number of cores. You can set the property to avoid
wasting cycles on additional cores.
When you're running multiple algorithms concurrently and want to
manually define how much of the system each algorithm can utilize.
When the thread pool's heuristics is unable to determine the right
number of threads to use and could end up injecting too many
threads. e.g. in long-running loop body iterations, the
thread pool might not be able to tell the difference between
reasonable progress or livelock or deadlock, and might not be able
to reclaim threads that were added to improve performance. You can set the property to ensure that you don't use more than a reasonable number of threads.
Task.StartNew is usually used when you require fine-grained control for a long-running, compute-bound task, and like what #Сергей Боголюбов mentioned, do not mix them up
It creates a new task, and that task will create threads asynchronously to run the for loop
You may find this ebook useful: http://www.albahari.com/threading/#_Introduction
does the work and then starts with the next 4 in that list?
This depends on your machine's hardware and how busy the machine's cores are with other processes/apps your CPU is working on
Does that mean it will take the first 32 strings from that list (if there even if that many in the list) and then do the same thing as I was talking about above?
No, there's is no guarantee that it will take first 32, could be less. It will vary each time you execute the same code
Task.Factory.StartNew creates a new tasks but it will not create a new one for each chunk as you expect.
Putting a Parallel.ForEach inside a new Task will not help you further reduce the time taken for the parallel tasks themselves.

Task vs Barrier

So my problem is as follows: I have a list of items to process and I'd like to process the items in parallel then commit the processed items.
The barrier class in C# will allow me to do this - I can run threads in parallel to process the list of items and when SignalAndWait is called and all participants hit he barrier I can commit the processed items.
The Task class will also allow me to do this - on the Task.WaitAll call I can wait for all tasks to complete and I can commit the processed items. If I understand correctly each task will run on it's own thread not a bunch of tasks in parallel on the same thread.
Is my understand correct on both usages for the problem?
Is there any advantage between one over the other?
Is there any way a hybrid solution is better (barrier and tasks?).
Is my understand correct on both usages for the problem?
I think you have a misunderstanding of the Barrier class. The docs say:
A barrier is a user-defined synchronization primitive that enables multiple threads (known as participants) to work concurrently on an algorithm in phases.
A barrier is a synchronization primitive. Comparing it to a unit of work which may be computed in parallel such as a Task isn't correct.
A barrier can signal all threads to wait until all others have completed some work and check upon that work. By itself, it has no parallel computation capabilities and no threading model behind it.
Is there any advantage between one over the other?
As for question 1, you see this is irrelevant.
Is there any way a hybrid solution is better (barrier and tasks?).
In your case, I'm not sure its needed at all. If you sinply want to do CPU bound computation in parallel on a collection of items, you have Parallel.ForEach exactly for that purpose. It will partition an enumerable and invoke them in parallel, and block until the entire collection has been computed.
I'm not directly answering your question because I think that working with barriers and tasks is just making your code more complex than it needs to be.
I'd suggest using Microsoft's Reactive Framework for this - NuGet "Rx-Main" - as it just makes the whole problem super simple.
Here's the code:
var query =
from item in items.ToObservable()
from processed in Observable.Start(() => processItem(item))
select new { item, processed };
query
.ToArray()
.Subscribe(processedItems =>
{
/* commit the processed items */
});
The query turns a list of items into a observable and then processes each item using Observable.Start(...). This optimally fires off new threads as needed. The .ToArray() takes the sequence of individual results and changes it into a single array of results. The .Subscribe(...) method then allows you to process the results.
The code is much simpler than using tasks or barriers.

How to improve throughput on Parallel.ForEach

I try to optimize code with parallel execution, but sometimes only one thread gets all the heavy load. The following example shows how 40 tasks should be performed in at most 4 threads, and the ten first are more time consuming than the others.
Parallel.ForEach seem to split the array in 4 parts, and lets one thread handle each part. So the entire execution takes about 10 seconds. It should be able to complete within at most 3.3 seconds!
Is there a way to use all threads all the way, since it in my real problem isn't known which tasks that are time consuming?
var array = System.Linq.Enumerable.Range(0, 40).ToArray();
System.Threading.Tasks.Parallel.ForEach(array, new System.Threading.Tasks.ParallelOptions() { MaxDegreeOfParallelism = 4, },
i =>
{
Console.WriteLine("Running index {0,3} : {1}", i, DateTime.Now.ToString("HH:mm:ss.fff"));
System.Threading.Thread.Sleep(i < 10 ? 1000 : 10);
});
It would be possible with Parallel.ForEach, but you'd need to use a custom partitioner (or find a 3rd party partitioner) that would be able to partition the elements more sensibly based on your particular items. (Or just use much smaller batches.)
This is also assuming that you don't strictly know in advance which items are going to be fast and which are slow; if you did, you could re-order the items yourself before calling ForEach so that the expensive items are more spread out. That may or may not be sufficient, depending on the circumstances.
In general I prefer to solve these problems by simply having one producer and multiple consumers, each of which handle one item at a time, rather than batches. The BlockingCollection class makes these situations rather straightforward. Just add all of the items to the collection, create N tasks/threads/etc., each of which grab an item and process it until there are no more items. It doesn't give you the dynamic adding/removing of threads that Parallel.ForEach gives you, but that doesn't seem to be an issue in your case.
Using a custom partitioner is the right solution to modify the behavior of Parallel.ForEach(). If you're on .Net 4.5, there is an overload of Partitioner.Create() that you can use. With it, your code would look like this:
var partitioner = Partitioner.Create(
array, EnumerablePartitionerOptions.NoBuffering);
Parallel.ForEach(
partitioner, new ParallelOptions { MaxDegreeOfParallelism = 4, }, i => …);
This is not the default, because turning off buffering increases the overhead of Parallel.ForEach(). But if your iterations are really that long (seconds), that additional overhead shouldn't be noticeable.
This is due to a feature called the partitioner. By default your loop is divided among your available threads equally. It sounds like you want to change this behavior. The reasoning behind the current behavior is that it takes a certain about of overhead time to set up a thread, so you want to do as much work as is reasonable on it. Therefore the collection is partitioned in to blocks and sent to each thread. The system has no way to know that parts of the collection take longer than others (unless you explicitly tell it) and assumes that an equal division leads to a roughly equal complete time. In your case you may want to split out the tasks that take longer and run time in a different way. Or you may wish to provide a custom partitioner which transverses the collection in a non sequential manner.
You might want to use the Microsoft TPL Dataflow library, which helps in designing highlight concurrent systems.
Your code is roughly equivalent to the following one using this library:
var options = new ExecutionDataflowBlockOptions {
MaxDegreeOfParallelism = 4,
SingleProducerConstrained = true
};
var actionBlock = new ActionBlock<int>(i => {
Console.WriteLine("Running index {0,3} : {1}", i, DateTime.Now.ToString("HH:mm:ss.fff"));
System.Threading.Thread.Sleep(i < 10 ? 1000 : 10);
}, options);
Task.WhenAll(Enumerable.Range(0, 40).Select(actionBlock.SendAsync)).Wait();
actionBlock.Complete();
actionBlock.Completion.Wait();
TPL dataflow will use 4 consumers in this scenario, processing a new value as soon as one of the consumer is available, thus maximizing throughput.
Once you're used to the library, you might want to add more asynchrony to your system by using the various blocks provided by the library, and removing all those awful Wait calls.

What is the correct usage of ConcurrentBag?

I've already read previous questions here about ConcurrentBag but did not find an actual sample of implementation in multi-threading.
ConcurrentBag is a thread-safe bag implementation, optimized for scenarios where the same thread will be both producing and consuming data stored in the bag."
Currently this is the current usage in my code (this is simplified not actual codes):
private void MyMethod()
{
List<Product> products = GetAllProducts(); // Get list of products
ConcurrentBag<Product> myBag = new ConcurrentBag<Product>();
//products were simply added here in the ConcurrentBag to simplify the code
//actual code process each product before adding in the bag
Parallel.ForEach(
products,
new ParallelOptions { MaxDegreeOfParallelism = Environment.ProcessorCount },
product => myBag.Add(product));
ProcessBag(myBag); // method to process each items in the concurrentbag
}
My questions:
Is this the right usage of ConcurrentBag? Is it ok to use ConcurrentBag in this kind of scenario?
For me I think a simple List<Product> and a manual lock will do better. The reason for this is that the scenario above already breaks the "same thread will be both producing and consuming data stored in the bag" rule.
Also I also found out that the ThreadLocal storage created in each thread in the parallel will still exist after the operation (even if the thread is reused is this right?) which may cause an undesired memory leak.
Am I right in this one guys? Or a simple clear or empty method to remove the items in the ConcurrentBag is enough?
This looks like an ok use of ConcurrentBag. The thread local variables are members of the bag, and will become eligible for garbage collection at the same time the bag is (clearing the contents won't release them). You are right that a simple List with a lock would suffice for your case. If the work you are doing in the loop is at all significant, the type of thread synchronization won't matter much to the overall performance. In that case, you might be more comfortable using what you are familiar with.
Another option would be to use ParallelEnumerable.Select, which matches what you are trying to do more closely. Again, any performance difference you are going to see is likely going to be negligible and there's nothing wrong with sticking with what you know.
As always, if the performance of this is critical there's no substitute for trying it and measuring.
It seems to me that bmm6o's is not correct. The ConcurrentBag instance internally contains mini-bags for each thread that adds items to it, so item insertion does not involve any thread locks, and thus all Environment.ProcessorCount threads may get into full swing without being stuck waiting and without any thread context switches. A thread sinchronization may require when iterating over the collected items, but again in the original example the iteration is done by a single thread after all insertions are done. Moreover, if the ConcurrentBag uses Interlocked techniques as the first layer of the thread synchronization, then it is possible not to involve Monitor operations at all.
On the other hand, using a usual List<T> instance and wrapping each its Add() method call with a lock keyword will hurt the performance a lot. First, due to the constant Monitor.Enter() and Monitor.Exit() calls that each require to step deep into the kernel mode and to work with Windows synchronization primitives. Secondly, sometimes occasionally one thread may be blocked by the second thread because the second thread has not finished its addition yet.
As for me, the code above is a really good example of the right usage of ConcurrentBag class.
Is this the right usage of ConcurrentBag? Is it ok to use ConcurrentBag in this kind of scenario?
No, for multiple reasons:
This is not the intended usage scenario for this collection. The ConcurrentBag<T> is intended for mixed producer-consumer scenarios, meaning that each thread is expected to add and take items from the bag. Your scenario is nothing like this. You have many threads that add items, and zero threads that take items. The main application for the ConcurrentBag<T> is for making object-pools (pools of reusable objects that are expensive to create or destroy). And given the availability of the ObjectPool<T> class in the Microsoft.Extensions.ObjectPool package, even this niche application for this collection is contested.
It doesn't preserve the insertion order. Even if preserving the insertion order is not important, getting a shuffled output makes the debugging more difficult.
It creates garbage that have to be collected by the GC. It creates one WorkStealingQueue (internal class) per thread, each containing an expandable array, so the more threads you have the more objects you allocate. Also each time it is enumerated it copies all the items in an array, and exposes an IEnumerator<T> GetEnumerator() property that is boxed on each foreach.
There are better options available, offering both better performance and better ordering behavior.
In your scenario you can store the results of the parallel execution in a simple array. Just create an array with length equal to the products.Count, switch from the Parallel.ForEach to the Parallel.For, and assign the result directly to the corresponding slot of the results array without doing any synchronization at all:
List<Product> products = GetAllProducts(); // Get list of products
Product[] results = Product[products.Count];
Parallel.For(0, products.Count,
new ParallelOptions { MaxDegreeOfParallelism = Environment.ProcessorCount },
i => results[i] = products[i]);
ProcessResults(results);
This way you'll get the results with perfect ordering, stored in a container that has the most compact size and the fastest enumeration of all .NET collections, doing only a single object allocation.
In case you are concerned about the thread-safety of the above operation, there is nothing to worry about. Each thread writes on different slots in the results array. After the completion of the parallel execution the current thread has full visibility of all the values that are stored in the array, because the TPL includes the appropriate barriers when tasks are queued, and at the beginning/end of task execution (citation).
(I have posted more thoughts about the ConcurrentBag<T> in this answer.)
If List<T> is used with a lock around Add() method it will make threads wait and will reduce the performance gain of using Parallel.ForEach()

Why isn't Parallel.ForEach running multiple threads?

Today i tried do some optimization to foreach statement, that works on XDocument.
Before optimization:
foreach (XElement elem in xDoc.Descendants("APSEvent").ToList())
{
//some operations
}
After optimization:
Parallel.ForEach(xDoc.Descendants("APSEvent").ToList(), elem =>
{
//same operations
});
I saw that .NET in Parallel.ForEach(...) opened ONLY one thread! As a result the timespan of Parallel was bigger than standard foreach.
Why do you think .NET only opened 1 thread? Because of locking of file?
Thanks
It's by design that Parallel.ForEach may use fewer threads than requested to achieve better performance. According to MSDN [link]:
By default, the Parallel.ForEach and Parallel.For methods can use a variable number of tasks. That's why, for example, the ParallelOptions class has a MaxDegreeOfParallelism property instead of a "MinDegreeOfParallelism" property. The idea is that the system can use fewer threads than requested to process a loop.
The .NET thread pool adapts dynamically to changing workloads by allowing the number of worker threads for parallel tasks to change over time. At run time, the system observes whether increasing the number of threads improves or degrades overall throughput and adjusts the number of worker threads accordingly.
From the problem description, there is nothing that explains why the TPL is not spawning more threads.
There is no evidence in the question that is even the problem. That can be fixed quite easily: you could log the thread id, before you enter the loop, and as the first thing you do inside your loop.
If it is always the same number, it is the TPL failing to spawn threads. You should then try different versions of your code and what change triggers the TPL to serialize everything. One reason could be if there are a small number of elements in your list. The TPL partitions your collection, and if you have only a few items, you might end up with only one batch. This behavior is configurable by the way.
It could be you are inadvertedly taking a lock in in the loop, then you will be seeing lots of different numbers, but no speedup. Then, simplify the code until the problem vanishes.
Not always the parallel way is faster than the "old fashion way"
http://social.msdn.microsoft.com/Forums/en-US/parallelextensions/thread/c860cf3f-f7a6-46b5-8a07-ca2f413258dd
use it like this:
int ParallelThreads = 10;
Parallel.ForEach(xDoc.Descendants("APSEvent").ToList(), new ParallelOptions() { MaxDegreeOfParallelism = ParallelThreads }, (myXDOC, i, j) =>
{
//do whatever you want here
});
Yes exactly, Document.Load(...) locks the file and due to resource contention between threads, TPL is unable to use the power of multiple threads. Try to load the XML into a Stream and then use Parallel.For(...).
Do you happen to have a single processor? TPL may limit the number of threads to one in this case. Same thing may happen if the collection is very small. Try a bigger collection.
See this answer for more details on how the degree of parallelism is determined.

Categories