Parallel Programming TPL - c#

When you spawn multiple tasks, like so:
for (int i = 0; i < 1000000; i++) {
// create a new task
tasks[i] = new Task<int>((stateObject) => {
tls.Value = (int)stateObject;
for (int j = 0; j < 1000; j++) {
// update the TLS balance
tls.Value++;
}
return tls.Value;
}, account.Balance);
tasks[i].Start();
}
Those tasks are basically operating on a ProcessThread. So therefore we can slice 1 process thread 1,000,000 times for 1,000,000 tasks.
Is it the TPL task scheduler that looks at the OS and determines that we have 8 virtual process threads in a multicore machine, and then allocates the load of 1,000,000 tasks across these 8 virtual process threads?

Nows tasks are basically operating on a ProcessThread..so therefore we can slice 1 process thread 1000000 times for 1000000 tasks.
This is not true. A Task != a thread, and especially does not equate to a ProcessThread. Multiple tasks will get scheduled onto a single thread.
Is it the TPL task schduler that looks at the OS and determines that we have 8 virtual Process threads in a multicore machine, and so therefore allocates the load of 1000000 tasks across these 8 vgirtual process threads ?
Effectively, yes. When using the default TaskScheduler (which you're doing above), the tasks are run on ThreadPool threads. The 1000000 tasks will not create 1000000 threads (though it will use more than the 8 you mention...)
That being said, data parallelism (such as looping in a giant for loop) is typically much better handled via Parallel.For or Parallel.ForEach. The Parallel class will, internally, use a Partitioner<T> to split up the work into fewer tasks, which will give you better overall performance since it will have far less overhead. For more details, see my post on Partitioning in the TPL.

Roughly, your current code pushes 1000000 tasks on the ThreadPool. When thosee Tasks take some significant time you could run into problems.
In a situation like this, always use
Parallel.For(0, 1000000, ...);
and then you not only have the scheduler but more important also a partitioner helping you to distribute the load.
Not to mention it's much more readable.

Related

Task.Factory.StartNew starts with a great delay despite having available threads in threadpool

This question is a continuation to a previous question I've asked:
It takes more than a few seconds for a task to start running
I now know how exactly to reproduce this scenario.
Task.Factory.StartNew is scheduled on the thread pool, so I'm logging the following (just before I invoke the Factory.StartNew):
int workerThreads = 0;
int completionPortThreads = 0;
ThreadPool.GetMaxThreads(out workerThreads, out completionPortThreads);
ThreadPool.GetAvailableThreads(out workerThreads, out completionPortThreads);
var tokenSource = new CancellationTokenSource();
CancellationToken token = tokenSource.Token;
//I HAVE A LOG HERE
Task task = Task.Factory.StartNew(() =>
{
//I HAVE A LOG ALSO HERE, AND THAT'S HOW I KNOW,
//THE TASK INVOCATION IS DELAYED, AND THE DALAY IS NOT DUE TO MY CODE WITHIN THE TASK
// Some action that returns a boolean - **CODE_A**
}).ContinueWith((task2) =>
{
result= task2.Result;
if (!result)
{
//Another action **CODE_B**
}
}, token);
When the bug is reproduced, I get 32767 as Max worker threads, and 32756 as available worker threads.
Now, there is something I don't understand.
At least as I've understood, once the threadpool reaches its overload, the threadpool will stop creating new threads immediately. And that's probably the reason for the delay of my task (that starts after more than 5 seconds from the invocation of Factory.StartNew).
But when the delay occurs, I see that I have 32756 available worker threads in my threadpool, so why does the threadpool NOT use one of those 32756 available worker threads to start my task immediately?
The available threads are on the ThreadPool (I mean, I invoke ThreadPool.GetAvailableThreads), and Task.Factory.StartNew allocates a task from the threadPool. So, why am I getting this delay despite having available threads in threadpool?
It's not the MAX worker threads value you need to look at - it's the MIN value you get via ThreadPool.GetMinThreads().
The max value is the absolute maximum threads that can be active. The min value is the number to always keep active. If you try to start a thread when the number of active threads is less than max (and greater than min) you'll see a 2 second delay.
You can change the minimum number of threads if absolutely necessary (which it is in some circumstances) but generally speaking if you find yourself needing to do that, you might need to think about redesigning your multithreading so that you don't need to.
As the Microsoft documentation states:
By default, the minimum number of threads is set to the number of processors on a system. You can use the SetMinThreads method to increase the minimum number of threads. However, unnecessarily increasing these values can cause performance problems. If too many tasks start at the same time, all of them might appear to be slow. In most cases, the thread pool will perform better with its own algorithm for allocating threads. Reducing the minimum to less than the number of processors can also hurt performance.

TPL force higher parallelism

When queuing Tasks to the ThreadPool, the code relies on the default TaskScheduler to execute them. In my code example, I can see that 7 Tasks maximum get executed in parallel on separate threads.
new Thread(() =>
{
while (true)
{
ThreadPool.GetAvailableThreads(out var wt, out var cpt);
Console.WriteLine($"WT:{wt} CPT:{cpt}");
Thread.Sleep(500);
}
}).Start();
var stopwatch = new Stopwatch();
stopwatch.Start();
var tasks = Enumerable.Range(0, 100).Select(async i => { await Task.Yield(); Thread.Sleep(10000); }).ToArray();
Task.WaitAll(tasks);
Console.WriteLine(stopwatch.Elapsed.TotalSeconds);
Console.ReadKey();
Is there a way to force the scheduler to fire up more Tasks on other threads? Or is there a more "generous" scheduler in the framework without implementing a custom one?
EDIT:
Adding ThreadPool.SetMinThreads(100, X) seems to do the trick, I presume awaiting frees up the thread so the pool think it can fire up another one and then it immediately resumes.
By default, the minimum number of threads is set to the number of processors on a system. You can use the SetMinThreads method to increase the minimum number ofthreads. However, unnecessarily increasing these values can cause performance problems. If too many tasks start at the same time, all of them might appear to be slow. In most cases, the thread pool will perform better with its own algorithm for allocating threads. Reducing the minimum to less than the number of processors can also hurt performance.
From here: https://msdn.microsoft.com/en-us/library/system.threading.threadpool.setminthreads(v=vs.110).aspx
I removed AsParallel as it is not relevant and it just seems to confuse readers.
Is there a way to force the scheduler to fire up more Tasks on other threads?
You cannot have more executing threads than you have CPU cores. This is just how computers work. If you use more threads, then your work will actually get done more slowly since the threads must swap in and out of the cores in order to run.
Or is there a more "generous" scheduler in the framework without implementing a custom one?
PLINQ is already tuned to make maximum use of the hardware.
You can see this for yourself if you replace the Thread.Sleep call with something that actually uses the CPU (e.g., while (true) ;), and then watch your CPU usage in Task Manager. My expectation is that the 7 or 8 threads used by PLINQ in this example is all your machine can handle.
Useful link that explains it can be done with ThreadPool.SetMinThread:
https://gist.github.com/JonCole/e65411214030f0d823cb#file-threadpool-md
Try this: https://msdn.microsoft.com/en-us/library/system.threading.threadpool.setmaxthreads(v=vs.110).aspx
You can set the number of worker threads (first argument).
Use WithDegreeOfParallelism extension:
Enumerable.Range(0, 100).AsParallel().WithDegreeOfParallelism(x).Select(...

What determines the number of threads for a TaskFactory spawned jobs?

I have the following code:
var factory = new TaskFactory();
for (int i = 0; i < 100; i++)
{
var i1 = i;
factory.StartNew(() => foo(i1));
}
static void foo(int i)
{
Thread.Sleep(1000);
Console.WriteLine($"foo{i} - on thread {Thread.CurrentThread.ManagedThreadId}");
}
I can see it only does 4 threads at a time (based on observation). My questions:
What determines the number of threads used at a time?
How can I retrieve this number?
How can I change this number?
P.S. My box has 4 cores.
P.P.S. I needed to have a specific number of tasks (and no more) that are concurrently processed by the TPL and ended up with the following code:
private static int count = 0; // keep track of how many concurrent tasks are running
private static void SemaphoreImplementation()
{
var s = new Semaphore(20, 20); // allow 20 tasks at a time
for (int i = 0; i < 1000; i++)
{
var i1 = i;
Task.Factory.StartNew(() =>
{
try
{
s.WaitOne();
Interlocked.Increment(ref count);
foo(i1);
}
finally
{
s.Release();
Interlocked.Decrement(ref count);
}
}, TaskCreationOptions.LongRunning);
}
}
static void foo(int i)
{
Thread.Sleep(100);
Console.WriteLine($"foo{i:00} - on thread " +
$"{Thread.CurrentThread.ManagedThreadId:00}. Executing concurently: {count}");
}
When you are using a Task in .NET, you are telling the TPL to schedule a piece of work (via TaskScheduler) to be executed on the ThreadPool. Note that the work will be scheduled at its earliest opportunity and however the scheduler sees fit. This means that the TaskScheduler will decide how many threads will be used to run n number of tasks and which task is executed on which thread.
The TPL is very well tuned and continues to adjust its algorithm as it executes your tasks. So, in most cases, it tries to minimize contention. What this means is if you are running 100 tasks and only have 4 cores (which you can get using Environment.ProcessorCount), it would not make sense to execute more than 4 threads at any given time, as otherwise it would need to do more context switching. Now there are times where you want to explicitly override this behaviour. Let's say in the case where you need to wait for some sort of IO to finish, which is a whole different story.
In summary, trust the TPL. But if you are adamant to spawn a thread per task (not always a good idea!), you can use:
Task.Factory.StartNew(
() => /* your piece of work */,
TaskCreationOptions.LongRunning);
This tells the DefaultTaskscheduler to explicitly spawn a new thread for that piece of work.
You can also use your own Scheduler and pass it in to the TaskFactory. You can find a whole bunch of Schedulers HERE.
Note another alternative would be to use PLINQ which again by default analyses your query and decides whether parallelizing it would yield any benefit or not, again in the case of a blocking IO where you are certain starting multiple threads will result in a better execution you can force the parallelism by using WithExecutionMode(ParallelExecutionMode.ForceParallelism) you then can use WithDegreeOfParallelism, to give hints on how many threads to use but remember there is no guarantee you would get that many threads, as MSDN says:
Sets the degree of parallelism to use in a query. Degree of
parallelism is the maximum number of concurrently executing tasks that
will be used to process the query.
Finally, I highly recommend having a read of THIS great series of articles on Threading and TPL.
If you increase the number of tasks to for example 1000000 you will see a lot more threads spawned over time. The TPL tends to inject one every 500ms.
The TPL threadpool does not understand IO-bound workloads (sleep is IO). It's not a good idea to rely on the TPL for picking the right degree of parallelism in these cases. The TPL is completely clueless and injects more threads based on vague guesses about throughput. Also to avoid deadlocks.
Here, the TPL policy clearly is not useful because the more threads you add the more throughput you get. Each thread can process one item per second in this contrived case. The TPL has no idea about that. It makes no sense to limit the thread count to the number of cores.
What determines the number of threads used at a time?
Barely documented TPL heuristics. They frequently go wrong. In particular they will spawn an unlimited number of threads over time in this case. Use task manager to see for yourself. Let this run for an hour and you'll have 1000s of threads.
How can I retrieve this number? How can I change this number?
You can retrieve some of these numbers but that's not the right way to go. If you need a guaranteed DOP you can use AsParallel().WithDegreeOfParallelism(...) or a custom task scheduler. You also can manually start LongRunning tasks. Do not mess with process global settings.
I would suggest using SemaphoreSlim because it doesn't use Windows kernel (so it can be used in Linux C# microservices) and also has a property SemaphoreSlim.CurrentCount that tells how many remaining threads are left so you don't need the Interlocked.Increment or Interlocked.Decrement. I also removed i1 because i is value type and it won't be changed by the call of foo method passing the i argument so it's no need to copy it into i1 to ensure it never changes (if that was the reasoning for adding i1):
private static void SemaphoreImplementation()
{
var maxThreadsCount = 20; // allow 20 tasks at a time
var semaphoreSlim = new SemaphoreSlim(maxTasksCount, maxTasksCount);
var taskFactory = new TaskFactory();
for (int i = 0; i < 1000; i++)
{
taskFactory.StartNew(async () =>
{
try
{
await semaphoreSlim.WaitAsync();
var count = maxTasksCount-semaphoreSlim.CurrentCount; //SemaphoreSlim.CurrentCount tells how many threads are remaining
await foo(i, count);
}
finally
{
semaphoreSlim.Release();
}
}, TaskCreationOptions.LongRunning);
}
}
static async void foo(int i, int count)
{
await Task.Wait(100);
Console.WriteLine($"foo{i:00} - on thread " +
$"{Thread.CurrentThread.ManagedThreadId:00}. Executing concurently: {count}");
}

Is it the correct implementation?

I am having a Windows Service that needs to pick the jobs from database and needs to process it.
Here, each job is a scanning process that would take approx 10 mins to complete.
I am very new to Task Parallel Library. I have implemented in the following way as sample logic:
Queue queue = new Queue();
for (int i = 0; i < 10000; i++)
{
queue.Enqueue(i);
}
for (int i = 0; i < 100; i++)
{
Task.Factory.StartNew((Object data ) =>
{
var Objdata = (Queue)data;
Console.WriteLine(Objdata.Dequeue());
Console.WriteLine(
"The current thread is " + Thread.CurrentThread.ManagedThreadId);
}, queue, TaskCreationOptions.LongRunning);
}
Console.ReadLine();
But, this is creating lot of threads. Since loop is repeating 100 times, it is creating 100 threads.
Is it right approach to create that many number of parallel threads ?
Is there any way to limit the number of threads to 10 (concurrency level)?
An important factor to remember when allocating new Threads is that the OS has to allocate a number of logical entities in order for that current thread to run:
Thread kernel object - an object for describing the thread,
including the thread's context, cpu registers, etc
Thread environment block - For exception handling and thread local
storage
User-mode stack - 1MB of stack
Kernel-mode stack - For passing arguments from user mode to kernel
mode
Other than that, the number of concurrent Threads that may run depend on the number of cores your machine is packing, and creating an amount of threads that is larger than the number of cores your machine owns will start causing Context Switching, which in the long run may slow your work down.
So after the long intro, to the good stuff. What we actually want to do is limit the number of threads running and reuse them as much as possible.
For this kind of job, i would go with TPL Dataflow which is based on the Producer-Consumer pattern. Just a small example of what can be done:
// a BufferBlock is an equivalent of a ConcurrentQueue to buffer your objects
var bufferBlock = new BufferBlock<object>();
// An ActionBlock to process each object and do something with it
var actionBlock = new ActionBlock<object>(obj =>
{
// Do stuff with the objects from the bufferblock
});
bufferBlock.LinkTo(actionBlock);
bufferBlock.Completion.ContinueWith(t => actionBlock.Complete());
You may pass each Block a ExecutionDataflowBlockOptions which may limit the Bounded Capacity (The number of objects inside the BufferBlock) and MaxDegreeOfParallelism which tells the block the number of maximum concurrency you may want.
There is a good example here to get you started.
Glad you asked, because you're right in the sense that - this is not the best approach.
The concept of Task should not be confused with a Thread. A Thread can be compared to a chef in a kitchen, while a Task is a dish ordered by a customer. You have a bunch of chefs, and they process the dish orders in some ordering (usually FIFO). A chef finishes a dish then moves on to the next. The concept of Thread Pool is the same. You create a bunch of Tasks to be completed, but you do not need to assign a new thread to each task.
Ok so the actual bits to do it. There are a few. The first one is ThreadPoll.QueueUserWorkItem. (http://msdn.microsoft.com/en-us/library/system.threading.threadpool.queueuserworkitem(v=vs.110).aspx). Using the Parallel library, Parallel.For can also be used, it will automatically spawn threads based on the number of actual CPU cores available in the system.
Parallel.For(0, 100, i=>{
//here, this method will be called 100 times, and i will be 0 to 100
WaitForGrassToGrow();
Console.WriteLine(string.Format("The {0}-th task has completed!",i));
});
Note that there is no guarantee that the method called by Parallel.For is called in sequence (0,1,2,3,4,5...). The actual sequence depends on the execution.

Parallel.For vs regular threads

I'm trying to understand why Parallel.For is able to outperform a number of threads in the following scenario: consider a batch of jobs that can be processed in parallel. While processing these jobs, new work may be added, which then needs to be processed as well. The Parallel.For solution would look as follows:
var jobs = new List<Job> { firstJob };
int startIdx = 0, endIdx = jobs.Count;
while (startIdx < endIdx) {
Parallel.For(startIdx, endIdx, i => WorkJob(jobs[i]));
startIdx = endIdx; endIdx = jobs.Count;
}
This means that there are multiple times where the Parallel.For needs to synchronize. Consider a bread-first graph algorithm algorithm; the number of synchronizations would be quite large. Waste of time, no?
Trying the same in the old-fashioned threading approach:
var queue = new ConcurrentQueue<Job> { firstJob };
var threads = new List<Thread>();
var waitHandle = new AutoResetEvent(false);
int numBusy = 0;
for (int i = 0; i < maxThreads; i++)
threads.Add(new Thread(new ThreadStart(delegate {
while (!queue.IsEmpty || numBusy > 0) {
if (queue.IsEmpty)
// numbusy > 0 implies more data may arrive
waitHandle.WaitOne();
Job job;
if (queue.TryDequeue(out job)) {
Interlocked.Increment(ref numBusy);
WorkJob(job); // WorkJob does a waitHandle.Set() when more work was found
Interlocked.Decrement(ref numBusy);
}
}
// others are possibly waiting for us to enable more work which won't happen
waitHandle.Set();
})));
threads.ForEach(t => t.Start());
threads.ForEach(t => t.Join());
The Parallel.For code is of course much cleaner, but what I cannot comprehend, it's even faster as well! Is the task scheduler just that good? The synchronizations were eleminated, there's no busy waiting, yet the threaded approach is consistently slower (for me). What's going on? Can the threading approach be made faster?
Edit: thanks for all the answers, I wish I could pick multiple ones. I chose to go with the one that also shows an actual possible improvement.
The two code samples are not really the same.
The Parallel.ForEach() will use a limited amount of threads and re-use them. The 2nd sample is already starting way behind by having to create a number of threads. That takes time.
And what is the value of maxThreads ? Very critical, in Parallel.ForEach() it is dynamic.
Is the task scheduler just that good?
It is pretty good. And TPL uses work-stealing and other adaptive technologies. You'll have a hard time to do any better.
Parallel.For doesn't actually break up the items into single units of work. It breaks up all the work (early on) based on the number of threads it plans to use and the number of iterations to be executed. Then has each thread synchronously process that batch (possibly using work stealing or saving some extra items to load-balance near the end). By using this approach the worker threads are virtually never waiting on each other, while your threads are constantly waiting on each other due to the heavy synchronization you're using before/after every single iteration.
On top of that since it's using thread pool threads many of the threads it needs are likely already created, which is another advantage in its favor.
As for synchronization, the entire point of a Parallel.For is that all of the iterations can be done in parallel, so there is almost no synchronization that needs to take place (at least in their code).
Then of course there is the issue of the number of threads. The threadpool has a lot of very good algorithms and heuristics to help it determine how many threads are need at that instant in time, based on the current hardware, the load from other applications, etc. It's possible that you're using too many, or not enough threads.
Also, since the number of items that you have isn't known before you start I would suggest using Parallel.ForEach rather than several Parallel.For loops. It is simply designed for the situation that you're in, so it's heuristics will apply better. (It also makes for even cleaner code.)
BlockingCollection<Job> queue = new BlockingCollection<Job>();
//add jobs to queue, possibly in another thread
//call queue.CompleteAdding() when there are no more jobs to run
Parallel.ForEach(queue.GetConsumingEnumerable(),
job => job.DoWork());
Your creating a bunch of new threads and the Parallel.For is using a Threadpool. You'll see better performance if you were utilizing the C# threadpool but there really is no point in doing that.
I would shy away from rolling out your own solution; if there is a corner case where you need customization use the TPL and customize..

Categories