Consider this code run on a CPU with 32 cores:
ParallelOptions po = new ParallelOptions();
po.MaxDegreeOfParallelism = 8;
Parallel.For(0, 4, po, (i) =>
{
Parallel.For(0, 4, po, (j) =>
{
WorkMethod(i, j); // assume a long-running method
});
}
);
My question is what is the actual maximum possibly concurrency of WorkMethod(i, j)? Is it 4, 8, or 16?
ParallelOptions.MaxDegreeOfParallelism is not applied globally. If you have enough cores, and the scheduler sees fit you will get a multiplication of the nested MPD values with each For able to spin up that many tasks (if the workloads are unconstrained).
Consider this example, 3 tasks can start 3 more tasks. This is limited by the MDP option of 3.
int k = 0;
ParallelOptions po = new ParallelOptions();
po.MaxDegreeOfParallelism = 3;
Parallel.For(0, 10, po, (i) =>
{
Parallel.For(0, 10, po, (j) =>
{
Interlocked.Increment(ref k);
Console.WriteLine(k);
Thread.Sleep(2000);
Interlocked.Decrement(ref k);
});
Thread.Sleep(2000);
});
Output
1
2
3
4
7
5
6
8
9
9
5
6
7
9
9
8
8
9
...
If MDP was global you would only get 3 I guess, since it's not you get 9s.
ParallelOptions.MaxDegreeOfParallelism is not global, it is per parallel loop. And more specifically, it sets the max number of tasks that can run in parallel, not the max number of cores or threads that will run those tasks in parallel.
Some demo tests
note: i have 4 cores, 8 threads
What's happening in the code
We're running 2 async methods; each one kicks off nested parallel loops.
We're setting max degrees of parallelism to 2 and a sleep time of 2 seconds to simulate the work each task does
So, due to setting MaxDegreeOfParallelism to 2, we would expect to reach up to 12 concurrent tasks before the 40 tasks complete (i'm only counting tasks kicked off by the nested parallel loops)
how do i get 12?
2 max concurrent tasks started in the outer loop
+4 max concurrent tasks from inner loop (2 started per task started in outer loop)
that's 6 (per asynchronous task kicked off in Main)
12 total
test code
using System;
using System.Threading;
using System.Threading.Tasks;
namespace forfun
{
class Program
{
static void Main(string[] args)
{
var taskRunner = new TaskRunner();
taskRunner.RunTheseTasks();
taskRunner.RunTheseTasksToo();
Console.ReadLine();
}
private class TaskRunner
{
private int _totalTasks = 0;
private int _runningTasks = 0;
public async void RunTheseTasks()
{
await Task.Run(() => ProcessThingsInParallel());
}
public async void RunTheseTasksToo()
{
await Task.Run(() => ProcessThingsInParallel());
}
private void ProcessThingsInParallel()
{
ParallelOptions po = new ParallelOptions();
po.MaxDegreeOfParallelism = 2;
Parallel.For(0, 4, po, (i) =>
{
Interlocked.Increment(ref _totalTasks);
Interlocked.Increment(ref _runningTasks);
Console.WriteLine($"{_runningTasks} currently running of {_totalTasks} total tasks");
Parallel.For(0, 4, po, (j) =>
{
Interlocked.Increment(ref _totalTasks);
Interlocked.Increment(ref _runningTasks);
Console.WriteLine($"{_runningTasks} currently running of {_totalTasks} total tasks");
WorkMethod(i, j); // assume a long-running method
Interlocked.Decrement(ref _runningTasks);
});
Interlocked.Decrement(ref _runningTasks);
}
);
}
private static void WorkMethod(int i, int l)
{
Thread.Sleep(2000);
}
}
}
}
Spoiler, the output shows that setting MaxDegreeOfParallelism is not global, is not limited to core or thread count, and is specifically setting a max on concurrent running tasks.
output with max set to 2:
1 currently running of 1 total tasks
3 currently running of 3 total tasks
2 currently running of 2 total tasks
4 currently running of 4 total tasks
5 currently running of 5 total tasks
7 currently running of 7 total tasks
[ ... snip ...]
11 currently running of 33 total tasks
12 currently running of 34 total tasks
11 currently running of 35 total tasks
12 currently running of 36 total tasks
11 currently running of 37 total tasks
12 currently running of 38 total tasks
11 currently running of 39 total tasks
12 currently running of 40 total tasks
(output will vary, but each time, the max concurrent should be 12)
output without max set:
1 currently running of 1 total tasks
3 currently running of 3 total tasks
4 currently running of 4 total tasks
2 currently running of 2 total tasks
5 currently running of 5 total tasks
7 currently running of 7 total tasks
[ ... snip ...]
19 currently running of 28 total tasks
19 currently running of 29 total tasks
18 currently running of 30 total tasks
13 currently running of 31 total tasks
13 currently running of 32 total tasks
16 currently running of 35 total tasks
16 currently running of 36 total tasks
14 currently running of 33 total tasks
15 currently running of 34 total tasks
15 currently running of 37 total tasks
16 currently running of 38 total tasks
16 currently running of 39 total tasks
17 currently running of 40 total tasks
notice how without setting the max, we get up to 19 concurrent tasks
- now the 2 second sleep time is limiting the number of tasks that could kick off before others finished
output after increasing sleep time to 12 seconds
1 currently running of 1 total tasks
2 currently running of 2 total tasks
3 currently running of 3 total tasks
4 currently running of 4 total tasks
[ ... snip ...]
26 currently running of 34 total tasks
26 currently running of 35 total tasks
27 currently running of 36 total tasks
28 currently running of 37 total tasks
28 currently running of 38 total tasks
28 currently running of 39 total tasks
28 currently running of 40 total tasks
got up to 28 concurrent tasks
now setting loops to 10 nested in 10 and setting sleep time back to 2 seconds - again no max set
1 currently running of 1 total tasks
3 currently running of 3 total tasks
2 currently running of 2 total tasks
4 currently running of 4 total tasks
[ ... snip ...]
38 currently running of 176 total tasks
38 currently running of 177 total tasks
38 currently running of 178 total tasks
37 currently running of 179 total tasks
38 currently running of 180 total tasks
38 currently running of 181 total tasks
[ ... snip ...]
35 currently running of 216 total tasks
35 currently running of 217 total tasks
32 currently running of 218 total tasks
32 currently running of 219 total tasks
33 currently running of 220 total tasks
got up to 38 concurrent tasks before all 220 finished
More related information
ParallelOptions.MaxDegreeOfParallelism Property
The MaxDegreeOfParallelism property affects the number of concurrent operations run by Parallel method calls that are passed this ParallelOptions instance. A positive property value limits the number of concurrent operations to the set value. If it is -1, there is no limit on the number of concurrently running operations.
By default, For and ForEach will utilize however many threads the underlying scheduler provides, so changing MaxDegreeOfParallelism from the default only limits how many concurrent tasks will be used.
to get the max degree of parallelism, don't set it, rather allow the TPL and its scheduler handle it
setting the max degree of parallelism only affects the number of concurrent tasks, not threads used
the maximum number of concurrent tasks is not equal to the number of threads available--threads will still be able to juggle multiple tasks; and even if your app is using all threads, it is still sharing those threads with the other processes that the machine is hosting
Environment.ProcessorCount
Gets the number of processors on the current machine.
What if we say MaxDegreeOfParallelism = Environment.ProcessorCount?
Even setting max degree of parallism to Environment.ProcessorCount does not dynamically ensure that you get the maximum concurrency regardless of the system your app is running on. Doing this still limits the degree of parallelism, because any given thread can switch between many tasks--so this would just limit the number of concurrent tasks to equal the number of available threads--and this does not necessarily mean that each concurrent task will be assigned neatly to each thread in a one-to-one relationship.
Related
When I run a lot of operations in parallel using SemaphoreSlim for each, their invocations are not so quick as expected.
Here is the code
var sw = new Stopwatch();
sw.Start();
for (int i = 0; i < 50; i++) {
int localI = i;
Task.Run(async () => {
var semaphore = new SemaphoreSlim(1, 1);
await semaphore.WaitAsync();
Thread.Sleep(1000);
counter++;
semaphore.Release();
Debug.WriteLine($"{localI} - {sw.ElapsedMilliseconds}");
});
}
Thread.Sleep(5000);
And here is the output:
2 - 1015
0 - 1015
1 - 1015
3 - 2053
4 - 2053
5 - 2053
6 - 2120
7 - 3009
8 - 3064
9 - 3066
10 - 3068
11 - 3134
12 - 4011
13 - 4016
14 - 4070
15 - 4071
16 - 4073
17 - 4140
Can somebody explain why they were not invoked approximately in 1 second?
What you are seeing is the limited thread pool injection rate. It has nothing to do with SemaphoreSlim or even async, as all the code posted is actually synchronous.
On your machine, three threads are able to run immediately. The thread pool sees that it has other work to do (47 other items already queued). So it waits for a bit and then injects another thread. The next group of work uses four threads. The thread pool is still "behind", so it waits for a bit and then injects another thread, etc.
The "wait for a bit" part of the description above is the limited thread pool injection rate. The thread pool has to wait for a bit, or else whenever it gets more work, it would immediately create a bunch of threads, which would then be disposed of when the work is done. So to be more efficient and prevent this "thread thrashing", the thread pool waits for a bit before creating new threads.
I have a webservice which receives multiple requests at the same time. For each request, I need to call another webservice (authentication things). The problem is, if multiple (>20) requests happen at the same time, the response time suddenly gets a lot worse.
I made a sample to demonstrate the problem:
using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Net;
using System.Net.Http;
using System.Threading.Tasks;
namespace CallTest
{
public class Program
{
private static readonly HttpClient _httpClient = new HttpClient(new HttpClientHandler { Proxy = null, UseProxy = false });
static void Main(string[] args)
{
ServicePointManager.DefaultConnectionLimit = 100;
ServicePointManager.Expect100Continue = false;
// warmup
CallSomeWebsite().GetAwaiter().GetResult();
CallSomeWebsite().GetAwaiter().GetResult();
RunSequentiell().GetAwaiter().GetResult();
RunParallel().GetAwaiter().GetResult();
}
private static async Task RunParallel()
{
var tasks = new List<Task>();
for (var i = 0; i < 300; i++)
{
tasks.Add(CallSomeWebsite());
}
await Task.WhenAll(tasks);
}
private static async Task RunSequentiell()
{
var tasks = new List<Task>();
for (var i = 0; i < 300; i++)
{
await CallSomeWebsite();
}
}
private static async Task CallSomeWebsite()
{
var watch = Stopwatch.StartNew();
using (var result = await _httpClient.GetAsync("http://example.com").ConfigureAwait(false))
{
// more work here, like checking success etc.
Console.WriteLine(watch.ElapsedMilliseconds);
}
}
}
}
Sequential calls are no problem. They take a few milliseconds to finish and the response time is mostly the same.
However, parallel request start taking longer and longer the more requests are being sent. Sometimes it takes even a few seconds. I tested it on .NET Framework 4.6.1 and on .NET Core 2.0 with the same results.
What is even stranger: I traced the HTTP requests with WireShark and they always take around the same time. But the sample program reports much higher values for parallel requests than WireShark.
How can I get the same performance for parallel requests? Is this a thread pool issue?
This behaviour has been fixed with .NET Core 2.1. I think the problem was the underlying windows WinHTTP handler, which was used by the HttpClient.
In .NET Core 2.1, they rewrote the HttpClientHandler (see https://blogs.msdn.microsoft.com/dotnet/2018/04/18/performance-improvements-in-net-core-2-1/#user-content-networking):
In .NET Core 2.1, HttpClientHandler has a new default implementation implemented from scratch entirely in C# on top of the other System.Net libraries, e.g. System.Net.Sockets, System.Net.Security, etc. Not only does this address the aforementioned behavioral issues, it provides a significant boost in performance (the implementation is also exposed publicly as SocketsHttpHandler, which can be used directly instead of via HttpClientHandler in order to configure SocketsHttpHandler-specific properties).
This turned out to remove the bottlenecks mentioned in the question.
On .NET Core 2.0, I get the following numbers (in milliseconds):
Fetching URL 500 times...
Sequentiell Total: 4209, Max: 35, Min: 6, Avg: 8.418
Parallel Total: 822, Max: 338, Min: 7, Avg: 69.126
But on .NET Core 2.1, the individual parallel HTTP requests seem to have improved a lot:
Fetching URL 500 times...
Sequentiell Total: 4020, Max: 40, Min: 6, Avg: 8.040
Parallel Total: 795, Max: 76, Min: 5, Avg: 7.972
In the question's RunParallel() function, a stopwatch is started for all 300 calls in the first second of the program running, and ended when each http request completes.
Therefore these times can't really be compared to the sequential iterations.
For smaller numbers of parallel tasks e.g. 50, if you measure the wall time that the sequential and parallel methods take you should find that the parallel method is faster due to it pipelining as many GetAsync tasks as it can.
That said, when running the code for 300 iterations I did find a repeatable several-second stall when running outside the debugger only:
Debug build, in debugger: Sequential 27.6 seconds, parallel 0.6 seconds
Debug build, without debugger: Sequential 26.8 seconds, parallel 3.2 seconds
[Edit]
There's a similar scenario described in this question, its possibly not relevant to your problem anyway.
This problem gets worse the more tasks are run, and disappears when:
Swapping the GetAsync work for an equivalent delay
Running against a local server
Slowing the rate of tasks creation / running less concurrent tasks
The watch.ElapsedMilliseconds diagnostic stops for all connections, indicating that all connections are affected by the throttling.
Seems to be some sort of (anti-syn-flood?) throttling in the host or network, that just halts the flow of packets once a certain number of sockets start connecting.
It sounds like for whatever reason, you're hitting a point of diminishing returns at around 20 concurrent Tasks. So, your best option might be to throttle your parallelism. TPL Dataflow is a great library for achieving this. To follow your pattern, add a method like this:
private static Task RunParallelThrottled()
{
var throtter = new ActionBlock<int>(i => CallSomeWebsite(),
new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 20 });
for (var i = 0; i < 300; i++)
{
throttler.Post(i);
}
throttler.Complete();
return throttler.Completion;
}
You might need to experiment with MaxDegreeOfParallelism until you find the sweet spot. Note that this is more efficient than doing batches of 20. In that scenario, all 20 in the batch would need to complete before the next batch begins. With TPL Dataflow, as soon as one completes, another is allowed to begin.
The reason that you are having issues is that .NET does not resume Tasks in the order that they are awaited, an awaited Task is only resumed when a calling function cannot resume execution, and Task is not for Parallel execution.
If you make a few modifications so that you pass in i to the CallSomeWebsite function and call Console.WriteLine("All loaded"); after you add all the tasks to the list, you will get something like this: (RequestNumber: Time)
All loaded
0: 164
199: 236
299: 312
12: 813
1: 837
9: 870
15: 888
17: 905
5: 912
10: 952
13: 952
16: 961
18: 976
19: 993
3: 1061
2: 1061
Do you notice how every Task is created before any of the times are printed out to the screen? The entire loop of creating Tasks completes before any of the Tasks resume execution after awaiting the network call.
Also, see how request 199 is completed before request 1? .NET will resume Tasks in the order that it deems best (This is guaranteed to be more complicated but I am not exactly sure how .NET decides which Task to continue).
One thing that I think you might be confusing is Asynchronous and Parallel. They are not the same, and Task is used for Asynchronous execution. What that means is that all of these tasks are running on the same thread (Probably. .NET can start a new thread for tasks if needed), so they are not running in Parallel. If they were truly Parallel, they would all be running in different threads, and the execution times would not be increasing for each execution.
Updated functions:
private static async Task RunParallel()
{
var tasks = new List<Task>();
for (var i = 0; i < 300; i++)
{
tasks.Add(CallSomeWebsite(i));
}
Console.WriteLine("All loaded");
await Task.WhenAll(tasks);
}
private static async Task CallSomeWebsite(int i)
{
var watch = Stopwatch.StartNew();
using (var result = await _httpClient.GetAsync("https://www.google.com").ConfigureAwait(false))
{
// more work here, like checking success etc.
Console.WriteLine($"{i}: {watch.ElapsedMilliseconds}");
}
}
As for the reason that the time printed is longer for the Asynchronous execution then the Synchronous execution, your current method of tracking time does not take into account the time that was spent between execution halt and continuation. That is why all of the reporting execution times are increasing over the set of completed requests. If you want an accurate time, you will need to find a way of subtracting the time that was spent between the await occurring and execution continuing. The issue isn't that it is taking longer, it is that you have an inaccurate reporting method. If you sum the time for all the Synchronous calls, it is actually significantly more than the max time of the Asynchronous call:
Sync: 27965
Max Async: 2341
While I was using Parallel.ForEach in my program, I found that some threads never seemed to finish. In fact, it kept spawning new threads over and over, a behaviour that I wasn't expecting and definitely don't want.
I was able to reproduce this behaviour with the following code which, just like my 'real' program, both uses processor and memory a lot (.NET 4.0 code):
public class Node
{
public Node Previous { get; private set; }
public Node(Node previous)
{
Previous = previous;
}
}
public class Program
{
public static void Main(string[] args)
{
DateTime startMoment = DateTime.Now;
int concurrentThreads = 0;
var jobs = Enumerable.Range(0, 2000);
Parallel.ForEach(jobs, delegate(int jobNr)
{
Interlocked.Increment(ref concurrentThreads);
int heavyness = jobNr % 9;
//Give the processor and the garbage collector something to do...
List<Node> nodes = new List<Node>();
Node current = null;
for (int y = 0; y < 1024 * 1024 * heavyness; y++)
{
current = new Node(current);
nodes.Add(current);
}
TimeSpan elapsed = DateTime.Now - startMoment;
int threadsRemaining = Interlocked.Decrement(ref concurrentThreads);
Console.WriteLine("[{0:mm\\:ss}] Job {1,4} complete. {2} threads remaining.",
elapsed, jobNr, threadsRemaining);
});
}
}
When run on my quad-core, it initially starts of with 4 concurrent threads, just as you would expect. However, over time more and more threads are being created. Eventually, this program then throws an OutOfMemoryException:
[00:00] Job 0 complete. 3 threads remaining.
[00:01] Job 1 complete. 4 threads remaining.
[00:01] Job 2 complete. 4 threads remaining.
[00:02] Job 3 complete. 4 threads remaining.
[00:05] Job 9 complete. 5 threads remaining.
[00:05] Job 4 complete. 5 threads remaining.
[00:05] Job 5 complete. 5 threads remaining.
[00:05] Job 10 complete. 5 threads remaining.
[00:08] Job 11 complete. 5 threads remaining.
[00:08] Job 6 complete. 5 threads remaining.
...
[00:55] Job 67 complete. 7 threads remaining.
[00:56] Job 81 complete. 8 threads remaining.
...
[01:54] Job 107 complete. 11 threads remaining.
[02:00] Job 121 complete. 12 threads remaining.
..
[02:55] Job 115 complete. 19 threads remaining.
[03:02] Job 166 complete. 21 threads remaining.
...
[03:41] Job 113 complete. 28 threads remaining.
<OutOfMemoryException>
The memory usage graph for the experiment above is as follows:
(The screenshot is in Dutch; the top part represents processor usage, the bottom part memory usage.) As you can see, it looks like a new thread is being spawned almost every time the garbage collector gets in the way (as can be seen in the dips of memory usage).
Can anyone explain why this is happening, and what I can do about it? I just want .NET to stop spawning new threads, and finish the existing threads first...
You can limit the maximum number of threads that get created by specifying a ParallelOptions instance with the MaxDegreeOfParallelism property set:
var jobs = Enumerable.Range(0, 2000);
ParallelOptions po = new ParallelOptions
{
MaxDegreeOfParallelism = Environment.ProcessorCount
};
Parallel.ForEach(jobs, po, jobNr =>
{
// ...
});
As to why you're getting the behaviour you're observing: The TPL (which underlies PLINQ) is, by default, at liberty to guess the optimal number of threads to use. Whenever a parallel task blocks, the task scheduler may create a new thread in order to maintain progress. In your case, the blocking might be happening implicitly; for example, through the Console.WriteLine call, or (as you observed) during garbage collection.
From Concurrency Levels Tuning with Task Parallel Library (How Many Threads to Use?):
Since the TPL default policy is to use one thread per processor, we can conclude that TPL initially assumes that the workload of a task is ~100% working and 0% waiting, and if the initial assumption fails and the task enters a waiting state (i.e. starts blocking) - TPL with take the liberty to add threads as appropriate.
You should probably read a bit about the how the task scheduler works.
Parallel Programming with Microsoft .NET - Parallel Tasks
(latter half of the page)
"The .NET thread pool automatically manages the number of worker
threads in the pool. It adds and removes threads according to built-in
heuristics. The .NET thread pool has two main mechanisms for injecting
threads: a starvation-avoidance mechanism that adds worker threads if
it sees no progress being made on queued items and a hill-climbing
heuristic that tries to maximize throughput while using as few threads
as possible.
The goal of starvation avoidance is to prevent deadlock. This kind of
deadlock can occur when a worker thread waits for a synchronization
event that can only be satisfied by a work item that is still pending
in the thread pool's global or local queues. If there were a fixed
number of worker threads, and all of those threads were similarly
blocked, the system would be unable to ever make further progress.
Adding a new worker thread resolves the problem.
A goal of the hill-climbing heuristic is to improve the utilization of
cores when threads are blocked by I/O or other wait conditions that
stall the processor. By default, the managed thread pool has one
worker thread per core. If one of these worker threads becomes
blocked, there's a chance that a core might be underutilized,
depending on the computer's overall workload. The thread injection
logic doesn't distinguish between a thread that's blocked and a thread
that's performing a lengthy, processor-intensive operation. Therefore,
whenever the thread pool's global or local queues contain pending work
items, active work items that take a long time to run (more than a
half second) can trigger the creation of new thread pool worker
threads."
You can mark a task as LongRunning but this has the side effect of allocating a thread for it from outside the thread pool which means that the task cannot be inlined.
Remember that the ParallelFor treats the work it is given as blocks so even if the work in one loop is fairly small the overall work done by the task invoked by the look may appear longer to the scheduler.
Most calls to the GC in and of them selves aren't blocking (it runs on a separate thread) but if you wait for GC to complete then this does block. Remember also that the GC is rearranging memory so this may have some side effects (and blocking) if you are trying to allocate memory while running GC. I don't have specifics here but I know the PPL has some memory allocation features specifically for concurrent memory management for this reason.
Looking at your code's output it seems that things are running for many seconds. So I'm not surprised that you are seeing thread injection. However I seem to remember that the default thread pool size is roughly 30 threads (probably depending on the number of cores on your system). A thread takes up roughly a MB of memory before your code allocates any more so I'm not clear why you could get an out of memory exception here.
I've posted the follow-up question "How to count the amount of concurrent threads in .NET application?"
If to count the threads directly, their number in Parallel.For() mostly ((very rarely and insignificantly decreasing) only increases and is not releleased after loop completion.
Checked this in both Release and Debug mode, with
ParallelOptions po = new ParallelOptions
{
MaxDegreeOfParallelism = Environment.ProcessorCount
};
and without
The digits vary but conclusions are the same.
Here is the ready code I was using, if someone wants to play with:
using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;
using System.Threading;
using System.Threading.Tasks;
namespace Edit4Posting
{
public class Node
{
public Node Previous { get; private set; }
public Node(Node previous)
{
Previous = previous;
}
}
public class Edit4Posting
{
public static void Main(string[] args)
{
int concurrentThreads = 0;
int directThreadsCount = 0;
int diagThreadCount = 0;
var jobs = Enumerable.Range(0, 160);
ParallelOptions po = new ParallelOptions
{
MaxDegreeOfParallelism = Environment.ProcessorCount
};
Parallel.ForEach(jobs, po, delegate(int jobNr)
//Parallel.ForEach(jobs, delegate(int jobNr)
{
int threadsRemaining = Interlocked.Increment(ref concurrentThreads);
int heavyness = jobNr % 9;
//Give the processor and the garbage collector something to do...
List<Node> nodes = new List<Node>();
Node current = null;
//for (int y = 0; y < 1024 * 1024 * heavyness; y++)
for (int y = 0; y < 1024 * 24 * heavyness; y++)
{
current = new Node(current);
nodes.Add(current);
}
//*******************************
directThreadsCount = Process.GetCurrentProcess().Threads.Count;
//*******************************
threadsRemaining = Interlocked.Decrement(ref concurrentThreads);
Console.WriteLine("[Job {0} complete. {1} threads remaining but directThreadsCount == {2}",
jobNr, threadsRemaining, directThreadsCount);
});
Console.WriteLine("FINISHED");
Console.ReadLine();
}
}
}
I'm using C# TPL and I'm having a problem with a producer/consumer code... for some reason, TPL doesn't reuse threads and keeps creating new ones without stopping
I made a simple example to demonstrate this behavior:
class Program
{
static BlockingCollection<int> m_Buffer = new BlockingCollection<int>(1);
static CancellationTokenSource m_Cts = new CancellationTokenSource();
static void Producer()
{
try
{
while (!m_Cts.IsCancellationRequested)
{
Console.WriteLine("Enqueuing job");
m_Buffer.Add(0);
Thread.Sleep(1000);
}
}
finally
{
m_Buffer.CompleteAdding();
}
}
static void Consumer()
{
Parallel.ForEach(m_Buffer.GetConsumingEnumerable(), Run);
}
static void Run(int i)
{
Console.WriteLine
("Job Processed\tThread: {0}\tProcess Thread Count: {1}",
Thread.CurrentThread.ManagedThreadId,
Process.GetCurrentProcess().Threads.Count);
}
static void Main(string[] args)
{
Task producer = new Task(Producer);
Task consumer = new Task(Consumer);
producer.Start();
consumer.Start();
Console.ReadKey();
m_Cts.Cancel();
Task.WaitAll(producer, consumer);
}
}
This code creates 2 tasks, producer and consumer. Produces adds 1 work item every second, and Consumer only prints out a string with information. I would assume that 1 consumer thread is enough in this situation, because tasks are processed much faster than they are being added to the queue, but what actually happens is that every second number of threads in the process grows by 1... as if TPL is creating new thread for every item
after trying to understand what's happening I also noticed another thing: even though BlockingCollection size is 1, after a while Consumer starts getting called in bursts, for example, this is how it starts:
Enqueuing job
Job Processed Thread: 4 Process Thread Count: 9
Enqueuing job
Job Processed Thread: 6 Process Thread Count: 9
Enqueuing job
Job Processed Thread: 5 Process Thread Count: 10
Enqueuing job
Job Processed Thread: 4 Process Thread Count: 10
Enqueuing job
Job Processed Thread: 6 Process Thread Count: 11
and this is how it's processing items less than a minute later:
Enqueuing job
Job Processed Thread: 25 Process Thread Count: 52
Enqueuing job
Enqueuing job
Job Processed Thread: 5 Process Thread Count: 54
Job Processed Thread: 5 Process Thread Count: 54
and because threads get disposed after finishing Parallel.ForEach loop (I don't show it in this example, but it was in the real project) I assumed that it has something to do with ForEach specifically... I found this artice http://reedcopsey.com/2010/01/26/parallelism-in-net-part-5-partitioning-of-work/, and I thought that my problem was caused by this default partitioner, so I took custom partitioner from TPL Examples that is feeding Consumer threads item one by one, and although it fixed the order of execution (got rid of delay)...
Enqueuing job
Job Processed Thread: 71 Process Thread Count: 140
Enqueuing job
Job Processed Thread: 12 Process Thread Count: 141
Enqueuing job
Job Processed Thread: 72 Process Thread Count: 142
Enqueuing job
Job Processed Thread: 38 Process Thread Count: 143
Enqueuing job
Job Processed Thread: 73 Process Thread Count: 143
Enqueuing job
Job Processed Thread: 21 Process Thread Count: 144
Enqueuing job
Job Processed Thread: 74 Process Thread Count: 145
...it didn't stop threads from growing
I know about ParallelOptions.MaxDegreeOfParallelism, but I still want to understand what's happening with TPL and why it creates hundreds of threads for no reason
in my project I a code that has to run for hours and read new data from database, put it into a BlockingCollections and have has data processed by other code, there's 1 new item about every 5 seconds and it takes from several milliseconds to almost a minute to process it, and after running for about 10 minutes, thread count reached over a 1000 threads
There are two things that together cause this behavior:
ThreadPool tries to use the optimal number of threads for your situation. But if one of the threads in the pool blocks, the pool sees this as if that thread wasn't doing any useful work and so it tends to create another thread soon after that. What this means is that if you have a lot of blocking, ThreadPool is really bad at guessing the optimal number of threads and it tends to create new threads until it reaches the limit.
Parallel.ForEach() trusts the ThreadPool to guess the correct number of threads, unless you set the maximum number of threads explicitly. Parallel.ForEach() was also primarily meant for bounded collections, not streams of data.
When you combine these two things with GetConsumingEnumerable(), what you get is that Parallel.ForEach() creates threads that are almost always blocked. The ThreadPool sees this, and, to try to keep the CPU utilized, creates more and more threads.
The correct solution here is to set MaxDegreeOfParallelism. If your computations are CPU-bound, the best value is most likely Environment.ProcessorCount. If they are IO-bound, you will have to find out the best value experimentally.
Another option, if you can use .Net 4.5, is to use TPL Dataflow. This library was made specifically to process streams of data, like you have, so it doesn't have the problems your code has. It's actually even better than that and doesn't use any threads at all when it's not processing anything currently.
Note: There is also a good reason why is a new thread created for each new item, but explaining that would require me to explain how Parallel.ForEach() works in more detail and I feel that's not necessary here.
My laptop has 2 logical processors and I stumbled upon the scenario where if I schedule 2 tasks that take longer than 1 second without designating them long-running, subsequent tasks are started after 1 second has elapsed. It is possible to change this timeout?
I know normal tasks should be short-running - much shorter than a second if possible - I'm just wondering I am seeing hard-coded TPL behavior or if I can influence this behavior in any way other than designating tasks long-running.
This Console app method should demonstrate the behavior for a machine with any number of processors:
static void Main(string[] args)
{
var timer = new Stopwatch();
timer.Start();
int numberOfTasks = Environment.ProcessorCount;
var rudeTasks = new List<Task>();
var shortTasks = new List<Task>();
for (int index = 0; index < numberOfTasks; index++)
{
int capturedIndex = index;
rudeTasks.Add(Task.Factory.StartNew(() =>
{
Console.WriteLine("Starting rude task {0} at {1}ms", capturedIndex, timer.ElapsedMilliseconds);
Thread.Sleep(5000);
}));
}
for (int index = 0; index < numberOfTasks; index++)
{
int capturedIndex = index;
shortTasks.Add(Task.Factory.StartNew(() =>
{
Console.WriteLine("Short-running task {0} running at {1}ms", capturedIndex, timer.ElapsedMilliseconds);
}));
}
Task.WaitAll(shortTasks.ToArray());
Console.WriteLine("Finished waiting for short tasks at {0}ms", timer.ElapsedMilliseconds);
Task.WaitAll(rudeTasks.ToArray());
Console.WriteLine("Finished waiting for rude tasks at {0}ms", timer.ElapsedMilliseconds);
Console.ReadLine();
}
Here is the app's output on my 2 proc laptop:
Starting rude task 0 at 2ms
Starting rude task 1 at 2ms
Short-running task 0 running at 1002ms
Short-running task 1 running at 1002ms
Finished waiting for short tasks at 1002ms
Finished waiting for rude tasks at 5004ms
Press any key to continue . . .
The lines:
Short-running task 0 running at 1002ms
Short-running task 1 running at 1002ms
indicate that there is a 1 second timeout or something of that nature allowing the shorter-running tasks to get scheduled over the 'rude' tasks. That's what I'm inquiring about.
The behavior that you are seeing is not specific to the TPL, it's specific to the TPL's default scheduler. The scheduler is attempting to increase the number of threads so that those two that are running don't "hog" the CPU and choke out the others. It's also helpful in avoiding deadlock situations if the two that are running start and wait on Tasks themselves.
If you want to change the scheduling behavior, you might want to look into implementing your own TaskScheduler.
This is standard behavior for the threadpool scheduler. It tries to keep the number of active threads equal to the number of cores. But can't do the job really well when your tasks do a lot of blocking instead of running. Sleeping in your case. Twice a second it allows another thread to run to try to work down the backlog. Seems like you have a dual-core cpu.
The proper workaround is to use TaskCreationOptions.LongRunning so the scheduler uses a regular Thread instead of a threadpool thread. An improper workaround is to use ThreadPool.SetMinThreads. But you should perhaps focus on doing real work in your tasks, Sleep() is not a very good simulation of that.
The problem is it takes a while for the scheduler to start the new tasks as it tries to determine if a task is long-running. You can tell the TPL that a task is long running as a parameter of the task:
for (int index = 0; index < numberOfTasks; index++)
{
int capturedIndex = index;
rudeTasks.Add(Task.Factory.StartNew(() =>
{
Console.WriteLine("Starting rude task {0} at {1}ms", capturedIndex, timer.ElapsedMilliseconds);
Thread.Sleep(3000);
}, TaskCreationOptions.LongRunning));
}
Resulting in:
Starting rude task 0 at 11ms
Starting rude task 1 at 13ms
Starting rude task 2 at 15ms
Starting rude task 3 at 19ms
Short-running task 0 running at 45ms
Short-running task 1 running at 45ms
Short-running task 2 running at 45ms
Short-running task 3 running at 45ms
Finished waiting for short tasks at 46ms
Finished waiting for rude tasks at 3019ms