I've noticed firing up several Task.Delay() calls basically "at the same time" causes systematic and periodic long pauses in the execution. Not just in one thread, but all running threads.
Here's an old SO question, which describes probably the same issue: await Task.Delay(foo); takes seconds instead of ms
I hope it's ok to re-surface this with a fresh take, since the problem still exists and I haven't found any other workaround than "use Thread.Sleep", which doesn't really work in all cases.
Here's a test code:
static Stopwatch totalTime = new Stopwatch();
static void Main(string[] args)
{
Task[] tasks = new Task[100];
totalTime.Start();
for (int i = 0; i < 100; i++)
{
tasks[i] = TestDelay(1000, 10, i);
}
Task.WaitAll(tasks);
}
private static async Task TestDelay(int loops, int delay, int id)
{
int exact = 0;
int close = 0;
int off = 0;
Stopwatch stopwatch = new Stopwatch();
for (int i = 0; i < loops; i++)
{
stopwatch.Restart();
await Task.Delay(delay);
long duration = stopwatch.ElapsedMilliseconds;
if (duration == delay) ++exact;
else if (duration < delay + 10) ++close;
else
{
//This is seen in chunks for all the tasks at once!
Console.WriteLine(totalTime.ElapsedMilliseconds + " ------ " + id + ": " + duration + "ms");
++off;
}
}
Console.WriteLine(totalTime.ElapsedMilliseconds + " -DONE- " + id + " Exact: " + exact + ", Close: " + close + ", Off:" + off);
}
By running the code, there will be 1-3 points in time, when all of the N tasks will block/hang/something for significantly more than 10ms, more like 100-500ms. This happens to all tasks, and at the same time. I've added relevant logging, in case someone wants to try it and fiddle with the numbers.
Finally the obvious question is: Why is this happening, and is there any way to avoid it? Can anyone run the code and NOT get the delays?
Tested with dotnetcore 3.1 and net 5.0. Ran on MacOS and Linux.
Changing min threads doesn't have any effect on this.
Just for laughs, I tried SemaphoreSlim.WaitAsync(millis) (on an always unsignaled semaphore), which funnily enough has the same problem.
EDIT: Here's a sample output:
136 ------ 65: 117ms
136 ------ 73: 117ms
160 ------ 99: 140ms
... all 100 of these
161 ------ 3: 144ms
Similar output is printed later in the execution as well.
These lines are printed when a task delay takes over 10ms more than requested.
So the first number is the point in time, which is almost the same for all tasks, so I assume it's due to the same hang in execution. The second number is just the task id to tell them apart. Last number is the stopwatch-given delay, which is significantly more than the 10ms.
It can be 10-20ms easily, but 10x is not due to inaccuracy.
I've tried to look into GC problems, but it doesn't happen during a manual GC.Collect(), and when it does happen I don't see changes in heapdump. It's still a possibility, but I'm lost at pinpointing it.
I'll do the unthinkable, and answer my own question, just in case anyone else stumbles upon this.
First, thanks to #paulomorgado for pointing me towards thread pool latency. That indeed is the problem, if you fire up hundreds of Task.Delay() calls in a short period of time.
The way I solved this, was to create a separate Thread, which keeps track of requested delays and uses TaskCompletionSource to enable asynchronous awaits on the delays.
E.g. create a struct with three fields: start time, delay duration and a TaskCompletionSource. Have the thread loop through these (in a lock) and whenever a duration has expired, mark the task done with TaskCompletionSource.SetResult().
Now you can have a custom async Delay(millis) method that
creates a new struct
adds it to a "requested delays" list (lock)
awaits for the task completion
remove the struct from the list (lock)
return
A custom TaskScheduler with the needed threads might be a fancier solution, but I found this approach simple and clean. And it seems to do the trick, especially since you can have more than one thread going through all the delays for extra efficiency. Obviously happy to have this approach murdered with any flaws you might notice.
Please note that this approach probably only makes sense if your code is filled with asynchronous delays for some reason (like mine).
EDIT Quick sample code for the relevant parts. This needs some optimizing in regards of how locks, loops, news, and lists are handled, but even with this, I can see a HUGE improvement.
With ridiculously short delays (say 10ms), this shows error at 80ms max (tried with 5 threads), where with Task.Delay it's at least 100ms, up to 500ms. With longer, reasonable, delays (100ms+) this is almost flawless, whereas Task.Delay() slaps with the same 100-500ms surprise delay at least in the beginning.
private struct CustomDelay
{
public TaskCompletionSource Completion;
public long Started;
public long Delay;
}
//Your favorite data structure here. Using List for clarity. Note: using separate blocks based on delay granularity might be a good idea.
private static List<CustomDelay> _requestedDelays = new List<CustomDelay>();
//Create threads from this. Sleep can be longer if there are several threads.
private static void CustomDelayHandler()
{
while (_running)
{
Thread.Sleep(10); //To avoid busy loop
lock (_delayListLock)
{
for (int i = 0; i < _requestedDelays.Count; ++i)
{
CustomDelay delay = _requestedDelays[i];
if (!delay.Completion.Task.IsCompleted)
{
if (TotalElapsed() - delay.Started >= delay.Delay)
{
delay.Completion.SetResult();
}
}
}
}
}
}
//Use this instead of Task.Delay(millis)
private static async Task Delay(int ms)
{
if (ms <= 0) return;
CustomDelay delay = new CustomDelay()
{
Completion = new TaskCompletionSource(),
Delay = ms,
Started = TotalElapsed()
};
lock (_delayListLock)
{
_requestedDelays.Add(delay);
}
await delay.Completion.Task;
lock (_delayListLock)
{
_requestedDelays.Remove(delay);
}
}
Here is my attempt to reproduce your observations. I am creating 100 tasks, and each task is awaiting repeatedly a 10 msec Task.Delay in a loop. The actual duration of each Delay is measured with a Stopwatch, and is used to update a dictionary that holds the occurrences of each duration (all measurements with the same integer duration are aggregated in a single entry in the dictionary). The total duration of the test is 10 seconds.
ThreadPool.SetMinThreads(100, 100);
const int nominalDelay = 10;
var cts = new CancellationTokenSource(10000); // Duration of the test
var durations = new ConcurrentDictionary<long, int>();
var tasks = Enumerable.Range(1, 100).Select(n => Task.Run(async () =>
{
var stopwatch = new Stopwatch();
while (true)
{
stopwatch.Restart();
try { await Task.Delay(nominalDelay, cts.Token); }
catch (OperationCanceledException) { break; }
long duration = stopwatch.ElapsedMilliseconds;
durations.AddOrUpdate(duration, _ => 1, (_, count) => count + 1);
}
})).ToArray();
Task.WaitAll(tasks);
var totalTasks = durations.Values.Sum();
var totalDuration = durations.Select(pair => pair.Key * pair.Value).Sum();
Console.WriteLine($"Nominal delay: {nominalDelay} msec");
Console.WriteLine($"Min duration: {durations.Keys.Min()} msec");
Console.WriteLine($"Avg duration: {(double)totalDuration / totalTasks:#,0.0} msec");
Console.WriteLine($"Max duration: {durations.Keys.Max()} msec");
Console.WriteLine($"Total tasks: {totalTasks:#,0}");
Console.WriteLine($"---Occurrences by Duration---");
foreach (var pair in durations.OrderBy(e => e.Key))
{
Console.WriteLine($"Duration {pair.Key,2} msec, Occurrences: {pair.Value:#,0}");
}
I run the program on .NET Core 3.1.3, in Release version without the debugger attached. Here are the results:
(Try it on fiddle)
Nominal delay: 10 msec
Min duration: 9 msec
Avg duration: 15.2 msec
Max duration: 40 msec
Total tasks: 63,418
---Occurrences by Duration---
Duration 9 msec, Occurrences: 165
Duration 10 msec, Occurrences: 11,373
Duration 11 msec, Occurrences: 21,299
Duration 12 msec, Occurrences: 2,745
Duration 13 msec, Occurrences: 878
Duration 14 msec, Occurrences: 375
Duration 15 msec, Occurrences: 252
Duration 16 msec, Occurrences: 7
Duration 17 msec, Occurrences: 16
Duration 18 msec, Occurrences: 102
Duration 19 msec, Occurrences: 110
Duration 20 msec, Occurrences: 1,995
Duration 21 msec, Occurrences: 14,839
Duration 22 msec, Occurrences: 7,347
Duration 23 msec, Occurrences: 1,269
Duration 24 msec, Occurrences: 166
Duration 25 msec, Occurrences: 136
Duration 26 msec, Occurrences: 264
Duration 27 msec, Occurrences: 47
Duration 28 msec, Occurrences: 1
Duration 36 msec, Occurrences: 5
Duration 37 msec, Occurrences: 8
Duration 38 msec, Occurrences: 9
Duration 39 msec, Occurrences: 7
Duration 40 msec, Occurrences: 3
Running the program on .NET Framework 4.8.3801.0 produces similar results.
TL;DR, I was not able to reproduce the 100-500 msec durations you observed.
Related
This question already has answers here:
ThreadPool not starting new Thread instantly
(2 answers)
Closed 1 year ago.
I am working on a program where we are constantly starting new threads to go off and do a piece of work. We noticed that even though we might have started 10 threads only 3 or 4 were executing at a time. To test it out I made a basic example like this:
private void startThreads()
{
for (int i = 0; i < 100; i++)
{
//Task.Run(() => someThread());
//Thread t = new Thread(() => someThread());
//t.Start();
ThreadPool.QueueUserWorkItem(someThread);
}
}
private void someThread()
{
Thread.Sleep(1000);
}
Simple stuff right? Well, the code creates the 100 threads and they start to execute... but only 3 or 4 at a time. When they complete the next threads start to execute. I would have expected that almost all of them start execution at the same time. For 100 threads (each with a 1 second sleep time) it takes about 30 seconds for all of them to complete. I would have thought it would have taken far less time than this.
I have tried using Thread.Start, ThreadPool and Tasks, all give me the exact same result. If I use ThreadPool and check for the available number of threads each time a thread runs there are always >2000 available worker threads and 1000 available async threads.
I just used the above as a test for our code to try and find out what is going on. In practice, the code spawns threads all over the place. The program is running at less than 5% CPU usage but is getting really slow because the threads aren't executing quick enough.
Yes you may only have a few threads running at the same time. That how a ThreadPool works. It doesn't necessarily run all the threads at the same time. It would queue them up fast, but then leave it to the ThreadPool to handle when each thread runs.
If you want to ensure all 100 threads run simultaneously you can use:
ThreadPool.SetMinThreads(100, 100);
For example, see the code below, this is the result without the thread pool min size:
No MinThreads
internal void startThreads()
{
ThreadPool.GetMaxThreads(out int maxThread, out int completionPortThreads);
stopwatch.Start();
var result = Parallel.For(0, 20, (i) =>
{
ThreadPool.QueueUserWorkItem(someThread, i);
});
while (!result.IsCompleted) { }
Console.WriteLine("Queueing completed...");
}
private void someThread(Object stateInfo)
{
int threadNum = (int)stateInfo;
Console.WriteLine(threadNum + " started.");
Thread.Sleep(10);
Console.WriteLine(threadNum + " finnished.");
}
Result (No MinThreads)
9 started.
7 started.
11 started.
10 started.
1 finnished.
12 started.
9 finnished.
13 started.
2 finnished.
4 finnished.
15 started.
3 finnished.
8 finnished.
16 started.
10 finnished.
6 finnished.
19 started.
0 finnished.
14 started.
5 finnished.
7 finnished.
17 started.
18 started.
11 finnished.
With MinThreads
internal void startThreads()
{
ThreadPool.GetMaxThreads(out int maxThread, out int completionPortThreads);
ThreadPool.SetMinThreads(20, 20); // HERE <-------
stopwatch.Start();
var result = Parallel.For(0, 20, (i) =>
{
ThreadPool.QueueUserWorkItem(someThread, i);
});
while (!result.IsCompleted) { }
Console.WriteLine("Queueing completed...");
}
Results
...
7 started.
15 started.
9 started.
12 started.
17 started.
13 started.
16 started.
19 started.
18 started.
5 finnished.
3 finnished.
4 finnished.
6 finnished.
0 finnished.
14 finnished.
1 finnished.
10 finnished.
...
A nice clean devise.
class Program
{
static void Main(string[] args)
{
new Program();
}
private static int prepTotal;
private static readonly object Lock = new object();
public Program()
{
var sw = new Stopwatch();
sw.Start();
Parallel.For((long) 0, 10,new ParallelOptions {MaxDegreeOfParallelism = 1}, (j) =>
{
DoIt();
});
sw.Stop();
Console.WriteLine($"1 thread sum time is {prepTotal} ms. Total time is {sw.ElapsedMilliseconds} ms.");
sw.Restart();
prepTotal = 0;
Parallel.For((long)0, 10, new ParallelOptions { MaxDegreeOfParallelism = 3 }, (j) =>
{
DoIt();
});
sw.Stop();
Console.WriteLine($"3 thread sum time is {prepTotal} ms. Total time is {sw.ElapsedMilliseconds} ms.");
sw.Restart();
prepTotal = 0;
Parallel.For((long)0, 10, new ParallelOptions { MaxDegreeOfParallelism = 1 }, (j) =>
{
DoIt();
});
sw.Stop();
Console.WriteLine($"1 thread sum time is {prepTotal} ms. Total time is {sw.ElapsedMilliseconds} ms.");
sw.Restart();
prepTotal = 0;
Parallel.For((long)0, 10, new ParallelOptions { MaxDegreeOfParallelism = 3 }, (j) =>
{
DoIt();
});
sw.Stop();
Console.WriteLine($"3 thread sum time is {prepTotal} ms. Total time is {sw.ElapsedMilliseconds} ms.");
Console.ReadLine();
}
private static void DoIt()
{
var sw2 = new Stopwatch();
sw2.Start();
using (var bmp = new Bitmap(3000, 3000))
{
}
sw2.Stop();
lock (Lock)
{
prepTotal += (int) sw2.ElapsedMilliseconds;
}
}
}
When I run my test code(derived from original really complex code) I got following results. As you can see code running in more threads is almost 3 times slower. Is Bitmap constructor makes some blocking or what?
1 thread sum time is 125 ms. Total time is 132 ms.
3 thread sum time is 360 ms. Total time is 132 ms.
1 thread sum time is 121 ms. Total time is 127 ms.
3 thread sum time is 364 ms. Total time is 128 ms.
Well, I just used a profiler see if my guess is correct, and indeed, new Bitmap(3000, 3000) is almost entirely memory bound. So unless you have a server machine with multiple independent memory systems, adding more CPU doesn't help any. The bottleneck is memory.
The second most important part happens in the Dispose, which is again... almost entirely memory bound.
Multi-threading only helps with CPU-bound code. Since the CPU is much faster than any memory you may have in your system, the CPU is only really saturated when it can avoid working with memory (and other I/O devices). Your case is pretty much exactly the opposite - there's very little CPU work, and where there is CPU work, it's mostly synchronized (e.g. requesting and freeing virtual memory). Not a lot of opportunities for parallelization.
I need some advice. I have an application that processes trade information from a real-time data feed from the stock exachanges. My processing is falling behind.
Since I'm running on a 3GHz Intel I7 with 32 GBytes of main memory, I should have enough power to do this application. The Parse routine stores trade information in an SQL Server 2014 database, running on a Windows 2012 R2 Server.
I put the following timing information in the main processing loop:
invokeTime.Restart();
Parallel.Invoke(() => parserObj.Parse(julian, data));
invokeTime.Stop();
var milliseconds = invokeTime.ElapsedMilliseconds;
if (milliseconds > maxMilliseconds) {
maxMilliseconds = milliseconds;
messageBar.SetText("Invoke: " + milliseconds);
}
I'm getting as much as 1122 milliseonds to do the Parallel.Invoke. A similar timing test shows that the Parse routine only takes 7 milliseconds (max).
Is there a better way of processing the data, other than doing the Parallel.Invoke?
Any suggestions will be greatly appreciated.
Charles
Have you tried
Task.Factory.StartNew(() => {
parserObj.Parse(julian, data));
});
How does your Parse method look like? Maybe the bottleneck is in there...
In Stephen Toub's article: http://blogs.msdn.com/b/pfxteam/archive/2011/10/24/10229468.aspx he describes: "Task.Run can and should be used for the most common cases of simply offloading some work to be processed on the ThreadPool". That's exactly what I want to do, offload the Parse rosutine to a background thread. So I changed:
Parallel.Invoke(() => parserObj.Parse(julian, data));
to:
Task.Run(() => parserObj.Parse(julian, data));
I also increased the number of threads in the ThreadPool from 8 to 80 by doing:
int minWorker, minIOC;;
ThreadPool.GetMinThreads(out minWorker, out minIOC);
var newMinWorker = 10*minWorker;
var newMinIOC = 10*minIOC;
if (ThreadPool.SetMinThreads(newMinWorker, newMinIOC)) {
textBox.AddLine("The minimum no. of worker threads is now: " + newMinWorker);
} else {
textBox.AddLine("Drat! The minimum no. of worker threads could not be changed.");
}
The parsing loop, which runs for 6 1/2 hours/day, looks like:
var stopWatch = new Stopwatch();
var maxMilliseconds = 0L;
while ((data = GetDataFromIQClient()) != null) {
if ( MarketHours.IsMarketClosedPlus2() ) {
break;
}
stopWatch.Restart();
Task.Run(() => parserObj.Parse(julian, data));
stopWatch.Stop();
var milliseconds = stopWatch.ElapsedMilliseconds;
if (milliseconds > maxMilliseconds) {
maxMilliseconds = milliseconds;
messageBar.SetText("Task.Run: " + milliseconds);
}
}
Now, the maximum time spent to call Task.Run was 96 milliseconds, and the maximum time spent in parser was 18 milliseconds. I'm now keeping up with the data transmission.
Charles
I have a ConcurrentBag urls whose items are being processed in parallel (nothing is being written back to the collection):
urls.AsParallel<UrlInfo>().WithDegreeOfParallelism(17).ForAll( item =>
UrlInfo info = MakeSynchronousWebRequest(item);
(myProgress as IProgress<UrlInfo>).Report(info);
});
I have the timeout set to 30 seconds in the web request. When a url that is very slow to respond is encountered, all of the parallel processing grinds to a halt. Is this expected behavior, or should I be searching out some problem in my code?
Here's the progress :
myProgress = new Progress<UrlInfo>( info =>
{
Action action = () =>
{
Interlocked.Increment(ref itested);
if (info.status == UrlInfo.UrlStatusCode.dead)
{
Interlocked.Increment(ref idead);
this.BadUrls.Add(info);
}
dead.Content = idead.ToString();
tested.Content = itested.ToString();
};
try
{
Dispatcher.BeginInvoke(action);
}
catch (Exception ex)
{
}
});
It's the expected behavior. AsParallel doesn't return until all the operations are finished. Since you're making synchronous requests, you've got to wait until your slowest one is finished. However note that even if you've got one really slow task hogging up a thread, the scheduler continues to schedule new tasks as old ones finish on the remaining threads.
Here's a rather instructive example. It creates 101 tasks. The first task hogs one thread for 5000 ms, the 100 others churn on the remaining 20 threads for 1000 ms each. So it schedules 20 of those tasks and they run for one second each, going through that cycle 5 times to get through all 100 tasks, for a total of 5000 ms. However if you change the 101 to 102, that means you've got 101 tasks churning on the 20 threads, which will end up taking 6000 ms; that 101th task just didn't have a thread to churn on until the 5 sec mark. If you change the 101 to, say, 2, you note it still takes 5000 ms because you have to wait for the slow task to complete.
static void Main()
{
ThreadPool.SetMinThreads(21, 21);
var sw = new Stopwatch();
sw.Start();
Enumerable.Range(0, 101).AsParallel().WithDegreeOfParallelism(21).ForAll(i => Thread.Sleep(i==0?5000:1000));
Console.WriteLine(sw.ElapsedMilliseconds);
}
I need to execute strategy.AllTablesUpdated(); for 50 strategies in 2 ms (and I need to repeat that ~500 times per second).
Using code below I discovered that just Monitor.TryEnter call spents up to 1 ms (!!!) and I do that 50 times!
// must be called ~500 times per second
public void FinishUpdatingTables()
{
foreach (Strategy strategy in strategies) // about ~50, should be executed in 2 ms
{
// this slow and can be paralleled
strategy.AllTablesUpdated();
}
}
...................
public override bool AllTablesUpdated(Stopwatch sw)
{
this.sw = sw;
Checkpoint(this + " TryEnter attempt ");
if (Monitor.TryEnter(desiredOrdersBuy))
{
Checkpoint(this + " TryEnter success ");
try
{
OnAllTablesUpdated();
} finally
{
Monitor.Exit(desiredOrdersBuy);
}
return true;
} else
{
Checkpoint(this + " TryEnter failed ");
}
return false;
}
public void Checkpoint(string message)
{
if (sw == null)
{
return;
}
long time = sw.ElapsedTicks / (Stopwatch.Frequency / (1000L * 1000L));
Log.Push(LogItemType.Debug, message + time);
}
From logs (in µs), failed attempt spents ~ 1ms:
12:55:43:778 Debug: TryEnter attempt 1264
12:55:43:779 Debug: TryEnter failed 2123
From logs (in µs), succeed attempt spents ~ 0.01ms:
12:55:49:701 Debug: TryEnter attempt 889
12:55:49:701 Debug: TryEnter success 900
So now I think that Monitor.TryEnter is too expensive for me to be executed one by one for 50 strategies. So I want to parallel this work using Task like that:
// must be called ~500 times per second
public void FinishUpdatingTables()
{
foreach (Strategy strategy in strategies) // about ~50, should be executed in 2 ms
{
// this slow and can be paralleled
Task.Factory.StartNew(() => {
strategy.AllTablesUpdated();
});
}
}
I will also probably replace Monitor.TryEnter to just lock as with such approach everything will be asynchronous.
My questions:
Why Monitor.TryEnter is so slow ? (1 ms if lock is not obtained)
How good would be to start 50 Task each 2 ms = 25 000 of Tasks each second? Can .NET manage this effectively? I can also use producer-consumer pattern with BlockingCollection and start 50 "workers" only ONCE and then submit new pack of 50 items each 2 ms to BlockingCollection? Would that be better?
How would you execute 50 methods that can be paralleled each 2 ms (500 times per second), totally 25 000 times per second?
Monitor.TryEnter(object) is just Monitor.TryEnter(object, 0, ref false) (0 millisecond timeout). That 1 ms if the lock is not obtained is just overhead of trying to acquire a lock.
You can start as many tasks as you like, they all use the ThreadPool though which will be limited to a maximum number of threads. The maximum is dependent on your system, number of cores, memory etc... It will not be 25,000 threads that's for sure though. However, if you start meddeling with the TPL scheduler you'll get into trouble. I'd just use Parallel.Foreach and see how far it gets me.
Parallel.ForEach. I'd also ensure that strategies is of type IList so as many items are fired off as possible without waiting on an iterator.
You haven't pasted the code to OnAllTablesUpdated(), you keep the lock for the duration of that procedure. That's going to be your bottleneck in all liklihood.
Some questions, why are you using a lock for when the table is ready to be processed?
Are delegates not possible?
Why lock it when you're running the strategy? Are you modifying that table inside each strategy? Can you not take a copy of it if that is the case?