Async/await performance - c#

I'm working on performance optimization of the program which widely uses async/await feature. Generally speaking it downloads thousands of json documents through HTTP in parallel, parses them and builds some response using this data. We experience some issues with performance, when we handle many requests simultaneously (e.g. download 1000 jsons), we can see that a simple HTTP request can take a few minutes.
I wrote a small console app to test it on a simplified example:
class Program
{
static void Main(string[] args)
{
for (int i = 0; i < 100000; i++)
{
Task.Run(IoBoundWork);
}
Console.ReadKey();
}
private static async Task IoBoundWork()
{
var sw = Stopwatch.StartNew();
await Task.Delay(1000);
Console.WriteLine(sw.Elapsed);
}
}
And I can see similar behavior here:
The question is why "await Task.Delay(1000)" eventually takes 23 sec.

Task.Delay isn't broken, but you're performing 100,000 tasks which each take some time. It's the call to Console.WriteLine that is causing the problem in this particular case. Each call is cheap, but they're accessing a shared resource, so they aren't very highly parallelizable.
If you remove the call to Console.WriteLine, all the tasks complete very quickly. I changed your code to return the elapsed time that each task observes, and then print just a single line of output at the end - the maximum observed time. On my computer, without any Console.WriteLine call, I see output of about 1.16 seconds, showing very little inefficiency:
using System;
using System.Linq;
using System.Collections.Generic;
using System.Diagnostics;
using System.Threading;
using System.Threading.Tasks;
class Program
{
static void Main(string[] args)
{
ThreadPool.SetMinThreads(50000, 50000);
var tasks = Enumerable.Repeat(0, 100000)
.Select(_ => Task.Run(IoBoundWork))
.ToArray();
Task.WaitAll(tasks);
var maxTime = tasks.Max(t => t.Result);
Console.WriteLine($"Max: {maxTime}");
}
private static async Task<double> IoBoundWork()
{
var sw = Stopwatch.StartNew();
await Task.Delay(1000);
return sw.Elapsed.TotalSeconds;
}
}
You can then modify IoBoundWork to do different tasks, and see the effect. Examples of work to try:
CPU work (do something actively "hard" for the CPU, but briefly)
Synchronous sleeping (so the thread is blocked, but the CPU isn't)
Synchronous IO which doesn't have any shared bottlenecks (although that's generally hard, given that the disk or network is likely to end up being a shared resource bottleneck even if you're writing to different files etc)
Synchronous IO with a shared bottleneck such as Console.WriteLine
Asynchronous IO (await foo.WriteAsync(...) etc)
You can also try removing the call to Task.Delay(1000) or changing it. I found that by removing it entirely, the result was very small - whereas replacing it with Task.Yield was very similar to Task.Delay. It's worth remembering that as soon as your async method has to actually "pause" you're effectively doubling the task scheduling problem - instead of scheduling 100,000 operations, you're scheduling 200,000.
You'll see a different pattern in each case. Fundamentally, you're starting 100,000 tasks, asking them all to wait for a second, then asking them all to do something. That causes issues in terms of continuation scheduling that's async/await specific, but also plain resource management of "Performing 100,000 tasks each of which needs to write to the console is going to take a while."

If your problem is performance, async-await is the wrong solution.
async-await is all about availability. Availability to handle the screen and user impute, availability to handle HTTP requests, etc.
The synchronization work behind async-await will use more resources and take more time than simply blocking until the operation completes.
Your HTTP server will handle more requests because less threads will be blocked waiting for operations to complete but each request will take slightly longer.

Related

Do race conditions exist when using just async/await?

.NET 5.0
using Microsoft.VisualStudio.TestTools.UnitTesting;
using System.Threading.Tasks;
using System;
using System.Collections.Generic;
namespace AsyncTest
{
[TestClass]
public class AsyncTest
{
public async Task AppendNewIntVal(List<int> intVals)
{
await Task.Delay(new Random().Next(15, 45));
intVals.Add(new Random().Next());
}
public async Task AppendNewIntVal(int count, List<int> intVals)
{
var appendNewIntValTasks = new List<Task>();
for (var a = 0; a < count; a++)
{
appendNewIntValTasks.Add(AppendNewIntVal(intVals));
}
await Task.WhenAll(appendNewIntValTasks);
}
[TestMethod]
public async Task TestAsyncIntList()
{
var appendCount = 30;
var intVals = new List<int>();
await AppendNewIntVal(appendCount, intVals);
Assert.AreEqual(appendCount, intVals.Count);
}
}
}
The above code compiles and runs, but the test fails with output similar to:
Assert.AreEqual failed. Expected:<30>. Actual:<17>.
In the above example the "Actual" value is 17, but it varies between executions.
I know I am missing some understanding around how asynchronous programming works in .NET as I'm not getting the expected output.
From my understanding, the AppendNewIntVal method kicks off N number of tasks, then waits for them all to complete. If they have all completed, I'd expect they would have each appended a single value to the list but that's not the case. It looks like there's a race condition but I didn't think that was possible because the code is not multithreaded. What am I missing?
Yes, if you don't await each awaitable immediately, i.e. here:
appendNewIntValTasks.Add(AppendNewIntVal(intVals));
This line is in async terms equivalent to (in thread-based code) Thread.Start, and we now have no safety around the inner async code:
intVals.Add(new Random().Next());
which can now fail in the same concurrency ways when two flows call Add at the same time. You should also probably avoid new Random(), as that isn't necessarily random (it is time based on many framework versions, and can end up with two flows getting the same seed).
So: the code as shown is indeed dangerous.
The obviously safe version is:
public async Task AppendNewIntVal(int count, List<int> intVals)
{
for (var a = 0; a < count; a++)
{
await AppendNewIntVal(intVals);
}
}
It is possible to defer the await, but you're explicitly opting into concurrency when you do that, and your code needs to handle it suitably defensively.
Yes, race conditions do exist.
Async methods are basically tasks that can potentially run in parallel, depending on a task scheduler they are submitted to. The default one is ThreadPoolTaskScheduler, which is a wrapper around ThreadPool. Thus, if you submit your tasks to a scheduler (thread pool) that can execute multiple tasks in parallel, you are likely going to run into race conditions.
You could make your code a bit safer:
lock (intVals) intVals.Add(new Random().Next());
But then this opens up another can of worms :)
If you are interested in more details about async programming, see this link. Also this article is quite useful and explains best practices in asynchronous programming.
Happy (asynchronous) coding!
Yes, race conditions are indeed possible when using async/await in a way that introduces concurrency. To introduce concurrency you must:
Launch multiple asynchronous operations concurrently, i.e. launch the next operation without awaiting the completion of the previous operation, and,
Have no ambient synchronization¹ mechanism in place, namely a SynchronizationContext, that would synchronize the execution of the continuations of the asynchronous operations.
In your case both conditions are met, so the continuations are running on multiple threads concurrently. And since the List<T> class is not thread-safe, you get undefined behavior.
To see what effect a SynchronizationContext has in a situation like this, you can install the Nito.AsyncEx.Context package and do this:
[TestMethod]
public void TestAsyncIntList()
{
AsyncContext.Run(async () =>
{
var appendCount = 30;
var intVals = new List<int>();
await AppendNewIntVal(appendCount, intVals);
Assert.AreEqual(appendCount, intVals.Count);
});
}
FYI many types of applications install automatically a SynchronizationContext when launched (WPF and Windows Forms to name a few). Console applications do not though, hence the need to be extra cautious when writing an async-enabled console application.
¹ It's worth noting that the similar-looking terms synchronized/unsynchronized and synchronous/asynchronous are mostly unrelated. This can be a source of confusion for someone who is not familiar with these terms. The first term is about preventing multiple threads from accessing a shared resource concurrently. The second term is about doing something without blocking a thread.

How to run batch of Tasks?

I have a lot of Task what i running simultaneously. But sometimes there are too much of them to run in same time. So i want to run them by bunches for 100 Tasks. But i'm not sure how can i modify my code.
There is my current code:
protected void ValidateFile(List<MyFile> validFiles, MyFile file)
{
///do something
validFiles.add(file);
}
internal Task ValidateFilesAsync(List<MyFile> validFiles, SplashScreenManager splashScreen, MyFile file)
{
return Task.Run(() => ValidateFile(validFiles, file)).ContinueWith(
t => splashScreen?.SendCommand(SplashScreen.SplashScreenCommand.IncreaseGeneralActionValue,
1));
}
var validFiles = new List<MyFile>;
var tasks = new List<Task>();
foreach (var file in filesToValidate)
{
tasks.Add(ValidateFilesAsync(validFiles, splashScreenManager, file));
}
Task.WaitAll(tasks.ToArray());
I'm not very good with Tasks so code may be not an optimal, but somehow it's working.
I found that i can use Parallel.Foreach with MaxDegreeOfParallelism param, but for this i have to ValidateFile into Action and remove ValidateFilesAsync and in this case i will lose ContinueWith functional what i use to increase progress bar in gui.
How can i achive restriction of simultaneously running Tasks and if possible save ContinueWith-like functional?
I would suggest using Parallel.Foreach. Not sure what you mean with " have to ValidateFile into Action", just change your foreach-body to a lambda. There should be plenty of examples easily available.
To update the UI there are several ways to do it:
Use SynchronizationContext
Start a task using a task scheduler that runs the task on the UI thread
Update the progress variable on the background thread, and use a timer to poll the progress variable from the UI thread.
Keep in mind that IO, like reading files, may not improve much, if at all, by reading in parallel. Spinning disks are inherently serial, and have a fairly large seek times. SSDs are inherently parallel, but I would still not expect any huge performance gains from reading in parallel, especially not going up to 100 concurrent reads.

Using Task.Yield to overcome ThreadPool starvation while implementing producer/consumer pattern

Answering the question: Task.Yield - real usages?
I proposed to use Task.Yield allowing a pool thread to be reused by other tasks. In such pattern:
CancellationTokenSource cts;
void Start()
{
cts = new CancellationTokenSource();
// run async operation
var task = Task.Run(() => SomeWork(cts.Token), cts.Token);
// wait for completion
// after the completion handle the result/ cancellation/ errors
}
async Task<int> SomeWork(CancellationToken cancellationToken)
{
int result = 0;
bool loopAgain = true;
while (loopAgain)
{
// do something ... means a substantial work or a micro batch here - not processing a single byte
loopAgain = /* check for loop end && */ cancellationToken.IsCancellationRequested;
if (loopAgain) {
// reschedule the task to the threadpool and free this thread for other waiting tasks
await Task.Yield();
}
}
cancellationToken.ThrowIfCancellationRequested();
return result;
}
void Cancel()
{
// request cancelation
cts.Cancel();
}
But one user wrote
I don't think using Task.Yield to overcome ThreadPool starvation while
implementing producer/consumer pattern is a good idea. I suggest you
ask a separate question if you want to go into details as to why.
Anybody knows, why is not a good idea?
There are some good points left in the comments to your question. Being the user you quoted, I'd just like to sum it up: use the right tool for the job.
Using ThreadPool doesn't feel like the right tool for executing multiple continuous CPU-bound tasks, even if you try to organize some cooperative execution by turning them into state machines which yield CPU time to each other with await Task.Yield(). Thread switching is rather expensive; by doing await Task.Yield() on a tight loop you add a significant overhead. Besides, you should never take over the whole ThreadPool, as the .NET framework (and the underlying OS process) may need it for other things. On a related note, TPL even has the TaskCreationOptions.LongRunning option that requests to not run the task on a ThreadPool thread (rather, it creates a normal thread with new Thread() behind the scene).
That said, using a custom TaskScheduler with limited parallelism on some dedicated, out-of-pool threads with thread affinity for individual long-running tasks might be a different thing. At least, await continuations would be posted on the same thread, which should help reducing the switching overhead. This reminds me of a different problem I was trying to solve a while ago with ThreadAffinityTaskScheduler.
Still, depending on a particular scenario, it's usually better to use an existing well-established and tested tool. To name a few: Parallel Class, TPL Dataflow, System.Threading.Channels, Reactive Extensions.
There is also a whole range of existing industrial-strength solutions to deal with Publish-Subscribe pattern (RabbitMQ, PubNub, Redis, Azure Service Bus, Firebase Cloud Messaging (FCM), Amazon Simple Queue Service (SQS) etc).
After a bit of debating on the issue with other users - who are worried about the context switching and its influence on the performance.
I see what they are worried about.
But I meant: do something ... inside the loop to be a substantial task - usually in the form of a message handler which reads a message from the queue and processes it. The message handlers are usually user defined and the message bus executes them using some sort of dispatcher. The user can implement a handler which executes synchronously (nobody knows what the user will do), and without Task.Yield that will block the thread to process those synchronous tasks in a loop.
Not to be empty worded i added tests to github: https://github.com/BBGONE/TestThreadAffinity
They compare the ThreadAffinityTaskScheduler, .NET ThreadScheduler with BlockingCollection and .NET ThreadScheduler with Threading.Channels.
The tests show that for Ultra Short jobs the performance degradation is
around 15%. To use the Task.Yield without the performance degradation (even small) - it is not to use extremely short tasks and if the task is too short then combine shorter tasks into a bigger batch.
[The price of context switch] = [context switch duration] / ([job duration]+[context switch duration]).
In that case the influence of the switching the tasks is negligible on the performance. But it adds a better task cooperation and responsiveness of the system.
For long running tasks it is better to use a custom Scheduler which executes tasks on its own dedicated thread pool - (like the WorkStealingTaskScheduler).
For the mixed jobs - which can contain different parts - short running CPU bound, asynchronous and long running code parts. It is better to split the task into subtasks.
private async Task HandleLongRunMessage(TestMessage message, CancellationToken token = default(CancellationToken))
{
// SHORT SYNCHRONOUS TASK - execute as is on the default thread (from thread pool)
CPU_TASK(message, 50);
// IO BOUND ASYNCH TASK - used as is
await Task.Delay(50);
// BUT WRAP the LONG SYNCHRONOUS TASK inside the Task
// which is scheduled on the custom thread pool
// (to save threadpool threads)
await Task.Factory.StartNew(() => {
CPU_TASK(message, 100000);
}, token, TaskCreationOptions.DenyChildAttach, _workStealingTaskScheduler);
}

Concurrent requests with HttpClient take longer than expected

I have a webservice which receives multiple requests at the same time. For each request, I need to call another webservice (authentication things). The problem is, if multiple (>20) requests happen at the same time, the response time suddenly gets a lot worse.
I made a sample to demonstrate the problem:
using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Net;
using System.Net.Http;
using System.Threading.Tasks;
namespace CallTest
{
public class Program
{
private static readonly HttpClient _httpClient = new HttpClient(new HttpClientHandler { Proxy = null, UseProxy = false });
static void Main(string[] args)
{
ServicePointManager.DefaultConnectionLimit = 100;
ServicePointManager.Expect100Continue = false;
// warmup
CallSomeWebsite().GetAwaiter().GetResult();
CallSomeWebsite().GetAwaiter().GetResult();
RunSequentiell().GetAwaiter().GetResult();
RunParallel().GetAwaiter().GetResult();
}
private static async Task RunParallel()
{
var tasks = new List<Task>();
for (var i = 0; i < 300; i++)
{
tasks.Add(CallSomeWebsite());
}
await Task.WhenAll(tasks);
}
private static async Task RunSequentiell()
{
var tasks = new List<Task>();
for (var i = 0; i < 300; i++)
{
await CallSomeWebsite();
}
}
private static async Task CallSomeWebsite()
{
var watch = Stopwatch.StartNew();
using (var result = await _httpClient.GetAsync("http://example.com").ConfigureAwait(false))
{
// more work here, like checking success etc.
Console.WriteLine(watch.ElapsedMilliseconds);
}
}
}
}
Sequential calls are no problem. They take a few milliseconds to finish and the response time is mostly the same.
However, parallel request start taking longer and longer the more requests are being sent. Sometimes it takes even a few seconds. I tested it on .NET Framework 4.6.1 and on .NET Core 2.0 with the same results.
What is even stranger: I traced the HTTP requests with WireShark and they always take around the same time. But the sample program reports much higher values for parallel requests than WireShark.
How can I get the same performance for parallel requests? Is this a thread pool issue?
This behaviour has been fixed with .NET Core 2.1. I think the problem was the underlying windows WinHTTP handler, which was used by the HttpClient.
In .NET Core 2.1, they rewrote the HttpClientHandler (see https://blogs.msdn.microsoft.com/dotnet/2018/04/18/performance-improvements-in-net-core-2-1/#user-content-networking):
In .NET Core 2.1, HttpClientHandler has a new default implementation implemented from scratch entirely in C# on top of the other System.Net libraries, e.g. System.Net.Sockets, System.Net.Security, etc. Not only does this address the aforementioned behavioral issues, it provides a significant boost in performance (the implementation is also exposed publicly as SocketsHttpHandler, which can be used directly instead of via HttpClientHandler in order to configure SocketsHttpHandler-specific properties).
This turned out to remove the bottlenecks mentioned in the question.
On .NET Core 2.0, I get the following numbers (in milliseconds):
Fetching URL 500 times...
Sequentiell Total: 4209, Max: 35, Min: 6, Avg: 8.418
Parallel Total: 822, Max: 338, Min: 7, Avg: 69.126
But on .NET Core 2.1, the individual parallel HTTP requests seem to have improved a lot:
Fetching URL 500 times...
Sequentiell Total: 4020, Max: 40, Min: 6, Avg: 8.040
Parallel Total: 795, Max: 76, Min: 5, Avg: 7.972
In the question's RunParallel() function, a stopwatch is started for all 300 calls in the first second of the program running, and ended when each http request completes.
Therefore these times can't really be compared to the sequential iterations.
For smaller numbers of parallel tasks e.g. 50, if you measure the wall time that the sequential and parallel methods take you should find that the parallel method is faster due to it pipelining as many GetAsync tasks as it can.
That said, when running the code for 300 iterations I did find a repeatable several-second stall when running outside the debugger only:
Debug build, in debugger: Sequential 27.6 seconds, parallel 0.6 seconds
Debug build, without debugger: Sequential 26.8 seconds, parallel 3.2 seconds
[Edit]
There's a similar scenario described in this question, its possibly not relevant to your problem anyway.
This problem gets worse the more tasks are run, and disappears when:
Swapping the GetAsync work for an equivalent delay
Running against a local server
Slowing the rate of tasks creation / running less concurrent tasks
The watch.ElapsedMilliseconds diagnostic stops for all connections, indicating that all connections are affected by the throttling.
Seems to be some sort of (anti-syn-flood?) throttling in the host or network, that just halts the flow of packets once a certain number of sockets start connecting.
It sounds like for whatever reason, you're hitting a point of diminishing returns at around 20 concurrent Tasks. So, your best option might be to throttle your parallelism. TPL Dataflow is a great library for achieving this. To follow your pattern, add a method like this:
private static Task RunParallelThrottled()
{
var throtter = new ActionBlock<int>(i => CallSomeWebsite(),
new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 20 });
for (var i = 0; i < 300; i++)
{
throttler.Post(i);
}
throttler.Complete();
return throttler.Completion;
}
You might need to experiment with MaxDegreeOfParallelism until you find the sweet spot. Note that this is more efficient than doing batches of 20. In that scenario, all 20 in the batch would need to complete before the next batch begins. With TPL Dataflow, as soon as one completes, another is allowed to begin.
The reason that you are having issues is that .NET does not resume Tasks in the order that they are awaited, an awaited Task is only resumed when a calling function cannot resume execution, and Task is not for Parallel execution.
If you make a few modifications so that you pass in i to the CallSomeWebsite function and call Console.WriteLine("All loaded"); after you add all the tasks to the list, you will get something like this: (RequestNumber: Time)
All loaded
0: 164
199: 236
299: 312
12: 813
1: 837
9: 870
15: 888
17: 905
5: 912
10: 952
13: 952
16: 961
18: 976
19: 993
3: 1061
2: 1061
Do you notice how every Task is created before any of the times are printed out to the screen? The entire loop of creating Tasks completes before any of the Tasks resume execution after awaiting the network call.
Also, see how request 199 is completed before request 1? .NET will resume Tasks in the order that it deems best (This is guaranteed to be more complicated but I am not exactly sure how .NET decides which Task to continue).
One thing that I think you might be confusing is Asynchronous and Parallel. They are not the same, and Task is used for Asynchronous execution. What that means is that all of these tasks are running on the same thread (Probably. .NET can start a new thread for tasks if needed), so they are not running in Parallel. If they were truly Parallel, they would all be running in different threads, and the execution times would not be increasing for each execution.
Updated functions:
private static async Task RunParallel()
{
var tasks = new List<Task>();
for (var i = 0; i < 300; i++)
{
tasks.Add(CallSomeWebsite(i));
}
Console.WriteLine("All loaded");
await Task.WhenAll(tasks);
}
private static async Task CallSomeWebsite(int i)
{
var watch = Stopwatch.StartNew();
using (var result = await _httpClient.GetAsync("https://www.google.com").ConfigureAwait(false))
{
// more work here, like checking success etc.
Console.WriteLine($"{i}: {watch.ElapsedMilliseconds}");
}
}
As for the reason that the time printed is longer for the Asynchronous execution then the Synchronous execution, your current method of tracking time does not take into account the time that was spent between execution halt and continuation. That is why all of the reporting execution times are increasing over the set of completed requests. If you want an accurate time, you will need to find a way of subtracting the time that was spent between the await occurring and execution continuing. The issue isn't that it is taking longer, it is that you have an inaccurate reporting method. If you sum the time for all the Synchronous calls, it is actually significantly more than the max time of the Asynchronous call:
Sync: 27965
Max Async: 2341

Best way to work on 15000 work items that need 1-2 I/O calls each

I have a C#/.NET 4.5 application that does work on around 15,000 items that are all independent of each other. Each item has a relatively small cpu work to do (no more than a few milliseconds) and 1-2 I/O calls to WCF services implemented in .NET 4.5 with SQL Server 2008 backend. I assume they will queue concurrent requests that they can't process quick enough? These I/O operations can take anywhere from a few milliseconds to a full second. The work item then has a little more cpu work(less than 100 milliseconds) and it is done.
I am running this on a quad-core machine with hyper-threading. Using the task parallel library, I am trying to get the best performance with machine as I can with as little waiting on I/O as possible by running those operations asynchronously and the CPU work done in parallel.
Synchronously, with no parallel processes and no async operations, the application takes around 9 hours to run. I believe I can speed this up to under an hour or less but I am not sure if I am going about this the right way.
What is the best way to do the work per item in .NET? Should I make 15000 threads and have them doing all the work with context switching? Or should I just make 8 threads (how many logical cores I have) and go about it that way? Any help on this would be greatly appreciated.
My usuall suggestion is TPL Dataflow.
You can use an ActionBlock with an async operation and set the parallelism as high as you need it to be:
var block = new ActionBlock<WorkItem>(wi =>
{
DoWork(wi);
await Task.WhenAll(DoSomeWorkAsync(wi), DoOtherWorkAsync(wi));
},
new ExecutionDataflowBlockOptions{ MaxDegreeOfParallelism = 1000 });
foreach (var workItem in workItems)
{
block.Post(workItem);
}
block.Complete();
await block.Completion;
That way you can test and tweak MaxDegreeOfParallelism until you find the number that fits your specific situation the most.
For CPU intensive work having higher parallelism than your cores doesn't help, but for I/O (and other async operations) it definitely does so if your CPU intensive work is short then I would go with at least a 1000.
You definitely don't want to kick off 15000 threads and let them all thrash it out. If you can make your I/O methods completely async - meaning I/O completion ports based - then you can get some very nice controlled parallelism going on.
If you have to tie up threads whilst waiting for I/O you're going to be massively limiting your ability to process the items.
TaskFactory taskFactory = new TaskFactory(new WorkStealingTaskScheduler(Environment.ProcessorCount));
public Job[] GetJobs() { get { return new Job[15000];} }
public async Task ProcessJobs(Job[] jobs)
{
var jobTasks = jobs.Select(j => StartJob(j));
await Task.WhenAll(jobTasks);
}
private async Task StartJob(Job j)
{
var initialCpuResults = await taskFactory.StartNew(() => j.DoInitialCpuWork());
var wcfResult = await DoIOCalls(initialCpuResults);
await taskFactory.StartNew(() => j.DoLastCpuWork(wcfResult));
}
private async Task<bool> DoIOCalls(Result r)
{
// Sequential...
await myWcfClientProxy.DoIOAsync(...); // These MUST be fully IO completion port based methods [not Task.Run etc] to achieve good throughput
await mySQLServerClient.DoIOAsync(...);
// or in Parallel...
// await Task.WhenAll(myWcfClientProxy.DoIOAsync(...), mySQLServerClient.DoIOAsync(...));
return true;
}

Categories