I have a .netcore 6 BackGroundService which pushes data from on-premise to a 3rd party API.
The 3rd party API takes about 500 milliseconds to process the API call.
The problem is that I have about 1,000,000 rows of data to push to this API one at a time. At 1/2 second per row, it's going to take about 6 days to sync up.
So, I would like to try to spawn multiple threads in order to hit the API simultaneously with 10 threads.
var startTime = DateTimeOffset.Now;
var batchSize = _config.GetValue<int>("BatchSize");
using (var scope = _serviceScopeFactory.CreateScope())
{
var context = scope.ServiceProvider.GetRequiredService<PlankContext>();
var dncEntries = await context.PlankQueueDnc.Where(x => x.ToProcessFlag == true).Take(batchSize).ToListAsync();
foreach (var plankQueueDnc in dncEntries)
{
var response = await _plankConnector.InsertDncAsync(plankQueueDnc);
context.PlankQueueDnc.Update(plankQueueDnc);
}
await context.SaveChangesAsync();
}
Here is the code. As you can see, it gets a batch of 100 records and then processes them one by one. Is there a way to modify this so this line is not awaited? I don't quite understand how it would work if it were not awaited. Would it create a thread for each execution in the loop?
var response = await _plankConnector.InsertDncAsync(plankQueueDnc);
I am clearly not up to speed on threads as well as the esteemed #StephanCleary.
So suggestions would be appreciated.
In .NET 6 you can use Parallel.ForEachAsync to execute operations concurrently, using either all available cores or a limited Degree-Of-Parallelism.
The following code loads all records, executes the posts concurrently, then updates the records :
using (var scope = _serviceScopeFactory.CreateScope())
{
var context = scope.ServiceProvider.GetRequiredService<PlankContext>();
var dncEntries = await context.PlankQueueDnc
.Where(x => x.ToProcessFlag == true)
.Take(batchSize)
.ToListAsync();
await Parallel.ForEachAsync(dncEntries,async plankQueueDnc=>
{
var response = await _plankConnector.InsertDncAsync(plankQueueDnc);
plankQueueDnc.Whatever=response.Something;
};
await context.SaveChangesAsync();
}
There's no reason to call Update as a DbContext tracks the objects it loaded and knows which ones were modified. SaveChangesAsync will persist all changes in a single transaction
DOP and Throttling
By default, ParallelForEachAsync will execute as many tasks concurrently as there are cores. This may be too little or too much for HTTP calls. On the one hand, the client machine isn't using its CPU at all while waiting for the remote service. On the other hand, the remote service itself may not like or even allow too many concurrent calls and may even impose throttling.
The ParallelOptions class can be used to specify the degree of parallelism. If the API allows it, we could execute eg 20 concurrent calls :
var option=new ParallelOptions { MaxDegreeOfParallelism = 20};
await Parallel.ForEachAsync(dncEntries,options,async plankQueueDnc=>{...});
Many services impose a rate on how many requests can be made in a period of time. A (somewhat naive) way of implementing this is to add a small delay in the task worker code can take care of this :
var delay=100;
await Parallel.ForEachAsync(dncEntries,options,async plankQueueDnc=>{
...
await Task.Delay(delay);
});
Related
I've been creating a service using C# in Azure Functions. I've read guides on how the best usage of async/await but don't understand its value in the context of Azure functions.
In my Azure function, I have 3 calls being made to a external API. I tried to use async/await to kick off my API calls. The idea is that the the first two tasks return a list each, which will be concatenated, and then compared against the third list.
After this comparison is complete, all the items are then sent to a storage container queue, for processing by another function that uses a queue trigger.
My async implementation is below:
var firstListTask = GetResourceListAsync(1);
var secondListTask = GetResourceListAsync(2);
var thirdListTask = GetOtherResourceListAsync(1);
var completed = await Task.WhenAll(firstListTask, secondListTask);
var resultList = completed[0].Concat(completed[1]);
var compareList = await thirdListTask;
// LINQ here to filter out resultList based on compareList
Running the above, I get an execution time of roughly 38 seconds.
My understanding of the async implementation is that I kick off all 3 async calls to get my lists.
The first two tasks are awaited with 'await Task.WhenAll...' - at this point the thread exits the async method and 'does something else' until the API returns the payload
API payload is received, the method is then resumed and continues executing the next instruction (concatenating the two lists)
The third task is then awaited with 'await thirdListTask', which exits the async method and 'does something else' until the API returns the payload
API payload is received, the method is then resumed and continues executing the next instruction (filtering lists)
Now if I run the same code synchronously, I get an execution time of about 40 seconds:
var firstList = GetResourceList(1)
var secondList = GetResourceList(2);
var resultList = firstList.Concat(secondListTask)
var compaireList = GetOtherResourceList(1);
var finalList = // linq query to filter out resultList based on compareList
I can see that the async function runs 2 seconds faster than the sync function, I'm assuming this is because the thirdListTask is being kicked off at the same time as firstListTask and secondListTask?
My problem with the async implementation is that I don't understand what 'does something else' entails in the context of Azure Functions. From my understanding there is nothing else to do other than the operations on the next line, but it can't progress there until the payload has returned from the API.
Moreover, is the following code sample doing the same thing as my first async implementation? I'm extremely confused seeing examples of Azure Functions that use await for each async call, just to await another call in the next line.
var firstList = await GetResourceListAsync(1);
var secondList = await GetResourceListAsync(2);
var resultList = firstList.Concat(secondList);
var compareList= await GetOtherResourceListAsync(1);
// LINQ here to filter out resultList based on compareList
I've tried reading MS best practice for Azure Functions and similar questions around async/await on stackoverflow, but I can't seem to wrap my head around the above. Can anyone help simplify this?
var firstListTask = GetResourceListAsync(1);
var secondListTask = GetResourceListAsync(2);
var thirdListTask = GetOtherResourceListAsync(1);
This starts all 3 tasks. All 3 API calls are now running.
var completed = await Task.WhenAll(firstListTask, secondListTask);
This async awaits until both tasks finish. It frees up the thread to go "do something else" What is this something else? Whatever the framework needs it to be. It's a resource being freed, so it could be used in running another async operation's continuation etc.
var compareList = await thirdListTask;
At this point, your API call has most likely completed already, as it was started with the other 2. If it completed, the await will pull out the value or throw an exception if the task was faulted. If it is still ongoing, it will async wait for it to complete, freeing up the thread to "go do something else"
var firstList = await GetResourceListAsync(1);
var secondList = await GetResourceListAsync(2);
var resultList = firstList.Concat(secondList);
var compareList= await GetOtherResourceListAsync(1);
This is different from your first example. If e.g. all your API calls take 5 seconds to complete, the total running time will be 15 seconds, as you sequentially start and await for it to complete. In your first example, the total running time will roughly be 5 seconds, so 3 times quicker.
I have a server app (C# with .Net 5) that exposes a gRPC bi-directional endpoint. This endpoint takes in a binary stream in which the server analyzes and produces responses that are sent back to the gRPC response stream.
Each file being sent over gRPC is few megabytes and it takes few minutes for the gRPC call to complete streaming (without latency). With latencies, this time increases sometimes by 50%.
On the client, I have 2 tasks (Task.Run) running, one streaming the file from the clients' file system using FileStream, other reading responses from the server (gRPC).
On the server also, I have 2 tasks running, one reading messages from the gRPC request stream and pushing them into a queue (DataFlow.BufferBlock<byte[]>), other processing messages from the queue, and writing responses to gRPC.
The problem:
If I disable (comment out) all the server processing code, and simply read and log messages from gRPC, there's almost 0 latency from client to server.
When the server has processing enabled, the clients see latencies while writing to grpcClient.
With just 10 active parallel sessions (gRPC Calls) these latencies can go up to 10-15 seconds.
PS: this only happens when I have more than one client running, a higher number of concurrent clients means higher latency.
The client code looks a bit like the below:
FileStream fs = new(audioFilePath, FileMode.Open, FileAccess.Read, FileShare.Read, 1024 * 1024, true);
byte[] buffer = new byte[10_000];
GrpcClient client = new GrpcClient(_singletonChannel); // using single channel since only 5-10 clients are there right now
BiDiCall call = client.BiDiService(hheaders: null, deadline: null, CancellationToken.None);
var writeTask = Task.Run(async () => {
while (fs.ReadAsync(buffer, 0, buffer.Length))
{
call.RequestStream.WriteAsync(new() { Chunk = ByteString.CopyFrom(buffer) });
}
await call.RequestStream.CompleteAsync();
});
var readTask = Task.Run(async () => {
while (await call.ResponseStream.MoveNext())
{
// write to log call.ResponseStream.Current
}
});
await Task.WhenAll(writeTask, readTask);
await call;
Server code looks like:
readonly BufferBlock<MessageRequest> messages = new();
MessageProcessor _processor = new();
public override async Task BiDiService(IAsyncStreamReader<MessageRequest> requestStream,
IServerStreamWriter<MessageResponse> responseStream,
ServerCallContext context)
{
var readTask = TaskFactory.StartNew(() => {
while (await requestStream.MoveNext())
{
messages.Post(requestStream.Current); // add to queue
}
messages.Complete();
}, TaskCreationOptions.LongRunning).ConfigureAwait(false);
var processTask = Task.Run(() => {
while (await messages.OutputAvailableAsync())
{
var message = await messages.ReceiveAsync(); // pick from queue
// if I comment out below line and run with multiple clients = latency disappears
var result = await _processor.Process(message); // takes some time to process
if (result.IsImportantForClient())
await responseStrem.WriteAsync(result.Value);
}
});
await Task.WhenAll(readTask, processTask);
}
So, as it turned out, the problem was due to the delay in the number of worker threads spawned by the ThreadPool.
The ThreadPool was taking more time to spawn threads to process these tasks causing gRPC reads to have a significant lag.
This was fixed after increasing the minThread count for spawn requstes using ThreadPool.SetMinThreads. MSDN reference
There have been a number of promising comments on the SO's initial question, but wanted to paraphrase what I thought was important: there's
a an outer async method that calls in to 2
Task.Run()'s - with TaskCreationOptions.LongRunning option that wrap async loops, and finally a
returns a Task.WhenAll() rejoins the two Tasks...
Alois Kraus offers that an OS task scheduler is an OS and its scheduling could be abstracting away what you might think is more efficient - this could very well be true and if it is
i would offer the suggestion to try and remove the asynchronous processing and see what kind of benefit difference you might see with various sync/async blends might work better for your particular scenario.
one thing to make sure to remember is that asynce/await logically blocks at an expense of automatic thread management - this is great for single-path-ish I/O bound processing (ex. needing to call a db/webservice before moving on to next step of execution) and can be less beneficial as you move toward compute-bound processing (execution that needs to be explicitly re-joined - async/await implicitly take care of Task re-join)
I have a bunch of independent REST calls to make (say 1000) , each call has differing body. How to make these calls in the least amount of time?
I am using a Parallel.foreach loop to to make the calls , but doesn't a call wait for the previous call to finish (on a single thread) , is there any callback kind of system to prevent this and make the process faster?
Parallel.foreach(...){
(REST call)
HttpResponseMessage response = this.client.PostAsync(new Uri(url), content).Result;
}
Using await also gives almost same results.
Make all the calls with PostAsync:
var urlAndContentArray = ...
// The fast part
IEnumerable<Task<HttpResponseMessage>> tasks = urlAndContentArray.Select
(x => this.client.PostAsync(new Uri(x.Url), x.Content));
// IF you care about results: here is the slow part:
var responses = await Task.WhenAll(tasks);
Note that this will make all the calls very quickly as requested, but indeed time it takes to get results is mostly not related to how many requests you send out - it's limited by number of outgoing requests .Net will run in parallel as well as how fast those servers reply (and if they have any DOS protection/throttling).
The simplest way to do many async actions in parallel, is to start them without waiting, capture tasks in a collection and then wait untill all tasks will be completed.
For example
var httpClient = new HttpClient();
var payloads = new List<string>(); // this is 1000 payloads
var tasks = payloads.Select(p => httpClient.PostAsync("https://addresss", new StringContent(p)));
await Task.WhenAll(tasks);
It should be enough for start, but mind 2 things.
There is still a connection pool per hostname, what defaults to 4. You can use HttpSocketsHandler to control the pool size.
It will really start or the 1000 items in parallel, what sometimes might be not what you want. To control MAX amount of parallel items you can check ActionBlock
I have console application which is doing multiple API requests over HTTPs.
When running in single thread it can do maximum of about 8 API requests / seconds.
Server which is receiving API calls has lots of free resources, so it should be able to handle many more than 8 / sec.
Also when I run multiple instances of the application, each instance is still able to do 8 requests / sec.
I tried following code to parallelize the requests, but it still runs synchronously:
var taskList = new List<Task<string>>();
for (int i = 0; i < 10000; i++)
{
string threadNumber = i.ToString();
Task<string> task = Task<string>.Factory.StartNew(() => apiRequest(requestData));
taskList.Add(task);
}
foreach (var task in taskList)
{
Console.WriteLine(task.Result);
}
What am I doing wrong here?
EDIT:
My mistake was iterating over tasks and getting task.Result, that was blocking the main thread, making me think that it was running synchronously.
Code which I ended up using instead of foreach(var task in taskList):
while (taskList.Count > 0)
{
Task.WaitAny();
// Gets tasks in RanToCompletion or Faulted state
var finishedTasks = GetFinishedTasks(taskList);
foreach (Task<string> finishedTask in finishedTasks)
{
Console.WriteLine(finishedTask.Result);
taskList.Remove(finishedTask);
}
}
There could be a couple of things going on.
First, the .net ServicePoint class allows a maximum number of 2 connections per host by default. See this Stack Overflow question/answer.
Second, your server might theoretically be able to handle more than 8/sec, but there could be resource constraints or other issues preventing that on the server side. I have run into issues with API calls which theoretically should be able to handle much more than they do, but for whatever reason were designed or implemented improperly.
#theMayer is kinda-sorta correct. It's possible that your call to apiRequest is what's blocking and making the whole expression seem synchronous...
However... you're iterating over each task and calling task.Result, which will block until the task completes in order to print it to the screen. So, for example, all tasks except the first could be complete, but you won't print them until the first one completes, and you will continue printing them in order.
On a slightly different note, you could rewrite this little more succinctly like so:
var screenLock = new object();
var results = Enumerable.Range(1, 10000)
.AsParallel()
.Select(i => {
// I wouldn't actually use this printing, but it should help you understand your example a bit better
lock (screenLock) {
Console.WriteLine("Task i");
}
apiRequest(requestedData));
});
Without the printing, it looks like this:
var results = Enumerable.Range(1, 10000)
.AsParallel()
.Select(i => apiRequest(requestedData));
I am reading azure tables' data - around 5k tables and collects different metrics and save them back to some other azure tables, everything in a asynchronous way. The problem I am facing is when there are huge data which can happen occasionally, application starts hanging. The same code is working fine with less data. The steps I am doing (all of them are asynchronous using Rx, async and await) are
Read all the table names from Azure
Read all the tables previous metric data (1 & 2 are in parallel - Task.WhenAll)
Get data from each table, process and save (Task.WhenAll)
what I want is, use asynchronousy till it doesn't make my application hanging. If there are more data than what can be handled, it should not read any more table's data instead focus on completing the available data processing.
Does Parallel.ForEach takes care of that?
The code: edited as per Stephen Cleary, Still not working for all the tables. whereas it is working for 500 tables,
I think it is the amount of data that brings the app (console app) to a standstill rather than the number of threads. (One thread may end up retrieving million rows, in thousands and each thousand is passed to a method and its count is added to dictionary hence can be garbage collected when there is a need for more memory) or is it the way I have implemented Semaphoreslim that is wrong?
public async Task CalculateMetricsForAllTablesAsync()
{
var allWizardTableNamesTask = GetAllWizardTableNamesAsync();
var allTablesNamesWithLastRunTimeTask = GetAllTableNamesWithLastRunTimeAsync();
await Task.WhenAll(allWizardTableNamesTask, allTablesNamesWithLastRunTimeTask).ConfigureAwait(false);
var allWizardTableNames = allWizardTableNamesTask.Result;
var allTablesNamesWithLastRunTime = allTablesNamesWithLastRunTimeTask.Result;
var throttler = new SemaphoreSlim(10);
var concurrentTableProcessingTasks = new ConcurrentStack<Task>();
foreach (var tname in allWizardTableNames)
{
await throttler.WaitAsync();
try
{
concurrentTableProcessingTasks.Push(ProcessTableDataAsync(tname, getTableNameWithLastRunTime(tname)));
}
finally
{
throttler.Release();
}
}
await Task.WhenAll(concurrentTableProcessingTasks).ConfigureAwait(false);
}
private async Task ProcessTableDataAsync(string tableName, Tuple<string, string> matchingTable)
{
var tableDataRetrieved = new TaskCompletionSource<bool>();
var metricCountsForEachDay = new ConcurrentDictionary<string, Tuple<int, int>>();
_fromATS.GetTableDataAsync<DynamicTableEntity>(tableName, GetFilter(matchingTable))
.Subscribe(entities => ProcessWizardDataChunk(metricCountsForEachDay, entities), () => tableDataRetrieved.TrySetResult(true));
await tableDataRetrieved.Task;
await SaveMetricDataAsync(tableName, metricCountsForEachDay).ConfigureAwait(false);
}
Since your async is wrapping Rx, I'd recommend throttling at the async level. You can do this by defining a SemaphoreSlim and wrapping your method logic within a WaitAsync/Release.
Alternatively, consider TPL Dataflow. Dataflow has built-in options for throttling (MaxDegreeOfParallelism), and also interoperates naturally with async and Rx.