I need to fetch content from some 3000 urls. I'm using HttpClient, create Task for each url, add tasks to list and then await Task.WhenAll. Something like this
var tasks = new List<Task<string>>();
foreach (var url in urls) {
var task = Task.Run(() => httpClient.GetStringAsync(url));
tasks.Add(task);
}
var t = Task.WhenAll(tasks);
However many tasks end up in Faulted or Canceled states. I thought it might be problem with the concrete urls, but no. I can fetch those url no problem with curl in parallel.
I tried HttpClientHandler, WinHttpHandler with various timeouts etc. Always several hundred urls end with an error.
Then I tried to fetch those urls in batches of 10 and that works. No errors, but very slow. Curl will fetch 3000 urls in parallel very fast.
Then I tried to get httpbin.org 3000 times to verify that the issue is not with my particular urls:
var handler = new HttpClientHandler() { MaxConnectionsPerServer = 5000 };
var httpClient = new HttpClient(handler);
var tasks = new List<Task<HttpResponseMessage>>();
foreach (var _ in Enumerable.Range(1, 3000)) {
var task = Task.Run(() => httpClient.GetAsync("http://httpbin.org"));
tasks.Add(task);
}
var t = Task.WhenAll(tasks);
try { await t.ConfigureAwait(false); } catch { }
int ok = 0, faulted = 0, cancelled = 0;
foreach (var task in tasks) {
switch (task.Status) {
case TaskStatus.RanToCompletion: ok++; break;
case TaskStatus.Faulted: faulted++; break;
case TaskStatus.Canceled: cancelled++; break;
}
}
Console.WriteLine($"RanToCompletion: {ok} Faulted: {faulted} Canceled: {cancelled}");
Again, always several hundred Tasks end in error.
So, what is the issue here? Why I cannot get those urls with async?
I'm using .NET Core and therefore the suggestion to use ServicePointManager (Trying to run multiple HTTP requests in parallel, but being limited by Windows (registry)) is not applicable.
Also, the urls I need to fetch point to different hosts. The code with httpbin is just a test, to show that the problem was not with my urls being invalid.
As Fildor said in the comments, httpClient.GetStringAsync returns Task. So you don't need to wrap it in Task.Run.
I ran this code in the console app. It took 50 seconds to complete. In your comment, you wrote that curl performs 3000 queries in less than a minute - the same thing.
var httpClient = new HttpClient();
var tasks = new List<Task<string>>();
var sw = Stopwatch.StartNew();
for (int i = 0; i < 3000; i++)
{
var task = httpClient.GetStringAsync("http://httpbin.org");
tasks.Add(task);
}
Task.WaitAll(tasks.ToArray());
sw.Stop();
Console.WriteLine(sw.Elapsed);
Console.WriteLine(tasks.All(t => t.IsCompleted));
Also, all requests were completed successfully.
In your code, you are waiting for tasks started using Task.Run. But you need to wait for the completion of tasks started by calling httpClient.Get...
Related
Let's say I want to download 1000 recipes from a website. The websites accepts at most 10 concurrent connections. Each recipe should be stored in an array, at its corresponding index. (I don't want to send the array to the DownloadRecipe method.)
Technically, I've already solved the problem, but I would like to know if there is an even cleaner way to use async/await or something else to achieve it?
static async Task MainAsync()
{
int recipeCount = 1000;
int connectionCount = 10;
string[] recipes = new string[recipeCount];
Task<string>[] tasks = new Task<string>[connectionCount];
int r = 0;
while (r < recipeCount)
{
for (int t = 0; t < tasks.Length; t++)
{
tasks[t] = Task.Run(async () => recipes[r] = await DownloadRecipe(r));
r++;
}
await Task.WhenAll(tasks);
}
}
static async Task<string> DownloadRecipe(int index)
{
// ... await calls to download recipe
}
Also, this solution it's not optimal, since it doesn't bother starting a new download until all the 10 running downloads are finished. Is there something we can improve there without bloating the code too much? A thread pool limited to 10 threads?
There are many many ways you could do this. One way is to use an ActionBlock which give you access to MaxDegreeOfParallelism fairly easily and will work well with async methods
static async Task MainAsync()
{
var recipeCount = 1000;
var connectionCount = 10;
var recipes = new string[recipeCount];
async Task Action(int i) => recipes[i] = await DownloadRecipe(i);
var processor = new ActionBlock<int>(Action, new ExecutionDataflowBlockOptions()
{
MaxDegreeOfParallelism = connectionCount,
SingleProducerConstrained = true
});
for (var i = 0; i < recipeCount; i++)
await processor.SendAsync(i);
processor.Complete();
await processor.Completion;
}
static async Task<string> DownloadRecipe(int index)
{
...
}
Another way might be to use a SemaphoreSlim
var slim = new SemaphoreSlim(connectionCount, connectionCount);
var tasks = Enumerable
.Range(0, recipeCount)
.Select(Selector);
async Task<string> Selector(int i)
{
await slim.WaitAsync()
try
{
return await DownloadRecipe(i)
}
finally
{
slim.Release();
}
}
var recipes = await Task.WhenAll(tasks);
Another set of approaches is to use Reactive Extensions (Rx)... Once again there are many ways to do this, this is just an awaitable approach (and likely could be better all things considered)
var results = await Enumerable
.Range(0, recipeCount)
.ToObservable()
.Select(i => Observable.FromAsync(() => DownloadRecipe(i)))
.Merge(connectionCount)
.ToArray()
.ToTask();
Alternative approach to have 10 "pools" which will load data "simultaneously".
You don't need to wrap IO operations with the separate thread. Using separate thread for IO operations is just a waste of resources.
Notice that thread which downloads data will do nothing, but just waiting for a response. This is where async-await approach come very handy - we can send multiple requests without waiting them to complete and without wasting threads.
static async Task MainAsync()
{
var requests = Enumerable.Range(0, 1000).ToArray();
var maxConnections = 10;
var pools = requests
.GroupBy(i => i % maxConnections)
.Select(group => DownloadRecipesFor(group.ToArray()))
.ToArray();
await Task.WhenAll(pools);
var recipes = pools.SelectMany(pool => pool.Result).ToArray();
}
static async Task<IEnumerable<string>> DownLoadRecipesFor(params int[] requests)
{
var recipes = new List<string>();
foreach (var request in requests)
{
var recipe = await DownloadRecipe(request);
recipes.Add(recipe);
}
return recipes;
}
Because inside the pool (DownloadRecipesFor method) we download results one by one - we make sure that we have no more than 10 active requests all the time.
This is little bit more effective than originals, because we don't wait for 10 tasks to complete before starting next "bunch".
This is not ideal, because if last "pool" finishes early then others it aren't able to pickup next request to handle.
Final result will have corresponding indexes, because we will process "pools" and requests inside in same order as we created them.
I have an IEnumerable<Task>, where each Task will call the same endpoint. However, the endpoint can only handle so many calls per second. How can I put, say, a half second delay between each call?
I have tried adding Task.Delay(), but of course awaiting them simply means that the app waits a half second before sending all the calls at once.
Here is a code snippet:
var resultTasks = orders
.Select(async task =>
{
var result = new VendorTaskResult();
try
{
result.Response = await result.CallVendorAsync();
}
catch(Exception ex)
{
result.Exception = ex;
}
return result;
} );
var results = Task.WhenAll(resultTasks);
I feel like I should do something like
Task.WhenAll(resultTasks.EmitOverTime(500));
... but how exactly do I do that?
What you describe in your question is in other words rate limiting. You'd like to apply rate limiting policy to your client, because the API you use enforces such a policy on the server to protect itself from abuse.
While you could implement rate limiting yourself, I'd recommend you to go with some well established solution. Rate Limiter from Davis Desmaisons was the one that I picked at random and I instantly liked it. It had solid documentation, superior coverage and was easy to use. It is also available as NuGet package.
Check out the simple snippet below that demonstrates running semi-overlapping tasks in sequence while defering the task start by half a second after the immediately preceding task started. Each task lasts at least 750 ms.
using ComposableAsync;
using RateLimiter;
using System;
using System.Threading.Tasks;
namespace RateLimiterTest
{
class Program
{
static void Main(string[] args)
{
Log("Starting tasks ...");
var constraint = TimeLimiter.GetFromMaxCountByInterval(1, TimeSpan.FromSeconds(0.5));
var tasks = new[]
{
DoWorkAsync("Task1", constraint),
DoWorkAsync("Task2", constraint),
DoWorkAsync("Task3", constraint),
DoWorkAsync("Task4", constraint)
};
Task.WaitAll(tasks);
Log("All tasks finished.");
Console.ReadLine();
}
static void Log(string message)
{
Console.WriteLine(DateTime.Now.ToString("HH:mm:ss.fff ") + message);
}
static async Task DoWorkAsync(string name, IDispatcher constraint)
{
await constraint;
Log(name + " started");
await Task.Delay(750);
Log(name + " finished");
}
}
}
Sample output:
10:03:27.121 Starting tasks ...
10:03:27.154 Task1 started
10:03:27.658 Task2 started
10:03:27.911 Task1 finished
10:03:28.160 Task3 started
10:03:28.410 Task2 finished
10:03:28.680 Task4 started
10:03:28.913 Task3 finished
10:03:29.443 Task4 finished
10:03:29.443 All tasks finished.
If you change the constraint to allow maximum two tasks per second (var constraint = TimeLimiter.GetFromMaxCountByInterval(2, TimeSpan.FromSeconds(1));), which is not the same as one per half a second, then the output could be like:
10:06:03.237 Starting tasks ...
10:06:03.264 Task1 started
10:06:03.268 Task2 started
10:06:04.026 Task2 finished
10:06:04.031 Task1 finished
10:06:04.275 Task3 started
10:06:04.276 Task4 started
10:06:05.032 Task4 finished
10:06:05.032 Task3 finished
10:06:05.033 All tasks finished.
Note that the current version of Rate Limiter targets .NETFramework 4.7.2+ or .NETStandard 2.0+.
This is just a thought, but another approach could be to create a queue and add another thread that runs polling the queue for calls that need to go out to your endpoint.
Have you considered just turning that into a foreach-loop with a Task.Delay call? You seem to want to explicitly call them sequentially, it won't hurt if that is obvious from your code.
var results = new List<YourResultType>;
foreach(var order in orders){
var result = new VendorTaskResult();
try
{
result.Response = await result.CallVendorAsync();
results.Add(result.Response);
}
catch(Exception ex)
{
result.Exception = ex;
}
}
Instead of selecting from orders you could loop over them, and inside the loop put the result into a list and then call Task.WhenAll.
Would look something like:
var resultTasks = new List<VendorTaskResult>(orders.Count);
orders.ToList().ForEach( item => {
var result = new VendorTaskResult();
try
{
result.Response = await result.CallVendorAsync();
}
catch(Exception ex)
{
result.Exception = ex;
}
resultTasks.Add(result);
Thread.Sleep(x);
});
var results = Task.WhenAll(resultTasks);
If you want to control the number of requests executed simultaneously, you have to use a semaphore.
I have something very similar, and it works fine with me. Please note that I call ToArray() after the Linq query finishes, that triggers the tasks:
using (HttpClient client = new HttpClient()) {
IEnumerable<Task<string>> _downloads = _group
.Select(job => {
await Task.Delay(300);
return client.GetStringAsync(<url with variable job>);
});
Task<string>[] _downloadTasks = _downloads.ToArray();
_pages = await Task.WhenAll(_downloadTasks);
}
Now please note that this will create n nunmber of tasks, all in parallel, and the Task.Delay literally does nothing. If you want to call the pages synchronously (as it sounds by putting a delay between the calls), then this code may be better:
using (HttpClient client = new HttpClient()) {
foreach (string job in _group) {
await Task.Delay(300);
_pages.Add(await client.GetStringAsync(<url with variable job>));
}
}
The download of the pages is still asynchronous (while downloading other tasks are done), but each call to download the page is synchronous, ensuring that you can wait for one to finish in order to call the next one.
The code can be easily changed to call the pages asynchronously in chunks, like every 10 pages, wait 300ms, like in this sample:
IEnumerable<string[]> toParse = myData
.Select((v, i) => new { v.code, group = i / 20 })
.GroupBy(x => x.group)
.Select(g => g.Select(x => x.code).ToArray());
using (HttpClient client = new HttpClient()) {
foreach (string[] _group in toParse) {
string[] _pages = null;
IEnumerable<Task<string>> _downloads = _group
.Select(job => {
return client.GetStringAsync(<url with job>);
});
Task<string>[] _downloadTasks = _downloads.ToArray();
_pages = await Task.WhenAll(_downloadTasks);
await Task.Delay(5000);
}
}
All this does is group your pages in chunks of 20, iterate through the chunks, download all pages of the chunk asynchronously, wait 5 seconds, move on to the next chunk.
I hope that is what you were waiting for :)
The proposed method EmitOverTime is doable, but only by blocking the current thread:
public static IEnumerable<Task<TResult>> EmitOverTime<TResult>(
this IEnumerable<Task<TResult>> tasks, int delay)
{
foreach (var item in tasks)
{
Thread.Sleep(delay); // Delay by blocking
yield return item;
}
}
Usage:
var results = await Task.WhenAll(resultTasks.EmitOverTime(500));
Probably better is to create a variant of Task.WhenAll that accepts a delay argument, and delays asyncronously:
public static async Task<TResult[]> WhenAllWithDelay<TResult>(
IEnumerable<Task<TResult>> tasks, int delay)
{
var tasksList = new List<Task<TResult>>();
foreach (var task in tasks)
{
await Task.Delay(delay).ConfigureAwait(false);
tasksList.Add(task);
}
return await Task.WhenAll(tasksList).ConfigureAwait(false);
}
Usage:
var results = await WhenAllWithDelay(resultTasks, 500);
This design implies that the enumerable of tasks should be enumerated only once. It is easy to forget this during development, and start enumerating it again, spawning a new set of tasks. For this reason I propose to make it an OnlyOnce enumerable, as it is shown in this question.
Update: I should mention why the above methods work, and under what premise. The premise is that the supplied IEnumerable<Task<TResult>> is deferred, in other words non-materialized. At the method's start there are no tasks created yet. The tasks are created one after the other during the enumeration of the enumerable, and the trick is that the enumeration is slow and controlled. The delay inside the loop ensures that the tasks are not created all at once. They are created hot (in other words already started), so at the time the last task has been created some of the first tasks may have already been completed. The materialized list of half-running/half-completed tasks is then passed to Task.WhenAll, that waits for all to complete asynchronously.
I would like to run a bunch of async tasks, with a limit on how many tasks may be pending completion at any given time.
Say you have 1000 URLs, and you only want to have 50 requests open at a time; but as soon as one request completes, you open up a connection to the next URL in the list. That way, there are always exactly 50 connections open at a time, until the URL list is exhausted.
I also want to utilize a given number of threads if possible.
I came up with an extension method, ThrottleTasksAsync that does what I want. Is there a simpler solution already out there? I would assume that this is a common scenario.
Usage:
class Program
{
static void Main(string[] args)
{
Enumerable.Range(1, 10).ThrottleTasksAsync(5, 2, async i => { Console.WriteLine(i); return i; }).Wait();
Console.WriteLine("Press a key to exit...");
Console.ReadKey(true);
}
}
Here is the code:
static class IEnumerableExtensions
{
public static async Task<Result_T[]> ThrottleTasksAsync<Enumerable_T, Result_T>(this IEnumerable<Enumerable_T> enumerable, int maxConcurrentTasks, int maxDegreeOfParallelism, Func<Enumerable_T, Task<Result_T>> taskToRun)
{
var blockingQueue = new BlockingCollection<Enumerable_T>(new ConcurrentBag<Enumerable_T>());
var semaphore = new SemaphoreSlim(maxConcurrentTasks);
// Run the throttler on a separate thread.
var t = Task.Run(() =>
{
foreach (var item in enumerable)
{
// Wait for the semaphore
semaphore.Wait();
blockingQueue.Add(item);
}
blockingQueue.CompleteAdding();
});
var taskList = new List<Task<Result_T>>();
Parallel.ForEach(IterateUntilTrue(() => blockingQueue.IsCompleted), new ParallelOptions { MaxDegreeOfParallelism = maxDegreeOfParallelism },
_ =>
{
Enumerable_T item;
if (blockingQueue.TryTake(out item, 100))
{
taskList.Add(
// Run the task
taskToRun(item)
.ContinueWith(tsk =>
{
// For effect
Thread.Sleep(2000);
// Release the semaphore
semaphore.Release();
return tsk.Result;
}
)
);
}
});
// Await all the tasks.
return await Task.WhenAll(taskList);
}
static IEnumerable<bool> IterateUntilTrue(Func<bool> condition)
{
while (!condition()) yield return true;
}
}
The method utilizes BlockingCollection and SemaphoreSlim to make it work. The throttler is run on one thread, and all the async tasks are run on the other thread. To achieve parallelism, I added a maxDegreeOfParallelism parameter that's passed to a Parallel.ForEach loop re-purposed as a while loop.
The old version was:
foreach (var master = ...)
{
var details = ...;
Parallel.ForEach(details, detail => {
// Process each detail record here
}, new ParallelOptions { MaxDegreeOfParallelism = 15 });
// Perform the final batch updates here
}
But, the thread pool gets exhausted fast, and you can't do async/await.
Bonus:
To get around the problem in BlockingCollection where an exception is thrown in Take() when CompleteAdding() is called, I'm using the TryTake overload with a timeout. If I didn't use the timeout in TryTake, it would defeat the purpose of using a BlockingCollection since TryTake won't block. Is there a better way? Ideally, there would be a TakeAsync method.
As suggested, use TPL Dataflow.
A TransformBlock<TInput, TOutput> may be what you're looking for.
You define a MaxDegreeOfParallelism to limit how many strings can be transformed (i.e., how many urls can be downloaded) in parallel. You then post urls to the block, and when you're done you tell the block you're done adding items and you fetch the responses.
var downloader = new TransformBlock<string, HttpResponse>(
url => Download(url),
new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 50 }
);
var buffer = new BufferBlock<HttpResponse>();
downloader.LinkTo(buffer);
foreach(var url in urls)
downloader.Post(url);
//or await downloader.SendAsync(url);
downloader.Complete();
await downloader.Completion;
IList<HttpResponse> responses;
if (buffer.TryReceiveAll(out responses))
{
//process responses
}
Note: The TransformBlock buffers both its input and output. Why, then, do we need to link it to a BufferBlock?
Because the TransformBlock won't complete until all items (HttpResponse) have been consumed, and await downloader.Completion would hang. Instead, we let the downloader forward all its output to a dedicated buffer block - then we wait for the downloader to complete, and inspect the buffer block.
Say you have 1000 URLs, and you only want to have 50 requests open at
a time; but as soon as one request completes, you open up a connection
to the next URL in the list. That way, there are always exactly 50
connections open at a time, until the URL list is exhausted.
The following simple solution has surfaced many times here on SO. It doesn't use blocking code and doesn't create threads explicitly, so it scales very well:
const int MAX_DOWNLOADS = 50;
static async Task DownloadAsync(string[] urls)
{
using (var semaphore = new SemaphoreSlim(MAX_DOWNLOADS))
using (var httpClient = new HttpClient())
{
var tasks = urls.Select(async url =>
{
await semaphore.WaitAsync();
try
{
var data = await httpClient.GetStringAsync(url);
Console.WriteLine(data);
}
finally
{
semaphore.Release();
}
});
await Task.WhenAll(tasks);
}
}
The thing is, the processing of the downloaded data should be done on a different pipeline, with a different level of parallelism, especially if it's a CPU-bound processing.
E.g., you'd probably want to have 4 threads concurrently doing the data processing (the number of CPU cores), and up to 50 pending requests for more data (which do not use threads at all). AFAICT, this is not what your code is currently doing.
That's where TPL Dataflow or Rx may come in handy as a preferred solution. Yet it is certainly possible to implement something like this with plain TPL. Note, the only blocking code here is the one doing the actual data processing inside Task.Run:
const int MAX_DOWNLOADS = 50;
const int MAX_PROCESSORS = 4;
// process data
class Processing
{
SemaphoreSlim _semaphore = new SemaphoreSlim(MAX_PROCESSORS);
HashSet<Task> _pending = new HashSet<Task>();
object _lock = new Object();
async Task ProcessAsync(string data)
{
await _semaphore.WaitAsync();
try
{
await Task.Run(() =>
{
// simuate work
Thread.Sleep(1000);
Console.WriteLine(data);
});
}
finally
{
_semaphore.Release();
}
}
public async void QueueItemAsync(string data)
{
var task = ProcessAsync(data);
lock (_lock)
_pending.Add(task);
try
{
await task;
}
catch
{
if (!task.IsCanceled && !task.IsFaulted)
throw; // not the task's exception, rethrow
// don't remove faulted/cancelled tasks from the list
return;
}
// remove successfully completed tasks from the list
lock (_lock)
_pending.Remove(task);
}
public async Task WaitForCompleteAsync()
{
Task[] tasks;
lock (_lock)
tasks = _pending.ToArray();
await Task.WhenAll(tasks);
}
}
// download data
static async Task DownloadAsync(string[] urls)
{
var processing = new Processing();
using (var semaphore = new SemaphoreSlim(MAX_DOWNLOADS))
using (var httpClient = new HttpClient())
{
var tasks = urls.Select(async (url) =>
{
await semaphore.WaitAsync();
try
{
var data = await httpClient.GetStringAsync(url);
// put the result on the processing pipeline
processing.QueueItemAsync(data);
}
finally
{
semaphore.Release();
}
});
await Task.WhenAll(tasks.ToArray());
await processing.WaitForCompleteAsync();
}
}
As requested, here's the code I ended up going with.
The work is set up in a master-detail configuration, and each master is processed as a batch. Each unit of work is queued up in this fashion:
var success = true;
// Start processing all the master records.
Master master;
while (null != (master = await StoredProcedures.ClaimRecordsAsync(...)))
{
await masterBuffer.SendAsync(master);
}
// Finished sending master records
masterBuffer.Complete();
// Now, wait for all the batches to complete.
await batchAction.Completion;
return success;
Masters are buffered one at a time to save work for other outside processes. The details for each master are dispatched for work via the masterTransform TransformManyBlock. A BatchedJoinBlock is also created to collect the details in one batch.
The actual work is done in the detailTransform TransformBlock, asynchronously, 150 at a time. BoundedCapacity is set to 300 to ensure that too many Masters don't get buffered at the beginning of the chain, while also leaving room for enough detail records to be queued to allow 150 records to be processed at one time. The block outputs an object to its targets, because it's filtered across the links depending on whether it's a Detail or Exception.
The batchAction ActionBlock collects the output from all the batches, and performs bulk database updates, error logging, etc. for each batch.
There will be several BatchedJoinBlocks, one for each master. Since each ISourceBlock is output sequentially and each batch only accepts the number of detail records associated with one master, the batches will be processed in order. Each block only outputs one group, and is unlinked on completion. Only the last batch block propagates its completion to the final ActionBlock.
The dataflow network:
// The dataflow network
BufferBlock<Master> masterBuffer = null;
TransformManyBlock<Master, Detail> masterTransform = null;
TransformBlock<Detail, object> detailTransform = null;
ActionBlock<Tuple<IList<object>, IList<object>>> batchAction = null;
// Buffer master records to enable efficient throttling.
masterBuffer = new BufferBlock<Master>(new DataflowBlockOptions { BoundedCapacity = 1 });
// Sequentially transform master records into a stream of detail records.
masterTransform = new TransformManyBlock<Master, Detail>(async masterRecord =>
{
var records = await StoredProcedures.GetObjectsAsync(masterRecord);
// Filter the master records based on some criteria here
var filteredRecords = records;
// Only propagate completion to the last batch
var propagateCompletion = masterBuffer.Completion.IsCompleted && masterTransform.InputCount == 0;
// Create a batch join block to encapsulate the results of the master record.
var batchjoinblock = new BatchedJoinBlock<object, object>(records.Count(), new GroupingDataflowBlockOptions { MaxNumberOfGroups = 1 });
// Add the batch block to the detail transform pipeline's link queue, and link the batch block to the the batch action block.
var detailLink1 = detailTransform.LinkTo(batchjoinblock.Target1, detailResult => detailResult is Detail);
var detailLink2 = detailTransform.LinkTo(batchjoinblock.Target2, detailResult => detailResult is Exception);
var batchLink = batchjoinblock.LinkTo(batchAction, new DataflowLinkOptions { PropagateCompletion = propagateCompletion });
// Unlink batchjoinblock upon completion.
// (the returned task does not need to be awaited, despite the warning.)
batchjoinblock.Completion.ContinueWith(task =>
{
detailLink1.Dispose();
detailLink2.Dispose();
batchLink.Dispose();
});
return filteredRecords;
}, new ExecutionDataflowBlockOptions { BoundedCapacity = 1 });
// Process each detail record asynchronously, 150 at a time.
detailTransform = new TransformBlock<Detail, object>(async detail => {
try
{
// Perform the action for each detail here asynchronously
await DoSomethingAsync();
return detail;
}
catch (Exception e)
{
success = false;
return e;
}
}, new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 150, BoundedCapacity = 300 });
// Perform the proper action for each batch
batchAction = new ActionBlock<Tuple<IList<object>, IList<object>>>(async batch =>
{
var details = batch.Item1.Cast<Detail>();
var errors = batch.Item2.Cast<Exception>();
// Do something with the batch here
}, new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 4 });
masterBuffer.LinkTo(masterTransform, new DataflowLinkOptions { PropagateCompletion = true });
masterTransform.LinkTo(detailTransform, new DataflowLinkOptions { PropagateCompletion = true });
I am failing to understand why this doesn't seem to run the tasks in Parallel:
var tasks = new Task<MyReturnType>[mbis.Length];
for (int i = 0; i < tasks.Length; i++)
{
tasks[i] = CAS.Service.GetAllRouterInterfaces(mbis[i], 3);
}
Parallel.ForEach(tasks, task => task.Start());
By stepping through the execution, I see that as soon as this line is evaluated:
tasks[i] = CAS.Service.GetAllRouterInterfaces(mbis[i], 3);
The task starts. I want to add all the new tasks to the list, and then execute them in parallel.
If GetAllRouterInterfaces is an async method, the resulting Task will already be started (see this answer for further explanation).
This means that tasks will contain multiple tasks all of which are running in parallel without the subsequent call to Parallel.ForEach.
You may wish to wait for all the entries in tasks to complete, you can do this with an await Task.WhenAll(tasks);.
So you should end up with:
var tasks = new Task<MyReturnType>[mbis.Length];
for (int i = 0; i < tasks.Length; i++)
{
tasks[i] = CAS.Service.GetAllRouterInterfaces(mbis[i], 3);
}
await Task.WhenAll(tasks);
Update from comments
It seems that despite GetAllRouterInterfaces being async and returning a Task it is still making synchronous POST requests (presumably before any other await). This would explain why you are getting minimal concurrency as each call to GetAllRouterInterfaces is blocking while this request is made. The ideal solution would be to make an aynchronous POST request, e.g:
await webclient.PostAsync(request).ConfigureAwait(false);
This will ensure your for loop is not blocked and the requests are made concurrently.
Further update after conversation
It seems you are unable to make the POST requests asynchronous and GetAllRouterInterfaces does not actually do any asynchronous work, due to this I have advised the following:
Remove async from GetAllRouterInterfaces and change the return type to MyReturnType
Call GetAllRouterInterfaces in parallel like so
var routerInterfaces = mbis.AsParallel()
.Select(mbi => CAS.Service.GetAllRouterInterfaces(mbi, 3));
I don't know if I understand you the right way.
First of all, if GetAllRouterInterfaces is returns a Task you have to await the result.
With Parallel.ForEach you can't await tasks like as it is, but you can do something similar like this:
public async Task RunInParallel(IEnumerable<TWhatEver> mbisItems)
{
//mbisItems == your parameter that you want to pass to GetAllRouterInterfaces
//degree of cucurrency
var concurrentTasks = 3;
//Parallel.Foreach does internally something like this:
await Task.WhenAll(
from partition in Partitioner.Create(mbisItems).GetPartitions(concurrentTasks)
select Task.Run(async delegate
{
using (partition)
while (partition.MoveNext())
{
var currentMbis = partition.Current;
var yourResult = await GetAllRouterInterfaces(currentMbis,3);
}
}
));
}
I am using the HTTPClient in System.Net.Http to make requests against an API. The API is limited to 10 requests per second.
My code is roughly like so:
List<Task> tasks = new List<Task>();
items..Select(i => tasks.Add(ProcessItem(i));
try
{
await Task.WhenAll(taskList.ToArray());
}
catch (Exception ex)
{
}
The ProcessItem method does a few things but always calls the API using the following:
await SendRequestAsync(..blah). Which looks like:
private async Task<Response> SendRequestAsync(HttpRequestMessage request, CancellationToken token)
{
token.ThrowIfCancellationRequested();
var response = await HttpClient
.SendAsync(request: request, cancellationToken: token).ConfigureAwait(continueOnCapturedContext: false);
token.ThrowIfCancellationRequested();
return await Response.BuildResponse(response);
}
Originally the code worked fine but when I started using Task.WhenAll I started getting 'Rate Limit Exceeded' messages from the API. How can I limit the rate at which requests are made?
Its worth noting that ProcessItem can make between 1-4 API calls depending on the item.
The API is limited to 10 requests per second.
Then just have your code do a batch of 10 requests, ensuring they take at least one second:
Items[] items = ...;
int index = 0;
while (index < items.Length)
{
var timer = Task.Delay(TimeSpan.FromSeconds(1.2)); // ".2" to make sure
var tasks = items.Skip(index).Take(10).Select(i => ProcessItemsAsync(i));
var tasksAndTimer = tasks.Concat(new[] { timer });
await Task.WhenAll(tasksAndTimer);
index += 10;
}
Update
My ProcessItems method makes 1-4 API calls depending on the item.
In this case, batching is not an appropriate solution. You need to limit an asynchronous method to a certain number, which implies a SemaphoreSlim. The tricky part is that you want to allow more calls over time.
I haven't tried this code, but the general idea I would go with is to have a periodic function that releases the semaphore up to 10 times. So, something like this:
private readonly SemaphoreSlim _semaphore = new SemaphoreSlim(10);
private async Task<Response> ThrottledSendRequestAsync(HttpRequestMessage request, CancellationToken token)
{
await _semaphore.WaitAsync(token);
return await SendRequestAsync(request, token);
}
private async Task PeriodicallyReleaseAsync(Task stop)
{
while (true)
{
var timer = Task.Delay(TimeSpan.FromSeconds(1.2));
if (await Task.WhenAny(timer, stop) == stop)
return;
// Release the semaphore at most 10 times.
for (int i = 0; i != 10; ++i)
{
try
{
_semaphore.Release();
}
catch (SemaphoreFullException)
{
break;
}
}
}
}
Usage:
// Start the periodic task, with a signal that we can use to stop it.
var stop = new TaskCompletionSource<object>();
var periodicTask = PeriodicallyReleaseAsync(stop.Task);
// Wait for all item processing.
await Task.WhenAll(taskList);
// Stop the periodic task.
stop.SetResult(null);
await periodicTask;
The answer is similar to this one.
Instead of using a list of tasks and WhenAll, use Parallel.ForEach and use ParallelOptions to limit the number of concurrent tasks to 10, and make sure each one takes at least 1 second:
Parallel.ForEach(
items,
new ParallelOptions { MaxDegreeOfParallelism = 10 },
async item => {
ProcessItems(item);
await Task.Delay(1000);
}
);
Or if you want to make sure each item takes as close to 1 second as possible:
Parallel.ForEach(
searches,
new ParallelOptions { MaxDegreeOfParallelism = 10 },
async item => {
var watch = new Stopwatch();
watch.Start();
ProcessItems(item);
watch.Stop();
if (watch.ElapsedMilliseconds < 1000) await Task.Delay((int)(1000 - watch.ElapsedMilliseconds));
}
);
Or:
Parallel.ForEach(
searches,
new ParallelOptions { MaxDegreeOfParallelism = 10 },
async item => {
await Task.WhenAll(
Task.Delay(1000),
Task.Run(() => { ProcessItems(item); })
);
}
);
UPDATED ANSWER
My ProcessItems method makes 1-4 API calls depending on the item. So with a batch size of 10 I still exceed the rate limit.
You need to implement a rolling window in SendRequestAsync. A queue containing timestamps of each request is a suitable data structure. You dequeue entries with a timestamp older than 10 seconds. As it so happens, there is an implementation as an answer to a similar question on SO.
ORIGINAL ANSWER
May still be useful to others
One straightforward way to handle this is to batch your requests in groups of 10, run those concurrently, and then wait until a total of 10 seconds has elapsed (if it hasn't already). This will bring you in right at the rate limit if the batch of requests can complete in 10 seconds, but is less than optimal if the batch of requests takes longer. Have a look at the .Batch() extension method in MoreLinq. Code would look approximately like
foreach (var taskList in tasks.Batch(10))
{
Stopwatch sw = Stopwatch.StartNew(); // From System.Diagnostics
await Task.WhenAll(taskList.ToArray());
if (sw.Elapsed.TotalSeconds < 10.0)
{
// Calculate how long you still have to wait and sleep that long
// You might want to wait 10.5 or 11 seconds just in case the rate
// limiting on the other side isn't perfectly implemented
}
}
https://github.com/thomhurst/EnumerableAsyncProcessor
I've written a library to help with this sort of logic.
Usage would be:
var responses = await AsyncProcessorBuilder.WithItems(items) // Or Extension Method: items.ToAsyncProcessorBuilder()
.SelectAsync(item => ProcessItem(item), CancellationToken.None)
.ProcessInParallel(levelOfParallelism: 10, TimeSpan.FromSeconds(1));