await thousands of Tasks - c#

I have an application which converts some data often there are 1.000 - 30.000 files.
I need to do 3 steps:
copy a File (replace some text in there)
Make a Webrequest with WebClient to download a file (I send the copied file to a WebServer, which converts the file to another format)
Take the downloaded file and change some of the content
So all three steps include some I/O and I used async/await methods:
var tasks = files.Select(async (file) =>
{
Item item = await createtempFile(file).ConfigureAwait(false);
await convert(item).ConfigureAwait(false);
await clean(item).ConfigureAwait(false);
}).ToList();
await Task.WhenAll(tasks).ConfigureAwait(false);
I don´t know if this is the best practice, because I create more than thousand tasks. I thought about splitting the three steps like:
List<Item> items = new List<Item>();
var tasks = files.Select(async (file) =>
{
Item item = await createtempFile(file, ext).ConfigureAwait(false);
lock(items)
items.Add(item);
}).ToList();
await Task.WhenAll(tasks).ConfigureAwait(false);
var tasks = items.Select(async (item) =>
{
await convert(item, baseAddress, ext).ConfigureAwait(false);
}).ToList();
await Task.WhenAll(tasks).ConfigureAwait(false);
var tasks = items.Select(async (item) =>
{
await clean(targetFile, item.Doctype, ext).ConfigureAwait(false);
}).ToList();
await Task.WhenAll(tasks).ConfigureAwait(false);
But that doesn´t seem to be better or faster, because I create 3 times thousands of tasks.
Should I throttle the creation of tasks? Like chunks of 100 tasks?
Or am I just overthinking it and the creation of thousands of tasks is just fine.
The CPU is idling with 2-4% peak, so I thought about too many awaits or context switches.
Maybe the WebRequest calls are too many, because the WebServer/WebService can´t handle thousands of Requests simultaneously and I should only throttle the WebRequests?
I already increased the .NET maxconnection in the app.config file.

It is possible to execute async operations in parallel with limiting the number of concurrent operations. There is a cool extension method for that, it is not part of the .Net framework.
/// <summary>
/// Enumerates a collection in parallel and calls an async method on each item. Useful for making
/// parallel async calls, e.g. independent web requests when the degree of parallelism needs to be
/// limited.
/// </summary>
public static Task ForEachAsync<T>(this IEnumerable<T> source, int degreeOfParalellism, Func<T, Task> action)
{
return Task.WhenAll(Partitioner.Create(source).GetPartitions(degreeOfParalellism).Select(partition => Task.Run(async () =>
{
using (partition)
while (partition.MoveNext())
await action(partition.Current);
})));
}
Call it like this:
var files = new List<string> {"one", "two", "three"};
await files.ForEachAsync(5, async file =>
{
// do async stuff here with the file
await Task.Delay(1000);
});

As commenters have correctly noted, you're overthinking it. The .NET runtime has absolutely no problem tracking thousands of tasks.
However, you might want to consider using a TPL Dataflow pipeline, which would enable you to easily have different concurrency levels for different operations ("blocks") in your pipeline.

Related

C# and WebApi stream result of task collection

I have a simple API and one service. Service is reading tree structure from path from configuration. problem is tree can be rather large, so I thought I can solve this by creating collection of tasks and resolve that tasks in form of stream on ActionResult. To make things harder I need a tree as result, can not split it to different requests.
So normally I would get file tree by:
public IEnumerable<string> GetFiles()
{
var result = new List<string>();
foreach (var resource in _root)
{
this.ValidateRootFolder(resource);
result.AddRange(Directory.EnumerateFiles(resource, "*.*", SearchOption.AllDirectories));
}
return result;
}
So that is simple but can be slow if there is a giant tree, and what I am trying to do is something like:
public ConcurrentBag<Task<IEnumerable<string>>> GetFiles()
{
var tasks = new ConcurrentBag<Task<IEnumerable<string>>>();
Parallel.ForEach(_root, (resource, token) =>
{
this.ValidateRootFolder(resource);
var task = Task.Run(() => Directory.EnumerateFiles(resource, "*.*", SearchOption.AllDirectories));
tasks.Add(task);
});
return tasks;
}
And this is creating task collection, so I can execute those that on endpoint, something like:
[HttpGet, ActionName("GetFiles")]
public IActionResult GetFiles()
{
ConcurrentBag<Task> tasks = _fileService.GetFiles();
return Ok(tasks); // how to make stream out of all these files
}
So my question is how to convert this task collection as stream with result, or is it possible?
And if not is there other way to do this?
Parallelism in a web application isn't always a great idea because it uses threads that would be serving web requests otherwise. Each web request is served by a separate thread. If all cores are busy, web request will have to wait.
Parallel.ForEach will use all available cores, which means no other request will be served until either Parallel.ForEach completes or one of the worker threads is rescheduled. In most web applications that would be very bad.
One way to handle this would be to use PLINQ to enumerate all folders with a limited degree-of-parallelism and return the results as a single list:
public IEnumerable<string> GetFiles()
{
var files= _roots.AsParallel()
.WithDegreeOfParallelism(2)
.SelectMany(fld=>Directory.EnumerateFiles(fld))
.AsEnumerable();
return files;
}
or
public IEnumerable<string> GetFiles()
{
var files= _roots.AsParallel()
.WithDegreeOfParallelism(2)
.SelectMany(fld=>Directory.EnumerateFiles(fld))
.ToList();
return files;
}
The benefit over Parallel.ForEach is that PLINQ handles the collection of the partial results into the final result set.
If you want to get the files grouped by root, you could use Select and GetFiles :
public IEnumerable<string[]> GetFiles()
{
var files= _roots.AsParallel()
.WithDegreeOfParallelism(2)
.Select(fld=>Directory.GetFiles(fld))
.ToList();
}
You could also return a dictionary of files per root:
public Dictionary<string,string[]> GetFiles()
{
var files= _roots.AsParallel()
.WithDegreeOfParallelism(2)
.Select(root=>(root,files=Directory.GetFiles(fld)))
.ToDictionary(p=>p.root,p=>p.files);
}
No async
There's no Directory.EnumerateFilesAsync so there's no way to benefit from asynchronous (not parallel) enumeration. The reason is that not all OSs have async IO and even when they do, the file system drivers may not have asynchronous file enumeration.
Windows NT was asynchronous from the start, with blocking operations emulated at the API level. Windows 9x wasn't. Linux on the other hand was synchronous, with async I/O added later. Even Windows doesn't have an async directory enumeration API though, because not all file systems support this.

C# Add to a List Asynchronously in API

I have an API which needs to be run in a loop for Mass processing.
Current single API is:
public async Task<ActionResult<CombinedAddressResponse>> GetCombinedAddress(AddressRequestDto request)
We are not allowed to touch/modify the original single API. However can be run in bulk, using foreach statement. What is the best way to run this asychronously without locks?
Current Solution below is just providing a list, would this be it?
public async Task<ActionResult<List<CombinedAddressResponse>>> GetCombinedAddress(List<AddressRequestDto> requests)
{
var combinedAddressResponses = new List<CombinedAddressResponse>();
foreach(AddressRequestDto request in requests)
{
var newCombinedAddress = (await GetCombinedAddress(request)).Value;
combinedAddressResponses.Add(newCombinedAddress);
}
return combinedAddressResponses;
}
Update:
In debugger, it has to go to combinedAddressResponse.Result.Value
combinedAddressResponse.Value = null
and Also strangely, writing combinedAddressResponse.Result.Value gives error below "Action Result does not contain a definition for for 'Value' and no accessible extension method
I'm writing this code off the top of my head without an IDE or sleep, so please comment if I'm missing something or there's a better way.
But effectively I think you want to run all your requests at once (not sequentially) doing something like this:
public async Task<ActionResult<List<CombinedAddressResponse>>> GetCombinedAddress(List<AddressRequestDto> requests)
{
var combinedAddressResponses = new List<CombinedAddressResponse>(requests.Count);
var tasks = new List<Task<ActionResult<CombinedAddressResponse>>(requests.Count);
foreach (var request in requests)
{
tasks.Add(Task.Run(async () => await GetCombinedAddress(request));
}
//This waits for all the tasks to complete
await tasks.WhenAll(tasks.ToArray());
combinedAddressResponses.AddRange(tasks.Select(x => x.Result.Value));
return combinedAddressResponses;
}
looking for a way to speed things up and run in parallel thanks
What you need is "asynchronous concurrency". I use the term "concurrency" to mean "doing more than one thing at a time", and "parallel" to mean "doing more than one thing at a time using threads". Since you're on ASP.NET, you don't want to use additional threads; you'd want to use a form of concurrency that works asynchronously (which uses fewer threads). So, Parallel and Task.Run should not be parts of your solution.
The way to do asynchronous concurrency is to build a collection of tasks, and then use await Task.WhenAll. E.g.:
public async Task<ActionResult<IReadOnlyList<CombinedAddressResponse>>> GetCombinedAddress(List<AddressRequestDto> requests)
{
// Build the collection of tasks by doing an asynchronous operation for each request.
var tasks = requests.Select(async request =>
{
var combinedAddressResponse = await GetCombinedAdress(request);
return combinedAddressResponse.Value;
}).ToList();
// Wait for all the tasks to complete and get the results.
var results = await Task.WhenAll(tasks);
return results;
}

Spawn new thread inside each foreach(), but do not return until all complete

I have a foreach() that loops through 15 reports and generates a PDF for each. The PDF generation process is slow (3 seconds each). But if I could generate them all concurrently with threads, maybe all 15 could be done in 4-5 seconds total. One constraint is that the function must not return until ALL pdfs have generated. Also, will 15 concurrent worker threads cause problems or instability for dotnet/windows?
Here is my pseudocode:
private void makePDFs(string path) {
string[] folders = Directory.GetDirectories(path);
foreach(string folderPath in folders) {
generatePDF(...);
}
// DO NOT RETURN UNTIL ALL PDFs HAVE BEEN GENERATED
}
}
What is the simplest way to achieve this?
The most straightforward approach is to use Parallel.ForEach:
private void makePDFs(string path)
{
string[] folders = Directory.GetDirectories(path);
Parallel.ForEach(folders, (folderPath) =>
{
generatePDF(folderPath);
};
//WILL NOT RETURN UNTIL ALL PDFs HAVE BEEN GENERATED
}
This way you avoid having to create, keep track of, and await each separate task; the TPL does it all for you.
You need to get a list of tasks and then use Task.WhenAll to wait for completion
var tasks = folders.Select(folder => Task.Run(() => generatePDF(...)));
await Task.WhenAll(tasks);
If you can't or don't want to use async/await you can use:
Task.WaitAll(tasks);
It will block current thread until all tasks are completed. So I'd recommend to use the 1st approach if you can.
You can also run your PDF generation in parallel using Parallel C# class:
Parallel.ForEach(folders, folder => generatePDF(...));
Please see this answer to choose which approach works the best for your problem.
.NET has a handy method just for this: Task.WhenAll(IEnumerable<Task>)
It will wait for all tasks in the IEnumerable to finish before continuing. It is an async method, so you need to await it.
var tasks = new List<Task>();
foreach(string folderPath in folders) {
tasks.Add(Task.Run(() => generatePdf()));
}
await Task.WhenAll(tasks);

How to limit the amount of concurrent async I/O operations?

// let's say there is a list of 1000+ URLs
string[] urls = { "http://google.com", "http://yahoo.com", ... };
// now let's send HTTP requests to each of these URLs in parallel
urls.AsParallel().ForAll(async (url) => {
var client = new HttpClient();
var html = await client.GetStringAsync(url);
});
Here is the problem, it starts 1000+ simultaneous web requests. Is there an easy way to limit the concurrent amount of these async http requests? So that no more than 20 web pages are downloaded at any given time. How to do it in the most efficient manner?
You can definitely do this in the latest versions of async for .NET, using .NET 4.5 Beta. The previous post from 'usr' points to a good article written by Stephen Toub, but the less announced news is that the async semaphore actually made it into the Beta release of .NET 4.5
If you look at our beloved SemaphoreSlim class (which you should be using since it's more performant than the original Semaphore), it now boasts the WaitAsync(...) series of overloads, with all of the expected arguments - timeout intervals, cancellation tokens, all of your usual scheduling friends :)
Stephen's also written a more recent blog post about the new .NET 4.5 goodies that came out with beta see What’s New for Parallelism in .NET 4.5 Beta.
Last, here's some sample code about how to use SemaphoreSlim for async method throttling:
public async Task MyOuterMethod()
{
// let's say there is a list of 1000+ URLs
var urls = { "http://google.com", "http://yahoo.com", ... };
// now let's send HTTP requests to each of these URLs in parallel
var allTasks = new List<Task>();
var throttler = new SemaphoreSlim(initialCount: 20);
foreach (var url in urls)
{
// do an async wait until we can schedule again
await throttler.WaitAsync();
// using Task.Run(...) to run the lambda in its own parallel
// flow on the threadpool
allTasks.Add(
Task.Run(async () =>
{
try
{
var client = new HttpClient();
var html = await client.GetStringAsync(url);
}
finally
{
throttler.Release();
}
}));
}
// won't get here until all urls have been put into tasks
await Task.WhenAll(allTasks);
// won't get here until all tasks have completed in some way
// (either success or exception)
}
Last, but probably a worthy mention is a solution that uses TPL-based scheduling. You can create delegate-bound tasks on the TPL that have not yet been started, and allow for a custom task scheduler to limit the concurrency. In fact, there's an MSDN sample for it here:
See also TaskScheduler .
If you have an IEnumerable (ie. strings of URL s) and you want to do an I/O bound operation with each of these (ie. make an async http request) concurrently AND optionally you also want to set the maximum number of concurrent I/O requests in real time, here is how you can do that. This way you do not use thread pool et al, the method uses semaphoreslim to control max concurrent I/O requests similar to a sliding window pattern one request completes, leaves the semaphore and the next one gets in.
usage:
await ForEachAsync(urlStrings, YourAsyncFunc, optionalMaxDegreeOfConcurrency);
public static Task ForEachAsync<TIn>(
IEnumerable<TIn> inputEnumerable,
Func<TIn, Task> asyncProcessor,
int? maxDegreeOfParallelism = null)
{
int maxAsyncThreadCount = maxDegreeOfParallelism ?? DefaultMaxDegreeOfParallelism;
SemaphoreSlim throttler = new SemaphoreSlim(maxAsyncThreadCount, maxAsyncThreadCount);
IEnumerable<Task> tasks = inputEnumerable.Select(async input =>
{
await throttler.WaitAsync().ConfigureAwait(false);
try
{
await asyncProcessor(input).ConfigureAwait(false);
}
finally
{
throttler.Release();
}
});
return Task.WhenAll(tasks);
}
After the release of the .NET 6 (in November, 2021), the recommended way of limiting the amount of concurrent asynchronous I/O operations is the Parallel.ForEachAsync API, with the MaxDegreeOfParallelism configuration. Here is how it can be used in practice:
// let's say there is a list of 1000+ URLs
string[] urls = { "http://google.com", "http://yahoo.com", /*...*/ };
var client = new HttpClient();
var options = new ParallelOptions() { MaxDegreeOfParallelism = 20 };
// now let's send HTTP requests to each of these URLs in parallel
await Parallel.ForEachAsync(urls, options, async (url, cancellationToken) =>
{
var html = await client.GetStringAsync(url, cancellationToken);
});
In the above example the Parallel.ForEachAsync task is awaited asynchronously. You can also Wait it synchronously if you need to, which will block the current thread until the completion of all asynchronous operations. The synchronous Wait has the advantage that in case of errors, all exceptions will be propagated. On the contrary the await operator propagates by design only the first exception. In case this is a problem, you can find solutions here.
(Note: an idiomatic implementation of a ForEachAsync extension method that also propagates the results, can be found in the 4th revision of this answer)
There are a lot of pitfalls and direct use of a semaphore can be tricky in error cases, so I would suggest to use AsyncEnumerator NuGet Package instead of re-inventing the wheel:
// let's say there is a list of 1000+ URLs
string[] urls = { "http://google.com", "http://yahoo.com", ... };
// now let's send HTTP requests to each of these URLs in parallel
await urls.ParallelForEachAsync(async (url) => {
var client = new HttpClient();
var html = await client.GetStringAsync(url);
}, maxDegreeOfParalellism: 20);
Unfortunately, the .NET Framework is missing most important combinators for orchestrating parallel async tasks. There is no such thing built-in.
Look at the AsyncSemaphore class built by the most respectable Stephen Toub. What you want is called a semaphore, and you need an async version of it.
SemaphoreSlim can be very helpful here. Here's the extension method I've created.
/// <summary>
/// Concurrently Executes async actions for each item of <see cref="IEnumerable<typeparamref name="T"/>
/// </summary>
/// <typeparam name="T">Type of IEnumerable</typeparam>
/// <param name="enumerable">instance of <see cref="IEnumerable<typeparamref name="T"/>"/></param>
/// <param name="action">an async <see cref="Action" /> to execute</param>
/// <param name="maxActionsToRunInParallel">Optional, max numbers of the actions to run in parallel,
/// Must be grater than 0</param>
/// <returns>A Task representing an async operation</returns>
/// <exception cref="ArgumentOutOfRangeException">If the maxActionsToRunInParallel is less than 1</exception>
public static async Task ForEachAsyncConcurrent<T>(
this IEnumerable<T> enumerable,
Func<T, Task> action,
int? maxActionsToRunInParallel = null)
{
if (maxActionsToRunInParallel.HasValue)
{
using (var semaphoreSlim = new SemaphoreSlim(
maxActionsToRunInParallel.Value, maxActionsToRunInParallel.Value))
{
var tasksWithThrottler = new List<Task>();
foreach (var item in enumerable)
{
// Increment the number of currently running tasks and wait if they are more than limit.
await semaphoreSlim.WaitAsync();
tasksWithThrottler.Add(Task.Run(async () =>
{
await action(item).ContinueWith(res =>
{
// action is completed, so decrement the number of currently running tasks
semaphoreSlim.Release();
});
}));
}
// Wait for all of the provided tasks to complete.
await Task.WhenAll(tasksWithThrottler.ToArray());
}
}
else
{
await Task.WhenAll(enumerable.Select(item => action(item)));
}
}
Sample Usage:
await enumerable.ForEachAsyncConcurrent(
async item =>
{
await SomeAsyncMethod(item);
},
5);
Although 1000 tasks might be queued very quickly, the Parallel Tasks library can only handle concurrent tasks equal to the amount of CPU cores in the machine. That means that if you have a four-core machine, only 4 tasks will be executing at a given time (unless you lower the MaxDegreeOfParallelism).
this is not good practice as it changes a global variable. it is also not a general solution for async. but it is easy for all instances of HttpClient, if that's all you're after. you can simply try:
System.Net.ServicePointManager.DefaultConnectionLimit = 20;
Here is a handy Extension Method you can create to wrap a list of tasks such that they will be executed with a maximum degree of concurrency:
/// <summary>Allows to do any async operation in bulk while limiting the system to a number of concurrent items being processed.</summary>
private static IEnumerable<Task<T>> WithMaxConcurrency<T>(this IEnumerable<Task<T>> tasks, int maxParallelism)
{
SemaphoreSlim maxOperations = new SemaphoreSlim(maxParallelism);
// The original tasks get wrapped in a new task that must first await a semaphore before the original task is called.
return tasks.Select(task => maxOperations.WaitAsync().ContinueWith(_ =>
{
try { return task; }
finally { maxOperations.Release(); }
}).Unwrap());
}
Now instead of:
await Task.WhenAll(someTasks);
You can go
await Task.WhenAll(someTasks.WithMaxConcurrency(20));
Parallel computations should be used for speeding up CPU-bound operations. Here we are talking about I/O bound operations. Your implementation should be purely async, unless you're overwhelming the busy single core on your multi-core CPU.
EDIT I like the suggestion made by usr to use an "async semaphore" here.
Essentially you're going to want to create an Action or Task for each URL that you want to hit, put them in a List, and then process that list, limiting the number that can be processed in parallel.
My blog post shows how to do this both with Tasks and with Actions, and provides a sample project you can download and run to see both in action.
With Actions
If using Actions, you can use the built-in .Net Parallel.Invoke function. Here we limit it to running at most 20 threads in parallel.
var listOfActions = new List<Action>();
foreach (var url in urls)
{
var localUrl = url;
// Note that we create the Task here, but do not start it.
listOfTasks.Add(new Task(() => CallUrl(localUrl)));
}
var options = new ParallelOptions {MaxDegreeOfParallelism = 20};
Parallel.Invoke(options, listOfActions.ToArray());
With Tasks
With Tasks there is no built-in function. However, you can use the one that I provide on my blog.
/// <summary>
/// Starts the given tasks and waits for them to complete. This will run, at most, the specified number of tasks in parallel.
/// <para>NOTE: If one of the given tasks has already been started, an exception will be thrown.</para>
/// </summary>
/// <param name="tasksToRun">The tasks to run.</param>
/// <param name="maxTasksToRunInParallel">The maximum number of tasks to run in parallel.</param>
/// <param name="cancellationToken">The cancellation token.</param>
public static async Task StartAndWaitAllThrottledAsync(IEnumerable<Task> tasksToRun, int maxTasksToRunInParallel, CancellationToken cancellationToken = new CancellationToken())
{
await StartAndWaitAllThrottledAsync(tasksToRun, maxTasksToRunInParallel, -1, cancellationToken);
}
/// <summary>
/// Starts the given tasks and waits for them to complete. This will run the specified number of tasks in parallel.
/// <para>NOTE: If a timeout is reached before the Task completes, another Task may be started, potentially running more than the specified maximum allowed.</para>
/// <para>NOTE: If one of the given tasks has already been started, an exception will be thrown.</para>
/// </summary>
/// <param name="tasksToRun">The tasks to run.</param>
/// <param name="maxTasksToRunInParallel">The maximum number of tasks to run in parallel.</param>
/// <param name="timeoutInMilliseconds">The maximum milliseconds we should allow the max tasks to run in parallel before allowing another task to start. Specify -1 to wait indefinitely.</param>
/// <param name="cancellationToken">The cancellation token.</param>
public static async Task StartAndWaitAllThrottledAsync(IEnumerable<Task> tasksToRun, int maxTasksToRunInParallel, int timeoutInMilliseconds, CancellationToken cancellationToken = new CancellationToken())
{
// Convert to a list of tasks so that we don't enumerate over it multiple times needlessly.
var tasks = tasksToRun.ToList();
using (var throttler = new SemaphoreSlim(maxTasksToRunInParallel))
{
var postTaskTasks = new List<Task>();
// Have each task notify the throttler when it completes so that it decrements the number of tasks currently running.
tasks.ForEach(t => postTaskTasks.Add(t.ContinueWith(tsk => throttler.Release())));
// Start running each task.
foreach (var task in tasks)
{
// Increment the number of tasks currently running and wait if too many are running.
await throttler.WaitAsync(timeoutInMilliseconds, cancellationToken);
cancellationToken.ThrowIfCancellationRequested();
task.Start();
}
// Wait for all of the provided tasks to complete.
// We wait on the list of "post" tasks instead of the original tasks, otherwise there is a potential race condition where the throttler's using block is exited before some Tasks have had their "post" action completed, which references the throttler, resulting in an exception due to accessing a disposed object.
await Task.WhenAll(postTaskTasks.ToArray());
}
}
And then creating your list of Tasks and calling the function to have them run, with say a maximum of 20 simultaneous at a time, you could do this:
var listOfTasks = new List<Task>();
foreach (var url in urls)
{
var localUrl = url;
// Note that we create the Task here, but do not start it.
listOfTasks.Add(new Task(async () => await CallUrl(localUrl)));
}
await Tasks.StartAndWaitAllThrottledAsync(listOfTasks, 20);

TPL DataFlow with Lazy Source / stream of data

Suppose you have a TransformBlock with configured parallelism and want to stream data trough the block. The input data should be created only when the pipeline can actually start processing it. (And should be released the moment it leaves the pipeline.)
Can I achieve this? And if so how?
Basically I want a data source that works as an iterator.
Like so:
public IEnumerable<Guid> GetSourceData()
{
//In reality -> this should also be an async task -> but yield return does not work in combination with async/await ...
Func<ICollection<Guid>> GetNextBatch = () => Enumerable.Repeat(100).Select(x => Guid.NewGuid()).ToArray();
while (true)
{
var batch = GetNextBatch();
if (batch == null || !batch.Any()) break;
foreach (var guid in batch)
yield return guid;
}
}
This would result in +- 100 records in memory. OK: more if the blocks you append to this data source would keep them in memory for some time, but you have a chance to get only a subset (/stream) of data.
Some background information:
I intend to use this in combination with azure cosmos db, where the source could all objects in a collection, or a change feed. Needless to say that I don't want all of those objects stored in memory. So this can't work:
using System.Threading.Tasks.Dataflow;
public async Task ExampleTask()
{
Func<Guid, object> TheActualAction = text => text.ToString();
var config = new ExecutionDataflowBlockOptions
{
BoundedCapacity = 5,
MaxDegreeOfParallelism = 15
};
var throtteler = new TransformBlock<Guid, object>(TheActualAction, config);
var output = new BufferBlock<object>();
throtteler.LinkTo(output);
throtteler.Post(Guid.NewGuid());
throtteler.Post(Guid.NewGuid());
throtteler.Post(Guid.NewGuid());
throtteler.Post(Guid.NewGuid());
//...
throtteler.Complete();
await throtteler.Completion;
}
The above example is not good because I add all the items without knowing if they are actually being 'used' by the transform block. Also, I don't really care about the output buffer. I understand that I need to send it somewhere so I can await the completion, but I have no use for the buffer after that. So it should just forget about all it gets ...
Post() will return false if the target is full without blocking. While this could be used in a busy-wait loop, it's wasteful. SendAsync() on the other hand will wait if the target is full :
public async Task ExampleTask()
{
var config = new ExecutionDataflowBlockOptions
{
BoundedCapacity = 50,
MaxDegreeOfParallelism = 15
};
var block= new ActionBlock<Guid, object>(TheActualAction, config);
while(//some condition//)
{
var data=await GetDataFromCosmosDB();
await block.SendAsync(data);
//Wait a bit if we want to use polling
await Task.Delay(...);
}
block.Complete();
await block.Completion;
}
It seems you want to process data at a defined degree of parallelism (MaxDegreeOfParallelism = 15). TPL dataflow is very clunky to use for such a simple requirement.
There's a very simple and powerful pattern that might solve your problem. It's a parallel async foreach loop as described here: https://blogs.msdn.microsoft.com/pfxteam/2012/03/05/implementing-a-simple-foreachasync-part-2/
public static Task ForEachAsync<T>(this IEnumerable<T> source, int dop, Func<T, Task> body)
{
return Task.WhenAll(
from partition in Partitioner.Create(source).GetPartitions(dop)
select Task.Run(async delegate {
using (partition)
while (partition.MoveNext())
await body(partition.Current);
}));
}
You can then write something like:
var dataSource = ...; //some sequence
dataSource.ForEachAsync(15, async item => await ProcessItem(item));
Very simple.
You can dynamically reduce the DOP by using a SemaphoreSlim. The semaphore acts as a gate that only lets N concurrent threads/tasks in. N can be changed dynamically.
So you would use ForEachAsync as the basic workhorse and then add additional restrictions and throttling on top.

Categories