How to limit the Maximum number of parallel tasks in c# - c#

I have a collection of 1000 input message to process. I'm looping the input collection and starting the new task for each message to get processed.
//Assume this messages collection contains 1000 items
var messages = new List<string>();
foreach (var msg in messages)
{
Task.Factory.StartNew(() =>
{
Process(msg);
});
}
Can we guess how many maximum messages simultaneously get processed at the time (assuming normal Quad core processor), or can we limit the maximum number of messages to be processed at the time?
How to ensure this message get processed in the same sequence/order of the Collection?

You could use Parallel.Foreach and rely on MaxDegreeOfParallelism instead.
Parallel.ForEach(messages, new ParallelOptions {MaxDegreeOfParallelism = 10},
msg =>
{
// logic
Process(msg);
});

SemaphoreSlim is a very good solution in this case and I higly recommend OP to try this, but #Manoj's answer has flaw as mentioned in comments.semaphore should be waited before spawning the task like this.
Updated Answer: As #Vasyl pointed out Semaphore may be disposed before completion of tasks and will raise exception when Release() method is called so before exiting the using block must wait for the completion of all created Tasks.
int maxConcurrency=10;
var messages = new List<string>();
using(SemaphoreSlim concurrencySemaphore = new SemaphoreSlim(maxConcurrency))
{
List<Task> tasks = new List<Task>();
foreach(var msg in messages)
{
concurrencySemaphore.Wait();
var t = Task.Factory.StartNew(() =>
{
try
{
Process(msg);
}
finally
{
concurrencySemaphore.Release();
}
});
tasks.Add(t);
}
Task.WaitAll(tasks.ToArray());
}
Answer to Comments
for those who want to see how semaphore can be disposed without Task.WaitAll
Run below code in console app and this exception will be raised.
System.ObjectDisposedException: 'The semaphore has been disposed.'
static void Main(string[] args)
{
int maxConcurrency = 5;
List<string> messages = Enumerable.Range(1, 15).Select(e => e.ToString()).ToList();
using (SemaphoreSlim concurrencySemaphore = new SemaphoreSlim(maxConcurrency))
{
List<Task> tasks = new List<Task>();
foreach (var msg in messages)
{
concurrencySemaphore.Wait();
var t = Task.Factory.StartNew(() =>
{
try
{
Process(msg);
}
finally
{
concurrencySemaphore.Release();
}
});
tasks.Add(t);
}
// Task.WaitAll(tasks.ToArray());
}
Console.WriteLine("Exited using block");
Console.ReadKey();
}
private static void Process(string msg)
{
Thread.Sleep(2000);
Console.WriteLine(msg);
}

I think it would be better to use Parallel LINQ
Parallel.ForEach(messages ,
new ParallelOptions{MaxDegreeOfParallelism = 4},
x => Process(x);
);
where x is the MaxDegreeOfParallelism

With .NET 5.0 and Core 3.0 channels were introduced.
The main benefit of this producer/consumer concurrency pattern is that you can also limit the input data processing to reduce resource impact.
This is especially helpful when processing millions of data records.
Instead of reading the whole dataset at once into memory, you can now consecutively query only chunks of the data and wait for the workers to process it before querying more.
Code sample with a queue capacity of 50 messages and 5 consumer threads:
/// <exception cref="System.AggregateException">Thrown on Consumer Task exceptions.</exception>
public static async Task ProcessMessages(List<string> messages)
{
const int producerCapacity = 10, consumerTaskLimit = 3;
var channel = Channel.CreateBounded<string>(producerCapacity);
_ = Task.Run(async () =>
{
foreach (var msg in messages)
{
await channel.Writer.WriteAsync(msg);
// blocking when channel is full
// waiting for the consumer tasks to pop messages from the queue
}
channel.Writer.Complete();
// signaling the end of queue so that
// WaitToReadAsync will return false to stop the consumer tasks
});
var tokenSource = new CancellationTokenSource();
CancellationToken ct = tokenSource.Token;
var consumerTasks = Enumerable
.Range(1, consumerTaskLimit)
.Select(_ => Task.Run(async () =>
{
try
{
while (await channel.Reader.WaitToReadAsync(ct))
{
ct.ThrowIfCancellationRequested();
while (channel.Reader.TryRead(out var message))
{
await Task.Delay(500);
Console.WriteLine(message);
}
}
}
catch (OperationCanceledException) { }
catch
{
tokenSource.Cancel();
throw;
}
}))
.ToArray();
Task waitForConsumers = Task.WhenAll(consumerTasks);
try { await waitForConsumers; }
catch
{
foreach (var e in waitForConsumers.Exception.Flatten().InnerExceptions)
Console.WriteLine(e.ToString());
throw waitForConsumers.Exception.Flatten();
}
}
As pointed out by Theodor Zoulias:
On multiple consumer exceptions, the remaining tasks will continue to run and have to take the load of the killed tasks. To avoid this, I implemented a CancellationToken to stop all the remaining tasks and handle the exceptions combined in the AggregateException of waitForConsumers.Exception.
Side note:
The Task Parallel Library (TPL) might be good at automatically limiting the tasks based on your local resources. But when you are processing data remotely via RPC, it's necessary to manually limit your RPC calls to avoid filling the network/processing stack!

If your Process method is async you can't use Task.Factory.StartNew as it doesn't play well with an async delegate. Also there are some other nuances when using it (see this for example).
The proper way to do it in this case is to use Task.Run. Here's #ClearLogic answer modified for an async Process method.
static void Main(string[] args)
{
int maxConcurrency = 5;
List<string> messages = Enumerable.Range(1, 15).Select(e => e.ToString()).ToList();
using (SemaphoreSlim concurrencySemaphore = new SemaphoreSlim(maxConcurrency))
{
List<Task> tasks = new List<Task>();
foreach (var msg in messages)
{
concurrencySemaphore.Wait();
var t = Task.Run(async () =>
{
try
{
await Process(msg);
}
finally
{
concurrencySemaphore.Release();
}
});
tasks.Add(t);
}
Task.WaitAll(tasks.ToArray());
}
Console.WriteLine("Exited using block");
Console.ReadKey();
}
private static async Task Process(string msg)
{
await Task.Delay(2000);
Console.WriteLine(msg);
}

You can create your own TaskScheduler and override QueueTask there.
protected virtual void QueueTask(Task task)
Then you can do anything you like.
One example here:
Limited concurrency level task scheduler (with task priority) handling wrapped tasks

You can simply set the max concurrency degree like this way:
int maxConcurrency=10;
var messages = new List<1000>();
using(SemaphoreSlim concurrencySemaphore = new SemaphoreSlim(maxConcurrency))
{
foreach(var msg in messages)
{
Task.Factory.StartNew(() =>
{
concurrencySemaphore.Wait();
try
{
Process(msg);
}
finally
{
concurrencySemaphore.Release();
}
});
}
}

If you need in-order queuing (processing might finish in any order), there is no need for a semaphore. Old fashioned if statements work fine:
const int maxConcurrency = 5;
List<Task> tasks = new List<Task>();
foreach (var arg in args)
{
var t = Task.Run(() => { Process(arg); } );
tasks.Add(t);
if(tasks.Count >= maxConcurrency)
Task.WaitAny(tasks.ToArray());
}
Task.WaitAll(tasks.ToArray());

I ran into a similar problem where I wanted to produce 5000 results while calling apis, etc. So, I ran some speed tests.
Parallel.ForEach(products.Select(x => x.KeyValue).Distinct().Take(100), id =>
{
new ParallelOptions { MaxDegreeOfParallelism = 100 };
GetProductMetaData(productsMetaData, client, id).GetAwaiter().GetResult();
});
produced 100 results in 30 seconds.
Parallel.ForEach(products.Select(x => x.KeyValue).Distinct().Take(100), id =>
{
new ParallelOptions { MaxDegreeOfParallelism = 100 };
GetProductMetaData(productsMetaData, client, id);
});
Moving the GetAwaiter().GetResult() to the individual async api calls inside GetProductMetaData resulted in 14.09 seconds to produce 100 results.
foreach (var id in ids.Take(100))
{
GetProductMetaData(productsMetaData, client, id);
}
Complete non-async programming with the GetAwaiter().GetResult() in api calls resulted in 13.417 seconds.
var tasks = new List<Task>();
while (y < ids.Count())
{
foreach (var id in ids.Skip(y).Take(100))
{
tasks.Add(GetProductMetaData(productsMetaData, client, id));
}
y += 100;
Task.WhenAll(tasks).GetAwaiter().GetResult();
Console.WriteLine($"Finished {y}, {sw.Elapsed}");
}
Forming a task list and working through 100 at a time resulted in a speed of 7.36 seconds.
using (SemaphoreSlim cons = new SemaphoreSlim(10))
{
var tasks = new List<Task>();
foreach (var id in ids.Take(100))
{
cons.Wait();
var t = Task.Factory.StartNew(() =>
{
try
{
GetProductMetaData(productsMetaData, client, id);
}
finally
{
cons.Release();
}
});
tasks.Add(t);
}
Task.WaitAll(tasks.ToArray());
}
Using SemaphoreSlim resulted in 13.369 seconds, but also took a moment to boot to start using it.
var throttler = new SemaphoreSlim(initialCount: take);
foreach (var id in ids)
{
throttler.WaitAsync().GetAwaiter().GetResult();
tasks.Add(Task.Run(async () =>
{
try
{
skip += 1;
await GetProductMetaData(productsMetaData, client, id);
if (skip % 100 == 0)
{
Console.WriteLine($"started {skip}/{count}, {sw.Elapsed}");
}
}
finally
{
throttler.Release();
}
}));
}
Using Semaphore Slim with a throttler for my async task took 6.12 seconds.
The answer for me in this specific project was use a throttler with Semaphore Slim. Although the while foreach tasklist did sometimes beat the throttler, 4/6 times the throttler won for 1000 records.
I realize I'm not using the OPs code, but I think this is important and adds to this discussion because how is sometimes not the only question that should be asked, and the answer is sometimes "It depends on what you are trying to do."
Now to answer the specific questions:
How to limit the maximum number of parallel tasks in c#: I showed how to limit the number of tasks that are completed at a time.
Can we guess how many maximum messages simultaneously get processed at the time (assuming normal Quad core processor), or can we limit the maximum number of messages to be processed at the time? I cannot guess how many will be processed at a time unless I set an upper limit but I can set an upper limit. Obviously different computers function at different speeds due to CPU, RAM etc. and how many threads and cores the program itself has access to as well as other programs running in tandem on the same computer.
How to ensure this message get processed in the same sequence/order of the Collection? If you want to process everything in a specific order, it is synchronous programming. The point of being able to run things asynchronously is ensuring that they can do everything without an order. As you can see from my code, the time difference is minimal in 100 records unless you use async code. In the event that you need an order to what you are doing, use asynchronous programming up until that point, then await and do things synchronously from there. For example, task1a.start, task2a.start, then later task1a.await, task2a.await... then later task1b.start task1b.await and task2b.start task 2b.await.

public static void RunTasks(List<NamedTask> importTaskList)
{
List<NamedTask> runningTasks = new List<NamedTask>();
try
{
foreach (NamedTask currentTask in importTaskList)
{
currentTask.Start();
runningTasks.Add(currentTask);
if (runningTasks.Where(x => x.Status == TaskStatus.Running).Count() >= MaxCountImportThread)
{
Task.WaitAny(runningTasks.ToArray());
}
}
Task.WaitAll(runningTasks.ToArray());
}
catch (Exception ex)
{
Log.Fatal("ERROR!", ex);
}
}

you can use the BlockingCollection, If the consume collection limit has reached, the produce will stop producing until a consume process will finish. I find this pattern more easy to understand and implement than the SemaphoreSlim.
int TasksLimit = 10;
BlockingCollection<Task> tasks = new BlockingCollection<Task>(new ConcurrentBag<Task>(), TasksLimit);
void ProduceAndConsume()
{
var producer = Task.Factory.StartNew(RunProducer);
var consumer = Task.Factory.StartNew(RunConsumer);
try
{
Task.WaitAll(new[] { producer, consumer });
}
catch (AggregateException ae) { }
}
void RunConsumer()
{
foreach (var task in tasks.GetConsumingEnumerable())
{
task.Start();
}
}
void RunProducer()
{
for (int i = 0; i < 1000; i++)
{
tasks.Add(new Task(() => Thread.Sleep(1000), TaskCreationOptions.AttachedToParent));
}
}
Note that the RunProducer and RunConsumer has spawn two independent tasks.

Related

Limit number of notifications in scheduler

I have an cold observable and an observer.
Both are slow but the observer is slower than the observable.
They process many and many notifications so I don't want to store notifications without limit.
This sample takes about 30 seconds to complete. Very slow. I believe they can do in 21 seconds.
var subject = new Subject<int>();
subject.Subscribe(i =>
{
Thread.Sleep(2000);
Console.WriteLine($"{DateTime.Now:HH:mm:ss} - {i}");
});
Task.Run(() =>
{
Console.WriteLine($"{DateTime.Now:HH:mm:ss} - Start");
foreach (var i in Enumerable.Range(0, 10))
{
Thread.Sleep(1000);
subject.OnNext(i);
}
Console.WriteLine($"{DateTime.Now:HH:mm:ss} - End");
});
This sample finishes about in 20 seconds but the observable ends before the observer shows "4".
It indicates the scheduler stored 4 to 9 somewhere. I afraid if it stores over 1'000'000 notifications and throws OutOfMemoryException.
var subject = new Subject<int>();
subject.ObserveOn(ThreadPoolScheduler.Instance).Subscribe(i =>
{
Thread.Sleep(2000);
Console.WriteLine($"{DateTime.Now:HH:mm:ss} - {i}");
});
Task.Run(() =>
{
Console.WriteLine($"{DateTime.Now:HH:mm:ss} - Start");
foreach (var i in Enumerable.Range(0, 10))
{
Thread.Sleep(1000);
subject.OnNext(i);
}
Console.WriteLine($"{DateTime.Now:HH:mm:ss} - End");
});
That's why I want to limit number of notifications in the scheduler.
edit: diagram
x : calculate or other task
S : send notification
R : receive notification
-- time -->
sample1
thread1: xxxS xxxS xxxS
thread2: Rxxxxxx Rxxxxxx Rxxxxxx
sample2
thread1: xxxSxxxSxxxSxxxSxxxSxxxSxxxS
thread2: RxxxxxxRxxxxxxRxxxxxxRxxxxxx
I want
thread1: xxxSxxxS xxxS xxxS xxxS
thread2: RxxxxxxRxxxxxxRxxxxxxRxxxxxx
The Rx contract requires that notifications are serialized, so even though you may specify a Scheduler, it's more akin to saying "here, use this scheduler for managing concurrency".
ThreadPoolScheduler will still serialize the notifications, so the end result is that it won't call your method in parallel.
If you want asynchronous execution, you could rewrite it to this:
subject.Subscribe(async i =>
{
await Task.Delay(1000);
Console.WriteLine($"{DateTime.Now:HH:mm:ss} - {i}");
});
But the underlying problem is that your consumer lags behind your producer.
You could look into using backpressure, or if your application is a series of data processing tasks, you could also look into the excellent TPL Dataflow.
What happens in that pipeline between the source of events and its final sink, that's where Rx shines best.
You could try implementing a custom IScheduler that throttles the requests (the method Schedule) using a SemaphoreSlim.
Alternatively you could create a BlockingThrottle extension method that accepts and returns an IObservable, that you could chain to the original IObservable before subscribing to it. Here is an implementation that uses a BlockingCollection as a throttling mechanism:
private static IObservable<T> BlockingThrottle<T>(this IObservable<T> source,
int boundedCapacity)
{
return Observable.Create<T>(observer =>
{
var queue = new BlockingCollection<T>(boundedCapacity);
var cts = new CancellationTokenSource();
var locker = new object();
Exception exception = null;
new Thread(() =>
{
try
{
foreach (var item in queue.GetConsumingEnumerable(cts.Token))
{
observer.OnNext(item);
}
observer.OnCompleted();
}
catch (OperationCanceledException)
{
Exception ex; lock (locker) ex = exception;
if (ex != null) observer.OnError(ex);
}
// Leave all other exceptions unhandled.
// The responsibility for catching them belongs to the caller.
})
{ IsBackground = true }.Start();
var subscription = source.Subscribe(x =>
{
try
{
queue.Add(x, cts.Token);
}
catch (OperationCanceledException) { } // Ignore this exception too
}, ex =>
{
lock (locker) exception = ex;
cts.Cancel();
}, () =>
{
queue.CompleteAdding();
});
return Disposable.Create(() =>
{
cts.Cancel();
subscription.Dispose();
});
});
}
Usage example:
subject.BlockingThrottle(boundedCapacity: 10).Subscribe(i =>
Note: If you are planing to use this inside an ASP.NET application, consider replacing the BlockingCollection with an async queue (like a BufferBlock<T> or a Channel<T>), to avoid blocking threads.

C# .NET Parallel I/O operation (with throttling) [duplicate]

I would like to run a bunch of async tasks, with a limit on how many tasks may be pending completion at any given time.
Say you have 1000 URLs, and you only want to have 50 requests open at a time; but as soon as one request completes, you open up a connection to the next URL in the list. That way, there are always exactly 50 connections open at a time, until the URL list is exhausted.
I also want to utilize a given number of threads if possible.
I came up with an extension method, ThrottleTasksAsync that does what I want. Is there a simpler solution already out there? I would assume that this is a common scenario.
Usage:
class Program
{
static void Main(string[] args)
{
Enumerable.Range(1, 10).ThrottleTasksAsync(5, 2, async i => { Console.WriteLine(i); return i; }).Wait();
Console.WriteLine("Press a key to exit...");
Console.ReadKey(true);
}
}
Here is the code:
static class IEnumerableExtensions
{
public static async Task<Result_T[]> ThrottleTasksAsync<Enumerable_T, Result_T>(this IEnumerable<Enumerable_T> enumerable, int maxConcurrentTasks, int maxDegreeOfParallelism, Func<Enumerable_T, Task<Result_T>> taskToRun)
{
var blockingQueue = new BlockingCollection<Enumerable_T>(new ConcurrentBag<Enumerable_T>());
var semaphore = new SemaphoreSlim(maxConcurrentTasks);
// Run the throttler on a separate thread.
var t = Task.Run(() =>
{
foreach (var item in enumerable)
{
// Wait for the semaphore
semaphore.Wait();
blockingQueue.Add(item);
}
blockingQueue.CompleteAdding();
});
var taskList = new List<Task<Result_T>>();
Parallel.ForEach(IterateUntilTrue(() => blockingQueue.IsCompleted), new ParallelOptions { MaxDegreeOfParallelism = maxDegreeOfParallelism },
_ =>
{
Enumerable_T item;
if (blockingQueue.TryTake(out item, 100))
{
taskList.Add(
// Run the task
taskToRun(item)
.ContinueWith(tsk =>
{
// For effect
Thread.Sleep(2000);
// Release the semaphore
semaphore.Release();
return tsk.Result;
}
)
);
}
});
// Await all the tasks.
return await Task.WhenAll(taskList);
}
static IEnumerable<bool> IterateUntilTrue(Func<bool> condition)
{
while (!condition()) yield return true;
}
}
The method utilizes BlockingCollection and SemaphoreSlim to make it work. The throttler is run on one thread, and all the async tasks are run on the other thread. To achieve parallelism, I added a maxDegreeOfParallelism parameter that's passed to a Parallel.ForEach loop re-purposed as a while loop.
The old version was:
foreach (var master = ...)
{
var details = ...;
Parallel.ForEach(details, detail => {
// Process each detail record here
}, new ParallelOptions { MaxDegreeOfParallelism = 15 });
// Perform the final batch updates here
}
But, the thread pool gets exhausted fast, and you can't do async/await.
Bonus:
To get around the problem in BlockingCollection where an exception is thrown in Take() when CompleteAdding() is called, I'm using the TryTake overload with a timeout. If I didn't use the timeout in TryTake, it would defeat the purpose of using a BlockingCollection since TryTake won't block. Is there a better way? Ideally, there would be a TakeAsync method.
As suggested, use TPL Dataflow.
A TransformBlock<TInput, TOutput> may be what you're looking for.
You define a MaxDegreeOfParallelism to limit how many strings can be transformed (i.e., how many urls can be downloaded) in parallel. You then post urls to the block, and when you're done you tell the block you're done adding items and you fetch the responses.
var downloader = new TransformBlock<string, HttpResponse>(
url => Download(url),
new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 50 }
);
var buffer = new BufferBlock<HttpResponse>();
downloader.LinkTo(buffer);
foreach(var url in urls)
downloader.Post(url);
//or await downloader.SendAsync(url);
downloader.Complete();
await downloader.Completion;
IList<HttpResponse> responses;
if (buffer.TryReceiveAll(out responses))
{
//process responses
}
Note: The TransformBlock buffers both its input and output. Why, then, do we need to link it to a BufferBlock?
Because the TransformBlock won't complete until all items (HttpResponse) have been consumed, and await downloader.Completion would hang. Instead, we let the downloader forward all its output to a dedicated buffer block - then we wait for the downloader to complete, and inspect the buffer block.
Say you have 1000 URLs, and you only want to have 50 requests open at
a time; but as soon as one request completes, you open up a connection
to the next URL in the list. That way, there are always exactly 50
connections open at a time, until the URL list is exhausted.
The following simple solution has surfaced many times here on SO. It doesn't use blocking code and doesn't create threads explicitly, so it scales very well:
const int MAX_DOWNLOADS = 50;
static async Task DownloadAsync(string[] urls)
{
using (var semaphore = new SemaphoreSlim(MAX_DOWNLOADS))
using (var httpClient = new HttpClient())
{
var tasks = urls.Select(async url =>
{
await semaphore.WaitAsync();
try
{
var data = await httpClient.GetStringAsync(url);
Console.WriteLine(data);
}
finally
{
semaphore.Release();
}
});
await Task.WhenAll(tasks);
}
}
The thing is, the processing of the downloaded data should be done on a different pipeline, with a different level of parallelism, especially if it's a CPU-bound processing.
E.g., you'd probably want to have 4 threads concurrently doing the data processing (the number of CPU cores), and up to 50 pending requests for more data (which do not use threads at all). AFAICT, this is not what your code is currently doing.
That's where TPL Dataflow or Rx may come in handy as a preferred solution. Yet it is certainly possible to implement something like this with plain TPL. Note, the only blocking code here is the one doing the actual data processing inside Task.Run:
const int MAX_DOWNLOADS = 50;
const int MAX_PROCESSORS = 4;
// process data
class Processing
{
SemaphoreSlim _semaphore = new SemaphoreSlim(MAX_PROCESSORS);
HashSet<Task> _pending = new HashSet<Task>();
object _lock = new Object();
async Task ProcessAsync(string data)
{
await _semaphore.WaitAsync();
try
{
await Task.Run(() =>
{
// simuate work
Thread.Sleep(1000);
Console.WriteLine(data);
});
}
finally
{
_semaphore.Release();
}
}
public async void QueueItemAsync(string data)
{
var task = ProcessAsync(data);
lock (_lock)
_pending.Add(task);
try
{
await task;
}
catch
{
if (!task.IsCanceled && !task.IsFaulted)
throw; // not the task's exception, rethrow
// don't remove faulted/cancelled tasks from the list
return;
}
// remove successfully completed tasks from the list
lock (_lock)
_pending.Remove(task);
}
public async Task WaitForCompleteAsync()
{
Task[] tasks;
lock (_lock)
tasks = _pending.ToArray();
await Task.WhenAll(tasks);
}
}
// download data
static async Task DownloadAsync(string[] urls)
{
var processing = new Processing();
using (var semaphore = new SemaphoreSlim(MAX_DOWNLOADS))
using (var httpClient = new HttpClient())
{
var tasks = urls.Select(async (url) =>
{
await semaphore.WaitAsync();
try
{
var data = await httpClient.GetStringAsync(url);
// put the result on the processing pipeline
processing.QueueItemAsync(data);
}
finally
{
semaphore.Release();
}
});
await Task.WhenAll(tasks.ToArray());
await processing.WaitForCompleteAsync();
}
}
As requested, here's the code I ended up going with.
The work is set up in a master-detail configuration, and each master is processed as a batch. Each unit of work is queued up in this fashion:
var success = true;
// Start processing all the master records.
Master master;
while (null != (master = await StoredProcedures.ClaimRecordsAsync(...)))
{
await masterBuffer.SendAsync(master);
}
// Finished sending master records
masterBuffer.Complete();
// Now, wait for all the batches to complete.
await batchAction.Completion;
return success;
Masters are buffered one at a time to save work for other outside processes. The details for each master are dispatched for work via the masterTransform TransformManyBlock. A BatchedJoinBlock is also created to collect the details in one batch.
The actual work is done in the detailTransform TransformBlock, asynchronously, 150 at a time. BoundedCapacity is set to 300 to ensure that too many Masters don't get buffered at the beginning of the chain, while also leaving room for enough detail records to be queued to allow 150 records to be processed at one time. The block outputs an object to its targets, because it's filtered across the links depending on whether it's a Detail or Exception.
The batchAction ActionBlock collects the output from all the batches, and performs bulk database updates, error logging, etc. for each batch.
There will be several BatchedJoinBlocks, one for each master. Since each ISourceBlock is output sequentially and each batch only accepts the number of detail records associated with one master, the batches will be processed in order. Each block only outputs one group, and is unlinked on completion. Only the last batch block propagates its completion to the final ActionBlock.
The dataflow network:
// The dataflow network
BufferBlock<Master> masterBuffer = null;
TransformManyBlock<Master, Detail> masterTransform = null;
TransformBlock<Detail, object> detailTransform = null;
ActionBlock<Tuple<IList<object>, IList<object>>> batchAction = null;
// Buffer master records to enable efficient throttling.
masterBuffer = new BufferBlock<Master>(new DataflowBlockOptions { BoundedCapacity = 1 });
// Sequentially transform master records into a stream of detail records.
masterTransform = new TransformManyBlock<Master, Detail>(async masterRecord =>
{
var records = await StoredProcedures.GetObjectsAsync(masterRecord);
// Filter the master records based on some criteria here
var filteredRecords = records;
// Only propagate completion to the last batch
var propagateCompletion = masterBuffer.Completion.IsCompleted && masterTransform.InputCount == 0;
// Create a batch join block to encapsulate the results of the master record.
var batchjoinblock = new BatchedJoinBlock<object, object>(records.Count(), new GroupingDataflowBlockOptions { MaxNumberOfGroups = 1 });
// Add the batch block to the detail transform pipeline's link queue, and link the batch block to the the batch action block.
var detailLink1 = detailTransform.LinkTo(batchjoinblock.Target1, detailResult => detailResult is Detail);
var detailLink2 = detailTransform.LinkTo(batchjoinblock.Target2, detailResult => detailResult is Exception);
var batchLink = batchjoinblock.LinkTo(batchAction, new DataflowLinkOptions { PropagateCompletion = propagateCompletion });
// Unlink batchjoinblock upon completion.
// (the returned task does not need to be awaited, despite the warning.)
batchjoinblock.Completion.ContinueWith(task =>
{
detailLink1.Dispose();
detailLink2.Dispose();
batchLink.Dispose();
});
return filteredRecords;
}, new ExecutionDataflowBlockOptions { BoundedCapacity = 1 });
// Process each detail record asynchronously, 150 at a time.
detailTransform = new TransformBlock<Detail, object>(async detail => {
try
{
// Perform the action for each detail here asynchronously
await DoSomethingAsync();
return detail;
}
catch (Exception e)
{
success = false;
return e;
}
}, new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 150, BoundedCapacity = 300 });
// Perform the proper action for each batch
batchAction = new ActionBlock<Tuple<IList<object>, IList<object>>>(async batch =>
{
var details = batch.Item1.Cast<Detail>();
var errors = batch.Item2.Cast<Exception>();
// Do something with the batch here
}, new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 4 });
masterBuffer.LinkTo(masterTransform, new DataflowLinkOptions { PropagateCompletion = true });
masterTransform.LinkTo(detailTransform, new DataflowLinkOptions { PropagateCompletion = true });

C# Running many async tasks the same time

I'm kinda new to async tasks.
I've a function that takes student ID and scrapes data from specific university website with the required ID.
private static HttpClient client = new HttpClient();
public static async Task<Student> ParseAsync(string departmentLink, int id, CancellationToken ct)
{
string website = string.Format(departmentLink, id);
try
{
string data;
var stream = await client.GetAsync(website, ct);
using (var reader = new StreamReader(await stream.Content.ReadAsStreamAsync(), Encoding.GetEncoding("windows-1256")))
data = reader.ReadToEnd();
//Parse data here and return Student.
} catch (Exception ex)
{
Console.WriteLine(ex.Message);
}
}
And it works correctly. Sometimes though I need to run this function for a lot of students so I use the following
for(int i = ids.first; i <= ids.last; i++)
{
tasks[i - ids.first] = ParseStudentData.ParseAsync(entity.Link, i, cts.Token).ContinueWith(t =>
{
Dispatcher.Invoke(() =>
{
listview_students.Items.Add(t.Result);
//Students.Add(t.Result);
//lbl_count.Content = $"{listview_students.Items.Count}/{testerino.Length}";
});
});
}
I'm storing tasks in an array to wait for them later.
This also works finely as long as the students count is between (0, ~600?) it's kinda random.
And then for every other student that still hasn't been parsed throws A task was cancelled.
Keep in mind that, I never use the cancellation token at all.
I need to run this function on so many students it can reach ~9000 async task altogether. So what's happening?
You are basically creating a denial of service attack on the website when you are queuing up 9000 requests in such a short time frame. Not only is this causing you errors, but it could take down the website. It would be best to limit the number of concurrent requests to a more reasonable value (say 30). While there are probably several ways to do this, one that comes to mind is the following:
private async Task Test()
{
var tasks = new List<Task>();
for (int i = ids.first; i <= ids.last; i++)
{
tasks.Add(/* Do stuff */);
await WaitList(tasks, 30);
}
}
private async Task WaitList(IList<Task> tasks, int maxSize)
{
while (tasks.Count > maxSize)
{
var completed = await Task.WhenAny(tasks).ConfigureAwait(false);
tasks.Remove(completed);
}
}
Other approaches might leverage the producer/consumer pattern using .Net classes such as a BlockingCollection
This is what I ended up with based on #erdomke code:
public static async Task ForEachParallel<T>(
this IEnumerable<T> list,
Func<T, Task> action,
int dop)
{
var tasks = new List<Task>(dop);
foreach (var item in list)
{
tasks.Add(action(item));
while (tasks.Count >= dop)
{
var completed = await Task.WhenAny(tasks).ConfigureAwait(false);
tasks.Remove(completed);
}
}
// Wait for all remaining tasks.
await Task.WhenAll(tasks).ConfigureAwait(false);
}
// usage
await Enumerable
.Range(1, 500)
.ForEachParallel(i => ProcessItem(i), Environment.ProcessorCount);

Limit total concurrent tasks running [duplicate]

This question already has answers here:
How to limit the amount of concurrent async I/O operations?
(11 answers)
Closed 5 years ago.
I have a method Create which is executed whenever a new message is seen on the service bus message queue (https://azure.microsoft.com/en-us/services/service-bus/).
I am trying to limit the total number of concurrent tasks that can run in parallel for all calls of Create to 5 tasks.
In my code Parallel.ForEach does not seem to do anything.
I have tried to add a mutex/lock around the makePdfAsync() invocation like this:
mutex.WaitOne();
if(curretNumTasks < MaxTasks)
{
tasks.Add(makePdfAsync(form));
}
mutex.ReleaseMutex();
but it is extremely slow and makes the service bus throw.
How do I limit the number of concurrent tasks all invocations of Create creates?
public async Task Create(List<FormModel> forms)
{
var tasks = new List<Task>();
Parallel.ForEach(forms, new ParallelOptions { MaxDegreeOfParallelism = 5 }, form =>
{
tasks.Add(makePdfAsync(form));
});
await Task.WhenAny(Task.WhenAll(tasks), Task.Delay(TimeSpan.FromMinutes(10)));
}
public async Task makePdfAsync()
{
var message = new PdfMessageModel();
message.forms = new List<FormModel>() { form };
var retry = 10;
var uri = new Uri("http://localhost.:8007");
var json = JsonConvert.SerializeObject(message);
using (var wc = new WebClient())
{
wc.Encoding = System.Text.Encoding.UTF8;
// reconnect with delay in case process is not ready
while (true)
{
try
{
await wc.UploadStringTaskAsync(uri, json);
break;
}
catch
{
if (retry-- == 0) throw;
}
}
}
}
TL;DR. Create is a method on a class, it is called on many instances simultaneously. The concurrency is two fold; Several invocations of Create simultaneously and within each invocation of Create several tasks run concurrently.
How do I limit the total number of tasks running at any one point?
You could look at using a system wide semaphore?
for example :
var throttle = new Semaphore(5,5,"pdftaskthrottle");
if (throttle.WaitOne(5000)){
try{
//do some task / thread stuff
.....
} catch(Exception ex){
// handle
} finally {
//always remember to release the semaphore
throttle.Release();
}
} else {
//we timed out ... try again?
}
If I understand you correctly, you effectively want a producer/consumer queue with a limit of 5 tasks. BlockingCollection would be the best if that's what you're after. It has very good performance as internally it uses SemaphoreSlim to do the blocking when necessary. Also you can leverage Task together e.g. creating a BlockingCollection<Task<T>>. "C# in a nutshell" has a good section of this; see code below as a general example. Also try avoid using kernel-mode synchronisation construct like mutex if possible as they're slow (you have to pay for transiting from managed code into native code!).
class PCQueue : IDisposable
{
private BlockingCollection<Task> _taskQueue = new BlockingCollection<Task>();
public PCQueue(int workerCount)
{
for (int i = 0; i < workerCount; i++)
Task.Factory.StartNew(Consume);
}
public Task Enqueue(Action action, CancellationToken cancelToken = default(CancellationToken))
{
//! A task object can either be generated using TaskCompletionSource or instantiated directly (an unstarted or cold task!).
var task = new Task(action, cancelToken);
_taskQueue.Add(task); //? Create a cold task and enqueue it.
return task;
}
public Task<TResult> Enqueue<TResult>(Func<TResult> func, CancellationToken cancelToken = default(CancellationToken))
{
var task = new Task<TResult>(func, cancelToken);
_taskQueue.Add(task);
return task;
}
void Consume()
{
foreach (var task in _taskQueue.GetConsumingEnumerable())
{
try
{
//! We run the task synchronously on the consumer's thread.
if (!task.IsCanceled) task.RunSynchronously();
}
catch (InvalidOperationException)
{
//! Handle the unlikely event that the task is canceled in between checking whether it's canceled and running it.
// race condition!
}
}
}
public void Dispose() => _taskQueue.CompleteAdding();
}

Throttling asynchronous tasks

I would like to run a bunch of async tasks, with a limit on how many tasks may be pending completion at any given time.
Say you have 1000 URLs, and you only want to have 50 requests open at a time; but as soon as one request completes, you open up a connection to the next URL in the list. That way, there are always exactly 50 connections open at a time, until the URL list is exhausted.
I also want to utilize a given number of threads if possible.
I came up with an extension method, ThrottleTasksAsync that does what I want. Is there a simpler solution already out there? I would assume that this is a common scenario.
Usage:
class Program
{
static void Main(string[] args)
{
Enumerable.Range(1, 10).ThrottleTasksAsync(5, 2, async i => { Console.WriteLine(i); return i; }).Wait();
Console.WriteLine("Press a key to exit...");
Console.ReadKey(true);
}
}
Here is the code:
static class IEnumerableExtensions
{
public static async Task<Result_T[]> ThrottleTasksAsync<Enumerable_T, Result_T>(this IEnumerable<Enumerable_T> enumerable, int maxConcurrentTasks, int maxDegreeOfParallelism, Func<Enumerable_T, Task<Result_T>> taskToRun)
{
var blockingQueue = new BlockingCollection<Enumerable_T>(new ConcurrentBag<Enumerable_T>());
var semaphore = new SemaphoreSlim(maxConcurrentTasks);
// Run the throttler on a separate thread.
var t = Task.Run(() =>
{
foreach (var item in enumerable)
{
// Wait for the semaphore
semaphore.Wait();
blockingQueue.Add(item);
}
blockingQueue.CompleteAdding();
});
var taskList = new List<Task<Result_T>>();
Parallel.ForEach(IterateUntilTrue(() => blockingQueue.IsCompleted), new ParallelOptions { MaxDegreeOfParallelism = maxDegreeOfParallelism },
_ =>
{
Enumerable_T item;
if (blockingQueue.TryTake(out item, 100))
{
taskList.Add(
// Run the task
taskToRun(item)
.ContinueWith(tsk =>
{
// For effect
Thread.Sleep(2000);
// Release the semaphore
semaphore.Release();
return tsk.Result;
}
)
);
}
});
// Await all the tasks.
return await Task.WhenAll(taskList);
}
static IEnumerable<bool> IterateUntilTrue(Func<bool> condition)
{
while (!condition()) yield return true;
}
}
The method utilizes BlockingCollection and SemaphoreSlim to make it work. The throttler is run on one thread, and all the async tasks are run on the other thread. To achieve parallelism, I added a maxDegreeOfParallelism parameter that's passed to a Parallel.ForEach loop re-purposed as a while loop.
The old version was:
foreach (var master = ...)
{
var details = ...;
Parallel.ForEach(details, detail => {
// Process each detail record here
}, new ParallelOptions { MaxDegreeOfParallelism = 15 });
// Perform the final batch updates here
}
But, the thread pool gets exhausted fast, and you can't do async/await.
Bonus:
To get around the problem in BlockingCollection where an exception is thrown in Take() when CompleteAdding() is called, I'm using the TryTake overload with a timeout. If I didn't use the timeout in TryTake, it would defeat the purpose of using a BlockingCollection since TryTake won't block. Is there a better way? Ideally, there would be a TakeAsync method.
As suggested, use TPL Dataflow.
A TransformBlock<TInput, TOutput> may be what you're looking for.
You define a MaxDegreeOfParallelism to limit how many strings can be transformed (i.e., how many urls can be downloaded) in parallel. You then post urls to the block, and when you're done you tell the block you're done adding items and you fetch the responses.
var downloader = new TransformBlock<string, HttpResponse>(
url => Download(url),
new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 50 }
);
var buffer = new BufferBlock<HttpResponse>();
downloader.LinkTo(buffer);
foreach(var url in urls)
downloader.Post(url);
//or await downloader.SendAsync(url);
downloader.Complete();
await downloader.Completion;
IList<HttpResponse> responses;
if (buffer.TryReceiveAll(out responses))
{
//process responses
}
Note: The TransformBlock buffers both its input and output. Why, then, do we need to link it to a BufferBlock?
Because the TransformBlock won't complete until all items (HttpResponse) have been consumed, and await downloader.Completion would hang. Instead, we let the downloader forward all its output to a dedicated buffer block - then we wait for the downloader to complete, and inspect the buffer block.
Say you have 1000 URLs, and you only want to have 50 requests open at
a time; but as soon as one request completes, you open up a connection
to the next URL in the list. That way, there are always exactly 50
connections open at a time, until the URL list is exhausted.
The following simple solution has surfaced many times here on SO. It doesn't use blocking code and doesn't create threads explicitly, so it scales very well:
const int MAX_DOWNLOADS = 50;
static async Task DownloadAsync(string[] urls)
{
using (var semaphore = new SemaphoreSlim(MAX_DOWNLOADS))
using (var httpClient = new HttpClient())
{
var tasks = urls.Select(async url =>
{
await semaphore.WaitAsync();
try
{
var data = await httpClient.GetStringAsync(url);
Console.WriteLine(data);
}
finally
{
semaphore.Release();
}
});
await Task.WhenAll(tasks);
}
}
The thing is, the processing of the downloaded data should be done on a different pipeline, with a different level of parallelism, especially if it's a CPU-bound processing.
E.g., you'd probably want to have 4 threads concurrently doing the data processing (the number of CPU cores), and up to 50 pending requests for more data (which do not use threads at all). AFAICT, this is not what your code is currently doing.
That's where TPL Dataflow or Rx may come in handy as a preferred solution. Yet it is certainly possible to implement something like this with plain TPL. Note, the only blocking code here is the one doing the actual data processing inside Task.Run:
const int MAX_DOWNLOADS = 50;
const int MAX_PROCESSORS = 4;
// process data
class Processing
{
SemaphoreSlim _semaphore = new SemaphoreSlim(MAX_PROCESSORS);
HashSet<Task> _pending = new HashSet<Task>();
object _lock = new Object();
async Task ProcessAsync(string data)
{
await _semaphore.WaitAsync();
try
{
await Task.Run(() =>
{
// simuate work
Thread.Sleep(1000);
Console.WriteLine(data);
});
}
finally
{
_semaphore.Release();
}
}
public async void QueueItemAsync(string data)
{
var task = ProcessAsync(data);
lock (_lock)
_pending.Add(task);
try
{
await task;
}
catch
{
if (!task.IsCanceled && !task.IsFaulted)
throw; // not the task's exception, rethrow
// don't remove faulted/cancelled tasks from the list
return;
}
// remove successfully completed tasks from the list
lock (_lock)
_pending.Remove(task);
}
public async Task WaitForCompleteAsync()
{
Task[] tasks;
lock (_lock)
tasks = _pending.ToArray();
await Task.WhenAll(tasks);
}
}
// download data
static async Task DownloadAsync(string[] urls)
{
var processing = new Processing();
using (var semaphore = new SemaphoreSlim(MAX_DOWNLOADS))
using (var httpClient = new HttpClient())
{
var tasks = urls.Select(async (url) =>
{
await semaphore.WaitAsync();
try
{
var data = await httpClient.GetStringAsync(url);
// put the result on the processing pipeline
processing.QueueItemAsync(data);
}
finally
{
semaphore.Release();
}
});
await Task.WhenAll(tasks.ToArray());
await processing.WaitForCompleteAsync();
}
}
As requested, here's the code I ended up going with.
The work is set up in a master-detail configuration, and each master is processed as a batch. Each unit of work is queued up in this fashion:
var success = true;
// Start processing all the master records.
Master master;
while (null != (master = await StoredProcedures.ClaimRecordsAsync(...)))
{
await masterBuffer.SendAsync(master);
}
// Finished sending master records
masterBuffer.Complete();
// Now, wait for all the batches to complete.
await batchAction.Completion;
return success;
Masters are buffered one at a time to save work for other outside processes. The details for each master are dispatched for work via the masterTransform TransformManyBlock. A BatchedJoinBlock is also created to collect the details in one batch.
The actual work is done in the detailTransform TransformBlock, asynchronously, 150 at a time. BoundedCapacity is set to 300 to ensure that too many Masters don't get buffered at the beginning of the chain, while also leaving room for enough detail records to be queued to allow 150 records to be processed at one time. The block outputs an object to its targets, because it's filtered across the links depending on whether it's a Detail or Exception.
The batchAction ActionBlock collects the output from all the batches, and performs bulk database updates, error logging, etc. for each batch.
There will be several BatchedJoinBlocks, one for each master. Since each ISourceBlock is output sequentially and each batch only accepts the number of detail records associated with one master, the batches will be processed in order. Each block only outputs one group, and is unlinked on completion. Only the last batch block propagates its completion to the final ActionBlock.
The dataflow network:
// The dataflow network
BufferBlock<Master> masterBuffer = null;
TransformManyBlock<Master, Detail> masterTransform = null;
TransformBlock<Detail, object> detailTransform = null;
ActionBlock<Tuple<IList<object>, IList<object>>> batchAction = null;
// Buffer master records to enable efficient throttling.
masterBuffer = new BufferBlock<Master>(new DataflowBlockOptions { BoundedCapacity = 1 });
// Sequentially transform master records into a stream of detail records.
masterTransform = new TransformManyBlock<Master, Detail>(async masterRecord =>
{
var records = await StoredProcedures.GetObjectsAsync(masterRecord);
// Filter the master records based on some criteria here
var filteredRecords = records;
// Only propagate completion to the last batch
var propagateCompletion = masterBuffer.Completion.IsCompleted && masterTransform.InputCount == 0;
// Create a batch join block to encapsulate the results of the master record.
var batchjoinblock = new BatchedJoinBlock<object, object>(records.Count(), new GroupingDataflowBlockOptions { MaxNumberOfGroups = 1 });
// Add the batch block to the detail transform pipeline's link queue, and link the batch block to the the batch action block.
var detailLink1 = detailTransform.LinkTo(batchjoinblock.Target1, detailResult => detailResult is Detail);
var detailLink2 = detailTransform.LinkTo(batchjoinblock.Target2, detailResult => detailResult is Exception);
var batchLink = batchjoinblock.LinkTo(batchAction, new DataflowLinkOptions { PropagateCompletion = propagateCompletion });
// Unlink batchjoinblock upon completion.
// (the returned task does not need to be awaited, despite the warning.)
batchjoinblock.Completion.ContinueWith(task =>
{
detailLink1.Dispose();
detailLink2.Dispose();
batchLink.Dispose();
});
return filteredRecords;
}, new ExecutionDataflowBlockOptions { BoundedCapacity = 1 });
// Process each detail record asynchronously, 150 at a time.
detailTransform = new TransformBlock<Detail, object>(async detail => {
try
{
// Perform the action for each detail here asynchronously
await DoSomethingAsync();
return detail;
}
catch (Exception e)
{
success = false;
return e;
}
}, new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 150, BoundedCapacity = 300 });
// Perform the proper action for each batch
batchAction = new ActionBlock<Tuple<IList<object>, IList<object>>>(async batch =>
{
var details = batch.Item1.Cast<Detail>();
var errors = batch.Item2.Cast<Exception>();
// Do something with the batch here
}, new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 4 });
masterBuffer.LinkTo(masterTransform, new DataflowLinkOptions { PropagateCompletion = true });
masterTransform.LinkTo(detailTransform, new DataflowLinkOptions { PropagateCompletion = true });

Categories