I have created a DataFlow pipeline using a BufferBlock, TransformBlock and an ActionBlock. Due to exception in the TransformBlock, the application is going to deadlock. I'm throttling data using BoundedCapacity.
My code is like this:
public async Task PerformOperation()
{
var bufferBlock = new BufferBlock<ObjA>(new DataflowBlockOptions { BoundedCapacity = 1 });
var fetchApiResponse = new TransformBlock<ObjA, ObjA>((item) => {
//Call an api to fetch result.
//Here for some data i get exception
return ObjA;
}, new ExecutionDataflowBlockOptions { BoundedCapacity = 2, MaxDegreeOfParallelism = 2, CancellationToken = cancellationToken });
var finalBlock = new ActionBlock<ObjA>((item) => {
if (item != null)
{
SaveToDB(item);
}
}, new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 1, BoundedCapacity = 1, CancellationToken = cancellationToken });
bufferBlock.LinkTo(fetchApiResponse, new DataflowLinkOptions { PropagateCompletion = true });
fetchApiResponse.LinkTo(finalBlock, new DataflowLinkOptions { PropagateCompletion = true });
await FetchData(bufferBlock);
bufferBlock.Complete();
await Task.WhenAll(fetchApiResponse.Completion, finalBlock.Completion);
}
public async Task FetchData(bufferBlock)
{
List<ObjA> dataToProcessList = GetFromDB();
foreach (var item in dataToProcessList)
{
await bufferBlock.SendAsync(item);
}
}
Here if exception comes in fetchApiResponse block, the data is not moving and it goes for a deadlock.
How do I handle exception in this pipeline?
Here around 200,000 records are pushed to bufferBlock.
What is the best way to handle the exceptions without causing this deadlock?
UPDATE 1:
Added the FetchData method also.
Thanks
Binil
Rather than trying to make sense of what faulted or not, the blocks should not allow unhandled exceptions. This is a very common pattern, also used in [Go pipelines]9https://blog.golang.org/pipelines)
The article Exception Handling in TPL Dataflow Networks explains how exceptions are handled.
When an unhandled exception is thrown, the block enters the faulted state, only after all concurrent operations are finished.
That state is propagated to linked blocks that have PropagateCompletion set to true. That doesn't mean that downstream blocks will immediately fault though.
Awaiting a faulted block throws. The line :
await Task.WhenAll(fetchApiResponse.Completion, finalBlock.Completion);
should have thrown, unless those blocks were still busy.
The solution - don't allow unhandled exceptions
Return a Result object instead. When having to make eg 1000 HTTP calls, it would be a bad idea to have one exception prevent the other 900 calls anyway. This is broadly similar to Railroad-oriented programming. A Dataflow pipeline is quite similar to a functional pipeline.
Each block should return a Result<T> class that wraps the actual result and somehow indicates success or failure. An exception handling block should catch any exceptions and return a faulted Result<T> item. The LinkTo method can have a predicate that allows redirecting failed results to eg a logging block or a NullBlock.
Let's say we have this simple Result<T> :
class Result<T>
{
public T Value{get;}
public Exception Exception{get;}
public bool Ok {get;}
public Result(){}
public Result(T value)
{
Value=value;
Ok=true;
}
public Result(Exception exc)
{
Exception=exc;
Ok=false;
}
}
fetchApiResponse could be :
var fetchApiResponse = new TransformBlock<TA, Result<TA>>((item) => {
try
{
...
return new Result(ObjA,true);
}
catch(Exception exc)
{
return new Result(exc);
}
}, new ExecutionDataflowBlockOptions { BoundedCapacity = 2, MaxDegreeOfParallelism = 2, CancellationToken = cancellationToken });
and the LinkTo code could be :
var propagate=new DataflowLinkOptions { PropagateCompletion = true };
var nullBlock=DataflowBlock.NullTarget<Result<TA>>();
fetchApiResponse.Linkto(logger,propagage,msg=>!msg.Ok);
fetchApiResponse.LinkTo(finalBlock,propagate,msg=>msg.Ok);
In this case, bad message are simply dumped to a null block.
There's no reason to use another buffer block, or await all blocks. Both TransformBlock and ActionBlock have an input buffer controlled by the ExecutionDataflowBlockOptions options.
Posting the messages and awaiting completion can be:
await FetchData(fetchApiResponse);
fetchApiResponse.Complete();
await finalBlock.Completion;
The null check in finalBlock can be removed too, if fetchApiResponse return an empty Result object if there's no valid result.
More complex scenarios can be handled by more complex Result objects.
Abrupt termination
Even when the pipeline needs to terminate immediately, there shouldn't be any unhandled exceptions. A fault may propagate downstream but won't affect the upstream blocks. They'll keep their messages in memory and keep accepting input even though the rest of the pipeline is broken.
That can definitely look like a deadlock.
The solution to this is to use a CancellationTokenSource, pass its token to all blocks, and signal it if the pipeline needs to be terminated.
This is common practice eg in Go, to use a channel like a CancellationTokenSource for precisely this reason, and cancel both downstream and upstream blocks. This is described in Go Concurrency Patterns: Pipelines and cancellation
Early cancellation is useful if a block decides there's no reason to continue working, not just in case of error. In this case it can singal the CancellationTokenSource to stop upstream blocks
I couldn't go through the post of #Panagiotis Kanavos. Meanwhile I have updated my code like this to handle the exception based on the comments.
public async Task PerformOperation()
{
try
{
var bufferBlock = new BufferBlock<ObjA>(new DataflowBlockOptions { BoundedCapacity = 1
});
var fetchApiResponse = new TransformBlock<ObjA, ObjA>((item) => {
//Call an api to fetch result.
//Here for some data i get exception
try
{
int apiResult = await apiCall();
}
catch(Exception ex)
{
**var dataflowBlock = (IDataflowBlock)bufferBlock;
dataflowBlock.Fault(ex);
throw ex;**
}
return ObjA;
}, new ExecutionDataflowBlockOptions { BoundedCapacity = 2, MaxDegreeOfParallelism = 2, CancellationToken = cancellationToken });
var finalBlock = new ActionBlock<ObjA>((item) => {
if (item != null)
{
SaveToDB(item);
}
}, new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 1, BoundedCapacity = 1, CancellationToken = cancellationToken });
bufferBlock.LinkTo(fetchApiResponse, new DataflowLinkOptions { PropagateCompletion = true });
fetchApiResponse.LinkTo(finalBlock, new DataflowLinkOptions { PropagateCompletion = true });
await FetchData(bufferBlock);
bufferBlock.Complete();
await Task.WhenAll(fetchApiResponse.Completion, finalBlock.Completion);
}
catch(AggregateException aex)
{ //logging the exceptions in aex }
catch(Exception ex)
{ //logging the exception}
}
public async Task FetchData(bufferBlock)
{
List<ObjA> dataToProcessList = GetFromDB();
foreach (var item in dataToProcessList)
{
if(!await bufferBlock.SendAsync(item))
{
break; //breaking the loop to stop pushing data.
}
}
}
This will now stop the pipeline and doesn't go to a deadlock. Since I'm dealing with lots of data, I'm planning to add a counter for the exceptions and if it exceeds certain limit then only I'll stop the pipeline. If a small network glitch caused one api call to fail, it might work for the next data.
I'll go through the new posts and update my code to make things better.
Please provide inputs.
Thanks
Binil
Related
Typical situation: Fast producer, slow consumer, need to slow producer down.
Sample code that doesn't work as I expected (explained below):
// I assumed this block will behave like BlockingCollection, but it doesn't
var bb = new BufferBlock<int>(new DataflowBlockOptions {
BoundedCapacity = 3, // looks like this does nothing
});
// fast producer
int dataSource = -1;
var producer = Task.Run(() => {
while (dataSource < 10) {
var message = ++dataSource;
bb.Post(message);
Console.WriteLine($"Posted: {message}");
}
Console.WriteLine("Calling .Complete() on buffer block");
bb.Complete();
});
// slow consumer
var ab = new ActionBlock<int>(i => {
Thread.Sleep(500);
Console.WriteLine($"Received: {i}");
}, new ExecutionDataflowBlockOptions {
MaxDegreeOfParallelism = 2,
});
bb.LinkTo(ab);
ab.Completion.Wait();
How I thought this code would work, but it doesn't:
BufferBlock bb is the blocking queue with capacity of 3. Once capacity reached, producer should not be able to .Post() to it, until there's a vacant slot.
Doesn't work like that. bb seems to happily accept any number of messages.
producer is a task that quickly Posts messages. Once all messages have been posted, the call to bb.Complete() should propagate through the pipeline and signal shutdown once all messages have been processed. Hence waiting ab.Completion.Wait(); at the end.
Doesn't work either. As soon as .Complete() is called, action block ab won't receive any more messages.
Can be done with a BlockingCollection, which I thought in TPL Dataflow (TDF) world BufferBlock was the equivalent of. I guess I'm misunderstanding how backpressure is supposed to work in TPL Dataflow.
So where's the catch? How to run this pipeline, not allowing more than 3 messages in the buffer bb, and wait for its completion?
PS: I found this gist (https://gist.github.com/mnadel/df2ec09fe7eae9ba8938) where it's suggested to maintain a semaphore to block writing to BufferBlock. I thought this was "built-in".
Update after accepting an answer:
Updates after accepting the answer:
If you're looking at this question, you need to remember that ActionBlock also has its own input buffer.
That's for one. Then you also need to realize, that because all blocks have their own input buffers you don't need the BufferBlock for what you might think its name implied. A BufferBlock is more like a utility block for more complex architectures or like a balance loading block. But it's not a backpressure buffer.
Completion propagation needs to be dfined at link level explicitly.
When calling .LinkTo() need to explicitly pass new DataflowLinkOptions {PropagateCompletion = true} as the 2nd argument.
To introduce back pressure you need use SendAsync when you send items into the block. This allows your producer to wait for the block to be ready for the item. Something like this is what you're looking for:
class Program
{
static async Task Main()
{
var options = new ExecutionDataflowBlockOptions()
{
BoundedCapacity = 3
};
var block = new ActionBlock<int>(async i =>
{
await Task.Delay(100);
Console.WriteLine(i);
}, options);
//Producer
foreach (var i in Enumerable.Range(0, 10))
{
await block.SendAsync(i);
}
block.Complete();
await block.Completion;
}
}
If you change this to use Post and print the result of the Post you'll see that many items fail to be passed to the block:
class Program
{
static async Task Main()
{
var options = new ExecutionDataflowBlockOptions()
{
BoundedCapacity = 1
};
var block = new ActionBlock<int>(async i =>
{
await Task.Delay(1000);
Console.WriteLine(i);
}, options);
//Producer
foreach (var i in Enumerable.Range(0, 10))
{
var result = block.Post(i);
Console.WriteLine(result);
}
block.Complete();
await block.Completion;
}
}
Output:
True
False
False
False
False
False
False
False
False
False
0
With the guidance from JSteward's answer, I came up with the following code.
It produces (reads etc.) new items concurrently with processing said items, maintaining a read-ahead buffer.
The completion signal is sent to the head of the chain when the "producer" has no more items.
The program also awaits the completion of the whole chain before terminating.
static async Task Main() {
string Time() => $"{DateTime.Now:hh:mm:ss.fff}";
// the buffer is added to the chain just for demonstration purposes
// the chain would work fine using just the built-in input buffer
// of the `action` block.
var buffer = new BufferBlock<int>(new DataflowBlockOptions {BoundedCapacity = 3});
var action = new ActionBlock<int>(async i =>
{
Console.WriteLine($"[{Time()}]: Processing: {i}");
await Task.Delay(500);
}, new ExecutionDataflowBlockOptions {MaxDegreeOfParallelism = 2, BoundedCapacity = 2});
// it's necessary to set `PropagateCompletion` property
buffer.LinkTo(action, new DataflowLinkOptions {PropagateCompletion = true});
//Producer
foreach (var i in Enumerable.Range(0, 10))
{
Console.WriteLine($"[{Time()}]: Ready to send: {i}");
await buffer.SendAsync(i);
Console.WriteLine($"[{Time()}]: Sent: {i}");
}
// we call `.Complete()` on the head of the chain and it's propagated forward
buffer.Complete();
await action.Completion;
}
I'm implementing simple data loader over HTTP, following tips from my previous question C# .NET Parallel I/O operation (with throttling), answered by Throttling asynchronous tasks.
I split loading and deserialization, assuming that one may be slower/faster than other. Also I want to throttle downloading, but don't want to throttle deserialization. Therefore I'm using two blocks and one buffer.
Unfortunately I'm facing problem that this pipeline sometimes processes less messages than consumed (I know from target server that I did exactly n requests, but I end up with less responses).
My method looks like this (no error handling):
public async Task<IEnumerable<DummyData>> LoadAsync(IEnumerable<Uri> uris)
{
IList<DummyData> result;
using (var client = new HttpClient())
{
var buffer = new BufferBlock<DummyData>();
var downloader = new TransformBlock<Uri, string>(
async u => await client.GetStringAsync(u),
new ExecutionDataflowBlockOptions
{ MaxDegreeOfParallelism = _maxParallelism });
var deserializer =
new TransformBlock<string, DummyData>(
s => JsonConvert.DeserializeObject<DummyData>(s),
new ExecutionDataflowBlockOptions
{ MaxDegreeOfParallelism = DataflowBlockOptions.Unbounded });
var linkOptions = new DataflowLinkOptions { PropagateCompletion = true };
downloader.LinkTo(deserializer, linkOptions);
deserializer.LinkTo(buffer, linkOptions);
foreach (Uri uri in uris)
{
await downloader.SendAsync(uri);
}
downloader.Complete();
await downloader.Completion;
buffer.TryReceiveAll(out result);
}
return result;
}
So to be more specific, I have 100 URLs to load, but I get 90-99 responses. No error & server handled 100 requests. This happens randomly, most of the time code behaves correctly.
There are three issues with your code:
Awaiting for the completion of the first block of the pipeline (downloader) instead of the last (buffer).
Using the TryReceiveAll method for retrieving the messages of the buffer block. The correct way to retrieve all messages from an unlinked block without introducing race conditions is to use the methods OutputAvailableAsync and TryReceive in a nested loop. You can find examples here and here.
In case of timeout the HttpClient throws an unexpected TaskCanceledException, and the TPL Dataflow blocks happens to ignore exceptions of this type. The combination of these two unfortunate realities means that by default any timeout occurrence will remain unobserved. To fix this problem you could change your code like this:
var downloader = new TransformBlock<Uri, string>(async url =>
{
try
{
return await client.GetStringAsync(url);
}
catch (OperationCanceledException) { throw new TimeoutException(); }
},
new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = _maxParallelism });
A fourth unrelated issue is the use of the MaxDegreeOfParallelism = DataflowBlockOptions.Unbounded option for the deserializer block. In the (hopefully unlikely) case that the deserializer is slower than the downloader, the deserializer will start queuing more and more work on the ThreadPool, keeping it permanently starved. This will not be good for the performance and the responsiveness of your application, or for the health of the system as a whole. In practice there is rarely a reason to configure a CPU-bound block with a MaxDegreeOfParallelism larger than Environment.ProcessorCount.
I would like to run a bunch of async tasks, with a limit on how many tasks may be pending completion at any given time.
Say you have 1000 URLs, and you only want to have 50 requests open at a time; but as soon as one request completes, you open up a connection to the next URL in the list. That way, there are always exactly 50 connections open at a time, until the URL list is exhausted.
I also want to utilize a given number of threads if possible.
I came up with an extension method, ThrottleTasksAsync that does what I want. Is there a simpler solution already out there? I would assume that this is a common scenario.
Usage:
class Program
{
static void Main(string[] args)
{
Enumerable.Range(1, 10).ThrottleTasksAsync(5, 2, async i => { Console.WriteLine(i); return i; }).Wait();
Console.WriteLine("Press a key to exit...");
Console.ReadKey(true);
}
}
Here is the code:
static class IEnumerableExtensions
{
public static async Task<Result_T[]> ThrottleTasksAsync<Enumerable_T, Result_T>(this IEnumerable<Enumerable_T> enumerable, int maxConcurrentTasks, int maxDegreeOfParallelism, Func<Enumerable_T, Task<Result_T>> taskToRun)
{
var blockingQueue = new BlockingCollection<Enumerable_T>(new ConcurrentBag<Enumerable_T>());
var semaphore = new SemaphoreSlim(maxConcurrentTasks);
// Run the throttler on a separate thread.
var t = Task.Run(() =>
{
foreach (var item in enumerable)
{
// Wait for the semaphore
semaphore.Wait();
blockingQueue.Add(item);
}
blockingQueue.CompleteAdding();
});
var taskList = new List<Task<Result_T>>();
Parallel.ForEach(IterateUntilTrue(() => blockingQueue.IsCompleted), new ParallelOptions { MaxDegreeOfParallelism = maxDegreeOfParallelism },
_ =>
{
Enumerable_T item;
if (blockingQueue.TryTake(out item, 100))
{
taskList.Add(
// Run the task
taskToRun(item)
.ContinueWith(tsk =>
{
// For effect
Thread.Sleep(2000);
// Release the semaphore
semaphore.Release();
return tsk.Result;
}
)
);
}
});
// Await all the tasks.
return await Task.WhenAll(taskList);
}
static IEnumerable<bool> IterateUntilTrue(Func<bool> condition)
{
while (!condition()) yield return true;
}
}
The method utilizes BlockingCollection and SemaphoreSlim to make it work. The throttler is run on one thread, and all the async tasks are run on the other thread. To achieve parallelism, I added a maxDegreeOfParallelism parameter that's passed to a Parallel.ForEach loop re-purposed as a while loop.
The old version was:
foreach (var master = ...)
{
var details = ...;
Parallel.ForEach(details, detail => {
// Process each detail record here
}, new ParallelOptions { MaxDegreeOfParallelism = 15 });
// Perform the final batch updates here
}
But, the thread pool gets exhausted fast, and you can't do async/await.
Bonus:
To get around the problem in BlockingCollection where an exception is thrown in Take() when CompleteAdding() is called, I'm using the TryTake overload with a timeout. If I didn't use the timeout in TryTake, it would defeat the purpose of using a BlockingCollection since TryTake won't block. Is there a better way? Ideally, there would be a TakeAsync method.
As suggested, use TPL Dataflow.
A TransformBlock<TInput, TOutput> may be what you're looking for.
You define a MaxDegreeOfParallelism to limit how many strings can be transformed (i.e., how many urls can be downloaded) in parallel. You then post urls to the block, and when you're done you tell the block you're done adding items and you fetch the responses.
var downloader = new TransformBlock<string, HttpResponse>(
url => Download(url),
new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 50 }
);
var buffer = new BufferBlock<HttpResponse>();
downloader.LinkTo(buffer);
foreach(var url in urls)
downloader.Post(url);
//or await downloader.SendAsync(url);
downloader.Complete();
await downloader.Completion;
IList<HttpResponse> responses;
if (buffer.TryReceiveAll(out responses))
{
//process responses
}
Note: The TransformBlock buffers both its input and output. Why, then, do we need to link it to a BufferBlock?
Because the TransformBlock won't complete until all items (HttpResponse) have been consumed, and await downloader.Completion would hang. Instead, we let the downloader forward all its output to a dedicated buffer block - then we wait for the downloader to complete, and inspect the buffer block.
Say you have 1000 URLs, and you only want to have 50 requests open at
a time; but as soon as one request completes, you open up a connection
to the next URL in the list. That way, there are always exactly 50
connections open at a time, until the URL list is exhausted.
The following simple solution has surfaced many times here on SO. It doesn't use blocking code and doesn't create threads explicitly, so it scales very well:
const int MAX_DOWNLOADS = 50;
static async Task DownloadAsync(string[] urls)
{
using (var semaphore = new SemaphoreSlim(MAX_DOWNLOADS))
using (var httpClient = new HttpClient())
{
var tasks = urls.Select(async url =>
{
await semaphore.WaitAsync();
try
{
var data = await httpClient.GetStringAsync(url);
Console.WriteLine(data);
}
finally
{
semaphore.Release();
}
});
await Task.WhenAll(tasks);
}
}
The thing is, the processing of the downloaded data should be done on a different pipeline, with a different level of parallelism, especially if it's a CPU-bound processing.
E.g., you'd probably want to have 4 threads concurrently doing the data processing (the number of CPU cores), and up to 50 pending requests for more data (which do not use threads at all). AFAICT, this is not what your code is currently doing.
That's where TPL Dataflow or Rx may come in handy as a preferred solution. Yet it is certainly possible to implement something like this with plain TPL. Note, the only blocking code here is the one doing the actual data processing inside Task.Run:
const int MAX_DOWNLOADS = 50;
const int MAX_PROCESSORS = 4;
// process data
class Processing
{
SemaphoreSlim _semaphore = new SemaphoreSlim(MAX_PROCESSORS);
HashSet<Task> _pending = new HashSet<Task>();
object _lock = new Object();
async Task ProcessAsync(string data)
{
await _semaphore.WaitAsync();
try
{
await Task.Run(() =>
{
// simuate work
Thread.Sleep(1000);
Console.WriteLine(data);
});
}
finally
{
_semaphore.Release();
}
}
public async void QueueItemAsync(string data)
{
var task = ProcessAsync(data);
lock (_lock)
_pending.Add(task);
try
{
await task;
}
catch
{
if (!task.IsCanceled && !task.IsFaulted)
throw; // not the task's exception, rethrow
// don't remove faulted/cancelled tasks from the list
return;
}
// remove successfully completed tasks from the list
lock (_lock)
_pending.Remove(task);
}
public async Task WaitForCompleteAsync()
{
Task[] tasks;
lock (_lock)
tasks = _pending.ToArray();
await Task.WhenAll(tasks);
}
}
// download data
static async Task DownloadAsync(string[] urls)
{
var processing = new Processing();
using (var semaphore = new SemaphoreSlim(MAX_DOWNLOADS))
using (var httpClient = new HttpClient())
{
var tasks = urls.Select(async (url) =>
{
await semaphore.WaitAsync();
try
{
var data = await httpClient.GetStringAsync(url);
// put the result on the processing pipeline
processing.QueueItemAsync(data);
}
finally
{
semaphore.Release();
}
});
await Task.WhenAll(tasks.ToArray());
await processing.WaitForCompleteAsync();
}
}
As requested, here's the code I ended up going with.
The work is set up in a master-detail configuration, and each master is processed as a batch. Each unit of work is queued up in this fashion:
var success = true;
// Start processing all the master records.
Master master;
while (null != (master = await StoredProcedures.ClaimRecordsAsync(...)))
{
await masterBuffer.SendAsync(master);
}
// Finished sending master records
masterBuffer.Complete();
// Now, wait for all the batches to complete.
await batchAction.Completion;
return success;
Masters are buffered one at a time to save work for other outside processes. The details for each master are dispatched for work via the masterTransform TransformManyBlock. A BatchedJoinBlock is also created to collect the details in one batch.
The actual work is done in the detailTransform TransformBlock, asynchronously, 150 at a time. BoundedCapacity is set to 300 to ensure that too many Masters don't get buffered at the beginning of the chain, while also leaving room for enough detail records to be queued to allow 150 records to be processed at one time. The block outputs an object to its targets, because it's filtered across the links depending on whether it's a Detail or Exception.
The batchAction ActionBlock collects the output from all the batches, and performs bulk database updates, error logging, etc. for each batch.
There will be several BatchedJoinBlocks, one for each master. Since each ISourceBlock is output sequentially and each batch only accepts the number of detail records associated with one master, the batches will be processed in order. Each block only outputs one group, and is unlinked on completion. Only the last batch block propagates its completion to the final ActionBlock.
The dataflow network:
// The dataflow network
BufferBlock<Master> masterBuffer = null;
TransformManyBlock<Master, Detail> masterTransform = null;
TransformBlock<Detail, object> detailTransform = null;
ActionBlock<Tuple<IList<object>, IList<object>>> batchAction = null;
// Buffer master records to enable efficient throttling.
masterBuffer = new BufferBlock<Master>(new DataflowBlockOptions { BoundedCapacity = 1 });
// Sequentially transform master records into a stream of detail records.
masterTransform = new TransformManyBlock<Master, Detail>(async masterRecord =>
{
var records = await StoredProcedures.GetObjectsAsync(masterRecord);
// Filter the master records based on some criteria here
var filteredRecords = records;
// Only propagate completion to the last batch
var propagateCompletion = masterBuffer.Completion.IsCompleted && masterTransform.InputCount == 0;
// Create a batch join block to encapsulate the results of the master record.
var batchjoinblock = new BatchedJoinBlock<object, object>(records.Count(), new GroupingDataflowBlockOptions { MaxNumberOfGroups = 1 });
// Add the batch block to the detail transform pipeline's link queue, and link the batch block to the the batch action block.
var detailLink1 = detailTransform.LinkTo(batchjoinblock.Target1, detailResult => detailResult is Detail);
var detailLink2 = detailTransform.LinkTo(batchjoinblock.Target2, detailResult => detailResult is Exception);
var batchLink = batchjoinblock.LinkTo(batchAction, new DataflowLinkOptions { PropagateCompletion = propagateCompletion });
// Unlink batchjoinblock upon completion.
// (the returned task does not need to be awaited, despite the warning.)
batchjoinblock.Completion.ContinueWith(task =>
{
detailLink1.Dispose();
detailLink2.Dispose();
batchLink.Dispose();
});
return filteredRecords;
}, new ExecutionDataflowBlockOptions { BoundedCapacity = 1 });
// Process each detail record asynchronously, 150 at a time.
detailTransform = new TransformBlock<Detail, object>(async detail => {
try
{
// Perform the action for each detail here asynchronously
await DoSomethingAsync();
return detail;
}
catch (Exception e)
{
success = false;
return e;
}
}, new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 150, BoundedCapacity = 300 });
// Perform the proper action for each batch
batchAction = new ActionBlock<Tuple<IList<object>, IList<object>>>(async batch =>
{
var details = batch.Item1.Cast<Detail>();
var errors = batch.Item2.Cast<Exception>();
// Do something with the batch here
}, new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 4 });
masterBuffer.LinkTo(masterTransform, new DataflowLinkOptions { PropagateCompletion = true });
masterTransform.LinkTo(detailTransform, new DataflowLinkOptions { PropagateCompletion = true });
The title of my question says it all.
I am looking for a TPL dataflow block that doesn't need an input.
Right now I am using a transform block but it's input is unused.
I would build a block like this from a BufferBlock<T>: the method accepts a delegate that presents the ITargetBlock<T> side of the block and returns the ISourceBlock<T> side of it. This way, the delegate can send input to the block, but from the outside, it looks like a block that only produces output.
The code:
public static ISourceBlock<T> CreateProducerBlock<T>(
Func<ITargetBlock<T>, Task> producer,
int boundedCapacity = DataflowBlockOptions.Unbounded)
{
var block = new BufferBlock<T>(
new ExecutionDataflowBlockOptions { BoundedCapacity = boundedCapacity });
Task.Run(async () =>
{
try
{
await producer(block);
block.Complete();
}
catch (Exception ex)
{
((IDataflowBlock)block).Fault(ex);
}
});
return block;
}
Example usage:
var producer = CreateProducerBlock<int>(async target =>
{
await target.SendAsync(10);
await target.SendAsync(20);
});
ITargetBlock<int> consumer = …;
producer.LinkTo(consumer);
Sometimes it is easiest to just use a throwaway bool as the input for a TransformManyBlock, and .Post(true) to kick off your pipeline.
I would like to run a bunch of async tasks, with a limit on how many tasks may be pending completion at any given time.
Say you have 1000 URLs, and you only want to have 50 requests open at a time; but as soon as one request completes, you open up a connection to the next URL in the list. That way, there are always exactly 50 connections open at a time, until the URL list is exhausted.
I also want to utilize a given number of threads if possible.
I came up with an extension method, ThrottleTasksAsync that does what I want. Is there a simpler solution already out there? I would assume that this is a common scenario.
Usage:
class Program
{
static void Main(string[] args)
{
Enumerable.Range(1, 10).ThrottleTasksAsync(5, 2, async i => { Console.WriteLine(i); return i; }).Wait();
Console.WriteLine("Press a key to exit...");
Console.ReadKey(true);
}
}
Here is the code:
static class IEnumerableExtensions
{
public static async Task<Result_T[]> ThrottleTasksAsync<Enumerable_T, Result_T>(this IEnumerable<Enumerable_T> enumerable, int maxConcurrentTasks, int maxDegreeOfParallelism, Func<Enumerable_T, Task<Result_T>> taskToRun)
{
var blockingQueue = new BlockingCollection<Enumerable_T>(new ConcurrentBag<Enumerable_T>());
var semaphore = new SemaphoreSlim(maxConcurrentTasks);
// Run the throttler on a separate thread.
var t = Task.Run(() =>
{
foreach (var item in enumerable)
{
// Wait for the semaphore
semaphore.Wait();
blockingQueue.Add(item);
}
blockingQueue.CompleteAdding();
});
var taskList = new List<Task<Result_T>>();
Parallel.ForEach(IterateUntilTrue(() => blockingQueue.IsCompleted), new ParallelOptions { MaxDegreeOfParallelism = maxDegreeOfParallelism },
_ =>
{
Enumerable_T item;
if (blockingQueue.TryTake(out item, 100))
{
taskList.Add(
// Run the task
taskToRun(item)
.ContinueWith(tsk =>
{
// For effect
Thread.Sleep(2000);
// Release the semaphore
semaphore.Release();
return tsk.Result;
}
)
);
}
});
// Await all the tasks.
return await Task.WhenAll(taskList);
}
static IEnumerable<bool> IterateUntilTrue(Func<bool> condition)
{
while (!condition()) yield return true;
}
}
The method utilizes BlockingCollection and SemaphoreSlim to make it work. The throttler is run on one thread, and all the async tasks are run on the other thread. To achieve parallelism, I added a maxDegreeOfParallelism parameter that's passed to a Parallel.ForEach loop re-purposed as a while loop.
The old version was:
foreach (var master = ...)
{
var details = ...;
Parallel.ForEach(details, detail => {
// Process each detail record here
}, new ParallelOptions { MaxDegreeOfParallelism = 15 });
// Perform the final batch updates here
}
But, the thread pool gets exhausted fast, and you can't do async/await.
Bonus:
To get around the problem in BlockingCollection where an exception is thrown in Take() when CompleteAdding() is called, I'm using the TryTake overload with a timeout. If I didn't use the timeout in TryTake, it would defeat the purpose of using a BlockingCollection since TryTake won't block. Is there a better way? Ideally, there would be a TakeAsync method.
As suggested, use TPL Dataflow.
A TransformBlock<TInput, TOutput> may be what you're looking for.
You define a MaxDegreeOfParallelism to limit how many strings can be transformed (i.e., how many urls can be downloaded) in parallel. You then post urls to the block, and when you're done you tell the block you're done adding items and you fetch the responses.
var downloader = new TransformBlock<string, HttpResponse>(
url => Download(url),
new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 50 }
);
var buffer = new BufferBlock<HttpResponse>();
downloader.LinkTo(buffer);
foreach(var url in urls)
downloader.Post(url);
//or await downloader.SendAsync(url);
downloader.Complete();
await downloader.Completion;
IList<HttpResponse> responses;
if (buffer.TryReceiveAll(out responses))
{
//process responses
}
Note: The TransformBlock buffers both its input and output. Why, then, do we need to link it to a BufferBlock?
Because the TransformBlock won't complete until all items (HttpResponse) have been consumed, and await downloader.Completion would hang. Instead, we let the downloader forward all its output to a dedicated buffer block - then we wait for the downloader to complete, and inspect the buffer block.
Say you have 1000 URLs, and you only want to have 50 requests open at
a time; but as soon as one request completes, you open up a connection
to the next URL in the list. That way, there are always exactly 50
connections open at a time, until the URL list is exhausted.
The following simple solution has surfaced many times here on SO. It doesn't use blocking code and doesn't create threads explicitly, so it scales very well:
const int MAX_DOWNLOADS = 50;
static async Task DownloadAsync(string[] urls)
{
using (var semaphore = new SemaphoreSlim(MAX_DOWNLOADS))
using (var httpClient = new HttpClient())
{
var tasks = urls.Select(async url =>
{
await semaphore.WaitAsync();
try
{
var data = await httpClient.GetStringAsync(url);
Console.WriteLine(data);
}
finally
{
semaphore.Release();
}
});
await Task.WhenAll(tasks);
}
}
The thing is, the processing of the downloaded data should be done on a different pipeline, with a different level of parallelism, especially if it's a CPU-bound processing.
E.g., you'd probably want to have 4 threads concurrently doing the data processing (the number of CPU cores), and up to 50 pending requests for more data (which do not use threads at all). AFAICT, this is not what your code is currently doing.
That's where TPL Dataflow or Rx may come in handy as a preferred solution. Yet it is certainly possible to implement something like this with plain TPL. Note, the only blocking code here is the one doing the actual data processing inside Task.Run:
const int MAX_DOWNLOADS = 50;
const int MAX_PROCESSORS = 4;
// process data
class Processing
{
SemaphoreSlim _semaphore = new SemaphoreSlim(MAX_PROCESSORS);
HashSet<Task> _pending = new HashSet<Task>();
object _lock = new Object();
async Task ProcessAsync(string data)
{
await _semaphore.WaitAsync();
try
{
await Task.Run(() =>
{
// simuate work
Thread.Sleep(1000);
Console.WriteLine(data);
});
}
finally
{
_semaphore.Release();
}
}
public async void QueueItemAsync(string data)
{
var task = ProcessAsync(data);
lock (_lock)
_pending.Add(task);
try
{
await task;
}
catch
{
if (!task.IsCanceled && !task.IsFaulted)
throw; // not the task's exception, rethrow
// don't remove faulted/cancelled tasks from the list
return;
}
// remove successfully completed tasks from the list
lock (_lock)
_pending.Remove(task);
}
public async Task WaitForCompleteAsync()
{
Task[] tasks;
lock (_lock)
tasks = _pending.ToArray();
await Task.WhenAll(tasks);
}
}
// download data
static async Task DownloadAsync(string[] urls)
{
var processing = new Processing();
using (var semaphore = new SemaphoreSlim(MAX_DOWNLOADS))
using (var httpClient = new HttpClient())
{
var tasks = urls.Select(async (url) =>
{
await semaphore.WaitAsync();
try
{
var data = await httpClient.GetStringAsync(url);
// put the result on the processing pipeline
processing.QueueItemAsync(data);
}
finally
{
semaphore.Release();
}
});
await Task.WhenAll(tasks.ToArray());
await processing.WaitForCompleteAsync();
}
}
As requested, here's the code I ended up going with.
The work is set up in a master-detail configuration, and each master is processed as a batch. Each unit of work is queued up in this fashion:
var success = true;
// Start processing all the master records.
Master master;
while (null != (master = await StoredProcedures.ClaimRecordsAsync(...)))
{
await masterBuffer.SendAsync(master);
}
// Finished sending master records
masterBuffer.Complete();
// Now, wait for all the batches to complete.
await batchAction.Completion;
return success;
Masters are buffered one at a time to save work for other outside processes. The details for each master are dispatched for work via the masterTransform TransformManyBlock. A BatchedJoinBlock is also created to collect the details in one batch.
The actual work is done in the detailTransform TransformBlock, asynchronously, 150 at a time. BoundedCapacity is set to 300 to ensure that too many Masters don't get buffered at the beginning of the chain, while also leaving room for enough detail records to be queued to allow 150 records to be processed at one time. The block outputs an object to its targets, because it's filtered across the links depending on whether it's a Detail or Exception.
The batchAction ActionBlock collects the output from all the batches, and performs bulk database updates, error logging, etc. for each batch.
There will be several BatchedJoinBlocks, one for each master. Since each ISourceBlock is output sequentially and each batch only accepts the number of detail records associated with one master, the batches will be processed in order. Each block only outputs one group, and is unlinked on completion. Only the last batch block propagates its completion to the final ActionBlock.
The dataflow network:
// The dataflow network
BufferBlock<Master> masterBuffer = null;
TransformManyBlock<Master, Detail> masterTransform = null;
TransformBlock<Detail, object> detailTransform = null;
ActionBlock<Tuple<IList<object>, IList<object>>> batchAction = null;
// Buffer master records to enable efficient throttling.
masterBuffer = new BufferBlock<Master>(new DataflowBlockOptions { BoundedCapacity = 1 });
// Sequentially transform master records into a stream of detail records.
masterTransform = new TransformManyBlock<Master, Detail>(async masterRecord =>
{
var records = await StoredProcedures.GetObjectsAsync(masterRecord);
// Filter the master records based on some criteria here
var filteredRecords = records;
// Only propagate completion to the last batch
var propagateCompletion = masterBuffer.Completion.IsCompleted && masterTransform.InputCount == 0;
// Create a batch join block to encapsulate the results of the master record.
var batchjoinblock = new BatchedJoinBlock<object, object>(records.Count(), new GroupingDataflowBlockOptions { MaxNumberOfGroups = 1 });
// Add the batch block to the detail transform pipeline's link queue, and link the batch block to the the batch action block.
var detailLink1 = detailTransform.LinkTo(batchjoinblock.Target1, detailResult => detailResult is Detail);
var detailLink2 = detailTransform.LinkTo(batchjoinblock.Target2, detailResult => detailResult is Exception);
var batchLink = batchjoinblock.LinkTo(batchAction, new DataflowLinkOptions { PropagateCompletion = propagateCompletion });
// Unlink batchjoinblock upon completion.
// (the returned task does not need to be awaited, despite the warning.)
batchjoinblock.Completion.ContinueWith(task =>
{
detailLink1.Dispose();
detailLink2.Dispose();
batchLink.Dispose();
});
return filteredRecords;
}, new ExecutionDataflowBlockOptions { BoundedCapacity = 1 });
// Process each detail record asynchronously, 150 at a time.
detailTransform = new TransformBlock<Detail, object>(async detail => {
try
{
// Perform the action for each detail here asynchronously
await DoSomethingAsync();
return detail;
}
catch (Exception e)
{
success = false;
return e;
}
}, new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 150, BoundedCapacity = 300 });
// Perform the proper action for each batch
batchAction = new ActionBlock<Tuple<IList<object>, IList<object>>>(async batch =>
{
var details = batch.Item1.Cast<Detail>();
var errors = batch.Item2.Cast<Exception>();
// Do something with the batch here
}, new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 4 });
masterBuffer.LinkTo(masterTransform, new DataflowLinkOptions { PropagateCompletion = true });
masterTransform.LinkTo(detailTransform, new DataflowLinkOptions { PropagateCompletion = true });