How to complete correctly a branched pipeline with BatchBlock? - c#

Consider the following pipeline:
public static void Main()
{
var firstBlock = new TransformBlock<string, string>(s =>
{
Console.WriteLine("FirstBlock");
return s + "-FirstBlock";
});
var batchBlock = new BatchBlock<string>(100, new GroupingDataflowBlockOptions { Greedy = true });
var afterBatchBlock = new TransformManyBlock<string[], string>(strings =>
{
Console.WriteLine("AfterBatchBlock");
return new[] { strings[0] + "-AfterBatchBlock" };
});
var lastBlock = new ActionBlock<string>(s =>
{
Console.WriteLine($"LastBlock {s}");
});
firstBlock.LinkTo(batchBlock, new DataflowLinkOptions { PropagateCompletion = true }, x => x.Contains("0"));
batchBlock.LinkTo(afterBatchBlock, new DataflowLinkOptions { PropagateCompletion = true });
afterBatchBlock.LinkTo(lastBlock, new DataflowLinkOptions { PropagateCompletion = true });
firstBlock.LinkTo(lastBlock, new DataflowLinkOptions { PropagateCompletion = true });
firstBlock.Post("0");
firstBlock.Complete();
firstBlock.Completion.GetAwaiter().GetResult();
batchBlock.Completion.GetAwaiter().GetResult();
afterBatchBlock.Completion.GetAwaiter().GetResult();
lastBlock.Completion.GetAwaiter().GetResult();
}
Running this code gets blocked with the following output:
FirstBlock
AfterBatchBlock
What I think is happening behind the scenes is the following:
FirstBlock sends a Completion signal to all its linked targets (BatchBlock, AfterBatchBlock, LastBlock)
BatchBlock and LastBlock gets completed
AfterBatchBlock then tries to pass its output to LastBlock but it is already completed and it gets stuck forever.
My question is:
Is it a bug?
Assuming it is by design, what is the recommended way to overcome the bad state my pipeline reached

No, it's not a bug. This is the expected behavior. You must first realize that you are not dealing with a simple, straightforward dataflow pipeline, but rather with a dataflow mesh. It's not a complex mesh, but still not what I would describe as pipeline. Dealing with the completion of dataflow blocks that form a mesh can be tricky. In your case you have one block, the lastBlock, which is the target of two source blocks, the firstBlock and afterBatchBlock. This block should be completed when both source blocks have completed, not either one them. So you can't use the PropagateCompletion = true option when linking this block with its sources.
There is no built-in API to do this, but implementing a two-to-one completion propagator is not very difficult. In fact you could just copy-paste the PropagateCompletion method found in this answer, and use it like this:
afterBatchBlock.LinkTo(lastBlock);
firstBlock.LinkTo(lastBlock);
PropagateCompletion(new IDataflowBlock[] { afterBatchBlock, firstBlock },
new[] { lastBlock });

Related

TPL: Dispose processed items

In C#, I am using Task Parallel Library (TPL) to download an image, process the image, and save the analysis results. A simplified code reads as the following.
var getImage = new TransformBlock<int, Image>(GetImage);
var proImage = new TransformBlock<Image, double>(ProcessImage);
var saveRes = new ActionBlock<double>(SaveResult);
var linkOptions = new DataflowLinkOptions() { PropagateCompletion = true };
getImage.LinkTo(proImage, linkOptions);
proImage.LinkTo(SaveRes, linkOptions);
for (int x = 0; x < 1000000; x++)
getImage.Post(x);
getImage.Complete();
SaveRes.Completion.Wait();
This works as expected, except for memory usage. I am expecting int_x, image_x, and double_x to be disposed when the pipeline has processed that iteration. In other words, I am expecting every resource created during the execution of getImage, proImage, and saveRes for iteration x be disposed when the last block completes its execution. However, this implementation keeps all the objects in the memory until I exit the scope of TPL.
Am I missing something? is this the expected behavior of TPL? and is there any option to set so the resources are released at the end of each iteration?
Update
Following the suggestion in the comments, I rewrote the code using BufferBlock and SendAsync as the following. However, I do not think it leads to claiming the resources consumed by each task. Setting the BoundedCapacity only causes my program to halt at a point where I believe it has reached the limit set to the BoundedCapacity.
var blockOpts = new DataflowBlockOptions()
{ BoundedCapacity = 100 };
var imgBuffer = new BufferBlock<int>(blockOpts);
var getImage = new TransformBlock<int, Image>(GetImage, blockOpts);
var proImage = new TransformBlock<Image, double>(ProcessImage, blockOpts);
var SaveRes = new ActionBlock<double>(SaveResult, blockOpts);
var linkOptions = new DataflowLinkOptions() { PropagateCompletion = true };
imgBuffer.LinkTo(getImage, linkOptions);
getImage.LinkTo(proImage, linkOptions);
proImage.LinkTo(SaveRes, linkOptions);
for (int x = 0; x < 1000000; x++)
await imgBuffer.SendAsync(x);
getImage.Complete();
SaveRes.Completion.Wait();
is this the expected behavior of TPL?
Yes. It doesn't root all the objects (they are available for garbage collection and finalization), but it does not dispose them, either.
and is there any option to set so the resources are released at the end of each iteration?
No.
how can I can make sure dispose is auto called when the last block/action executed on an input?
To dispose objects, your code should call Dispose. This is fairly easily done by modifying ProcessImage or wrapping it in a delegate.
If ProcessImage is synchronous:
var proImage = new TransformBlock<Image, double>(image => { using (image) return ProcessImage(image); });
or if it's asynchronous:
var proImage = new TransformBlock<Image, double>(async image => { using (image) return await ProcessImage(image); });

Report when input to first dataflow block finishes all linked blocks

I am using TPL Dataflow to download data from a ticketing system.
The system takes the ticket number as the input, calls an API and receives a nested JSON response with various information. Once received, a set of blocks handles each level of the nested structure and writes it to a relational database. e.g. Conversation, Conversation Attachments, Users, User Photos, User Tags, etc
Json
{
"conversations":[
{
"id":12345,
"author_id":23456,
"body":"First Conversation"
},
{
"id":98765,
"authorid":34567,
"body":"Second Conversation",
"attachments":[
{
"attachmentid":12345
"attachment_name":"Test.jpg"
}
}
],
"users":[
{
"userid":12345
"user_name":"John Smith"
},
{
"userid":34567
"user_name":"Joe Bloggs"
"user_photo":
{
"photoid":44556,
"photo_name":"headshot.jpg"
},
tags:[
"development",
"vip"
]
}
]
Code
Some blocks need to broadcast so that deeper nesting can still have access to the data. e.g. UserModelJson is broadcast so that 1 block can handle writing the user, 1 block can handle writing the User Tags and 1 block can handle writing the User Photos.
var loadTicketsBlock = new TransformBlock<int, ConversationsModelJson>(async ticketNumber => await p.GetConversationObjectFromTicket(ticketNumber));
var broadcastConversationsObjectBlock = new BroadcastBlock<ConversationsModelJson>(conversations => conversations);
//Conversation
var getConversationsFromConversationObjectBlock = new TransformManyBlock<ConversationsModelJson, ConversationModelJson>(conversation => ModelConverter.ConvertConversationsObjectJsonToConversationJson(conversation));
var convertConversationsBlock = new TransformBlock<ConversationModelJson, ConversationModel>(conversation => ModelConverter.ConvertConversationJsonToConversation(conversation));
var batchConversionBlock = new BatchBlock<ConversationModel>(batchBlockSize);
var convertConversationsToDTBlock = new TransformBlock<IEnumerable<ConversationModel>, DataTable>(conversations => ModelConverter.ConvertConversationModelToConversationDT(conversations));
var writeConversationsBlock = new ActionBlock<DataTable>(async conversations => await p.ProcessConversationsAsync(conversations));
var getUsersFromConversationsBlock = new TransformManyBlock<ConversationsModelJson, UserModelJson>(conversations => ModelConverter.ConvertConversationsJsonToUsersJson(conversations));
var broadcastUserBlock = new BroadcastBlock<UserModelJson>(userModelJson => userModelJson);
//User
var convertUsersBlock = new TransformBlock<UserModelJson, UserModel>(user => ModelConverter.ConvertUserJsonToUser(user));
var batchUsersBlock = new BatchBlock<UserModel>(batchBlockSize);
var convertUsersToDTBlock = new TransformBlock<IEnumerable<UserModel>, DataTable>(users => ModelConverter.ConvertUserModelToUserDT(users));
var writeUsersBlock = new ActionBlock<DataTable>(async users => await p.ProcessUsersAsync(users));
//UserTag
var getUserTagsFromUserBlock = new TransformBlock<UserModelJson, UserTagModel>(user => ModelConverter.ConvertUserJsonToUserTag(user));
var batchTagsBlock = new BatchBlock<UserTagModel>(batchBlockSize);
var convertTagsToDTBlock = new TransformBlock<IEnumerable<UserTagModel>, DataTable>(tags => ModelConverter.ConvertUserTagModelToUserTagDT(tags));
var writeTagsBlock = new ActionBlock<DataTable>(async tags => await p.ProcessUserTagsAsync(tags));
DataflowLinkOptions linkOptions = new DataflowLinkOptions()
{
PropagateCompletion = true
};
loadTicketsBlock.LinkTo(broadcastConversationsObjectBlock, linkOptions);
//Conversation
broadcastConversationsObjectBlock.LinkTo(getConversationsFromConversationObjectBlock, linkOptions);
getConversationsFromConversationObjectBlock.LinkTo(convertConversationsBlock, linkOptions);
convertConversationsBlock.LinkTo(batchConversionBlock, linkOptions);
batchConversionBlock.LinkTo(convertConversationsToDTBlock, linkOptions);
convertConversationsToDTBlock.LinkTo(writeConversationsBlock, linkOptions);
var tickets = await provider.GetAllTicketsAsync();
foreach (var ticket in tickets)
{
cts.Token.ThrowIfCancellationRequested();
await loadTicketsBlock.SendAsync(ticket.TicketID);
}
loadTicketsBlock.Complete();
The LinkTo blocks are repeated for each type of data to be written.
I know when the whole pipeline is complete by using
await Task.WhenAll(<Last block of each branch>.Completion);
but if I pass in ticket number 1 into the loadTicketsBlock block then how do I know when that specific ticket has been through all blocks in the pipeline and therefore is complete?
The reason that I want to know this is so that I can report to the UI that ticket 1 of 100 is complete.
You could consider using the TaskCompletionSource as the base class for all your sub-entities. For example:
class Attachment : TaskCompletionSource
{
}
class Conversation : TaskCompletionSource
{
}
Then every time you insert a sub-entity in the database, you mark it as completed:
attachment.SetResult();
...or if the insert fails, mark it as faulted:
attachment.SetException(ex);
Finally you can combine all the asynchronous completions in one, with the method Task.WhenAll:
Task ticketCompletion = Task.WhenAll(Enumerable.Empty<Task>()
.Append(ticket.Task)
.Concat(attachments.Select(e => e.Task))
.Concat(conversations.Select(e => e.Task)));
If I am tracking progress in Dataflow, usually I will set up the last block as a notify the UI of progress type block. To be able to track the progress of your inputs, you will need to keep the context of the original input in all the objects you are passing around, so in this case you need to be able to tell that you are working on ticket 1 all the way through your pipeline, and if one of your transforms removes the context that it is working on ticket 1, then you will need to rethink the object types that you are passing through your pipeline so you can keep that context.
A simple example of what I'm talking about is laid out below with a broadcast block going to three transform blocks, and then all three transform blocks going back to an action block that notifies about the progress of the pipelines.
When combining back into the single action block you need to make sure not to propagate completion at that point because as soon as one block propagates completion to the action block, that action block will stop accepting input, so you will still wait for the last block of each pipeline to complete, and then after that manually complete your final notify the UI action block.
using System;
using System.Threading.Tasks.Dataflow;
using System.Threading.Tasks;
using System.Collections.Generic;
public class Program
{
public static void Main()
{
var broadcastBlock = new BroadcastBlock<string>(x => x);
var transformBlockA = new TransformBlock<string, string>(x =>
{
return x + "A";
});
var transformBlockB = new TransformBlock<string, string>(x =>
{
return x + "B";
});
var transformBlockC = new TransformBlock<string, string>(x =>
{
return x + "C";
});
var ticketTracking = new Dictionary<int, List<string>>();
var notifyUiBlock = new ActionBlock<string>(x =>
{
var ticketNumber = int.Parse(x.Substring(5,1));
var taskLetter = x.Substring(7,1);
var success = ticketTracking.TryGetValue(ticketNumber, out var tasksComplete);
if (!success)
{
tasksComplete = new List<string>();
ticketTracking[ticketNumber] = tasksComplete;
}
tasksComplete.Add(taskLetter);
if (tasksComplete.Count == 3)
{
Console.WriteLine($"Ticket {ticketNumber} is complete");
}
});
DataflowLinkOptions linkOptions = new DataflowLinkOptions() {PropagateCompletion = true};
broadcastBlock.LinkTo(transformBlockA, linkOptions);
broadcastBlock.LinkTo(transformBlockB, linkOptions);
broadcastBlock.LinkTo(transformBlockC, linkOptions);
transformBlockA.LinkTo(notifyUiBlock);
transformBlockB.LinkTo(notifyUiBlock);
transformBlockC.LinkTo(notifyUiBlock);
for(var i = 0; i < 5; i++)
{
broadcastBlock.Post($"Task {i} ");
}
broadcastBlock.Complete();
Task.WhenAll(transformBlockA.Completion, transformBlockB.Completion, transformBlockC.Completion).Wait();
notifyUiBlock.Complete();
notifyUiBlock.Completion.Wait();
Console.WriteLine("Done");
}
}
This will give an output similar to this
Ticket 0 is complete
Ticket 1 is complete
Ticket 2 is complete
Ticket 3 is complete
Ticket 4 is complete
Done

Exception causing deadlock in TPL Dataflow pipeline

I have created a DataFlow pipeline using a BufferBlock, TransformBlock and an ActionBlock. Due to exception in the TransformBlock, the application is going to deadlock. I'm throttling data using BoundedCapacity.
My code is like this:
public async Task PerformOperation()
{
var bufferBlock = new BufferBlock<ObjA>(new DataflowBlockOptions { BoundedCapacity = 1 });
var fetchApiResponse = new TransformBlock<ObjA, ObjA>((item) => {
//Call an api to fetch result.
//Here for some data i get exception
return ObjA;
}, new ExecutionDataflowBlockOptions { BoundedCapacity = 2, MaxDegreeOfParallelism = 2, CancellationToken = cancellationToken });
var finalBlock = new ActionBlock<ObjA>((item) => {
if (item != null)
{
SaveToDB(item);
}
}, new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 1, BoundedCapacity = 1, CancellationToken = cancellationToken });
bufferBlock.LinkTo(fetchApiResponse, new DataflowLinkOptions { PropagateCompletion = true });
fetchApiResponse.LinkTo(finalBlock, new DataflowLinkOptions { PropagateCompletion = true });
await FetchData(bufferBlock);
bufferBlock.Complete();
await Task.WhenAll(fetchApiResponse.Completion, finalBlock.Completion);
}
public async Task FetchData(bufferBlock)
{
List<ObjA> dataToProcessList = GetFromDB();
foreach (var item in dataToProcessList)
{
await bufferBlock.SendAsync(item);
}
}
Here if exception comes in fetchApiResponse block, the data is not moving and it goes for a deadlock.
How do I handle exception in this pipeline?
Here around 200,000 records are pushed to bufferBlock.
What is the best way to handle the exceptions without causing this deadlock?
UPDATE 1:
Added the FetchData method also.
Thanks
Binil
Rather than trying to make sense of what faulted or not, the blocks should not allow unhandled exceptions. This is a very common pattern, also used in [Go pipelines]9https://blog.golang.org/pipelines)
The article Exception Handling in TPL Dataflow Networks explains how exceptions are handled.
When an unhandled exception is thrown, the block enters the faulted state, only after all concurrent operations are finished.
That state is propagated to linked blocks that have PropagateCompletion set to true. That doesn't mean that downstream blocks will immediately fault though.
Awaiting a faulted block throws. The line :
await Task.WhenAll(fetchApiResponse.Completion, finalBlock.Completion);
should have thrown, unless those blocks were still busy.
The solution - don't allow unhandled exceptions
Return a Result object instead. When having to make eg 1000 HTTP calls, it would be a bad idea to have one exception prevent the other 900 calls anyway. This is broadly similar to Railroad-oriented programming. A Dataflow pipeline is quite similar to a functional pipeline.
Each block should return a Result<T> class that wraps the actual result and somehow indicates success or failure. An exception handling block should catch any exceptions and return a faulted Result<T> item. The LinkTo method can have a predicate that allows redirecting failed results to eg a logging block or a NullBlock.
Let's say we have this simple Result<T> :
class Result<T>
{
public T Value{get;}
public Exception Exception{get;}
public bool Ok {get;}
public Result(){}
public Result(T value)
{
Value=value;
Ok=true;
}
public Result(Exception exc)
{
Exception=exc;
Ok=false;
}
}
fetchApiResponse could be :
var fetchApiResponse = new TransformBlock<TA, Result<TA>>((item) => {
try
{
...
return new Result(ObjA,true);
}
catch(Exception exc)
{
return new Result(exc);
}
}, new ExecutionDataflowBlockOptions { BoundedCapacity = 2, MaxDegreeOfParallelism = 2, CancellationToken = cancellationToken });
and the LinkTo code could be :
var propagate=new DataflowLinkOptions { PropagateCompletion = true };
var nullBlock=DataflowBlock.NullTarget<Result<TA>>();
fetchApiResponse.Linkto(logger,propagage,msg=>!msg.Ok);
fetchApiResponse.LinkTo(finalBlock,propagate,msg=>msg.Ok);
In this case, bad message are simply dumped to a null block.
There's no reason to use another buffer block, or await all blocks. Both TransformBlock and ActionBlock have an input buffer controlled by the ExecutionDataflowBlockOptions options.
Posting the messages and awaiting completion can be:
await FetchData(fetchApiResponse);
fetchApiResponse.Complete();
await finalBlock.Completion;
The null check in finalBlock can be removed too, if fetchApiResponse return an empty Result object if there's no valid result.
More complex scenarios can be handled by more complex Result objects.
Abrupt termination
Even when the pipeline needs to terminate immediately, there shouldn't be any unhandled exceptions. A fault may propagate downstream but won't affect the upstream blocks. They'll keep their messages in memory and keep accepting input even though the rest of the pipeline is broken.
That can definitely look like a deadlock.
The solution to this is to use a CancellationTokenSource, pass its token to all blocks, and signal it if the pipeline needs to be terminated.
This is common practice eg in Go, to use a channel like a CancellationTokenSource for precisely this reason, and cancel both downstream and upstream blocks. This is described in Go Concurrency Patterns: Pipelines and cancellation
Early cancellation is useful if a block decides there's no reason to continue working, not just in case of error. In this case it can singal the CancellationTokenSource to stop upstream blocks
I couldn't go through the post of #Panagiotis Kanavos. Meanwhile I have updated my code like this to handle the exception based on the comments.
public async Task PerformOperation()
{
try
{
var bufferBlock = new BufferBlock<ObjA>(new DataflowBlockOptions { BoundedCapacity = 1
});
var fetchApiResponse = new TransformBlock<ObjA, ObjA>((item) => {
//Call an api to fetch result.
//Here for some data i get exception
try
{
int apiResult = await apiCall();
}
catch(Exception ex)
{
**var dataflowBlock = (IDataflowBlock)bufferBlock;
dataflowBlock.Fault(ex);
throw ex;**
}
return ObjA;
}, new ExecutionDataflowBlockOptions { BoundedCapacity = 2, MaxDegreeOfParallelism = 2, CancellationToken = cancellationToken });
var finalBlock = new ActionBlock<ObjA>((item) => {
if (item != null)
{
SaveToDB(item);
}
}, new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 1, BoundedCapacity = 1, CancellationToken = cancellationToken });
bufferBlock.LinkTo(fetchApiResponse, new DataflowLinkOptions { PropagateCompletion = true });
fetchApiResponse.LinkTo(finalBlock, new DataflowLinkOptions { PropagateCompletion = true });
await FetchData(bufferBlock);
bufferBlock.Complete();
await Task.WhenAll(fetchApiResponse.Completion, finalBlock.Completion);
}
catch(AggregateException aex)
{ //logging the exceptions in aex }
catch(Exception ex)
{ //logging the exception}
}
public async Task FetchData(bufferBlock)
{
List<ObjA> dataToProcessList = GetFromDB();
foreach (var item in dataToProcessList)
{
if(!await bufferBlock.SendAsync(item))
{
break; //breaking the loop to stop pushing data.
}
}
}
This will now stop the pipeline and doesn't go to a deadlock. Since I'm dealing with lots of data, I'm planning to add a counter for the exceptions and if it exceeds certain limit then only I'll stop the pipeline. If a small network glitch caused one api call to fail, it might work for the next data.
I'll go through the new posts and update my code to make things better.
Please provide inputs.
Thanks
Binil

Backpressure via BufferBlock not working. (C# TPL Dataflow)

Typical situation: Fast producer, slow consumer, need to slow producer down.
Sample code that doesn't work as I expected (explained below):
// I assumed this block will behave like BlockingCollection, but it doesn't
var bb = new BufferBlock<int>(new DataflowBlockOptions {
BoundedCapacity = 3, // looks like this does nothing
});
// fast producer
int dataSource = -1;
var producer = Task.Run(() => {
while (dataSource < 10) {
var message = ++dataSource;
bb.Post(message);
Console.WriteLine($"Posted: {message}");
}
Console.WriteLine("Calling .Complete() on buffer block");
bb.Complete();
});
// slow consumer
var ab = new ActionBlock<int>(i => {
Thread.Sleep(500);
Console.WriteLine($"Received: {i}");
}, new ExecutionDataflowBlockOptions {
MaxDegreeOfParallelism = 2,
});
bb.LinkTo(ab);
ab.Completion.Wait();
How I thought this code would work, but it doesn't:
BufferBlock bb is the blocking queue with capacity of 3. Once capacity reached, producer should not be able to .Post() to it, until there's a vacant slot.
Doesn't work like that. bb seems to happily accept any number of messages.
producer is a task that quickly Posts messages. Once all messages have been posted, the call to bb.Complete() should propagate through the pipeline and signal shutdown once all messages have been processed. Hence waiting ab.Completion.Wait(); at the end.
Doesn't work either. As soon as .Complete() is called, action block ab won't receive any more messages.
Can be done with a BlockingCollection, which I thought in TPL Dataflow (TDF) world BufferBlock was the equivalent of. I guess I'm misunderstanding how backpressure is supposed to work in TPL Dataflow.
So where's the catch? How to run this pipeline, not allowing more than 3 messages in the buffer bb, and wait for its completion?
PS: I found this gist (https://gist.github.com/mnadel/df2ec09fe7eae9ba8938) where it's suggested to maintain a semaphore to block writing to BufferBlock. I thought this was "built-in".
Update after accepting an answer:
Updates after accepting the answer:
If you're looking at this question, you need to remember that ActionBlock also has its own input buffer.
That's for one. Then you also need to realize, that because all blocks have their own input buffers you don't need the BufferBlock for what you might think its name implied. A BufferBlock is more like a utility block for more complex architectures or like a balance loading block. But it's not a backpressure buffer.
Completion propagation needs to be dfined at link level explicitly.
When calling .LinkTo() need to explicitly pass new DataflowLinkOptions {PropagateCompletion = true} as the 2nd argument.
To introduce back pressure you need use SendAsync when you send items into the block. This allows your producer to wait for the block to be ready for the item. Something like this is what you're looking for:
class Program
{
static async Task Main()
{
var options = new ExecutionDataflowBlockOptions()
{
BoundedCapacity = 3
};
var block = new ActionBlock<int>(async i =>
{
await Task.Delay(100);
Console.WriteLine(i);
}, options);
//Producer
foreach (var i in Enumerable.Range(0, 10))
{
await block.SendAsync(i);
}
block.Complete();
await block.Completion;
}
}
If you change this to use Post and print the result of the Post you'll see that many items fail to be passed to the block:
class Program
{
static async Task Main()
{
var options = new ExecutionDataflowBlockOptions()
{
BoundedCapacity = 1
};
var block = new ActionBlock<int>(async i =>
{
await Task.Delay(1000);
Console.WriteLine(i);
}, options);
//Producer
foreach (var i in Enumerable.Range(0, 10))
{
var result = block.Post(i);
Console.WriteLine(result);
}
block.Complete();
await block.Completion;
}
}
Output:
True
False
False
False
False
False
False
False
False
False
0
With the guidance from JSteward's answer, I came up with the following code.
It produces (reads etc.) new items concurrently with processing said items, maintaining a read-ahead buffer.
The completion signal is sent to the head of the chain when the "producer" has no more items.
The program also awaits the completion of the whole chain before terminating.
static async Task Main() {
string Time() => $"{DateTime.Now:hh:mm:ss.fff}";
// the buffer is added to the chain just for demonstration purposes
// the chain would work fine using just the built-in input buffer
// of the `action` block.
var buffer = new BufferBlock<int>(new DataflowBlockOptions {BoundedCapacity = 3});
var action = new ActionBlock<int>(async i =>
{
Console.WriteLine($"[{Time()}]: Processing: {i}");
await Task.Delay(500);
}, new ExecutionDataflowBlockOptions {MaxDegreeOfParallelism = 2, BoundedCapacity = 2});
// it's necessary to set `PropagateCompletion` property
buffer.LinkTo(action, new DataflowLinkOptions {PropagateCompletion = true});
//Producer
foreach (var i in Enumerable.Range(0, 10))
{
Console.WriteLine($"[{Time()}]: Ready to send: {i}");
await buffer.SendAsync(i);
Console.WriteLine($"[{Time()}]: Sent: {i}");
}
// we call `.Complete()` on the head of the chain and it's propagated forward
buffer.Complete();
await action.Completion;
}

TPL Dataflow, BroadcastBlock to BatchBlocks

I have a problem connecting BroadcastBlock(s) to BatchBlocks. The scenario is that the sources are BroadcastBlocks, and recipients are BatchBlocks.
In the simplified code below, only one of the supplemental action blocks executes. I even set the batchSize for each BatchBlock to 1 to illustrate the problem.
Setting Greedy to "true" would make the 2 ActionBlocks execute, but that's not what I want as it will cause the BatchBlock to proceed even if it's not complete yet. Any ideas?
class Program
{
static void Main(string[] args)
{
// My possible sources are BroadcastBlocks. Could be more
var source1 = new BroadcastBlock<int>(z => z);
// batch 1
// can be many potential sources, one for now
// I want all sources to arrive first before proceeding
var batch1 = new BatchBlock<int>(1, new GroupingDataflowBlockOptions() { Greedy = false });
var batch1Action = new ActionBlock<int[]>(arr =>
{
// this does not run sometimes
Console.WriteLine("Received from batch 1 block!");
foreach (var item in arr)
{
Console.WriteLine("Received {0}", item);
}
});
batch1.LinkTo(batch1Action, new DataflowLinkOptions() { PropagateCompletion = true });
// batch 2
// can be many potential sources, one for now
// I want all sources to arrive first before proceeding
var batch2 = new BatchBlock<int>(1, new GroupingDataflowBlockOptions() { Greedy = false });
var batch2Action = new ActionBlock<int[]>(arr =>
{
// this does not run sometimes
Console.WriteLine("Received from batch 2 block!");
foreach (var item in arr)
{
Console.WriteLine("Received {0}", item);
}
});
batch2.LinkTo(batch2Action, new DataflowLinkOptions() { PropagateCompletion = true });
// connect source(s)
source1.LinkTo(batch1, new DataflowLinkOptions() { PropagateCompletion = true });
source1.LinkTo(batch2, new DataflowLinkOptions() { PropagateCompletion = true });
// fire
source1.SendAsync(3);
Task.WaitAll(new Task[] { batch1Action.Completion, batch2Action.Completion }); ;
Console.ReadLine();
}
}
It seems that there is a flaw in the internal mechanism of the TPL Dataflow library that supports the non-greedy functionality. What happens is that a BatchBlock configured as non-greedy will Postpone all offered messages from linked blocks, instead of accepting them. It keeps an internal queue with the messages it has postponed, and when their number reaches its BatchSize configuration it attempts to consume the postponed messages, and if successful it propagates them downstream as expected. The problem is that the source blocks like the BroadcastBlock and the BufferBlock will stop offering more messages to a block that has postponed a previously offered message, until it has consumed this single message. The combination of these two behaviors results to a deadlock. No progress forward can be made, because the BatchBlock waits for more messages to be offered before consuming the postponed, and the BroadcastBlock waits for the postponed messages to be consumed before offering more messages...
This scenario occurs only with BatchSize larger than one (which is the typical configuration for this block).
Here is a demonstration of this problem. As source it is used the more common BufferBlock instead of a BroadcastBlock. 10 messages are posted to a three-block pipeline, and the expected behavior is for the messages to flow through the pipeline to the last block. In reality nothing happens, and all messages remain stuck in the first block.
using System;
using System.Threading;
using System.Threading.Tasks.Dataflow;
public static class Program
{
static void Main(string[] args)
{
var bufferBlock = new BufferBlock<int>();
var batchBlock = new BatchBlock<int>(batchSize: 2,
new GroupingDataflowBlockOptions() { Greedy = false });
var actionBlock = new ActionBlock<int[]>(batch =>
Console.WriteLine($"Received: {String.Join(", ", batch)}"));
bufferBlock.LinkTo(batchBlock,
new DataflowLinkOptions() { PropagateCompletion = true });
batchBlock.LinkTo(actionBlock,
new DataflowLinkOptions() { PropagateCompletion = true });
for (int i = 1; i <= 10; i++)
{
var accepted = bufferBlock.Post(i);
Console.WriteLine(
$"bufferBlock.Post({i}) {(accepted ? "accepted" : "rejected")}");
Thread.Sleep(100);
}
bufferBlock.Complete();
actionBlock.Completion.Wait(millisecondsTimeout: 1000);
Console.WriteLine();
Console.WriteLine($"bufferBlock.Completion: {bufferBlock.Completion.Status}");
Console.WriteLine($"batchBlock.Completion: {batchBlock.Completion.Status}");
Console.WriteLine($"actionBlock.Completion: {actionBlock.Completion.Status}");
Console.WriteLine($"bufferBlock.Count: {bufferBlock.Count}");
}
}
Output:
bufferBlock.Post(1) accepted
bufferBlock.Post(2) accepted
bufferBlock.Post(3) accepted
bufferBlock.Post(4) accepted
bufferBlock.Post(5) accepted
bufferBlock.Post(6) accepted
bufferBlock.Post(7) accepted
bufferBlock.Post(8) accepted
bufferBlock.Post(9) accepted
bufferBlock.Post(10) accepted
bufferBlock.Completion: WaitingForActivation
batchBlock.Completion: WaitingForActivation
actionBlock.Completion: WaitingForActivation
bufferBlock.Count: 10
My guess is that the internal offer-consume-reserve-release mechanism has been tuned for maximum efficiency at supporting the BoundedCapacity functionality, which is critical for many applications, and the rarely used Greedy = false functionality was left non-thoroughly tested.
The good news is that in your case you don't really need to set the Greedy to false. A BatchBlock in the default greedy mode will not propagate less messages than the configured BatchSize, unless it has been marked as completed and it propagates any leftover messages, or you manually call its TriggerBatch method at any arbitrary moment. The intended usage of the non-greedy configuration is for preventing resource starvation in complex graph scenarios, with multiple dependencies between blocks.

Categories