This problem raised its head while trying to implement the suggested solution to this problem.
Problem summary
Performing a ReceiveAsync() call from a TransformBlock to a WriteOnceBlock is causing the TransformBlock to essentially remove itself from the flow. It ceases to propagate any kind of message, whether it's data or a completion signal.
System design
The system is intended for parsing large CSV files through a series of steps.
The problematic part of the flow can be (inexpertly) visualized as follows:
The parallelogram is a BufferBlock, diamonds are BroadcastBlocks, triangles are WriteOnceBlocks and arrows are TransformBlocks. Solid lines denote a link created with LinkTo(), and the dotted line represents a ReceiveAsync() call from the ParsedHeaderAndRecordJoiner to the ParsedHeaderContainer block. I'm aware that this flow is somewhat suboptimal but that's not the primary reason for the question.
Code
Application root
Here is a part of the class that creates the necessary blocks and links them together using PropagateCompletion
using (var cancellationSource = new CancellationTokenSource())
{
var cancellationToken = cancellationSource.Token;
var temporaryEntityInstance = new Card(); // Just as an example
var producerQueue = queueFactory.CreateQueue<string>(new DataflowBlockOptions{CancellationToken = cancellationToken});
var recordDistributor = distributorFactory.CreateDistributor<string>(s => (string)s.Clone(),
new DataflowBlockOptions { CancellationToken = cancellationToken });
var headerRowContainer = containerFactory.CreateContainer<string>(s => (string)s.Clone(),
new DataflowBlockOptions { CancellationToken = cancellationToken });
var headerRowParser = new HeaderRowParserFactory().CreateHeaderRowParser(temporaryEntityInstance.GetType(), ';',
new ExecutionDataflowBlockOptions { CancellationToken = cancellationToken });
var parsedHeaderContainer = containerFactory.CreateContainer<HeaderParsingResult>(HeaderParsingResult.Clone,
new DataflowBlockOptions { CancellationToken = cancellationToken});
var parsedHeaderAndRecordJoiner = new ParsedHeaderAndRecordJoinerFactory().CreateParsedHeaderAndRecordJoiner(parsedHeaderContainer,
new ExecutionDataflowBlockOptions { CancellationToken = cancellationToken });
var entityParser = new entityParserFactory().CreateEntityParser(temporaryEntityInstance.GetType(), ';',
dataflowBlockOptions: new ExecutionDataflowBlockOptions { CancellationToken = cancellationToken });
var entityDistributor = distributorFactory.CreateDistributor<EntityParsingResult>(EntityParsingResult.Clone,
new DataflowBlockOptions{CancellationToken = cancellationToken});
var linkOptions = new DataflowLinkOptions {PropagateCompletion = true};
// Producer subprocess
producerQueue.LinkTo(recordDistributor, linkOptions);
// Header subprocess
recordDistributor.LinkTo(headerRowContainer, linkOptions);
headerRowContainer.LinkTo(headerRowParser, linkOptions);
headerRowParser.LinkTo(parsedHeaderContainer, linkOptions);
parsedHeaderContainer.LinkTo(errorQueue, new DataflowLinkOptions{MaxMessages = 1, PropagateCompletion = true}, dataflowResult => !dataflowResult.WasSuccessful);
// Parsing subprocess
recordDistributor.LinkTo(parsedHeaderAndRecordJoiner, linkOptions);
parsedHeaderAndRecordJoiner.LinkTo(entityParser, linkOptions, joiningResult => joiningResult.WasSuccessful);
entityParser.LinkTo(entityDistributor, linkOptions);
entityDistributor.LinkTo(errorQueue, linkOptions, dataflowResult => !dataflowResult.WasSuccessful);
}
HeaderRowParser
This block parses a header row from a CSV file and does some validation.
public class HeaderRowParserFactory
{
public TransformBlock<string, HeaderParsingResult> CreateHeaderRowParser(Type entityType,
char delimiter,
ExecutionDataflowBlockOptions dataflowBlockOptions = null)
{
return new TransformBlock<string, HeaderParsingResult>(headerRow =>
{
// Set up some containers
var result = new HeaderParsingResult(identifier: "N/A", wasSuccessful: true);
var fieldIndexesByPropertyName = new Dictionary<string, int>();
// Get all serializable properties on the chosen entity type
var serializableProperties = entityType.GetProperties()
.Where(prop => prop.IsDefined(typeof(CsvFieldNameAttribute), false))
.ToList();
// Add their CSV fieldnames to the result
var entityFieldNames = serializableProperties.Select(prop => prop.GetCustomAttribute<CsvFieldNameAttribute>().FieldName);
result.SetEntityFieldNames(entityFieldNames);
// Create the dictionary of properties by field name
var serializablePropertiesByFieldName = serializableProperties.ToDictionary(prop => prop.GetCustomAttribute<CsvFieldNameAttribute>().FieldName, prop => prop, StringComparer.OrdinalIgnoreCase);
var fields = headerRow.Split(delimiter);
for (var i = 0; i < fields.Length; i++)
{
// If any field in the CSV is unknown as a serializable property, we return a failed result
if (!serializablePropertiesByFieldName.TryGetValue(fields[i], out var foundProperty))
{
result.Invalidate($"The header row contains a field that does not match any of the serializable properties - {fields[i]}.",
DataflowErrorSeverity.Critical);
return result;
}
// Perform a bunch more validation
fieldIndexesByPropertyName.Add(foundProperty.Name, i);
}
result.SetFieldIndexesByName(fieldIndexesByPropertyName);
return result;
}, dataflowBlockOptions ?? new ExecutionDataflowBlockOptions());
}
}
ParsedHeaderAndRecordJoiner
For each subsequent record that comes through the pipe, this block is intended to retrieve the parsed header data and add it to the record.
public class ParsedHeaderAndRecordJoinerFactory
{
public TransformBlock<string, HeaderAndRecordJoiningResult> CreateParsedHeaderAndRecordJoiner(WriteOnceBlock<HeaderParsingResult> parsedHeaderContainer,
ExecutionDataflowBlockOptions dataflowBlockOptions = null)
{
return new TransformBlock<string, HeaderAndRecordJoiningResult>(async csvRecord =>
{
var headerParsingResult = await parsedHeaderContainer.ReceiveAsync();
// If the header couldn't be parsed, a critical error is already on its way to the failure logger so we don't need to continue
if (!headerParsingResult.WasSuccessful) return new HeaderAndRecordJoiningResult(identifier: "N.A.", wasSuccessful: false, null, null);
// The entity parser can't do anything with the header record, so we send a message with wasSuccessful false
var isHeaderRecord = true;
foreach (var entityFieldName in headerParsingResult.EntityFieldNames)
{
isHeaderRecord &= csvRecord.Contains(entityFieldName);
}
if (isHeaderRecord) return new HeaderAndRecordJoiningResult(identifier: "N.A.", wasSuccessful: false, null, null);
return new HeaderAndRecordJoiningResult(identifier: "N.A.", wasSuccessful: true, headerParsingResult, csvRecord);
}, dataflowBlockOptions ?? new ExecutionDataflowBlockOptions());
}
}
Problem detail
With the current implementation, the ParsedHeaderAndRecordJoiner receives the data correctly from the ReceiveAsync() call to ParsedHeaderContainer and returns as expected, however no message arrives at the EntityParser.
Also, when a Complete signal is sent to the front of the flow (the ProducerQueue), it propagates to the RecordDistributor but then stops at the ParsedHeaderAndRecordJoiner (it does continue from the HeaderRowContainer onwards, so the RecordDistributor is passing it on).
If I remove the ReceiveAsync() call and mock the data myself, the block behaves as expected.
I think this part is key
however no message arrives at the EntityParser.
Based on the sample the only way EntityParser doesn't receive a message output by the ParsedHeaderAndRecordJoiner is when WasSuccessful returns false. The predicate used in your link excludes failed messages but those messages have no where to go, so they build up in the ParsedHeaderAndRecordJoiner output buffer and would also prevent Completion from propagating. You'd need to link a null target to dump the failed messages.
parsedHeaderAndRecordJoiner.LinkTo(DataflowBlock.NullTarget<HeaderParsingResult>());
Further if your mock data always is coming back with WasSuccessful true, then that might be pointing you to the await ...ReceiveAsync()
Not necessarily the smoking gun but a good place to start. Can you confirm the state of all messages in the output buffer of ParsedHeaderAndRecordJoiner when the pipeline gets stuck.
Related
I am using TPL Dataflow to download data from a ticketing system.
The system takes the ticket number as the input, calls an API and receives a nested JSON response with various information. Once received, a set of blocks handles each level of the nested structure and writes it to a relational database. e.g. Conversation, Conversation Attachments, Users, User Photos, User Tags, etc
Json
{
"conversations":[
{
"id":12345,
"author_id":23456,
"body":"First Conversation"
},
{
"id":98765,
"authorid":34567,
"body":"Second Conversation",
"attachments":[
{
"attachmentid":12345
"attachment_name":"Test.jpg"
}
}
],
"users":[
{
"userid":12345
"user_name":"John Smith"
},
{
"userid":34567
"user_name":"Joe Bloggs"
"user_photo":
{
"photoid":44556,
"photo_name":"headshot.jpg"
},
tags:[
"development",
"vip"
]
}
]
Code
Some blocks need to broadcast so that deeper nesting can still have access to the data. e.g. UserModelJson is broadcast so that 1 block can handle writing the user, 1 block can handle writing the User Tags and 1 block can handle writing the User Photos.
var loadTicketsBlock = new TransformBlock<int, ConversationsModelJson>(async ticketNumber => await p.GetConversationObjectFromTicket(ticketNumber));
var broadcastConversationsObjectBlock = new BroadcastBlock<ConversationsModelJson>(conversations => conversations);
//Conversation
var getConversationsFromConversationObjectBlock = new TransformManyBlock<ConversationsModelJson, ConversationModelJson>(conversation => ModelConverter.ConvertConversationsObjectJsonToConversationJson(conversation));
var convertConversationsBlock = new TransformBlock<ConversationModelJson, ConversationModel>(conversation => ModelConverter.ConvertConversationJsonToConversation(conversation));
var batchConversionBlock = new BatchBlock<ConversationModel>(batchBlockSize);
var convertConversationsToDTBlock = new TransformBlock<IEnumerable<ConversationModel>, DataTable>(conversations => ModelConverter.ConvertConversationModelToConversationDT(conversations));
var writeConversationsBlock = new ActionBlock<DataTable>(async conversations => await p.ProcessConversationsAsync(conversations));
var getUsersFromConversationsBlock = new TransformManyBlock<ConversationsModelJson, UserModelJson>(conversations => ModelConverter.ConvertConversationsJsonToUsersJson(conversations));
var broadcastUserBlock = new BroadcastBlock<UserModelJson>(userModelJson => userModelJson);
//User
var convertUsersBlock = new TransformBlock<UserModelJson, UserModel>(user => ModelConverter.ConvertUserJsonToUser(user));
var batchUsersBlock = new BatchBlock<UserModel>(batchBlockSize);
var convertUsersToDTBlock = new TransformBlock<IEnumerable<UserModel>, DataTable>(users => ModelConverter.ConvertUserModelToUserDT(users));
var writeUsersBlock = new ActionBlock<DataTable>(async users => await p.ProcessUsersAsync(users));
//UserTag
var getUserTagsFromUserBlock = new TransformBlock<UserModelJson, UserTagModel>(user => ModelConverter.ConvertUserJsonToUserTag(user));
var batchTagsBlock = new BatchBlock<UserTagModel>(batchBlockSize);
var convertTagsToDTBlock = new TransformBlock<IEnumerable<UserTagModel>, DataTable>(tags => ModelConverter.ConvertUserTagModelToUserTagDT(tags));
var writeTagsBlock = new ActionBlock<DataTable>(async tags => await p.ProcessUserTagsAsync(tags));
DataflowLinkOptions linkOptions = new DataflowLinkOptions()
{
PropagateCompletion = true
};
loadTicketsBlock.LinkTo(broadcastConversationsObjectBlock, linkOptions);
//Conversation
broadcastConversationsObjectBlock.LinkTo(getConversationsFromConversationObjectBlock, linkOptions);
getConversationsFromConversationObjectBlock.LinkTo(convertConversationsBlock, linkOptions);
convertConversationsBlock.LinkTo(batchConversionBlock, linkOptions);
batchConversionBlock.LinkTo(convertConversationsToDTBlock, linkOptions);
convertConversationsToDTBlock.LinkTo(writeConversationsBlock, linkOptions);
var tickets = await provider.GetAllTicketsAsync();
foreach (var ticket in tickets)
{
cts.Token.ThrowIfCancellationRequested();
await loadTicketsBlock.SendAsync(ticket.TicketID);
}
loadTicketsBlock.Complete();
The LinkTo blocks are repeated for each type of data to be written.
I know when the whole pipeline is complete by using
await Task.WhenAll(<Last block of each branch>.Completion);
but if I pass in ticket number 1 into the loadTicketsBlock block then how do I know when that specific ticket has been through all blocks in the pipeline and therefore is complete?
The reason that I want to know this is so that I can report to the UI that ticket 1 of 100 is complete.
You could consider using the TaskCompletionSource as the base class for all your sub-entities. For example:
class Attachment : TaskCompletionSource
{
}
class Conversation : TaskCompletionSource
{
}
Then every time you insert a sub-entity in the database, you mark it as completed:
attachment.SetResult();
...or if the insert fails, mark it as faulted:
attachment.SetException(ex);
Finally you can combine all the asynchronous completions in one, with the method Task.WhenAll:
Task ticketCompletion = Task.WhenAll(Enumerable.Empty<Task>()
.Append(ticket.Task)
.Concat(attachments.Select(e => e.Task))
.Concat(conversations.Select(e => e.Task)));
If I am tracking progress in Dataflow, usually I will set up the last block as a notify the UI of progress type block. To be able to track the progress of your inputs, you will need to keep the context of the original input in all the objects you are passing around, so in this case you need to be able to tell that you are working on ticket 1 all the way through your pipeline, and if one of your transforms removes the context that it is working on ticket 1, then you will need to rethink the object types that you are passing through your pipeline so you can keep that context.
A simple example of what I'm talking about is laid out below with a broadcast block going to three transform blocks, and then all three transform blocks going back to an action block that notifies about the progress of the pipelines.
When combining back into the single action block you need to make sure not to propagate completion at that point because as soon as one block propagates completion to the action block, that action block will stop accepting input, so you will still wait for the last block of each pipeline to complete, and then after that manually complete your final notify the UI action block.
using System;
using System.Threading.Tasks.Dataflow;
using System.Threading.Tasks;
using System.Collections.Generic;
public class Program
{
public static void Main()
{
var broadcastBlock = new BroadcastBlock<string>(x => x);
var transformBlockA = new TransformBlock<string, string>(x =>
{
return x + "A";
});
var transformBlockB = new TransformBlock<string, string>(x =>
{
return x + "B";
});
var transformBlockC = new TransformBlock<string, string>(x =>
{
return x + "C";
});
var ticketTracking = new Dictionary<int, List<string>>();
var notifyUiBlock = new ActionBlock<string>(x =>
{
var ticketNumber = int.Parse(x.Substring(5,1));
var taskLetter = x.Substring(7,1);
var success = ticketTracking.TryGetValue(ticketNumber, out var tasksComplete);
if (!success)
{
tasksComplete = new List<string>();
ticketTracking[ticketNumber] = tasksComplete;
}
tasksComplete.Add(taskLetter);
if (tasksComplete.Count == 3)
{
Console.WriteLine($"Ticket {ticketNumber} is complete");
}
});
DataflowLinkOptions linkOptions = new DataflowLinkOptions() {PropagateCompletion = true};
broadcastBlock.LinkTo(transformBlockA, linkOptions);
broadcastBlock.LinkTo(transformBlockB, linkOptions);
broadcastBlock.LinkTo(transformBlockC, linkOptions);
transformBlockA.LinkTo(notifyUiBlock);
transformBlockB.LinkTo(notifyUiBlock);
transformBlockC.LinkTo(notifyUiBlock);
for(var i = 0; i < 5; i++)
{
broadcastBlock.Post($"Task {i} ");
}
broadcastBlock.Complete();
Task.WhenAll(transformBlockA.Completion, transformBlockB.Completion, transformBlockC.Completion).Wait();
notifyUiBlock.Complete();
notifyUiBlock.Completion.Wait();
Console.WriteLine("Done");
}
}
This will give an output similar to this
Ticket 0 is complete
Ticket 1 is complete
Ticket 2 is complete
Ticket 3 is complete
Ticket 4 is complete
Done
Typical situation: Fast producer, slow consumer, need to slow producer down.
Sample code that doesn't work as I expected (explained below):
// I assumed this block will behave like BlockingCollection, but it doesn't
var bb = new BufferBlock<int>(new DataflowBlockOptions {
BoundedCapacity = 3, // looks like this does nothing
});
// fast producer
int dataSource = -1;
var producer = Task.Run(() => {
while (dataSource < 10) {
var message = ++dataSource;
bb.Post(message);
Console.WriteLine($"Posted: {message}");
}
Console.WriteLine("Calling .Complete() on buffer block");
bb.Complete();
});
// slow consumer
var ab = new ActionBlock<int>(i => {
Thread.Sleep(500);
Console.WriteLine($"Received: {i}");
}, new ExecutionDataflowBlockOptions {
MaxDegreeOfParallelism = 2,
});
bb.LinkTo(ab);
ab.Completion.Wait();
How I thought this code would work, but it doesn't:
BufferBlock bb is the blocking queue with capacity of 3. Once capacity reached, producer should not be able to .Post() to it, until there's a vacant slot.
Doesn't work like that. bb seems to happily accept any number of messages.
producer is a task that quickly Posts messages. Once all messages have been posted, the call to bb.Complete() should propagate through the pipeline and signal shutdown once all messages have been processed. Hence waiting ab.Completion.Wait(); at the end.
Doesn't work either. As soon as .Complete() is called, action block ab won't receive any more messages.
Can be done with a BlockingCollection, which I thought in TPL Dataflow (TDF) world BufferBlock was the equivalent of. I guess I'm misunderstanding how backpressure is supposed to work in TPL Dataflow.
So where's the catch? How to run this pipeline, not allowing more than 3 messages in the buffer bb, and wait for its completion?
PS: I found this gist (https://gist.github.com/mnadel/df2ec09fe7eae9ba8938) where it's suggested to maintain a semaphore to block writing to BufferBlock. I thought this was "built-in".
Update after accepting an answer:
Updates after accepting the answer:
If you're looking at this question, you need to remember that ActionBlock also has its own input buffer.
That's for one. Then you also need to realize, that because all blocks have their own input buffers you don't need the BufferBlock for what you might think its name implied. A BufferBlock is more like a utility block for more complex architectures or like a balance loading block. But it's not a backpressure buffer.
Completion propagation needs to be dfined at link level explicitly.
When calling .LinkTo() need to explicitly pass new DataflowLinkOptions {PropagateCompletion = true} as the 2nd argument.
To introduce back pressure you need use SendAsync when you send items into the block. This allows your producer to wait for the block to be ready for the item. Something like this is what you're looking for:
class Program
{
static async Task Main()
{
var options = new ExecutionDataflowBlockOptions()
{
BoundedCapacity = 3
};
var block = new ActionBlock<int>(async i =>
{
await Task.Delay(100);
Console.WriteLine(i);
}, options);
//Producer
foreach (var i in Enumerable.Range(0, 10))
{
await block.SendAsync(i);
}
block.Complete();
await block.Completion;
}
}
If you change this to use Post and print the result of the Post you'll see that many items fail to be passed to the block:
class Program
{
static async Task Main()
{
var options = new ExecutionDataflowBlockOptions()
{
BoundedCapacity = 1
};
var block = new ActionBlock<int>(async i =>
{
await Task.Delay(1000);
Console.WriteLine(i);
}, options);
//Producer
foreach (var i in Enumerable.Range(0, 10))
{
var result = block.Post(i);
Console.WriteLine(result);
}
block.Complete();
await block.Completion;
}
}
Output:
True
False
False
False
False
False
False
False
False
False
0
With the guidance from JSteward's answer, I came up with the following code.
It produces (reads etc.) new items concurrently with processing said items, maintaining a read-ahead buffer.
The completion signal is sent to the head of the chain when the "producer" has no more items.
The program also awaits the completion of the whole chain before terminating.
static async Task Main() {
string Time() => $"{DateTime.Now:hh:mm:ss.fff}";
// the buffer is added to the chain just for demonstration purposes
// the chain would work fine using just the built-in input buffer
// of the `action` block.
var buffer = new BufferBlock<int>(new DataflowBlockOptions {BoundedCapacity = 3});
var action = new ActionBlock<int>(async i =>
{
Console.WriteLine($"[{Time()}]: Processing: {i}");
await Task.Delay(500);
}, new ExecutionDataflowBlockOptions {MaxDegreeOfParallelism = 2, BoundedCapacity = 2});
// it's necessary to set `PropagateCompletion` property
buffer.LinkTo(action, new DataflowLinkOptions {PropagateCompletion = true});
//Producer
foreach (var i in Enumerable.Range(0, 10))
{
Console.WriteLine($"[{Time()}]: Ready to send: {i}");
await buffer.SendAsync(i);
Console.WriteLine($"[{Time()}]: Sent: {i}");
}
// we call `.Complete()` on the head of the chain and it's propagated forward
buffer.Complete();
await action.Completion;
}
I am getting items from an upstream API which is quite slow. I try to speed this up by using TPL Dataflow to create multiple connections and bring these together, like this;
class Stuff
{
int Id { get; }
}
async Task<Stuff> GetStuffById(int id) => throw new NotImplementedException();
async Task<IEnumerable<Stuff>> GetLotsOfStuff(IEnumerable<int> ids)
{
var bagOfStuff = new ConcurrentBag<Stuff>();
var options = new ExecutionDataflowBlockOptions
{
MaxDegreeOfParallelism = 5
};
var processor = new ActionBlock<int>(async id =>
{
bagOfStuff.Add(await GetStuffById(id));
}, options);
foreach (int id in ids)
{
processor.Post(id);
}
processor.Complete();
await processor.Completion;
return bagOfStuff.ToArray();
}
The problem is that I have to wait until I have finished querying the entire collection of Stuff before I can return it to the caller. What I would prefer is that, whenever any of the multiple parallel queries returns an item, I return that item in a yield return fashion. Therefore I don't need to return an sync Task<IEnumerable<Stuff>>, I can just return an IEnumerable<Stuff> and the caller advances the iteration as soon as any items return.
I tried doing it like this;
IEnumerable<Stuff> GetLotsOfStuff(IEnumerable<int> ids)
{
var options = new ExecutionDataflowBlockOptions
{
MaxDegreeOfParallelism = 5
};
var processor = new ActionBlock<int>(async id =>
{
yield return await GetStuffById(id);
}, options);
foreach (int id in ids)
{
processor.Post(id);
}
processor.Complete();
processor.Completion.Wait();
yield break;
}
But I get an error
The yield statement cannot be used inside an anonymous method or lambda expression
How can I restructure my code?
You can return an IEnumerable, but to do so you must block your current thread. You need a TransformBlock to process the ids, and a feeder-task that will feed asynchronously the TransformBlock with ids. Finally the current thread will enter a blocking loop, waiting for produced stuff to yield:
static IEnumerable<Stuff> GetLotsOfStuff(IEnumerable<int> ids)
{
using var completionCTS = new CancellationTokenSource();
var processor = new TransformBlock<int, Stuff>(async id =>
{
return await GetStuffById(id);
}, new ExecutionDataflowBlockOptions
{
MaxDegreeOfParallelism = 5,
BoundedCapacity = 50, // Avoid buffering millions of ids
CancellationToken = completionCTS.Token
});
var feederTask = Task.Run(async () =>
{
try
{
foreach (int id in ids)
if (!await processor.SendAsync(id)) break;
}
finally { processor.Complete(); }
});
try
{
while (processor.OutputAvailableAsync().Result)
while (processor.TryReceive(out var stuff))
yield return stuff;
}
finally // This runs when the caller exits the foreach loop
{
completionCTS.Cancel(); // Cancel the TransformBlock if it's still running
}
Task.WaitAll(feederTask, processor.Completion); // Propagate all exceptions
}
No ConcurrentBag is needed, since the TransformBlock has an internal output buffer. The tricky part is dealing with the case that the caller will abandon the enumeration of the IEnumerable<Stuff> by breaking early, or by being obstructed by an exception. In this case you don't want the feeder-task to keep pumping the IEnumerable<int> with the ids till the end. Fortunately there is a solution. Enclosing the yielding loop in a try/finally block allows a notification of this event to be received, so that the feeder-task can be terminated in a timely manner.
An alternative implementation could remove the need for a feeder-task by combining pumping the ids, feeding the block, and yielding stuff in a single loop. In this case you would want a lag between pumping and yielding. To achieve it, the MoreLinq's Lag (or Lead) extension method could be handy.
Update: Here is a different implementation, that enumerates and yields in the same loop. To achieve the desired lagging, the source enumerable is right-padded with some dummy elements, equal in number with the degree of concurrency.
This implementation accepts generic types, instead of int and Stuff.
public static IEnumerable<TResult> Transform<TSource, TResult>(
IEnumerable<TSource> source, Func<TSource, Task<TResult>> taskFactory,
int degreeOfConcurrency)
{
var processor = new TransformBlock<TSource, TResult>(async item =>
{
return await taskFactory(item);
}, new ExecutionDataflowBlockOptions
{
MaxDegreeOfParallelism = degreeOfConcurrency
});
var paddedSource = source.Select(item => (item, true))
.Concat(Enumerable.Repeat((default(TSource), false), degreeOfConcurrency));
int index = -1;
bool completed = false;
foreach (var (item, hasValue) in paddedSource)
{
index++;
if (hasValue) { processor.Post(item); }
else if (!completed) { processor.Complete(); completed = true; }
if (index >= degreeOfConcurrency)
{
if (!processor.OutputAvailableAsync().Result) break; // Blocking call
if (!processor.TryReceive(out var result))
throw new InvalidOperationException(); // Should never happen
yield return result;
}
}
processor.Completion.Wait();
}
Usage example:
IEnumerable<Stuff> lotsOfStuff = Transform(ids, GetStuffById, 5);
Both implementations can be modified trivially to return an IAsyncEnumerable instead of IEnumerable, to avoid blocking the calling thread.
There's probably a few different ways you can handle this based on your specific use case. But to handle items as they come through in terms of TPL-Dataflow, you'd change your source block to a TransformBlock<,> and flow the items to another block to process your items. Note that now you can get rid of collecting ConcurrentBag and be sure to set EnsureOrdered to false if you don't care what order you receive your items in. Also link the blocks and propagate completion to ensure your pipeline finishes once all item are retrieved and subsequently processed.
class Stuff
{
int Id { get; }
}
public class GetStuff
{
async Task<Stuff> GetStuffById(int id) => throw new NotImplementedException();
async Task GetLotsOfStuff(IEnumerable<int> ids)
{
//var bagOfStuff = new ConcurrentBag<Stuff>();
var options = new ExecutionDataflowBlockOptions
{
MaxDegreeOfParallelism = 5,
EnsureOrdered = false
};
var processor = new TransformBlock<int, Stuff>(id => GetStuffById(id), options);
var handler = new ActionBlock<Stuff>(s => throw new NotImplementedException());
processor.LinkTo(handler, new DataflowLinkOptions() { PropagateCompletion = true });
foreach (int id in ids)
{
processor.Post(id);
}
processor.Complete();
await handler.Completion;
}
}
Other options could be making your method an observable streaming out of the TransformBlock or using IAsyncEnumerable to yield return and async get method.
I am describing my problem in a simple example and then describing a more close problem.
Imagine We Have n items [i1,i2,i3,i4,...,in] in the box1 and we have a box2 that can handle m items to do them (m is usually much less than n) . The time required for each item is different. I want to always have doing m job items until all items are proceeded.
A much more close problem is that for example you have a list1 of n strings (URL addresses) of files and we want to have a system to have m files downloading concurrently (for example via httpclient.getAsync() method). Whenever downloading of one of m items finishes, another remaining item from list1 must be substituted as soon as possible and this must be countinued until all of List1 items proceeded.
(number of n and m are specified by users input at runtime)
How this can be done?
Here is a generic method you can use.
when you call this TIn will be string (URL addresses) and the asyncProcessor will be your async method that takes the URL address as input and returns a Task.
The SlimSemaphore used by this method is going to allow only n number of concurrent async I/O requests in real time, as soon as one completes the other request will execute. Something like a sliding window pattern.
public static Task ForEachAsync<TIn>(
IEnumerable<TIn> inputEnumerable,
Func<TIn, Task> asyncProcessor,
int? maxDegreeOfParallelism = null)
{
int maxAsyncThreadCount = maxDegreeOfParallelism ?? DefaultMaxDegreeOfParallelism;
SemaphoreSlim throttler = new SemaphoreSlim(maxAsyncThreadCount, maxAsyncThreadCount);
IEnumerable<Task> tasks = inputEnumerable.Select(async input =>
{
await throttler.WaitAsync().ConfigureAwait(false);
try
{
await asyncProcessor(input).ConfigureAwait(false);
}
finally
{
throttler.Release();
}
});
return Task.WhenAll(tasks);
}
You should look in to TPL Dataflow, add the System.Threading.Tasks.Dataflow NuGet package to your project then what you want is as simple as
private static HttpClient _client = new HttpClient();
public async Task<List<MyClass>> ProcessDownloads(IEnumerable<string> uris,
int concurrentDownloads)
{
var result = new List<MyClass>();
var downloadData = new TransformBlock<string, string>(async uri =>
{
return await _client.GetStringAsync(uri); //GetStringAsync is a thread safe method.
}, new ExecutionDataflowBlockOptions{MaxDegreeOfParallelism = concurrentDownloads});
var processData = new TransformBlock<string, MyClass>(
json => JsonConvert.DeserializeObject<MyClass>(json),
new ExecutionDataflowBlockOptions {MaxDegreeOfParallelism = DataflowBlockOptions.Unbounded});
var collectData = new ActionBlock<MyClass>(
data => result.Add(data)); //When you don't specifiy options dataflow processes items one at a time.
//Set up the chain of blocks, have it call `.Complete()` on the next block when the current block finishes processing it's last item.
downloadData.LinkTo(processData, new DataflowLinkOptions {PropagateCompletion = true});
processData.LinkTo(collectData, new DataflowLinkOptions {PropagateCompletion = true});
//Load the data in to the first transform block to start off the process.
foreach (var uri in uris)
{
await downloadData.SendAsync(uri).ConfigureAwait(false);
}
downloadData.Complete(); //Signal you are done adding data.
//Wait for the last object to be added to the list.
await collectData.Completion.ConfigureAwait(false);
return result;
}
In the above code only concurrentDownloads number of HttpClients will be active at any given time, unlimited threads will be processing the received strings and turning them in to objects, and a single thread will be taking those objects and adding them to a list.
UPDATE: here is a simplified example that only does what you asked for in the question
private static HttpClient _client = new HttpClient();
public void ProcessDownloads(IEnumerable<string> uris, int concurrentDownloads)
{
var downloadData = new ActionBlock<string>(async uri =>
{
var response = await _client.GetAsync(uri); //GetAsync is a thread safe method.
//do something with response here.
}, new ExecutionDataflowBlockOptions{MaxDegreeOfParallelism = concurrentDownloads});
foreach (var uri in uris)
{
downloadData.Post(uri);
}
downloadData.Complete();
downloadData.Completion.Wait();
}
A simple solution for throttling is a SemaphoreSlim.
EDIT
After a slight alteration the code now creates the tasks when they are needed
var client = new HttpClient();
SemaphoreSlim semaphore = new SemaphoreSlim(m, m); //set the max here
var tasks = new List<Task>();
foreach(var url in urls)
{
// moving the wait here throttles the foreach loop
await semaphore.WaitAsync();
tasks.Add(((Func<Task>)(async () =>
{
//await semaphore.WaitAsync();
var response = await client.GetAsync(url); // possibly ConfigureAwait(false) here
// do something with response
semaphore.Release();
}))());
}
await Task.WhenAll(tasks);
This is another way to do it
var client = new HttpClient();
var tasks = new HashSet<Task>();
foreach(var url in urls)
{
if(tasks.Count == m)
{
tasks.Remove(await Task.WhenAny(tasks));
}
tasks.Add(((Func<Task>)(async () =>
{
var response = await client.GetAsync(url); // possibly ConfigureAwait(false) here
// do something with response
}))());
}
await Task.WhenAll(tasks);
Process items in parallel, limiting the number of simultaneous jobs:
string[] strings = GetStrings(); // Items to process.
const int m = 2; // Max simultaneous jobs.
Parallel.ForEach(strings, new ParallelOptions {MaxDegreeOfParallelism = m}, s =>
{
DoWork(s);
});
In my application I want to join multiple strings with a dictionary of replacement values.
The readTemplateBlock gets fed with FileInfos and returns their contents as string.
The getReplacersBlock gets fed (once) with a single replacers dictionary.
The joinTemplateAndReplacersBlock should join each item of the readTemplateBlock with the one getReplacersBlock result.
In my current setup it requires me to post the same replacers dictionary again for each file I post.
// Build
var readTemplateBlock = new TransformBlock<FileInfo, string>(file => File.ReadAllText(file.FullName));
var getReplacersBlock = new WriteOnceBlock<IDictionary<string, string>>(null);
var joinTemplateAndReplacersBlock = new JoinBlock<string, IDictionary<string, string>>();
// Assemble
var propagateComplete = new DataflowLinkOptions {PropagateCompletion = true};
readTemplateBlock.LinkTo(joinTemplateAndReplacersBlock.Target1, propagateComplete);
getReplacersBlock.LinkTo(joinTemplateAndReplacersBlock.Target2, propagateComplete);
joinTemplateAndReplacersBlock.LinkTo(replaceTemplateBlock, propagateComplete);
// Post
foreach (var template in templateFilenames)
{
getFileBlock.Post(template);
}
getFileBlock.Complete();
getReplacersBlock.Post(replacers);
getReplacersBlock.Complete();
Is there a better block I'm missing? Maybe a configuration option I overlooked?
I couldn't figure out how to do this using the built-in dataflow blocks. The alternatives I can see:
Use a BufferBlock with small BoundedCapacity along with a Task that keeps sending the value to it. How exactly does the Task get the value could vary, but if you like WriteOnceBlock, you could reuse and encapsulate it:
static IPropagatorBlock<T, T> CreateWriteOnceRepeaterBlock<T>()
{
var target = new WriteOnceBlock<T>(null);
var source = new BufferBlock<T>(new DataflowBlockOptions { BoundedCapacity = 1 });
Task.Run(
async () =>
{
var value = await target.ReceiveAsync();
while (true)
{
await source.SendAsync(value);
}
});
return DataflowBlock.Encapsulate(target, source);
}
You would then use CreateWriteOnceRepeaterBlock<IDictionary<string, string>>() instead of new WriteOnceBlock<IDictionary<string, string>>(null).
Write a custom block similar to WriteOnceBlock that behaves exactly the way you want. Looking at how big the source of WriteOnceBlock is, this is probably not very appealing.
Use a TaskCompletionSource instead of dataflow blocks for this.
Assuming your current code looks something like this (using C# 7 and the System.ValueTuple package for brevity):
void ReplaceTemplateBlockAction(Tuple<string, IDictionary<string, string>> tuple)
{
var (template, replacers) = tuple;
…
}
…
var getReplacersBlock = new WriteOnceBlock<IDictionary<string, string>>(null);
var replaceTemplateBlock = new ActionBlock<Tuple<string, IDictionary<string, string>>>(
ReplaceTemplateBlockAction);
…
getReplacersBlock.Post(replacers);
You would instead use:
void ReplaceTemplateBlockAction(string template, IDictionary<string, string>>> replacers)
{
…
}
…
var getReplacersTcs = new TaskCompletionSource<IDictionary<string, string>>();
var replaceTemplateBlock = new ActionBlock<string>(
async template => ReplaceTemplateBlockAction(template, await getReplacersTcs.Task));
…
getReplacersTcs.SetResult(replacers);