Report when input to first dataflow block finishes all linked blocks - c#

I am using TPL Dataflow to download data from a ticketing system.
The system takes the ticket number as the input, calls an API and receives a nested JSON response with various information. Once received, a set of blocks handles each level of the nested structure and writes it to a relational database. e.g. Conversation, Conversation Attachments, Users, User Photos, User Tags, etc
Json
{
"conversations":[
{
"id":12345,
"author_id":23456,
"body":"First Conversation"
},
{
"id":98765,
"authorid":34567,
"body":"Second Conversation",
"attachments":[
{
"attachmentid":12345
"attachment_name":"Test.jpg"
}
}
],
"users":[
{
"userid":12345
"user_name":"John Smith"
},
{
"userid":34567
"user_name":"Joe Bloggs"
"user_photo":
{
"photoid":44556,
"photo_name":"headshot.jpg"
},
tags:[
"development",
"vip"
]
}
]
Code
Some blocks need to broadcast so that deeper nesting can still have access to the data. e.g. UserModelJson is broadcast so that 1 block can handle writing the user, 1 block can handle writing the User Tags and 1 block can handle writing the User Photos.
var loadTicketsBlock = new TransformBlock<int, ConversationsModelJson>(async ticketNumber => await p.GetConversationObjectFromTicket(ticketNumber));
var broadcastConversationsObjectBlock = new BroadcastBlock<ConversationsModelJson>(conversations => conversations);
//Conversation
var getConversationsFromConversationObjectBlock = new TransformManyBlock<ConversationsModelJson, ConversationModelJson>(conversation => ModelConverter.ConvertConversationsObjectJsonToConversationJson(conversation));
var convertConversationsBlock = new TransformBlock<ConversationModelJson, ConversationModel>(conversation => ModelConverter.ConvertConversationJsonToConversation(conversation));
var batchConversionBlock = new BatchBlock<ConversationModel>(batchBlockSize);
var convertConversationsToDTBlock = new TransformBlock<IEnumerable<ConversationModel>, DataTable>(conversations => ModelConverter.ConvertConversationModelToConversationDT(conversations));
var writeConversationsBlock = new ActionBlock<DataTable>(async conversations => await p.ProcessConversationsAsync(conversations));
var getUsersFromConversationsBlock = new TransformManyBlock<ConversationsModelJson, UserModelJson>(conversations => ModelConverter.ConvertConversationsJsonToUsersJson(conversations));
var broadcastUserBlock = new BroadcastBlock<UserModelJson>(userModelJson => userModelJson);
//User
var convertUsersBlock = new TransformBlock<UserModelJson, UserModel>(user => ModelConverter.ConvertUserJsonToUser(user));
var batchUsersBlock = new BatchBlock<UserModel>(batchBlockSize);
var convertUsersToDTBlock = new TransformBlock<IEnumerable<UserModel>, DataTable>(users => ModelConverter.ConvertUserModelToUserDT(users));
var writeUsersBlock = new ActionBlock<DataTable>(async users => await p.ProcessUsersAsync(users));
//UserTag
var getUserTagsFromUserBlock = new TransformBlock<UserModelJson, UserTagModel>(user => ModelConverter.ConvertUserJsonToUserTag(user));
var batchTagsBlock = new BatchBlock<UserTagModel>(batchBlockSize);
var convertTagsToDTBlock = new TransformBlock<IEnumerable<UserTagModel>, DataTable>(tags => ModelConverter.ConvertUserTagModelToUserTagDT(tags));
var writeTagsBlock = new ActionBlock<DataTable>(async tags => await p.ProcessUserTagsAsync(tags));
DataflowLinkOptions linkOptions = new DataflowLinkOptions()
{
PropagateCompletion = true
};
loadTicketsBlock.LinkTo(broadcastConversationsObjectBlock, linkOptions);
//Conversation
broadcastConversationsObjectBlock.LinkTo(getConversationsFromConversationObjectBlock, linkOptions);
getConversationsFromConversationObjectBlock.LinkTo(convertConversationsBlock, linkOptions);
convertConversationsBlock.LinkTo(batchConversionBlock, linkOptions);
batchConversionBlock.LinkTo(convertConversationsToDTBlock, linkOptions);
convertConversationsToDTBlock.LinkTo(writeConversationsBlock, linkOptions);
var tickets = await provider.GetAllTicketsAsync();
foreach (var ticket in tickets)
{
cts.Token.ThrowIfCancellationRequested();
await loadTicketsBlock.SendAsync(ticket.TicketID);
}
loadTicketsBlock.Complete();
The LinkTo blocks are repeated for each type of data to be written.
I know when the whole pipeline is complete by using
await Task.WhenAll(<Last block of each branch>.Completion);
but if I pass in ticket number 1 into the loadTicketsBlock block then how do I know when that specific ticket has been through all blocks in the pipeline and therefore is complete?
The reason that I want to know this is so that I can report to the UI that ticket 1 of 100 is complete.

You could consider using the TaskCompletionSource as the base class for all your sub-entities. For example:
class Attachment : TaskCompletionSource
{
}
class Conversation : TaskCompletionSource
{
}
Then every time you insert a sub-entity in the database, you mark it as completed:
attachment.SetResult();
...or if the insert fails, mark it as faulted:
attachment.SetException(ex);
Finally you can combine all the asynchronous completions in one, with the method Task.WhenAll:
Task ticketCompletion = Task.WhenAll(Enumerable.Empty<Task>()
.Append(ticket.Task)
.Concat(attachments.Select(e => e.Task))
.Concat(conversations.Select(e => e.Task)));

If I am tracking progress in Dataflow, usually I will set up the last block as a notify the UI of progress type block. To be able to track the progress of your inputs, you will need to keep the context of the original input in all the objects you are passing around, so in this case you need to be able to tell that you are working on ticket 1 all the way through your pipeline, and if one of your transforms removes the context that it is working on ticket 1, then you will need to rethink the object types that you are passing through your pipeline so you can keep that context.
A simple example of what I'm talking about is laid out below with a broadcast block going to three transform blocks, and then all three transform blocks going back to an action block that notifies about the progress of the pipelines.
When combining back into the single action block you need to make sure not to propagate completion at that point because as soon as one block propagates completion to the action block, that action block will stop accepting input, so you will still wait for the last block of each pipeline to complete, and then after that manually complete your final notify the UI action block.
using System;
using System.Threading.Tasks.Dataflow;
using System.Threading.Tasks;
using System.Collections.Generic;
public class Program
{
public static void Main()
{
var broadcastBlock = new BroadcastBlock<string>(x => x);
var transformBlockA = new TransformBlock<string, string>(x =>
{
return x + "A";
});
var transformBlockB = new TransformBlock<string, string>(x =>
{
return x + "B";
});
var transformBlockC = new TransformBlock<string, string>(x =>
{
return x + "C";
});
var ticketTracking = new Dictionary<int, List<string>>();
var notifyUiBlock = new ActionBlock<string>(x =>
{
var ticketNumber = int.Parse(x.Substring(5,1));
var taskLetter = x.Substring(7,1);
var success = ticketTracking.TryGetValue(ticketNumber, out var tasksComplete);
if (!success)
{
tasksComplete = new List<string>();
ticketTracking[ticketNumber] = tasksComplete;
}
tasksComplete.Add(taskLetter);
if (tasksComplete.Count == 3)
{
Console.WriteLine($"Ticket {ticketNumber} is complete");
}
});
DataflowLinkOptions linkOptions = new DataflowLinkOptions() {PropagateCompletion = true};
broadcastBlock.LinkTo(transformBlockA, linkOptions);
broadcastBlock.LinkTo(transformBlockB, linkOptions);
broadcastBlock.LinkTo(transformBlockC, linkOptions);
transformBlockA.LinkTo(notifyUiBlock);
transformBlockB.LinkTo(notifyUiBlock);
transformBlockC.LinkTo(notifyUiBlock);
for(var i = 0; i < 5; i++)
{
broadcastBlock.Post($"Task {i} ");
}
broadcastBlock.Complete();
Task.WhenAll(transformBlockA.Completion, transformBlockB.Completion, transformBlockC.Completion).Wait();
notifyUiBlock.Complete();
notifyUiBlock.Completion.Wait();
Console.WriteLine("Done");
}
}
This will give an output similar to this
Ticket 0 is complete
Ticket 1 is complete
Ticket 2 is complete
Ticket 3 is complete
Ticket 4 is complete
Done

Related

Deduplicate stream records

I'm using Redis to stream data. I have multiple producer instances producing the same data, aiming event consistency.
Right now the producers generate trades with random trade ids between 1 and 2. I want a deduplication service or something which based on trade id to distrinct the duplicates. How do I do that?
Consumer
using System.Text.Json;
using Shared;
using StackExchange.Redis;
var tokenSource = new CancellationTokenSource();
var token = tokenSource.Token;
var muxer = ConnectionMultiplexer.Connect("localhost:6379");
var db = muxer.GetDatabase();
const string streamName = "positions";
const string groupName = "avg";
if (!await db.KeyExistsAsync(streamName) ||
(await db.StreamGroupInfoAsync(streamName)).All(x => x.Name != groupName))
{
await db.StreamCreateConsumerGroupAsync(streamName, groupName, "0-0");
}
var consumerGroupReadTask = Task.Run(async () =>
{
var id = string.Empty;
while (!token.IsCancellationRequested)
{
if (!string.IsNullOrEmpty(id))
{
await db.StreamAcknowledgeAsync(streamName, groupName, id);
id = string.Empty;
}
var result = await db.StreamReadGroupAsync(streamName, groupName, "avg-1", ">", 1);
if (result.Any())
{
id = result.First().Id;
var dict = ParseResult(result.First());
var trade = JsonSerializer.Deserialize<Trade>(dict["trade"]);
Console.WriteLine($"Group read result: trade: {dict["trade"]}, time: {dict["time"]}");
}
await Task.Delay(1000);
}
});
Console.ReadLine();
static Dictionary<string, string> ParseResult(StreamEntry entry)
{
return entry.Values.ToDictionary(x => x.Name.ToString(), x => x.Value.ToString());
}
Producer
using System.Text.Json;
using Shared;
using StackExchange.Redis;
var tokenSource = new CancellationTokenSource();
var token = tokenSource.Token;
var muxer = ConnectionMultiplexer.Connect("localhost:6379");
var db = muxer.GetDatabase();
const string streamName = "positions";
var producerTask = Task.Run(async () =>
{
var random = new Random();
while (!token.IsCancellationRequested)
{
var trade = new Trade(random.Next(1, 3), "btcusdt", 25000, 2);
var entry = new List<NameValueEntry>
{
new("trade", JsonSerializer.Serialize(trade)),
new("time", DateTimeOffset.Now.ToUnixTimeSeconds())
};
await db.StreamAddAsync(streamName, entry.ToArray());
await Task.Delay(2000);
}
});
Console.ReadLine();
You can use a couple tactics here, depending on the level of distribution required and the degree to which you can handle missing messages incoming from your stream. Here are a couple workable solutions using Redis:
Use Bloom Filters When you can tolerate a 1% miss in events
You can use a BloomFilter in Redis, which will be a very compact, very fast way to determine if a particular record has not been recorded yet. If you run:
var hasBeenAdded = ((int)await db.ExecuteAsync("BF.ADD", "bf:trades",dict["trade"])) == 1;
If hasBeenAdded is true, you can definitively say that the record is not a duplicate, if it is false, there's about a probability depending on how you set up the bloom filter with BF.RESERVE
If you want to use a Bloom Filter, you'll need to either side-load RedisBloom into your instance of Redis, or you can just use Redis Stack
Use a Sorted Set when misses aren't acceptable
If your app cannot tolerate a miss, you are probably wiser to use a Set or a Sorted Set, in general I'd advise you to use a set because they are much easier to clean up.
Basically if you are using a sorted set, you would check to see if a record has already been recorded in your average by using a ZSCORE zset:trades trade-id, if a score comes back you know that the records been used already, otherwise you can add it to the sorted set. Importantly, because your sorted set grows linearly you are going to want to clean it up periodically, so if you set the timestamp from the message id to the score, you likely can determine some workable interval to go back in and do a ZREMRANGEBYSCORE to clear out older records.

ReceiveAsync interrupting/breaking message passing

This problem raised its head while trying to implement the suggested solution to this problem.
Problem summary
Performing a ReceiveAsync() call from a TransformBlock to a WriteOnceBlock is causing the TransformBlock to essentially remove itself from the flow. It ceases to propagate any kind of message, whether it's data or a completion signal.
System design
The system is intended for parsing large CSV files through a series of steps.
The problematic part of the flow can be (inexpertly) visualized as follows:
The parallelogram is a BufferBlock, diamonds are BroadcastBlocks, triangles are WriteOnceBlocks and arrows are TransformBlocks. Solid lines denote a link created with LinkTo(), and the dotted line represents a ReceiveAsync() call from the ParsedHeaderAndRecordJoiner to the ParsedHeaderContainer block. I'm aware that this flow is somewhat suboptimal but that's not the primary reason for the question.
Code
Application root
Here is a part of the class that creates the necessary blocks and links them together using PropagateCompletion
using (var cancellationSource = new CancellationTokenSource())
{
var cancellationToken = cancellationSource.Token;
var temporaryEntityInstance = new Card(); // Just as an example
var producerQueue = queueFactory.CreateQueue<string>(new DataflowBlockOptions{CancellationToken = cancellationToken});
var recordDistributor = distributorFactory.CreateDistributor<string>(s => (string)s.Clone(),
new DataflowBlockOptions { CancellationToken = cancellationToken });
var headerRowContainer = containerFactory.CreateContainer<string>(s => (string)s.Clone(),
new DataflowBlockOptions { CancellationToken = cancellationToken });
var headerRowParser = new HeaderRowParserFactory().CreateHeaderRowParser(temporaryEntityInstance.GetType(), ';',
new ExecutionDataflowBlockOptions { CancellationToken = cancellationToken });
var parsedHeaderContainer = containerFactory.CreateContainer<HeaderParsingResult>(HeaderParsingResult.Clone,
new DataflowBlockOptions { CancellationToken = cancellationToken});
var parsedHeaderAndRecordJoiner = new ParsedHeaderAndRecordJoinerFactory().CreateParsedHeaderAndRecordJoiner(parsedHeaderContainer,
new ExecutionDataflowBlockOptions { CancellationToken = cancellationToken });
var entityParser = new entityParserFactory().CreateEntityParser(temporaryEntityInstance.GetType(), ';',
dataflowBlockOptions: new ExecutionDataflowBlockOptions { CancellationToken = cancellationToken });
var entityDistributor = distributorFactory.CreateDistributor<EntityParsingResult>(EntityParsingResult.Clone,
new DataflowBlockOptions{CancellationToken = cancellationToken});
var linkOptions = new DataflowLinkOptions {PropagateCompletion = true};
// Producer subprocess
producerQueue.LinkTo(recordDistributor, linkOptions);
// Header subprocess
recordDistributor.LinkTo(headerRowContainer, linkOptions);
headerRowContainer.LinkTo(headerRowParser, linkOptions);
headerRowParser.LinkTo(parsedHeaderContainer, linkOptions);
parsedHeaderContainer.LinkTo(errorQueue, new DataflowLinkOptions{MaxMessages = 1, PropagateCompletion = true}, dataflowResult => !dataflowResult.WasSuccessful);
// Parsing subprocess
recordDistributor.LinkTo(parsedHeaderAndRecordJoiner, linkOptions);
parsedHeaderAndRecordJoiner.LinkTo(entityParser, linkOptions, joiningResult => joiningResult.WasSuccessful);
entityParser.LinkTo(entityDistributor, linkOptions);
entityDistributor.LinkTo(errorQueue, linkOptions, dataflowResult => !dataflowResult.WasSuccessful);
}
HeaderRowParser
This block parses a header row from a CSV file and does some validation.
public class HeaderRowParserFactory
{
public TransformBlock<string, HeaderParsingResult> CreateHeaderRowParser(Type entityType,
char delimiter,
ExecutionDataflowBlockOptions dataflowBlockOptions = null)
{
return new TransformBlock<string, HeaderParsingResult>(headerRow =>
{
// Set up some containers
var result = new HeaderParsingResult(identifier: "N/A", wasSuccessful: true);
var fieldIndexesByPropertyName = new Dictionary<string, int>();
// Get all serializable properties on the chosen entity type
var serializableProperties = entityType.GetProperties()
.Where(prop => prop.IsDefined(typeof(CsvFieldNameAttribute), false))
.ToList();
// Add their CSV fieldnames to the result
var entityFieldNames = serializableProperties.Select(prop => prop.GetCustomAttribute<CsvFieldNameAttribute>().FieldName);
result.SetEntityFieldNames(entityFieldNames);
// Create the dictionary of properties by field name
var serializablePropertiesByFieldName = serializableProperties.ToDictionary(prop => prop.GetCustomAttribute<CsvFieldNameAttribute>().FieldName, prop => prop, StringComparer.OrdinalIgnoreCase);
var fields = headerRow.Split(delimiter);
for (var i = 0; i < fields.Length; i++)
{
// If any field in the CSV is unknown as a serializable property, we return a failed result
if (!serializablePropertiesByFieldName.TryGetValue(fields[i], out var foundProperty))
{
result.Invalidate($"The header row contains a field that does not match any of the serializable properties - {fields[i]}.",
DataflowErrorSeverity.Critical);
return result;
}
// Perform a bunch more validation
fieldIndexesByPropertyName.Add(foundProperty.Name, i);
}
result.SetFieldIndexesByName(fieldIndexesByPropertyName);
return result;
}, dataflowBlockOptions ?? new ExecutionDataflowBlockOptions());
}
}
ParsedHeaderAndRecordJoiner
For each subsequent record that comes through the pipe, this block is intended to retrieve the parsed header data and add it to the record.
public class ParsedHeaderAndRecordJoinerFactory
{
public TransformBlock<string, HeaderAndRecordJoiningResult> CreateParsedHeaderAndRecordJoiner(WriteOnceBlock<HeaderParsingResult> parsedHeaderContainer,
ExecutionDataflowBlockOptions dataflowBlockOptions = null)
{
return new TransformBlock<string, HeaderAndRecordJoiningResult>(async csvRecord =>
{
var headerParsingResult = await parsedHeaderContainer.ReceiveAsync();
// If the header couldn't be parsed, a critical error is already on its way to the failure logger so we don't need to continue
if (!headerParsingResult.WasSuccessful) return new HeaderAndRecordJoiningResult(identifier: "N.A.", wasSuccessful: false, null, null);
// The entity parser can't do anything with the header record, so we send a message with wasSuccessful false
var isHeaderRecord = true;
foreach (var entityFieldName in headerParsingResult.EntityFieldNames)
{
isHeaderRecord &= csvRecord.Contains(entityFieldName);
}
if (isHeaderRecord) return new HeaderAndRecordJoiningResult(identifier: "N.A.", wasSuccessful: false, null, null);
return new HeaderAndRecordJoiningResult(identifier: "N.A.", wasSuccessful: true, headerParsingResult, csvRecord);
}, dataflowBlockOptions ?? new ExecutionDataflowBlockOptions());
}
}
Problem detail
With the current implementation, the ParsedHeaderAndRecordJoiner receives the data correctly from the ReceiveAsync() call to ParsedHeaderContainer and returns as expected, however no message arrives at the EntityParser.
Also, when a Complete signal is sent to the front of the flow (the ProducerQueue), it propagates to the RecordDistributor but then stops at the ParsedHeaderAndRecordJoiner (it does continue from the HeaderRowContainer onwards, so the RecordDistributor is passing it on).
If I remove the ReceiveAsync() call and mock the data myself, the block behaves as expected.
I think this part is key
however no message arrives at the EntityParser.
Based on the sample the only way EntityParser doesn't receive a message output by the ParsedHeaderAndRecordJoiner is when WasSuccessful returns false. The predicate used in your link excludes failed messages but those messages have no where to go, so they build up in the ParsedHeaderAndRecordJoiner output buffer and would also prevent Completion from propagating. You'd need to link a null target to dump the failed messages.
parsedHeaderAndRecordJoiner.LinkTo(DataflowBlock.NullTarget<HeaderParsingResult>());
Further if your mock data always is coming back with WasSuccessful true, then that might be pointing you to the await ...ReceiveAsync()
Not necessarily the smoking gun but a good place to start. Can you confirm the state of all messages in the output buffer of ParsedHeaderAndRecordJoiner when the pipeline gets stuck.

C# .NET Parallel I/O operation (with throttling) [duplicate]

I would like to run a bunch of async tasks, with a limit on how many tasks may be pending completion at any given time.
Say you have 1000 URLs, and you only want to have 50 requests open at a time; but as soon as one request completes, you open up a connection to the next URL in the list. That way, there are always exactly 50 connections open at a time, until the URL list is exhausted.
I also want to utilize a given number of threads if possible.
I came up with an extension method, ThrottleTasksAsync that does what I want. Is there a simpler solution already out there? I would assume that this is a common scenario.
Usage:
class Program
{
static void Main(string[] args)
{
Enumerable.Range(1, 10).ThrottleTasksAsync(5, 2, async i => { Console.WriteLine(i); return i; }).Wait();
Console.WriteLine("Press a key to exit...");
Console.ReadKey(true);
}
}
Here is the code:
static class IEnumerableExtensions
{
public static async Task<Result_T[]> ThrottleTasksAsync<Enumerable_T, Result_T>(this IEnumerable<Enumerable_T> enumerable, int maxConcurrentTasks, int maxDegreeOfParallelism, Func<Enumerable_T, Task<Result_T>> taskToRun)
{
var blockingQueue = new BlockingCollection<Enumerable_T>(new ConcurrentBag<Enumerable_T>());
var semaphore = new SemaphoreSlim(maxConcurrentTasks);
// Run the throttler on a separate thread.
var t = Task.Run(() =>
{
foreach (var item in enumerable)
{
// Wait for the semaphore
semaphore.Wait();
blockingQueue.Add(item);
}
blockingQueue.CompleteAdding();
});
var taskList = new List<Task<Result_T>>();
Parallel.ForEach(IterateUntilTrue(() => blockingQueue.IsCompleted), new ParallelOptions { MaxDegreeOfParallelism = maxDegreeOfParallelism },
_ =>
{
Enumerable_T item;
if (blockingQueue.TryTake(out item, 100))
{
taskList.Add(
// Run the task
taskToRun(item)
.ContinueWith(tsk =>
{
// For effect
Thread.Sleep(2000);
// Release the semaphore
semaphore.Release();
return tsk.Result;
}
)
);
}
});
// Await all the tasks.
return await Task.WhenAll(taskList);
}
static IEnumerable<bool> IterateUntilTrue(Func<bool> condition)
{
while (!condition()) yield return true;
}
}
The method utilizes BlockingCollection and SemaphoreSlim to make it work. The throttler is run on one thread, and all the async tasks are run on the other thread. To achieve parallelism, I added a maxDegreeOfParallelism parameter that's passed to a Parallel.ForEach loop re-purposed as a while loop.
The old version was:
foreach (var master = ...)
{
var details = ...;
Parallel.ForEach(details, detail => {
// Process each detail record here
}, new ParallelOptions { MaxDegreeOfParallelism = 15 });
// Perform the final batch updates here
}
But, the thread pool gets exhausted fast, and you can't do async/await.
Bonus:
To get around the problem in BlockingCollection where an exception is thrown in Take() when CompleteAdding() is called, I'm using the TryTake overload with a timeout. If I didn't use the timeout in TryTake, it would defeat the purpose of using a BlockingCollection since TryTake won't block. Is there a better way? Ideally, there would be a TakeAsync method.
As suggested, use TPL Dataflow.
A TransformBlock<TInput, TOutput> may be what you're looking for.
You define a MaxDegreeOfParallelism to limit how many strings can be transformed (i.e., how many urls can be downloaded) in parallel. You then post urls to the block, and when you're done you tell the block you're done adding items and you fetch the responses.
var downloader = new TransformBlock<string, HttpResponse>(
url => Download(url),
new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 50 }
);
var buffer = new BufferBlock<HttpResponse>();
downloader.LinkTo(buffer);
foreach(var url in urls)
downloader.Post(url);
//or await downloader.SendAsync(url);
downloader.Complete();
await downloader.Completion;
IList<HttpResponse> responses;
if (buffer.TryReceiveAll(out responses))
{
//process responses
}
Note: The TransformBlock buffers both its input and output. Why, then, do we need to link it to a BufferBlock?
Because the TransformBlock won't complete until all items (HttpResponse) have been consumed, and await downloader.Completion would hang. Instead, we let the downloader forward all its output to a dedicated buffer block - then we wait for the downloader to complete, and inspect the buffer block.
Say you have 1000 URLs, and you only want to have 50 requests open at
a time; but as soon as one request completes, you open up a connection
to the next URL in the list. That way, there are always exactly 50
connections open at a time, until the URL list is exhausted.
The following simple solution has surfaced many times here on SO. It doesn't use blocking code and doesn't create threads explicitly, so it scales very well:
const int MAX_DOWNLOADS = 50;
static async Task DownloadAsync(string[] urls)
{
using (var semaphore = new SemaphoreSlim(MAX_DOWNLOADS))
using (var httpClient = new HttpClient())
{
var tasks = urls.Select(async url =>
{
await semaphore.WaitAsync();
try
{
var data = await httpClient.GetStringAsync(url);
Console.WriteLine(data);
}
finally
{
semaphore.Release();
}
});
await Task.WhenAll(tasks);
}
}
The thing is, the processing of the downloaded data should be done on a different pipeline, with a different level of parallelism, especially if it's a CPU-bound processing.
E.g., you'd probably want to have 4 threads concurrently doing the data processing (the number of CPU cores), and up to 50 pending requests for more data (which do not use threads at all). AFAICT, this is not what your code is currently doing.
That's where TPL Dataflow or Rx may come in handy as a preferred solution. Yet it is certainly possible to implement something like this with plain TPL. Note, the only blocking code here is the one doing the actual data processing inside Task.Run:
const int MAX_DOWNLOADS = 50;
const int MAX_PROCESSORS = 4;
// process data
class Processing
{
SemaphoreSlim _semaphore = new SemaphoreSlim(MAX_PROCESSORS);
HashSet<Task> _pending = new HashSet<Task>();
object _lock = new Object();
async Task ProcessAsync(string data)
{
await _semaphore.WaitAsync();
try
{
await Task.Run(() =>
{
// simuate work
Thread.Sleep(1000);
Console.WriteLine(data);
});
}
finally
{
_semaphore.Release();
}
}
public async void QueueItemAsync(string data)
{
var task = ProcessAsync(data);
lock (_lock)
_pending.Add(task);
try
{
await task;
}
catch
{
if (!task.IsCanceled && !task.IsFaulted)
throw; // not the task's exception, rethrow
// don't remove faulted/cancelled tasks from the list
return;
}
// remove successfully completed tasks from the list
lock (_lock)
_pending.Remove(task);
}
public async Task WaitForCompleteAsync()
{
Task[] tasks;
lock (_lock)
tasks = _pending.ToArray();
await Task.WhenAll(tasks);
}
}
// download data
static async Task DownloadAsync(string[] urls)
{
var processing = new Processing();
using (var semaphore = new SemaphoreSlim(MAX_DOWNLOADS))
using (var httpClient = new HttpClient())
{
var tasks = urls.Select(async (url) =>
{
await semaphore.WaitAsync();
try
{
var data = await httpClient.GetStringAsync(url);
// put the result on the processing pipeline
processing.QueueItemAsync(data);
}
finally
{
semaphore.Release();
}
});
await Task.WhenAll(tasks.ToArray());
await processing.WaitForCompleteAsync();
}
}
As requested, here's the code I ended up going with.
The work is set up in a master-detail configuration, and each master is processed as a batch. Each unit of work is queued up in this fashion:
var success = true;
// Start processing all the master records.
Master master;
while (null != (master = await StoredProcedures.ClaimRecordsAsync(...)))
{
await masterBuffer.SendAsync(master);
}
// Finished sending master records
masterBuffer.Complete();
// Now, wait for all the batches to complete.
await batchAction.Completion;
return success;
Masters are buffered one at a time to save work for other outside processes. The details for each master are dispatched for work via the masterTransform TransformManyBlock. A BatchedJoinBlock is also created to collect the details in one batch.
The actual work is done in the detailTransform TransformBlock, asynchronously, 150 at a time. BoundedCapacity is set to 300 to ensure that too many Masters don't get buffered at the beginning of the chain, while also leaving room for enough detail records to be queued to allow 150 records to be processed at one time. The block outputs an object to its targets, because it's filtered across the links depending on whether it's a Detail or Exception.
The batchAction ActionBlock collects the output from all the batches, and performs bulk database updates, error logging, etc. for each batch.
There will be several BatchedJoinBlocks, one for each master. Since each ISourceBlock is output sequentially and each batch only accepts the number of detail records associated with one master, the batches will be processed in order. Each block only outputs one group, and is unlinked on completion. Only the last batch block propagates its completion to the final ActionBlock.
The dataflow network:
// The dataflow network
BufferBlock<Master> masterBuffer = null;
TransformManyBlock<Master, Detail> masterTransform = null;
TransformBlock<Detail, object> detailTransform = null;
ActionBlock<Tuple<IList<object>, IList<object>>> batchAction = null;
// Buffer master records to enable efficient throttling.
masterBuffer = new BufferBlock<Master>(new DataflowBlockOptions { BoundedCapacity = 1 });
// Sequentially transform master records into a stream of detail records.
masterTransform = new TransformManyBlock<Master, Detail>(async masterRecord =>
{
var records = await StoredProcedures.GetObjectsAsync(masterRecord);
// Filter the master records based on some criteria here
var filteredRecords = records;
// Only propagate completion to the last batch
var propagateCompletion = masterBuffer.Completion.IsCompleted && masterTransform.InputCount == 0;
// Create a batch join block to encapsulate the results of the master record.
var batchjoinblock = new BatchedJoinBlock<object, object>(records.Count(), new GroupingDataflowBlockOptions { MaxNumberOfGroups = 1 });
// Add the batch block to the detail transform pipeline's link queue, and link the batch block to the the batch action block.
var detailLink1 = detailTransform.LinkTo(batchjoinblock.Target1, detailResult => detailResult is Detail);
var detailLink2 = detailTransform.LinkTo(batchjoinblock.Target2, detailResult => detailResult is Exception);
var batchLink = batchjoinblock.LinkTo(batchAction, new DataflowLinkOptions { PropagateCompletion = propagateCompletion });
// Unlink batchjoinblock upon completion.
// (the returned task does not need to be awaited, despite the warning.)
batchjoinblock.Completion.ContinueWith(task =>
{
detailLink1.Dispose();
detailLink2.Dispose();
batchLink.Dispose();
});
return filteredRecords;
}, new ExecutionDataflowBlockOptions { BoundedCapacity = 1 });
// Process each detail record asynchronously, 150 at a time.
detailTransform = new TransformBlock<Detail, object>(async detail => {
try
{
// Perform the action for each detail here asynchronously
await DoSomethingAsync();
return detail;
}
catch (Exception e)
{
success = false;
return e;
}
}, new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 150, BoundedCapacity = 300 });
// Perform the proper action for each batch
batchAction = new ActionBlock<Tuple<IList<object>, IList<object>>>(async batch =>
{
var details = batch.Item1.Cast<Detail>();
var errors = batch.Item2.Cast<Exception>();
// Do something with the batch here
}, new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 4 });
masterBuffer.LinkTo(masterTransform, new DataflowLinkOptions { PropagateCompletion = true });
masterTransform.LinkTo(detailTransform, new DataflowLinkOptions { PropagateCompletion = true });

Parallel execution of tasks in groups

I am describing my problem in a simple example and then describing a more close problem.
Imagine We Have n items [i1,i2,i3,i4,...,in] in the box1 and we have a box2 that can handle m items to do them (m is usually much less than n) . The time required for each item is different. I want to always have doing m job items until all items are proceeded.
A much more close problem is that for example you have a list1 of n strings (URL addresses) of files and we want to have a system to have m files downloading concurrently (for example via httpclient.getAsync() method). Whenever downloading of one of m items finishes, another remaining item from list1 must be substituted as soon as possible and this must be countinued until all of List1 items proceeded.
(number of n and m are specified by users input at runtime)
How this can be done?
Here is a generic method you can use.
when you call this TIn will be string (URL addresses) and the asyncProcessor will be your async method that takes the URL address as input and returns a Task.
The SlimSemaphore used by this method is going to allow only n number of concurrent async I/O requests in real time, as soon as one completes the other request will execute. Something like a sliding window pattern.
public static Task ForEachAsync<TIn>(
IEnumerable<TIn> inputEnumerable,
Func<TIn, Task> asyncProcessor,
int? maxDegreeOfParallelism = null)
{
int maxAsyncThreadCount = maxDegreeOfParallelism ?? DefaultMaxDegreeOfParallelism;
SemaphoreSlim throttler = new SemaphoreSlim(maxAsyncThreadCount, maxAsyncThreadCount);
IEnumerable<Task> tasks = inputEnumerable.Select(async input =>
{
await throttler.WaitAsync().ConfigureAwait(false);
try
{
await asyncProcessor(input).ConfigureAwait(false);
}
finally
{
throttler.Release();
}
});
return Task.WhenAll(tasks);
}
You should look in to TPL Dataflow, add the System.Threading.Tasks.Dataflow NuGet package to your project then what you want is as simple as
private static HttpClient _client = new HttpClient();
public async Task<List<MyClass>> ProcessDownloads(IEnumerable<string> uris,
int concurrentDownloads)
{
var result = new List<MyClass>();
var downloadData = new TransformBlock<string, string>(async uri =>
{
return await _client.GetStringAsync(uri); //GetStringAsync is a thread safe method.
}, new ExecutionDataflowBlockOptions{MaxDegreeOfParallelism = concurrentDownloads});
var processData = new TransformBlock<string, MyClass>(
json => JsonConvert.DeserializeObject<MyClass>(json),
new ExecutionDataflowBlockOptions {MaxDegreeOfParallelism = DataflowBlockOptions.Unbounded});
var collectData = new ActionBlock<MyClass>(
data => result.Add(data)); //When you don't specifiy options dataflow processes items one at a time.
//Set up the chain of blocks, have it call `.Complete()` on the next block when the current block finishes processing it's last item.
downloadData.LinkTo(processData, new DataflowLinkOptions {PropagateCompletion = true});
processData.LinkTo(collectData, new DataflowLinkOptions {PropagateCompletion = true});
//Load the data in to the first transform block to start off the process.
foreach (var uri in uris)
{
await downloadData.SendAsync(uri).ConfigureAwait(false);
}
downloadData.Complete(); //Signal you are done adding data.
//Wait for the last object to be added to the list.
await collectData.Completion.ConfigureAwait(false);
return result;
}
In the above code only concurrentDownloads number of HttpClients will be active at any given time, unlimited threads will be processing the received strings and turning them in to objects, and a single thread will be taking those objects and adding them to a list.
UPDATE: here is a simplified example that only does what you asked for in the question
private static HttpClient _client = new HttpClient();
public void ProcessDownloads(IEnumerable<string> uris, int concurrentDownloads)
{
var downloadData = new ActionBlock<string>(async uri =>
{
var response = await _client.GetAsync(uri); //GetAsync is a thread safe method.
//do something with response here.
}, new ExecutionDataflowBlockOptions{MaxDegreeOfParallelism = concurrentDownloads});
foreach (var uri in uris)
{
downloadData.Post(uri);
}
downloadData.Complete();
downloadData.Completion.Wait();
}
A simple solution for throttling is a SemaphoreSlim.
EDIT
After a slight alteration the code now creates the tasks when they are needed
var client = new HttpClient();
SemaphoreSlim semaphore = new SemaphoreSlim(m, m); //set the max here
var tasks = new List<Task>();
foreach(var url in urls)
{
// moving the wait here throttles the foreach loop
await semaphore.WaitAsync();
tasks.Add(((Func<Task>)(async () =>
{
//await semaphore.WaitAsync();
var response = await client.GetAsync(url); // possibly ConfigureAwait(false) here
// do something with response
semaphore.Release();
}))());
}
await Task.WhenAll(tasks);
This is another way to do it
var client = new HttpClient();
var tasks = new HashSet<Task>();
foreach(var url in urls)
{
if(tasks.Count == m)
{
tasks.Remove(await Task.WhenAny(tasks));
}
tasks.Add(((Func<Task>)(async () =>
{
var response = await client.GetAsync(url); // possibly ConfigureAwait(false) here
// do something with response
}))());
}
await Task.WhenAll(tasks);
Process items in parallel, limiting the number of simultaneous jobs:
string[] strings = GetStrings(); // Items to process.
const int m = 2; // Max simultaneous jobs.
Parallel.ForEach(strings, new ParallelOptions {MaxDegreeOfParallelism = m}, s =>
{
DoWork(s);
});

Throttling asynchronous tasks

I would like to run a bunch of async tasks, with a limit on how many tasks may be pending completion at any given time.
Say you have 1000 URLs, and you only want to have 50 requests open at a time; but as soon as one request completes, you open up a connection to the next URL in the list. That way, there are always exactly 50 connections open at a time, until the URL list is exhausted.
I also want to utilize a given number of threads if possible.
I came up with an extension method, ThrottleTasksAsync that does what I want. Is there a simpler solution already out there? I would assume that this is a common scenario.
Usage:
class Program
{
static void Main(string[] args)
{
Enumerable.Range(1, 10).ThrottleTasksAsync(5, 2, async i => { Console.WriteLine(i); return i; }).Wait();
Console.WriteLine("Press a key to exit...");
Console.ReadKey(true);
}
}
Here is the code:
static class IEnumerableExtensions
{
public static async Task<Result_T[]> ThrottleTasksAsync<Enumerable_T, Result_T>(this IEnumerable<Enumerable_T> enumerable, int maxConcurrentTasks, int maxDegreeOfParallelism, Func<Enumerable_T, Task<Result_T>> taskToRun)
{
var blockingQueue = new BlockingCollection<Enumerable_T>(new ConcurrentBag<Enumerable_T>());
var semaphore = new SemaphoreSlim(maxConcurrentTasks);
// Run the throttler on a separate thread.
var t = Task.Run(() =>
{
foreach (var item in enumerable)
{
// Wait for the semaphore
semaphore.Wait();
blockingQueue.Add(item);
}
blockingQueue.CompleteAdding();
});
var taskList = new List<Task<Result_T>>();
Parallel.ForEach(IterateUntilTrue(() => blockingQueue.IsCompleted), new ParallelOptions { MaxDegreeOfParallelism = maxDegreeOfParallelism },
_ =>
{
Enumerable_T item;
if (blockingQueue.TryTake(out item, 100))
{
taskList.Add(
// Run the task
taskToRun(item)
.ContinueWith(tsk =>
{
// For effect
Thread.Sleep(2000);
// Release the semaphore
semaphore.Release();
return tsk.Result;
}
)
);
}
});
// Await all the tasks.
return await Task.WhenAll(taskList);
}
static IEnumerable<bool> IterateUntilTrue(Func<bool> condition)
{
while (!condition()) yield return true;
}
}
The method utilizes BlockingCollection and SemaphoreSlim to make it work. The throttler is run on one thread, and all the async tasks are run on the other thread. To achieve parallelism, I added a maxDegreeOfParallelism parameter that's passed to a Parallel.ForEach loop re-purposed as a while loop.
The old version was:
foreach (var master = ...)
{
var details = ...;
Parallel.ForEach(details, detail => {
// Process each detail record here
}, new ParallelOptions { MaxDegreeOfParallelism = 15 });
// Perform the final batch updates here
}
But, the thread pool gets exhausted fast, and you can't do async/await.
Bonus:
To get around the problem in BlockingCollection where an exception is thrown in Take() when CompleteAdding() is called, I'm using the TryTake overload with a timeout. If I didn't use the timeout in TryTake, it would defeat the purpose of using a BlockingCollection since TryTake won't block. Is there a better way? Ideally, there would be a TakeAsync method.
As suggested, use TPL Dataflow.
A TransformBlock<TInput, TOutput> may be what you're looking for.
You define a MaxDegreeOfParallelism to limit how many strings can be transformed (i.e., how many urls can be downloaded) in parallel. You then post urls to the block, and when you're done you tell the block you're done adding items and you fetch the responses.
var downloader = new TransformBlock<string, HttpResponse>(
url => Download(url),
new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 50 }
);
var buffer = new BufferBlock<HttpResponse>();
downloader.LinkTo(buffer);
foreach(var url in urls)
downloader.Post(url);
//or await downloader.SendAsync(url);
downloader.Complete();
await downloader.Completion;
IList<HttpResponse> responses;
if (buffer.TryReceiveAll(out responses))
{
//process responses
}
Note: The TransformBlock buffers both its input and output. Why, then, do we need to link it to a BufferBlock?
Because the TransformBlock won't complete until all items (HttpResponse) have been consumed, and await downloader.Completion would hang. Instead, we let the downloader forward all its output to a dedicated buffer block - then we wait for the downloader to complete, and inspect the buffer block.
Say you have 1000 URLs, and you only want to have 50 requests open at
a time; but as soon as one request completes, you open up a connection
to the next URL in the list. That way, there are always exactly 50
connections open at a time, until the URL list is exhausted.
The following simple solution has surfaced many times here on SO. It doesn't use blocking code and doesn't create threads explicitly, so it scales very well:
const int MAX_DOWNLOADS = 50;
static async Task DownloadAsync(string[] urls)
{
using (var semaphore = new SemaphoreSlim(MAX_DOWNLOADS))
using (var httpClient = new HttpClient())
{
var tasks = urls.Select(async url =>
{
await semaphore.WaitAsync();
try
{
var data = await httpClient.GetStringAsync(url);
Console.WriteLine(data);
}
finally
{
semaphore.Release();
}
});
await Task.WhenAll(tasks);
}
}
The thing is, the processing of the downloaded data should be done on a different pipeline, with a different level of parallelism, especially if it's a CPU-bound processing.
E.g., you'd probably want to have 4 threads concurrently doing the data processing (the number of CPU cores), and up to 50 pending requests for more data (which do not use threads at all). AFAICT, this is not what your code is currently doing.
That's where TPL Dataflow or Rx may come in handy as a preferred solution. Yet it is certainly possible to implement something like this with plain TPL. Note, the only blocking code here is the one doing the actual data processing inside Task.Run:
const int MAX_DOWNLOADS = 50;
const int MAX_PROCESSORS = 4;
// process data
class Processing
{
SemaphoreSlim _semaphore = new SemaphoreSlim(MAX_PROCESSORS);
HashSet<Task> _pending = new HashSet<Task>();
object _lock = new Object();
async Task ProcessAsync(string data)
{
await _semaphore.WaitAsync();
try
{
await Task.Run(() =>
{
// simuate work
Thread.Sleep(1000);
Console.WriteLine(data);
});
}
finally
{
_semaphore.Release();
}
}
public async void QueueItemAsync(string data)
{
var task = ProcessAsync(data);
lock (_lock)
_pending.Add(task);
try
{
await task;
}
catch
{
if (!task.IsCanceled && !task.IsFaulted)
throw; // not the task's exception, rethrow
// don't remove faulted/cancelled tasks from the list
return;
}
// remove successfully completed tasks from the list
lock (_lock)
_pending.Remove(task);
}
public async Task WaitForCompleteAsync()
{
Task[] tasks;
lock (_lock)
tasks = _pending.ToArray();
await Task.WhenAll(tasks);
}
}
// download data
static async Task DownloadAsync(string[] urls)
{
var processing = new Processing();
using (var semaphore = new SemaphoreSlim(MAX_DOWNLOADS))
using (var httpClient = new HttpClient())
{
var tasks = urls.Select(async (url) =>
{
await semaphore.WaitAsync();
try
{
var data = await httpClient.GetStringAsync(url);
// put the result on the processing pipeline
processing.QueueItemAsync(data);
}
finally
{
semaphore.Release();
}
});
await Task.WhenAll(tasks.ToArray());
await processing.WaitForCompleteAsync();
}
}
As requested, here's the code I ended up going with.
The work is set up in a master-detail configuration, and each master is processed as a batch. Each unit of work is queued up in this fashion:
var success = true;
// Start processing all the master records.
Master master;
while (null != (master = await StoredProcedures.ClaimRecordsAsync(...)))
{
await masterBuffer.SendAsync(master);
}
// Finished sending master records
masterBuffer.Complete();
// Now, wait for all the batches to complete.
await batchAction.Completion;
return success;
Masters are buffered one at a time to save work for other outside processes. The details for each master are dispatched for work via the masterTransform TransformManyBlock. A BatchedJoinBlock is also created to collect the details in one batch.
The actual work is done in the detailTransform TransformBlock, asynchronously, 150 at a time. BoundedCapacity is set to 300 to ensure that too many Masters don't get buffered at the beginning of the chain, while also leaving room for enough detail records to be queued to allow 150 records to be processed at one time. The block outputs an object to its targets, because it's filtered across the links depending on whether it's a Detail or Exception.
The batchAction ActionBlock collects the output from all the batches, and performs bulk database updates, error logging, etc. for each batch.
There will be several BatchedJoinBlocks, one for each master. Since each ISourceBlock is output sequentially and each batch only accepts the number of detail records associated with one master, the batches will be processed in order. Each block only outputs one group, and is unlinked on completion. Only the last batch block propagates its completion to the final ActionBlock.
The dataflow network:
// The dataflow network
BufferBlock<Master> masterBuffer = null;
TransformManyBlock<Master, Detail> masterTransform = null;
TransformBlock<Detail, object> detailTransform = null;
ActionBlock<Tuple<IList<object>, IList<object>>> batchAction = null;
// Buffer master records to enable efficient throttling.
masterBuffer = new BufferBlock<Master>(new DataflowBlockOptions { BoundedCapacity = 1 });
// Sequentially transform master records into a stream of detail records.
masterTransform = new TransformManyBlock<Master, Detail>(async masterRecord =>
{
var records = await StoredProcedures.GetObjectsAsync(masterRecord);
// Filter the master records based on some criteria here
var filteredRecords = records;
// Only propagate completion to the last batch
var propagateCompletion = masterBuffer.Completion.IsCompleted && masterTransform.InputCount == 0;
// Create a batch join block to encapsulate the results of the master record.
var batchjoinblock = new BatchedJoinBlock<object, object>(records.Count(), new GroupingDataflowBlockOptions { MaxNumberOfGroups = 1 });
// Add the batch block to the detail transform pipeline's link queue, and link the batch block to the the batch action block.
var detailLink1 = detailTransform.LinkTo(batchjoinblock.Target1, detailResult => detailResult is Detail);
var detailLink2 = detailTransform.LinkTo(batchjoinblock.Target2, detailResult => detailResult is Exception);
var batchLink = batchjoinblock.LinkTo(batchAction, new DataflowLinkOptions { PropagateCompletion = propagateCompletion });
// Unlink batchjoinblock upon completion.
// (the returned task does not need to be awaited, despite the warning.)
batchjoinblock.Completion.ContinueWith(task =>
{
detailLink1.Dispose();
detailLink2.Dispose();
batchLink.Dispose();
});
return filteredRecords;
}, new ExecutionDataflowBlockOptions { BoundedCapacity = 1 });
// Process each detail record asynchronously, 150 at a time.
detailTransform = new TransformBlock<Detail, object>(async detail => {
try
{
// Perform the action for each detail here asynchronously
await DoSomethingAsync();
return detail;
}
catch (Exception e)
{
success = false;
return e;
}
}, new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 150, BoundedCapacity = 300 });
// Perform the proper action for each batch
batchAction = new ActionBlock<Tuple<IList<object>, IList<object>>>(async batch =>
{
var details = batch.Item1.Cast<Detail>();
var errors = batch.Item2.Cast<Exception>();
// Do something with the batch here
}, new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 4 });
masterBuffer.LinkTo(masterTransform, new DataflowLinkOptions { PropagateCompletion = true });
masterTransform.LinkTo(detailTransform, new DataflowLinkOptions { PropagateCompletion = true });

Categories