Best way to convert thread safe collection to DataTable? - c#

So here is the scenario:
I have to take a group of data, process it, build an object and then insert those objects into a database.
In order to increase performance, I am multi-threading the processing of the data using a parallel loop and storing the objects in a CollectionBag list.
That part works fine. However, the issue here is I now need to take that list, convert it into a DataTable object and insert the data into the database. It's very ugly and I feel like I'm not doing this in the best way possible (pseudo below):
ConcurrentBag<FinalObject> bag = new ConcurrentBag<FinalObject>();
ParallelOptions parallelOptions = new ParallelOptions();
parallelOptions.MaxDegreeOfParallelism = Environment.ProcessorCount;
Parallel.ForEach(allData, parallelOptions, dataObj =>
{
.... Process data ....
bag.Add(theData);
Thread.Sleep(100);
});
DataTable table = createTable();
foreach(FinalObject moveObj in bag) {
table.Rows.Add(moveObj.x);
}

This is a good candidate for PLINQ (or Rx - I'll focus on PLINQ since it's part of the Base Class Library).
IEnumerable<FinalObject> bag = allData
.AsParallel()
.WithDegreeOfParallelism(Environment.ProcessorCount)
.Select(dataObj =>
{
FinalObject theData = Process(dataObj);
Thread.Sleep(100);
return theData;
});
DataTable table = createTable();
foreach (FinalObject moveObj in bag)
{
table.Rows.Add(moveObj.x);
}
Realistically, instead of throttling the loop via Thread.Sleep, you should be limiting the maximum degree of parallelism further until you get the CPU usage down to the desired level.
Disclaimer: all of the below is meant for entertainment only, although it does actually work.
Of course you can always kick it up a notch and produce a full-on async Parallel.ForEach implementation that allows you to process input in parallel and do your throttling asynchronously, without blocking any thread pool threads.
async Task ParallelForEachAsync<TInput, TResult>(IEnumerable<TInput> input,
int maxDegreeOfParallelism,
Func<TInput, Task<TResult>> body,
Action<TResult> onCompleted)
{
Queue<TInput> queue = new Queue<TInput>(input);
if (queue.Count == 0) {
return;
}
List<Task<TResult>> tasksInFlight = new List<Task<TResult>>(maxDegreeOfParallelism);
do
{
while (tasksInFlight.Count < maxDegreeOfParallelism && queue.Count != 0)
{
TInput item = queue.Dequeue();
Task<TResult> task = body(item);
tasksInFlight.Add(task);
}
Task<TResult> completedTask = await Task.WhenAny(tasksInFlight).ConfigureAwait(false);
tasksInFlight.Remove(completedTask);
TResult result = completedTask.GetAwaiter().GetResult(); // We know the task has completed. No need for await.
onCompleted(result);
}
while (queue.Count != 0 || tasksInFlight.Count != 0);
}
Usage (full Fiddle here):
async Task<DataTable> ProcessAllAsync(IEnumerable<InputObject> allData)
{
DataTable table = CreateTable();
int maxDegreeOfParallelism = Environment.ProcessorCount;
await ParallelForEachAsync(
allData,
maxDegreeOfParallelism,
// Loop body: these Tasks will run in parallel, up to {maxDegreeOfParallelism} at any given time.
async dataObj =>
{
FinalObject o = await Task.Run(() => Process(dataObj)).ConfigureAwait(false); // Thread pool processing.
await Task.Delay(100).ConfigureAwait(false); // Artificial throttling.
return o;
},
// Completion handler: these will be executed one at a time, and can safely mutate shared state.
moveObj => table.Rows.Add(moveObj.x)
);
return table;
}
struct InputObject
{
public int x;
}
struct FinalObject
{
public int x;
}
FinalObject Process(InputObject o)
{
// Simulate synchronous work.
Thread.Sleep(100);
return new FinalObject { x = o.x };
}
Same behaviour, but without Thread.Sleep and ConcurrentBag<T>.

Sounds like you've complicated things quite a bit by tring to make everything run in parallel, but if you store DataRow obejcts in your bag instead of plain objects, at the end you can use DataTableExtensions to create a DataTable from a generic collection quite easily:
var dataTable = bag.CopyToDataTable();
Just add a reference to System.Data.DataSetExtensions in your project.

I think something like this should give better performance, looks like object[] is a better option than DataRow as you need DataTable to get a DataRow object.
ConcurrentBag<object[]> bag = new ConcurrentBag<object[]>();
Parallel.ForEach(allData,
new ParallelOptions { MaxDegreeOfParallelism = Environment.ProcessorCount },
dataObj =>
{
object[] row = new object[colCount];
//do processing
bag.Add(row);
Thread.Sleep(100);
});
DataTable table = createTable();
foreach (object[] row in bag)
{
table.Rows.Add(row);
}

Related

Streaming data via IEnumerable & TPL Dataflow

I am getting items from an upstream API which is quite slow. I try to speed this up by using TPL Dataflow to create multiple connections and bring these together, like this;
class Stuff
{
int Id { get; }
}
async Task<Stuff> GetStuffById(int id) => throw new NotImplementedException();
async Task<IEnumerable<Stuff>> GetLotsOfStuff(IEnumerable<int> ids)
{
var bagOfStuff = new ConcurrentBag<Stuff>();
var options = new ExecutionDataflowBlockOptions
{
MaxDegreeOfParallelism = 5
};
var processor = new ActionBlock<int>(async id =>
{
bagOfStuff.Add(await GetStuffById(id));
}, options);
foreach (int id in ids)
{
processor.Post(id);
}
processor.Complete();
await processor.Completion;
return bagOfStuff.ToArray();
}
The problem is that I have to wait until I have finished querying the entire collection of Stuff before I can return it to the caller. What I would prefer is that, whenever any of the multiple parallel queries returns an item, I return that item in a yield return fashion. Therefore I don't need to return an sync Task<IEnumerable<Stuff>>, I can just return an IEnumerable<Stuff> and the caller advances the iteration as soon as any items return.
I tried doing it like this;
IEnumerable<Stuff> GetLotsOfStuff(IEnumerable<int> ids)
{
var options = new ExecutionDataflowBlockOptions
{
MaxDegreeOfParallelism = 5
};
var processor = new ActionBlock<int>(async id =>
{
yield return await GetStuffById(id);
}, options);
foreach (int id in ids)
{
processor.Post(id);
}
processor.Complete();
processor.Completion.Wait();
yield break;
}
But I get an error
The yield statement cannot be used inside an anonymous method or lambda expression
How can I restructure my code?
You can return an IEnumerable, but to do so you must block your current thread. You need a TransformBlock to process the ids, and a feeder-task that will feed asynchronously the TransformBlock with ids. Finally the current thread will enter a blocking loop, waiting for produced stuff to yield:
static IEnumerable<Stuff> GetLotsOfStuff(IEnumerable<int> ids)
{
using var completionCTS = new CancellationTokenSource();
var processor = new TransformBlock<int, Stuff>(async id =>
{
return await GetStuffById(id);
}, new ExecutionDataflowBlockOptions
{
MaxDegreeOfParallelism = 5,
BoundedCapacity = 50, // Avoid buffering millions of ids
CancellationToken = completionCTS.Token
});
var feederTask = Task.Run(async () =>
{
try
{
foreach (int id in ids)
if (!await processor.SendAsync(id)) break;
}
finally { processor.Complete(); }
});
try
{
while (processor.OutputAvailableAsync().Result)
while (processor.TryReceive(out var stuff))
yield return stuff;
}
finally // This runs when the caller exits the foreach loop
{
completionCTS.Cancel(); // Cancel the TransformBlock if it's still running
}
Task.WaitAll(feederTask, processor.Completion); // Propagate all exceptions
}
No ConcurrentBag is needed, since the TransformBlock has an internal output buffer. The tricky part is dealing with the case that the caller will abandon the enumeration of the IEnumerable<Stuff> by breaking early, or by being obstructed by an exception. In this case you don't want the feeder-task to keep pumping the IEnumerable<int> with the ids till the end. Fortunately there is a solution. Enclosing the yielding loop in a try/finally block allows a notification of this event to be received, so that the feeder-task can be terminated in a timely manner.
An alternative implementation could remove the need for a feeder-task by combining pumping the ids, feeding the block, and yielding stuff in a single loop. In this case you would want a lag between pumping and yielding. To achieve it, the MoreLinq's Lag (or Lead) extension method could be handy.
Update: Here is a different implementation, that enumerates and yields in the same loop. To achieve the desired lagging, the source enumerable is right-padded with some dummy elements, equal in number with the degree of concurrency.
This implementation accepts generic types, instead of int and Stuff.
public static IEnumerable<TResult> Transform<TSource, TResult>(
IEnumerable<TSource> source, Func<TSource, Task<TResult>> taskFactory,
int degreeOfConcurrency)
{
var processor = new TransformBlock<TSource, TResult>(async item =>
{
return await taskFactory(item);
}, new ExecutionDataflowBlockOptions
{
MaxDegreeOfParallelism = degreeOfConcurrency
});
var paddedSource = source.Select(item => (item, true))
.Concat(Enumerable.Repeat((default(TSource), false), degreeOfConcurrency));
int index = -1;
bool completed = false;
foreach (var (item, hasValue) in paddedSource)
{
index++;
if (hasValue) { processor.Post(item); }
else if (!completed) { processor.Complete(); completed = true; }
if (index >= degreeOfConcurrency)
{
if (!processor.OutputAvailableAsync().Result) break; // Blocking call
if (!processor.TryReceive(out var result))
throw new InvalidOperationException(); // Should never happen
yield return result;
}
}
processor.Completion.Wait();
}
Usage example:
IEnumerable<Stuff> lotsOfStuff = Transform(ids, GetStuffById, 5);
Both implementations can be modified trivially to return an IAsyncEnumerable instead of IEnumerable, to avoid blocking the calling thread.
There's probably a few different ways you can handle this based on your specific use case. But to handle items as they come through in terms of TPL-Dataflow, you'd change your source block to a TransformBlock<,> and flow the items to another block to process your items. Note that now you can get rid of collecting ConcurrentBag and be sure to set EnsureOrdered to false if you don't care what order you receive your items in. Also link the blocks and propagate completion to ensure your pipeline finishes once all item are retrieved and subsequently processed.
class Stuff
{
int Id { get; }
}
public class GetStuff
{
async Task<Stuff> GetStuffById(int id) => throw new NotImplementedException();
async Task GetLotsOfStuff(IEnumerable<int> ids)
{
//var bagOfStuff = new ConcurrentBag<Stuff>();
var options = new ExecutionDataflowBlockOptions
{
MaxDegreeOfParallelism = 5,
EnsureOrdered = false
};
var processor = new TransformBlock<int, Stuff>(id => GetStuffById(id), options);
var handler = new ActionBlock<Stuff>(s => throw new NotImplementedException());
processor.LinkTo(handler, new DataflowLinkOptions() { PropagateCompletion = true });
foreach (int id in ids)
{
processor.Post(id);
}
processor.Complete();
await handler.Completion;
}
}
Other options could be making your method an observable streaming out of the TransformBlock or using IAsyncEnumerable to yield return and async get method.

Task Factory for each loop with await

I'm new to tasks and have a question regarding the usage. Does the Task.Factory fire for all items in the foreach loop or block at the 'await' basically making the program single threaded? If I am thinking about this correctly, the foreach loop starts all the tasks and the .GetAwaiter().GetResult(); is blocking the main thread until the last task is complete.
Also, I'm just wanting some anonymous tasks to load the data. Would this be a correct implementation? I'm not referring to exception handling as this is simply an example.
To provide clarity, I am loading data into a database from an outside API. This one is using the FRED database. (https://fred.stlouisfed.org/), but I have several I will hit to complete the entire transfer (maybe 200k data points). Once they are done I update the tables, refresh market calculations, etc. Some of it is real time and some of it is End-of-day. I would also like to say, I currently have everything working in docker, but have been working to update the code using tasks to improve execution.
class Program
{
private async Task SQLBulkLoader()
{
foreach (var fileListObj in indicators.file_list)
{
await Task.Factory.StartNew( () =>
{
string json = this.GET(//API call);
SeriesObject obj = JsonConvert.DeserializeObject<SeriesObject>(json);
DataTable dataTableConversion = ConvertToDataTable(obj.observations);
dataTableConversion.TableName = fileListObj.series_id;
using (SqlConnection dbConnection = new SqlConnection("SQL Connection"))
{
dbConnection.Open();
using (SqlBulkCopy s = new SqlBulkCopy(dbConnection))
{
s.DestinationTableName = dataTableConversion.TableName;
foreach (var column in dataTableConversion.Columns)
s.ColumnMappings.Add(column.ToString(), column.ToString());
s.WriteToServer(dataTableConversion);
}
Console.WriteLine("File: {0} Complete", fileListObj.series_id);
}
});
}
}
static void Main(string[] args)
{
Program worker = new Program();
worker.SQLBulkLoader().GetAwaiter().GetResult();
}
}
Your awaiting the task returned from Task.Factory.StartNew does make it effectively single threaded. You can see a simple demonstration of this with this short LinqPad example:
for (var i = 0; i < 3; i++)
{
var index = i;
$"{index} inline".Dump();
await Task.Run(() =>
{
Thread.Sleep((3 - index) * 1000);
$"{index} in thread".Dump();
});
}
Here we wait less as we progress through the loop. The output is:
0 inline
0 in thread
1 inline
1 in thread
2 inline
2 in thread
If you remove the await in front of StartNew you'll see it runs in parallel. As others have mentioned, you can certainly use Parallel.ForEach, but for a demonstration of doing it a bit more manually, you can consider a solution like this:
var tasks = new List<Task>();
for (var i = 0; i < 3; i++)
{
var index = i;
$"{index} inline".Dump();
tasks.Add(Task.Factory.StartNew(() =>
{
Thread.Sleep((3 - index) * 1000);
$"{index} in thread".Dump();
}));
}
Task.WaitAll(tasks.ToArray());
Notice now how the result is:
0 inline
1 inline
2 inline
2 in thread
1 in thread
0 in thread
This is a typical problem that C# 8.0 Async Streams are going to solve very soon.
Until C# 8.0 is released, you can use the AsyncEnumarator library:
using System.Collections.Async;
class Program
{
private async Task SQLBulkLoader() {
await indicators.file_list.ParallelForEachAsync(async fileListObj =>
{
...
await s.WriteToServerAsync(dataTableConversion);
...
},
maxDegreeOfParalellism: 3,
cancellationToken: default);
}
static void Main(string[] args)
{
Program worker = new Program();
worker.SQLBulkLoader().GetAwaiter().GetResult();
}
}
I do not recommend using Parallel.ForEach and Task.WhenAll as those functions are not designed for asynchronous streaming.
You'll want to add each task to a collection and then use Task.WhenAll to await all of the tasks in that collection:
private async Task SQLBulkLoader()
{
var tasks = new List<Task>();
foreach (var fileListObj in indicators.file_list)
{
tasks.Add(Task.Factory.StartNew( () => { //Doing Stuff }));
}
await Task.WhenAll(tasks.ToArray());
}
My take on this: most time consuming operations will be getting the data using a GET operation and the actual call to WriteToServer using SqlBulkCopy. If you take a look at that class you will see that there is a native async method WriteToServerAsync method (docs here)
. Always use those before creating Tasks yourself using Task.Run.
The same applies to the http GET call. You can use the native HttpClient.GetAsync (docs here) for that.
Doing that you can rewrite your code to this:
private async Task ProcessFileAsync(string series_id)
{
string json = await GetAsync();
SeriesObject obj = JsonConvert.DeserializeObject<SeriesObject>(json);
DataTable dataTableConversion = ConvertToDataTable(obj.observations);
dataTableConversion.TableName = series_id;
using (SqlConnection dbConnection = new SqlConnection("SQL Connection"))
{
dbConnection.Open();
using (SqlBulkCopy s = new SqlBulkCopy(dbConnection))
{
s.DestinationTableName = dataTableConversion.TableName;
foreach (var column in dataTableConversion.Columns)
s.ColumnMappings.Add(column.ToString(), column.ToString());
await s.WriteToServerAsync(dataTableConversion);
}
Console.WriteLine("File: {0} Complete", series_id);
}
}
private async Task SQLBulkLoaderAsync()
{
var tasks = indicators.file_list.Select(f => ProcessFileAsync(f.series_id));
await Task.WhenAll(tasks);
}
Both operations (http call and sql server call) are I/O calls. Using the native async/await pattern there won't even be a thread created or used, see this question for a more in-depth explanation. That is why for IO bound operations you should never have to use Task.Run (or Task.Factory.StartNew. But do mind that Task.Run is the recommended approach).
Sidenote: if you are using HttpClient in a loop, please read this about how to correctly use it.
If you need to limit the number of parallel actions you could also use TPL Dataflow as it plays very nice with Task based IO bound operations. The SQLBulkLoaderAsyncshould then be modified to (leaving the ProcessFileAsync method from earlier this answer intact):
private async Task SQLBulkLoaderAsync()
{
var ab = new ActionBlock<string>(ProcessFileAsync, new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 5 });
foreach (var file in indicators.file_list)
{
ab.Post(file.series_id);
}
ab.Complete();
await ab.Completion;
}
Use a Parallel.ForEach loop to enable data parallelism over any System.Collections.Generic.IEnumerable<T> source.
// Method signature: Parallel.ForEach(IEnumerable<TSource> source, Action<TSource> body)
Parallel.ForEach(fileList, (currentFile) =>
{
//Doing Stuff
Console.WriteLine("Processing {0} on thread {1}", currentFile, Thread.CurrentThread.ManagedThreadId);
});
Why didnt you try this :), this program will not start parallel tasks (in a foreach), it will be blocking but the logic in task will be done in separate thread from threadpool (only one at the time, but the main thread will be blocked).
The proper approach in your situation is to use Paraller.ForEach
How can I convert this foreach code to Parallel.ForEach?

Using SemaphoreSlim for tasks with return type

I am trying to create a producer/consumer TPL dataflow process. As part of it, I will be creating multiple producer tasks for different id range which will generate the data records for processing. So I am planning to throttle the number of threads that will connect to database to get the data and be active at same time.
However, the method which extracts data and sends it to BufferBlock has a return type so that I can get the count of records extracted. So I am unable to figure out where to call the Release method from the SemaphoreSlim class, but still be able to get a return value? Below is the sample code that I am using. Can someone please suggest any work around for this?
private Task<KeyRange>[] DataExtractProducer(ITargetBlock<DataRow[]> targetBuffer, ExtractQueryConfiguration QueryConf)
{
CancellationToken cancelToken = cancelTokenSrc.Token;
var tasks = new List<Task<KeyRange>>();
int taskCount = 0;
int maxThreads = QueryConf.MaxThreadsLimit > 0 ? QueryConf.MaxThreadsLimit : DataExtConstants.DefaultMaxThreads;
using (SemaphoreSlim concurrency = new SemaphoreSlim(maxThreads))
{
foreach (KeyRange range in keyRangeList)
{
concurrency.WaitAsync();
var task = Task.Run(() =>
{
Console.WriteLine(MsgConstants.StartKeyRangeExtract, QueryConf.KeyColumn, range.StartValue, range.EndValue);
//concurrency.Release();
return GetDataTask(targetBuffer, range, QueryConf.ExtractQuery);
}, cancelToken);
tasks.Add(task);
taskCount++;
}
}
Console.WriteLine(MsgConstants.TaskCountMessage, taskCount);
return tasks.ToArray<Task<KeyRange>>();
}
Edit: I tried this variant also. But this does not seem to work. I tried with limit of 20. But I see more than 50 DB connections going out. Eventually, I am hitting high memory consumption because of the unthrottled connections.
using (SemaphoreSlim concurrency = new SemaphoreSlim(maxThreads))
{
foreach (KeyRange range in keyRangeList)
{
concurrency.WaitAsync();
var task = Task.Run(() =>
{
Console.WriteLine(MsgConstants.StartKeyRangeExtract, QueryConf.KeyColumn, range.StartValue, range.EndValue);
var temptask = await GetDataTask(targetBuffer, range, QueryConf.ExtractQuery);
concurrency.Release();
return temptask;
}, cancelToken);
tasks.Add(task);
taskCount++;
}
}

C# .NET Parallel I/O operation (with throttling) [duplicate]

I would like to run a bunch of async tasks, with a limit on how many tasks may be pending completion at any given time.
Say you have 1000 URLs, and you only want to have 50 requests open at a time; but as soon as one request completes, you open up a connection to the next URL in the list. That way, there are always exactly 50 connections open at a time, until the URL list is exhausted.
I also want to utilize a given number of threads if possible.
I came up with an extension method, ThrottleTasksAsync that does what I want. Is there a simpler solution already out there? I would assume that this is a common scenario.
Usage:
class Program
{
static void Main(string[] args)
{
Enumerable.Range(1, 10).ThrottleTasksAsync(5, 2, async i => { Console.WriteLine(i); return i; }).Wait();
Console.WriteLine("Press a key to exit...");
Console.ReadKey(true);
}
}
Here is the code:
static class IEnumerableExtensions
{
public static async Task<Result_T[]> ThrottleTasksAsync<Enumerable_T, Result_T>(this IEnumerable<Enumerable_T> enumerable, int maxConcurrentTasks, int maxDegreeOfParallelism, Func<Enumerable_T, Task<Result_T>> taskToRun)
{
var blockingQueue = new BlockingCollection<Enumerable_T>(new ConcurrentBag<Enumerable_T>());
var semaphore = new SemaphoreSlim(maxConcurrentTasks);
// Run the throttler on a separate thread.
var t = Task.Run(() =>
{
foreach (var item in enumerable)
{
// Wait for the semaphore
semaphore.Wait();
blockingQueue.Add(item);
}
blockingQueue.CompleteAdding();
});
var taskList = new List<Task<Result_T>>();
Parallel.ForEach(IterateUntilTrue(() => blockingQueue.IsCompleted), new ParallelOptions { MaxDegreeOfParallelism = maxDegreeOfParallelism },
_ =>
{
Enumerable_T item;
if (blockingQueue.TryTake(out item, 100))
{
taskList.Add(
// Run the task
taskToRun(item)
.ContinueWith(tsk =>
{
// For effect
Thread.Sleep(2000);
// Release the semaphore
semaphore.Release();
return tsk.Result;
}
)
);
}
});
// Await all the tasks.
return await Task.WhenAll(taskList);
}
static IEnumerable<bool> IterateUntilTrue(Func<bool> condition)
{
while (!condition()) yield return true;
}
}
The method utilizes BlockingCollection and SemaphoreSlim to make it work. The throttler is run on one thread, and all the async tasks are run on the other thread. To achieve parallelism, I added a maxDegreeOfParallelism parameter that's passed to a Parallel.ForEach loop re-purposed as a while loop.
The old version was:
foreach (var master = ...)
{
var details = ...;
Parallel.ForEach(details, detail => {
// Process each detail record here
}, new ParallelOptions { MaxDegreeOfParallelism = 15 });
// Perform the final batch updates here
}
But, the thread pool gets exhausted fast, and you can't do async/await.
Bonus:
To get around the problem in BlockingCollection where an exception is thrown in Take() when CompleteAdding() is called, I'm using the TryTake overload with a timeout. If I didn't use the timeout in TryTake, it would defeat the purpose of using a BlockingCollection since TryTake won't block. Is there a better way? Ideally, there would be a TakeAsync method.
As suggested, use TPL Dataflow.
A TransformBlock<TInput, TOutput> may be what you're looking for.
You define a MaxDegreeOfParallelism to limit how many strings can be transformed (i.e., how many urls can be downloaded) in parallel. You then post urls to the block, and when you're done you tell the block you're done adding items and you fetch the responses.
var downloader = new TransformBlock<string, HttpResponse>(
url => Download(url),
new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 50 }
);
var buffer = new BufferBlock<HttpResponse>();
downloader.LinkTo(buffer);
foreach(var url in urls)
downloader.Post(url);
//or await downloader.SendAsync(url);
downloader.Complete();
await downloader.Completion;
IList<HttpResponse> responses;
if (buffer.TryReceiveAll(out responses))
{
//process responses
}
Note: The TransformBlock buffers both its input and output. Why, then, do we need to link it to a BufferBlock?
Because the TransformBlock won't complete until all items (HttpResponse) have been consumed, and await downloader.Completion would hang. Instead, we let the downloader forward all its output to a dedicated buffer block - then we wait for the downloader to complete, and inspect the buffer block.
Say you have 1000 URLs, and you only want to have 50 requests open at
a time; but as soon as one request completes, you open up a connection
to the next URL in the list. That way, there are always exactly 50
connections open at a time, until the URL list is exhausted.
The following simple solution has surfaced many times here on SO. It doesn't use blocking code and doesn't create threads explicitly, so it scales very well:
const int MAX_DOWNLOADS = 50;
static async Task DownloadAsync(string[] urls)
{
using (var semaphore = new SemaphoreSlim(MAX_DOWNLOADS))
using (var httpClient = new HttpClient())
{
var tasks = urls.Select(async url =>
{
await semaphore.WaitAsync();
try
{
var data = await httpClient.GetStringAsync(url);
Console.WriteLine(data);
}
finally
{
semaphore.Release();
}
});
await Task.WhenAll(tasks);
}
}
The thing is, the processing of the downloaded data should be done on a different pipeline, with a different level of parallelism, especially if it's a CPU-bound processing.
E.g., you'd probably want to have 4 threads concurrently doing the data processing (the number of CPU cores), and up to 50 pending requests for more data (which do not use threads at all). AFAICT, this is not what your code is currently doing.
That's where TPL Dataflow or Rx may come in handy as a preferred solution. Yet it is certainly possible to implement something like this with plain TPL. Note, the only blocking code here is the one doing the actual data processing inside Task.Run:
const int MAX_DOWNLOADS = 50;
const int MAX_PROCESSORS = 4;
// process data
class Processing
{
SemaphoreSlim _semaphore = new SemaphoreSlim(MAX_PROCESSORS);
HashSet<Task> _pending = new HashSet<Task>();
object _lock = new Object();
async Task ProcessAsync(string data)
{
await _semaphore.WaitAsync();
try
{
await Task.Run(() =>
{
// simuate work
Thread.Sleep(1000);
Console.WriteLine(data);
});
}
finally
{
_semaphore.Release();
}
}
public async void QueueItemAsync(string data)
{
var task = ProcessAsync(data);
lock (_lock)
_pending.Add(task);
try
{
await task;
}
catch
{
if (!task.IsCanceled && !task.IsFaulted)
throw; // not the task's exception, rethrow
// don't remove faulted/cancelled tasks from the list
return;
}
// remove successfully completed tasks from the list
lock (_lock)
_pending.Remove(task);
}
public async Task WaitForCompleteAsync()
{
Task[] tasks;
lock (_lock)
tasks = _pending.ToArray();
await Task.WhenAll(tasks);
}
}
// download data
static async Task DownloadAsync(string[] urls)
{
var processing = new Processing();
using (var semaphore = new SemaphoreSlim(MAX_DOWNLOADS))
using (var httpClient = new HttpClient())
{
var tasks = urls.Select(async (url) =>
{
await semaphore.WaitAsync();
try
{
var data = await httpClient.GetStringAsync(url);
// put the result on the processing pipeline
processing.QueueItemAsync(data);
}
finally
{
semaphore.Release();
}
});
await Task.WhenAll(tasks.ToArray());
await processing.WaitForCompleteAsync();
}
}
As requested, here's the code I ended up going with.
The work is set up in a master-detail configuration, and each master is processed as a batch. Each unit of work is queued up in this fashion:
var success = true;
// Start processing all the master records.
Master master;
while (null != (master = await StoredProcedures.ClaimRecordsAsync(...)))
{
await masterBuffer.SendAsync(master);
}
// Finished sending master records
masterBuffer.Complete();
// Now, wait for all the batches to complete.
await batchAction.Completion;
return success;
Masters are buffered one at a time to save work for other outside processes. The details for each master are dispatched for work via the masterTransform TransformManyBlock. A BatchedJoinBlock is also created to collect the details in one batch.
The actual work is done in the detailTransform TransformBlock, asynchronously, 150 at a time. BoundedCapacity is set to 300 to ensure that too many Masters don't get buffered at the beginning of the chain, while also leaving room for enough detail records to be queued to allow 150 records to be processed at one time. The block outputs an object to its targets, because it's filtered across the links depending on whether it's a Detail or Exception.
The batchAction ActionBlock collects the output from all the batches, and performs bulk database updates, error logging, etc. for each batch.
There will be several BatchedJoinBlocks, one for each master. Since each ISourceBlock is output sequentially and each batch only accepts the number of detail records associated with one master, the batches will be processed in order. Each block only outputs one group, and is unlinked on completion. Only the last batch block propagates its completion to the final ActionBlock.
The dataflow network:
// The dataflow network
BufferBlock<Master> masterBuffer = null;
TransformManyBlock<Master, Detail> masterTransform = null;
TransformBlock<Detail, object> detailTransform = null;
ActionBlock<Tuple<IList<object>, IList<object>>> batchAction = null;
// Buffer master records to enable efficient throttling.
masterBuffer = new BufferBlock<Master>(new DataflowBlockOptions { BoundedCapacity = 1 });
// Sequentially transform master records into a stream of detail records.
masterTransform = new TransformManyBlock<Master, Detail>(async masterRecord =>
{
var records = await StoredProcedures.GetObjectsAsync(masterRecord);
// Filter the master records based on some criteria here
var filteredRecords = records;
// Only propagate completion to the last batch
var propagateCompletion = masterBuffer.Completion.IsCompleted && masterTransform.InputCount == 0;
// Create a batch join block to encapsulate the results of the master record.
var batchjoinblock = new BatchedJoinBlock<object, object>(records.Count(), new GroupingDataflowBlockOptions { MaxNumberOfGroups = 1 });
// Add the batch block to the detail transform pipeline's link queue, and link the batch block to the the batch action block.
var detailLink1 = detailTransform.LinkTo(batchjoinblock.Target1, detailResult => detailResult is Detail);
var detailLink2 = detailTransform.LinkTo(batchjoinblock.Target2, detailResult => detailResult is Exception);
var batchLink = batchjoinblock.LinkTo(batchAction, new DataflowLinkOptions { PropagateCompletion = propagateCompletion });
// Unlink batchjoinblock upon completion.
// (the returned task does not need to be awaited, despite the warning.)
batchjoinblock.Completion.ContinueWith(task =>
{
detailLink1.Dispose();
detailLink2.Dispose();
batchLink.Dispose();
});
return filteredRecords;
}, new ExecutionDataflowBlockOptions { BoundedCapacity = 1 });
// Process each detail record asynchronously, 150 at a time.
detailTransform = new TransformBlock<Detail, object>(async detail => {
try
{
// Perform the action for each detail here asynchronously
await DoSomethingAsync();
return detail;
}
catch (Exception e)
{
success = false;
return e;
}
}, new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 150, BoundedCapacity = 300 });
// Perform the proper action for each batch
batchAction = new ActionBlock<Tuple<IList<object>, IList<object>>>(async batch =>
{
var details = batch.Item1.Cast<Detail>();
var errors = batch.Item2.Cast<Exception>();
// Do something with the batch here
}, new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 4 });
masterBuffer.LinkTo(masterTransform, new DataflowLinkOptions { PropagateCompletion = true });
masterTransform.LinkTo(detailTransform, new DataflowLinkOptions { PropagateCompletion = true });

TPL DataFlow with Lazy Source / stream of data

Suppose you have a TransformBlock with configured parallelism and want to stream data trough the block. The input data should be created only when the pipeline can actually start processing it. (And should be released the moment it leaves the pipeline.)
Can I achieve this? And if so how?
Basically I want a data source that works as an iterator.
Like so:
public IEnumerable<Guid> GetSourceData()
{
//In reality -> this should also be an async task -> but yield return does not work in combination with async/await ...
Func<ICollection<Guid>> GetNextBatch = () => Enumerable.Repeat(100).Select(x => Guid.NewGuid()).ToArray();
while (true)
{
var batch = GetNextBatch();
if (batch == null || !batch.Any()) break;
foreach (var guid in batch)
yield return guid;
}
}
This would result in +- 100 records in memory. OK: more if the blocks you append to this data source would keep them in memory for some time, but you have a chance to get only a subset (/stream) of data.
Some background information:
I intend to use this in combination with azure cosmos db, where the source could all objects in a collection, or a change feed. Needless to say that I don't want all of those objects stored in memory. So this can't work:
using System.Threading.Tasks.Dataflow;
public async Task ExampleTask()
{
Func<Guid, object> TheActualAction = text => text.ToString();
var config = new ExecutionDataflowBlockOptions
{
BoundedCapacity = 5,
MaxDegreeOfParallelism = 15
};
var throtteler = new TransformBlock<Guid, object>(TheActualAction, config);
var output = new BufferBlock<object>();
throtteler.LinkTo(output);
throtteler.Post(Guid.NewGuid());
throtteler.Post(Guid.NewGuid());
throtteler.Post(Guid.NewGuid());
throtteler.Post(Guid.NewGuid());
//...
throtteler.Complete();
await throtteler.Completion;
}
The above example is not good because I add all the items without knowing if they are actually being 'used' by the transform block. Also, I don't really care about the output buffer. I understand that I need to send it somewhere so I can await the completion, but I have no use for the buffer after that. So it should just forget about all it gets ...
Post() will return false if the target is full without blocking. While this could be used in a busy-wait loop, it's wasteful. SendAsync() on the other hand will wait if the target is full :
public async Task ExampleTask()
{
var config = new ExecutionDataflowBlockOptions
{
BoundedCapacity = 50,
MaxDegreeOfParallelism = 15
};
var block= new ActionBlock<Guid, object>(TheActualAction, config);
while(//some condition//)
{
var data=await GetDataFromCosmosDB();
await block.SendAsync(data);
//Wait a bit if we want to use polling
await Task.Delay(...);
}
block.Complete();
await block.Completion;
}
It seems you want to process data at a defined degree of parallelism (MaxDegreeOfParallelism = 15). TPL dataflow is very clunky to use for such a simple requirement.
There's a very simple and powerful pattern that might solve your problem. It's a parallel async foreach loop as described here: https://blogs.msdn.microsoft.com/pfxteam/2012/03/05/implementing-a-simple-foreachasync-part-2/
public static Task ForEachAsync<T>(this IEnumerable<T> source, int dop, Func<T, Task> body)
{
return Task.WhenAll(
from partition in Partitioner.Create(source).GetPartitions(dop)
select Task.Run(async delegate {
using (partition)
while (partition.MoveNext())
await body(partition.Current);
}));
}
You can then write something like:
var dataSource = ...; //some sequence
dataSource.ForEachAsync(15, async item => await ProcessItem(item));
Very simple.
You can dynamically reduce the DOP by using a SemaphoreSlim. The semaphore acts as a gate that only lets N concurrent threads/tasks in. N can be changed dynamically.
So you would use ForEachAsync as the basic workhorse and then add additional restrictions and throttling on top.

Categories