I have a program where I download files from the Internet and process them. Following is the function that I have written to download the file using threads.
Task<File> re = Task.Factory.StartNew(() => { /*Download the File*/ });
re.ContinueWith((x) => { /*Do another function*/ });
I now want it to use only 10 threads for downloading. I have looked in to ParallelOptions.MaxDegreeOfParallelism property, but I can't understand how to use it when the task returns a result.
One good way to do that is to use the DataFlow API. To use it, you have to install the Microsoft.Tpl.Dataflow Nuget package.
Assuming that you have the following methods for downloading and processing data:
public async Task<DownloadResult> DownloadFile(string url)
{
//Asynchronously download the file and return the result of the download.
//You don't need a thread to download the file if you use asynchronous API.
}
public ProcessingResult ProcessDownloadResult(DownloadResult download_result)
{
//Synchronously process the download result and produce a ProcessingResult.
}
And assuming that you have a list of URLs that you want to download:
List<string> urls = new List<string>();
Then you can do the following with the DataFlow API:
TransformBlock<string,DownloadResult> download_block =
new TransformBlock<string, DownloadResult>(
url => DownloadFile(url),
new ExecutionDataflowBlockOptions
{
//Only 10 asynchronous download operations
//can happen at any point in time.
MaxDegreeOfParallelism = 10
});
TransformBlock<DownloadResult, ProcessingResult> process_block =
new TransformBlock<DownloadResult, ProcessingResult>(
dr => ProcessDownloadResult(dr),
new ExecutionDataflowBlockOptions
{
//We limit the number of CPU intensive operation
//to the number of processors in the system.
MaxDegreeOfParallelism = Environment.ProcessorCount
});
download_block.LinkTo(process_block);
foreach(var url in urls)
{
download_block.Post(url);
}
You can use something like:
Func<File> work = () => {
// Do something
File file = ...
return file
};
var maxNoOfWorkers = 10;
IEnumerable<Task> tasks = Enumerable.Range(0, maxNoOfWorkers)
.Select(s =>
{
var task = Task.Factory.StartNew<File>(work);
return task.ContinueWith(ant => { /* do soemthing else */ });
});
This way TPL decides how many threads to get from the threadpool if however you really want to create a dedicated (non-threadpool) thread you can then do so using:
IEnumerable<Task> tasks = Enumerable.Range(0, maxNoOfWorkers)
.Select(s =>
{
var task = Task.Factory.StartNew<File>(
work,
CancellationToken.None,
TaskCreationOptions.LongRunning,
TaskScheduler.Default);
return task.ContinueWith(ant => { /* do soemthing else */ });
});
Your other options would be to use PLINQ or Paraller.For/ForEach which you can use the MaxDegreeOfParallelism with.
A PLINQ example can be:
Func<File> work = () => {
// Do something
File file = ...
return file
};
var maxNoOfWorkers = 10;
ParallelEnumerable.Range(0, maxNoOfWorkers)
.WithDegreeOfParallelism(maxNoOfWorkers)
.ForAll(x => {
var file = work();
// Do something with file
});
Of course I don't know the context of your example so you may need to adapt it to your requirement.
Related
I am trying to execute file upload using Parallel.ForEachAsync, it works but loses the sort order. Is there any method to synchronize sort order or source and destination lists?
await Parallel.ForEachAsync(model.DestinationFiles,
new ParallelOptions { MaxDegreeOfParallelism = 20 }, async (file, CancellationToken) =>
{
var storeAsync = await _fileServerService.Init(displayUrl).StoreAsync(file.FileInfo, false, file.OutputFileName);
convertResultDto.Files.Add(new ConverterConvertResultFile(storeAsync));
});
Previously I used Linq parallel operator (PLINQ), which has the AsOrdered operator to deal with sorting. Anyway, I think the Parallel.ForEachAsync is better for using in async methods with I/O scenario?
var storeFiles = model.DestinationFiles.AsParallel().AsOrdered().WithDegreeOfParallelism(50)
.Select(file => StoreAsync(file.FileInfo, false, file.OutputFileName).GetAwaiter().GetResult())
.Select(storeFile => new StoreFile
{
FileId = storeFile.FileId,
Url = storeFile.Url,
OutputFileName = storeFile.OutputFileName,
Size = storeFile.Size
});
In this case, you're wanting to get a set of results and store them in a resulting collection. Parallel is designed for more operations without results. For operations with results, you can use PLINQ for CPU-bound operations or asynchronous concurrency for I/O-bound operations. Unfortunately, there isn't a PLINQ equivalent for Parallel.ForEachAsync, which would be the closest equivalent to your current code.
Asynchronous concurrency uses Task.WhenAll to get the results of multiple asynchronous operations. It can also use SemaphoreSlim for throttling. Something like this:
var mutex = new SemaphoreSlim(20);
var results = await Task.WhenAll(model.DestinationFiles.Select(async file =>
{
await mutex.WaitAsync();
try
{
var storeAsync = await _fileServerService.Init(displayUrl).StoreAsync(file.FileInfo, false, file.OutputFileName);
return new ConverterConvertResultFile(storeAsync);
}
finally { mutex.Release(); }
});
convertResultDto.Files.AddRange(results);
However, if you have a mixture of CPU-bound and I/O-bound operations, then you'll probably want to continue to use ForEachAsync. In that case, you can create the entries in your destination collection first, then perform each operation with an index so it knows where to store them:
// This code assumes convertResultDto.Files is empty at this point.
var count = model.DestinationFiles.Count;
convertResultDto.Files.AddRange(Enumerable.Repeat<ConverterConvertResultFile>(null!, count));
await Parallel.ForEachAsync(
model.DestinationFiles.Select((file, i) => (file, i)),
new ParallelOptions { MaxDegreeOfParallelism = 20 },
async item =>
{
var (file, i) = item;
var storeAsync = await _fileServerService.Init(displayUrl).StoreAsync(file.FileInfo, false, file.OutputFileName);
convertResultDto.Files[i] = new ConverterConvertResultFile(storeAsync);
});
I would like to run a bunch of async tasks, with a limit on how many tasks may be pending completion at any given time.
Say you have 1000 URLs, and you only want to have 50 requests open at a time; but as soon as one request completes, you open up a connection to the next URL in the list. That way, there are always exactly 50 connections open at a time, until the URL list is exhausted.
I also want to utilize a given number of threads if possible.
I came up with an extension method, ThrottleTasksAsync that does what I want. Is there a simpler solution already out there? I would assume that this is a common scenario.
Usage:
class Program
{
static void Main(string[] args)
{
Enumerable.Range(1, 10).ThrottleTasksAsync(5, 2, async i => { Console.WriteLine(i); return i; }).Wait();
Console.WriteLine("Press a key to exit...");
Console.ReadKey(true);
}
}
Here is the code:
static class IEnumerableExtensions
{
public static async Task<Result_T[]> ThrottleTasksAsync<Enumerable_T, Result_T>(this IEnumerable<Enumerable_T> enumerable, int maxConcurrentTasks, int maxDegreeOfParallelism, Func<Enumerable_T, Task<Result_T>> taskToRun)
{
var blockingQueue = new BlockingCollection<Enumerable_T>(new ConcurrentBag<Enumerable_T>());
var semaphore = new SemaphoreSlim(maxConcurrentTasks);
// Run the throttler on a separate thread.
var t = Task.Run(() =>
{
foreach (var item in enumerable)
{
// Wait for the semaphore
semaphore.Wait();
blockingQueue.Add(item);
}
blockingQueue.CompleteAdding();
});
var taskList = new List<Task<Result_T>>();
Parallel.ForEach(IterateUntilTrue(() => blockingQueue.IsCompleted), new ParallelOptions { MaxDegreeOfParallelism = maxDegreeOfParallelism },
_ =>
{
Enumerable_T item;
if (blockingQueue.TryTake(out item, 100))
{
taskList.Add(
// Run the task
taskToRun(item)
.ContinueWith(tsk =>
{
// For effect
Thread.Sleep(2000);
// Release the semaphore
semaphore.Release();
return tsk.Result;
}
)
);
}
});
// Await all the tasks.
return await Task.WhenAll(taskList);
}
static IEnumerable<bool> IterateUntilTrue(Func<bool> condition)
{
while (!condition()) yield return true;
}
}
The method utilizes BlockingCollection and SemaphoreSlim to make it work. The throttler is run on one thread, and all the async tasks are run on the other thread. To achieve parallelism, I added a maxDegreeOfParallelism parameter that's passed to a Parallel.ForEach loop re-purposed as a while loop.
The old version was:
foreach (var master = ...)
{
var details = ...;
Parallel.ForEach(details, detail => {
// Process each detail record here
}, new ParallelOptions { MaxDegreeOfParallelism = 15 });
// Perform the final batch updates here
}
But, the thread pool gets exhausted fast, and you can't do async/await.
Bonus:
To get around the problem in BlockingCollection where an exception is thrown in Take() when CompleteAdding() is called, I'm using the TryTake overload with a timeout. If I didn't use the timeout in TryTake, it would defeat the purpose of using a BlockingCollection since TryTake won't block. Is there a better way? Ideally, there would be a TakeAsync method.
As suggested, use TPL Dataflow.
A TransformBlock<TInput, TOutput> may be what you're looking for.
You define a MaxDegreeOfParallelism to limit how many strings can be transformed (i.e., how many urls can be downloaded) in parallel. You then post urls to the block, and when you're done you tell the block you're done adding items and you fetch the responses.
var downloader = new TransformBlock<string, HttpResponse>(
url => Download(url),
new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 50 }
);
var buffer = new BufferBlock<HttpResponse>();
downloader.LinkTo(buffer);
foreach(var url in urls)
downloader.Post(url);
//or await downloader.SendAsync(url);
downloader.Complete();
await downloader.Completion;
IList<HttpResponse> responses;
if (buffer.TryReceiveAll(out responses))
{
//process responses
}
Note: The TransformBlock buffers both its input and output. Why, then, do we need to link it to a BufferBlock?
Because the TransformBlock won't complete until all items (HttpResponse) have been consumed, and await downloader.Completion would hang. Instead, we let the downloader forward all its output to a dedicated buffer block - then we wait for the downloader to complete, and inspect the buffer block.
Say you have 1000 URLs, and you only want to have 50 requests open at
a time; but as soon as one request completes, you open up a connection
to the next URL in the list. That way, there are always exactly 50
connections open at a time, until the URL list is exhausted.
The following simple solution has surfaced many times here on SO. It doesn't use blocking code and doesn't create threads explicitly, so it scales very well:
const int MAX_DOWNLOADS = 50;
static async Task DownloadAsync(string[] urls)
{
using (var semaphore = new SemaphoreSlim(MAX_DOWNLOADS))
using (var httpClient = new HttpClient())
{
var tasks = urls.Select(async url =>
{
await semaphore.WaitAsync();
try
{
var data = await httpClient.GetStringAsync(url);
Console.WriteLine(data);
}
finally
{
semaphore.Release();
}
});
await Task.WhenAll(tasks);
}
}
The thing is, the processing of the downloaded data should be done on a different pipeline, with a different level of parallelism, especially if it's a CPU-bound processing.
E.g., you'd probably want to have 4 threads concurrently doing the data processing (the number of CPU cores), and up to 50 pending requests for more data (which do not use threads at all). AFAICT, this is not what your code is currently doing.
That's where TPL Dataflow or Rx may come in handy as a preferred solution. Yet it is certainly possible to implement something like this with plain TPL. Note, the only blocking code here is the one doing the actual data processing inside Task.Run:
const int MAX_DOWNLOADS = 50;
const int MAX_PROCESSORS = 4;
// process data
class Processing
{
SemaphoreSlim _semaphore = new SemaphoreSlim(MAX_PROCESSORS);
HashSet<Task> _pending = new HashSet<Task>();
object _lock = new Object();
async Task ProcessAsync(string data)
{
await _semaphore.WaitAsync();
try
{
await Task.Run(() =>
{
// simuate work
Thread.Sleep(1000);
Console.WriteLine(data);
});
}
finally
{
_semaphore.Release();
}
}
public async void QueueItemAsync(string data)
{
var task = ProcessAsync(data);
lock (_lock)
_pending.Add(task);
try
{
await task;
}
catch
{
if (!task.IsCanceled && !task.IsFaulted)
throw; // not the task's exception, rethrow
// don't remove faulted/cancelled tasks from the list
return;
}
// remove successfully completed tasks from the list
lock (_lock)
_pending.Remove(task);
}
public async Task WaitForCompleteAsync()
{
Task[] tasks;
lock (_lock)
tasks = _pending.ToArray();
await Task.WhenAll(tasks);
}
}
// download data
static async Task DownloadAsync(string[] urls)
{
var processing = new Processing();
using (var semaphore = new SemaphoreSlim(MAX_DOWNLOADS))
using (var httpClient = new HttpClient())
{
var tasks = urls.Select(async (url) =>
{
await semaphore.WaitAsync();
try
{
var data = await httpClient.GetStringAsync(url);
// put the result on the processing pipeline
processing.QueueItemAsync(data);
}
finally
{
semaphore.Release();
}
});
await Task.WhenAll(tasks.ToArray());
await processing.WaitForCompleteAsync();
}
}
As requested, here's the code I ended up going with.
The work is set up in a master-detail configuration, and each master is processed as a batch. Each unit of work is queued up in this fashion:
var success = true;
// Start processing all the master records.
Master master;
while (null != (master = await StoredProcedures.ClaimRecordsAsync(...)))
{
await masterBuffer.SendAsync(master);
}
// Finished sending master records
masterBuffer.Complete();
// Now, wait for all the batches to complete.
await batchAction.Completion;
return success;
Masters are buffered one at a time to save work for other outside processes. The details for each master are dispatched for work via the masterTransform TransformManyBlock. A BatchedJoinBlock is also created to collect the details in one batch.
The actual work is done in the detailTransform TransformBlock, asynchronously, 150 at a time. BoundedCapacity is set to 300 to ensure that too many Masters don't get buffered at the beginning of the chain, while also leaving room for enough detail records to be queued to allow 150 records to be processed at one time. The block outputs an object to its targets, because it's filtered across the links depending on whether it's a Detail or Exception.
The batchAction ActionBlock collects the output from all the batches, and performs bulk database updates, error logging, etc. for each batch.
There will be several BatchedJoinBlocks, one for each master. Since each ISourceBlock is output sequentially and each batch only accepts the number of detail records associated with one master, the batches will be processed in order. Each block only outputs one group, and is unlinked on completion. Only the last batch block propagates its completion to the final ActionBlock.
The dataflow network:
// The dataflow network
BufferBlock<Master> masterBuffer = null;
TransformManyBlock<Master, Detail> masterTransform = null;
TransformBlock<Detail, object> detailTransform = null;
ActionBlock<Tuple<IList<object>, IList<object>>> batchAction = null;
// Buffer master records to enable efficient throttling.
masterBuffer = new BufferBlock<Master>(new DataflowBlockOptions { BoundedCapacity = 1 });
// Sequentially transform master records into a stream of detail records.
masterTransform = new TransformManyBlock<Master, Detail>(async masterRecord =>
{
var records = await StoredProcedures.GetObjectsAsync(masterRecord);
// Filter the master records based on some criteria here
var filteredRecords = records;
// Only propagate completion to the last batch
var propagateCompletion = masterBuffer.Completion.IsCompleted && masterTransform.InputCount == 0;
// Create a batch join block to encapsulate the results of the master record.
var batchjoinblock = new BatchedJoinBlock<object, object>(records.Count(), new GroupingDataflowBlockOptions { MaxNumberOfGroups = 1 });
// Add the batch block to the detail transform pipeline's link queue, and link the batch block to the the batch action block.
var detailLink1 = detailTransform.LinkTo(batchjoinblock.Target1, detailResult => detailResult is Detail);
var detailLink2 = detailTransform.LinkTo(batchjoinblock.Target2, detailResult => detailResult is Exception);
var batchLink = batchjoinblock.LinkTo(batchAction, new DataflowLinkOptions { PropagateCompletion = propagateCompletion });
// Unlink batchjoinblock upon completion.
// (the returned task does not need to be awaited, despite the warning.)
batchjoinblock.Completion.ContinueWith(task =>
{
detailLink1.Dispose();
detailLink2.Dispose();
batchLink.Dispose();
});
return filteredRecords;
}, new ExecutionDataflowBlockOptions { BoundedCapacity = 1 });
// Process each detail record asynchronously, 150 at a time.
detailTransform = new TransformBlock<Detail, object>(async detail => {
try
{
// Perform the action for each detail here asynchronously
await DoSomethingAsync();
return detail;
}
catch (Exception e)
{
success = false;
return e;
}
}, new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 150, BoundedCapacity = 300 });
// Perform the proper action for each batch
batchAction = new ActionBlock<Tuple<IList<object>, IList<object>>>(async batch =>
{
var details = batch.Item1.Cast<Detail>();
var errors = batch.Item2.Cast<Exception>();
// Do something with the batch here
}, new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 4 });
masterBuffer.LinkTo(masterTransform, new DataflowLinkOptions { PropagateCompletion = true });
masterTransform.LinkTo(detailTransform, new DataflowLinkOptions { PropagateCompletion = true });
I have a list of URLs of pages I want to download concurrently using HttpClient. The list of URLs can be large (100 or more!)
I have currently have this code:
var urls = new List<string>
{
#"http:\\www.amazon.com",
#"http:\\www.bing.com",
#"http:\\www.facebook.com",
#"http:\\www.twitter.com",
#"http:\\www.google.com"
};
var client = new HttpClient();
var contents = urls
.ToObservable()
.SelectMany(uri => client.GetStringAsync(new Uri(uri, UriKind.Absolute)));
contents.Subscribe(Console.WriteLine);
The problem: due to the usage of SelectMany, a big bunch of Tasks are created almost at the same time. It seems that if the list of URLs is big enough, a lot Tasks give timeouts (I'm getting "A Task was cancelled" exceptions).
So, I thought there should be a way, maybe using some kind of Scheduler, to limit the number of concurrent Tasks, not allowing more than 5 or 6 at a given time.
This way I could get concurrent downloads without launching too many tasks that may get stall, like they do right now.
How to do that so I don't saturate with lots of timed-out Tasks?
Remember SelectMany() is actually Select().Merge(). While SelectMany does not have a maxConcurrent paramter, Merge() does. So you can use that.
From your example, you can do this:
var urls = new List<string>
{
#"http:\\www.amazon.com",
#"http:\\www.bing.com",
#"http:\\www.facebook.com",
#"http:\\www.twitter.com",
#"http:\\www.google.com"
};
var client = new HttpClient();
var contents = urls
.ToObservable()
.Select(uri => Observable.FromAsync(() => client.GetStringAsync(uri)))
.Merge(2); // 2 maximum concurrent requests!
contents.Subscribe(Console.WriteLine);
Here is an example of how you can do it with the DataFlow API:
private static Task DoIt()
{
var urls = new List<string>
{
#"http:\\www.amazon.com",
#"http:\\www.bing.com",
#"http:\\www.facebook.com",
#"http:\\www.twitter.com",
#"http:\\www.google.com"
};
var client = new HttpClient();
//Create a block that takes a URL as input
//and produces the download result as output
TransformBlock<string,string> downloadBlock =
new TransformBlock<string, string>(
uri => client.GetStringAsync(new Uri(uri, UriKind.Absolute)),
new ExecutionDataflowBlockOptions
{
//At most 2 download operation execute at the same time
MaxDegreeOfParallelism = 2
});
//Create a block that prints out the result
ActionBlock<string> doneBlock =
new ActionBlock<string>(x => Console.WriteLine(x));
//Link the output of the first block to the input of the second one
downloadBlock.LinkTo(
doneBlock,
new DataflowLinkOptions { PropagateCompletion = true});
//input the urls into the first block
foreach (var url in urls)
{
downloadBlock.Post(url);
}
downloadBlock.Complete(); //Mark completion of input
//Allows consumer to wait for the whole operation to complete
return doneBlock.Completion;
}
static void Main(string[] args)
{
DoIt().Wait();
Console.WriteLine("Done");
Console.ReadLine();
}
Can you see if this helps?
var urls = new List<string>
{
#"http:\\www.amazon.com",
#"http:\\www.bing.com",
#"http:\\www.google.com",
#"http:\\www.twitter.com",
#"http:\\www.google.com"
};
var contents =
urls
.ToObservable()
.SelectMany(uri =>
Observable
.Using(
() => new System.Net.Http.HttpClient(),
client =>
client
.GetStringAsync(new Uri(uri, UriKind.Absolute))
.ToObservable()));
I was reading this post Limit the number of parallel threads in C# and trying to use it to send multiple files via ftp at the same time:
Perhaps something along the lines of:
ParallelOptions options = new ParallelOptions();
options.MaxDegreeOfParallelism = 4;
Then in your loop something like:
Parallel.Invoke(options,
() => new WebClient().Upload("http://www.linqpad.net", "lp.html"),
() => new WebClient().Upload("http://www.jaoo.dk", "jaoo.html"));
I am trying to add the files in my directory to the Invoke but I wasn't sure how to add them:
var dirlisting = Directory.GetFiles(zipdir, "*.*", SearchOption.TopDirectoryOnly);
if (!dirlisting.Any())
{
Console.WriteLine("Error! No zipped files found!!");
return;
}
foreach (var s in dirlisting)
{
var thread = new Thread(() => FtpFile.SendFile(s));
thread.Start();
}
I wasn't sure how to add them to the list of files to send. I only want 3 to go up at a time.
How do I add a thread for each file in the directory listing to be sent?
Something along these lines should do the trick
ParallelOptions o = new ParallelOptions();
o.MaxDegreeOfParallelism = 3;
Parallel.ForEach(dirlisting, o, (f) =>
{
Ftp.SendFile(f);
});
Parallel.For family of APIs is for CPU-bound tasks. You have IO-bound tasks here. Ideally, you should be using some asynchronous I/O Task-based API.
HttpClient doesn't support FTP uploads, but you can use FtpWebRequest.GetRequestStreamAsync and Stream.WriteAsync. You can use WebClient.UploadFileAsync too, but you'd need to create a new WebClient instance per each UploadFileAsync, as WebClient doesn't support multiple operation in parallel on the same instance.
Then you can use TPL Dataflow Library or just SemaphoreSlim to limit the level of parallelism. For examples, check "Throttling asynchronous tasks".
These lines Worked for me.
var options = new ParallelOptions { MaxDegreeOfParallelism = Environment.ProcessorCount * 10 };
Parallel.ForEach(dirlisting, options, (item) =>
{
FtpFile.SendFile(item)
});
I would like to run a bunch of async tasks, with a limit on how many tasks may be pending completion at any given time.
Say you have 1000 URLs, and you only want to have 50 requests open at a time; but as soon as one request completes, you open up a connection to the next URL in the list. That way, there are always exactly 50 connections open at a time, until the URL list is exhausted.
I also want to utilize a given number of threads if possible.
I came up with an extension method, ThrottleTasksAsync that does what I want. Is there a simpler solution already out there? I would assume that this is a common scenario.
Usage:
class Program
{
static void Main(string[] args)
{
Enumerable.Range(1, 10).ThrottleTasksAsync(5, 2, async i => { Console.WriteLine(i); return i; }).Wait();
Console.WriteLine("Press a key to exit...");
Console.ReadKey(true);
}
}
Here is the code:
static class IEnumerableExtensions
{
public static async Task<Result_T[]> ThrottleTasksAsync<Enumerable_T, Result_T>(this IEnumerable<Enumerable_T> enumerable, int maxConcurrentTasks, int maxDegreeOfParallelism, Func<Enumerable_T, Task<Result_T>> taskToRun)
{
var blockingQueue = new BlockingCollection<Enumerable_T>(new ConcurrentBag<Enumerable_T>());
var semaphore = new SemaphoreSlim(maxConcurrentTasks);
// Run the throttler on a separate thread.
var t = Task.Run(() =>
{
foreach (var item in enumerable)
{
// Wait for the semaphore
semaphore.Wait();
blockingQueue.Add(item);
}
blockingQueue.CompleteAdding();
});
var taskList = new List<Task<Result_T>>();
Parallel.ForEach(IterateUntilTrue(() => blockingQueue.IsCompleted), new ParallelOptions { MaxDegreeOfParallelism = maxDegreeOfParallelism },
_ =>
{
Enumerable_T item;
if (blockingQueue.TryTake(out item, 100))
{
taskList.Add(
// Run the task
taskToRun(item)
.ContinueWith(tsk =>
{
// For effect
Thread.Sleep(2000);
// Release the semaphore
semaphore.Release();
return tsk.Result;
}
)
);
}
});
// Await all the tasks.
return await Task.WhenAll(taskList);
}
static IEnumerable<bool> IterateUntilTrue(Func<bool> condition)
{
while (!condition()) yield return true;
}
}
The method utilizes BlockingCollection and SemaphoreSlim to make it work. The throttler is run on one thread, and all the async tasks are run on the other thread. To achieve parallelism, I added a maxDegreeOfParallelism parameter that's passed to a Parallel.ForEach loop re-purposed as a while loop.
The old version was:
foreach (var master = ...)
{
var details = ...;
Parallel.ForEach(details, detail => {
// Process each detail record here
}, new ParallelOptions { MaxDegreeOfParallelism = 15 });
// Perform the final batch updates here
}
But, the thread pool gets exhausted fast, and you can't do async/await.
Bonus:
To get around the problem in BlockingCollection where an exception is thrown in Take() when CompleteAdding() is called, I'm using the TryTake overload with a timeout. If I didn't use the timeout in TryTake, it would defeat the purpose of using a BlockingCollection since TryTake won't block. Is there a better way? Ideally, there would be a TakeAsync method.
As suggested, use TPL Dataflow.
A TransformBlock<TInput, TOutput> may be what you're looking for.
You define a MaxDegreeOfParallelism to limit how many strings can be transformed (i.e., how many urls can be downloaded) in parallel. You then post urls to the block, and when you're done you tell the block you're done adding items and you fetch the responses.
var downloader = new TransformBlock<string, HttpResponse>(
url => Download(url),
new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 50 }
);
var buffer = new BufferBlock<HttpResponse>();
downloader.LinkTo(buffer);
foreach(var url in urls)
downloader.Post(url);
//or await downloader.SendAsync(url);
downloader.Complete();
await downloader.Completion;
IList<HttpResponse> responses;
if (buffer.TryReceiveAll(out responses))
{
//process responses
}
Note: The TransformBlock buffers both its input and output. Why, then, do we need to link it to a BufferBlock?
Because the TransformBlock won't complete until all items (HttpResponse) have been consumed, and await downloader.Completion would hang. Instead, we let the downloader forward all its output to a dedicated buffer block - then we wait for the downloader to complete, and inspect the buffer block.
Say you have 1000 URLs, and you only want to have 50 requests open at
a time; but as soon as one request completes, you open up a connection
to the next URL in the list. That way, there are always exactly 50
connections open at a time, until the URL list is exhausted.
The following simple solution has surfaced many times here on SO. It doesn't use blocking code and doesn't create threads explicitly, so it scales very well:
const int MAX_DOWNLOADS = 50;
static async Task DownloadAsync(string[] urls)
{
using (var semaphore = new SemaphoreSlim(MAX_DOWNLOADS))
using (var httpClient = new HttpClient())
{
var tasks = urls.Select(async url =>
{
await semaphore.WaitAsync();
try
{
var data = await httpClient.GetStringAsync(url);
Console.WriteLine(data);
}
finally
{
semaphore.Release();
}
});
await Task.WhenAll(tasks);
}
}
The thing is, the processing of the downloaded data should be done on a different pipeline, with a different level of parallelism, especially if it's a CPU-bound processing.
E.g., you'd probably want to have 4 threads concurrently doing the data processing (the number of CPU cores), and up to 50 pending requests for more data (which do not use threads at all). AFAICT, this is not what your code is currently doing.
That's where TPL Dataflow or Rx may come in handy as a preferred solution. Yet it is certainly possible to implement something like this with plain TPL. Note, the only blocking code here is the one doing the actual data processing inside Task.Run:
const int MAX_DOWNLOADS = 50;
const int MAX_PROCESSORS = 4;
// process data
class Processing
{
SemaphoreSlim _semaphore = new SemaphoreSlim(MAX_PROCESSORS);
HashSet<Task> _pending = new HashSet<Task>();
object _lock = new Object();
async Task ProcessAsync(string data)
{
await _semaphore.WaitAsync();
try
{
await Task.Run(() =>
{
// simuate work
Thread.Sleep(1000);
Console.WriteLine(data);
});
}
finally
{
_semaphore.Release();
}
}
public async void QueueItemAsync(string data)
{
var task = ProcessAsync(data);
lock (_lock)
_pending.Add(task);
try
{
await task;
}
catch
{
if (!task.IsCanceled && !task.IsFaulted)
throw; // not the task's exception, rethrow
// don't remove faulted/cancelled tasks from the list
return;
}
// remove successfully completed tasks from the list
lock (_lock)
_pending.Remove(task);
}
public async Task WaitForCompleteAsync()
{
Task[] tasks;
lock (_lock)
tasks = _pending.ToArray();
await Task.WhenAll(tasks);
}
}
// download data
static async Task DownloadAsync(string[] urls)
{
var processing = new Processing();
using (var semaphore = new SemaphoreSlim(MAX_DOWNLOADS))
using (var httpClient = new HttpClient())
{
var tasks = urls.Select(async (url) =>
{
await semaphore.WaitAsync();
try
{
var data = await httpClient.GetStringAsync(url);
// put the result on the processing pipeline
processing.QueueItemAsync(data);
}
finally
{
semaphore.Release();
}
});
await Task.WhenAll(tasks.ToArray());
await processing.WaitForCompleteAsync();
}
}
As requested, here's the code I ended up going with.
The work is set up in a master-detail configuration, and each master is processed as a batch. Each unit of work is queued up in this fashion:
var success = true;
// Start processing all the master records.
Master master;
while (null != (master = await StoredProcedures.ClaimRecordsAsync(...)))
{
await masterBuffer.SendAsync(master);
}
// Finished sending master records
masterBuffer.Complete();
// Now, wait for all the batches to complete.
await batchAction.Completion;
return success;
Masters are buffered one at a time to save work for other outside processes. The details for each master are dispatched for work via the masterTransform TransformManyBlock. A BatchedJoinBlock is also created to collect the details in one batch.
The actual work is done in the detailTransform TransformBlock, asynchronously, 150 at a time. BoundedCapacity is set to 300 to ensure that too many Masters don't get buffered at the beginning of the chain, while also leaving room for enough detail records to be queued to allow 150 records to be processed at one time. The block outputs an object to its targets, because it's filtered across the links depending on whether it's a Detail or Exception.
The batchAction ActionBlock collects the output from all the batches, and performs bulk database updates, error logging, etc. for each batch.
There will be several BatchedJoinBlocks, one for each master. Since each ISourceBlock is output sequentially and each batch only accepts the number of detail records associated with one master, the batches will be processed in order. Each block only outputs one group, and is unlinked on completion. Only the last batch block propagates its completion to the final ActionBlock.
The dataflow network:
// The dataflow network
BufferBlock<Master> masterBuffer = null;
TransformManyBlock<Master, Detail> masterTransform = null;
TransformBlock<Detail, object> detailTransform = null;
ActionBlock<Tuple<IList<object>, IList<object>>> batchAction = null;
// Buffer master records to enable efficient throttling.
masterBuffer = new BufferBlock<Master>(new DataflowBlockOptions { BoundedCapacity = 1 });
// Sequentially transform master records into a stream of detail records.
masterTransform = new TransformManyBlock<Master, Detail>(async masterRecord =>
{
var records = await StoredProcedures.GetObjectsAsync(masterRecord);
// Filter the master records based on some criteria here
var filteredRecords = records;
// Only propagate completion to the last batch
var propagateCompletion = masterBuffer.Completion.IsCompleted && masterTransform.InputCount == 0;
// Create a batch join block to encapsulate the results of the master record.
var batchjoinblock = new BatchedJoinBlock<object, object>(records.Count(), new GroupingDataflowBlockOptions { MaxNumberOfGroups = 1 });
// Add the batch block to the detail transform pipeline's link queue, and link the batch block to the the batch action block.
var detailLink1 = detailTransform.LinkTo(batchjoinblock.Target1, detailResult => detailResult is Detail);
var detailLink2 = detailTransform.LinkTo(batchjoinblock.Target2, detailResult => detailResult is Exception);
var batchLink = batchjoinblock.LinkTo(batchAction, new DataflowLinkOptions { PropagateCompletion = propagateCompletion });
// Unlink batchjoinblock upon completion.
// (the returned task does not need to be awaited, despite the warning.)
batchjoinblock.Completion.ContinueWith(task =>
{
detailLink1.Dispose();
detailLink2.Dispose();
batchLink.Dispose();
});
return filteredRecords;
}, new ExecutionDataflowBlockOptions { BoundedCapacity = 1 });
// Process each detail record asynchronously, 150 at a time.
detailTransform = new TransformBlock<Detail, object>(async detail => {
try
{
// Perform the action for each detail here asynchronously
await DoSomethingAsync();
return detail;
}
catch (Exception e)
{
success = false;
return e;
}
}, new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 150, BoundedCapacity = 300 });
// Perform the proper action for each batch
batchAction = new ActionBlock<Tuple<IList<object>, IList<object>>>(async batch =>
{
var details = batch.Item1.Cast<Detail>();
var errors = batch.Item2.Cast<Exception>();
// Do something with the batch here
}, new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 4 });
masterBuffer.LinkTo(masterTransform, new DataflowLinkOptions { PropagateCompletion = true });
masterTransform.LinkTo(detailTransform, new DataflowLinkOptions { PropagateCompletion = true });