Given a input text file containing the Urls, I would like to download the corresponding files all at once. I use the answer to this question
UserState using WebClient and TaskAsync download from Async CTP as reference.
public void Run()
{
List<string> urls = File.ReadAllLines(#"c:/temp/Input/input.txt").ToList();
int index = 0;
Task[] tasks = new Task[urls.Count()];
foreach (string url in urls)
{
WebClient wc = new WebClient();
string path = string.Format("{0}image-{1}.jpg", #"c:/temp/Output/", index+1);
Task downloadTask = wc.DownloadFileTaskAsync(new Uri(url), path);
Task outputTask = downloadTask.ContinueWith(t => Output(path));
tasks[index] = outputTask;
}
Console.WriteLine("Start now");
Task.WhenAll(tasks);
Console.WriteLine("Done");
}
public void Output(string path)
{
Console.WriteLine(path);
}
I expected that the downloading of the files would begin at the point of "Task.WhenAll(tasks)". But it turns out that the output look likes
c:/temp/Output/image-2.jpg
c:/temp/Output/image-1.jpg
c:/temp/Output/image-4.jpg
c:/temp/Output/image-6.jpg
c:/temp/Output/image-3.jpg
[many lines deleted]
Start now
c:/temp/Output/image-18.jpg
c:/temp/Output/image-19.jpg
c:/temp/Output/image-20.jpg
c:/temp/Output/image-21.jpg
c:/temp/Output/image-23.jpg
[many lines deleted]
Done
Why does the downloading begin before WaitAll is called? What can I change to achieve what I would like (i.e. all tasks will begin at the same time)?
Thanks
Why does the downloading begin before WaitAll is called?
First of all, you're not calling Task.WaitAll, which synchronously blocks, you're calling Task.WhenAll, which returns an awaitable which should be awaited.
Now, as others said, when you call an async method, even without using await on it, it fires the asynchronous operation, because any method conforming to the TAP will return a "hot task".
What can I change to achieve what I would like (i.e. all tasks will
begin at the same time)?
Now, if you want to defer execution until Task.WhenAll, you can use Enumerable.Select to project each element to a Task, and materialize it when you pass it to Task.WhenAll:
public async Task RunAsync()
{
IEnumerable<string> urls = File.ReadAllLines(#"c:/temp/Input/input.txt");
var urlTasks = urls.Select((url, index) =>
{
WebClient wc = new WebClient();
string path = string.Format("{0}image-{1}.jpg", #"c:/temp/Output/", index);
var downloadTask = wc.DownloadFileTaskAsync(new Uri(url), path);
Output(path);
return downloadTask;
});
Console.WriteLine("Start now");
await Task.WhenAll(urlTasks);
Console.WriteLine("Done");
}
Why does the downloading begin before WaitAll is called?
Because:
Tasks created by its public constructors are referred to as “cold”
tasks, in that they begin their life cycle in the non-scheduled
TaskStatus.Created state, and it’s not until Start is called on these
instances that they progress to being scheduled. All other tasks begin
their life cycle in a “hot” state, meaning that the asynchronous
operations they represent have already been initiated and their
TaskStatus is an enumeration value other than Created. All tasks
returned from TAP methods must be “hot.”
Since DownloadFileTaskAsync is a TAP method, it returns "hot" (that is, already started) task.
What can I change to achieve what I would like (i.e. all tasks will begin at the same time)?
I'd look at TPL Data Flow. Something like this (I've used HttpClient instead of WebClient, but, actually, it doesn't matter):
static async Task DownloadData(IEnumerable<string> urls)
{
// we want to execute this in parallel
var executionOptions = new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = Environment.ProcessorCount };
// this block will receive URL and download content, pointed by URL
var donwloadBlock = new TransformBlock<string, Tuple<string, string>>(async url =>
{
using (var client = new HttpClient())
{
var content = await client.GetStringAsync(url);
return Tuple.Create(url, content);
}
}, executionOptions);
// this block will print number of bytes downloaded
var outputBlock = new ActionBlock<Tuple<string, string>>(tuple =>
{
Console.WriteLine($"Downloaded {(string.IsNullOrEmpty(tuple.Item2) ? 0 : tuple.Item2.Length)} bytes from {tuple.Item1}");
}, executionOptions);
// here we tell to donwloadBlock, that it is linked with outputBlock;
// this means, that when some item from donwloadBlock is being processed,
// it must be posted to outputBlock
using (donwloadBlock.LinkTo(outputBlock))
{
// fill downloadBlock with input data
foreach (var url in urls)
{
await donwloadBlock.SendAsync(url);
}
// tell donwloadBlock, that it is complete; thus, it should start processing its items
donwloadBlock.Complete();
// wait while downloading data
await donwloadBlock.Completion;
// tell outputBlock, that it is completed
outputBlock.Complete();
// wait while printing output
await outputBlock.Completion;
}
}
static void Main(string[] args)
{
var urls = new[]
{
"http://www.microsoft.com",
"http://www.google.com",
"http://stackoverflow.com",
"http://www.amazon.com",
"http://www.asp.net"
};
Console.WriteLine("Start now.");
DownloadData(urls).Wait();
Console.WriteLine("Done.");
Console.ReadLine();
}
Output:
Start now.
Downloaded 1020 bytes from http://www.microsoft.com
Downloaded 53108 bytes from http://www.google.com
Downloaded 244143 bytes from http://stackoverflow.com
Downloaded 468922 bytes from http://www.amazon.com
Downloaded 27771 bytes from http://www.asp.net
Done.
What can I change to achieve what I would like (i.e. all tasks will
begin at the same time)?
To synchronize the beginning of the download you could use Barrier class.
public void Run()
{
List<string> urls = File.ReadAllLines(#"c:/temp/Input/input.txt").ToList();
Barrier barrier = new Barrier(url.Count, ()=> {Console.WriteLine("Start now");} );
Task[] tasks = new Task[urls.Count()];
Parallel.For(0, urls.Count, (int index)=>
{
string path = string.Format("{0}image-{1}.jpg", #"c:/temp/Output/", index+1);
tasks[index] = DownloadAsync(Uri(urls[index]), path, barrier);
})
Task.WaitAll(tasks); // wait for completion
Console.WriteLine("Done");
}
async Task DownloadAsync(Uri url, string path, Barrier barrier)
{
using (WebClient wc = new WebClient())
{
barrier.SignalAndWait();
await wc.DownloadFileAsync(url, path);
Output(path);
}
}
Related
I have a list of URLs (thousands), I want to asynchronously get page data from each URL as fast as possible without putting extreme load on the CPU.
I have tried using threading but it still feels quite slow:
public static ConcurrentQueue<string> List = new ConcurrentQueue<string>(); //URL List (assume I added them already)
public static void Threading()
{
for(int i=0;i<100;i++) //100 threads
{
Thread thread = new Thread(new ThreadStart(Task));
thread.Start();
}
}
public static void Task()
{
while(!(List.isEmpty))
{
List.TryDequeue(out string URL);
//GET REQUEST HERE
}
}
Is there any better way to do this? I want to do this asynchronously but I can't figure out how to do it, and I don't want to sacrifice speed or CPU efficiency to do so.
Thanks :)
You should use Microsoft's Reactive Framework (aka Rx) - NuGet System.Reactive and add using System.Reactive.Linq; - then you can do this:
public static IObservable<(string url, string content)> GetAllUrls(List<string> urls) =>
Observable
.Using(
() => new HttpClient(),
hc =>
from url in urls.ToObservable()
from response in Observable.FromAsync(() => hc.GetAsync(url))
from content in Observable.FromAsync(() => response.Content.ReadAsStringAsync())
select (url, content));
That allows you to consume the results in a couple of ways.
You can process them as they get produced:
IDisposable subscription =
GetAllUrls(urlsx).Subscribe(x => Console.WriteLine(x.content));
Or you can get all of them produced and then await the full results:
(string url, string content)[] results = await GetAllUrls(urlsx).ToArray();
You are best off using HttpClient which allows async Task requests.
Just store each task in a list, and await the whole list. To prevent too many requests at once, wait for any single one to complete if there are too many, and remove the completed one from the list.
const int maxDegreeOfParallelism = 100;
static HttpClient _client = new HttpClient();
public static async Task GetAllUrls(List<string> urls)
{
var tasks = new List<Task>(urls.Count);
foreach (var url in urls)
{
if (tasks.Count == maxDegreeOfParallelism) // this prevents too many requests at once
tasks.Remove(await Task.WhenAny(tasks));
tasks.Add(GetUrl(url));
}
await Task.WhenAll(tasks);
}
private static async Task GetUrl(string url)
{
using var response = await _client.GetAsync(url);
// handle response here
var responseStr = await response.Content.ReadAsStringAsync(); // whatever
// do stuff etc
}
I have a question regarding Task.WaitAll. At first I tried to use async/await to get something like this:
private async Task ReadImagesAsync(string HTMLtag)
{
await Task.Run(() =>
{
ReadImages(HTMLtag);
});
}
Content of this function doesn't matter, it works synchronously and is completely independent from outside world.
I use it like this:
private void Execute()
{
string tags = ConfigurationManager.AppSettings["HTMLTags"];
var cursor = Mouse.OverrideCursor;
Mouse.OverrideCursor = System.Windows.Input.Cursors.Wait;
List<Task> tasks = new List<Task>();
foreach (string tag in tags.Split(';'))
{
tasks.Add(ReadImagesAsync(tag));
//tasks.Add(Task.Run(() => ReadImages(tag)));
}
Task.WaitAll(tasks.ToArray());
Mouse.OverrideCursor = cursor;
}
Unfortunately I get deadlock on Task.WaitAll if I use it that way (with async/await). My functions do their jobs (so they are executed properly), but Task.WaitAll just stays here forever because apparently ReadImagesAsync doesn't return to the caller.
The commented line is approach that actually works correctly. If I comment the tasks.Add(ReadImagesAsync(tag)); line and use tasks.Add(Task.Run(() => ReadImages(tag))); - everything works well.
What am I missing here?
ReadImages method looks like that:
private void ReadImages (string HTMLtag)
{
string section = HTMLtag.Split(':')[0];
string tag = HTMLtag.Split(':')[1];
List<string> UsedAdresses = new List<string>();
var webClient = new WebClient();
string page = webClient.DownloadString(Link);
var siteParsed = Link.Split('/');
string site = $"{siteParsed[0]} + // + {siteParsed[1]} + {siteParsed[2]}";
int.TryParse(MinHeight, out int minHeight);
int.TryParse(MinWidth, out int minWidth);
int index = 0;
while (index < page.Length)
{
int startSection = page.IndexOf("<" + section, index);
if (startSection < 0)
break;
int endSection = page.IndexOf(">", startSection) + 1;
index = endSection;
string imgSection = page.Substring(startSection, endSection - startSection);
int imgLinkStart = imgSection.IndexOf(tag + "=\"") + tag.Length + 2;
if (imgLinkStart < 0 || imgLinkStart > imgSection.Length)
continue;
int imgLinkEnd = imgSection.IndexOf("\"", imgLinkStart);
if (imgLinkEnd < 0)
continue;
string imgAdress = imgSection.Substring(imgLinkStart, imgLinkEnd - imgLinkStart);
string format = null;
foreach (var imgFormat in ConfigurationManager.AppSettings["ImgFormats"].Split(';'))
{
if (imgAdress.IndexOf(imgFormat) > 0)
{
format = imgFormat;
break;
}
}
// not an image
if (format == null)
continue;
// some internal resource, but we can try to get it anyways
if (!imgAdress.StartsWith("http"))
imgAdress = site + imgAdress;
string imgName = imgAdress.Split('/').Last();
if (!UsedAdresses.Contains(imgAdress))
{
try
{
Bitmap pic = new Bitmap(webClient.OpenRead(imgAdress));
if (pic.Width > minHeight && pic.Height > minWidth)
webClient.DownloadFile(imgAdress, SaveAdress + "\\" + imgName);
}
catch { }
finally
{
UsedAdresses.Add(imgAdress);
}
}
}
}
You are synchronously waiting for tasks to finish. This is not gonna work for WPF without a little bit of ConfigureAwait(false) magic. Here is a better solution:
private async Task Execute()
{
string tags = ConfigurationManager.AppSettings["HTMLTags"];
var cursor = Mouse.OverrideCursor;
Mouse.OverrideCursor = System.Windows.Input.Cursors.Wait;
List<Task> tasks = new List<Task>();
foreach (string tag in tags.Split(';'))
{
tasks.Add(ReadImagesAsync(tag));
//tasks.Add(Task.Run(() => ReadImages(tag)));
}
await Task.WhenAll(tasks.ToArray());
Mouse.OverrideCursor = cursor;
}
If this is WPF, then I'm sure you would call it when some kind of event happens. The way you should call this method is from event handler, e.g.:
private async void OnWindowOpened(object sender, EventArgs args)
{
await Execute();
}
Looking at the edited version of your question I can see that in fact you can make it all very nice and pretty by using async version of DownloadStringAsync:
private async Task ReadImages (string HTMLtag)
{
string section = HTMLtag.Split(':')[0];
string tag = HTMLtag.Split(':')[1];
List<string> UsedAdresses = new List<string>();
var webClient = new WebClient();
string page = await webClient.DownloadStringAsync(Link);
//...
}
Now, what's the deal with tasks.Add(Task.Run(() => ReadImages(tag)));?
This requires knowledge of SynchronizationContext. When you create a task, you copy the state of thread that scheduled the task, so you can come back to it when you are finished with await. When you call method without Task.Run, you say "I want to come back to UI thread". This is not possible, because UI thread is already waiting for the task and so they are both waiting for themselves. When you add another task to the mix, you are saying: "UI thread must schedule an 'outer' task that will schedule another, 'inner' task, that I will come back to."
Use WhenAll instead of WaitAll, Turn your Execute into async Task and await the task returned by Task.WhenAll.
This way it never blocks on an asynchronous code.
I found some more detailed articles explaining why actually deadlock happened here:
https://medium.com/bynder-tech/c-why-you-should-use-configureawait-false-in-your-library-code-d7837dce3d7f
https://blog.stephencleary.com/2012/07/dont-block-on-async-code.html
Short answer would be making a small change in my async method so it looks like that:
private async Task ReadImagesAsync(string HTMLtag)
{
await Task.Run(() =>
{
ReadImages(HTMLtag);
}).ConfigureAwait(false);
}
Yup. That's it. Suddenly it doesn't deadlock. But these two articles + #FCin response explain WHY it actually happened.
It's like you are saying i do not care when ReadImagesAsync() finishes but you have to wait for it .... Here is a definition
The Task.WaitAll blocks the current thread until all other tasks have completed execution.
The Task.WhenAll method is used to create a task that will complete if and only if all the other tasks are complete.
So, if you are using Task.WhenAll you would get a task object that isn't complete. However, it will not block and would allow the program to execute. On the contrary, the Task.WaitAll method call actually blocks and waits for all other tasks to complete.
Essentially, Task.WhenAll would provide you a task that isn't complete but you can use ContinueWith as soon as the specified tasks have completed their execution. Note that neither the Task.WhenAll method nor the Task.WaitAll would run the tasks, i.e., no tasks are started by any of these methods.
Task.WhenAll(taskList).ContinueWith(t => {
// write your code here
});
I would like to run a bunch of async tasks, with a limit on how many tasks may be pending completion at any given time.
Say you have 1000 URLs, and you only want to have 50 requests open at a time; but as soon as one request completes, you open up a connection to the next URL in the list. That way, there are always exactly 50 connections open at a time, until the URL list is exhausted.
I also want to utilize a given number of threads if possible.
I came up with an extension method, ThrottleTasksAsync that does what I want. Is there a simpler solution already out there? I would assume that this is a common scenario.
Usage:
class Program
{
static void Main(string[] args)
{
Enumerable.Range(1, 10).ThrottleTasksAsync(5, 2, async i => { Console.WriteLine(i); return i; }).Wait();
Console.WriteLine("Press a key to exit...");
Console.ReadKey(true);
}
}
Here is the code:
static class IEnumerableExtensions
{
public static async Task<Result_T[]> ThrottleTasksAsync<Enumerable_T, Result_T>(this IEnumerable<Enumerable_T> enumerable, int maxConcurrentTasks, int maxDegreeOfParallelism, Func<Enumerable_T, Task<Result_T>> taskToRun)
{
var blockingQueue = new BlockingCollection<Enumerable_T>(new ConcurrentBag<Enumerable_T>());
var semaphore = new SemaphoreSlim(maxConcurrentTasks);
// Run the throttler on a separate thread.
var t = Task.Run(() =>
{
foreach (var item in enumerable)
{
// Wait for the semaphore
semaphore.Wait();
blockingQueue.Add(item);
}
blockingQueue.CompleteAdding();
});
var taskList = new List<Task<Result_T>>();
Parallel.ForEach(IterateUntilTrue(() => blockingQueue.IsCompleted), new ParallelOptions { MaxDegreeOfParallelism = maxDegreeOfParallelism },
_ =>
{
Enumerable_T item;
if (blockingQueue.TryTake(out item, 100))
{
taskList.Add(
// Run the task
taskToRun(item)
.ContinueWith(tsk =>
{
// For effect
Thread.Sleep(2000);
// Release the semaphore
semaphore.Release();
return tsk.Result;
}
)
);
}
});
// Await all the tasks.
return await Task.WhenAll(taskList);
}
static IEnumerable<bool> IterateUntilTrue(Func<bool> condition)
{
while (!condition()) yield return true;
}
}
The method utilizes BlockingCollection and SemaphoreSlim to make it work. The throttler is run on one thread, and all the async tasks are run on the other thread. To achieve parallelism, I added a maxDegreeOfParallelism parameter that's passed to a Parallel.ForEach loop re-purposed as a while loop.
The old version was:
foreach (var master = ...)
{
var details = ...;
Parallel.ForEach(details, detail => {
// Process each detail record here
}, new ParallelOptions { MaxDegreeOfParallelism = 15 });
// Perform the final batch updates here
}
But, the thread pool gets exhausted fast, and you can't do async/await.
Bonus:
To get around the problem in BlockingCollection where an exception is thrown in Take() when CompleteAdding() is called, I'm using the TryTake overload with a timeout. If I didn't use the timeout in TryTake, it would defeat the purpose of using a BlockingCollection since TryTake won't block. Is there a better way? Ideally, there would be a TakeAsync method.
As suggested, use TPL Dataflow.
A TransformBlock<TInput, TOutput> may be what you're looking for.
You define a MaxDegreeOfParallelism to limit how many strings can be transformed (i.e., how many urls can be downloaded) in parallel. You then post urls to the block, and when you're done you tell the block you're done adding items and you fetch the responses.
var downloader = new TransformBlock<string, HttpResponse>(
url => Download(url),
new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 50 }
);
var buffer = new BufferBlock<HttpResponse>();
downloader.LinkTo(buffer);
foreach(var url in urls)
downloader.Post(url);
//or await downloader.SendAsync(url);
downloader.Complete();
await downloader.Completion;
IList<HttpResponse> responses;
if (buffer.TryReceiveAll(out responses))
{
//process responses
}
Note: The TransformBlock buffers both its input and output. Why, then, do we need to link it to a BufferBlock?
Because the TransformBlock won't complete until all items (HttpResponse) have been consumed, and await downloader.Completion would hang. Instead, we let the downloader forward all its output to a dedicated buffer block - then we wait for the downloader to complete, and inspect the buffer block.
Say you have 1000 URLs, and you only want to have 50 requests open at
a time; but as soon as one request completes, you open up a connection
to the next URL in the list. That way, there are always exactly 50
connections open at a time, until the URL list is exhausted.
The following simple solution has surfaced many times here on SO. It doesn't use blocking code and doesn't create threads explicitly, so it scales very well:
const int MAX_DOWNLOADS = 50;
static async Task DownloadAsync(string[] urls)
{
using (var semaphore = new SemaphoreSlim(MAX_DOWNLOADS))
using (var httpClient = new HttpClient())
{
var tasks = urls.Select(async url =>
{
await semaphore.WaitAsync();
try
{
var data = await httpClient.GetStringAsync(url);
Console.WriteLine(data);
}
finally
{
semaphore.Release();
}
});
await Task.WhenAll(tasks);
}
}
The thing is, the processing of the downloaded data should be done on a different pipeline, with a different level of parallelism, especially if it's a CPU-bound processing.
E.g., you'd probably want to have 4 threads concurrently doing the data processing (the number of CPU cores), and up to 50 pending requests for more data (which do not use threads at all). AFAICT, this is not what your code is currently doing.
That's where TPL Dataflow or Rx may come in handy as a preferred solution. Yet it is certainly possible to implement something like this with plain TPL. Note, the only blocking code here is the one doing the actual data processing inside Task.Run:
const int MAX_DOWNLOADS = 50;
const int MAX_PROCESSORS = 4;
// process data
class Processing
{
SemaphoreSlim _semaphore = new SemaphoreSlim(MAX_PROCESSORS);
HashSet<Task> _pending = new HashSet<Task>();
object _lock = new Object();
async Task ProcessAsync(string data)
{
await _semaphore.WaitAsync();
try
{
await Task.Run(() =>
{
// simuate work
Thread.Sleep(1000);
Console.WriteLine(data);
});
}
finally
{
_semaphore.Release();
}
}
public async void QueueItemAsync(string data)
{
var task = ProcessAsync(data);
lock (_lock)
_pending.Add(task);
try
{
await task;
}
catch
{
if (!task.IsCanceled && !task.IsFaulted)
throw; // not the task's exception, rethrow
// don't remove faulted/cancelled tasks from the list
return;
}
// remove successfully completed tasks from the list
lock (_lock)
_pending.Remove(task);
}
public async Task WaitForCompleteAsync()
{
Task[] tasks;
lock (_lock)
tasks = _pending.ToArray();
await Task.WhenAll(tasks);
}
}
// download data
static async Task DownloadAsync(string[] urls)
{
var processing = new Processing();
using (var semaphore = new SemaphoreSlim(MAX_DOWNLOADS))
using (var httpClient = new HttpClient())
{
var tasks = urls.Select(async (url) =>
{
await semaphore.WaitAsync();
try
{
var data = await httpClient.GetStringAsync(url);
// put the result on the processing pipeline
processing.QueueItemAsync(data);
}
finally
{
semaphore.Release();
}
});
await Task.WhenAll(tasks.ToArray());
await processing.WaitForCompleteAsync();
}
}
As requested, here's the code I ended up going with.
The work is set up in a master-detail configuration, and each master is processed as a batch. Each unit of work is queued up in this fashion:
var success = true;
// Start processing all the master records.
Master master;
while (null != (master = await StoredProcedures.ClaimRecordsAsync(...)))
{
await masterBuffer.SendAsync(master);
}
// Finished sending master records
masterBuffer.Complete();
// Now, wait for all the batches to complete.
await batchAction.Completion;
return success;
Masters are buffered one at a time to save work for other outside processes. The details for each master are dispatched for work via the masterTransform TransformManyBlock. A BatchedJoinBlock is also created to collect the details in one batch.
The actual work is done in the detailTransform TransformBlock, asynchronously, 150 at a time. BoundedCapacity is set to 300 to ensure that too many Masters don't get buffered at the beginning of the chain, while also leaving room for enough detail records to be queued to allow 150 records to be processed at one time. The block outputs an object to its targets, because it's filtered across the links depending on whether it's a Detail or Exception.
The batchAction ActionBlock collects the output from all the batches, and performs bulk database updates, error logging, etc. for each batch.
There will be several BatchedJoinBlocks, one for each master. Since each ISourceBlock is output sequentially and each batch only accepts the number of detail records associated with one master, the batches will be processed in order. Each block only outputs one group, and is unlinked on completion. Only the last batch block propagates its completion to the final ActionBlock.
The dataflow network:
// The dataflow network
BufferBlock<Master> masterBuffer = null;
TransformManyBlock<Master, Detail> masterTransform = null;
TransformBlock<Detail, object> detailTransform = null;
ActionBlock<Tuple<IList<object>, IList<object>>> batchAction = null;
// Buffer master records to enable efficient throttling.
masterBuffer = new BufferBlock<Master>(new DataflowBlockOptions { BoundedCapacity = 1 });
// Sequentially transform master records into a stream of detail records.
masterTransform = new TransformManyBlock<Master, Detail>(async masterRecord =>
{
var records = await StoredProcedures.GetObjectsAsync(masterRecord);
// Filter the master records based on some criteria here
var filteredRecords = records;
// Only propagate completion to the last batch
var propagateCompletion = masterBuffer.Completion.IsCompleted && masterTransform.InputCount == 0;
// Create a batch join block to encapsulate the results of the master record.
var batchjoinblock = new BatchedJoinBlock<object, object>(records.Count(), new GroupingDataflowBlockOptions { MaxNumberOfGroups = 1 });
// Add the batch block to the detail transform pipeline's link queue, and link the batch block to the the batch action block.
var detailLink1 = detailTransform.LinkTo(batchjoinblock.Target1, detailResult => detailResult is Detail);
var detailLink2 = detailTransform.LinkTo(batchjoinblock.Target2, detailResult => detailResult is Exception);
var batchLink = batchjoinblock.LinkTo(batchAction, new DataflowLinkOptions { PropagateCompletion = propagateCompletion });
// Unlink batchjoinblock upon completion.
// (the returned task does not need to be awaited, despite the warning.)
batchjoinblock.Completion.ContinueWith(task =>
{
detailLink1.Dispose();
detailLink2.Dispose();
batchLink.Dispose();
});
return filteredRecords;
}, new ExecutionDataflowBlockOptions { BoundedCapacity = 1 });
// Process each detail record asynchronously, 150 at a time.
detailTransform = new TransformBlock<Detail, object>(async detail => {
try
{
// Perform the action for each detail here asynchronously
await DoSomethingAsync();
return detail;
}
catch (Exception e)
{
success = false;
return e;
}
}, new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 150, BoundedCapacity = 300 });
// Perform the proper action for each batch
batchAction = new ActionBlock<Tuple<IList<object>, IList<object>>>(async batch =>
{
var details = batch.Item1.Cast<Detail>();
var errors = batch.Item2.Cast<Exception>();
// Do something with the batch here
}, new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 4 });
masterBuffer.LinkTo(masterTransform, new DataflowLinkOptions { PropagateCompletion = true });
masterTransform.LinkTo(detailTransform, new DataflowLinkOptions { PropagateCompletion = true });
I am describing my problem in a simple example and then describing a more close problem.
Imagine We Have n items [i1,i2,i3,i4,...,in] in the box1 and we have a box2 that can handle m items to do them (m is usually much less than n) . The time required for each item is different. I want to always have doing m job items until all items are proceeded.
A much more close problem is that for example you have a list1 of n strings (URL addresses) of files and we want to have a system to have m files downloading concurrently (for example via httpclient.getAsync() method). Whenever downloading of one of m items finishes, another remaining item from list1 must be substituted as soon as possible and this must be countinued until all of List1 items proceeded.
(number of n and m are specified by users input at runtime)
How this can be done?
Here is a generic method you can use.
when you call this TIn will be string (URL addresses) and the asyncProcessor will be your async method that takes the URL address as input and returns a Task.
The SlimSemaphore used by this method is going to allow only n number of concurrent async I/O requests in real time, as soon as one completes the other request will execute. Something like a sliding window pattern.
public static Task ForEachAsync<TIn>(
IEnumerable<TIn> inputEnumerable,
Func<TIn, Task> asyncProcessor,
int? maxDegreeOfParallelism = null)
{
int maxAsyncThreadCount = maxDegreeOfParallelism ?? DefaultMaxDegreeOfParallelism;
SemaphoreSlim throttler = new SemaphoreSlim(maxAsyncThreadCount, maxAsyncThreadCount);
IEnumerable<Task> tasks = inputEnumerable.Select(async input =>
{
await throttler.WaitAsync().ConfigureAwait(false);
try
{
await asyncProcessor(input).ConfigureAwait(false);
}
finally
{
throttler.Release();
}
});
return Task.WhenAll(tasks);
}
You should look in to TPL Dataflow, add the System.Threading.Tasks.Dataflow NuGet package to your project then what you want is as simple as
private static HttpClient _client = new HttpClient();
public async Task<List<MyClass>> ProcessDownloads(IEnumerable<string> uris,
int concurrentDownloads)
{
var result = new List<MyClass>();
var downloadData = new TransformBlock<string, string>(async uri =>
{
return await _client.GetStringAsync(uri); //GetStringAsync is a thread safe method.
}, new ExecutionDataflowBlockOptions{MaxDegreeOfParallelism = concurrentDownloads});
var processData = new TransformBlock<string, MyClass>(
json => JsonConvert.DeserializeObject<MyClass>(json),
new ExecutionDataflowBlockOptions {MaxDegreeOfParallelism = DataflowBlockOptions.Unbounded});
var collectData = new ActionBlock<MyClass>(
data => result.Add(data)); //When you don't specifiy options dataflow processes items one at a time.
//Set up the chain of blocks, have it call `.Complete()` on the next block when the current block finishes processing it's last item.
downloadData.LinkTo(processData, new DataflowLinkOptions {PropagateCompletion = true});
processData.LinkTo(collectData, new DataflowLinkOptions {PropagateCompletion = true});
//Load the data in to the first transform block to start off the process.
foreach (var uri in uris)
{
await downloadData.SendAsync(uri).ConfigureAwait(false);
}
downloadData.Complete(); //Signal you are done adding data.
//Wait for the last object to be added to the list.
await collectData.Completion.ConfigureAwait(false);
return result;
}
In the above code only concurrentDownloads number of HttpClients will be active at any given time, unlimited threads will be processing the received strings and turning them in to objects, and a single thread will be taking those objects and adding them to a list.
UPDATE: here is a simplified example that only does what you asked for in the question
private static HttpClient _client = new HttpClient();
public void ProcessDownloads(IEnumerable<string> uris, int concurrentDownloads)
{
var downloadData = new ActionBlock<string>(async uri =>
{
var response = await _client.GetAsync(uri); //GetAsync is a thread safe method.
//do something with response here.
}, new ExecutionDataflowBlockOptions{MaxDegreeOfParallelism = concurrentDownloads});
foreach (var uri in uris)
{
downloadData.Post(uri);
}
downloadData.Complete();
downloadData.Completion.Wait();
}
A simple solution for throttling is a SemaphoreSlim.
EDIT
After a slight alteration the code now creates the tasks when they are needed
var client = new HttpClient();
SemaphoreSlim semaphore = new SemaphoreSlim(m, m); //set the max here
var tasks = new List<Task>();
foreach(var url in urls)
{
// moving the wait here throttles the foreach loop
await semaphore.WaitAsync();
tasks.Add(((Func<Task>)(async () =>
{
//await semaphore.WaitAsync();
var response = await client.GetAsync(url); // possibly ConfigureAwait(false) here
// do something with response
semaphore.Release();
}))());
}
await Task.WhenAll(tasks);
This is another way to do it
var client = new HttpClient();
var tasks = new HashSet<Task>();
foreach(var url in urls)
{
if(tasks.Count == m)
{
tasks.Remove(await Task.WhenAny(tasks));
}
tasks.Add(((Func<Task>)(async () =>
{
var response = await client.GetAsync(url); // possibly ConfigureAwait(false) here
// do something with response
}))());
}
await Task.WhenAll(tasks);
Process items in parallel, limiting the number of simultaneous jobs:
string[] strings = GetStrings(); // Items to process.
const int m = 2; // Max simultaneous jobs.
Parallel.ForEach(strings, new ParallelOptions {MaxDegreeOfParallelism = m}, s =>
{
DoWork(s);
});
Update
Added the missing code for adding in taskList
I have a list of task, that I await on..
var files = Directory.GetFiles(myFilesDirectory);
var listOfTasks = new List<Tasks>();
files.ToList().ForEach(file => {
var localFile = file // to avoid any closure issue
listOfTasks.Add(ProcessMyFileTask(localFile));
});
await Task.WhenAll(listOfTasks.ToArray());
Console.WriteLine("All done!");
Here's the ProcessMyFileTask
private async Task<List<string>> ProcessMyFileTask(string filePath)
{
using (var streamReader = File.OpenText(filePath))
{
string line;
if ((line = await streamReader.ReadLineAsync()) != null)
{
return DumpHexInLog(line);
}
return null;
}
}
The message shows up when all files are processed. But if I add a continuation task, like this..
var files = Directory.GetFiles(myFilesDirectory);
var listOfTasks = new List<Tasks>();
files.ToList().ForEach(file => {
var localFile = file // to avoid any closure issue
listOfTasks.Add(ProcessMyFileTask(localFile).ContinueWith(list =>
ValidateHexDumpsTask(list.Result, localFile)));
});
await Task.WhenAll(listOfTasks.ToArray());
Console.WriteLine("All done!");
Then what Tasks will be awaited on? I mean would "All Done!" come after all the ProcessMyFileTask is done? Or will it come after all the ValidateHexDumpsTask are done too?
When I tested it, it came after the ValidateHexDumpsTask but I am not sure if that will certainly be always the case, as this might have been because of some threading condition or such.
It will complete only when both the ProcessMyFiles methods and ValidateHexDumps are done.
However, ContinueWith is not recommended. It's a low-level, dangerous API. You should use await instead:
var files = Directory.GetFiles(myFilesDirectory);
var listOfTasks = files.Select(ProcessAndValidateAsync);
await Task.WhenAll(listOfTasks);
Console.WriteLine("All done!");
async Task ProcessAndValidateAsync(string file)
{
var list = await ProcessMyFileTask(localFile);
ValidateHexDumps(list, localFile);
}
WhenAll will be completed when ValidateHexDumps has finished running on each of the items.
ContinueWith returns a Task that represents the completion of the continuation, not the completion of the task that it is a continuation of.