ConcurrentBag Skiping some items C# - c#

I am using concurrentbag for scraping URLs , Right now its working fine for 500 / 100 urls but when I am trying to scrape 8000 urls . All URLs not processing and some items pending in inputQueue.
But I am using while (!inputQueue.IsEmpty) . So, it should run loop till any items exists into inputqueue.
I want only run 100 threads max. So, I first creating 100 threads and calling "Run()" method and inside that method I am running a loop to take items till items exits in inputqueue and add into output queue after scraping urls.
public ConcurrentBag<Data> inputQueue = new ConcurrentBag<Data>();
public ConcurrentBag<Data> outPutQueue = new ConcurrentBag<Data>();
public List<Data> Scrapes(List<Data> scrapeRequests)
{
ServicePointManager.ServerCertificateValidationCallback += (sender, cert, chain, sslPolicyErrors) => true;
string proxy_session_id = new Random().Next().ToString();
numberOfRequestSent = 0;
watch.Start();
foreach (var sRequest in scrapeRequests)
{
inputQueue.Add(sRequest);
}
//inputQueue.CompleteAdding();
var taskList = new List<Task>();
for (var i = 0; i < n_parallel_exit_nodes; i++) //create 100 threads only
{
taskList.Add(Task.Factory.StartNew(async () =>
{
await Run();
}, TaskCreationOptions.RunContinuationsAsynchronously));
}
Task.WaitAll(taskList.ToArray()); //Waiting
//print result
Console.WriteLine("Number Of URLs Found - {0}", scrapeRequests.Count);
Console.WriteLine("Number Of Request Sent - {0}", numberOfRequestSent);
Console.WriteLine("Input Queue - {0}", inputQueue.Count);
Console.WriteLine("OutPut Queue - {0}", outPutQueue.ToList().Count);
Console.WriteLine("Success - {0}", outPutQueue.ToList().Where(x=>x.IsProxySuccess==true).Count().ToString());
Console.WriteLine("Failed - {0}", outPutQueue.ToList().Where(x => x.IsProxySuccess == false).Count().ToString());
Console.WriteLine("Process Time In - {0}", watch.Elapsed);
return outPutQueue.ToList();
}
async Task<string> Run()
{
while (!inputQueue.IsEmpty)
{
var client = new Client(super_proxy_ip, "US");
if (!client.have_good_super_proxy())
client.switch_session_id();
if (client.n_req_for_exit_node == switch_ip_every_n_req)
client.switch_session_id();
var scrapeRequest = new ProductResearch_ProData();
inputQueue.TryTake(out scrapeRequest);
try
{
numberOfRequestSent++;
// Console.WriteLine("Sending request for - {0}", scrapeRequest.URL);
scrapeRequest.HTML = client.DownloadString((string)scrapeRequest.URL);
//Console.WriteLine("Response done for - {0}", scrapeRequest.URL);
scrapeRequest.IsProxySuccess = true;
outPutQueue.Add(scrapeRequest); //add object to output queue
//lumanti code
client.handle_response();
}
catch (WebException e)
{
Console.WriteLine("Failed");
scrapeRequest.IsProxySuccess = false;
Console.WriteLine(e.Message);
outPutQueue.Add(scrapeRequest); //add object to output queue
//lumanti code
client.handle_response(e);
}
client.clean_connection_pool();
client.Dispose();
}
return await Task.Run(() => "Done");
}

There are multiple problems here, but none of them seems to be the cause for the inputQueue.Count having a none-zero value at the end. In any case I would like to point at the problems I can see.
var taskList = new List<Task>();
for (var i = 0; i < n_parallel_exit_nodes; i++) // create 100 threads only
{
taskList.Add(Task.Factory.StartNew(async () =>
{
await Run();
}, TaskCreationOptions.RunContinuationsAsynchronously));
}
The method Task.Factory.StartNew doesn't understand async delegates, so when it is called with an async lambda as argument it returns a nested task. In this case it returns a Task<Task<string>>. You store this nested task in List<Task> collection, which is possible because the type Task<TResult> inherits from the type Task, but doing so you lose the ability to await for the completion (and get the result) of the inner task. You only hold a reference to the outer task. Miraculously this is not a problem in this case (it usually is) since the outer task does all the work, and the inner task does essentially nothing (other than using a thread-pool thread to return a "Done" string that is not really needed anywhere).
You also don't attach any continuations to the outer tasks, so the flag TaskCreationOptions.RunContinuationsAsynchronously seems redundant.
// create 100 threads only
You don't create 100 threads, you create 100 tasks. These tasks are scheduled in the ThreadPool, which will be immediately starved because the tasks are long-running, and will start injecting one new thread every 500 msec until all scheduled tasks have been assigned to a thread.
var scrapeRequest = new ProductResearch_ProData();
inputQueue.TryTake(out scrapeRequest);
Here you instantiate an object of type ProductResearch_ProData that is immediately discarded and becomes eligible for garbage collection in the very next line. The TryTake method will either return an object removed from the bag, or null if the bag is empty. You ignore the return value of the TryTake method, which is entirely possible to be false because meanwhile the bag may have been emptied by another worker, and then proceed with a scrapeRequest that has possibly a null value, resulting in that case to a NullReferenceException.
Worth noting that you extract an object of type ProductResearch_ProData from a ConcurrentBag<Data>, so either the class Data inherits from
the base class ProductResearch_ProData, or there is a transcription error in the code.

Related

Multiple Async Calls with Pause Between Calls

I have an IEnumerable<Task>, where each Task will call the same endpoint. However, the endpoint can only handle so many calls per second. How can I put, say, a half second delay between each call?
I have tried adding Task.Delay(), but of course awaiting them simply means that the app waits a half second before sending all the calls at once.
Here is a code snippet:
var resultTasks = orders
.Select(async task =>
{
var result = new VendorTaskResult();
try
{
result.Response = await result.CallVendorAsync();
}
catch(Exception ex)
{
result.Exception = ex;
}
return result;
} );
var results = Task.WhenAll(resultTasks);
I feel like I should do something like
Task.WhenAll(resultTasks.EmitOverTime(500));
... but how exactly do I do that?
What you describe in your question is in other words rate limiting. You'd like to apply rate limiting policy to your client, because the API you use enforces such a policy on the server to protect itself from abuse.
While you could implement rate limiting yourself, I'd recommend you to go with some well established solution. Rate Limiter from Davis Desmaisons was the one that I picked at random and I instantly liked it. It had solid documentation, superior coverage and was easy to use. It is also available as NuGet package.
Check out the simple snippet below that demonstrates running semi-overlapping tasks in sequence while defering the task start by half a second after the immediately preceding task started. Each task lasts at least 750 ms.
using ComposableAsync;
using RateLimiter;
using System;
using System.Threading.Tasks;
namespace RateLimiterTest
{
class Program
{
static void Main(string[] args)
{
Log("Starting tasks ...");
var constraint = TimeLimiter.GetFromMaxCountByInterval(1, TimeSpan.FromSeconds(0.5));
var tasks = new[]
{
DoWorkAsync("Task1", constraint),
DoWorkAsync("Task2", constraint),
DoWorkAsync("Task3", constraint),
DoWorkAsync("Task4", constraint)
};
Task.WaitAll(tasks);
Log("All tasks finished.");
Console.ReadLine();
}
static void Log(string message)
{
Console.WriteLine(DateTime.Now.ToString("HH:mm:ss.fff ") + message);
}
static async Task DoWorkAsync(string name, IDispatcher constraint)
{
await constraint;
Log(name + " started");
await Task.Delay(750);
Log(name + " finished");
}
}
}
Sample output:
10:03:27.121 Starting tasks ...
10:03:27.154 Task1 started
10:03:27.658 Task2 started
10:03:27.911 Task1 finished
10:03:28.160 Task3 started
10:03:28.410 Task2 finished
10:03:28.680 Task4 started
10:03:28.913 Task3 finished
10:03:29.443 Task4 finished
10:03:29.443 All tasks finished.
If you change the constraint to allow maximum two tasks per second (var constraint = TimeLimiter.GetFromMaxCountByInterval(2, TimeSpan.FromSeconds(1));), which is not the same as one per half a second, then the output could be like:
10:06:03.237 Starting tasks ...
10:06:03.264 Task1 started
10:06:03.268 Task2 started
10:06:04.026 Task2 finished
10:06:04.031 Task1 finished
10:06:04.275 Task3 started
10:06:04.276 Task4 started
10:06:05.032 Task4 finished
10:06:05.032 Task3 finished
10:06:05.033 All tasks finished.
Note that the current version of Rate Limiter targets .NETFramework 4.7.2+ or .NETStandard 2.0+.
This is just a thought, but another approach could be to create a queue and add another thread that runs polling the queue for calls that need to go out to your endpoint.
Have you considered just turning that into a foreach-loop with a Task.Delay call? You seem to want to explicitly call them sequentially, it won't hurt if that is obvious from your code.
var results = new List<YourResultType>;
foreach(var order in orders){
var result = new VendorTaskResult();
try
{
result.Response = await result.CallVendorAsync();
results.Add(result.Response);
}
catch(Exception ex)
{
result.Exception = ex;
}
}
Instead of selecting from orders you could loop over them, and inside the loop put the result into a list and then call Task.WhenAll.
Would look something like:
var resultTasks = new List<VendorTaskResult>(orders.Count);
orders.ToList().ForEach( item => {
var result = new VendorTaskResult();
try
{
result.Response = await result.CallVendorAsync();
}
catch(Exception ex)
{
result.Exception = ex;
}
resultTasks.Add(result);
Thread.Sleep(x);
});
var results = Task.WhenAll(resultTasks);
If you want to control the number of requests executed simultaneously, you have to use a semaphore.
I have something very similar, and it works fine with me. Please note that I call ToArray() after the Linq query finishes, that triggers the tasks:
using (HttpClient client = new HttpClient()) {
IEnumerable<Task<string>> _downloads = _group
.Select(job => {
await Task.Delay(300);
return client.GetStringAsync(<url with variable job>);
});
Task<string>[] _downloadTasks = _downloads.ToArray();
_pages = await Task.WhenAll(_downloadTasks);
}
Now please note that this will create n nunmber of tasks, all in parallel, and the Task.Delay literally does nothing. If you want to call the pages synchronously (as it sounds by putting a delay between the calls), then this code may be better:
using (HttpClient client = new HttpClient()) {
foreach (string job in _group) {
await Task.Delay(300);
_pages.Add(await client.GetStringAsync(<url with variable job>));
}
}
The download of the pages is still asynchronous (while downloading other tasks are done), but each call to download the page is synchronous, ensuring that you can wait for one to finish in order to call the next one.
The code can be easily changed to call the pages asynchronously in chunks, like every 10 pages, wait 300ms, like in this sample:
IEnumerable<string[]> toParse = myData
.Select((v, i) => new { v.code, group = i / 20 })
.GroupBy(x => x.group)
.Select(g => g.Select(x => x.code).ToArray());
using (HttpClient client = new HttpClient()) {
foreach (string[] _group in toParse) {
string[] _pages = null;
IEnumerable<Task<string>> _downloads = _group
.Select(job => {
return client.GetStringAsync(<url with job>);
});
Task<string>[] _downloadTasks = _downloads.ToArray();
_pages = await Task.WhenAll(_downloadTasks);
await Task.Delay(5000);
}
}
All this does is group your pages in chunks of 20, iterate through the chunks, download all pages of the chunk asynchronously, wait 5 seconds, move on to the next chunk.
I hope that is what you were waiting for :)
The proposed method EmitOverTime is doable, but only by blocking the current thread:
public static IEnumerable<Task<TResult>> EmitOverTime<TResult>(
this IEnumerable<Task<TResult>> tasks, int delay)
{
foreach (var item in tasks)
{
Thread.Sleep(delay); // Delay by blocking
yield return item;
}
}
Usage:
var results = await Task.WhenAll(resultTasks.EmitOverTime(500));
Probably better is to create a variant of Task.WhenAll that accepts a delay argument, and delays asyncronously:
public static async Task<TResult[]> WhenAllWithDelay<TResult>(
IEnumerable<Task<TResult>> tasks, int delay)
{
var tasksList = new List<Task<TResult>>();
foreach (var task in tasks)
{
await Task.Delay(delay).ConfigureAwait(false);
tasksList.Add(task);
}
return await Task.WhenAll(tasksList).ConfigureAwait(false);
}
Usage:
var results = await WhenAllWithDelay(resultTasks, 500);
This design implies that the enumerable of tasks should be enumerated only once. It is easy to forget this during development, and start enumerating it again, spawning a new set of tasks. For this reason I propose to make it an OnlyOnce enumerable, as it is shown in this question.
Update: I should mention why the above methods work, and under what premise. The premise is that the supplied IEnumerable<Task<TResult>> is deferred, in other words non-materialized. At the method's start there are no tasks created yet. The tasks are created one after the other during the enumeration of the enumerable, and the trick is that the enumeration is slow and controlled. The delay inside the loop ensures that the tasks are not created all at once. They are created hot (in other words already started), so at the time the last task has been created some of the first tasks may have already been completed. The materialized list of half-running/half-completed tasks is then passed to Task.WhenAll, that waits for all to complete asynchronously.

C# .NET Parallel I/O operation (with throttling) [duplicate]

I would like to run a bunch of async tasks, with a limit on how many tasks may be pending completion at any given time.
Say you have 1000 URLs, and you only want to have 50 requests open at a time; but as soon as one request completes, you open up a connection to the next URL in the list. That way, there are always exactly 50 connections open at a time, until the URL list is exhausted.
I also want to utilize a given number of threads if possible.
I came up with an extension method, ThrottleTasksAsync that does what I want. Is there a simpler solution already out there? I would assume that this is a common scenario.
Usage:
class Program
{
static void Main(string[] args)
{
Enumerable.Range(1, 10).ThrottleTasksAsync(5, 2, async i => { Console.WriteLine(i); return i; }).Wait();
Console.WriteLine("Press a key to exit...");
Console.ReadKey(true);
}
}
Here is the code:
static class IEnumerableExtensions
{
public static async Task<Result_T[]> ThrottleTasksAsync<Enumerable_T, Result_T>(this IEnumerable<Enumerable_T> enumerable, int maxConcurrentTasks, int maxDegreeOfParallelism, Func<Enumerable_T, Task<Result_T>> taskToRun)
{
var blockingQueue = new BlockingCollection<Enumerable_T>(new ConcurrentBag<Enumerable_T>());
var semaphore = new SemaphoreSlim(maxConcurrentTasks);
// Run the throttler on a separate thread.
var t = Task.Run(() =>
{
foreach (var item in enumerable)
{
// Wait for the semaphore
semaphore.Wait();
blockingQueue.Add(item);
}
blockingQueue.CompleteAdding();
});
var taskList = new List<Task<Result_T>>();
Parallel.ForEach(IterateUntilTrue(() => blockingQueue.IsCompleted), new ParallelOptions { MaxDegreeOfParallelism = maxDegreeOfParallelism },
_ =>
{
Enumerable_T item;
if (blockingQueue.TryTake(out item, 100))
{
taskList.Add(
// Run the task
taskToRun(item)
.ContinueWith(tsk =>
{
// For effect
Thread.Sleep(2000);
// Release the semaphore
semaphore.Release();
return tsk.Result;
}
)
);
}
});
// Await all the tasks.
return await Task.WhenAll(taskList);
}
static IEnumerable<bool> IterateUntilTrue(Func<bool> condition)
{
while (!condition()) yield return true;
}
}
The method utilizes BlockingCollection and SemaphoreSlim to make it work. The throttler is run on one thread, and all the async tasks are run on the other thread. To achieve parallelism, I added a maxDegreeOfParallelism parameter that's passed to a Parallel.ForEach loop re-purposed as a while loop.
The old version was:
foreach (var master = ...)
{
var details = ...;
Parallel.ForEach(details, detail => {
// Process each detail record here
}, new ParallelOptions { MaxDegreeOfParallelism = 15 });
// Perform the final batch updates here
}
But, the thread pool gets exhausted fast, and you can't do async/await.
Bonus:
To get around the problem in BlockingCollection where an exception is thrown in Take() when CompleteAdding() is called, I'm using the TryTake overload with a timeout. If I didn't use the timeout in TryTake, it would defeat the purpose of using a BlockingCollection since TryTake won't block. Is there a better way? Ideally, there would be a TakeAsync method.
As suggested, use TPL Dataflow.
A TransformBlock<TInput, TOutput> may be what you're looking for.
You define a MaxDegreeOfParallelism to limit how many strings can be transformed (i.e., how many urls can be downloaded) in parallel. You then post urls to the block, and when you're done you tell the block you're done adding items and you fetch the responses.
var downloader = new TransformBlock<string, HttpResponse>(
url => Download(url),
new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 50 }
);
var buffer = new BufferBlock<HttpResponse>();
downloader.LinkTo(buffer);
foreach(var url in urls)
downloader.Post(url);
//or await downloader.SendAsync(url);
downloader.Complete();
await downloader.Completion;
IList<HttpResponse> responses;
if (buffer.TryReceiveAll(out responses))
{
//process responses
}
Note: The TransformBlock buffers both its input and output. Why, then, do we need to link it to a BufferBlock?
Because the TransformBlock won't complete until all items (HttpResponse) have been consumed, and await downloader.Completion would hang. Instead, we let the downloader forward all its output to a dedicated buffer block - then we wait for the downloader to complete, and inspect the buffer block.
Say you have 1000 URLs, and you only want to have 50 requests open at
a time; but as soon as one request completes, you open up a connection
to the next URL in the list. That way, there are always exactly 50
connections open at a time, until the URL list is exhausted.
The following simple solution has surfaced many times here on SO. It doesn't use blocking code and doesn't create threads explicitly, so it scales very well:
const int MAX_DOWNLOADS = 50;
static async Task DownloadAsync(string[] urls)
{
using (var semaphore = new SemaphoreSlim(MAX_DOWNLOADS))
using (var httpClient = new HttpClient())
{
var tasks = urls.Select(async url =>
{
await semaphore.WaitAsync();
try
{
var data = await httpClient.GetStringAsync(url);
Console.WriteLine(data);
}
finally
{
semaphore.Release();
}
});
await Task.WhenAll(tasks);
}
}
The thing is, the processing of the downloaded data should be done on a different pipeline, with a different level of parallelism, especially if it's a CPU-bound processing.
E.g., you'd probably want to have 4 threads concurrently doing the data processing (the number of CPU cores), and up to 50 pending requests for more data (which do not use threads at all). AFAICT, this is not what your code is currently doing.
That's where TPL Dataflow or Rx may come in handy as a preferred solution. Yet it is certainly possible to implement something like this with plain TPL. Note, the only blocking code here is the one doing the actual data processing inside Task.Run:
const int MAX_DOWNLOADS = 50;
const int MAX_PROCESSORS = 4;
// process data
class Processing
{
SemaphoreSlim _semaphore = new SemaphoreSlim(MAX_PROCESSORS);
HashSet<Task> _pending = new HashSet<Task>();
object _lock = new Object();
async Task ProcessAsync(string data)
{
await _semaphore.WaitAsync();
try
{
await Task.Run(() =>
{
// simuate work
Thread.Sleep(1000);
Console.WriteLine(data);
});
}
finally
{
_semaphore.Release();
}
}
public async void QueueItemAsync(string data)
{
var task = ProcessAsync(data);
lock (_lock)
_pending.Add(task);
try
{
await task;
}
catch
{
if (!task.IsCanceled && !task.IsFaulted)
throw; // not the task's exception, rethrow
// don't remove faulted/cancelled tasks from the list
return;
}
// remove successfully completed tasks from the list
lock (_lock)
_pending.Remove(task);
}
public async Task WaitForCompleteAsync()
{
Task[] tasks;
lock (_lock)
tasks = _pending.ToArray();
await Task.WhenAll(tasks);
}
}
// download data
static async Task DownloadAsync(string[] urls)
{
var processing = new Processing();
using (var semaphore = new SemaphoreSlim(MAX_DOWNLOADS))
using (var httpClient = new HttpClient())
{
var tasks = urls.Select(async (url) =>
{
await semaphore.WaitAsync();
try
{
var data = await httpClient.GetStringAsync(url);
// put the result on the processing pipeline
processing.QueueItemAsync(data);
}
finally
{
semaphore.Release();
}
});
await Task.WhenAll(tasks.ToArray());
await processing.WaitForCompleteAsync();
}
}
As requested, here's the code I ended up going with.
The work is set up in a master-detail configuration, and each master is processed as a batch. Each unit of work is queued up in this fashion:
var success = true;
// Start processing all the master records.
Master master;
while (null != (master = await StoredProcedures.ClaimRecordsAsync(...)))
{
await masterBuffer.SendAsync(master);
}
// Finished sending master records
masterBuffer.Complete();
// Now, wait for all the batches to complete.
await batchAction.Completion;
return success;
Masters are buffered one at a time to save work for other outside processes. The details for each master are dispatched for work via the masterTransform TransformManyBlock. A BatchedJoinBlock is also created to collect the details in one batch.
The actual work is done in the detailTransform TransformBlock, asynchronously, 150 at a time. BoundedCapacity is set to 300 to ensure that too many Masters don't get buffered at the beginning of the chain, while also leaving room for enough detail records to be queued to allow 150 records to be processed at one time. The block outputs an object to its targets, because it's filtered across the links depending on whether it's a Detail or Exception.
The batchAction ActionBlock collects the output from all the batches, and performs bulk database updates, error logging, etc. for each batch.
There will be several BatchedJoinBlocks, one for each master. Since each ISourceBlock is output sequentially and each batch only accepts the number of detail records associated with one master, the batches will be processed in order. Each block only outputs one group, and is unlinked on completion. Only the last batch block propagates its completion to the final ActionBlock.
The dataflow network:
// The dataflow network
BufferBlock<Master> masterBuffer = null;
TransformManyBlock<Master, Detail> masterTransform = null;
TransformBlock<Detail, object> detailTransform = null;
ActionBlock<Tuple<IList<object>, IList<object>>> batchAction = null;
// Buffer master records to enable efficient throttling.
masterBuffer = new BufferBlock<Master>(new DataflowBlockOptions { BoundedCapacity = 1 });
// Sequentially transform master records into a stream of detail records.
masterTransform = new TransformManyBlock<Master, Detail>(async masterRecord =>
{
var records = await StoredProcedures.GetObjectsAsync(masterRecord);
// Filter the master records based on some criteria here
var filteredRecords = records;
// Only propagate completion to the last batch
var propagateCompletion = masterBuffer.Completion.IsCompleted && masterTransform.InputCount == 0;
// Create a batch join block to encapsulate the results of the master record.
var batchjoinblock = new BatchedJoinBlock<object, object>(records.Count(), new GroupingDataflowBlockOptions { MaxNumberOfGroups = 1 });
// Add the batch block to the detail transform pipeline's link queue, and link the batch block to the the batch action block.
var detailLink1 = detailTransform.LinkTo(batchjoinblock.Target1, detailResult => detailResult is Detail);
var detailLink2 = detailTransform.LinkTo(batchjoinblock.Target2, detailResult => detailResult is Exception);
var batchLink = batchjoinblock.LinkTo(batchAction, new DataflowLinkOptions { PropagateCompletion = propagateCompletion });
// Unlink batchjoinblock upon completion.
// (the returned task does not need to be awaited, despite the warning.)
batchjoinblock.Completion.ContinueWith(task =>
{
detailLink1.Dispose();
detailLink2.Dispose();
batchLink.Dispose();
});
return filteredRecords;
}, new ExecutionDataflowBlockOptions { BoundedCapacity = 1 });
// Process each detail record asynchronously, 150 at a time.
detailTransform = new TransformBlock<Detail, object>(async detail => {
try
{
// Perform the action for each detail here asynchronously
await DoSomethingAsync();
return detail;
}
catch (Exception e)
{
success = false;
return e;
}
}, new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 150, BoundedCapacity = 300 });
// Perform the proper action for each batch
batchAction = new ActionBlock<Tuple<IList<object>, IList<object>>>(async batch =>
{
var details = batch.Item1.Cast<Detail>();
var errors = batch.Item2.Cast<Exception>();
// Do something with the batch here
}, new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 4 });
masterBuffer.LinkTo(masterTransform, new DataflowLinkOptions { PropagateCompletion = true });
masterTransform.LinkTo(detailTransform, new DataflowLinkOptions { PropagateCompletion = true });

How can I guarantee continuations run in task completion order?

If I have code that abstracts a staged sequence of asynchronous operations by returning a Task representing each stage, how can I ensure that continuations execute in stage order (i.e. the order in which the Tasks are completed)?
Note that this is a different requirement from simply 'not wasting time waiting for slower tasks'. The order needs to be guaranteed without race conditions in the scheduling. This looser requirement could addressed by parts of the answers to the following questions:
Sort Tasks into order of completition
Is there default way to get first task that finished successfully?
I think the logical solution would be to attach the continuations using a custom TaskScheduler (such as one based on a SynchronizationContext). However, I can't find any assurance that the scheduling of continuations is performed synchronously upon task completion.
In code this could be something like
class StagedOperationSource
{
public TaskCompletionSource Connect = new TaskCompletionSource();
public TaskCompletionSource Accept = new TaskCompletionSource();
public TaskCompletionSource Complete = new TaskCompletionSource();
}
class StagedOperation
{
public Task Connect, Accept, Complete;
public StagedOperation(StagedOperationSource source)
{
Connect = source.Connect.Task;
Accept = source.Accept.Task;
Complete = source.Complete.Task;
}
}
...
private StagedOperation InitiateStagedOperation(int opId)
{
var source = new StagedOperationSource();
Task.Run(GetRunnerFromOpId(opId, source));
return new StagedOperation(source);
}
...
public RunOperations()
{
for (int i=0; i<3; i++)
{
var op = InitiateStagedOperation(i);
op.Connect.ContinueWith(t => Console.WriteLine("{0}: Connected", i));
op.Accept.ContinueWith(t => Console.WriteLine("{0}: Accepted", i));
op.Complete.ContinueWith(t => Console.WriteLine("{0}: Completed", i));
}
}
which should produce output similar to
0: Connected
1: Connected
0: Accepted
2: Connected
0: Completed
1: Accepted
2: Accepted
2: Completed
1: Completed
Obviously the example is missing details like forwarding exceptions to (or cancelling) later stages if an earlier stage fails, but its just an example.
Just await each stage before going onto the next...
public static async Task ProcessStagedOperation(StagedOperation operation, int i)
{
await operation.Connect;
Console.WriteLine("{0}: Connected", i);
await operation.Accept;
Console.WriteLine("{0}: Accepted", i);
await operation.Complete;
Console.WriteLine("{0}: Completed", i);
}
You can then call that method in your for loop.
If you use TAP (Task Asynchronous Programming), i.e. async and await, you can make the flow of processing a lot more apparent. In this case I would create a new method to encapsulate the order of operations:
public async Task ProcessStagedOperation(StagedOperation op, int i)
{
await op.Connect;
Console.WriteLine("{0}: Connected", i);
await op.Accept;
Console.WriteLine("{0}: Accepted", i)
await op.Complete;
Console.WriteLine("{0}: Completed", i)
}
Now your processing loop gets simplified a bit:
public async Task RunOperations()
{
List<Task> pendingOperations = new List<Task>();
for (int i=0; i<3; i++)
{
var op = InitiateStagedOperation(i);
pendingOperations.Add(ProcessStagedOperation(op, i));
}
await Task.WhenAll(pendingOperations); // finish
}
You now have a reference to a task object you can explicitly wait or simply await from another context. (or you can simply ignore it). The way I modified the RunOperations() method allows you to create a large queue of pending tasks but not block while you wait for them all to finish.

Alternate to Dataflow BroadcastBlock with guaranteed delivery

I need to have some kind of object that acts like a BroadcastBlock, but with guaranteed delivery. So i used an answer from this question. But i don't really clearly understand the execution flow here. I have a console app. Here is my code:
static void Main(string[] args)
{
ExecutionDataflowBlockOptions execopt = new ExecutionDataflowBlockOptions { BoundedCapacity = 5 };
List<ActionBlock<int>> blocks = new List<ActionBlock<int>>();
for (int i = 0; i <= 10; i++)
blocks.Add(new ActionBlock<int>(num =>
{
int coef = i;
Console.WriteLine(Thread.CurrentThread.ManagedThreadId + ". " + num * coef);
}, execopt));
ActionBlock<int> broadcaster = new ActionBlock<int>(async num =>
{
foreach (ActionBlock<int> block in blocks) await block.SendAsync(num);
}, execopt);
broadcaster.Completion.ContinueWith(task =>
{
foreach (ActionBlock<int> block in blocks) block.Complete();
});
Task producer = Produce(broadcaster);
List<Task> ToWait = new List<Task>();
foreach (ActionBlock<int> block in blocks) ToWait.Add(block.Completion);
ToWait.Add(producer);
Task.WaitAll(ToWait.ToArray());
Console.ReadLine();
}
static async Task Produce(ActionBlock<int> broadcaster)
{
for (int i = 0; i <= 15; i++) await broadcaster.SendAsync(i);
broadcaster.Complete();
}
Each number must be handled sequentially, so i can't use MaxDegreeOfParallelism in broadcaster block. But all actionblocks that receive the number can run in parallel.
So here is the question:
In the output i can see different thread ids. Do i understand it correctly that works as follows:
Execution hits await block.SendAsync(num); in a broadcaster.
If current block is not ready to accept the number, execution exits broadcaster and hangs at the Task.WaitAll.
When block accepts the number, the rest of foreach statement in broadcaster is executed in a threadpool.
And the same till the end.
Each iteration of foreach is executed in a threadpool. But actually it happens sequentially.
Am i right or wrong in my understanding?
How can i change this code to send the number to all blocks asynchronously?
To make sure that if one of blocks is not ready to receive the number at the moment, i won't wait for it and all others that are ready will receive the number. And that all blocks can run in parallel. And guarantee delivery.
Assuming you want to handle one item at a time by the broadcaster while enabling the target blocks to receive that item concurrently you need to change the broadcaster to offer the number to all blocks at the same time and then asynchronously wait for all of them together to accept it before moving on to the next number:
var broadcaster = new ActionBlock<int>(async num =>
{
var tasks = new List<Task>();
foreach (var block in blocks)
{
tasks.Add(block.SendAsync(num));
}
await Task.WhenAll(tasks);
}, execopt);
Now, in this case where you don't have work after the await you can slightly optimize while still returning an awaitable task:
ActionBlock<int> broadcaster = new ActionBlock<int>(
num => Task.WhenAll(blocks.Select(block => block.SendAsync(num))), execopt);

Throttling asynchronous tasks

I would like to run a bunch of async tasks, with a limit on how many tasks may be pending completion at any given time.
Say you have 1000 URLs, and you only want to have 50 requests open at a time; but as soon as one request completes, you open up a connection to the next URL in the list. That way, there are always exactly 50 connections open at a time, until the URL list is exhausted.
I also want to utilize a given number of threads if possible.
I came up with an extension method, ThrottleTasksAsync that does what I want. Is there a simpler solution already out there? I would assume that this is a common scenario.
Usage:
class Program
{
static void Main(string[] args)
{
Enumerable.Range(1, 10).ThrottleTasksAsync(5, 2, async i => { Console.WriteLine(i); return i; }).Wait();
Console.WriteLine("Press a key to exit...");
Console.ReadKey(true);
}
}
Here is the code:
static class IEnumerableExtensions
{
public static async Task<Result_T[]> ThrottleTasksAsync<Enumerable_T, Result_T>(this IEnumerable<Enumerable_T> enumerable, int maxConcurrentTasks, int maxDegreeOfParallelism, Func<Enumerable_T, Task<Result_T>> taskToRun)
{
var blockingQueue = new BlockingCollection<Enumerable_T>(new ConcurrentBag<Enumerable_T>());
var semaphore = new SemaphoreSlim(maxConcurrentTasks);
// Run the throttler on a separate thread.
var t = Task.Run(() =>
{
foreach (var item in enumerable)
{
// Wait for the semaphore
semaphore.Wait();
blockingQueue.Add(item);
}
blockingQueue.CompleteAdding();
});
var taskList = new List<Task<Result_T>>();
Parallel.ForEach(IterateUntilTrue(() => blockingQueue.IsCompleted), new ParallelOptions { MaxDegreeOfParallelism = maxDegreeOfParallelism },
_ =>
{
Enumerable_T item;
if (blockingQueue.TryTake(out item, 100))
{
taskList.Add(
// Run the task
taskToRun(item)
.ContinueWith(tsk =>
{
// For effect
Thread.Sleep(2000);
// Release the semaphore
semaphore.Release();
return tsk.Result;
}
)
);
}
});
// Await all the tasks.
return await Task.WhenAll(taskList);
}
static IEnumerable<bool> IterateUntilTrue(Func<bool> condition)
{
while (!condition()) yield return true;
}
}
The method utilizes BlockingCollection and SemaphoreSlim to make it work. The throttler is run on one thread, and all the async tasks are run on the other thread. To achieve parallelism, I added a maxDegreeOfParallelism parameter that's passed to a Parallel.ForEach loop re-purposed as a while loop.
The old version was:
foreach (var master = ...)
{
var details = ...;
Parallel.ForEach(details, detail => {
// Process each detail record here
}, new ParallelOptions { MaxDegreeOfParallelism = 15 });
// Perform the final batch updates here
}
But, the thread pool gets exhausted fast, and you can't do async/await.
Bonus:
To get around the problem in BlockingCollection where an exception is thrown in Take() when CompleteAdding() is called, I'm using the TryTake overload with a timeout. If I didn't use the timeout in TryTake, it would defeat the purpose of using a BlockingCollection since TryTake won't block. Is there a better way? Ideally, there would be a TakeAsync method.
As suggested, use TPL Dataflow.
A TransformBlock<TInput, TOutput> may be what you're looking for.
You define a MaxDegreeOfParallelism to limit how many strings can be transformed (i.e., how many urls can be downloaded) in parallel. You then post urls to the block, and when you're done you tell the block you're done adding items and you fetch the responses.
var downloader = new TransformBlock<string, HttpResponse>(
url => Download(url),
new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 50 }
);
var buffer = new BufferBlock<HttpResponse>();
downloader.LinkTo(buffer);
foreach(var url in urls)
downloader.Post(url);
//or await downloader.SendAsync(url);
downloader.Complete();
await downloader.Completion;
IList<HttpResponse> responses;
if (buffer.TryReceiveAll(out responses))
{
//process responses
}
Note: The TransformBlock buffers both its input and output. Why, then, do we need to link it to a BufferBlock?
Because the TransformBlock won't complete until all items (HttpResponse) have been consumed, and await downloader.Completion would hang. Instead, we let the downloader forward all its output to a dedicated buffer block - then we wait for the downloader to complete, and inspect the buffer block.
Say you have 1000 URLs, and you only want to have 50 requests open at
a time; but as soon as one request completes, you open up a connection
to the next URL in the list. That way, there are always exactly 50
connections open at a time, until the URL list is exhausted.
The following simple solution has surfaced many times here on SO. It doesn't use blocking code and doesn't create threads explicitly, so it scales very well:
const int MAX_DOWNLOADS = 50;
static async Task DownloadAsync(string[] urls)
{
using (var semaphore = new SemaphoreSlim(MAX_DOWNLOADS))
using (var httpClient = new HttpClient())
{
var tasks = urls.Select(async url =>
{
await semaphore.WaitAsync();
try
{
var data = await httpClient.GetStringAsync(url);
Console.WriteLine(data);
}
finally
{
semaphore.Release();
}
});
await Task.WhenAll(tasks);
}
}
The thing is, the processing of the downloaded data should be done on a different pipeline, with a different level of parallelism, especially if it's a CPU-bound processing.
E.g., you'd probably want to have 4 threads concurrently doing the data processing (the number of CPU cores), and up to 50 pending requests for more data (which do not use threads at all). AFAICT, this is not what your code is currently doing.
That's where TPL Dataflow or Rx may come in handy as a preferred solution. Yet it is certainly possible to implement something like this with plain TPL. Note, the only blocking code here is the one doing the actual data processing inside Task.Run:
const int MAX_DOWNLOADS = 50;
const int MAX_PROCESSORS = 4;
// process data
class Processing
{
SemaphoreSlim _semaphore = new SemaphoreSlim(MAX_PROCESSORS);
HashSet<Task> _pending = new HashSet<Task>();
object _lock = new Object();
async Task ProcessAsync(string data)
{
await _semaphore.WaitAsync();
try
{
await Task.Run(() =>
{
// simuate work
Thread.Sleep(1000);
Console.WriteLine(data);
});
}
finally
{
_semaphore.Release();
}
}
public async void QueueItemAsync(string data)
{
var task = ProcessAsync(data);
lock (_lock)
_pending.Add(task);
try
{
await task;
}
catch
{
if (!task.IsCanceled && !task.IsFaulted)
throw; // not the task's exception, rethrow
// don't remove faulted/cancelled tasks from the list
return;
}
// remove successfully completed tasks from the list
lock (_lock)
_pending.Remove(task);
}
public async Task WaitForCompleteAsync()
{
Task[] tasks;
lock (_lock)
tasks = _pending.ToArray();
await Task.WhenAll(tasks);
}
}
// download data
static async Task DownloadAsync(string[] urls)
{
var processing = new Processing();
using (var semaphore = new SemaphoreSlim(MAX_DOWNLOADS))
using (var httpClient = new HttpClient())
{
var tasks = urls.Select(async (url) =>
{
await semaphore.WaitAsync();
try
{
var data = await httpClient.GetStringAsync(url);
// put the result on the processing pipeline
processing.QueueItemAsync(data);
}
finally
{
semaphore.Release();
}
});
await Task.WhenAll(tasks.ToArray());
await processing.WaitForCompleteAsync();
}
}
As requested, here's the code I ended up going with.
The work is set up in a master-detail configuration, and each master is processed as a batch. Each unit of work is queued up in this fashion:
var success = true;
// Start processing all the master records.
Master master;
while (null != (master = await StoredProcedures.ClaimRecordsAsync(...)))
{
await masterBuffer.SendAsync(master);
}
// Finished sending master records
masterBuffer.Complete();
// Now, wait for all the batches to complete.
await batchAction.Completion;
return success;
Masters are buffered one at a time to save work for other outside processes. The details for each master are dispatched for work via the masterTransform TransformManyBlock. A BatchedJoinBlock is also created to collect the details in one batch.
The actual work is done in the detailTransform TransformBlock, asynchronously, 150 at a time. BoundedCapacity is set to 300 to ensure that too many Masters don't get buffered at the beginning of the chain, while also leaving room for enough detail records to be queued to allow 150 records to be processed at one time. The block outputs an object to its targets, because it's filtered across the links depending on whether it's a Detail or Exception.
The batchAction ActionBlock collects the output from all the batches, and performs bulk database updates, error logging, etc. for each batch.
There will be several BatchedJoinBlocks, one for each master. Since each ISourceBlock is output sequentially and each batch only accepts the number of detail records associated with one master, the batches will be processed in order. Each block only outputs one group, and is unlinked on completion. Only the last batch block propagates its completion to the final ActionBlock.
The dataflow network:
// The dataflow network
BufferBlock<Master> masterBuffer = null;
TransformManyBlock<Master, Detail> masterTransform = null;
TransformBlock<Detail, object> detailTransform = null;
ActionBlock<Tuple<IList<object>, IList<object>>> batchAction = null;
// Buffer master records to enable efficient throttling.
masterBuffer = new BufferBlock<Master>(new DataflowBlockOptions { BoundedCapacity = 1 });
// Sequentially transform master records into a stream of detail records.
masterTransform = new TransformManyBlock<Master, Detail>(async masterRecord =>
{
var records = await StoredProcedures.GetObjectsAsync(masterRecord);
// Filter the master records based on some criteria here
var filteredRecords = records;
// Only propagate completion to the last batch
var propagateCompletion = masterBuffer.Completion.IsCompleted && masterTransform.InputCount == 0;
// Create a batch join block to encapsulate the results of the master record.
var batchjoinblock = new BatchedJoinBlock<object, object>(records.Count(), new GroupingDataflowBlockOptions { MaxNumberOfGroups = 1 });
// Add the batch block to the detail transform pipeline's link queue, and link the batch block to the the batch action block.
var detailLink1 = detailTransform.LinkTo(batchjoinblock.Target1, detailResult => detailResult is Detail);
var detailLink2 = detailTransform.LinkTo(batchjoinblock.Target2, detailResult => detailResult is Exception);
var batchLink = batchjoinblock.LinkTo(batchAction, new DataflowLinkOptions { PropagateCompletion = propagateCompletion });
// Unlink batchjoinblock upon completion.
// (the returned task does not need to be awaited, despite the warning.)
batchjoinblock.Completion.ContinueWith(task =>
{
detailLink1.Dispose();
detailLink2.Dispose();
batchLink.Dispose();
});
return filteredRecords;
}, new ExecutionDataflowBlockOptions { BoundedCapacity = 1 });
// Process each detail record asynchronously, 150 at a time.
detailTransform = new TransformBlock<Detail, object>(async detail => {
try
{
// Perform the action for each detail here asynchronously
await DoSomethingAsync();
return detail;
}
catch (Exception e)
{
success = false;
return e;
}
}, new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 150, BoundedCapacity = 300 });
// Perform the proper action for each batch
batchAction = new ActionBlock<Tuple<IList<object>, IList<object>>>(async batch =>
{
var details = batch.Item1.Cast<Detail>();
var errors = batch.Item2.Cast<Exception>();
// Do something with the batch here
}, new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 4 });
masterBuffer.LinkTo(masterTransform, new DataflowLinkOptions { PropagateCompletion = true });
masterTransform.LinkTo(detailTransform, new DataflowLinkOptions { PropagateCompletion = true });

Categories