How to yield from parallel tasks in .NET 4.5 - c#

I would like to use .NET iterator with parallel Tasks/await?. Something like this:
IEnumerable<TDst> Foo<TSrc, TDest>(IEnumerable<TSrc> source)
{
Parallel.ForEach(
source,
s=>
{
// Ordering is NOT important
// items can be yielded as soon as they are done
yield return ExecuteOrDownloadSomething(s);
}
}
Unfortunately .NET cannot natively handle this. Best answer so far by #svick - use AsParallel().
BONUS: Any simple async/await code that implements multiple publishers and a single subscriber? The subscriber would yield, and the pubs would process. (core libraries only)

This seems like a job for PLINQ:
return source.AsParallel().Select(s => ExecuteOrDownloadSomething(s));
This will execute the delegate in parallel using a limited number of threads, returning each result as soon as it completes.
If the ExecuteOrDownloadSomething() method is IO-bound (e.g. it actually downloads something) and you don't want to waste threads, then using async-await might make sense, but it would be more complicated.
If you want to fully take advantage of async, you shouldn't return IEnumerable, because it's synchronous (i.e. it blocks if no items are available). What you need is some sort of asynchronous collection, and you can use ISourceBlock (specifically, TransformBlock) from TPL Dataflow for that:
ISourceBlock<TDst> Foo<TSrc, TDest>(IEnumerable<TSrc> source)
{
var block = new TransformBlock<TSrc, TDest>(
async s => await ExecuteOrDownloadSomethingAsync(s),
new ExecutionDataflowBlockOptions
{
MaxDegreeOfParallelism = DataflowBlockOptions.Unbounded
});
foreach (var item in source)
block.Post(item);
block.Complete();
return block;
}
If the source is “slow” (i.e. you want to start processing the results from Foo() before iterating source is completed), you might want to move the foreach and Complete() call to a separate Task. Even better solution would be to make source into a ISourceBlock<TSrc> too.

So it appears what you really want to do is to order a sequence of tasks based on when they complete. This is not terribly complex:
public static IEnumerable<Task<T>> Order<T>(this IEnumerable<Task<T>> tasks)
{
var input = tasks.ToList();
var output = input.Select(task => new TaskCompletionSource<T>());
var collection = new BlockingCollection<TaskCompletionSource<T>>();
foreach (var tcs in output)
collection.Add(tcs);
foreach (var task in input)
{
task.ContinueWith(t =>
{
var tcs = collection.Take();
switch (task.Status)
{
case TaskStatus.Canceled:
tcs.TrySetCanceled();
break;
case TaskStatus.Faulted:
tcs.TrySetException(task.Exception.InnerExceptions);
break;
case TaskStatus.RanToCompletion:
tcs.TrySetResult(task.Result);
break;
}
}
, CancellationToken.None
, TaskContinuationOptions.ExecuteSynchronously
, TaskScheduler.Default);
}
return output.Select(tcs => tcs.Task);
}
So here we create a TaskCompletionSource for each input task, then go through each of the tasks and set a continuation which grabs the next completion source from a BlockingCollection and sets it's result. The first task completed grabs the first tcs that was returned, the second task completed gets the second tcs that was returned, and so on.
Now your code becomes quite simple:
var tasks = collection.Select(item => LongRunningOperationThatReturnsTask(item))
.Order();
foreach(var task in tasks)
{
var result = task.Result;//or you could `await` each result
//....
}

In the asynchronous library made by the MS robotics team, they had concurrency primitives which allowed for using an iterator to yield asynchronous code.
The library (CCR) is free (It didn't use to be free). A nice introductory article can be found here: Concurrent affairs
Perhaps you can use this library alongside .Net task library, or it'll inspire you to 'roll your own'

Related

C# Add to a List Asynchronously in API

I have an API which needs to be run in a loop for Mass processing.
Current single API is:
public async Task<ActionResult<CombinedAddressResponse>> GetCombinedAddress(AddressRequestDto request)
We are not allowed to touch/modify the original single API. However can be run in bulk, using foreach statement. What is the best way to run this asychronously without locks?
Current Solution below is just providing a list, would this be it?
public async Task<ActionResult<List<CombinedAddressResponse>>> GetCombinedAddress(List<AddressRequestDto> requests)
{
var combinedAddressResponses = new List<CombinedAddressResponse>();
foreach(AddressRequestDto request in requests)
{
var newCombinedAddress = (await GetCombinedAddress(request)).Value;
combinedAddressResponses.Add(newCombinedAddress);
}
return combinedAddressResponses;
}
Update:
In debugger, it has to go to combinedAddressResponse.Result.Value
combinedAddressResponse.Value = null
and Also strangely, writing combinedAddressResponse.Result.Value gives error below "Action Result does not contain a definition for for 'Value' and no accessible extension method
I'm writing this code off the top of my head without an IDE or sleep, so please comment if I'm missing something or there's a better way.
But effectively I think you want to run all your requests at once (not sequentially) doing something like this:
public async Task<ActionResult<List<CombinedAddressResponse>>> GetCombinedAddress(List<AddressRequestDto> requests)
{
var combinedAddressResponses = new List<CombinedAddressResponse>(requests.Count);
var tasks = new List<Task<ActionResult<CombinedAddressResponse>>(requests.Count);
foreach (var request in requests)
{
tasks.Add(Task.Run(async () => await GetCombinedAddress(request));
}
//This waits for all the tasks to complete
await tasks.WhenAll(tasks.ToArray());
combinedAddressResponses.AddRange(tasks.Select(x => x.Result.Value));
return combinedAddressResponses;
}
looking for a way to speed things up and run in parallel thanks
What you need is "asynchronous concurrency". I use the term "concurrency" to mean "doing more than one thing at a time", and "parallel" to mean "doing more than one thing at a time using threads". Since you're on ASP.NET, you don't want to use additional threads; you'd want to use a form of concurrency that works asynchronously (which uses fewer threads). So, Parallel and Task.Run should not be parts of your solution.
The way to do asynchronous concurrency is to build a collection of tasks, and then use await Task.WhenAll. E.g.:
public async Task<ActionResult<IReadOnlyList<CombinedAddressResponse>>> GetCombinedAddress(List<AddressRequestDto> requests)
{
// Build the collection of tasks by doing an asynchronous operation for each request.
var tasks = requests.Select(async request =>
{
var combinedAddressResponse = await GetCombinedAdress(request);
return combinedAddressResponse.Value;
}).ToList();
// Wait for all the tasks to complete and get the results.
var results = await Task.WhenAll(tasks);
return results;
}

How can I asynchronously transform one IEnumerable to another, just like LINQ's Select(), but using await on every transformed item?

Consider this situation:
class Product { }
interface IWorker
{
Task<Product> CreateProductAsync();
}
I am now given an IEnumerable<IWorker> workers and am supposed to create an IEnumerable<Product> from it that I have to pass to some other function that I cannot alter:
void CheckProducts(IEnumerable<Product> products);
This methods needs to have access to the entire IEnumerable<Product>. It is not possible to subdivide it and call CheckProducts on multiple subsets.
One obvious solution is this:
CheckProducts(workers.Select(worker => worker.CreateProductAsync().Result));
But this is blocking, of course, and hence it would only be my last resort.
Syntactically, I need precisely this, just without blocking.
I cannot use await inside of the function I'm passing to Select() as I would have to mark it as async and that would require it to return a Task itself and I would have gained nothing. In the end I need an IEnumerable<Product> and not an IEnumerable<Task<Product>>.
It is important to know that the order of the workers creating their products does matter, their work must not overlap. Otherwise, I would do this:
async Task<IEnumerable<Product>> CreateProductsAsync(IEnumerable<IWorker> workers)
{
var tasks = workers.Select(worker => worker.CreateProductAsync());
return await Task.WhenAll(tasks);
}
But unfortunately, Task.WhenAll() executes some tasks in parallel while I need them executed sequentially.
Here is one possibility to implement it if I had an IReadOnlyList<IWorker> instead of an IEnumerable<IWorker>:
async Task<IEnumerable<Product>> CreateProductsAsync(IReadOnlyList<IWorker> workers)
{
var resultList = new Product[workers.Count];
for (int i = 0; i < resultList.Length; ++i)
resultList[i] = await workers[i].CreateProductAsync();
return resultList;
}
But I must deal with an IEnumerable and, even worse, it is usually quite huge, sometimes it is even unlimited, yielding workers forever. If I knew that its size was decent, I would just call ToArray() on it and use the method above.
The ultimate solution would be this:
async Task<IEnumerable<Product>> CreateProductsAsync(IEnumerable<IWorker> workers)
{
foreach (var worker in workers)
yield return await worker.CreateProductAsync();
}
But yield and await are incompatible as described in this answer. Looking at that answer, would that hypothetical IAsyncEnumerator help me here? Does something similar meanwhile exist in C#?
A summary of the issues I'm facing:
I have a potentially endless IEnumerable<IWorker>
I want to asynchronously call CreateProductAsync() on each of them in the same order as they are coming in
In the end I need an IEnumerable<Product>
A summary of what I already tried, but doesn't work:
I cannot use Task.WhenAll() because it executes tasks in parallel.
I cannot use ToArray() and process that array manually in a loop because my sequence is sometimes endless.
I cannot use yield return because it's incompatible with await.
Does anybody have a solution or workaround for me?
Otherwise I will have to use that blocking code...
IEnumerator<T> is a synchronous interface, so blocking is unavoidable if CheckProducts enumerates the next product before the next worker has finished creating the product.
Nevertheless, you can achieve parallelism by creating products on another thread, adding them to a BlockingCollection<T>, and yielding them on the main thread:
static IEnumerable<Product> CreateProducts(IEnumerable<IWorker> workers)
{
var products = new BlockingCollection<Product>(3);
Task.Run(async () => // On the thread pool...
{
foreach (IWorker worker in workers)
{
Product product = await worker.CreateProductAsync(); // Create products serially.
products.Add(product); // Enqueue the product, blocking if the queue is full.
}
products.CompleteAdding(); // Notify GetConsumingEnumerable that we're done.
});
return products.GetConsumingEnumerable();
}
To avoid unbounded memory consumption, you can optionally specify the capacity of the queue as a constructor argument to BlockingCollection<T>. I used 3 in the code above.
The Situation:
Here you're saying you need to do this synchronously, because IEnumerable doesn't support async and the requirements are you need an IEnumerable<Product>.
I am now given an IEnumerable workers and am supposed to
create an IEnumerable from it that I have to pass to some
other function that I cannot alter:
Here you say the entire product set needs to be processed at the same time, presumably making a single call to void CheckProducts(IEnumerable<Product> products).
This methods needs to check the entire Product set as a whole. It is
not possible to subdivide the result.
And here you say the enumerable can yield an indefinite number of items
But I must deal with an IEnumerable and, even worse, it is usually
quite huge, sometimes it is even unlimited, yielding workers forever.
If I knew that its size was decent, I would just call ToArray() on it
and use the method above.
So lets put these together. You need to do asynchronous processing of an indefinite number of items within a synchronous environment and then evaluate the entire set as a whole... synchronously.
The Underlying Problems:
1: To evaluate a set as a whole, it must be completely enumerated. To completely enumerate a set, it must be finite. Therefore it is impossible to evaluate an infinite set as a whole.
2: Switching back and forth between sync and async forces the async code to run synchronously. that might be ok from a requirements perspective, but from a technical perspective it can cause deadlocks (maybe unavoidable, I don't know. Look that up. I'm not the expert).
Possible Solutions to Problem 1:
1: Force the source to be an ICollection<T> instead of IEnumerable<T>. This enforces finiteness.
2: Alter the CheckProducts algorithm to process iteratively, potentially yielding intermediary results while still maintaining an ongoing aggregation internally.
Possible Solutions to Problem 2:
1: Make the CheckProducts method asynchronous.
2: Make the CreateProduct... method synchronous.
Bottom Line
You can't do what you're asking how you're asking, and it sounds like someone else is dictating your requirements. They need to change some of the requirements, because what they're asking for is (and I really hate using this word) impossible. Is it possible you have misinterpreted some of the requirements?
Two ideas for you OP
Multiple call solution
If you are allowed to call CheckProducts more than once, you could simply do this:
foreach (var worker in workers)
{
var product = await worker.CreateProductAsync();
CheckProducts(new [] { product } );
}
If it adds value, I'm pretty sure you could work out a way to do it in batches of, say, 100 at a time, too.
Thread pool solution
If you are not allowed to call CheckProducts more than once, and not allowed to modify CheckProducts, there is no way to force it to yield control and allow other continuations to run. So no matter what you do, you cannot force asynchronousness into the IEnumerable that you pass to it, not just because of the compiler checking, but because it would probably deadlock.
So here is a thread pool solution. The idea is to create one separate thread to process the products in series; the processor is async, so a call to CreateProductAsync() will still yield control to anything else that has been posted to the synchronization context, as needed. However it can't magically force CheckProduct to give up control, so there is still some possibility that it will block occasionally if it is able to check products faster than they are created. In my example I'm using Monitor.Wait() so the O/S won't schedule the thread until there is something waiting for it. You'll still be using up a thread resource while it blocks, but at least you won't be wasting CPU time in a busy-wait loop.
public static IEnumerable<Product> CreateProducts(IEnumerable<Worker> workers)
{
var queue = new ConcurrentQueue<Product>();
var task = Task.Run(() => ConvertProducts(workers.GetEnumerator(), queue));
while (true)
{
while (queue.Count > 0)
{
Product product;
var ok = queue.TryDequeue(out product);
if (ok) yield return product;
}
if (task.IsCompleted && queue.Count == 0) yield break;
Monitor.Wait(queue, 1000);
}
}
private static async Task ConvertProducts(IEnumerator<Worker> input, ConcurrentQueue<Product> output)
{
while (input.MoveNext())
{
var current = input.Current;
var product = await current.CreateProductAsync();
output.Enqueue(product);
Monitor.Pulse(output);
}
}
From your requirements I can put together the following:
1) Workers processed in order
2) Open to receive new Workers at any time
So using the fact that a dataflow TransformBlock has a built in queue and processes items in order. Now we can accept Workers from the producer at any time.
Next we make the result of the TransformBlockobservale so that the consumer can consume Products on demand.
Made some quick changes and started the consumer portion. This simply takes the observable produced by the Transformer and maps it to an enumerable that yields each product. For background here is the ToEnumerable().
The ToEnumerator operator returns an enumerator from an observable sequence. The enumerator will yield each item in the sequence as it is produced
Source
using System;
using System.Threading.Tasks;
using System.Threading.Tasks.Dataflow;
namespace ClassLibrary1
{
public class WorkerProducer
{
public async Task ProduceWorker()
{
//await ProductTransformer_Transformer.SendAsync(new Worker())
}
}
public class ProductTransformer
{
public IObservable<Product> Products { get; private set; }
public TransformBlock<Worker, Product> Transformer { get; private set; }
private Task<Product> CreateProductAsync(Worker worker) => Task.FromResult(new Product());
public ProductTransformer()
{
Transformer = new TransformBlock<Worker, Product>(wrk => CreateProductAsync(wrk));
Products = Transformer.AsObservable();
}
}
public class ProductConsumer
{
private ThirdParty ThirdParty { get; set; } = new ThirdParty();
private ProductTransformer Transformer { get; set; }
public ProductConsumer()
{
ThirdParty.CheckProducts(Transformer.Products.ToEnumerable());
}
public class Worker { }
public class Product { }
public class ThirdParty
{
public void CheckProducts(IEnumerable<Product> products)
{
}
}
}
Unless I misunterstood something, I don't see why you don't simply do it like this:
var productList = new List<Product>(workers.Count())
foreach(var worker in workers)
{
productList.Add(await worker.CreateProductAsync());
}
CheckProducts(productList);
What about if you simply keep clearing a List of size 1?
var productList = new List<Product>(1);
var checkTask = Task.CompletedTask;
foreach(var worker in workers)
{
await checkTask;
productList.Clear();
productList.Add(await worker.CreateProductAsync());
checkTask = Task.Run(CheckProducts(productList));
}
await checkTask;
You can use Task.WhenAll, but instead of returning result of Task.WhenAll, return collection of tasks transformed to the collection of results.
async Task<IEnumerable<Product>> CreateProductsAsync(IEnumerable<IWorker> workers)
{
var tasks = workers.Select(worker => worker.CreateProductAsync()).ToList();
await Task.WhenAll(tasks);
return tasks.Select(task => task.Result);
}
Order of tasks will be persisted.
And seems like should be ok to go with just return await Task.WhenAll()
From docs of Task.WhenAll Method (IEnumerable>)
The Task.Result property of the returned task will be set to
an array containing all of the results of the supplied tasks in the
same order as they were provided...
If workers need to be executed one by one in the order they were created and based on requirement that another function need whole set of workers results
async Task<IEnumerable<Product>> CreateProductsAsync(IEnumerable<IWorker> workers)
{
var products = new List<product>();
foreach (var worker in workers)
{
product = await worker.CreateProductAsync();
products.Add(product);
}
return products;
}
You can do this now with async, IEnumerable and LINQ but every method in the chain after the async would be a Task<T>, and you need to use something like await Task.WhenAll at the end. You can use async lambdas in the LINQ methods, which return Task<T>. You don't need to wait synchronously in these.
The Select will start your tasks sequentially i.e. they won't even exist as tasks until the select enumerates each one, and won't keep going after you stop enumerating. You could also run your own foreach over the enumerable of tasks if you want to await them all individually.
You can break out of this like any other foreach without it starting all of them, so this will also work on an infinite enumerable.
public async Task Main()
{
// This async method call could also be an async lambda
foreach (var task in GetTasks())
{
var result = await task;
Console.WriteLine($"Result is {result}");
if (result > 5) break;
}
}
private IEnumerable<Task<int>> GetTasks()
{
return GetNumbers().Select(WaitAndDoubleAsync);
}
private async Task<int> WaitAndDoubleAsync(int i)
{
Console.WriteLine($"Waiting {i} seconds asynchronously");
await Task.Delay(TimeSpan.FromSeconds(i));
return i * 2;
}
/// Keeps yielding numbers
private IEnumerable<int> GetNumbers()
{
var i = 0;
while (true) yield return i++;
}
Outputs, the following, then stops:
Waiting 0 seconds asynchronously
Result is 0
Waiting 1 seconds asynchronously
Result is 2
Waiting 2 seconds asynchronously
Result is 4
Waiting 3 seconds asynchronously
Result is 6
The important thing is that you can't mix yield and await in the same method, but you can yield Tasks returned from a method that uses await absolutely fine, so you can use them together just by splitting them into separate methods. Select is already a method that uses yield, so you may not need to write your own method for this.
In your post you were looking for a Task<IEnumerable<Product>>, but what you can actually use is a IEnumerable<Task<Product>>.
You can go even further with this e.g. if you had something like a REST API where one resource can have links to other resources, like if you just wanted to get a list of users of a group, but stop when you found the user you were interested in:
public async Task<IEnumerable<Task<User>>> GetUserTasksAsync(int groupId)
{
var group = await GetGroupAsync(groupId);
return group.UserIds.Select(GetUserAsync);
}
foreach (var task in await GetUserTasksAsync(1))
{
var user = await task;
...
}
There is no solution to your problem. You can't transform a deferred IEnumerable<Task<Product>> to a deferred IEnumerable<Product>, such that the consuming thread will not get blocked while enumerating the IEnumerable<Product>. The IEnumerable<T> is a synchronous interface. It returns an enumerator with a synchronous MoveNext method. The MoveNext returns bool, which is not an awaitable type. An asynchronous interface IAsyncEnumerable<T> exists, whose enumerator has an asynchronous MoveNextAsync method, with a return type of ValueTask<bool>. But you have explicitly said that you can't change the consuming method, so you are stuck with the IEnumerable<T> interface. No solution then.
try
workers.ForEach(async wrkr =>
{
var prdlist = await wrkr.CreateProductAsync();
//Remaing tasks....
});

How to convert WaitAll() to async and await

Is it possible to rewrite the code below using the async await syntax?
private void PullTablePages()
{
var taskList = new List<Task>();
var faultedList = new List<Task>();
foreach (var featureItem in featuresWithDataName)
{
var task = Task.Run(() => PullTablePage();
taskList.Add(task);
if(taskList.Count == Constants.THREADS)
{
var index = Task.WaitAny(taskList.ToArray());
taskList.Remove(taskList[index]);
}
}
if (taskList.Any())
{
Task.WaitAll(taskList.ToArray());
}
//todo: do something with faulted list
}
When I rewrote it as below, the code doesn't block and the console application finishes before most of the threads complete.
It seems like the await syntax doesn't block as I expected.
private async void PullTablePagesAsync()
{
var taskList = new List<Task>();
var faultedList = new List<Task>();
foreach (var featureItem in featuresWithDataName)
{
var task = Task.Run(() => PullTablePage();
taskList.Add(task);
if(taskList.Count == Constants.THREADS)
{
var anyFinished = await Task.WhenAny(taskList.ToArray());
await anyFinished;
for (var index = taskList.Count - 1; index >= 0; index--)
{
if (taskList[index].IsCompleted)
{
taskList.Remove(taskList[index]);
}
}
}
}
if (taskList.Any())
{
await Task.WhenAll(taskList);
}
//todo: what to do with faulted list?
}
Is it possible to do so?
WaitAll doesn't seem to wait for all tasks to complete. How do I get it to do so? The return type says that it returns a task, but can't seem to figure out the syntax.## Heading ##
New to multithreading, please excuse ignorance.
Is it possible to rewrite the code below using the async await syntax?
Yes and no. Yes, because you already did it. No, because while being equivalent from the function implementation standpoint, it's not equivalent from the function caller perspective (as you already experienced with your console application). Contrary to many other C# keywords, await is not a syntax sugar. There is a reason why the compiler forces you to mark your function with async in order to enable await construct, and the reason is that now your function is no more blocking and that puts additional responsibility to the callers - either put themselves to be non blocking (async) or use a blocking calls to your function. Every async method in fact is and should return Task or Task<TResult> and then the compiler will warn the callers that ignore that fact. The only exception is async void which is supposed to be used only for event handlers, which by nature should not care what the object being notified is doing.
Shortly, async/await is not for rewriting synchronous (blocking code), but for easy turning it to asynchronous (non blocking). If your function is supposed to be synchronous, then just keep it the way it is (and your original implementation is perfectly doing that). But if you need asynchronous version, then you should change the signature to
private async Task PullTablePagesAsync()
with the await rewritten body (which you already did correctly). And for backward compatibility provide the old synchronous version using the new implementation like this
private void PullTablePages() { PullTablePagesAsync().Wait(); }
It seems like the await syntax doesn't block as I expected.
You're expecting the wrong thing.
The await syntax should never block - it's just that the execution flow should not continue until the task is finished.
Usually you are using async/await in methods that return a task. In your case you're using it in a method with a return type void.
It takes times to get your head around async void methods, that's why their use is usually discouraged. Async void methods run synchronously (block) to the calling method until the first (not completed) task is awaited. What happens after depends on the execution context (usually you're running on the pool). What's important: The calling method (the one that calls PullTablePAgesAsync) does not know of continuations and can't know when all code in PullTablePagesAsync is done.
Maybe take a look on the async/await best practices on MSDN

Is it in general dubious to call Task.Factory.StartNew(async () => {}) in Subscribe?

I have a situation where I need to use a custom scheduler to run tasks (these need to be tasks) and the scheduler does not set a synchronization context (so no ObserveOn, SubscribeOn, SynchronizationContextScheduler etc. I gather). The following is how I ended up doing it. Now, I wonder, I'm not really sure if this is the fittest way of doing asynchronous calls and awaiting their results. Is this all right or is there a more robust or idiomatic way?
var orleansScheduler = TaskScheduler.Current;
var someObservable = ...;
someObservable.Subscribe(i =>
{
Task.Factory.StartNew(async () =>
{
return await AsynchronousOperation(i);
}, CancellationToken.None, TaskCreationOptions.None, orleansScheduler);
});
What if awaiting wouldn't be needed?
<edit: I found a concrete, and a simplified example of what I'm doing here. Basically I'm using Rx in Orleans and the above code is bare-bones illustration of what I'm up to. Though I'm also interested in this situation in general too.
The final code
It turns out this was a bit tricky in the Orleans context. I don't see how I could get to use ObserveOn, which would be just the thing I'd like to use. The problem is that by using it, the Subscribe would never get called. The code:
var orleansScheduler = TaskScheduler.Current;
var factory = new TaskFactory(orleansScheduler);
var rxScheduler = new TaskPoolScheduler(factory);
var someObservable = ...;
someObservable
//.ObserveOn(rxScheduler) This doesn't look like useful since...
.SelectMany(i =>
{
//... we need to set the custom scheduler here explicitly anyway.
//See Async SelectMany at http://log.paulbetts.org/rx-and-await-some-notes/.
//Doing the "shorthand" form of .SelectMany(async... would call Task.Run, which
//in turn runs always on .NET ThreadPool and not on Orleans scheduler and hence
//the following .Subscribe wouldn't be called.
return Task.Factory.StartNew(async () =>
{
//In reality this is an asynchronous grain call. Doing the "shorthand way"
//(and optionally using ObserveOn) would get the grain called, but not the
//following .Subscribe.
return await AsynchronousOperation(i);
}, CancellationToken.None, TaskCreationOptions.None, orleansScheduler).Unwrap().ToObservable();
})
.Subscribe(i =>
{
Trace.WriteLine(i);
});
Also, a link to a related thread at Codeplex Orleans forums.
I strongly recommend against StartNew for any modern code. It does have a use case, but it's very rare.
If you have to use a custom task scheduler, I recommend using ObserveOn with a TaskPoolScheduler constructed from a TaskFactory wrapper around your scheduler. That's a mouthful, so here's the general idea:
var factory = new TaskFactory(customScheduler);
var rxScheduler = new TaskPoolScheduler(factory);
someObservable.ObserveOn(rxScheduler)...
Then you could use SelectMany to start an asynchronous operation for each event in a source stream as they arrive.
An alternative, less ideal solution is to use async void for your subscription "events". This is acceptable, but you have to watch your error handling. As a general rule, don't allow exceptions to propagate out of an async void method.
There is a third alternative, where you hook an observable into a TPL Dataflow block. A block like ActionBlock can specify its task scheduler, and Dataflow naturally understands asynchronous handlers. Note that by default, Dataflow blocks will throttle the processing to a single element at a time.
Generally speaking, instead of subscribing to execute, it's better/more idiomatic to project the task parameters into the task execution and subscribe just for the results. That way you can compose with further Rx downstream.
e.g. Given a random task like:
static async Task<int> DoubleAsync(int i, Random random)
{
Console.WriteLine("Started");
await Task.Delay(TimeSpan.FromSeconds(random.Next(10) + 1));
return i * 2;
}
Then you might do:
void Main()
{
var random = new Random();
// stream of task parameters
var source = Observable.Range(1, 5);
// project the task parameters into the task execution, collect and flatten results
source.SelectMany(i => DoubleAsync(i, random))
// subscribe just for results, which turn up as they are done
// gives you flexibility to continue the rx chain here
.Subscribe(result => Console.WriteLine(result),
() => Console.WriteLine("All done."));
}

TPL DataFlow with Lazy Source / stream of data

Suppose you have a TransformBlock with configured parallelism and want to stream data trough the block. The input data should be created only when the pipeline can actually start processing it. (And should be released the moment it leaves the pipeline.)
Can I achieve this? And if so how?
Basically I want a data source that works as an iterator.
Like so:
public IEnumerable<Guid> GetSourceData()
{
//In reality -> this should also be an async task -> but yield return does not work in combination with async/await ...
Func<ICollection<Guid>> GetNextBatch = () => Enumerable.Repeat(100).Select(x => Guid.NewGuid()).ToArray();
while (true)
{
var batch = GetNextBatch();
if (batch == null || !batch.Any()) break;
foreach (var guid in batch)
yield return guid;
}
}
This would result in +- 100 records in memory. OK: more if the blocks you append to this data source would keep them in memory for some time, but you have a chance to get only a subset (/stream) of data.
Some background information:
I intend to use this in combination with azure cosmos db, where the source could all objects in a collection, or a change feed. Needless to say that I don't want all of those objects stored in memory. So this can't work:
using System.Threading.Tasks.Dataflow;
public async Task ExampleTask()
{
Func<Guid, object> TheActualAction = text => text.ToString();
var config = new ExecutionDataflowBlockOptions
{
BoundedCapacity = 5,
MaxDegreeOfParallelism = 15
};
var throtteler = new TransformBlock<Guid, object>(TheActualAction, config);
var output = new BufferBlock<object>();
throtteler.LinkTo(output);
throtteler.Post(Guid.NewGuid());
throtteler.Post(Guid.NewGuid());
throtteler.Post(Guid.NewGuid());
throtteler.Post(Guid.NewGuid());
//...
throtteler.Complete();
await throtteler.Completion;
}
The above example is not good because I add all the items without knowing if they are actually being 'used' by the transform block. Also, I don't really care about the output buffer. I understand that I need to send it somewhere so I can await the completion, but I have no use for the buffer after that. So it should just forget about all it gets ...
Post() will return false if the target is full without blocking. While this could be used in a busy-wait loop, it's wasteful. SendAsync() on the other hand will wait if the target is full :
public async Task ExampleTask()
{
var config = new ExecutionDataflowBlockOptions
{
BoundedCapacity = 50,
MaxDegreeOfParallelism = 15
};
var block= new ActionBlock<Guid, object>(TheActualAction, config);
while(//some condition//)
{
var data=await GetDataFromCosmosDB();
await block.SendAsync(data);
//Wait a bit if we want to use polling
await Task.Delay(...);
}
block.Complete();
await block.Completion;
}
It seems you want to process data at a defined degree of parallelism (MaxDegreeOfParallelism = 15). TPL dataflow is very clunky to use for such a simple requirement.
There's a very simple and powerful pattern that might solve your problem. It's a parallel async foreach loop as described here: https://blogs.msdn.microsoft.com/pfxteam/2012/03/05/implementing-a-simple-foreachasync-part-2/
public static Task ForEachAsync<T>(this IEnumerable<T> source, int dop, Func<T, Task> body)
{
return Task.WhenAll(
from partition in Partitioner.Create(source).GetPartitions(dop)
select Task.Run(async delegate {
using (partition)
while (partition.MoveNext())
await body(partition.Current);
}));
}
You can then write something like:
var dataSource = ...; //some sequence
dataSource.ForEachAsync(15, async item => await ProcessItem(item));
Very simple.
You can dynamically reduce the DOP by using a SemaphoreSlim. The semaphore acts as a gate that only lets N concurrent threads/tasks in. N can be changed dynamically.
So you would use ForEachAsync as the basic workhorse and then add additional restrictions and throttling on top.

Categories