I have code which streams data down from SQL and writes it to a different store. The code is approximately this:
using (var cmd = new SqlCommand("select * from MyTable", connection))
{
using (var reader = await cmd.ExecuteReaderAsync())
{
var list = new List<MyData>();
while (await reader.ReadAsync())
{
var row = GetRow(reader);
list.Add(row);
if (list.Count == BatchSize)
{
await WriteDataAsync(list);
list.Clear();
}
}
if (list.Count > 0)
{
await WriteDataAsync(list);
}
}
}
I would like to use Reactive extensions for this purpose instead. Ideally the code would look like this:
await StreamDataFromSql()
.Buffer(BatchSize)
.ForEachAsync(async batch => await WriteDataAsync(batch));
However, it seems that the extension method ForEachAsync only accepts synchronous actions. Would it be possible to write an extension which would accept an async action?
Would it be possible to write an extension which would accept an async action?
Not directly.
Rx subscriptions are necessarily synchronous because Rx is a push-based system. When a data item arrives, it travels through your query until it hits the final subscription - which in this case is to execute an Action.
The await-able methods provided by Rx are awaiting the sequence itself - i.e., ForEachAsync is asynchronous in terms of the sequence (you are asynchronously waiting for the sequence to complete), but the subscription within ForEachAsync (the action taken for each element) must still be synchronous.
In order to do a sync-to-async transition in your data pipeline, you'll need to have a buffer. An Rx subscription can (synchronously) add to the buffer as a producer while an asynchronous consumer is retrieving items and processing them. So, you'd need a producer/consumer queue that supports both synchronous and asynchronous operations.
The various block types in TPL Dataflow can satisfy this need. Something like this should suffice:
var obs = StreamDataFromSql().Buffer(BatchSize);
var buffer = new ActionBlock<IList<T>>(batch => WriteDataAsync(batch));
using (var subscription = obs.Subscribe(buffer.AsObserver()))
await buffer.Completion;
Note that there is no backpressure; as quickly as StreamDataFromSql can push data, it'll be buffered and stored in the incoming queue of the ActionBlock. Depending on the size and type of data, this can quickly use a lot of memory.
The correct thing to do is to use Reactive Extensions properly to get this done - so start from the point that you create the connection right up until you write your data.
Here's how:
IObservable<IList<MyData>> query =
Observable
.Using(() => new SqlConnection(""), connection =>
Observable
.Using(() => new SqlCommand("select * from MyTable", connection), cmd =>
Observable
.Using(() => cmd.ExecuteReader(), reader =>
Observable
.While(() => reader.Read(), Observable.Return(GetRow(reader))))))
.Buffer(BatchSize);
IDisposable subscription =
query
.Subscribe(async list => await WriteDataAsync(list));
I couldn't test the code, but it should work. This code assumes that WriteDataAsync can take a IList<MyData> too. If it doesn't just drop in a .ToList().
Here is a version of the ForEachAsync method that supports asynchronous actions. It projects the source observable to a nested IObservable<IObservable<Unit>> containing the asynchronous actions, and then flattens it back to an IObservable<Unit> using the Merge operator. The resulting observable is finally converted to a task.
By default the actions are invoked sequentially, but it is possible to invoke them concurrently by configuring the optional maximumConcurrency argument.
Canceling the optional cancellationToken argument results to the immediate completion (cancellation) of the returned Task, potentially before the cancellation of the currently running actions.
Any exception that may occur is propagated through the Task, and causes the cancellation of all currently running actions.
/// <summary>
/// Invokes an asynchronous action for each element in the observable sequence,
/// and returns a 'Task' that represents the completion of the sequence and
/// all the asynchronous actions.
/// </summary>
public static Task ForEachAsync<TSource>(
this IObservable<TSource> source,
Func<TSource, CancellationToken, Task> action,
CancellationToken cancellationToken = default,
int maximumConcurrency = 1)
{
// Arguments validation omitted
return source
.Select(item => Observable.FromAsync(ct => action(item, ct)))
.Merge(maximumConcurrency)
.DefaultIfEmpty()
.ToTask(cancellationToken);
}
Usage example:
await StreamDataFromSql()
.Buffer(BatchSize)
.ForEachAsync(async (batch, token) => await WriteDataAsync(batch, token));
Here is the source code for ForEachAsync and an article on the ToEnumerable and AsObservable method
We can make a wrapper around the ForEachAsync that will await a Task-returning function:
public static async Task ForEachAsync<T>( this IObservable<T> t, Func<T, Task> onNext )
{
foreach ( var x in t.ToEnumerable() )
await onNext( x );
}
Example usage:
await ForEachAsync( Observable.Range(0, 10), async x => await Task.FromResult( x ) );
Related
I'm submitting a series of select statements (queries - thousands of them) to a single database synchronously and getting back one DataTable per query (Note: This program is such that it has knowledge of the DB schema it is scanning only at run time, hence the use of DataTables). The program runs on a client machine and connects to DBs on a remote machine. It takes a long time to run so many queries. So, assuming that executing them async or in parallel will speed things up, I'm exploring TPL Dataflow (TDF). I want to use the TDF library because it seems to handle all of the concerns related to writing multi-threaded code that would otherwise need to be done by hand.
The code shown is based on http://blog.i3arnon.com/2016/05/23/tpl-dataflow/. Its minimal and is just to help me understand the basic operations of TDF. Please do know I've read many blogs and coded many iterations trying to crack this nut.
None-the-less, with this current iteration, I have one problem and a question:
Problem
The code is inside a button click method (Using a UI, a user selects a machine, a sql instance, and a database, and then kicks off the scan). The two lines with the await operator return an error at build time: The 'await' operator can only be used within an async method. Consider marking this method with the 'async' modifier and changing its return type to 'Task'. I can't change the return type of the button click method. Do I need to somehow isolate the button click method from the async-await code?
Question
Although I've found beau-coup write-ups describing the basics of TDF, I can't find an example of how to get my hands on the output that each invocation of the TransformBlock produces (i.e., a DataTable). Although I want to submit the queries async, I do need to block until all queries submitted to the TransformBlock are completed. How do I get my hands on the series of DataTables produced by the TransformBlock and block until all queries are complete?
Note: I acknowledge that I have only one block now. At a minimum, I'll be adding a cancellation block and so do need/want to use TPL.
private async Task ToolStripButtonStart_Click(object sender, EventArgs e)
{
UserInput userInput = new UserInput
{
MachineName = "gat-admin",
InstanceName = "",
DbName = "AdventureWorks2014",
};
DataAccessLayer dataAccessLayer = new DataAccessLayer(userInput.MachineName, userInput.InstanceName);
//CreateTableQueryList gets a list of all tables from the DB and returns a list of
// select statements, one per table, e.g., SELECT * from [schemaname].[tablename]
IList<String> tableQueryList = CreateTableQueryList(userInput);
// Define a block that accepts a select statement and returns a DataTable of results
// where each returned record is: schemaname + tablename + columnname + column datatype + field data
// e.g., if the select query returns one record with 5 columns, then a datatable with 5
// records (one per field) will come back
var transformBlock_SubmitTableQuery = new TransformBlock<String, Task<DataTable>>(
async tableQuery => await dataAccessLayer._SubmitSelectStatement(tableQuery),
new ExecutionDataflowBlockOptions
{
MaxDegreeOfParallelism = 2,
});
// Add items to the block and start processing
foreach (String tableQuery in tableQueryList)
{
await transformBlock_SubmitTableQuery.SendAsync(tableQuery);
}
// Enable the Cancel button and disable the Start button.
toolStripButtonStart.Enabled = false;
toolStripButtonStop.Enabled = true;
//shut down the block (no more inputs or outputs)
transformBlock_SubmitTableQuery.Complete();
//await the completion of the task that procduces the output DataTable
await transformBlock_SubmitTableQuery.Completion;
}
public async Task<DataTable> _SubmitSelectStatement(string queryString )
{
try
{
.
.
await Task.Run(() => sqlDataAdapter.Fill(dt));
// process dt into the output DataTable I need
return outputDt;
}
catch
{
throw;
}
}
The cleanest way to retrieve the output of a TransformBlock is to perform a nested loop using the methods OutputAvailableAsync and TryReceive. It is a bit verbose, so you could consider encapsulating this functionality in an extension method ToListAsync:
public static async Task<List<T>> ToListAsync<T>(this IReceivableSourceBlock<T> source,
CancellationToken cancellationToken = default)
{
ArgumentNullException.ThrowIfNull(source);
List<T> list = new();
while (await source.OutputAvailableAsync(cancellationToken).ConfigureAwait(false))
{
while (source.TryReceive(out T item))
{
list.Add(item);
}
}
Debug.Assert(source.Completion.IsCompleted);
await source.Completion.ConfigureAwait(false); // Propagate possible exception
return list;
}
Then you could use the ToListAsync method like this:
private async Task ToolStripButtonStart_Click(object sender, EventArgs e)
{
TransformBlock<string, DataTable> transformBlock = new(async query => //...
//...
transformBlock.Complete();
foreach (DataTable dataTable in await transformBlock.ToListAsync())
{
// Do something with each dataTable
}
}
Note: this ToListAsync implementation is destructive, meaning that in case of an error the consumed messages are discarded. To make it non-destructive, just remove the await source.Completion line. In this case you'll have to remember to await the Completion of the block after processing the list with the consumed messages, otherwise you won't be aware if the TransformBlock failed to process all of its input.
Alternative ways to retrieve the output of a dataflow block do exist, for example this one by dcastro uses a BufferBlock as a buffer and is slightly more performant, but personally I find the approach above to be safer and more straightforward.
Instead of waiting for the completion of the block before retrieving the output, you could also retrieve it in a streaming manner, as an IAsyncEnumerable<T> sequence:
public static async IAsyncEnumerable<T> ToAsyncEnumerable<T>(
this IReceivableSourceBlock<T> source,
[EnumeratorCancellation] CancellationToken cancellationToken = default)
{
ArgumentNullException.ThrowIfNull(source);
while (await source.OutputAvailableAsync(cancellationToken).ConfigureAwait(false))
{
while (source.TryReceive(out T item))
{
yield return item;
cancellationToken.ThrowIfCancellationRequested();
}
}
Debug.Assert(source.Completion.IsCompleted);
await source.Completion.ConfigureAwait(false); // Propagate possible exception
}
This way you will be able to get your hands to each DataTable immediately after it has been cooked, without having to wait for the processing of all queries. To consume an IAsyncEnumerable<T> you simply move the await before the foreach:
await foreach (DataTable dataTable in transformBlock.ToAsyncEnumerable())
{
// Do something with each dataTable
}
Advanced: Below is a more sophisticated version of the ToListAsync method, that propagates all the errors of the underlying block, in the same direct way that are propagated by methods like the Task.WhenAll and Parallel.ForEachAsync. The original simple ToListAsync method wraps the errors in a nested AggregateException, using the Wait technique that is shown in this answer.
/// <summary>
/// Asynchronously waits for the successful completion of the specified source, and
/// returns all the received messages. In case the source completes with error,
/// the error is propagated and the received messages are discarded.
/// </summary>
public static Task<List<T>> ToListAsync<T>(this IReceivableSourceBlock<T> source,
CancellationToken cancellationToken = default)
{
ArgumentNullException.ThrowIfNull(source);
async Task<List<T>> Implementation()
{
List<T> list = new();
while (await source.OutputAvailableAsync(cancellationToken)
.ConfigureAwait(false))
while (source.TryReceive(out T item))
list.Add(item);
await source.Completion.ConfigureAwait(false);
return list;
}
return Implementation().ContinueWith(t =>
{
if (t.IsCanceled) return t;
Debug.Assert(source.Completion.IsCompleted);
if (source.Completion.IsFaulted)
{
TaskCompletionSource<List<T>> tcs = new();
tcs.SetException(source.Completion.Exception.InnerExceptions);
return tcs.Task;
}
return t;
}, default, TaskContinuationOptions.DenyChildAttach |
TaskContinuationOptions.ExecuteSynchronously, TaskScheduler.Default).Unwrap();
}
.NET 6 update: A new API DataflowBlock.ReceiveAllAsync was introduced in .NET 6, with this signature:
public static IAsyncEnumerable<TOutput> ReceiveAllAsync<TOutput> (
this IReceivableSourceBlock<TOutput> source,
CancellationToken cancellationToken = default);
It is similar with the aforementioned ToAsyncEnumerable method. The important difference is that the new API does not propagate the possible exception of the consumed source block, after propagating all of its messages. This behavior is not consistent with the analogous API ReadAllAsync from the Channels library. I have reported this consistency on GitHub, and the issue is currently labeled by Microsoft as a bug.
As it turns out, to meet my requirements, TPL Dataflow is a bit overkill. I was able to meet my requirements using async/await and Task.WhenAll. I used the Microsoft How-To How to: Extend the async Walkthrough by Using Task.WhenAll (C#) as a model.
Regarding my "Problem"
My "problem" is not a problem. An event method signature (in my case, a "Start" button click method that initiates my search) can be modified to be async. In the Microsoft How-To GetURLContentsAsync solution, see the startButton_Click method signature:
private async void startButton_Click(object sender, RoutedEventArgs e)
{
.
.
}
Regarding my question
Using Task.WhenAll, I can wait for all my queries to finish then process all the outputs for use on my UI. In the Microsoft How-To GetURLContentsAsync solution, see the SumPageSizesAsync method, i.e,, the array of int named lengths is the sum of all outputs.
private async Task SumPageSizesAsync()
{
.
.
// Create a query.
IEnumerable<Task<int>> downloadTasksQuery = from url in urlList select ProcessURLAsync(url);
// Use ToArray to execute the query and start the download tasks.
Task<int>[] downloadTasks = downloadTasksQuery.ToArray();
// Await the completion of all the running tasks.
Task<int[]> whenAllTask = Task.WhenAll(downloadTasks);
int[] lengths = await whenAllTask;
.
.
}
Using Dataflow blocks properly results in both cleaner and faster code. Dataflow blocks aren't agents or tasks. They're meant to work in a pipeline of blocks, connected with LinkTo calls, not manual coding.
It seems the scenario is to download some data, eg some CSVs, parse them and insert them to a database. Each of those steps can go into its own block:
a Downloader with a DOP>1, to allow multiple downloads run concurrently without flooding the network.
a Parser that converts the files into arrays of objects
an Importer that uses SqlBulkCopy to bulk insert the rows into the database in the fastest way possible, using minimal logging.
var downloadDOP=8;
var parseDOP=2;
var tableName="SomeTable";
var linkOptions=new DataflowLinkOptions { PropagateCompletion = true};
var downloadOptions =new ExecutionDataflowBlockOptions {
MaxDegreeOfParallelism = downloadDOP,
};
var parseOptions =new ExecutionDataflowBlockOptions {
MaxDegreeOfParallelism = parseDOP,
};
With these options, we can construct a pipeline of blocks
//HttpClient is thread-safe and reusable
HttpClient http=new HttpClient(...);
var downloader=new TransformBlock<(Uri,string),FileInfo>(async (uri,path)=>{
var file=new FileInfo(path);
using var stream =await httpClient.GetStreamAsync(uri);
using var fileStream=file.Create();
await stream.CopyToAsync(stream);
return file;
},downloadOptions);
var parser=new TransformBlock<FileInfo,Foo[]>(async file=>{
using var reader = file.OpenText();
using var csv = new CsvReader(reader, CultureInfo.InvariantCulture);
var records = csv.GetRecords<Foo>().ToList();
return records;
},parseOptions);
var importer=new ActionBlock<Foo[]>(async recs=>{
using var bcp=new SqlBulkCopy(connectionString, SqlBulkCopyOptions.TableLock);
bcp.DestinationTableName=tableName;
//Map columns if needed
...
using var reader=ObjectReader.Create(recs);
await bcp.WriteToServerAsync(reader);
});
downloader.LinkTo(parser,linkOptions);
parser.LinkTo(importer,linkOptions);
Once the pipeline is complete, you can start posting Uris to the head block and await until the tail block completes :
IEnumerable<(Uri,string)> filesToDownload = ...
foreach(var pair in filesToDownload)
{
await downloader.SendAsync(pair);
}
downloader.Complete();
await importer.Completion;
The code uses CsvHelper to parse the CSV file and FastMember's ObjectReader to create an IDataReader wrapper over the CSV records.
In each block you can use a Progress instance to update the UI based on the pipeline's progress
I need to modify an existing program and it contains following code:
var inputs = events.Select(async ev => await ProcessEventAsync(ev))
.Select(t => t.Result)
.Where(i => i != null)
.ToList();
But this seems very weird to me, first of all the use of async and awaitin the select. According to this answer by Stephen Cleary I should be able to drop those.
Then the second Select which selects the result. Doesn't this mean the task isn't async at all and is performed synchronously (so much effort for nothing), or will the task be performed asynchronously and when it's done the rest of the query is executed?
Should I write the above code like following according to another answer by Stephen Cleary:
var tasks = await Task.WhenAll(events.Select(ev => ProcessEventAsync(ev)));
var inputs = tasks.Where(result => result != null).ToList();
and is it completely the same like this?
var inputs = (await Task.WhenAll(events.Select(ev => ProcessEventAsync(ev))))
.Where(result => result != null).ToList();
While i'm working on this project I'd like to change the first code sample but I'm not too keen on changing (apparantly working) async code. Maybe I'm just worrying for nothing and all 3 code samples do exactly the same thing?
ProcessEventsAsync looks like this:
async Task<InputResult> ProcessEventAsync(InputEvent ev) {...}
var inputs = events.Select(async ev => await ProcessEventAsync(ev))
.Select(t => t.Result)
.Where(i => i != null)
.ToList();
But this seems very weird to me, first of all the use of async and await in the select. According to this answer by Stephen Cleary I should be able to drop those.
The call to Select is valid. These two lines are essentially identical:
events.Select(async ev => await ProcessEventAsync(ev))
events.Select(ev => ProcessEventAsync(ev))
(There's a minor difference regarding how a synchronous exception would be thrown from ProcessEventAsync, but in the context of this code it doesn't matter at all.)
Then the second Select which selects the result. Doesn't this mean the task isn't async at all and is performed synchronously (so much effort for nothing), or will the task be performed asynchronously and when it's done the rest of the query is executed?
It means that the query is blocking. So it is not really asynchronous.
Breaking it down:
var inputs = events.Select(async ev => await ProcessEventAsync(ev))
will first start an asynchronous operation for each event. Then this line:
.Select(t => t.Result)
will wait for those operations to complete one at a time (first it waits for the first event's operation, then the next, then the next, etc).
This is the part I don't care for, because it blocks and also would wrap any exceptions in AggregateException.
and is it completely the same like this?
var tasks = await Task.WhenAll(events.Select(ev => ProcessEventAsync(ev)));
var inputs = tasks.Where(result => result != null).ToList();
var inputs = (await Task.WhenAll(events.Select(ev => ProcessEventAsync(ev))))
.Where(result => result != null).ToList();
Yes, those two examples are equivalent. They both start all asynchronous operations (events.Select(...)), then asynchronously wait for all the operations to complete in any order (await Task.WhenAll(...)), then proceed with the rest of the work (Where...).
Both of these examples are different from the original code. The original code is blocking and will wrap exceptions in AggregateException.
I used this code:
public static async Task<IEnumerable<TResult>> SelectAsync<TSource,TResult>(
this IEnumerable<TSource> source, Func<TSource, Task<TResult>> method)
{
return await Task.WhenAll(source.Select(async s => await method(s)));
}
like this:
var result = await sourceEnumerable.SelectAsync(async s=>await someFunction(s,other params));
Edit:
Some people have raised the issue of concurrency, like when you are accessing a database and you can't run two tasks at the same time. So here is a more complex version that also allows for a specific concurrency level:
public static async Task<IEnumerable<TResult>> SelectAsync<TSource, TResult>(
this IEnumerable<TSource> source, Func<TSource, Task<TResult>> method,
int concurrency = int.MaxValue)
{
var semaphore = new SemaphoreSlim(concurrency);
try
{
return await Task.WhenAll(source.Select(async s =>
{
try
{
await semaphore.WaitAsync();
return await method(s);
}
finally
{
semaphore.Release();
}
}));
} finally
{
semaphore.Dispose();
}
}
Without a parameter it behaves exactly as the simpler version above. With a parameter of 1 it will execute all tasks sequentially:
var result = await sourceEnumerable.SelectAsync(async s=>await someFunction(s,other params),1);
Note: Executing the tasks sequentially doesn't mean the execution will stop on error!
Just like with a larger value for concurrency or no parameter specified, all the tasks will be executed and if any of them fail, the resulting AggregateException will contain the thrown exceptions.
If you want to execute tasks one after the other and fail at the first one, try another solution, like the one suggested by xhafan (https://stackoverflow.com/a/64363463/379279)
Existing code is working, but is blocking the thread.
.Select(async ev => await ProcessEventAsync(ev))
creates a new Task for every event, but
.Select(t => t.Result)
blocks the thread waiting for each new task to end.
In the other hand your code produce the same result but keeps asynchronous.
Just one comment on your first code. This line
var tasks = await Task.WhenAll(events...
will produce a single Task<TResult[]> so the variable should be named in singular.
Finally your last code make the same but is more succinct.
For reference: Task.Wait / Task.WhenAll
I prefer this as an extension method:
public static async Task<IEnumerable<T>> WhenAll<T>(this IEnumerable<Task<T>> tasks)
{
return await Task.WhenAll(tasks);
}
So that it is usable with method chaining:
var inputs = await events
.Select(async ev => await ProcessEventAsync(ev))
.WhenAll()
I have the same problem as #KTCheek in that I need it to execute sequentially. However I figured I would try using IAsyncEnumerable (introduced in .NET Core 3) and await foreach (introduced in C# 8). Here's what I have come up with:
public static class IEnumerableExtensions {
public static async IAsyncEnumerable<TResult> SelectAsync<TSource, TResult>(this IEnumerable<TSource> source, Func<TSource, Task<TResult>> selector) {
foreach (var item in source) {
yield return await selector(item);
}
}
}
public static class IAsyncEnumerableExtensions {
public static async Task<List<TSource>> ToListAsync<TSource>(this IAsyncEnumerable<TSource> source) {
var list = new List<TSource>();
await foreach (var item in source) {
list.Add(item);
}
return list;
}
}
This can be consumed by saying:
var inputs = await events.SelectAsync(ev => ProcessEventAsync(ev)).ToListAsync();
Update: Alternatively you can add a reference to System.Linq.Async and then you can say:
var inputs = await events
.ToAsyncEnumerable()
.SelectAwait(async ev => await ProcessEventAsync(ev))
.ToListAsync();
With current methods available in Linq it looks quite ugly:
var tasks = items.Select(
async item => new
{
Item = item,
IsValid = await IsValid(item)
});
var tuples = await Task.WhenAll(tasks);
var validItems = tuples
.Where(p => p.IsValid)
.Select(p => p.Item)
.ToList();
Hopefully following versions of .NET will come up with more elegant tooling to handle collections of tasks and tasks of collections.
I wanted to call Select(...) but ensure it ran in sequence because running in parallel would cause some other concurrency problems, so I ended up with this.
I cannot call .Result because it will block the UI thread.
public static class TaskExtensions
{
public static async Task<IEnumerable<TResult>> SelectInSequenceAsync<TSource, TResult>(this IEnumerable<TSource> source, Func<TSource, Task<TResult>> asyncSelector)
{
var result = new List<TResult>();
foreach (var s in source)
{
result.Add(await asyncSelector(s));
}
return result;
}
}
Usage:
var inputs = events.SelectInSequenceAsync(ev => ProcessEventAsync(ev))
.Where(i => i != null)
.ToList();
I am aware that Task.WhenAll is the way to go when we can run in parallel.
"Just because you can doesn't mean you should."
You can probably use async/await in LINQ expressions such that it will behave exactly as you want it to, but will any other developer reading your code still understand its behavior and intent?
(In particular: Should the async operations be run in parallel or are they intentionally sequential? Did the original developer even think about it?)
This is also shown clearly by the question, which seems to have been asked by a developer trying to understand someone else's code, without knowing its intent. To make sure this does not happen again, it may be best to rewrite the LINQ expression as a loop statement, if possible.
I have a situation where I need to use a custom scheduler to run tasks (these need to be tasks) and the scheduler does not set a synchronization context (so no ObserveOn, SubscribeOn, SynchronizationContextScheduler etc. I gather). The following is how I ended up doing it. Now, I wonder, I'm not really sure if this is the fittest way of doing asynchronous calls and awaiting their results. Is this all right or is there a more robust or idiomatic way?
var orleansScheduler = TaskScheduler.Current;
var someObservable = ...;
someObservable.Subscribe(i =>
{
Task.Factory.StartNew(async () =>
{
return await AsynchronousOperation(i);
}, CancellationToken.None, TaskCreationOptions.None, orleansScheduler);
});
What if awaiting wouldn't be needed?
<edit: I found a concrete, and a simplified example of what I'm doing here. Basically I'm using Rx in Orleans and the above code is bare-bones illustration of what I'm up to. Though I'm also interested in this situation in general too.
The final code
It turns out this was a bit tricky in the Orleans context. I don't see how I could get to use ObserveOn, which would be just the thing I'd like to use. The problem is that by using it, the Subscribe would never get called. The code:
var orleansScheduler = TaskScheduler.Current;
var factory = new TaskFactory(orleansScheduler);
var rxScheduler = new TaskPoolScheduler(factory);
var someObservable = ...;
someObservable
//.ObserveOn(rxScheduler) This doesn't look like useful since...
.SelectMany(i =>
{
//... we need to set the custom scheduler here explicitly anyway.
//See Async SelectMany at http://log.paulbetts.org/rx-and-await-some-notes/.
//Doing the "shorthand" form of .SelectMany(async... would call Task.Run, which
//in turn runs always on .NET ThreadPool and not on Orleans scheduler and hence
//the following .Subscribe wouldn't be called.
return Task.Factory.StartNew(async () =>
{
//In reality this is an asynchronous grain call. Doing the "shorthand way"
//(and optionally using ObserveOn) would get the grain called, but not the
//following .Subscribe.
return await AsynchronousOperation(i);
}, CancellationToken.None, TaskCreationOptions.None, orleansScheduler).Unwrap().ToObservable();
})
.Subscribe(i =>
{
Trace.WriteLine(i);
});
Also, a link to a related thread at Codeplex Orleans forums.
I strongly recommend against StartNew for any modern code. It does have a use case, but it's very rare.
If you have to use a custom task scheduler, I recommend using ObserveOn with a TaskPoolScheduler constructed from a TaskFactory wrapper around your scheduler. That's a mouthful, so here's the general idea:
var factory = new TaskFactory(customScheduler);
var rxScheduler = new TaskPoolScheduler(factory);
someObservable.ObserveOn(rxScheduler)...
Then you could use SelectMany to start an asynchronous operation for each event in a source stream as they arrive.
An alternative, less ideal solution is to use async void for your subscription "events". This is acceptable, but you have to watch your error handling. As a general rule, don't allow exceptions to propagate out of an async void method.
There is a third alternative, where you hook an observable into a TPL Dataflow block. A block like ActionBlock can specify its task scheduler, and Dataflow naturally understands asynchronous handlers. Note that by default, Dataflow blocks will throttle the processing to a single element at a time.
Generally speaking, instead of subscribing to execute, it's better/more idiomatic to project the task parameters into the task execution and subscribe just for the results. That way you can compose with further Rx downstream.
e.g. Given a random task like:
static async Task<int> DoubleAsync(int i, Random random)
{
Console.WriteLine("Started");
await Task.Delay(TimeSpan.FromSeconds(random.Next(10) + 1));
return i * 2;
}
Then you might do:
void Main()
{
var random = new Random();
// stream of task parameters
var source = Observable.Range(1, 5);
// project the task parameters into the task execution, collect and flatten results
source.SelectMany(i => DoubleAsync(i, random))
// subscribe just for results, which turn up as they are done
// gives you flexibility to continue the rx chain here
.Subscribe(result => Console.WriteLine(result),
() => Console.WriteLine("All done."));
}
I would like to use .NET iterator with parallel Tasks/await?. Something like this:
IEnumerable<TDst> Foo<TSrc, TDest>(IEnumerable<TSrc> source)
{
Parallel.ForEach(
source,
s=>
{
// Ordering is NOT important
// items can be yielded as soon as they are done
yield return ExecuteOrDownloadSomething(s);
}
}
Unfortunately .NET cannot natively handle this. Best answer so far by #svick - use AsParallel().
BONUS: Any simple async/await code that implements multiple publishers and a single subscriber? The subscriber would yield, and the pubs would process. (core libraries only)
This seems like a job for PLINQ:
return source.AsParallel().Select(s => ExecuteOrDownloadSomething(s));
This will execute the delegate in parallel using a limited number of threads, returning each result as soon as it completes.
If the ExecuteOrDownloadSomething() method is IO-bound (e.g. it actually downloads something) and you don't want to waste threads, then using async-await might make sense, but it would be more complicated.
If you want to fully take advantage of async, you shouldn't return IEnumerable, because it's synchronous (i.e. it blocks if no items are available). What you need is some sort of asynchronous collection, and you can use ISourceBlock (specifically, TransformBlock) from TPL Dataflow for that:
ISourceBlock<TDst> Foo<TSrc, TDest>(IEnumerable<TSrc> source)
{
var block = new TransformBlock<TSrc, TDest>(
async s => await ExecuteOrDownloadSomethingAsync(s),
new ExecutionDataflowBlockOptions
{
MaxDegreeOfParallelism = DataflowBlockOptions.Unbounded
});
foreach (var item in source)
block.Post(item);
block.Complete();
return block;
}
If the source is “slow” (i.e. you want to start processing the results from Foo() before iterating source is completed), you might want to move the foreach and Complete() call to a separate Task. Even better solution would be to make source into a ISourceBlock<TSrc> too.
So it appears what you really want to do is to order a sequence of tasks based on when they complete. This is not terribly complex:
public static IEnumerable<Task<T>> Order<T>(this IEnumerable<Task<T>> tasks)
{
var input = tasks.ToList();
var output = input.Select(task => new TaskCompletionSource<T>());
var collection = new BlockingCollection<TaskCompletionSource<T>>();
foreach (var tcs in output)
collection.Add(tcs);
foreach (var task in input)
{
task.ContinueWith(t =>
{
var tcs = collection.Take();
switch (task.Status)
{
case TaskStatus.Canceled:
tcs.TrySetCanceled();
break;
case TaskStatus.Faulted:
tcs.TrySetException(task.Exception.InnerExceptions);
break;
case TaskStatus.RanToCompletion:
tcs.TrySetResult(task.Result);
break;
}
}
, CancellationToken.None
, TaskContinuationOptions.ExecuteSynchronously
, TaskScheduler.Default);
}
return output.Select(tcs => tcs.Task);
}
So here we create a TaskCompletionSource for each input task, then go through each of the tasks and set a continuation which grabs the next completion source from a BlockingCollection and sets it's result. The first task completed grabs the first tcs that was returned, the second task completed gets the second tcs that was returned, and so on.
Now your code becomes quite simple:
var tasks = collection.Select(item => LongRunningOperationThatReturnsTask(item))
.Order();
foreach(var task in tasks)
{
var result = task.Result;//or you could `await` each result
//....
}
In the asynchronous library made by the MS robotics team, they had concurrency primitives which allowed for using an iterator to yield asynchronous code.
The library (CCR) is free (It didn't use to be free). A nice introductory article can be found here: Concurrent affairs
Perhaps you can use this library alongside .Net task library, or it'll inspire you to 'roll your own'
Suppose you have a TransformBlock with configured parallelism and want to stream data trough the block. The input data should be created only when the pipeline can actually start processing it. (And should be released the moment it leaves the pipeline.)
Can I achieve this? And if so how?
Basically I want a data source that works as an iterator.
Like so:
public IEnumerable<Guid> GetSourceData()
{
//In reality -> this should also be an async task -> but yield return does not work in combination with async/await ...
Func<ICollection<Guid>> GetNextBatch = () => Enumerable.Repeat(100).Select(x => Guid.NewGuid()).ToArray();
while (true)
{
var batch = GetNextBatch();
if (batch == null || !batch.Any()) break;
foreach (var guid in batch)
yield return guid;
}
}
This would result in +- 100 records in memory. OK: more if the blocks you append to this data source would keep them in memory for some time, but you have a chance to get only a subset (/stream) of data.
Some background information:
I intend to use this in combination with azure cosmos db, where the source could all objects in a collection, or a change feed. Needless to say that I don't want all of those objects stored in memory. So this can't work:
using System.Threading.Tasks.Dataflow;
public async Task ExampleTask()
{
Func<Guid, object> TheActualAction = text => text.ToString();
var config = new ExecutionDataflowBlockOptions
{
BoundedCapacity = 5,
MaxDegreeOfParallelism = 15
};
var throtteler = new TransformBlock<Guid, object>(TheActualAction, config);
var output = new BufferBlock<object>();
throtteler.LinkTo(output);
throtteler.Post(Guid.NewGuid());
throtteler.Post(Guid.NewGuid());
throtteler.Post(Guid.NewGuid());
throtteler.Post(Guid.NewGuid());
//...
throtteler.Complete();
await throtteler.Completion;
}
The above example is not good because I add all the items without knowing if they are actually being 'used' by the transform block. Also, I don't really care about the output buffer. I understand that I need to send it somewhere so I can await the completion, but I have no use for the buffer after that. So it should just forget about all it gets ...
Post() will return false if the target is full without blocking. While this could be used in a busy-wait loop, it's wasteful. SendAsync() on the other hand will wait if the target is full :
public async Task ExampleTask()
{
var config = new ExecutionDataflowBlockOptions
{
BoundedCapacity = 50,
MaxDegreeOfParallelism = 15
};
var block= new ActionBlock<Guid, object>(TheActualAction, config);
while(//some condition//)
{
var data=await GetDataFromCosmosDB();
await block.SendAsync(data);
//Wait a bit if we want to use polling
await Task.Delay(...);
}
block.Complete();
await block.Completion;
}
It seems you want to process data at a defined degree of parallelism (MaxDegreeOfParallelism = 15). TPL dataflow is very clunky to use for such a simple requirement.
There's a very simple and powerful pattern that might solve your problem. It's a parallel async foreach loop as described here: https://blogs.msdn.microsoft.com/pfxteam/2012/03/05/implementing-a-simple-foreachasync-part-2/
public static Task ForEachAsync<T>(this IEnumerable<T> source, int dop, Func<T, Task> body)
{
return Task.WhenAll(
from partition in Partitioner.Create(source).GetPartitions(dop)
select Task.Run(async delegate {
using (partition)
while (partition.MoveNext())
await body(partition.Current);
}));
}
You can then write something like:
var dataSource = ...; //some sequence
dataSource.ForEachAsync(15, async item => await ProcessItem(item));
Very simple.
You can dynamically reduce the DOP by using a SemaphoreSlim. The semaphore acts as a gate that only lets N concurrent threads/tasks in. N can be changed dynamically.
So you would use ForEachAsync as the basic workhorse and then add additional restrictions and throttling on top.