Unwrapping IObservable<Task<T>> into IObservable<T> with order preservation - c#

Is there a way to unwrap the IObservable<Task<T>> into IObservable<T> keeping the same order of events, like this?
Tasks: ----a-------b--c----------d------e---f---->
Values: -------A-----------B--C------D-----E---F-->
Let's say I have a desktop application that consumes a stream of messages, some of which require heavy post-processing:
IObservable<Message> streamOfMessages = ...;
IObservable<Task<Result>> streamOfTasks = streamOfMessages
.Select(async msg => await PostprocessAsync(msg));
IObservable<Result> streamOfResults = ???; // unwrap streamOfTasks
I imagine two ways of dealing with that.
First, I can subscribe to streamOfTasks using the asynchronous event handler:
streamOfTasks.Subscribe(async task =>
{
var result = await task;
Display(result);
});
Second, I can convert streamOfTasks using Observable.Create, like this:
var streamOfResults =
from task in streamOfTasks
from value in Observable.Create<T>(async (obs, cancel) =>
{
var v = await task;
obs.OnNext(v);
// TODO: don't know when to call obs.OnComplete()
})
select value;
streamOfResults.Subscribe(result => Display(result));
Either way, the order of messages is not preserved: some later messages that
don't need any post-processing come out faster than earlier messages that
require post-processing. Both my solutions handle the incoming messages
in parallel, but I'd like them to be processed sequentially, one by one.
I can write a simple task queue to process just one task at a time,
but perhaps it's an overkill. Seems to me that I'm missing something obvious.
UPD. I wrote a sample console program to demonstrate my approaches. All solutions by far don't preserve the original order of events. Here is the output of the program:
Timer: 0
Timer: 1
Async handler: 1
Observable.Create: 1
Observable.FromAsync: 1
Timer: 2
Async handler: 2
Observable.Create: 2
Observable.FromAsync: 2
Observable.Create: 0
Async handler: 0
Observable.FromAsync: 0
Here is the complete source code:
// "C:\Program Files (x86)\MSBuild\14.0\Bin\csc.exe" test.cs /r:System.Reactive.Core.dll /r:System.Reactive.Linq.dll /r:System.Reactive.Interfaces.dll
using System;
using System.Reactive;
using System.Reactive.Concurrency;
using System.Reactive.Linq;
using System.Threading.Tasks;
class Program
{
static void Main()
{
Console.WriteLine("Press ENTER to exit.");
// the source stream
var timerEvents = Observable.Timer(TimeSpan.Zero, TimeSpan.FromSeconds(1));
timerEvents.Subscribe(x => Console.WriteLine($"Timer: {x}"));
// solution #1: using async event handler
timerEvents.Subscribe(async x =>
{
var result = await PostprocessAsync(x);
Console.WriteLine($"Async handler: {x}");
});
// solution #2: using Observable.Create
var processedEventsV2 =
from task in timerEvents.Select(async x => await PostprocessAsync(x))
from value in Observable.Create<long>(async (obs, cancel) =>
{
var v = await task;
obs.OnNext(v);
})
select value;
processedEventsV2.Subscribe(x => Console.WriteLine($"Observable.Create: {x}"));
// solution #3: using FromAsync, as answered by #Enigmativity
var processedEventsV3 =
from msg in timerEvents
from result in Observable.FromAsync(() => PostprocessAsync(msg))
select result;
processedEventsV3.Subscribe(x => Console.WriteLine($"Observable.FromAsync: {x}"));
Console.ReadLine();
}
static async Task<long> PostprocessAsync(long x)
{
// some messages require long post-processing
if (x % 3 == 0)
{
await Task.Delay(TimeSpan.FromSeconds(2.5));
}
// and some don't
return x;
}
}

Combining #Enigmativity's simple approach with #VMAtm's idea of attaching the counter and some code snippets from this SO question, I came up with this solution:
// usage
var processedStream = timerEvents.SelectAsync(async t => await PostprocessAsync(t));
processedStream.Subscribe(x => Console.WriteLine($"Processed: {x}"));
// my sample console program prints the events ordered properly:
Timer: 0
Timer: 1
Timer: 2
Processed: 0
Processed: 1
Processed: 2
Timer: 3
Timer: 4
Timer: 5
Processed: 3
Processed: 4
Processed: 5
....
Here is my SelectAsync extension method to transform IObservable<Task<TSource>> into IObservable<TResult> keeping the original order of events:
public static IObservable<TResult> SelectAsync<TSource, TResult>(
this IObservable<TSource> src,
Func<TSource, Task<TResult>> selectorAsync)
{
// using local variable for counter is easier than src.Scan(...)
var counter = 0;
var streamOfTasks =
from source in src
from result in Observable.FromAsync(async () => new
{
Index = Interlocked.Increment(ref counter) - 1,
Result = await selectorAsync(source)
})
select result;
// buffer the results coming out of order
return Observable.Create<TResult>(observer =>
{
var index = 0;
var buffer = new Dictionary<int, TResult>();
return streamOfTasks.Subscribe(item =>
{
buffer.Add(item.Index, item.Result);
TResult result;
while (buffer.TryGetValue(index, out result))
{
buffer.Remove(index);
observer.OnNext(result);
index++;
}
});
});
}
I'm not particularly satisfied with my solution as it looks too complex to me, but at least it doesn't require any external dependencies. I'm using here a simple Dictionary to buffer and reorder task results because the subscriber need not to be thread-safe (the subscriptions are neved called concurrently).
Any comments or suggestions are welcome. I'm still hoping to find the native RX way of doing this without custom buffering extension method.

The RX library contains three operators that can unwrap an observable sequence of tasks, the Concat, Merge and Switch. All three accept a single source argument of type IObservable<Task<T>>, and return an IObservable<T>. Here are their descriptions from the documentation:
Concat
Concatenates all task results, as long as the previous task terminated successfully.
Merge
Merges results from all source tasks into a single observable sequence.
Switch
Transforms an observable sequence of tasks into an observable sequence producing values only from the most recent observable sequence. Each time a new task is received, the previous task's result is ignored.
In other words the Concat returns the results in their original order, the Merge returns the results in order of completion, and the Switch filters out any results from tasks that didn't complete before the next task was emitted. So your problem can be solved by just using the built-in Concat operator. No custom operator is needed.
var streamOfResults = streamOfTasks
.Select(async task =>
{
var result1 = await task;
var result2 = await PostprocessAsync(result1);
return result2;
})
.Concat();
The tasks are already started before they are emitted by the streamOfTasks. In other words they are emerging in a "hot" state. So the fact that the Concat operator awaits them the one after the other has no consequence regarding the concurrency of the operations. It only affects the order of their results. This would be a consideration if instead of hot tasks you had cold observables, like these created by the Observable.FromAsync and Observable.Create methods, in which case the Concat would execute the operations sequentially.

Is the following simple approach an answer for you?
IObservable<Result> streamOfResults =
from msg in streamOfMessages
from result in Observable.FromAsync(() => PostprocessAsync(msg))
select result;

To maintain the order of events you can funnel your stream into a TransformBlock from TPL Dataflow. The TransformBlock would execute your post-processing logic and will maintain the order of its output by default.
using System;
using System.Collections.Generic;
using System.Reactive.Linq;
using System.Threading.Tasks;
using System.Threading.Tasks.Dataflow;
using NUnit.Framework;
namespace HandlingStreamInOrder {
[TestFixture]
public class ItemHandlerTests {
[Test]
public async Task Items_Are_Output_In_The_Same_Order_As_They_Are_Input() {
var itemHandler = new ItemHandler();
var timerEvents = Observable.Timer(TimeSpan.Zero, TimeSpan.FromMilliseconds(250));
timerEvents.Subscribe(async x => {
var data = (int)x;
Console.WriteLine($"Value Produced: {x}");
var dataAccepted = await itemHandler.SendAsync((int)data);
if (dataAccepted) {
InputItems.Add(data);
}
});
await Task.Delay(5000);
itemHandler.Complete();
await itemHandler.Completion;
CollectionAssert.AreEqual(InputItems, itemHandler.OutputValues);
}
private IList<int> InputItems {
get;
} = new List<int>();
}
public class ItemHandler {
public ItemHandler() {
var options = new ExecutionDataflowBlockOptions() {
BoundedCapacity = DataflowBlockOptions.Unbounded,
MaxDegreeOfParallelism = Environment.ProcessorCount,
EnsureOrdered = true
};
PostProcessBlock = new TransformBlock<int, int>((Func<int, Task<int>>)PostProcess, options);
var output = PostProcessBlock.AsObservable().Subscribe(x => {
Console.WriteLine($"Value Output: {x}");
OutputValues.Add(x);
});
}
public async Task<bool> SendAsync(int data) {
return await PostProcessBlock.SendAsync(data);
}
public void Complete() {
PostProcessBlock.Complete();
}
public Task Completion {
get { return PostProcessBlock.Completion; }
}
public IList<int> OutputValues {
get;
} = new List<int>();
private IPropagatorBlock<int, int> PostProcessBlock {
get;
}
private async Task<int> PostProcess(int data) {
if (data % 3 == 0) {
await Task.Delay(TimeSpan.FromSeconds(2));
}
return data;
}
}
}

Rx and TPL can be easily combined here, and TPL do save the order of events, by default, so your code could be something like this:
using System.Threading.Tasks;
using System.Threading.Tasks.Dataflow;
static async Task<long> PostprocessAsync(long x) { ... }
IObservable<Message> streamOfMessages = ...;
var streamOfTasks = new TransformBlock<long, long>(async msg =>
await PostprocessAsync(msg)
// set the concurrency level for messages to handle
, new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = Environment.ProcessorCount });
// easily convert block into observable
IObservable<long> streamOfResults = streamOfTasks.AsObservable();
Edit: Rx extensions meant to be a reactive pipeline of events for UI. As this type of applications are in general single-threaded, so messages are being handled with saving the order. But in general events in C# aren't thread safe, so you have to provide some additional logic to same the order.
If you don't like the idea to introduce another dependency, you need to store the operation number with Interlocked class, something like this:
// counter for operations get started
int operationNumber = 0;
// counter for operations get done
int doneNumber = 0;
...
var currentOperationNumber = Interlocked.Increment(ref operationNumber);
...
while (Interlocked.CompareExchange(ref doneNumber, currentOperationNumber + 1, currentOperationNumber) != currentOperationNumber)
{
// spin once here
}
// handle event
Interlocked.Increment(ref doneNumber);

Related

Slow down IObserver.OnNext while item handling is in progress [duplicate]

Is there a way in C# rx to handle backpressure?
I'm trying to call a web api from the results of a paged query. This web api is very fragile and I need to not have more than say 3 concurrent calls, so, the program should be something like:
Feth a page from db
Call the web api with a maximum of three concurrent calls per each record on the page
Save the results back to db
Fetch another page and repeat until there are no more results.
I'm not really getting the sequence that I'm after, basically the db gets all the records regardless of whether they can be processed or not.
I've tried a variety of things including tweaking at the ObserveOn operator, implementing a semaphore, and a few other things. Could I get a little bit of guidance to implement something like this?
using System;
using System.Collections.Generic;
using System.Linq;
using System.Reactive.Concurrency;
using System.Reactive.Linq;
using System.Reactive.Threading.Tasks;
using System.Threading;
using System.Threading.Tasks;
using Castle.Core.Internal;
using Xunit;
using Xunit.Abstractions;
namespace ProductValidation.CLI.Tests.Services
{
public class Example
{
private readonly ITestOutputHelper output;
public Example(ITestOutputHelper output)
{
this.output = output;
}
[Fact]
public async Task RunsObservableToCompletion()
{
var repo = new Repository(output);
var client = new ServiceClient(output);
var results = repo.FetchRecords()
.Select(x => client.FetchMoreInformation(x).ToObservable())
.Merge(1)
.Do(async x => await repo.Save(x));
await results.LastOrDefaultAsync();
}
}
public class Repository
{
private readonly ITestOutputHelper output;
public Repository(ITestOutputHelper output)
{
this.output = output;
}
public IObservable<int> FetchRecords()
{
return Observable.Create<int>(async (observer) =>
{
var page = 1;
var products = await FetchPage(page);
while (!products.IsNullOrEmpty())
{
foreach (var product in products)
{
observer.OnNext(product);
}
page += 1;
products = await FetchPage(page);
}
observer.OnCompleted();
})
.ObserveOn(SynchronizationContext.Current);
}
private async Task<IEnumerable<int>> FetchPage(int page)
{
// Simulate fetching a paged query.
await Task.Delay(500).ToObservable().ObserveOn(new TaskPoolScheduler(new TaskFactory()));
output.WriteLine("Fetching page {0}", page);
if (page >= 4) return Enumerable.Empty<int>();
return Enumerable.Range(1, 3).Select(_ => page);
}
public async Task Save(string id)
{
await Task.Delay(50); //Simulates latency
}
}
public class ServiceClient
{
private readonly ITestOutputHelper output;
private readonly SemaphoreSlim semaphore;
public ServiceClient(ITestOutputHelper output)
{
this.output = output;
this.semaphore = new SemaphoreSlim(2);
}
public async Task<string> FetchMoreInformation(int id)
{
try
{
output.WriteLine("Calling the web client for {0}", id);
await semaphore.WaitAsync(); // Protection for the webapi not sending too many calls
await Task.Delay(1000); //Simulates latency
return id.ToString();
}
finally
{
semaphore.Release();
}
}
}
}
The Rx does not support backpressure, so there is no easy way to fetch the records from the DB at the same tempo that the records are processed. Maybe you could use a Subject<Unit> as a signaling mechanism, push a value every time a record is processed, and devise a way to use these signals at the producing site to fetch a new record from the DB when a signal is received. But it will be a messy and idiomatic solution. The TPL Dataflow is a more suitable tool than the Rx for doing this kind of work. It supports natively the BoundedCapacity configuration option.
Some comments regarding the code you've posted, that are not directly related to the backpressure issue:
The Merge operator with a maxConcurrent parameter imposes a limit on the concurrent subscriptions to the inner sequences, but this will have no effect in case the inner sequences are already up and running. So you have to ensure that the inner sequences are cold, and a handy way to do this is the Defer operator:
.Select(x => Observable.Defer(() =>
client.FetchMoreInformation(x).ToObservable()))
A more common way to convert asynchronous methods to deferred observable sequences is the FromAsync operator:
.Select(x => Observable.FromAsync(() => client.FetchMoreInformation(x)))
Btw the Do operator does not understand async delegates, so instead of:
.Do(async x => await repo.Save(x));
...which creates async void lambdas, it's better to do this:
.Select(x => Observable.FromAsync(() => repo.Save(x)))
.Merge(1);
Update: Here is an example of how you could use a SemaphoreSlim in order to implement backpressure in Rx:
const int boundedCapacity = 10;
using var semaphore = new SemaphoreSlim(boundedCapacity, boundedCapacity);
IObservable<int> results = repo
.FetchRecords(semaphore)
.Select(x => Observable.FromAsync(() => client.FetchMoreInformation(x)))
.Merge(1)
.Select(x => Observable.FromAsync(() => repo.Save(x)))
.Merge(1)
.Do(_ => semaphore.Release());
await results.DefaultIfEmpty();
And inside the FetchRecords method:
//...
await semaphore.WaitAsync();
observer.OnNext(product);
//...
This is a fragile solution, because it depends on propagating all elements through the pipeline. If in the future you decide to include filtering or throttling inside the pipeline, then the one-to-one relationship between WaitAsync and Release will be violated, with the most probable outcome being a deadlocked pipeline.

Async/Await in loop behaviour

Say I had a list of algorithms, each containing an async call somewhere in the body of that algorithm. The order in which I execute the algorithms is the order in which I want to receive the results. That is, I want the AlgorithmResult List to look like {Algorithm1Result, Algorithm2Result, Algorithm3Result} after all the algorithms have executed. Would I be right in saying that if Algorithm1 and 3 finished before 2 that my results would actually be in the order {Algorithm1Result, Algorithm3Result, Algorithm2Result}
var algorithms = new List<Algorithm>(){Algorithm1, Algorithm2, Algorithm3};
var algorithmResults = new List<AlgorithmResults>();
foreach (var algorithm in algorithms)
{
algorithmResults.Add(await algorithm.Execute());
}
NO, Result would be in the same order you added it to the list, since each operation is being waited for separately.
class Program
{
public static async Task<int> GetResult(int timing)
{
return await Task<int>.Run(() =>
{
Thread.Sleep(timing * 1000);
return timing;
});
}
public static async Task<List<int>> GetAll()
{
List<int> tasks = new List<int>();
tasks.Add(await GetResult(3));
tasks.Add(await GetResult(2));
tasks.Add(await GetResult(1));
return tasks;
}
static void Main(string[] args)
{
var res = GetAll().Result;
}
}
res anyway contains list in order it was added, also this is not parallel execution.
Since you don't add the tasks, but the results of the task after awaiting, they will be in the order you want.
Even if you did not await the result before adding the next task you could get the order you want:
in small steps, showing type awareness:
List<Task<AlgorithmResult>> tasks = new List<Task<AlgorithmResult>>();
foreach (Algorithm algorithm in algorithms)
{
Task<AlgorithmResult> task = algorithm.Execute();
// don't wait until task Completes, just remember the Task
// and continue executing the next Algorithm
tasks.Add(task);
}
Now some Tasks may be running, some may already have completed. Let's wait until they are all complete, and fetch the results:
Task.WhenAll(tasks.ToArray());
List<AlgrithmResults> results = tasks.Select(task => task.Result).ToList();

Fire and forget with TPL

I have following code:
public void Func()
{
...
var task = httpClient.PostAsync(...);
var onlyOnRanToCompletionTask = task
.ContinueWith(
t => OnPostAsyncSuccess(t, notification, provider, account.Name),
TaskContinuationOptions.OnlyOnRanToCompletion);
var onlyOnFaultedTask = task
.ContinueWith(
t => OnPostAsyncAggregateException(t, notification.EntityId),
TaskContinuationOptions.OnlyOnFaulted);
return true;
}
Func is not async and I would like to have something like fire and forget but with continuation function. I don't want to have that function async. The scenario is that this function is called in some kind of loop for group of objects to handle them. For me the problem seems that e.g. onlyOnRanToCompletionTask can be deleted when we finish Func execution.
Thank you in advance!
But your code should be performed as expected. Simple example:
void Main()
{
HandlingMyFuncAsync();
Console.WriteLine("Doing some work, while 'fire and forget job is performed");
Console.ReadLine();
}
public void HandlingMyFuncAsync()
{
var task = MyFuncAsync();
task.ContinueWith(t => Console.WriteLine(t), TaskContinuationOptions.OnlyOnRanToCompletion);
}
public async Task<string> MyFuncAsync()
{
await Task.Delay(5000);
return "A";
}
produces
Doing some work, while 'fire and forget job is performed
[end after 5 sec]
A
If I were you I'd look at using Microsoft's Reactive Framework (Rx) for this. Rx is designed for this kind of thing.
Here's some basic code:
void Main()
{
int[] source = new [] { 1, 2, 3 };
IObservable<int> query =
from s in source.ToObservable()
from u in Observable.Start(() => Func(s))
select s;
IDisposable subscription =
query
.Subscribe(
x => Console.WriteLine($"!{x}!"),
() => Console.WriteLine("Done."));
}
public void Func(int x)
{
Console.WriteLine($"+{x} ");
Thread.Sleep(TimeSpan.FromSeconds(1.0));
Console.WriteLine($" {x}-");
}
When I run that I get this kind of output:
+1
+2
+3
1-
!1!
2-
!2!
3-
!3!
Done.
It triggers off all of the calls and waits for the results to come in. It also let's you know when it is done.
Here's a more practical example showing how to use async and Rx to return a result:
void Main()
{
int[] source = new [] { 1, 2, 3 };
var query =
from s in source.ToObservable()
from u in Observable.FromAsync(() => Func(s))
select new { s, u };
IDisposable subscription =
query
.Subscribe(
x => Console.WriteLine($"!{x.s},{x.u}!"),
() => Console.WriteLine("Done."));
}
public async Task<int> Func(int x)
{
Console.WriteLine($"+{x} ");
await Task.Delay(TimeSpan.FromSeconds(1.0));
Console.WriteLine($" {x}-");
return 10 * x;
}
That gives:
+1
+2
+3
1-
!1,10!
3-
2-
!2,20!
!3,30!
Done.
Without seeing your full code I can't easily give you a complete example that you can use, but Rx will also handle opening an closing on your DB contexts using the Observable.Using operator.
You can get Rx by Nugetting "System.Reactive".

creating a .net async wrapper to a sync request

I have the following situation (or a basic misunderstanding with the async await mechanism).
Assume you have a set of 1-20 web request call that takes a long time: findItemsByProduct().
you want to wrap it around in an async request, that would be able to abstract all these calls into one async call, but I can't seem to be able to do it without using more threads.
If I'm doing:
int total = result.paginationOutput.totalPages;
for (int i = 2; i < total + 1; i++)
{
await Task.Factory.StartNew(() =>
{
result = client.findItemsByProduct(i);
});
newList.AddRange(result.searchResult.item);
}
}
return newList;
problem here, that the calls don't run together, rather they are waiting one by one.
I would like all the calls to run together and than harvest the results.
as pseudo code, I would like the code to run like this:
forEach item {
result = item.makeWebRequest();
}
foreach item {
List.addRange(item.harvestResults);
}
I have no idea how to make the code to do that though..
Ideally, you should add a findItemsByProductAsync that returns a Task<Item[]>. That way, you don't have to create unnecessary tasks using StartNew or Task.Run.
Then your code can look like this:
int total = result.paginationOutput.totalPages;
// Start all downloads; each download is represented by a task.
Task<Item[]>[] tasks = Enumerable.Range(2, total - 1)
.Select(i => client.findItemsByProductAsync(i)).ToArray();
// Wait for all downloads to complete.
Item[][] results = await Task.WhenAll(tasks);
// Flatten the results into a single collection.
return results.SelectMany(x => x).ToArray();
Given your requirements which I see as:
Process n number of non-blocking tasks
Process results after all queries have returned
I would use the CountdownEvent for this e.g.
var results = new ConcurrentBag<ItemType>(result.pagination.totalPages);
using (var e = new CountdownEvent(result.pagination.totalPages))
{
for (int i = 2; i <= result.pagination.totalPages+1; i++)
{
Task.Factory.StartNew(() => return client.findItemsByProduct(i))
.ContinueWith(items => {
results.AddRange(items);
e.Signal(); // signal task is done
});
}
// Wait for all requests to complete
e.Wait();
}
// Process results
foreach (var item in results)
{
...
}
This particular problem is solved easily enough without even using await. Simply create each of the tasks, put all of the tasks into a list, and then use WhenAll on that list to get a task that represents the completion of all of those tasks:
public static Task<Item[]> Foo()
{
int total = result.paginationOutput.totalPages;
var tasks = new List<Task<Item>>();
for (int i = 2; i < total + 1; i++)
{
tasks.Add(Task.Factory.StartNew(() => client.findItemsByProduct(i)));
}
return Task.WhenAll(tasks);
}
Also note you have a major problem in how you use result in your code. You're having each of the different tasks all using the same variable, so there are race conditions as to whether or not it works properly. You could end up adding the same call twice and having one skipped entirely. Instead you should have the call to findItemsByProduct be the result of the task, and use that task's Result.
If you want to use async-await properly you have to declare your functions async, and the functions that call you also have to be async. This continues until you have once synchronous function that starts the async process.
Your function would look like this:
by the way you didn't describe what's in the list. I assume they are
object of type T. in that case result.SearchResult.Item returns
IEnumerable
private async Task<List<T>> FindItems(...)
{
int total = result.paginationOutput.totalPages;
var newList = new List<T>();
for (int i = 2; i < total + 1; i++)
{
IEnumerable<T> result = await Task.Factory.StartNew(() =>
{
return client.findItemsByProduct(i);
});
newList.AddRange(result.searchResult.item);
}
return newList;
}
If you do it this way, your function will be asynchronous, but the findItemsByProduct will be executed one after another. If you want to execute them simultaneously you should not await for the result, but start the next task before the previous one is finished. Once all tasks are started wait until all are finished. Like this:
private async Task<List<T>> FindItems(...)
{
int total = result.paginationOutput.totalPages;
var tasks= new List<Task<IEnumerable<T>>>();
// start all tasks. don't wait for the result yet
for (int i = 2; i < total + 1; i++)
{
Task<IEnumerable<T>> task = Task.Factory.StartNew(() =>
{
return client.findItemsByProduct(i);
});
tasks.Add(task);
}
// now that all tasks are started, wait until all are finished
await Task.WhenAll(tasks);
// the result of each task is now in task.Result
// the type of result is IEnumerable<T>
// put all into one big list using some linq:
return tasks.SelectMany ( task => task.Result.SearchResult.Item)
.ToList();
// if you're not familiar to linq yet, use a foreach:
var newList = new List<T>();
foreach (var task in tasks)
{
newList.AddRange(task.Result.searchResult.item);
}
return newList;
}

TPL DataFlow with Lazy Source / stream of data

Suppose you have a TransformBlock with configured parallelism and want to stream data trough the block. The input data should be created only when the pipeline can actually start processing it. (And should be released the moment it leaves the pipeline.)
Can I achieve this? And if so how?
Basically I want a data source that works as an iterator.
Like so:
public IEnumerable<Guid> GetSourceData()
{
//In reality -> this should also be an async task -> but yield return does not work in combination with async/await ...
Func<ICollection<Guid>> GetNextBatch = () => Enumerable.Repeat(100).Select(x => Guid.NewGuid()).ToArray();
while (true)
{
var batch = GetNextBatch();
if (batch == null || !batch.Any()) break;
foreach (var guid in batch)
yield return guid;
}
}
This would result in +- 100 records in memory. OK: more if the blocks you append to this data source would keep them in memory for some time, but you have a chance to get only a subset (/stream) of data.
Some background information:
I intend to use this in combination with azure cosmos db, where the source could all objects in a collection, or a change feed. Needless to say that I don't want all of those objects stored in memory. So this can't work:
using System.Threading.Tasks.Dataflow;
public async Task ExampleTask()
{
Func<Guid, object> TheActualAction = text => text.ToString();
var config = new ExecutionDataflowBlockOptions
{
BoundedCapacity = 5,
MaxDegreeOfParallelism = 15
};
var throtteler = new TransformBlock<Guid, object>(TheActualAction, config);
var output = new BufferBlock<object>();
throtteler.LinkTo(output);
throtteler.Post(Guid.NewGuid());
throtteler.Post(Guid.NewGuid());
throtteler.Post(Guid.NewGuid());
throtteler.Post(Guid.NewGuid());
//...
throtteler.Complete();
await throtteler.Completion;
}
The above example is not good because I add all the items without knowing if they are actually being 'used' by the transform block. Also, I don't really care about the output buffer. I understand that I need to send it somewhere so I can await the completion, but I have no use for the buffer after that. So it should just forget about all it gets ...
Post() will return false if the target is full without blocking. While this could be used in a busy-wait loop, it's wasteful. SendAsync() on the other hand will wait if the target is full :
public async Task ExampleTask()
{
var config = new ExecutionDataflowBlockOptions
{
BoundedCapacity = 50,
MaxDegreeOfParallelism = 15
};
var block= new ActionBlock<Guid, object>(TheActualAction, config);
while(//some condition//)
{
var data=await GetDataFromCosmosDB();
await block.SendAsync(data);
//Wait a bit if we want to use polling
await Task.Delay(...);
}
block.Complete();
await block.Completion;
}
It seems you want to process data at a defined degree of parallelism (MaxDegreeOfParallelism = 15). TPL dataflow is very clunky to use for such a simple requirement.
There's a very simple and powerful pattern that might solve your problem. It's a parallel async foreach loop as described here: https://blogs.msdn.microsoft.com/pfxteam/2012/03/05/implementing-a-simple-foreachasync-part-2/
public static Task ForEachAsync<T>(this IEnumerable<T> source, int dop, Func<T, Task> body)
{
return Task.WhenAll(
from partition in Partitioner.Create(source).GetPartitions(dop)
select Task.Run(async delegate {
using (partition)
while (partition.MoveNext())
await body(partition.Current);
}));
}
You can then write something like:
var dataSource = ...; //some sequence
dataSource.ForEachAsync(15, async item => await ProcessItem(item));
Very simple.
You can dynamically reduce the DOP by using a SemaphoreSlim. The semaphore acts as a gate that only lets N concurrent threads/tasks in. N can be changed dynamically.
So you would use ForEachAsync as the basic workhorse and then add additional restrictions and throttling on top.

Categories