Call many asynchronous tasks and process the results in certain order - c#

Here is my code which generates XmlNodes in async way and then inserts these nodes into main document - sequentially because it is fast process and I also may need to keep certain order.
There is about 15 nodes to import. How can I refactor this code so the code is more compact?
XmlNode soccerNode = null;
XmlNode basketbalNode = null;
XmlNode hockeyNode = null;
...
var tasks = new List<Task>
{
Task.Factory.StartNew(() => soccerNode = this.getSoccer()),
Task.Factory.StartNew(() => basketbalNode = this.getBasketball()),
Task.Factory.StartNew(() => hockeyNode = this.getHockey()),
...};
Task.WaitAll(tasks.ToArray());
AddToMainDocument(soccerNode);
AddToMainDocument(basketbalNode);
AddToMainDocument(hockeyNode);
...

I'm guessing getSoccer, getBasketball and getHockey are properties since your code example did not have any method parenthesis? If they are methods then just add the missing parenthesis to the code below.
var soccerTask = Task.Run(() => getSoccer).ConfigureAwait(false);
var basketbalTask = Task.Run(() => getBasketball).ConfigureAwait(false);
var hockeyTask = Task.Run(() => getHockey).ConfigureAwait(false);
AddToMainDocument(await soccerTask);
AddToMainDocument(await basketbalTask);
AddToMainDocument(await hockeyTask);
All three jobs are executed asynchronously and then you await them in the order you need them.
Regarding how to make your code more compact, I would need to know more about it. The methods/properties, are they all within the same object? Should all of them always be called? You could use reflection to find all of your methods, have attributes on them to specify order and use that information to spin up tasks and await them in the right order.
But to be honest, if you know which methods needs to be called then just do it manually as in your example. If you have a dynamic object where methods are added/removed then you should probably use reflection to do the job, otherwise it's just an unnecessary overhead.

A think this might help:
var t1 = Task.Factory.StartNew(() => soccerNode = /*Do something */);
var t2 = Task.Factory.StartNew(() => basketbalNode = /*Do something */);
var tasks = new List<Task> { t1, t2 };
Task.WaitAll(tasks.ToArray());
AddToMainDocument(t1.Result);
AddToMainDocument(t2.Result);
We wait until they are all done then add to the main document.

Related

Is there a way to combine LINQ and async

Basically I have a procedure like
var results = await Task.WhenAll(
from input in inputs
select Task.Run(async () => await InnerMethodAsync(input))
);
.
.
.
private static async Task<Output> InnerMethodAsync(Input input)
{
var x = await Foo(input);
var y = await Bar(x);
var z = await Baz(y);
return z;
}
and I'm wondering whether there's a fancy way to combine this into a single LINQ query that's like an "async stream" (best way I can describe it).
When you use LINQ, there are generally two parts to it: creation and iteration.
Creation:
var query = list.Select( a => a.Name);
These calls are always synchronous. But this code doesn't do much more than create an object that exposes an IEnumerable. The actual work isn't done till later, due to a pattern called deferred execution.
Iteration:
var results = query.ToList();
This code takes the enumerable and gets the value of each item, which typically will involve the invocation of your callback delegates (in this case, a => a.Name ). This is the part that is potentially expensive, and could benefit from asychronousness, e.g. if your callback is something like async a => await httpClient.GetByteArrayAsync(a).
So it's the iteration part that we're interested in, if we want to make it async.
The issue here is that ToList() (and most of the other methods that force iteration, like Any() or Last()) are not asynchronous methods, so your callback delegate will be invoked synchronously, and you’ll end up with a list of tasks instead of the data you want.
We can get around that with a piece of code like this:
public static class ExtensionMethods
{
static public async Task<List<T>> ToListAsync<T>(this IEnumerable<Task<T>> This)
{
var tasks = This.ToList(); //Force LINQ to iterate and create all the tasks. Tasks always start when created.
var results = new List<T>(); //Create a list to hold the results (not the tasks)
foreach (var item in tasks)
{
results.Add(await item); //Await the result for each task and add to results list
}
return results;
}
}
With this extension method, we can rewrite your code:
var results = await inputs.Select( async i => await InnerMethodAsync(i) ).ToListAsync();
^That should give you the async behavior you're looking for, and avoids creating thread pool tasks, as your example does.
Note: If you are using LINQ-to-entities, the expensive part (the data retrieval) isn't exposed to you. For LINQ-to-entities, you'd want to use the ToListAsync() that comes with the EF framework instead.
Try it out and see the timings in my demo on DotNetFiddle.
A rather obvious answer, but you have just used LINQ and async together - you're using LINQ's select to project, and start, a bunch of async Tasks, and then await on the results, which provides an asynchronous parallelism pattern.
Although you've likely just provided a sample, there are a couple of things to note in your code (I've switched to Lambda syntax, but the same principals apply)
Since there's basically zero CPU bound work on each Task before the first await (i.e. no work done before var x = await Foo(input);), there's no real reason to use Task.Run here.
And since there's no work to be done in the lambda after call to InnerMethodAsync, you don't need to wrap the InnerMethodAsync calls in an async lambda (but be wary of IDisposable)
i.e. You can just select the Task returned from InnerMethodAsync and await these with Task.WhenAll.
var tasks = inputs
.Select(input => InnerMethodAsync(input)) // or just .Select(InnerMethodAsync);
var results = await Task.WhenAll(tasks);
More complex patterns are possible with asynchronony and Linq, but rather than reinventing the wheel, you should have a look at Reactive Extensions, and the TPL Data Flow Library, which have many building blocks for complex flows.
Try using Microsoft's Reactive Framework. Then you can do this:
IObservable<Output[]> query =
from input in inputs.ToObservable()
from x in Observable.FromAsync(() => Foo(input))
from y in Observable.FromAsync(() => Bar(x))
from z in Observable.FromAsync(() => Baz(y))
select z;
Output[] results = await query.ToArray();
Simple.
Just NuGet "System.Reactive" and add using System.Reactive.Linq; to your code.

When returning multiple async tasks how do I know which results came from which task?

I am have the following code to run multiple async tasks and wait for all the results.
string[] personStoreNames = _faceStoreRepo.GetPersonStoreNames();
IEnumerable<Task<IdentifyResult[]>> identifyFaceTasks =
personStoreNames.Select(storename => _faceServiceClient.IdentifyAsync(storename, faceIds, 1));
var recognitionresults =
await Task.WhenAll(identifyFaceTasks);
When I get the results how can I get the storename for each task result. Each array of IdentifyResult will be for a certain storename, but I'm not sure how to end up with my IdentifyResults and the storename they were found in.
As MSDN says use same indexes to get results that you used for parameters.
WhenAll
If none of the tasks faulted and none of the tasks were canceled, the resulting task will end in the TaskStatus.RanToCompletion state. The Result of the returned task will be set to an array containing all of the results of the supplied tasks in the same order as they were provided (e.g. if the input tasks array contained t1, t2, t3, the output task's Result will return an TResult[] where arr[0] == t1.Result, arr[1] == t2.Result, and arr[2] == t3.Result).
This is not a direct answer to the question, but you can use Microsoft's Reactive Framework to make this code a bit neater.
You can write this:
var query =
from sn in _faceStoreRepo.GetPersonStoreNames().ToObservable()
from irs in Observable.FromAsync(() => _faceServiceClient.IdentifyAsync(sn, faceIds, 1))
select new { sn, irs };
var result = await query.ToArray();
result is an array of anonymous variables of new { sn, irs }.
One advantage is that you can process the values as they become available:
var result = await query
.Do(x => { /* process each `x.sn` & `x.irs` pair as they arrive */ })
.ToArray();

Reactive Extensions SelectMany with large objects

I have this little piece of code that simulates a flow that uses large objects (that huge byte[]). For each item in the sequence, an async method is invoked to get some result. The problem? As it is, it throws OutOfMemoryException.
Code compatible with LINQPad (C# Program):
void Main()
{
var selectMany = Enumerable.Range(1, 100)
.Select(i => new LargeObject(i))
.ToObservable()
.SelectMany(o => Observable.FromAsync(() => DoSomethingAsync(o)));
selectMany
.Subscribe(r => Console.WriteLine(r));
}
private static async Task<int> DoSomethingAsync(LargeObject lo)
{
await Task.Delay(10000);
return lo.Id;
}
internal class LargeObject
{
public int Id { get; }
public LargeObject(int id)
{
this.Id = id;
}
public byte[] Data { get; } = new byte[10000000];
}
It seems that it creates all the objects at the same time. How can I do it the right way?
The underlying idea is to invoke DoSomethingAsync in order to get some result for each object, so that's why I use SelectMany. To simplify, I just have introduced a Task.Delay, but in real life it is a service that can process some items concurrently, so I want to introduce some concurrency mechanism to get advantage of it.
Please, notice that, theoretically, processing a little number of items at time shouldn't fill the memory. In fact, we only need each "large object" to get the results of the DoSomethingAsync method. After that point, the large object isn't used anymore.
I feel like i'm repeating myself. Similar to your last question and my last answer, what you need to do is limit the number of bigObjects™ to be created concurrent.
To do so, you need to combine object creation and processing and put it on the same thread pool. Now the problem is, we use async methods to allow threads to do other things while our async method run. Since your slow network call is async, your (fast) object creation code will keep creating large objects too fast.
Instead, we can use Rx to keep count of the number of concurrent Observables running by combine the object creation with the async call and use .Merge(maxConcurrent) to limit concurrency.
As a bonus, we can also set a minimal time for queries to execute. Just Zip with something that takes a minimal delay.
static void Main()
{
var selectMany = Enumerable.Range(1, 100)
.ToObservable()
.Select(i => Observable.Defer(() => Observable.Return(new LargeObject(i)))
.SelectMany(o => Observable.FromAsync(() => DoSomethingAsync(o)))
.Zip(Observable.Timer(TimeSpan.FromMilliseconds(400)), (el, _) => el)
).Merge(4);
selectMany
.Subscribe(r => Console.WriteLine(r));
Console.ReadLine();
}
private static async Task<int> DoSomethingAsync(LargeObject lo)
{
await Task.Delay(10000);
return lo.Id;
}
internal class LargeObject
{
public int Id { get; }
public LargeObject(int id)
{
this.Id = id;
Console.WriteLine(id + "!");
}
public byte[] Data { get; } = new byte[10000000];
}
It seems that it creates all the objects at the same time.
Yes, because you are creating them all at once.
If I simplify your code I can show you why:
void Main()
{
var selectMany =
Enumerable
.Range(1, 5)
.Do(x => Console.WriteLine($"{x}!"))
.ToObservable()
.SelectMany(i => Observable.FromAsync(() => DoSomethingAsync(i)));
selectMany
.Subscribe(r => Console.WriteLine(r));
}
private static async Task<int> DoSomethingAsync(int i)
{
await Task.Delay(1);
return i;
}
Running this produces:
1!
2!
3!
4!
5!
4
3
5
2
1
Because of the Observable.FromAsync you are allowing the source to run to completion before any of the results return. In other words you are quickly building all of the large objects, but slowly processing them.
You should allow Rx to run synchronously, but on the default scheduler so that your main thread is not blocked. The code will then run without any memory issues and your program will remain responsive on the main thread.
Here's the code for this:
var selectMany =
Observable
.Range(1, 100, Scheduler.Default)
.Select(i => new LargeObject(i))
.Select(o => DoSomethingAsync(o))
.Select(t => t.Result);
(I've effectively replaced Enumerable.Range(1, 100).ToObservable() with Observable.Range(1, 100) as that will also help with some issues.)
I've tried testing other options, but so far anything that allows DoSomethingAsync to run asynchronously runs into the out of memory error.
ConcatMap supports this out of the box. I know this operator is not available in .net, but you can make the same using Concat operator which defers subscribing to each inner source until the previous one completes.
You can introduce a time interval delay this way:
var source = Enumerable.Range(1, 100)
.ToObservable()
.Zip(Observable.Interval(TimeSpan.FromSeconds(1)), (i, ts) => i)
.Select(i => new LargeObject(i))
.SelectMany(o => Observable.FromAsync(() => DoSomethingAsync(o)));
So instead of pulling all 100 integers at once, immediately converting them to the LargeObject then calling DoSomethingAsync on all 100, it drips the integers out one-by-one spaced out one second each.
This is what a TPL+Rx solution would look like. Needless to say it is less elegant than Rx alone, or TPL alone. However, I don't think this problem is well suited for Rx:
void Main()
{
var source = Observable.Range(1, 100);
const int MaxParallelism = 5;
var transformBlock = new TransformBlock<int, int>(async i => await DoSomethingAsync(new LargeObject(i)),
new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = MaxParallelism });
source.Subscribe(transformBlock.AsObserver());
var selectMany = transformBlock.AsObservable();
selectMany
.Subscribe(r => Console.WriteLine(r));
}

Counting Non-Faulted Tasks causes re-execution of each task

I am saving a bunch of items to my database using async saves
var tasks = items.Select(item =>
{
var clone = item.MakeCopy();
clone.Id = Guid.NewGuid();
return dbAccess.SaveAsync(clone);
});
await Task.WhenAll(tasks);
I need to verify how many times SaveAsync was successful (It throws and exception if something goes wrong). I am using IsFaulted flag to examine the tasks:
var successCount = tasks.Count(t => !t.IsFaulted);
Collection items consists of 3 elements so SaveAsync should have been called three times but it is executed 6 times. Upon closer examination I noticed that counting non-faulted tasks with c.Count(...) causes each of the task to re-run.
I suspect it has something to do with deferred LINQ execution but I am not sure why exactly and how to fix this.
Any suggestion why I observe this behavior and what would be the optimal pattern to avoid this artifact?
It happens because of multiple enumeration of your Select query.
In order to fix it, force enumeration by calling ToList() method. Then it will work correctly.
var tasks = items.Select(item =>
{
var clone = item.MakeCopy();
clone.Id = Guid.NewGuid();
return dbAccess.SaveAsync(clone);
})
.ToList();
Also you may take a look at these more detailed answers:
https://stackoverflow.com/a/8240935/3872935
https://stackoverflow.com/a/20129161/3872935.

Parallel Linq - return first result that comes back

I'm using PLINQ to run a function that tests serial ports to determine if they're a GPS device.
Some serial ports immediately are found to be a valid GPS. In this case, I want the first one to complete the test to be the one returned. I don't want to wait for the rest of the results.
Can I do this with PLINQ, or do I have to schedule a batch of tasks and wait for one to return?
PLINQ is probably not going to suffice here. While you can use .First, in .NET 4, this will cause it to run sequentially, which defeats the purpose. (Note that this will be improved in .NET 4.5.)
The TPL, however, is most likely the right answer here. You can create a Task<Location> for each serial port, and then use Task.WaitAny to wait on the first successful operation.
This provides a simple way to schedule a bunch of "tasks" and then just use the first result.
I have been thinking about this on and off for the past couple days and I can't find a built in PLINQ way to do this in C# 4.0. The accepted answer to this question of using FirstOrDefault does not return a value until the full PLINQ query is complete and still returns the (ordered) first result. The following extreme example shows the behavior:
var cts = new CancellationTokenSource();
var rnd = new ThreadLocal<Random>(() => new Random());
var q = Enumerable.Range(0, 11).Select(x => x).AsParallel()
.WithCancellation(cts.Token).WithMergeOptions( ParallelMergeOptions.NotBuffered).WithDegreeOfParallelism(10).AsUnordered()
.Where(i => i % 2 == 0 )
.Select( i =>
{
if( i == 0 )
Thread.Sleep(3000);
else
Thread.Sleep(rnd.Value.Next(50, 100));
return string.Format("dat {0}", i).Dump();
});
cts.CancelAfter(5000);
// waits until all results are in, then returns first
q.FirstOrDefault().Dump("result");
I don't see a built-in way to immediately get the first available result, but I was able to come up with two workarounds.
The first creates Tasks to do the work and returns the Task, resulting in a quickly completed PLINQ query. The resulting tasks can be passed to WaitAny to get the first result as soon as it is available:
var cts = new CancellationTokenSource();
var rnd = new ThreadLocal<Random>(() => new Random());
var q = Enumerable.Range(0, 11).Select(x => x).AsParallel()
.WithCancellation(cts.Token).WithMergeOptions( ParallelMergeOptions.NotBuffered).WithDegreeOfParallelism(10).AsUnordered()
.Where(i => i % 2 == 0 )
.Select( i =>
{
return Task.Factory.StartNew(() =>
{
if( i == 0 )
Thread.Sleep(3000);
else
Thread.Sleep(rnd.Value.Next(50, 100));
return string.Format("dat {0}", i).Dump();
});
});
cts.CancelAfter(5000);
// returns as soon as the tasks are created
var ts = q.ToArray();
// wait till the first task finishes
var idx = Task.WaitAny( ts );
ts[idx].Result.Dump("res");
This is probably a terrible way to do it. Since the actual work of the PLINQ query is just a very fast Task.Factory.StartNew, it's pointless to use PLINQ at all. A simple .Select( i => Task.Factory.StartNew( ... on the IEnumerable is cleaner and probably faster.
The second workaround uses a queue (BlockingCollection) and just inserts results into this queue once they are computed:
var cts = new CancellationTokenSource();
var rnd = new ThreadLocal<Random>(() => new Random());
var q = Enumerable.Range(0, 11).Select(x => x).AsParallel()
.WithCancellation(cts.Token).WithMergeOptions( ParallelMergeOptions.NotBuffered).WithDegreeOfParallelism(10).AsUnordered()
.Where(i => i % 2 == 0 )
.Select( i =>
{
if( i == 0 )
Thread.Sleep(3000);
else
Thread.Sleep(rnd.Value.Next(50, 100));
return string.Format("dat {0}", i).Dump();
});
cts.CancelAfter(5000);
var qu = new BlockingCollection<string>();
// ForAll blocks until PLINQ query is complete
Task.Factory.StartNew(() => q.ForAll( x => qu.Add(x) ));
// get first result asap
qu.Take().Dump("result");
With this method, the work is done using PLINQ, and the BlockingCollecion's Take() will return the first result as soon as it is inserted by the PLINQ query.
While this produces the desired result, I am not sure it has any advantage over just using the simpler Tasks + WaitAny
Upon further review, you can apparently just use FirstOrDefault to solve this. PLINQ will not preserve ordering by default, and with an unbuffered query, will return immediately.
http://msdn.microsoft.com/en-us/library/dd460677.aspx
To accomplish this entirely with PLINQ in .NET 4.0:
SerialPorts. // Your IEnumerable of serial ports
AsParallel().AsUnordered(). // Run as an unordered parallel query
Where(IsGps). // Matching the predicate IsGps (Func<SerialPort, bool>)
Take(1). // Taking the first match
FirstOrDefault(); // And unwrap it from the IEnumerable (or null if none are found
The key is to not use an ordered evaluation like First or FirstOrDefault until you have specified that you only care to find one.

Categories