I need to execute multiple long-running operations in parallel and would like to report a progress in some way. From my initial research it seems that IObservable fits into this model. The idea is that I call a method that return IObservable of int where int is reported percent complete, parallel execution starts immediately upon exiting a method, this observable must be a hot observable so that all subscribers learn the same progress information at specific point in time, e.g. late subscriber may only learn that the whole execution is complete and there is no more progress to track.
The closest approach to this problem that I found is to use Observable.ForkJoin and Observable.Start, but I can't come to understanding how to make them a single observable that I can return from a method.
Please share your ideas of how can it be achieved or maybe there is another approach to this problem using .Net RX.
To make a hot observable, I would probably start with a method that uses a BehaviorSubject as the return value and the way the operations report progress. If you just want the example, skip to the end. The rest of this answer explains the steps.
I will assume for the sake of this answer that your long-running operations do not have their own way to be called asynchronously. If they do, the next step may be a little different. The next thing to do is to send the work to another thread using an IScheduler. You may allow the caller to select where the work happens by making an overload that takes the scheduler as a parameter if desired (in which case the overload that does not will pick a default scheduler). There are quite a few overloads of IScheduler.Scheduler, of which several are extensions methods, so you should look through them to see which is most appropriate for your situation; I'm using the on that takes only an Action here. If you have multiple operations that can all run in parallel, you can call scheduler.Schedule multiple times.
The hardest part of this will probably be determining what the progress is at any given point. If you have multiple operations going on at once, you will probably need to keep track of how many have completed to know what the current progress is. With the information you provided, I can't be more specific than that.
Finally, if your operations are cancellable, you may want to take a CancellationToken as a parameter. You can use this to cancel the operation while it is in the scheduler's queue before it starts. If you write your operation code correctly, it can use the token for cancellation as well.
IObservable<int> DoStuff(/*args*/,
CancellationToken cancel,
IScheduler scheduler)
{
BehaviorSubject<int> progress;
//if you don't take it as a parameter, pick a scheduler
//IScheduler scheduler = Scheduler.ThreadPool;
var disp = scheduler.Schedule(() =>
{
//do stuff that needs to run on another thread
//report progres
porgress.OnNext(25);
});
var disp2 = scheduler.Schedule(...);
//if the operation is cancelled before the scheduler has started it,
//you need to dispose the return from the Schedule calls
var allOps = new CompositeDisposable(disp, disp2);
cancel.Register(allOps.Dispose);
return progress;
}
Here is one approach
// setup a method to do some work,
// and report it's own partial progress
Func<string, IObservable<int>> doPartialWork =
(arg) => Observable.Create<int>(obsvr => {
return Scheduler.TaskPool.Schedule(arg,(sched,state) => {
var progress = 0;
var cancel = new BooleanDisposable();
while(progress < 10 && !cancel.IsDisposed)
{
// do work with arg
Thread.Sleep(550);
obsvr.OnNext(1); //report progress
progress++;
}
obsvr.OnCompleted();
return cancel;
});
});
var myArgs = new[]{"Arg1", "Arg2", "Arg3"};
// run all the partial bits of work
// use SelectMany to get a flat stream of
// partial progress notifications
var xsOfPartialProgress =
myArgs.ToObservable(Scheduler.NewThread)
.SelectMany(arg => doPartialWork(arg))
.Replay().RefCount();
// use Scan to get a running aggreggation of progress
var xsProgress = xsOfPartialProgress
.Scan(0d, (prog,nextPartial)
=> prog + (nextPartial/(myArgs.Length*10d)));
Related
If we fill a list of Tasks that need to do both CPU-bound and I/O bound work, by simply passing their method declaration to that list (Not by creating a new task and manually scheduling it by using Task.Start), how exactly are these tasks handled?
I know that they are not done in parallel, but concurrently.
Does that mean that a single thread will move along them, and that single thread might not be the same thread in the thread pool, or the same thread that initially started waiting for them all to complete/added them to the list?
EDIT: My question is about how exactly these items are handled in the list concurrently - is the calling thread moving through them, or something else is going on?
Code for those that need code:
public async Task SomeFancyMethod(int i)
{
doCPUBoundWork(i);
await doIOBoundWork(i);
}
//Main thread
List<Task> someFancyTaskList = new List<Task>();
for (int i = 0; i< 10; i++)
someFancyTaskList.Add(SomeFancyMethod(i));
// Do various other things here --
// how are the items handled in the meantime?
await Task.WhenAll(someFancyTaskList);
Thank you.
Asynchronous methods always start running synchronously. The magic happens at the first await. When the await keyword sees an incomplete Task, it returns its own incomplete Task. If it sees a complete Task, execution continues synchronously.
So at this line:
someFancyTaskList.Add(SomeFancyMethod(i));
You're calling SomeFancyMethod(i), which will:
Run doCPUBoundWork(i) synchronously.
Run doIOBoundWork(i).
If doIOBoundWork(i) returns an incomplete Task, then the await in SomeFancyMethod will return its own incomplete Task.
Only then will the returned Task be added to your list and your loop will continue. So the CPU-bound work is happening sequentially (one after the other).
There is some more reading about this here: Control flow in async programs (C#)
As each I/O operation completes, the continuations of those tasks are scheduled. How those are done depends on the type of application - particularly, if there is a context that it needs to return to (desktop and ASP.NET do unless you specify ConfigureAwait(false), ASP.NET Core doesn't). So they might run sequentially on the same thread, or in parallel on ThreadPool threads.
If you want to immediately move the CPU-bound work to another thread to run that in parallel, you can use Task.Run:
someFancyTaskList.Add(Task.Run(() => SomeFancyMethod(i)));
If this is in a desktop application, then this would be wise, since you want to keep CPU-heavy work off of the UI thread. However, then you've lost your context in SomeFancyMethod, which may or may not matter to you. In a desktop app, you can always marshall calls back to the UI thread fairly easily.
I assume you don't mean passing their method declaration, but just invoking the method, like so:
var tasks = new Task[] { MethodAsync("foo"),
MethodAsync("bar") };
And we'll compare that to using Task.Run:
var tasks = new Task[] { Task.Run(() => MethodAsync("foo")),
Task.Run(() => MethodAsync("bar")) };
First, let's get the quick answer out of the way. The first variant will have lower or equal parallelism to the second variant. Parts of MethodAsync will run the caller thread in the first case, but not in the second case. How much this actually affects the parallelism depends entirely on the implementation of MethodAsync.
To get a bit deeper, we need to understand how async methods work. We have a method like:
async Task MethodAsync(string argument)
{
DoSomePreparationWork();
await WaitForIO();
await DoSomeOtherWork();
}
What happens when you call such a method? There is no magic. The method is a method like any other, just rewritten as a state machine (similar to how yield return works). It will run as any other method until it encounters the first await. At that point, it may or may not return a Task object. You may or may not await that Task object in the caller code. Ideally, your code should not depend on the difference. Just like yield return, await on a (non-completed!) task returns control to the caller of the method. Essentially, the contract is:
If you have CPU work to do, use my thread.
If whatever you do would mean the thread isn't going to use the CPU, return a promise of the result (a Task object) to the caller.
It allows you to maximize the ratio of what CPU work each thread is doing. If the asynchronous operation doesn't need the CPU, it will let the caller do something else. It doesn't inherently allow for parallelism, but it gives you the tools to do any kind of asynchronous operation, including parallel operations. One of the operations you can do is Task.Run, which is just another asynchronous method that returns a task, but which returns to the caller immediately.
So, the difference between:
MethodAsync("foo");
MethodAsync("bar");
and
Task.Run(() => MethodAsync("foo"));
Task.Run(() => MethodAsync("bar"));
is that the former will return (and continue to execute the next MethodAsync) after it reaches the first await on a non-completed task, while the latter will always return immediately.
You should usually decide based on your actual requirements:
Do you need to use the CPU efficiently and minimize context switching etc., or do you expect the async method to have negligible CPU work to do? Invoke the method directly.
Do you want to encourage parallelism or do you expect the async method to do interesting amounts of CPU work? Use Task.Run.
Here is your code rewritten without async/await, with old-school continuations instead. Hopefully it will make it easier to understand what's going on.
public Task CompoundMethodAsync(int i)
{
doCPUBoundWork(i);
return doIOBoundWorkAsync(i).ContinueWith(_ =>
{
doMoreCPUBoundWork(i);
});
}
// Main thread
var tasks = new List<Task>();
for (int i = 0; i < 10; i++)
{
Task task = CompoundMethodAsync(i);
tasks.Add(task);
}
// The doCPUBoundWork has already ran synchronously 10 times at this point
// Do various things while the compound tasks are progressing concurrently
Task.WhenAll(tasks).ContinueWith(_ =>
{
// The doIOBoundWorkAsync/doMoreCPUBoundWork have completed 10 times at this point
// Do various things after all compound tasks have been completed
});
// No code should exist here. Move everything inside the continuation above.
I will first provide the pseudocode and describe it below:
public void RunUntilEmpty(List<Job> jobs)
{
while (jobs.Any()) // the list "jobs" will be modified during the execution
{
List<Job> childJobs = new List<Job>();
Parallel.ForEach(jobs, job => // this will be done in parallel
{
List<Job> newJobs = job.Do(); // after a job is done, it may return new jobs to do
lock (childJobs)
childJobs.AddRange(newJobs); // I would like to add those jobs to the "pool"
});
jobs = childJobs;
}
}
As you can see, I am performing a unique type of foreach. The source, the set (jobs), can simply be enhanced during the execution and this behaviour cannot be determined earlier. When the method Do() is called on an object (here, job), it may return new jobs to perform and thus would enhance the source (jobs).
I could call this method (RunUntilEmpty) recursively, but unfortunately the stack can be really huge and is likely to result in an overflow.
Could you please tell me how to achieve this? Is there a way of doing this kind of actions in C#?
If I understand correctly, you basically start out with some collection of Job objects, each representing some task which can itself create one or more new Job objects as a result of performing its task.
Your updated code example looks like it will basically accomplish this. But note that, as commenter CommuSoft points out, it won't make most efficient use of your CPU cores. Because you are only updating the list of jobs after each group of jobs has completed, there's no way for newly-generated jobs to run until all of the previously-generated jobs have completed.
A better implementation would use a single queue of jobs, continually retrieving new Job objects for execution as old ones complete.
I agree that TPL Dataflow may be a useful way to implement this. However, depending on your needs, you might find it simple enough to just queue the tasks directly to the thread pool and use CountdownEvent to track the progress of the work so that your RunUntilEmpty() method knows when to return.
Without a good, minimal, complete code example, it's impossible to provide an answer that includes a similarly complete code example. But hopefully the below snippet illustrates the basic idea well enough:
public void RunUntilEmpty(List<Job> jobs)
{
CountdownEvent countdown = new CountdownEvent(1);
QueueJobs(jobs, countdown);
countdown.Signal();
countdown.Wait();
}
private static void QueueJobs(List<Job> jobs, CountdownEvent countdown)
{
foreach (Job job in jobs)
{
countdown.AddCount(1);
Task.Run(() =>
{
// after a job is done, it may return new jobs to do
QueueJobs(job.Do(), countdown);
countdown.Signal();
});
}
}
The basic idea is to queue a new task for each Job object, incrementing the counter of the CountdownEvent for each task that is queued. The tasks themselves do three things:
Run the Do() method,
Queue any new tasks, using the QueueJobs() method so that the CountdownEvent object's counter is incremented accordingly, and
Signal the CountdownEvent, decrementing its counter for the current task
The main RunUntilEmpty() signals the CountdownEvent to account for the single count it contributed to the object's counter when it created it, and then waits for the counter to reach zero.
Note that the calls to QueueJobs() are not recursive. The QueueJobs() method is not called by itself, but rather by the anonymous method declared within it, which is itself also not called by QueueJobs(). So there is no stack-overflow issue here.
The key feature in the above is that tasks are continuously queued as they become known, i.e. as they are returned by the previously-executed Do() method calls. Thus, the available CPU cores are kept busy by the thread pool, at least to the extent that any completed Do() method has in fact returned any new Job object to run. This addresses the main problem with the version of the code you've included in your question.
I have a function which is along the lines of
private void DoSomethingToFeed(IFeed feed)
{
feed.SendData(); // Send data to remote server
Thread.Sleep(1000 * 60 * 5); // Sleep 5 minutes
feed.GetResults(); // Get data from remote server after it's processed it
}
I want to parallelize this, since I have lots of feeds that are all independent of each other. Based on this answer, leaving the Thread.Sleep() in there is not a good idea. I also want to wait after all the threads have spun up, until they've all had a chance to get their results.
What's the best way to handle a scenario like this?
Edit, because I accidentally left it out: I had originally considered calling this function as Parallel.ForEach(feeds, DoSomethingToFeed), but I was wondering if there was a better way to handle the sleeping when I found the answer I linked to.
Unless you have an awful lot of threads, you can keep it simple. Create all the threads. You'll get some thread creation overhead, but since the threads are basically sleeping the whole time, you won't get too much context switching.
It'll be easier to code than any other solution (unless you're using C# 5). So start with that, and improve it only if you actually see a performance problem.
I think you should take a look at the Task class in .NET. It is a nice abstraction on top of more low level threading / thread pool management.
In order to wait for all tasks to complete, you can use Task.WaitAll.
An example use of Tasks could look like:
IFeed feedOne = new SomeFeed();
IFeed feedTwo = new SomeFeed();
var t1 = Task.Factory.StartNew(() => { feedOne.SendData(); });
var t2 = Task.Factory.StartNew(() => { feedTwo.SendData(); });
// Waits for all provided tasks to finish execution
Task.WaitAll(t1, t2);
However, another solution would be using Parallel.ForEach which handles all Task creation for you and does the appropriate batching of tasks as well. A good comparison of the two approaches is given here - where it, among other good points is stated that:
Parallel.ForEach, internally, uses a Partitioner to distribute your collection into work items. It will not do one task per item, but rather batch this to lower the overhead involved.
check WaitHandle for waiting on tasks.
private void DoSomethingToFeed(IFeed feed)
{
Task.Factory.StartNew(() => feed.SendData())
.ContinueWith(_ => Delay(1000 * 60 * 5)
.ContinueWith(__ => feed.GetResults())
);
}
//http://stevenhollidge.blogspot.com/2012/06/async-taskdelay.html
Task Delay(int milliseconds)
{
var tcs = new TaskCompletionSource<object>();
new System.Threading.Timer(_ => tcs.SetResult(null)).Change(milliseconds, -1);
return tcs.Task;
}
My generalized question is this: how do you write asynchronous code that is still clear and easy to follow, like a synchronous solution would be?
My experience is that if you need to make some synchronous code asynchronous, using something like BackgroundWorker, you no longer have a series of easy to follow program statements that express your overall intent and order of activities, you end up instead with a bunch of "Done" Event Handlers, each of which starts the next BackgroundWorker, producing code that's really hard to follow.
I know that's not very clear; something more concrete:
Let's say a function in my WinForms application needs to start up some amazon EC2 instances, wait for them to become running, and then wait for them to all accept an SSH connection. A synchronous solution in pseudo code might look like this:
instances StartNewInstances() {
instances = StartInstances()
WaitForInstancesToBecomeRunning(instances)
WaitForInstancesToAcceptSSHConnection(instances).
return (instances)
}
That's nice. What is happening is very clear, and the order of program actions is very clear. No white noise to distract you from understanding the code and the flow. I'd really like to end up with code that looks like that.
But in reality, I can't have a synchronous solution .. each of those functions can run for a long time, and each needs to do things like: update the ui, monitor for time-outs being exceeded, and retry operations periodically until success or time-out. In short, each of these needs to be happening in the background so the foreground UI thread can continue on.
But if I use solutions like BackgroundWorker, it seems like I don't end up with nice easy to follow program logic like the above. Instead I might start a background worker from my UI thread to perform the first function, and then my ui thread goes back to the UI while the worker thread runs. When it finishes, its "done" event handler might start the next Background Worker. WHen it finishes, its "done" event handler might start the last BackgroundWorker, and so on. Meaning you have to "follow the trail" of the Done Event handlers in order to understand the overall program flow.
There has to be a better way that a) lets my UI thread be responsive, b) let's my async operations be able to update the ui and most importantly c) be able to express my program as series of consecutive steps (as I've shown above) so that someone can understand the resultant code
Any and all input would be greatly appreciated!
Michael
My generalized question is this: how do you write asynchronous code that is still clear and easy to follow, like a synchronous solution would be?
You wait for C# 5. It won't be long now. async/await rocks. You've really described the feature in the above sentence... See the Visual Studio async homepage for tutorials, the language spec, downloads etc.
At the moment, there really isn't a terribly clean way - which is why the feature was required in the first place. Asynchronous code very naturally becomes a mess, especially when you consider error handling etc.
Your code would be expressed as:
async Task<List<Instance>> StartNewInstances() {
List<Instance> instances = await StartInstancesAsync();
await instances.ForEachAsync(x => await instance.WaitUntilRunningAsync());
await instances.ForEachAsync(x => await instance.WaitToAcceptSSHConnectionAsync());
return instances;
}
That's assuming a little bit of extra work, such as an extension method on IEnumerable<T> with the form
public static Task ForEachAsync<T>(this IEnumerable<T> source,
Func<T, Task> taskStarter)
{
// Stuff. It's not terribly tricky :(
}
On the off chance that you can't wait for 5 as Jon rightly suggests, I'd suggest that you look at the Task Parallel Library (part of .NET 4). It provides a lot of the plumbing around the "Do this asynchronously, and when it finishes do that" paradigm that you describe in the question. It also has solid support for error handling in the asynchronous tasks themselves.
Async/await is really the best way to go. However, if you don't want to do wait, you can try Continuation-passing-style, or CPS. To do this, you pass a delegate into the async method, which is called when processing is complete. In my opinion, this is cleaner than having all of the extra events.
That will change this method signature
Foo GetFoo(Bar bar)
{
return new Foo(bar);
}
To
void GetFooAsync(Bar bar, Action<Foo> callback)
{
Foo foo = new Foo(bar);
callback(foo);
}
Then to use it, you would have
Bar bar = new Bar();
GetFooAsync(bar, GetFooAsyncCallback);
....
public void GetFooAsyncCallback(Foo foo)
{
//work with foo
}
This gets a little tricky when GetFoo could throw an exception. The method I prefer is to chage the signature of GetFooAsync.
void GetFooAsync(Bar bar, Action<Func<Foo>> callback)
{
Foo foo;
try
{
foo = new Foo(bar);
}
catch(Exception ex)
{
callback(() => {throw ex;});
return;
}
callback(() => foo);
}
Your callback method will look like this
public void GetFooAsyncCallback(Func<Foo> getFoo)
{
try
{
Foo foo = getFoo();
//work with foo
}
catch(Exception ex)
{
//handle exception
}
}
Other methods involve giving the callback two parameters, the actual result and an exception.
void GetFooAsync(Bar bar, Action<Foo, Exception> callback);
This relies on the callback checking for an exception, which could allow it to be ignored. Other methods have two call backs, one for success, and one for failure.
void GetFooAsync(Bar bar, Action<Foo> callback, Action<Exception> error);
To me this makes the flow more complicated, and still allows the Exception to be ignored.
However, giving the callback a method that must be called to get the result forces the callback to deal with the Exception.
When it finishes, its "done" event handler might start the next Background Worker.
This is something that I've been struggling with for a while. Basically waiting for a process to finish without locking the UI.
Instead of using a backgroundWorker to start a backgroundWorker however, you can just do all the tasks in one backgroundWorker. Inside the backgroundWorker.DoWork function, it runs synchronously on that thread. So you can have one DoWork function that processes all 3 items.
Then you have to wait on just the one BackgroundWorker.Completed and have "cleaner" code.
So you can end up with
BackgroundWorker_DoWork
returnValue = LongFunction1
returnValue2 = LongFunction2(returnValue)
LongFunction3
BackgroundWorker_ProgressReported
Common Update UI code for any of the 3 LongFunctions
BackgroundWorker_Completed
Notify user long process is done
In some scenario (will explain later), you can wrap the async calls to a method like the following pseudo code:
byte[] ReadTheFile() {
var buf = new byte[1000000];
var signal = new AutoResetEvent(false);
proxy.BeginReadAsync(..., data => {
data.FillBuffer(buf);
signal.Set();
});
signal.WaitOne();
return buf;
}
For the above code to work, the call back needs to be invoked from a different thread. So this depends on what you are working with. From my experience, at least Silverlight web service calls are handled in UI thread, which means the above pattern cannot be used - if the UI thread is blocked, the previous begin call even cannot be carried out. If you are working with this kind of frameworks, another way to handle multiple async calls is to move your higher level logic to a background thread and use UI thread for communication. However, this approach is a little bit over killing in most cases because it requires some boilerplate code to start and stop background thread.
I'm new to the TPL (Task-Parallel Library) and am wondering if the following is the most efficient way to spin up 1 or more tasks, collate the results, and display them in a datagrid.
Search1 & Search2 talk to two separate databases, but return the same results.
I disable the buttons and turn on a spinner.
I'm firing the tasks off using a single ContinueWhenAll method call.
I've added the scheduler to the ContinueWhenAll call to update form buttons, datagrid, and turn off the spinner.
Q: Am I doing this the right way ? Is there a better way ?
Q: How could I add cancellation/exception checking to this ?
Q: If I needed to add progress reporting - how would I do that ?
The reason that I chose this method over say, a background worker is so that I could fire each DB task off in parallel vs. sequentially. Besides that, I thought it might be fun to use the TPL.. however, since I could not find any concrete examples of what I'm doing below (multiple tasks) I thought it might be nice to put it on here to get the answers, and hopefully be an example for others.
Thank you!
Code:
// Disable buttons and start the spinner
btnSearch.Enabled = btnClear.Enabled = false;
searchSpinner.Active = searchSpinner.Visible = true;
// Setup scheduler
TaskScheduler scheduler = TaskScheduler.FromCurrentSynchronizationContext();
// Start the tasks
Task.Factory.ContinueWhenAll(
// Define the search tasks that return List<ImageDocument>
new [] {
Task.Factory.StartNew<List<ImageDocument>>(Search1),
Task.Factory.StartNew<List<ImageDocument>>(Search2)
},
// Process the return results
(taskResults) => {
// Create a holding list
List<ImageDocument> documents = new List<ImageDocument>();
// Iterate through the results and add them to the holding list
foreach (var item in taskResults) {
documents.AddRange(item.Result);
}
// Assign the document list to the grid
grid.DataSource = documents;
// Re-enable the search buttons
btnSearch.Enabled = btnClear.Enabled = true;
// End the spinner
searchSpinner.Active = searchSpinner.Visible = false;
},
CancellationToken.None,
TaskContinuationOptions.None,
scheduler
);
Q: Am I doing this the right way ? Is there a better way ?
Yes, this is a good way to handle this type of situation. Personally, I would consider refactoring the disable/enable of the UI into a separate method, but other than that, this seems very reasonable.
Q: How could I add cancellation/exception checking to this ?
You could pass around a CancellationToken to your methods, and have them check it and throw if a cancellation was requested.
You'd handle exceptions where you grab the results from taskResults. This line:
documents.AddRange(item.Result);
Is where the exception will get thrown (as an AggregateException or OperationCanceledException) if an exception or cancellation occurred during the operations.
Q: If I needed to add progress reporting - how would I do that ?
The simplest way would be to pass the scheduler into your methods. Once you've done that, you could use it to schedule a task that updates on the UI thread - ie: Task.Factory.StartNew with the TaskScheduler specified.
however, since I could not find any concrete examples of what I'm doing below (multiple tasks)
Just FYI - I have samples of working with multiple tasks in Part 18 of my series on TPL.
For best practices, read the Task-Based Asynchronous Pattern document. It includes recommendations on cancellation support and progress notification for Task-based APIs.
You'd also benefit from the async/await keywords in the Async CTP; they greatly simplifiy task continuations.