I have a function which is along the lines of
private void DoSomethingToFeed(IFeed feed)
{
feed.SendData(); // Send data to remote server
Thread.Sleep(1000 * 60 * 5); // Sleep 5 minutes
feed.GetResults(); // Get data from remote server after it's processed it
}
I want to parallelize this, since I have lots of feeds that are all independent of each other. Based on this answer, leaving the Thread.Sleep() in there is not a good idea. I also want to wait after all the threads have spun up, until they've all had a chance to get their results.
What's the best way to handle a scenario like this?
Edit, because I accidentally left it out: I had originally considered calling this function as Parallel.ForEach(feeds, DoSomethingToFeed), but I was wondering if there was a better way to handle the sleeping when I found the answer I linked to.
Unless you have an awful lot of threads, you can keep it simple. Create all the threads. You'll get some thread creation overhead, but since the threads are basically sleeping the whole time, you won't get too much context switching.
It'll be easier to code than any other solution (unless you're using C# 5). So start with that, and improve it only if you actually see a performance problem.
I think you should take a look at the Task class in .NET. It is a nice abstraction on top of more low level threading / thread pool management.
In order to wait for all tasks to complete, you can use Task.WaitAll.
An example use of Tasks could look like:
IFeed feedOne = new SomeFeed();
IFeed feedTwo = new SomeFeed();
var t1 = Task.Factory.StartNew(() => { feedOne.SendData(); });
var t2 = Task.Factory.StartNew(() => { feedTwo.SendData(); });
// Waits for all provided tasks to finish execution
Task.WaitAll(t1, t2);
However, another solution would be using Parallel.ForEach which handles all Task creation for you and does the appropriate batching of tasks as well. A good comparison of the two approaches is given here - where it, among other good points is stated that:
Parallel.ForEach, internally, uses a Partitioner to distribute your collection into work items. It will not do one task per item, but rather batch this to lower the overhead involved.
check WaitHandle for waiting on tasks.
private void DoSomethingToFeed(IFeed feed)
{
Task.Factory.StartNew(() => feed.SendData())
.ContinueWith(_ => Delay(1000 * 60 * 5)
.ContinueWith(__ => feed.GetResults())
);
}
//http://stevenhollidge.blogspot.com/2012/06/async-taskdelay.html
Task Delay(int milliseconds)
{
var tcs = new TaskCompletionSource<object>();
new System.Threading.Timer(_ => tcs.SetResult(null)).Change(milliseconds, -1);
return tcs.Task;
}
Related
This sounds like an overly trivial question, and I think I am overcomplicating it because I haven't been able to find the answer for months. There are easy ways of doing this in Golang, Scala/Akka, etc but I can't seem to find anything in .NET.
What I need is an ability to have a list of Tasks that are all independent of each other, and the ability to execute them concurrently on a specified (and easily changeable) number of threads.
Basically something like:
int numberOfParallelThreads = 3; // changeable
Queue<Task> pendingTasks = GetPendingTasks(); // returns 80 items
await SomeBuiltInDotNetParallelExecutableManager.RunAllTasksWithSpecifiedConcurrency(pendingTasks, numberOfParallelThreads);
And that SomeBuiltInDotNetParallelExecutableManager would execute 80 tasks three at a time; i.e. when one finishes it draws the next one from the queue, until the queue is exhausted.
There is Task.WhenAll and Task.WaitAll, but you can't specify the max number of parallel threads in them.
Is there a built in, simple way to do this?
Parallel.ForEachAsync (or depending on actual workload it's sync counterpart - Parallel.ForEach, but it will not handle functions returning Task correctly):
IEnumerable<int> x = ...;
await Parallel.ForEachAsync(x, new ParallelOptions
{
MaxDegreeOfParallelism = 3
}, async (i, token) => await Task.Delay(i * 1000, token));
Also it is highly recommended that methods in C# return so called "hot", i.e. started tasks, so "idiomatically" Queue<Task> should be a collection of already started tasks, so you will have no control over number of them executing in parallel cause it will be controlled by ThreadPool/TaskScheduler.
And there is port of Akka to .NET - Akka.NET if you want to go down that route.
Microsoft's Reactive Framework makes this easy too:
IEnumerable<int> values = ...;
IDisposable subscription =
values
.ToObservable()
.Select(v => Observable.Defer(() => Observable.Start(() => { /* do work on each value */ })))
.Merge(3)
.Subscribe();
I have the following code:
var factory = new TaskFactory();
for (int i = 0; i < 100; i++)
{
var i1 = i;
factory.StartNew(() => foo(i1));
}
static void foo(int i)
{
Thread.Sleep(1000);
Console.WriteLine($"foo{i} - on thread {Thread.CurrentThread.ManagedThreadId}");
}
I can see it only does 4 threads at a time (based on observation). My questions:
What determines the number of threads used at a time?
How can I retrieve this number?
How can I change this number?
P.S. My box has 4 cores.
P.P.S. I needed to have a specific number of tasks (and no more) that are concurrently processed by the TPL and ended up with the following code:
private static int count = 0; // keep track of how many concurrent tasks are running
private static void SemaphoreImplementation()
{
var s = new Semaphore(20, 20); // allow 20 tasks at a time
for (int i = 0; i < 1000; i++)
{
var i1 = i;
Task.Factory.StartNew(() =>
{
try
{
s.WaitOne();
Interlocked.Increment(ref count);
foo(i1);
}
finally
{
s.Release();
Interlocked.Decrement(ref count);
}
}, TaskCreationOptions.LongRunning);
}
}
static void foo(int i)
{
Thread.Sleep(100);
Console.WriteLine($"foo{i:00} - on thread " +
$"{Thread.CurrentThread.ManagedThreadId:00}. Executing concurently: {count}");
}
When you are using a Task in .NET, you are telling the TPL to schedule a piece of work (via TaskScheduler) to be executed on the ThreadPool. Note that the work will be scheduled at its earliest opportunity and however the scheduler sees fit. This means that the TaskScheduler will decide how many threads will be used to run n number of tasks and which task is executed on which thread.
The TPL is very well tuned and continues to adjust its algorithm as it executes your tasks. So, in most cases, it tries to minimize contention. What this means is if you are running 100 tasks and only have 4 cores (which you can get using Environment.ProcessorCount), it would not make sense to execute more than 4 threads at any given time, as otherwise it would need to do more context switching. Now there are times where you want to explicitly override this behaviour. Let's say in the case where you need to wait for some sort of IO to finish, which is a whole different story.
In summary, trust the TPL. But if you are adamant to spawn a thread per task (not always a good idea!), you can use:
Task.Factory.StartNew(
() => /* your piece of work */,
TaskCreationOptions.LongRunning);
This tells the DefaultTaskscheduler to explicitly spawn a new thread for that piece of work.
You can also use your own Scheduler and pass it in to the TaskFactory. You can find a whole bunch of Schedulers HERE.
Note another alternative would be to use PLINQ which again by default analyses your query and decides whether parallelizing it would yield any benefit or not, again in the case of a blocking IO where you are certain starting multiple threads will result in a better execution you can force the parallelism by using WithExecutionMode(ParallelExecutionMode.ForceParallelism) you then can use WithDegreeOfParallelism, to give hints on how many threads to use but remember there is no guarantee you would get that many threads, as MSDN says:
Sets the degree of parallelism to use in a query. Degree of
parallelism is the maximum number of concurrently executing tasks that
will be used to process the query.
Finally, I highly recommend having a read of THIS great series of articles on Threading and TPL.
If you increase the number of tasks to for example 1000000 you will see a lot more threads spawned over time. The TPL tends to inject one every 500ms.
The TPL threadpool does not understand IO-bound workloads (sleep is IO). It's not a good idea to rely on the TPL for picking the right degree of parallelism in these cases. The TPL is completely clueless and injects more threads based on vague guesses about throughput. Also to avoid deadlocks.
Here, the TPL policy clearly is not useful because the more threads you add the more throughput you get. Each thread can process one item per second in this contrived case. The TPL has no idea about that. It makes no sense to limit the thread count to the number of cores.
What determines the number of threads used at a time?
Barely documented TPL heuristics. They frequently go wrong. In particular they will spawn an unlimited number of threads over time in this case. Use task manager to see for yourself. Let this run for an hour and you'll have 1000s of threads.
How can I retrieve this number? How can I change this number?
You can retrieve some of these numbers but that's not the right way to go. If you need a guaranteed DOP you can use AsParallel().WithDegreeOfParallelism(...) or a custom task scheduler. You also can manually start LongRunning tasks. Do not mess with process global settings.
I would suggest using SemaphoreSlim because it doesn't use Windows kernel (so it can be used in Linux C# microservices) and also has a property SemaphoreSlim.CurrentCount that tells how many remaining threads are left so you don't need the Interlocked.Increment or Interlocked.Decrement. I also removed i1 because i is value type and it won't be changed by the call of foo method passing the i argument so it's no need to copy it into i1 to ensure it never changes (if that was the reasoning for adding i1):
private static void SemaphoreImplementation()
{
var maxThreadsCount = 20; // allow 20 tasks at a time
var semaphoreSlim = new SemaphoreSlim(maxTasksCount, maxTasksCount);
var taskFactory = new TaskFactory();
for (int i = 0; i < 1000; i++)
{
taskFactory.StartNew(async () =>
{
try
{
await semaphoreSlim.WaitAsync();
var count = maxTasksCount-semaphoreSlim.CurrentCount; //SemaphoreSlim.CurrentCount tells how many threads are remaining
await foo(i, count);
}
finally
{
semaphoreSlim.Release();
}
}, TaskCreationOptions.LongRunning);
}
}
static async void foo(int i, int count)
{
await Task.Wait(100);
Console.WriteLine($"foo{i:00} - on thread " +
$"{Thread.CurrentThread.ManagedThreadId:00}. Executing concurently: {count}");
}
I need to execute multiple long-running operations in parallel and would like to report a progress in some way. From my initial research it seems that IObservable fits into this model. The idea is that I call a method that return IObservable of int where int is reported percent complete, parallel execution starts immediately upon exiting a method, this observable must be a hot observable so that all subscribers learn the same progress information at specific point in time, e.g. late subscriber may only learn that the whole execution is complete and there is no more progress to track.
The closest approach to this problem that I found is to use Observable.ForkJoin and Observable.Start, but I can't come to understanding how to make them a single observable that I can return from a method.
Please share your ideas of how can it be achieved or maybe there is another approach to this problem using .Net RX.
To make a hot observable, I would probably start with a method that uses a BehaviorSubject as the return value and the way the operations report progress. If you just want the example, skip to the end. The rest of this answer explains the steps.
I will assume for the sake of this answer that your long-running operations do not have their own way to be called asynchronously. If they do, the next step may be a little different. The next thing to do is to send the work to another thread using an IScheduler. You may allow the caller to select where the work happens by making an overload that takes the scheduler as a parameter if desired (in which case the overload that does not will pick a default scheduler). There are quite a few overloads of IScheduler.Scheduler, of which several are extensions methods, so you should look through them to see which is most appropriate for your situation; I'm using the on that takes only an Action here. If you have multiple operations that can all run in parallel, you can call scheduler.Schedule multiple times.
The hardest part of this will probably be determining what the progress is at any given point. If you have multiple operations going on at once, you will probably need to keep track of how many have completed to know what the current progress is. With the information you provided, I can't be more specific than that.
Finally, if your operations are cancellable, you may want to take a CancellationToken as a parameter. You can use this to cancel the operation while it is in the scheduler's queue before it starts. If you write your operation code correctly, it can use the token for cancellation as well.
IObservable<int> DoStuff(/*args*/,
CancellationToken cancel,
IScheduler scheduler)
{
BehaviorSubject<int> progress;
//if you don't take it as a parameter, pick a scheduler
//IScheduler scheduler = Scheduler.ThreadPool;
var disp = scheduler.Schedule(() =>
{
//do stuff that needs to run on another thread
//report progres
porgress.OnNext(25);
});
var disp2 = scheduler.Schedule(...);
//if the operation is cancelled before the scheduler has started it,
//you need to dispose the return from the Schedule calls
var allOps = new CompositeDisposable(disp, disp2);
cancel.Register(allOps.Dispose);
return progress;
}
Here is one approach
// setup a method to do some work,
// and report it's own partial progress
Func<string, IObservable<int>> doPartialWork =
(arg) => Observable.Create<int>(obsvr => {
return Scheduler.TaskPool.Schedule(arg,(sched,state) => {
var progress = 0;
var cancel = new BooleanDisposable();
while(progress < 10 && !cancel.IsDisposed)
{
// do work with arg
Thread.Sleep(550);
obsvr.OnNext(1); //report progress
progress++;
}
obsvr.OnCompleted();
return cancel;
});
});
var myArgs = new[]{"Arg1", "Arg2", "Arg3"};
// run all the partial bits of work
// use SelectMany to get a flat stream of
// partial progress notifications
var xsOfPartialProgress =
myArgs.ToObservable(Scheduler.NewThread)
.SelectMany(arg => doPartialWork(arg))
.Replay().RefCount();
// use Scan to get a running aggreggation of progress
var xsProgress = xsOfPartialProgress
.Scan(0d, (prog,nextPartial)
=> prog + (nextPartial/(myArgs.Length*10d)));
At first I have thread waiting example and it works perfect. It's job is ask 100 threads wait 3 seconds and then make output:
for (int i = 0; i < 100; ++i)
{
int index = i;
Thread t = new Thread(() =>
{
Caller c = new Caller();
c.DoWaitCall();
}) { IsBackground = true };
t.Start();
}
the Caller::DoWaitCall() looks like:
public void DoWaitCall()
{
Thread.Sleep(3000);
Console.WriteLine("done");
}
In this case, all threads wait 3 seconds and give output message almost in same time.
But when I try to use Async callback to do the Console.WriteLine:
public void DoWaitCall()
{
MyDel del = () => { Thread.Sleep(3000); };
del.BeginInvoke(CallBack, del);
}
private void CallBack(IAsyncResult r)
{
Console.WriteLine("done");
}
Each thread wait for different time, and make their output one-by-one slowly.
Is there any good way to achieve async callback in parallel?
The effect you're seeing is the ThreadPool ramping up gradually, basically. The idea is that it's relatively expensive to create (and then keep) threads, and the ThreadPool is designed for short-running tasks. So if it receives a bunch of tasks in a short space of time, it makes sense to batch them up, only starting new threads when it spots that there are still tasks waiting a little bit later.
You can force it to keep a minimum number of threads around using ThreadPool.SetMinThreads. For real systems you shouldn't normally need to do this, but it makes sense for a demo or something similar.
The first time you have spawned many threads that do the job in parallel. The second time you have used thread-pool, which has a limited number of threads. As Jon noted you can use a property to define the minimum thread number.
But, why do you need that to make async call from your parallel threads? This will not improve your performance at all, as your work is already done in parallel plus and you are making another split (using thread pool), which will introduce more latencies due to thread context switch. There is no need to do that.
I am building a node-based drag-and-drop editor, where each node represents one action (for example, read this file, or sort this data, etc.) Outputs and inputs of nodes can be connected.
One of the features I'd like to implement is automatic parallelization, so that if a path branches off I can automatically begin a thread to handle each branch. I'm concerned about a few issues, however:
If a path branches off, but then later joins back together, I will need to synchronize them somehow
If there are multiple start-nodes (where execution begins), their paths will have to be managed separately and then possibly dynamically joined/merged
I want to limit how many threads are created so that I don't suddenly have 20 threads deadlocked
Essentially, I'd like to know if any strategies for doing something like this exist (not looking for code necessarily; just theory). Could scheduling algorithms help?
Thanks for your advice! I look forward to hearing your suggestions.
Note: I'm using C# 3.5, so none of the fun parallel-tasking abilities are available to me. If necessary, I will make the switch to C# 4.0, but I'd like to avoid this.
The Task Parallel Library might be exactly what you're looking for.
I imagine your node-based drag-and-drop editor to look like this:
Every node is essentially a Task. A Task can be anything -- read a file from disk, download some data from the web, or compute anything.
When a Task has finished, it can ContinueWith one or more other Tasks, passing the result of the old Task to the new Tasks.
A Task can also consist of waiting for multiple Tasks to finish. WhenAll these Tasks have finished, this Task can continue with another Task, passing the result of all Tasks to the new task.
The TPL will schedule all these Tasks on a Thread Pool, so Threads can be reused and each Task doesn't need to have its own Thread. The TPL will find the optimal number of Threads for the system it is running on.
The Visual Studio Async CTP adds native language support for asynchronous operations to C#, which makes working with Tasks really easy and fun.
With the TPL it is just a matter of creating Tasks and composing them according to the node layout.
Complete program code for the above example:
var t1 = Task.Factory.StartNew<int>(() => 42);
var t2a = t1.ContinueWith<int>(t => t.Result + 1);
var t2b = t1.ContinueWith<int>(t => t.Result + 1);
var t3a = t2a.ContinueWith<int>(t => t.Result * 2);
var t3b = t2b.ContinueWith<int>(t => t.Result * 3);
var t4 = TaskEx.WhenAll<int>(t3a, t3b)
.ContinueWith<int>(t => t.Result[0] + t.Result[1]);
t4.ContinueWith(t => { Console.WriteLine(t.Result); });
Console.ReadKey();