I have a function like such:
static void AddResultsToDb(IEnumerable<int> numbers)
{
foreach (int number in numbers)
{
int result = ComputeResult(number); // This takes a long time, but is thread safe.
AddResultToDb(number, result); // This is quick but not thread safe.
}
}
I could solve this problem by using, for example, Parallel.ForEach to compute the results, and then use a regular foreach to add the results to the database.
However, for educational purposes, I would like a solution that revolves around await/async. But no matter how much I read about it, I cannot wrap my mind around it. If await/async is not applicable in this context, I would like to understand why.
As others have suggested, this isn't a case of using async/await as that is for asynchrony. What you're doing is concurrency. Microsoft has a framework specifically for that and it solves this problem nicely.
So for learning purposes, you should use Microsoft's Reactive Framework (aka Rx) - NuGet System.Reactive and add using System.Reactive.Linq; - then you can do this:
static void AddResultsToDb(IEnumerable<int> numbers)
{
numbers
.ToObservable()
.SelectMany(n => Observable.Start(() => new { n, r = ComputeResult(n) }))
.Do(x => AddResultToDb(x.n, x.r))
.Wait();
}
The SelectMany/Observable.Start combination allows as many ComputeResult calls to occur as possible concurrently. The nice thing about Rx is that it then serializes the results so that only one call at a time goes to AddResultToDb.
To control the degrees of parallelism you can change the SelectMany to a Select/Merge like this:
static void AddResultsToDb(IEnumerable<int> numbers)
{
numbers
.ToObservable()
.Select(n => Observable.Start(() => new { n, r = ComputeResult(n) }))
.Merge(maxConcurrent: 2)
.Do(x => AddResultToDb(x.n, x.r))
.Wait();
}
The async and await pattern is not really suitable for your first method. It's well suited for IO Bound workloads to achieve scalability, or for frameworks that have UI's for responsiveness. It's less suited for raw CPU workloads.
However you could still get benefits from parallel processing because your first method is expensive and thread safe.
In the following example I used Parallel LINQ (PLINQ) for a fluent expression of the results without worrying about a pre-sized array / concurrent collection / locking, though you could use other TPL functionality, like Parallel.For/ForEach
// Potentially break up the workloads in parallel
// return the number and result in a ValueTuple
var results = numbers.AsParallel()
.Select(x => (number: x, result: ComputeResult(x)))
.ToList();
// iterate through the number and results and execute them serially
foreach (var (number, result) in results)
AddResultToDb(number, result);
Note : The assumption here is the order is not important
Supplemental
Your method AddResultToDb looks like it's just inserting results into a database, which is IO Bound and is worthy of async, furthermore could probably take all results at once and insert them in bulk/batch saving round trips
From Comments credit #TheodorZoulias
To preserve the order you could use the method AsOrdered, at
the cost of some performance penalty. A possible performance
improvement is to remove the ToList(), so that the results are added
to the DB concurrently with the computations.
To make the results available as fast as possible it's probably a good
idea to disable the partial buffering that happens by default, by
chaining the method
.WithMergeOptions(ParallelMergeOptions.NotBuffered) in the query
var results = numbers.AsParallel()
.Select(x => (number: x, result: ComputeResult(x)))
.WithMergeOptions(ParallelMergeOptions.NotBuffered)
.AsOrdered();
Example
Additional resources
ParallelEnumerable.AsOrdered Method
Enables treatment of a data source as if it were ordered, overriding
the default of unordered. AsOrdered may only be invoked on non-generic
sequences
ParallelEnumerable.WithMergeOptions
Sets the merge options for this query, which specify how the query
will buffer output.
ParallelMergeOptions Enum
NotBuffered Use a merge without output buffers. As soon as result elements have been computed, make that element available to the
consumer of the query.
This isn't really a case for async/await because it sounds like ComputeResult is expensive computationally, as opposed to just taking a long, indeterminate amount of time. aync/await is better for tasks you are truly waiting on. Parallel.ForEach will actually thread your workload.
If anything, AddResultToDb is what you would want to async/await - you would be waiting on an external action to complete.
Good in-depth explanation: https://stackoverflow.com/a/35485780/127257
Using Parallel.For honestly seems like the simplest solution, since your computations are likely to be CPU-bound. Async/await is better for I/O bound operations since it does not require another thread to wait for an I/O operation to complete (see there is no thread).
That being said, you can still use async/await for tasks that you put on the thread pool. So here's how you could do it.
static void AddResultToDb(int number)
{
int result = ComputeResult(number);
AddResultToDb(number, result);
}
static async Task AddResultsToDb(IEnumerable<int> numbers)
{
var tasks = numbers.Select
(
number => Task.Run( () => AddResultToDb(number) )
)
.ToList();
await Task.WhenAll(tasks);
}
Related
This sounds like an overly trivial question, and I think I am overcomplicating it because I haven't been able to find the answer for months. There are easy ways of doing this in Golang, Scala/Akka, etc but I can't seem to find anything in .NET.
What I need is an ability to have a list of Tasks that are all independent of each other, and the ability to execute them concurrently on a specified (and easily changeable) number of threads.
Basically something like:
int numberOfParallelThreads = 3; // changeable
Queue<Task> pendingTasks = GetPendingTasks(); // returns 80 items
await SomeBuiltInDotNetParallelExecutableManager.RunAllTasksWithSpecifiedConcurrency(pendingTasks, numberOfParallelThreads);
And that SomeBuiltInDotNetParallelExecutableManager would execute 80 tasks three at a time; i.e. when one finishes it draws the next one from the queue, until the queue is exhausted.
There is Task.WhenAll and Task.WaitAll, but you can't specify the max number of parallel threads in them.
Is there a built in, simple way to do this?
Parallel.ForEachAsync (or depending on actual workload it's sync counterpart - Parallel.ForEach, but it will not handle functions returning Task correctly):
IEnumerable<int> x = ...;
await Parallel.ForEachAsync(x, new ParallelOptions
{
MaxDegreeOfParallelism = 3
}, async (i, token) => await Task.Delay(i * 1000, token));
Also it is highly recommended that methods in C# return so called "hot", i.e. started tasks, so "idiomatically" Queue<Task> should be a collection of already started tasks, so you will have no control over number of them executing in parallel cause it will be controlled by ThreadPool/TaskScheduler.
And there is port of Akka to .NET - Akka.NET if you want to go down that route.
Microsoft's Reactive Framework makes this easy too:
IEnumerable<int> values = ...;
IDisposable subscription =
values
.ToObservable()
.Select(v => Observable.Defer(() => Observable.Start(() => { /* do work on each value */ })))
.Merge(3)
.Subscribe();
I have this (below) process where it collects search results returned from a service. Each result is then added to UI for display.
Can this be improved so that the _list can be processed in parallel (perhaps using multiple threads?), therefore I get faster results?
List<Query> queries = _list.Where(x => string.IsNullOrEmpty(x.Title));
foreach (var item in queries)
{
List<ExtendedSearchResult> searchResults = (await _service.SearchAsync(item.Query))
.Select(x => ExtendedSearchResult.FromSearchResult(x))
.ToList();
if (searchResults != null)
{
foreach (var result in searchResults)
{
_view.AddItem(result);
}
}
}
Found this post but not sure if this applies to my scenario and how to implement it.
Can this be improved so that the _list can be processed in parallel (perhaps using multiple threads?)
Maybe? there is not really anyway to tell from the example. Is the performed work IO-bound or compute bound? The searching might need to take some kind of lock to gain exclusive access to some resource, if that is the case parallelism would do nothing except increase overhead.
As a rule of thumb, parallelism is best for compute bound tasks, while async is best for IO bound tasks. But often a combination can be useful, modern SSDs are inherently parallel, and compute bound tasks are often done in the background to avoid blocking the main thread.
If you want to make this code parallel you need to make it thread-safe, i.e. make sure any shared objects are threadsafe, and ensure the UI is only updated on the main-thread.
There are plenty of resources available for running tasks concurrently. For example using async await for multiple tasks. As an alternative I would consider IAsynEnumerable to update the UI as results are returned.
I've googled this plenty but I'm afraid I don't fully understand the consequences of concurrency and parallelism.
I have about 3000 rows of database objects that each have an average of 2-4 logical data attached to them that need to be validated as a part of a search query, meaning the validation service needs to execute approx. 3*3000 times. E.g. the user has filtered on color then each row needs to validate the color and return the result. The loop cannot break when a match has been found, meaning all logical objects will always need to be evaluated (this is due to calculations of relevance and just not a match).
This is done on-demand when the user selects various properties, meaning performance is key here.
I'm currently doing this by using Parallel.ForEach but wonder if it is smarter to use async behavior instead?
Current way
var validatorService = new LogicalGroupValidatorService();
ConcurrentBag<StandardSearchResult> results = new ConcurrentBag<StandardSearchResult>();
Parallel.ForEach(searchGroups, (group) =>
{
var searchGroupResult = validatorService.ValidateLogicGroupRecursivly(
propertySearchQuery, group.StandardPropertyLogicalGroup);
result.Add(new StandardSearchResult(searchGroupResult));
});
Async example code
var validatorService = new LogicalGroupValidatorService();
List<StandardSearchResult> results = new List<StandardSearchResult>();
var tasks = new List<Task<StandardPropertyLogicalGroupSearchResult>>();
foreach (var group in searchGroups)
{
tasks.Add(validatorService.ValidateLogicGroupRecursivlyAsync(
propertySearchQuery, group.StandardPropertyLogicalGroup));
}
await Task.WhenAll(tasks);
results = tasks.Select(logicalGroupResultTask =>
new StandardSearchResult(logicalGroupResultTask.Result)).ToList();
The difference between parallel and async is this:
Parallel: Spin up multiple threads and divide the work over each thread
Async: Do the work in a non-blocking manner.
Whether this makes a difference depends on what it is that is blocking in the async-way. If you're doing work on the CPU, it's the CPU that is blocking you and therefore you will still end up with multiple threads. In case it's IO (or anything else besides the CPU, you will reuse the same thread)
For your particular example that means the following:
Parallel.ForEach => Spin up new threads for each item in the list (the nr of threads that are spun up is managed by the CLR) and execute each item on a different thread
async/await => Do this bit of work, but let me continue execution. Since you have many items, that means saying this multiple times. It depends now what the results:
If this bit of workis on the CPU, the effect is the same
Otherwise, you'll just use a single thread while the work is being done somewhere else
What is the best way to use parallelization such as with Parallel.ForEach so that I can rapidly iterate a collection and add items to a new List without violating thread safety but using the performance gain of multiple server cores and plenty of RAM?
public List<Leg> FetchLegs(Trip trip)
{
var result = new List<Leg>();
try
{
// get days
var days = FetchDays(trip);
// add each day's legs to result
Parallel.ForEach<Day>(days, day =>
{
var legs = FetchLegs(day);
result.AddRange(legs);
});
// sort legs by scheduledOut
result.Sort((x, y) => x.scheduledOut.Value.CompareTo(y.scheduledOut.Value));
}
catch (Exception ex)
{
SiAuto.Main.LogException(ex);
System.Diagnostics.Debugger.Break();
}
return result;
}
The simplest solution is to use PLINQ and its extension methods like ToList, ToDictionary. Example:
var result = days.AsParallel().SelectMany(d => FetchLegs(d))
.OrderBy(l => l.scheduledOut.Value).ToList();
Shifting the comment to the answer with few more suggestions, beside using a thread safe structure like ConcurrentQueue for thread safety inside Parallel loop, or you may consider a custom thread safe list, check my following post - Thread safe List. In fact you may consider ReaderWriterSlim lock in place of plain lock.
Another important point would be, you may consider using ParallelOptions MaxDegreeOfParallelism, check here with value equal to Environment.ProcessorCount, so that Parallel.ForEach partitioner works well, by not creating a task for each data point in the collection at the beginning, it will do it based on number of logical cores in the system, thus effectively reducing the thread context switching and making application much more efficient
I have certain objects on which certain tasks needs to be performed.On all objects all task needs to be performed. I want to employ multiple threads say N parallel threads
Say I have objects identifiers like A,B,C (Objects can be in 100 K range ; keys can be long or string)
And Tasks can T1,T2,T3,TN - (Task are max 20 in number)
Conditions for task execution -
Tasks can be executed in parallel even for the same object.
But for the same object, for a given task, it should be executed in series.
Example , say I have
Objects on which are task performed are A,B,A
and tasks are t1, t2
So T1(A), T2(A) or T1(A) , T2(B) are possible , but T1(A) and T1(A) shouldnt be allowed
How can I ensure that , that my conditions are met. I know I have to use some sort of hashing.
I read about hashing , so my hash function can be of -
return ObjectIdentifier.getHashCode() + TaskIdentifier.getHashCode()
or other can be - a^3 + b^2 (where a and b are hashes of object identifier and task identifier respectively)
What would be best strategy, any suggestions
My task doesnt involve any IO, and as of now I am using one thread for each task.
So my current design is ok, or should I try to optimize it based on num of processors. (have fixed num of threads )
You can do a Parallel.ForEach on one of the lists, and a regular foreach on the other list, for example:
Parallel.ForEach (myListOfObjects, currentObject =>
{
foreach(var task in myListOfTasks)
{
task.DoSomething(currentObject);
}
});
I must say that I really like Rufus L's answer. You have to be smart about the things you parallelise and not over-encumber your implementation with excessive thread synchronisation and memory-intensive constructs - those things diminish the benefit of parallelisation. Given the large size of the item pool and the CPU-bound nature of the work, Parallel.ForEach with a sequential inner loop should provide very reasonable performance while keeping the implementation dead simple. It's a win.
Having said that, I have a pretty trivial LINQ-based tweak to Rufus' answer which addresses your other requirement (which is for the same object, for a given task, it should be executed in series). The solution works provided that the following assumptions hold:
The order in which the tasks are executed is not significant.
The work to be performed (all combinations of task x object) is known in advance and cannot change.
(Sorry for stating the obvious) The work which you want to parallelise can be parallelised - i.e. there are no shared resources / side-effects are completely isolated.
With those assumptions in mind, consider the following:
// Cartesian product of the two sets (*objects* and *tasks*).
var workItems = objects.SelectMany(
o => tasks.Select(t => new { Object = o, Task = t })
);
// Group *work items* and materialise *work item groups*.
var workItemGroups = workItems
.GroupBy(i => i, (key, items) => items.ToArray())
.ToArray();
Parallel.ForEach(workItemGroups, workItemGroup =>
{
// Execute non-unique *task* x *object*
// combinations sequentially.
foreach (var workItem in workItemGroup)
{
workItem.Task.Execute(workItem.Object);
}
});
Note that I am not limiting the degree of parallelism in Parallel.ForEach. Since all work is CPU-bound, it will work out the best number of threads on its own.