C# Splitting loop on multiple threads - c#

I have a task that essentially loops through a collection and does an operation on them in pairs (for int i = 0; i < limit; i+=2 etc.) And so, most suggestions I see on threading loops use some sort of foreach mechanism. But that seems a bit tricky to me, seeing as how I use this approach of operating in pairs.
So what I would want to do is essentially replace:
DoOperation(list.Take(numberToProcess));
with
Thread lowerHalf = new Thread(() => => DoOperation(list.Take(numberToProcess/2)));
Thread lowerHalf = new Thread(() => => DoOperation(list.getRange(numberToProcess/2, numberToProcess));
lowerHalf.Start();
upperHalf.Start();
And this seems to get the work done, but it's VERY slow. Every iteration is slower than the previous one, and when I debug, the Thread view shows a growing list of Threads.
But I was under the impression that Threads terminated themselves upon completion? And yes, the threads do complete. The DoOperation() method is pretty much just a for loop.
So what am I not understanding here?

Try Parallel.For It will save lot of work.

To explain pranitkothari's answer a little bit more and give a different example you can use
list.AsParallel().ForAll(delegate([ListContainingType] item) {
// do stuff to a single item here (whatever is done in DoOperation() in your code
// except applied to a single item rather than several)
});
For instance, if I had a list string, it would be
List<String> list = new List<String>();
list.AsParallel().ForAll(delegate(String item) {
// do stuff to a single item here (whatever is done in DoOperation() in your code
// except applied to a single item rather than several)
});
This will let you perform an operation for each item in the list on a separate thread. It's simpler in that it handles all the "multi-threadedness" for you.
This is a good post that explains one of the differences in them

Related

Parallel or async ASP.NET Core C#

I've googled this plenty but I'm afraid I don't fully understand the consequences of concurrency and parallelism.
I have about 3000 rows of database objects that each have an average of 2-4 logical data attached to them that need to be validated as a part of a search query, meaning the validation service needs to execute approx. 3*3000 times. E.g. the user has filtered on color then each row needs to validate the color and return the result. The loop cannot break when a match has been found, meaning all logical objects will always need to be evaluated (this is due to calculations of relevance and just not a match).
This is done on-demand when the user selects various properties, meaning performance is key here.
I'm currently doing this by using Parallel.ForEach but wonder if it is smarter to use async behavior instead?
Current way
var validatorService = new LogicalGroupValidatorService();
ConcurrentBag<StandardSearchResult> results = new ConcurrentBag<StandardSearchResult>();
Parallel.ForEach(searchGroups, (group) =>
{
var searchGroupResult = validatorService.ValidateLogicGroupRecursivly(
propertySearchQuery, group.StandardPropertyLogicalGroup);
result.Add(new StandardSearchResult(searchGroupResult));
});
Async example code
var validatorService = new LogicalGroupValidatorService();
List<StandardSearchResult> results = new List<StandardSearchResult>();
var tasks = new List<Task<StandardPropertyLogicalGroupSearchResult>>();
foreach (var group in searchGroups)
{
tasks.Add(validatorService.ValidateLogicGroupRecursivlyAsync(
propertySearchQuery, group.StandardPropertyLogicalGroup));
}
await Task.WhenAll(tasks);
results = tasks.Select(logicalGroupResultTask =>
new StandardSearchResult(logicalGroupResultTask.Result)).ToList();
The difference between parallel and async is this:
Parallel: Spin up multiple threads and divide the work over each thread
Async: Do the work in a non-blocking manner.
Whether this makes a difference depends on what it is that is blocking in the async-way. If you're doing work on the CPU, it's the CPU that is blocking you and therefore you will still end up with multiple threads. In case it's IO (or anything else besides the CPU, you will reuse the same thread)
For your particular example that means the following:
Parallel.ForEach => Spin up new threads for each item in the list (the nr of threads that are spun up is managed by the CLR) and execute each item on a different thread
async/await => Do this bit of work, but let me continue execution. Since you have many items, that means saying this multiple times. It depends now what the results:
If this bit of workis on the CPU, the effect is the same
Otherwise, you'll just use a single thread while the work is being done somewhere else

What does the Parallel.Foreach do behind the scenes?

So I just cant grasp the concept here.
I have a Method that uses the Parallel class with the Foreach method.
But the thing I dont understand is, does it create new threads so it can run the function faster?
Let's take this as an example.
I do a normal foreach loop.
private static void DoSimpleWork()
{
foreach (var item in collection)
{
//DoWork();
}
}
What that will do is, it will take the first item in the list, assign the method DoWork(); to it and wait until it finishes. Simple, plain and works.
Now.. There are three cases I am curious about
If I do this.
Parallel.ForEach(stringList, simpleString =>
{
DoMagic(simpleString);
});
Will that split up the Foreach into let's say 4 chunks?
So what I think is happening is that it takes the first 4 lines in the list, assigns each string to each "thread" (assuming parallel creates 4 virtual threads) does the work and then starts with the next 4 in that list?
If that is wrong please correct me I really want to understand how this works.
And then we have this.
Which essentially is the same but with a new parameter
Parallel.ForEach(stringList, new ParallelOptions() { MaxDegreeOfParallelism = 32 }, simpleString =>
{
DoMagic(simpleString);
});
What I am curious about is this
new ParallelOptions() { MaxDegreeOfParallelism = 32 }
Does that mean it will take the first 32 strings from that list (if there even is that many in the list) and then do the same thing as I was talking about above?
And for the last one.
Task.Factory.StartNew(() =>
{
Parallel.ForEach(stringList, simpleString =>
{
DoMagic(simpleString);
});
});
Would that create a new task, assigning each "chunk" to it's own task?
Do not mix async code with parallel. Task is for async operations - querying a DB, reading file, awaiting some comparatively-computation-cheap operation such that your UI won't be blocked and unresponsive.
Parallel is different. That's designed for 1) multi-core systems and 2) computational-intensive operations. I won't go in details how it works, that kind of info could be found in an MS documentation. Long story short, Parallel.For most probably will make it's own decision on what exactly when and how to run. It might disobey you parameters, i.e. MaxDegreeOfParallelism or somewhat else. The whole idea is to provide the best possible parallezation, thus complete your operation as fast as possible.
Parallel.ForEach perform the equivalent of a C# foreach loop, but with each iteration executing in parallel instead of sequentially. There is no sequencing, it depends on whether the OS can find an available thread, if there is it will execute
MaxDegreeOfParallelism
By default, For and ForEach will utilize as many threads as the OS provides, so changing MaxDegreeOfParallelism from the default only limits how many concurrent tasks will be used by the application.
You do not need to modify this parameter in general but may choose to change it in advanced scenarios:
When you know that a particular algorithm you're using won't scale
beyond a certain number of cores. You can set the property to avoid
wasting cycles on additional cores.
When you're running multiple algorithms concurrently and want to
manually define how much of the system each algorithm can utilize.
When the thread pool's heuristics is unable to determine the right
number of threads to use and could end up injecting too many
threads. e.g. in long-running loop body iterations, the
thread pool might not be able to tell the difference between
reasonable progress or livelock or deadlock, and might not be able
to reclaim threads that were added to improve performance. You can set the property to ensure that you don't use more than a reasonable number of threads.
Task.StartNew is usually used when you require fine-grained control for a long-running, compute-bound task, and like what #Сергей Боголюбов mentioned, do not mix them up
It creates a new task, and that task will create threads asynchronously to run the for loop
You may find this ebook useful: http://www.albahari.com/threading/#_Introduction
does the work and then starts with the next 4 in that list?
This depends on your machine's hardware and how busy the machine's cores are with other processes/apps your CPU is working on
Does that mean it will take the first 32 strings from that list (if there even if that many in the list) and then do the same thing as I was talking about above?
No, there's is no guarantee that it will take first 32, could be less. It will vary each time you execute the same code
Task.Factory.StartNew creates a new tasks but it will not create a new one for each chunk as you expect.
Putting a Parallel.ForEach inside a new Task will not help you further reduce the time taken for the parallel tasks themselves.

Hashing on multiple keys : for task execution In Multi threaded environment

I have certain objects on which certain tasks needs to be performed.On all objects all task needs to be performed. I want to employ multiple threads say N parallel threads
Say I have objects identifiers like A,B,C (Objects can be in 100 K range ; keys can be long or string)
And Tasks can T1,T2,T3,TN - (Task are max 20 in number)
Conditions for task execution -
Tasks can be executed in parallel even for the same object.
But for the same object, for a given task, it should be executed in series.
Example , say I have
Objects on which are task performed are A,B,A
and tasks are t1, t2
So T1(A), T2(A) or T1(A) , T2(B) are possible , but T1(A) and T1(A) shouldnt be allowed
How can I ensure that , that my conditions are met. I know I have to use some sort of hashing.
I read about hashing , so my hash function can be of -
return ObjectIdentifier.getHashCode() + TaskIdentifier.getHashCode()
or other can be - a^3 + b^2 (where a and b are hashes of object identifier and task identifier respectively)
What would be best strategy, any suggestions
My task doesnt involve any IO, and as of now I am using one thread for each task.
So my current design is ok, or should I try to optimize it based on num of processors. (have fixed num of threads )
You can do a Parallel.ForEach on one of the lists, and a regular foreach on the other list, for example:
Parallel.ForEach (myListOfObjects, currentObject =>
{
foreach(var task in myListOfTasks)
{
task.DoSomething(currentObject);
}
});
I must say that I really like Rufus L's answer. You have to be smart about the things you parallelise and not over-encumber your implementation with excessive thread synchronisation and memory-intensive constructs - those things diminish the benefit of parallelisation. Given the large size of the item pool and the CPU-bound nature of the work, Parallel.ForEach with a sequential inner loop should provide very reasonable performance while keeping the implementation dead simple. It's a win.
Having said that, I have a pretty trivial LINQ-based tweak to Rufus' answer which addresses your other requirement (which is for the same object, for a given task, it should be executed in series). The solution works provided that the following assumptions hold:
The order in which the tasks are executed is not significant.
The work to be performed (all combinations of task x object) is known in advance and cannot change.
(Sorry for stating the obvious) The work which you want to parallelise can be parallelised - i.e. there are no shared resources / side-effects are completely isolated.
With those assumptions in mind, consider the following:
// Cartesian product of the two sets (*objects* and *tasks*).
var workItems = objects.SelectMany(
o => tasks.Select(t => new { Object = o, Task = t })
);
// Group *work items* and materialise *work item groups*.
var workItemGroups = workItems
.GroupBy(i => i, (key, items) => items.ToArray())
.ToArray();
Parallel.ForEach(workItemGroups, workItemGroup =>
{
// Execute non-unique *task* x *object*
// combinations sequentially.
foreach (var workItem in workItemGroup)
{
workItem.Task.Execute(workItem.Object);
}
});
Note that I am not limiting the degree of parallelism in Parallel.ForEach. Since all work is CPU-bound, it will work out the best number of threads on its own.

Implement Parallel.Foreach inside a while loop

I have scenario where I need to run a Parallel.Foreach within a while loop. I need to understand the impact of this implementation in terms of how the processing will take place. I will have an implementation something like this
ConcurrentQueue<MyTable> queue = new ConcurrentQueue<MyTable>();
Here, I have initially added lot of items in queue but while execution also, more items can be added in the queue.
while(true)
{
Parallel.Foreach(queue, (myTable) => {some processing});
Sleep(sometime);
}
Each time one item will be de-queued and new thread will be spawned to work with it, in the meanwhile new items will be added for that I need to keep an infinite while loop.
Now, I need to understand that as concurrent queue is thread safe, I think each item will be processed one time only in spite of while above foreach but I am not sure about is that there will be multiple threads of foreach itself that will be spawning child threads or single copy of foreach will be running within while loop. I do not know how foreach itself is implemented.
I have scenario where I need to run a Parallel.Foreach within a while loop.
I don't think you do. You want to process new items as they come in in parallel, but I think this is not the best way to do that.
I think the best way is to use ActionBlock from TPL Dataflow. It won't waste CPU or threads when there are no items to process and if you set its MaxDegreeOfParallelism, it will process items in parallel:
ActionBlock<MyTable> actionBlock = new ActionBlock<MyTable>(
myTable => /* some processing */,
new ExecutionDataflowBlockOptions
{
MaxDegreeOfParallelism = DataflowBlockOptions.Unbounded
});
...
actionBlock.Post(someTable);
If you don't want to or can't (it's .Net 4.5 only) use TPL Dataflow, another option would be use a single Parallel.Foreach() (no while) together with BlockingCollection and GetConsumingPartitioner() (not GetConsumingEnumerable()!).
Using this, the Parallel.Foreach() threads will be blocked when there are no items to process, but there also won't be any delays in processing (like the ones caused by your Sleep()):
BlockingCollection<MyTable> queue = new BlockingCollection<MyTable>();
...
Parallel.ForEach(
queue.GetConsumingPartitioner(), myTable => /* some processing */);
...
queue.Add(someTable);
I think each item will be processed one time only in spite of while above foreach but I am not sure
That's one reason why you should use one of the options above, since they mean you don't need to know much about the details of how they work, they just work.

C#: Different ways to run this code asynchronously?

I have this code
List<string> myList = new List<string>();
myList.AddRange(new MyClass1().Load());
myList.AddRange(new MyClass2().Load());
myList.AddRange(new MyClass3().Load());
myList.DoSomethingWithValues();
What's the best way of running an arbitrary number of Load() methods asynchronously and then ensuring DoSomethingWithValues() runs when all asynchronous threads have completed (of course without incrementing a variable every time a callback happens and waiting for == 3)
My personal favorite would be:
List<string> myList = new List<string>();
var task1 = Task.Factory.StartNew( () => new MyClass1().Load() );
var task2 = Task.Factory.StartNew( () => new MyClass2().Load() );
var task3 = Task.Factory.StartNew( () => new MyClass3().Load() );
myList.AddRange(task1.Result);
myList.AddRange(task2.Result);
myList.AddRange(task3.Result);
myList.DoSomethingWithValues();
How about PLINQ?
var loadables = new ILoadable[]
{ new MyClass1(), new MyClass2(), new MyClass3() };
var loadResults = loadables.AsParallel()
.SelectMany(l => l.Load());
myList.AddRange(loadResults);
myList.DoSomethingWithValues();
EDIT: Changed Select to SelectMany as pointed out by Reed Copsey.
Ani's conceptual solution can be written more concisely:
new ILoadable[] { new MyClass1(), new MyClass2(), new MyClass3() }
.AsParallel().SelectMany(o => o.Load()).ToList()
.DoSomethingWithValues();
That's my preferred solution: declarative (AsParallel) and concise.
Reed's solution, when written in this fashion, looks as follows:
new ILoadable[] { new MyClass1(), new MyClass2(), new MyClass3() }
.Select(o=>Task.Factory.StartNew(()=>o.Load().ToArray())).ToArray()
.SelectMany(t=>t.Result).ToList()
.DoSomethingWithValues();
Note that both ToArray calls may be necessary. The first call is necessary if o.Load is lazy (which in general it can be, though YMMV) to ensure evaluation of o.Load is completed inside the background task. The second call is necessary to ensure the list of tasks has been fully constructed before the call to SelectMany - if you don't do this, then SelectMany will attempt to iterate over its source only as necessary - i.e. it won't iterate to the second task before it has to, and that's not until the first task's Result has been computed. Effectively, you're starting tasks but then lazily executing them one after the other - turning background tasks back into a strictly sequential execution.
Note that the second, less declarative solution has many more pitfalls and requires a much more thorough analysis to be sure it's correct - i.e., this is less maintainable, though still miles better than manual threading. Incidentally, you may be able to get away with leaving out the calls to .ToList - that depends on the details of DoSomethingWithValues - for even better performance, whereby your final processing can access the first values as they trickle in without needing to wait for all tasks or parallel enumerables to complete. And that's even shorter to boot!
Unless there's compelling reason to try to run them all at once I'd suggest you just run them all in a single asynchronous method.
Compelling reason might be heavy disk/database IO that would mean running more than one background thread would actually allow them to run simultaneously. If most of the initialization is actually code logic, you might find that multiple threads actually result in slower performance.

Categories