Specify default MaxDegreeOfParallelism in Parallel.ForEach?

Specify default MaxDegreeOfParallelism in Parallel.ForEach? - c#

We would like to optionally control the number of "threads" on our parallel loops to avoid overwhelming a web service (for example).
Is it possible to specify a custom MaxDegreeOfParallelism on a Parallel.ForEach loop, but also to revert to the default value as required? Seemingly zero (0) is an invalid value for MaxDegreeOfParallelism, whereas I was hoping it could simply mean "ignore".
In other words, can you avoid writing this type of code?
int numParallelOperations = GetNumParallelOperations();
if (numParallelOperations > 0)
{
ParallelOptions options = new ParallelOptions();
options.MaxDegreeOfParallelism = numParallelOperations;
Parallel.ForEach(items, options, i =>
{
Foo(i);
});
}
else
{
Parallel.ForEach(items, i =>
{
Foo(i);
});
}

Do you mean -1 as per MSDN:
The MaxDegreeOfParallelism limits the number of concurrent operations
run by Parallel method calls that are passed this ParallelOptions
instance to the set value, if it is positive. If
MaxDegreeOfParallelism is -1, then there is no limit placed on the
number of concurrently running operations.
You can control the approximate number of threads like this:
// use only (ca) one kernel:
int degreeOfParallelism = 1;
// leave (ca) one kernel idle:
int degreeOfParallelism = Environment.ProcessorCount - 1;
// use (ca) half of the kernels:
int degreeOfParallelism = Environment.ProcessorCount > 1 ?
Environment.ProcessorCount / 2 : 1;
// run at full speed:
int degreeOfParallelism = - 1;
var options = new ParallelOptions();
options.MaxDegreeOfParallelism = degreeOfParallelism;
Parallel.For(0, x, options, y =>
//...

This may not be a definitive answer as I cannot add a comment due to me only just joining StackOverflow. I don't believe it is possible to do as your asking but I do know that MSDN documentation states that -1 is the parameter which sets an unlimited number of tasks to be ran the the ForEach. From my experience, it is best to leave the CLR to determine how many concurrent tasks will be ran unless you really know what you are doing. The Parallel library is high level and if you needed to really do something like this you should be coding at a lower level and in control of your own threads and not leaving it up to a TaskScheduler or ThreadPool etc but this takes a lot of experimentation to get your own algorithms running effectively.
The only thing I can suggest is wrapping the Parallel.ForEach method to include the setting of your ParallelOptions.MaxDegreeOfParallism to cut down on the duplicate code and enable you to add an interface and test the asynchonous code in a synchronous manner.
Apologies for not providing a more positive response!

Related

Why the following C# program uses limited (10) number of threads? [duplicate]

I have just did a sample for multithreading using This Link like below:
Console.WriteLine("Number of Threads: {0}", System.Diagnostics.Process.GetCurrentProcess().Threads.Count);
int count = 0;
Parallel.For(0, 50000, options,(i, state) =>
{
count++;
});
Console.WriteLine("Number of Threads: {0}", System.Diagnostics.Process.GetCurrentProcess().Threads.Count);
Console.ReadKey();
It gives me 15 thread before Parellel.For and after it gives me 17 thread only. So only 2 thread is occupy with Parellel.For.
Then I have created a another sample code using This Link like below:
var options = new ParallelOptions { MaxDegreeOfParallelism = Environment.ProcessorCount * 10 };
Console.WriteLine("MaxDegreeOfParallelism : {0}", Environment.ProcessorCount * 10);
Console.WriteLine("Number of Threads: {0}", System.Diagnostics.Process.GetCurrentProcess().Threads.Count);
int count = 0;
Parallel.For(0, 50000, options,(i, state) =>
{
count++;
});
Console.WriteLine("Number of Threads: {0}", System.Diagnostics.Process.GetCurrentProcess().Threads.Count);
Console.ReadKey();
In above code, I have set MaxDegreeOfParallelism where it sets 40 but is still taking same threads for Parallel.For.
So how can I increase running thread for Parallel.For?

I am facing a problem that some numbers is skipped inside the Parallel.For when I perform some heavy and complex functionality inside it. So here I want to increase the maximum thread and override the skipping issue.
What you're saying is something like: "My car is shaking when driving too fast. I'm trying to avoid this by driving even faster." That doesn't make any sense. What you need is to fix the car, not change the speed.
How exactly to do that depends on what are you actually doing in the loop. The code you showed is obviously placeholder, but even that's wrong. So I think what you should do first is to learn about thread safety.
Using a lock is one option, and it's the easiest one to get correct. But it's also hard to make it efficient. What you need is to lock only for a short amount of time each iteration.
There are other options how to achieve thread safety, including using Interlocked, overloads of Parallel.For that use thread-local data and approaches other than Parallel.For(), like PLINQ or TPL Dataflow.
After you made sure your code is thread safe, only then it's time to worry about things like the number of threads. And regarding that, I think there are two things to note:
For CPU-bound computations, it doesn't make sense to use more threads than the number of cores your CPU has. Using more threads than that will actually usually lead to slower code, since switching between threads has some overhead.
I don't think you can measure the number of threads used by Parallel.For() like that. Parallel.For() uses the thread pool and it's quite possible that there already are some threads in the pool before the loop begins.

Parallel loops use hardware CPU cores. If your CPU has 2 cores, this is the maximum degree of paralellism that you can get in your machine.
Taken from MSDN:
What to Expect
By default, the degree of parallelism (that is, how many iterations run at the same time in hardware) depends on the
number of available cores. In typical scenarios, the more cores you
have, the faster your loop executes, until you reach the point of
diminishing returns that Amdahl's Law predicts. How much faster
depends on the kind of work your loop does.
Further reading:
Threading vs Parallelism, how do they differ?
Threading vs. Parallel Processing

Parallel loops will give you wrong result for summation operations without locks as result of each iteration depends on a single variable 'Count' and value of 'Count' in parallel loop is not predictable. However, using locks in parallel loops do not achieve actual parallelism. so, u should try something else for testing parallel loop instead of summation.

ThreadPool with speed execution control

I need proccess several lines from a database (can be millions) in parallel in c#. The processing is quite quick (50 or 150ms/line) but I can not know this speed before runtime as it depends on hardware/network.
The ThreadPool or the newer TaskParallelLibrary seems to be what feets my needs as I am new to threading and want to get the most efficient way to process the data.
However these methods does not provide a way to control the speed execution of my tasks (lines/minute) : I want to be able to set a maximum speed limit for the processing or run it full speed.
Please note that setting the number of thread of the ThreadPool/TaskFactory does not provide sufficient accuracy for my needs as I would like to be able to set a speed limit below the 'one thread speed'.
Using a custom sheduler for the TPL seems to be a way to do that, but I did not find a way to implement it.
Furthermore, I'm worried about the efficiency cost that would take such a setup.
Could you provide me a way or advices how to achieve this work ?
Thanks in advance for your answers.

The TPL provides a convenient programming abstraction on top of the Thread Pool. I would always select TPL when that is an option.
If you wish to throttle the total processing speed, there's nothing built-in that would support that.
You can measure the total processing speed as you proceed through the file and regulate speed by introducing (non-spinning) delays in each thread. The size of the delay can be dynamically adjusted in your code based on observed processing speed.

I am not seeing the advantage of limiting a speed, but I suggest you look into limiting max degree of parallalism of the operation. That can be done via MaxDegreeOfParallelism in the ParalleForEach options property as the code works over the disparate lines of data. That way you can control the slots, for lack of a better term, which can be expanded or subtracted depending on the criteria which you are working under.
Here is an example using the ConcurrentBag to process lines of disperate data and to use 2 parallel tasks.
var myLines = new List<string> { "Alpha", "Beta", "Gamma", "Omega" };
var stringResult = new ConcurrentBag<string>();
ParallelOptions parallelOptions = new ParallelOptions();
parallelOptions.MaxDegreeOfParallelism = 2;
Parallel.ForEach( myLines, parallelOptions, line =>
{
if (line.Contains( "e" ))
stringResult.Add( line );
} );
Console.WriteLine( string.Join( " | ", stringResult ) );
// Outputs Beta | Omega
Note that parallel options also has a TaskScheduler property which you can refine more of the processing. Finally for more control, maybe you want to cancel the processing when a specific threshold is reached? If so look into CancellationToken property to exit the process early.

Estimate segment indexes in Parallel.For

In the parallel for loop below, is there a reliable way to determine how many threads will be created and with what index boundaries?
Parallel.For
(
0,
int.MaxValue,
new ParallelOptions() {MaxDegreeOfParallelism=Environment.ProcessorCount},
(i) =>
{
// Monitor [i] to see how the range is segmented.
}
);
If processor count on the target machine is 4 and we use all 4 processors, I observe that 4 segments are created roughly equal in size, each being approximately int.MaxValue/4. However, this is just observation and Parallel.For may or may not offer deterministic segmentation.
Searching around did not help much either. Is it possible to predict or calculate this?

You can specify your own partitioner if you don't like the default behavior.

Multi-part question about multi-threading, locks and multi-core processors (multi ^ 3)

I have a program with two methods. The first method takes two arrays as parameters, and performs an operation in which values from one array are conditionally written into the other, like so:
void Blend(int[] dest, int[] src, int offset)
{
for (int i = 0; i < src.Length; i++)
{
int rdr = dest[i + offset];
dest[i + offset] = src[i] > rdr? src[i] : rdr;
}
}
The second method creates two separate sets of int arrays and iterates through them such that each array of one set is Blended with each array from the other set, like so:
void CrossBlend()
{
int[][] set1 = new int[150][75000]; // we'll pretend this actually compiles
int[][] set2 = new int[25][10000]; // we'll pretend this actually compiles
for (int i1 = 0; i1 < set1.Length; i1++)
{
for (int i2 = 0; i2 < set2.Length; i2++)
{
Blend(set1[i1], set2[i2], 0); // or any offset, doesn't matter
}
}
}
First question: Since this apporoach is an obvious candidate for parallelization, is it intrinsically thread-safe? It seems like no, since I can conceive a scenario (unlikely, I think) where one thread's changes are lost because a different threads ~simultaneous operation.
If no, would this:
void Blend(int[] dest, int[] src, int offset)
{
lock (dest)
{
for (int i = 0; i < src.Length; i++)
{
int rdr = dest[i + offset];
dest[i + offset] = src[i] > rdr? src[i] : rdr;
}
}
}
be an effective fix?
Second question: If so, what would be the likely performance cost of using locks like this? I assume that with something like this, if a thread attempts to lock a destination array that is currently locked by another thread, the first thread would block until the lock was released instead of continuing to process something.
Also, how much time does it actually take to acquire a lock? Nanosecond scale, or worse than that? Would this be a major issue in something like this?
Third question: How would I best approach this problem in a multi-threaded way that would take advantage of multi-core processors (and this is based on the potentially wrong assumption that a multi-threaded solution would not speed up this operation on a single core processor)? I'm guessing that I would want to have one thread running per core, but I don't know if that's true.

The potential contention with CrossBlend is set1 - the destination of the blend. Rather than using a lock, which is going to be comparatively expensive compared to the amount of work you are doing, arrange for each thread to work on it's own destination. That is a given destination (array at some index in set1) is owned by a given task. This is possible since the outcome is independent of the order that CrossBlend processes the arrays in.
Each task should then run just the inner loop in CrossBlend, and the task is parameterized with the index of the dest array (set1) to use (or range of indices.)
You can also parallelize the Blend method, since each index is computed independently of the others, so no contention there. But on todays machines, with <40 cores you will get sufficient parallism just threading the CrossBlend method.
To run effectively on multi-core you can either
for N cores, divide the problem into N parts. Given that set1 is reasonably large compared to the number of cores, you could just divide set1 into N ranges, and pass each range of indices into N threads running the inner CrossBlend loop. That will give you fairly good parallelism, but it's not optimal. (Some threads will finish sooner and end up with no work to do.)
A more involved scheme is to make each iteration of the CrossBlend inner loop a separate task. Have N queues (for N cores), and distribute the tasks amongst the queues. Start N threads, with each thread reading it's tasks from a queue. If a threads queue becomes empty, it takes a task from some other thread's queue.
The second approach is best suited to irregularly sized tasks, or where the system is being used for other tasks, so some cores may be time switching between other processes, so you cannot expect that equal amounts of work complete in the roughly same time on different cores.
The first approach is much simpler to code, and will give you a good level of parallelism.

Parallel.For(): Update variable outside of loop

I'm just looking in to the new .NET 4.0 features. With that, I'm attempting a simple calculation using Parallel.For and a normal for(x;x;x) loop.
However, I'm getting different results about 50% of the time.
long sum = 0;
Parallel.For(1, 10000, y =>
{
sum += y;
}
);
Console.WriteLine(sum.ToString());
sum = 0;
for (int y = 1; y < 10000; y++)
{
sum += y;
}
Console.WriteLine(sum.ToString());
My guess is that the threads are trying to update "sum" at the same time.
Is there an obvious way around it?

You can't do this. sum is being shared across you parallel threads. You need to make sure that the sum variable is only being accessed by one thread at a time:
// DON'T DO THIS!
Parallel.For(0, data.Count, i =>
{
Interlocked.Add(ref sum, data[i]);
});
BUT... This is an anti-pattern because you've effectively serialised the loop because each thread will lock on the Interlocked.Add.
What you need to do is add sub totals and merge them at the end like this:
Parallel.For<int>(0, result.Count, () => 0, (i, loop, subtotal) =>
{
subtotal += result[i];
return subtotal;
},
(x) => Interlocked.Add(ref sum, x)
);
You can find further discussion of this on MSDN: http://msdn.microsoft.com/en-us/library/dd460703.aspx
PLUG: You can find more on this in Chapter 2 on A Guide to Parallel Programming
The following is also definitely worth a read...
Patterns for Parallel Programming: Understanding and Applying Parallel Patterns with the .NET Framework 4 - Stephen Toub

sum += y; is actually sum = sum + y;. You are getting incorrect results because of the following race condition:
Thread1 reads sum
Thread2 reads sum
Thread1 calculates sum+y1, and stores the result in sum
Thread2 calculates sum+y2, and stores the result in sum
sum is now equal to sum+y2, instead of sum+y1+y2.

Your surmise is correct.
When you write sum += y, the runtime does the following:
Read the field onto the stack
Add y to the stack
Write the result back to the field
If two threads read the field at the same time, the change made by the first thread will be overwritten by the second thread.
You need to use Interlocked.Add, which performs the addition as a single atomic operation.

Incrementing a long isn't an atomic operation.

I think it's important to distinguish that this loop is not capable of being partitioned for parallelism, because as has been mentioned above each iteration of the loop is dependent on the prior. The parallel for is designed for explicitly parallel tasks, such as pixel scaling etc. because each iteration of the loop cannot have data dependencies outside its iteration.
Parallel.For(0, input.length, x =>
{
output[x] = input[x] * scalingFactor;
});
The above an example of code that allows for easy partitioning for parallelism. However a word of warning, parallelism comes with a cost, even the loop I used as an example above is far far too simple to bother with a parallel for because the set up time takes longer than the time saved via parallelism.

An important point no-one seems to have mentioned: For data-parallel operations (such as the OP's), it is often better (in terms of both efficiency and simplicity) to use PLINQ instead of the Parallel class. The OP's code is actually trivial to parallelize:
long sum = Enumerable.Range(1, 10000).AsParallel().Sum();
The above snippet uses the ParallelEnumerable.Sum method, although one could also use Aggregate for more general scenarios. Refer to the Parallel Loops chapter for an explanation of these approaches.

if there are two parameters in this code.
For example
long sum1 = 0;
long sum2 = 0;
Parallel.For(1, 10000, y =>
{
sum1 += y;
sum2=sum1*y;
}
);
what will we do ? i am guessing that have to use array !

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.