For and Parallel For efficiency - c#

I have question about difference of following code:
I will show the code first used with normal for loop.
Guesser class contructor starts mixing the character array and startBruteForce calls the function which permutates it in the key with the appropriate length that is matched to the private password. To every new key, I setted Console.Write (key);
And results was different in each loop.
DateTime timeStarted = DateTime.Now;
Console.WriteLine("Start BruteForce - {0}", timeStarted.ToString());
for (int i = 0; i < Environment.ProcessorCount; i++)
{
Guesser guesser = new Guesser();
Thread thread = new Thread(guesser.startBruteForce);
thread.Start();
}
while (!Guesser.isMatched) ;
Console.WriteLine("Time passed: {0}s", DateTime.Now.Subtract(timeStarted).TotalSeconds);
and Parallel.For
DateTime timeStarted = DateTime.Now;
Console.WriteLine("Start BruteForce - {0}", timeStarted.ToString());
Parallel.For(0, Environment.ProcessorCount , (i) =>
{
Guesser guesser = new Guesser();
guesser.startBruteForce();
});
while (!Guesser.isMatched) ;
Console.WriteLine("Time passed: {0}s", DateTime.Now.Subtract(timeStarted).TotalSeconds);
In ParallelFor it was something like: a,b,c,d,e,f...ab,ac,ad...
While in normal foor loop it was much more permutated for ex: b,r,1,a,k,b,ed
It look like ParrarelFor created only one thread.
Normal for loop performs the brute-force algorithm faster than Parallel.For loop.
My question is why this is happening.
Another thing that interests me is why if I will narrow the number of iterations to Environment.ProcessorCount/2both algorithms execution time are faster than just iterted with unchangedEnvironment.ProcessorCount.

Related

How can I determine whether a parallel foreach loop is going to have better performance than a foreach loop?

I just did a simple test in .NET Fiddle of sorting 100 random integer arrays of length 1000 and seeing whether doing so with a Paralell.ForEach loop is faster than a plain old foreach loop.
Here is my code (I put this together fast, so please ignore the repetition and overall bad look of the code)
using System;
using System.Net;
using System.Collections.Generic;
using System.Threading;
using System.Threading.Tasks;
using System.Linq;
public class Program
{
public static int[] RandomArray(int minval, int maxval, int arrsize)
{
Random randNum = new Random();
int[] rand = Enumerable
.Repeat(0, arrsize)
.Select(i => randNum.Next(minval, maxval))
.ToArray();
return rand;
}
public static void SortOneThousandArraysSync()
{
var arrs = new List<int[]>(100);
for(int i = 0; i < 100; ++i)
arrs.Add(RandomArray(Int32.MinValue,Int32.MaxValue,1000));
Parallel.ForEach(arrs, (arr) =>
{
Array.Sort(arr);
});
}
public static void SortOneThousandArraysAsync()
{
var arrs = new List<int[]>(100);
for(int i = 0; i < 100; ++i)
arrs.Add(RandomArray(Int32.MinValue,Int32.MaxValue,1000));
foreach(var arr in arrs)
{
Array.Sort(arr);
};
}
public static void Main()
{
var start = DateTime.Now;
SortOneThousandArraysSync();
var end = DateTime.Now;
Console.WriteLine("t1 = " + (end - start).ToString());
start = DateTime.Now;
SortOneThousandArraysAsync();
end = DateTime.Now;
Console.WriteLine("t2 = " + (end - start).ToString());
}
}
and here are the results after hitting Run twice:
t1 = 00:00:00.0156244
t2 = 00:00:00.0156243
...
t1 = 00:00:00.0467854
t2 = 00:00:00.0156246
...
So, sometimes it's faster and sometimes it's about the same.
Possible explanations:
The random arrays were "more unsorted" for the sync one versus the async one in the 2nd test I ran
It has something to do with the processes running on .NET Fiddle. In the first case the parallel one basically ran like a non-parallel operation because there weren't any threads for my fiddle to take over. (Or something like that)
Thoughts?
You should only use Parallel.ForEach() if the code within the loop takes a significant amount of time to execute. In this case, it takes more time to create multiple threads, sort the array, and then combine the result onto one thread than it is to simply sort it on a single thread. For example, the Parallel.ForEach() in the following code snippet takes less time to execute than the normal ForEach loop:
public static void Main(string[] args)
{
var numbers = Enumerable.Range(1, 10000);
Parallel.ForEach(numbers, n => Factorial(n));
foreach (var number in numbers)
{
Factorial(number);
}
}
private static int Factorial(int number)
{
if (number == 1 || number == 0)
return 1;
return number * Factorial(number - 1);
}
However, if I change var numbers = Enumerable.Range(1, 10000); to var numbers = Enumerable.Range(1, 1000);, the ForEach loop is faster than Parallel.ForEach().
When working with small tasks (which don't take a significant amount of time to execute) have a look at Partitioner class; in your case:
public static void SortOneThousandArraysAsyncWithPart() {
var arrs = new List<int[]>(100);
for (int i = 0; i < 100; ++i)
arrs.Add(RandomArray(Int32.MinValue, Int32.MaxValue, 1000));
// Let's spread the tasks between threads manually with a help of Partitioner.
// We don't want task stealing and other optimizations: just split the
// list between 8 (on my workstation) threads and run them
Parallel.ForEach(Partitioner.Create(0, 100), part => {
for (int i = part.Item1; i < part.Item2; ++i)
Array.Sort(arrs[i]);
});
}
I get the following results (i7 3.2GHz 4 cores HT, .Net 4.6 IA-64) - averaged by 100 runs:
0.0081 Async (foreach)
0.0119 Parallel.ForEach
0.0084 Parallel.ForEach + Partitioner
as you can see, foreach is still on the top, but Parallel.ForEach + Partitioner is very close to the winner
Checking performance of algorithms is a tricky business, and performance at small scale can easily be affected by a variety of factors external to your code. Please see my answer to an almost-duplicate question here for an in-depth explanation, plus some links to benchmarking templates that you can adapt to better measure your algorithm's performance.

Parallel.ForEach slower than normal foreach

I'm playing around with the Parallel.ForEach in a C# console application, but can't seem to get it right. I'm creating an array with random numbers and i have a sequential foreach and a Parallel.ForEach that finds the largest value in the array. With approximately the same code in c++ i started to see a tradeoff to using several threads at 3M values in the array. But the Parallel.ForEach is twice as slow even at 100M values. What am i doing wrong?
class Program
{
static void Main(string[] args)
{
dostuff();
}
static void dostuff() {
Console.WriteLine("How large do you want the array to be?");
int size = int.Parse(Console.ReadLine());
int[] arr = new int[size];
Random rand = new Random();
for (int i = 0; i < size; i++)
{
arr[i] = rand.Next(0, int.MaxValue);
}
var watchSeq = System.Diagnostics.Stopwatch.StartNew();
var largestSeq = FindLargestSequentially(arr);
watchSeq.Stop();
var elapsedSeq = watchSeq.ElapsedMilliseconds;
Console.WriteLine("Finished sequential in: " + elapsedSeq + "ms. Largest = " + largestSeq);
var watchPar = System.Diagnostics.Stopwatch.StartNew();
var largestPar = FindLargestParallel(arr);
watchPar.Stop();
var elapsedPar = watchPar.ElapsedMilliseconds;
Console.WriteLine("Finished parallel in: " + elapsedPar + "ms Largest = " + largestPar);
dostuff();
}
static int FindLargestSequentially(int[] arr) {
int largest = arr[0];
foreach (int i in arr) {
if (largest < i) {
largest = i;
}
}
return largest;
}
static int FindLargestParallel(int[] arr) {
int largest = arr[0];
Parallel.ForEach<int, int>(arr, () => 0, (i, loop, subtotal) =>
{
if (i > subtotal)
subtotal = i;
return subtotal;
},
(finalResult) => {
Console.WriteLine("Thread finished with result: " + finalResult);
if (largest < finalResult) largest = finalResult;
}
);
return largest;
}
}
It's performance ramifications of having a very small delegate body.
We can achieve better performance using the partitioning. In this case the body delegate performs work with a high data volume.
static int FindLargestParallelRange(int[] arr)
{
object locker = new object();
int largest = arr[0];
Parallel.ForEach(Partitioner.Create(0, arr.Length), () => arr[0], (range, loop, subtotal) =>
{
for (int i = range.Item1; i < range.Item2; i++)
if (arr[i] > subtotal)
subtotal = arr[i];
return subtotal;
},
(finalResult) =>
{
lock (locker)
if (largest < finalResult)
largest = finalResult;
});
return largest;
}
Pay attention to synchronize the localFinally delegate. Also note the need for proper initialization of the localInit: () => arr[0] instead of () => 0.
Partitioning with PLINQ:
static int FindLargestPlinqRange(int[] arr)
{
return Partitioner.Create(0, arr.Length)
.AsParallel()
.Select(range =>
{
int largest = arr[0];
for (int i = range.Item1; i < range.Item2; i++)
if (arr[i] > largest)
largest = arr[i];
return largest;
})
.Max();
}
I highly recommend free book Patterns of Parallel Programming by Stephen Toub.
As the other answerers have mentioned, the action you're trying to perform against each item here is so insignificant that there are a variety of other factors which end up carrying more weight than the actual work you're doing. These may include:
JIT optimizations
CPU branch prediction
I/O (outputting thread results while the timer is running)
the cost of invoking delegates
the cost of task management
the system incorrectly guessing what thread strategy will be optimal
memory/cpu caching
memory pressure
environment (debugging)
etc.
Running each approach a single time is not an adequate way to test, because it enables a number of the above factors to weigh more heavily on one iteration than on another. You should start with a more robust benchmarking strategy.
Furthermore, your implementation is actually dangerously incorrect. The documentation specifically says:
The localFinally delegate is invoked once per task to perform a final action on each task’s local state. This delegate might be invoked concurrently on multiple tasks; therefore, you must synchronize access to any shared variables.
You have not synchronized your final delegate, so your function is prone to race conditions that would make it produce incorrect results.
As in most cases, the best approach to this one is to take advantage of work done by people smarter than we are. In my testing, the following approach appears to be the fastest overall:
return arr.AsParallel().Max();
The Parallel Foreach loop should be running slower because the algorithm used is not parallel and a lot more work is being done to run this algorithm.
In the single thread, to find the max value, we can take the first number as our max value and compare it to every other number in the array. If one of the numbers larger than our first number, we swap and continue. This way we access each number in the array once, for a total of N comparisons.
In the Parallel loop above, the algorithm creates overhead because each operation is wrapped inside a function call with a return value. So in addition to doing the comparisons, it is running overhead of adding and removing these calls onto the call stack. In addition, since each call is dependent on the value of the function call before, it needs to run in sequence.
In the Parallel For Loop below, the array is divided into an explicit number of threads determined by the variable threadNumber. This limits the overhead of function calls to a low number.
Note, for low values, the parallel loops performs slower. However, for 100M, there is a decrease in time elapsed.
static int FindLargestParallel(int[] arr)
{
var answers = new ConcurrentBag<int>();
int threadNumber = 4;
int partitionSize = arr.Length/threadNumber;
Parallel.For(0, /* starting number */
threadNumber+1, /* Adding 1 to threadNumber in case array.Length not evenly divisible by threadNumber */
i =>
{
if (i*partitionSize < arr.Length) /* check in case # in array is divisible by # threads */
{
var max = arr[i*partitionSize];
for (var x = i*partitionSize;
x < (i + 1)*partitionSize && x < arr.Length;
++x)
{
if (arr[x] > max)
max = arr[x];
}
answers.Add(max);
}
});
/* note the shortcut in finding max in the bag */
return answers.Max(i=>i);
}
Some thoughts here: In the parallel case, there is thread management logic involved that determines how many threads it wants to use. This thread management logic presumably possibly runs on your main thread. Every time a thread returns with the new maximum value, the management logic kicks in and determines the next work item (the next number to process in your array). I'm pretty sure that this requires some kind of locking. In any case, determining the next item may even cost more than performing the comparison operation itself.
That sounds like a magnitude more work (overhead) to me than a single thread that processes one number after the other. In the single-threaded case there are a number of optimization at play: No boundary checks, CPU can load data into the first level cache within the CPU, etc. Not sure, which of these optimizations apply for the parallel case.
Keep in mind that on a typical desktop machine there are only 2 to 4 physical CPU cores available so you will never have more than that actually doing work. So if the parallel processing overhead is more than 2-4 times of a single-threaded operation, the parallel version will inevitably be slower, which you are observing.
Have you attempted to run this on a 32 core machine? ;-)
A better solution would be determine non-overlapping ranges (start + stop index) covering the entire array and let each parallel task process one range. This way, each parallel task can internally do a tight single-threaded loop and only return once the entire range has been processed. You could probably even determine a near optimal number of ranges based on the number of logical cores of the machine. I haven't tried this but I'm pretty sure you will see an improvement over the single-threaded case.
Try splitting the set into batches and running the batches in parallel, where the number of batches corresponds to your number of CPU cores.
I ran some equations 1K, 10K and 1M times using the following methods:
A "for" loop.
A "Parallel.For" from the System.Threading.Tasks lib, across the entire set.
A "Parallel.For" across 4 batches.
A "Parallel.ForEach" from the System.Threading.Tasks lib, across the entire set.
A "Parallel.ForEach" across 4 batches.
Results: (Measured in seconds)
Conclusion:
Processing batches in parallel using the "Parallel.ForEach" has the best outcome in cases above 10K records. I believe the batching helps because it utilizes all CPU cores (4 in this example), but also minimizes the amount of threading overhead associated with parallelization.
Here is my code:
public void ParallelSpeedTest()
{
var rnd = new Random(56);
int range = 1000000;
int numberOfCores = 4;
int batchSize = range / numberOfCores;
int[] rangeIndexes = Enumerable.Range(0, range).ToArray();
double[] inputs = rangeIndexes.Select(n => rnd.NextDouble()).ToArray();
double[] weights = rangeIndexes.Select(n => rnd.NextDouble()).ToArray();
double[] outputs = new double[rangeIndexes.Length];
/// Series "for"...
var startTimeSeries = DateTime.Now;
for (var i = 0; i < range; i++)
{
outputs[i] = Math.Sqrt(Math.Pow(inputs[i] * weights[i], 2));
}
var durationSeries = DateTime.Now - startTimeSeries;
/// "Parallel.For"...
var startTimeParallel = DateTime.Now;
Parallel.For(0, range, (i) => {
outputs[i] = Math.Sqrt(Math.Pow(inputs[i] * weights[i], 2));
});
var durationParallelFor = DateTime.Now - startTimeParallel;
/// "Parallel.For" in Batches...
var startTimeParallel2 = DateTime.Now;
Parallel.For(0, numberOfCores, (c) => {
var endValue = (c == numberOfCores - 1) ? range : (c + 1) * batchSize;
var startValue = c * batchSize;
for (var i = startValue; i < endValue; i++)
{
outputs[i] = Math.Sqrt(Math.Pow(inputs[i] * weights[i], 2));
}
});
var durationParallelForBatches = DateTime.Now - startTimeParallel2;
/// "Parallel.ForEach"...
var startTimeParallelForEach = DateTime.Now;
Parallel.ForEach(rangeIndexes, (i) => {
outputs[i] = Math.Sqrt(Math.Pow(inputs[i] * weights[i], 2));
});
var durationParallelForEach = DateTime.Now - startTimeParallelForEach;
/// Parallel.ForEach in Batches...
List<Tuple<int,int>> ranges = new List<Tuple<int, int>>();
for (var i = 0; i < numberOfCores; i++)
{
int start = i * batchSize;
int end = (i == numberOfCores - 1) ? range : (i + 1) * batchSize;
ranges.Add(new Tuple<int,int>(start, end));
}
var startTimeParallelBatches = DateTime.Now;
Parallel.ForEach(ranges, (range) => {
for(var i = range.Item1; i < range.Item1; i++) {
outputs[i] = Math.Sqrt(Math.Pow(inputs[i] * weights[i], 2));
}
});
var durationParallelForEachBatches = DateTime.Now - startTimeParallelBatches;
Debug.Print($"=================================================================");
Debug.Print($"Given: Set-size: {range}, number-of-batches: {numberOfCores}, batch-size: {batchSize}");
Debug.Print($".................................................................");
Debug.Print($"Series For: {durationSeries}");
Debug.Print($"Parallel For: {durationParallelFor}");
Debug.Print($"Parallel For Batches: {durationParallelForBatches}");
Debug.Print($"Parallel ForEach: {durationParallelForEach}");
Debug.Print($"Parallel ForEach Batches: {durationParallelForEachBatches}");
Debug.Print($"");
}

How to obatin a thread ID in the range (0,MaxDegreeOfParallelism) in a parallel.for

I want to assign and index in the range (0 to MaxDegreeOfParallelism) to each incoming thread in a parallel.for. I have tried using CurrentThread.ManagedThreadId and CurrentThread.Name but they are not in a given range and think they sometimes change values (is this right?). What I am using now is a ConcurrentStack in the following way which works but I think is making my loop horribly slow. Any idea how I could solve this problem? Thanks!!
int nCPU = 16; //number of CORES
ConcurrentStack<int> cs = new ConcurrentStack<int>();
for (int i = 0; i < nCPU; i++) { cs.Push(i); }
options.MaxDegreeOfParallelism = nCPU - 1;
Parallel.For(0, 10000, options, tt =>
{
int threadIndex;
cs.TryPop(out threadIndex); //assign an index to the incoming thread
//..... do stuff with all my variables matrix1[threadIndex][i][j], matrix2.......//
cs.Push(threadIndex); //the thread leaves the previously assigned index
});

Passing data to a thread using Lambda expressions

for (int i = 0; i < 10; i++)
new Thread (() => Console.Write (i)).Start();
As expected the output of the above code is non-deterministic, because i variable refers to the same memory location throughout the loop’s lifetime. Therefore, each thread calls Console.Write on a variable whose value may change as it is running
However,
for (int i = 0; i < 10; i++)
{
int temp = i;
new Thread (() => Console.Write (temp)).Start();
}
Is also giving non-deterministic output! I thought variable temp was local to each loop iteration. Therefore, each thread captured a different memory location and there should have been np problem.
Your program should have 10 lambdas, each writing one of the digits from 0 to 9 to the console. However, there's no guarantee that the threads will execute in order.
Is also giving non-deterministic output!
No, its not. I have checked ten times your first code (had repeating numbers) and the second (had not).
So it all works fine. Just as it should.
The second code snippet should be deterministic in the sense that each thread eventually writes its temp, and all their temps will differ.
However, it does not guarantee that threads will be scheduled for execution in the order of their creation. You'll see all possible temps, but not necessarily in ascending order.
Here is the proof that OP is right and both pieces of his code are incorrect.
And there is also a solution with proof also.
However need to note that 'non-deterministic' means that the threads receive wrong parameter. The order will be never guaranteed.
The code below examines second piece of OP code and demonstrates that it is working as expected..
I am storing the pair (thread identity, parameter) and then print it to compare with the thread output to prove the pairs aren't changed. I also added few hundreds millisecond random sleep so the for index should obviously change at those times.
Dictionary<int, int> hash = new Dictionary<int, int>();
Random r = new Random(DateTime.Now.Millisecond);
for (int i = 0; i < 10; i++)
{
int temp = i;
var th = new Thread(() =>
{
Thread.Sleep(r.Next(9) * 100);
Console.WriteLine("{0} {1}",
Thread.CurrentThread.GetHashCode(), temp);
});
hash.Add(th.GetHashCode(), temp);
th.Start();
}
Thread.Sleep(1000);
Console.WriteLine();
foreach (var kvp in hash)
Console.WriteLine("{0} {1}", kvp.Key, kvp.Value);

Trying to understand multi-threading in C#

I'm trying to understand the basics of multi-threading so I built a little program that raised a few question and I'll be thankful for any help :)
Here is the little program:
class Program
{
public static int count;
public static int max;
static void Main(string[] args)
{
int t = 0;
DateTime Result;
Console.WriteLine("Enter Max Number : ");
max = int.Parse(Console.ReadLine());
Console.WriteLine("Enter Thread Number : ");
t = int.Parse(Console.ReadLine());
count = 0;
Result = DateTime.Now;
List<Thread> MyThreads = new List<Thread>();
for (int i = 1; i < 31; i++)
{
Thread Temp = new Thread(print);
Temp.Name = i.ToString();
MyThreads.Add(Temp);
}
foreach (Thread th in MyThreads)
th.Start();
while (count < max)
{
}
Console.WriteLine("Finish , Took : " + (DateTime.Now - Result).ToString() + " With : " + t + " Threads.");
Console.ReadLine();
}
public static void print()
{
while (count < max)
{
Console.WriteLine(Thread.CurrentThread.Name + " - " + count.ToString());
count++;
}
}
}
I checked this with some test runs:
I made the maximum number 100, and it seems to be that the fastest execution time is with 2 threads which is 80% faster than the time with 10 threads.
Questions:
1) Threads 4-10 don't print even one time, how can it be?
2) Shouldn't more threads be faster?
I made the maximum number 10000 and disabled printing.
With this configuration, 5 threads seems to be fastest.
Why there is a change compared to the first check?
And also in this configuration (with printing) all the threads print a few times. Why is that different from the first run where only a few threads printed?
Is there is a way to make all the threads print one by one? In a line or something like that?
Thank you very much for your help :)
Your code is certainly a first step into the world of threading, and you've just experienced the first (of many) headaches!
To start with, static may enable you to share a variable among the threads, but it does not do so in a thread safe manner. This means your count < max expression and count++ are not guaranteed to be up to date or an effective guard between threads. Look at the output of your program when max is only 10 (t set to 4, on my 8 processor workstation):
T0 - 0
T0 - 1
T0 - 2
T0 - 3
T1 - 0 // wait T1 got count = 0 too!
T2 - 1 // and T2 got count = 1 too!
T2 - 6
T2 - 7
T2 - 8
T2 - 9
T0 - 4
T3 - 1 // and T3 got count = 1 too!
T1 - 5
To your question about each thread printing one-by-one, I assume you're trying to coordinate access to count. You can accomplish this with synchronization primitives (such as the lock statement in C#). Here is a naive modification to your code which will ensure only max increments occur:
static object countLock = new object();
public static void printWithLock()
{
// loop forever
while(true)
{
// protect access to count using a static object
// now only 1 thread can use 'count' at a time
lock (countLock)
{
if (count >= max) return;
Console.WriteLine(Thread.CurrentThread.Name + " - " + count.ToString());
count++;
}
}
}
This simple modification makes your program logically correct, but also slow. The sample now exhibits a new problem: lock contention. Every thread is now vying for access to countLock. We've made our program thread safe, but without any benefits of parallelism!
Threading and parallelism is not particularly easy to get right, but thankfully recent versions of .Net come with the Task Parallel Library (TPL) and Parallel LINQ (PLINQ).
The beauty of the library is how easy it would be to convert your current code:
var sw = new Stopwatch();
sw.Start();
Enumerable.Range(0, max)
.AsParallel()
.ForAll(number =>
Console.WriteLine("T{0}: {1}",
Thread.CurrentThread.ManagedThreadId,
number));
Console.WriteLine("{0} ms elapsed", sw.ElapsedMilliseconds);
// Sample output from max = 10
//
// T9: 3
// T9: 4
// T9: 5
// T9: 6
// T9: 7
// T9: 8
// T9: 9
// T8: 1
// T7: 2
// T1: 0
// 30 ms elapsed
The output above is an interesting illustration of why threading produces "unexpected results" for newer users. When threads execute in parallel, they may complete chunks of code at different points in time or one thread may be faster than another. You never really know with threading!
Your print function is far from thread safe, that's why 4-10 doesn't print. All threads share the same max and count variables.
Reason for the why more threads slows you down is likely the state change taking place each time the processor changes focus between each thread.
Also, when you're creating a lot of threads, the system needs to allocate new ones. Most of the time it is now advisable to use Tasks instead, as they are pulled from a system managed thread-pool. And thus doesn't necessarily have to be allocated. The creation of a distinct new thread is rather expensive.
Take a look here anyhow: http://msdn.microsoft.com/en-us/library/aa645740(VS.71).aspx
Look carefuly:
t = int.Parse(Console.ReadLine());
count = 0;
Result = DateTime.Now;
List<Thread> MyThreads = new List<Thread>();
for (int i = 1; i < 31; i++)
{
Thread Temp = new Thread(print);
Temp.Name = i.ToString();
MyThreads.Add(Temp);
}
I think you missed a variable t ( i < 31).
You should read many books on parallel and multithreaded programming before writing code, because programming language is just a tool. Good luck!

Categories