Disappointing performance with Parallel.For

Disappointing performance with Parallel.For - c#

I am trying to speed up my calculation times by using Parallel.For. I have an Intel Core i7 Q840 CPU with 8 cores, but I only manage to get a performance ratio of 4 compared to a sequential for loop. Is this as good as it can get with Parallel.For, or can the method call be fine-tuned to increase performance?
Here is my test code, sequential:
var loops = 200;
var perloop = 10000000;
var sum = 0.0;
for (var k = 0; k < loops; ++k)
{
var sumk = 0.0;
for (var i = 0; i < perloop; ++i) sumk += (1.0 / i) * i;
sum += sumk;
}
and parallel:
sum = 0.0;
Parallel.For(0, loops,
k =>
{
var sumk = 0.0;
for (var i = 0; i < perloop; ++i) sumk += (1.0 / i) * i;
sum += sumk;
});
The loop that I am parallelizing involves computation with a "globally" defined variable, sum, but this should only amount to a tiny, tiny fraction of the total time within the parallelized loop.
In Release build ("optimize code" flag set) the sequential for loop takes 33.7 s on my computer, whereas the Parallel.For loop takes 8.4 s, a performance ratio of only 4.0.
In the Task Manager, I can see that the CPU utilization is 10-11% during the sequential calculation, whereas it is only 70% during the parallel calculation. I have tried to explicitly set
ParallelOptions.MaxDegreesOfParallelism = Environment.ProcessorCount
but to no avail. It is not clear to me why not all CPU power is assigned to the parallel calculation?
I have noticed that a similar question has been raised on SO before, with an even more disappointing result. However, that question also involved inferior parallelization in a third-party library. My primary concern is parallelization of basic operations in the core libraries.
UPDATE
It was pointed out to me in some of the comments that the CPU I am using only has 4 physical cores, which is visible to the system as 8 cores if hyper threading is enabled. For the sake of it, I disabled hyper-threading and re-benchmarked.
With hyper-threading disabled, my calculations are now faster, both the parallel and also the (what I thought was) sequential for loop. CPU utilization during the for loop is up to approx. 45% (!!!) and 100% during the Parallel.For loop.
Computation time for the for loop 15.6 s (more than twice as fast as with hyper-threading enabled) and 6.2 s for Parallel.For (25% better than when hyper-threading is enabled). Performance ratio with Parallel.For is now only 2.5, running on 4 real cores.
So the performance ratio is still substantially lower than expected, despite hyper-threading being disabled. On the other hand it is intriguing that CPU utilization is so high during the for loop? Could there be some kind of internal parallelization going on in this loop as well?

Using a global variable can introduce significant synchronization problems, even when you are not using locks. When you assign a value to the variable each core will have to get access to the same place in system memory, or wait for the other core to finish before accessing it.
You can avoid corruption without locks by using the lighter Interlocked.Add method to add a value to the sum atomically, at the OS level, but you will still get delays due to contention.
The proper way to do this is to update a thread local variable to create the partial sums and add all of them to a single global sum at the end. Parallel.For has an overload that does just this. MSDN even has an example using sumation at How To: Write a Parallel.For Loop that has Thread Local Variables
int[] nums = Enumerable.Range(0, 1000000).ToArray();
long total = 0;
// Use type parameter to make subtotal a long, not an int
Parallel.For<long>(0, nums.Length, () => 0, (j, loop, subtotal) =>
{
subtotal += nums[j];
return subtotal;
},
(x) => Interlocked.Add(ref total, x)
);
Each thread updates its own subtotal value and updates the global total using Interlocked.Add when it finishes.

Parallel.For and Parallel.ForEach will use a degree of parallelism that it feels is appropriate, balancing the cost to setup and tear down threads and the work it expects each thread will perform. .NET 4.5 made several improvements to performance (including more intelligent decisions on the number of threads to spin up) compared to previous .NET versions.
Note that, even if it were to spin up one thread per core, context switches, false sharing issues, resource locks, and other issues may prevent you from achieving linear scalability (in general, not necessarily with your specific code example).

I think the computation gain is so low because your code is "too easy" to work on other task each iteration - because parallel.for just create new task in each iteration, so this will take time to service them in threads. I will it like this:
int[] nums = Enumerable.Range(0, 1000000).ToArray();
long total = 0;
Parallel.ForEach(
Partitioner.Create(0, nums.Length),
() => 0,
(part, loopState, partSum) =>
{
for (int i = part.Item1; i < part.Item2; i++)
{
partSum += nums[i];
}
return partSum;
},
(partSum) =>
{
Interlocked.Add(ref total, partSum);
}
);
Partitioner will create optimal part of job for each task, there will be less time for service task with threads. If you can, please benchmark this solution and tell us if it get better speed up.

foreach vs parallel for each an example
for (int i = 0; i < 10; i++)
{
int[] array = new int[] { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29 };
Stopwatch watch = new Stopwatch();
watch.Start();
//Parallel foreach
Parallel.ForEach(array, line =>
{
for (int x = 0; x < 1000000; x++)
{
}
});
watch.Stop();
Console.WriteLine("Parallel.ForEach {0}", watch.Elapsed.Milliseconds);
watch = new Stopwatch();
//foreach
watch.Start();
foreach (int item in array)
{
for (int z = 0; z < 10000000; z++)
{
}
}
watch.Stop();
Console.WriteLine("ForEach {0}", watch.Elapsed.Milliseconds);
Console.WriteLine("####");
}
Console.ReadKey();
My CPU
Intel® Core™ i7-620M Processor (4M Cache, 2.66 GHz)

Related

Why are 1000 threads faster than a few?

I have a simple program that searches linearly in an array of 2D points. I do 1000 searches into an array of 1 000 000 points.
The curious thing is that if I spawn 1000 threads, the program works as fast as when I span only as much as CPU cores I have, or when I use Parallel.For. This is contrary to everything I know about creating threads. Creating and destroying threads is expensive, but obviously not in this case.
Can someone explain why?
Note: this is a methodological example; the search algorithm is deliberately not meant do to optimal. The focus is on threading.
Note 2: I tested on an 4-core i7 and 3-core AMD, the results follow the same pattern!
using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Threading;
/// <summary>
/// We search for closest points.
/// For every point in array searchData, we search into inputData for the closest point,
/// and store it at the same position into array resultData;
/// </summary>
class Program
{
class Point
{
public double X { get; set; }
public double Y { get; set; }
public double GetDistanceFrom (Point p)
{
double dx, dy;
dx = p.X - X;
dy = p.Y - Y;
return Math.Sqrt(dx * dx + dy * dy);
}
}
const int inputDataSize = 1_000_000;
static Point[] inputData = new Point[inputDataSize];
const int searchDataSize = 1000;
static Point[] searchData = new Point[searchDataSize];
static Point[] resultData = new Point[searchDataSize];
static void GenerateRandomData (Point[] array)
{
Random rand = new Random();
for (int i = 0; i < array.Length; i++)
{
array[i] = new Point()
{
X = rand.NextDouble() * 100_000,
Y = rand.NextDouble() * 100_000
};
}
}
private static void SearchOne(int i)
{
var searchPoint = searchData[i];
foreach (var p in inputData)
{
if (resultData[i] == null)
{
resultData[i] = p;
}
else
{
double oldDistance = searchPoint.GetDistanceFrom(resultData[i]);
double newDistance = searchPoint.GetDistanceFrom(p);
if (newDistance < oldDistance)
{
resultData[i] = p;
}
}
}
}
static void AllThreadSearch()
{
List<Thread> threads = new List<Thread>();
for (int i = 0; i < searchDataSize; i++)
{
var thread = new Thread(
obj =>
{
int index = (int)obj;
SearchOne(index);
});
thread.Start(i);
threads.Add(thread);
}
foreach (var t in threads) t.Join();
}
static void FewThreadSearch()
{
int threadCount = Environment.ProcessorCount;
int workSize = searchDataSize / threadCount;
List<Thread> threads = new List<Thread>();
for (int i = 0; i < threadCount; i++)
{
var thread = new Thread(
obj =>
{
int[] range = (int[])obj;
int from = range[0];
int to = range[1];
for (int index = from; index < to; index++)
{
SearchOne(index);
}
}
);
int rangeFrom = workSize * i;
int rangeTo = workSize * (i + 1);
thread.Start(new int[]{ rangeFrom, rangeTo });
threads.Add(thread);
}
foreach (var t in threads) t.Join();
}
static void ParallelThreadSearch()
{
System.Threading.Tasks.Parallel.For (0, searchDataSize,
index =>
{
SearchOne(index);
});
}
static void Main(string[] args)
{
Console.Write("Generatic data... ");
GenerateRandomData(inputData);
GenerateRandomData(searchData);
Console.WriteLine("Done.");
Console.WriteLine();
Stopwatch watch = new Stopwatch();
Console.Write("All thread searching... ");
watch.Restart();
AllThreadSearch();
watch.Stop();
Console.WriteLine($"Done in {watch.ElapsedMilliseconds} ms.");
Console.Write("Few thread searching... ");
watch.Restart();
FewThreadSearch();
watch.Stop();
Console.WriteLine($"Done in {watch.ElapsedMilliseconds} ms.");
Console.Write("Parallel thread searching... ");
watch.Restart();
ParallelThreadSearch();
watch.Stop();
Console.WriteLine($"Done in {watch.ElapsedMilliseconds} ms.");
Console.WriteLine();
Console.WriteLine("Press ENTER to quit.");
Console.ReadLine();
}
}
EDIT: Please make sure to run the app outside the debugger. VS Debugger slows down the case of multiple threads.
EDIT 2: Some more tests.
To make it clear, here is updated code that guarantees we do have 1000 running at once:
public static void AllThreadSearch()
{
ManualResetEvent startEvent = new ManualResetEvent(false);
List<Thread> threads = new List<Thread>();
for (int i = 0; i < searchDataSize; i++)
{
var thread = new Thread(
obj =>
{
startEvent.WaitOne();
int index = (int)obj;
SearchOne(index);
});
thread.Start(i);
threads.Add(thread);
}
startEvent.Set();
foreach (var t in threads) t.Join();
}
Testing with a smaller array - 100K elements, the results are:
1000 vs 8 threads
Method | Mean | Error | StdDev | Scaled |
--------------------- |---------:|---------:|----------:|-------:|
AllThreadSearch | 323.0 ms | 7.307 ms | 21.546 ms | 1.00 |
FewThreadSearch | 164.9 ms | 3.311 ms | 5.251 ms | 1.00 |
ParallelThreadSearch | 141.3 ms | 1.503 ms | 1.406 ms | 1.00 |
Now, 1000 threads is much slower, as expected. Parallel.For still bests them all, which is also logical.
However, growing the array to 500K (i.e. the amount of work every thread does), things start to look weird:
1000 vs 8, 500K
Method | Mean | Error | StdDev | Scaled |
--------------------- |---------:|---------:|---------:|-------:|
AllThreadSearch | 890.9 ms | 17.74 ms | 30.61 ms | 1.00 |
FewThreadSearch | 712.0 ms | 13.97 ms | 20.91 ms | 1.00 |
ParallelThreadSearch | 714.5 ms | 13.75 ms | 12.19 ms | 1.00 |
Looks like context-switching has negligible costs. Thread-creation costs are also relatively small. The only significant cost of having too many threads is loss of memory (memory addresses). Which, alone, is bad enough.
Now, are thread-creation costs that little indeed? We've been universally told that creating threads is very bad and context-switches are evil.

You may want to consider how the application is accessing memory. In the maximum threads scenario you are effectively accessing memory sequentially, which is efficient from a caching point of view. The approach using a small number of threads is more random, causing cache misses. Depending on the CPU there are performance counters that allow you to measure L1 and L2 cache hits/misses.

I think the real issue (other than memory use) with too many threads is that the CPU may have a hard time optimizing itself because it is switching tasks all the time. In the OP's original benchmark, the threads are all working on the same task and so you aren't seeing that much of a cost for the extra threads.
To simulate threads working on different tasks, I modified Jodrell's reformulation of the original code (labeled "Normal" in the data below) to first optimize memory access by ensuring all the threads are working in the same block of memory at the same time and such that the block fits in the cache (4mb) using the method from this cache blocking techniques article. Then I "reversed" that to ensure each set of 4 threads work in a different block of memory. The results for my machine (in ms):
Intel Core i7-5500U CPU 2.40GHz (Max: 2.39GHz) (Broadwell), 1 CPU, 4 logical and 2 physical cores)
inputDataSize = 1_000_000; searchDataSize = 1000; blocks used for O/D: 10
Threads 1 2 3 4 6 8 10 18 32 56 100 178 316 562 1000
Normal(N) 5722 3729 3599 2909 3485 3109 3422 3385 3138 3220 3196 3216 3061 3012 3121
Optimized(O) 5480 2903 2978 2791 2826 2806 2820 2796 2778 2775 2775 2805 2833 2866 2988
De-optimized(D) 5455 3373 3730 3849 3409 3350 3297 3335 3365 3406 3455 3553 3692 3719 3900
For O, all the threads worked in the same block of cacheable memory at the same time (where 1 block = 1/10 of inputData). For D, for every set of 4 threads, no thread worked in the same block of memory at the same time. So basically, in the former case access of inputData was able to make use of the cache whereas in the latter case for 4 threads access of inputData was forced to use main memory.
It's easier to see in charts. These charts have the thread-creation cost subtracted out and note the x-axis is logarithmic and y-axis is truncated to better show the shape of the data. Also, the value for 1 thread has been halved to show the theoretical best multi-threaded performance:
A quick glance above shows the optimized data (O) is indeed faster than the others. It is also more consistent (smoother) because compared to N it is not having to deal with cache-misses. As suggested by Jodrell, there appears to be a sweet spot around 100 threads, which is the number on my system which would allow a thread to complete its work within 1 time-slice. After that, the time increases linearly with number of threads (remember, the x-axis has a logarithmic scale on the chart.)
Comparing the normal and optimized data, the former is quite jagged whereas the latter is smooth. This answer suggested more threads would be more efficient from a caching point of view compared to fewer threads where the memory access could be more "random". The chart below seems to confirm this (note 4 threads is optimal for my machine as it has 4 logical cores):
The de-optimized version is most interesting. The worse case is with 4 threads as they have been forced to work in different areas of memory, preventing effective caching. As the number threads increases, the system is able to cache as threads share blocks of memory. But, as the number of threads increases presumably the context-switching makes it harder for the system to cache again and the results tend back to the worst-case:
I think this last chart is what shows the real cost of context-switching. In the original (N) version, the threads are all doing the same task. As a result there is limited competition for resources other than CPU time and the CPU is able to optimize itself for the workload (i.e. cache effectively.) If the threads are all doing different things, then the CPU isn't able to optimize itself and a severe performance hit results. So it's not directly the context switching that causes the problem, but the competition for resources.
In this case, the difference for 4 Threads between O (2909ms) and D (3849ms) is 940ms. This represents a 32% performance hit. Because my machine has a shared L3 cache, this performance hit shows up even with only 4 threads.

I took the liberty of rearranging your code to run using BenchmarkDotNet, it looks like this,
using System;
using System.Collections.Generic;
using System.Threading;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
namespace Benchmarks
{
public class Point
{
public double X { get; set; }
public double Y { get; set; }
public double GetDistanceFrom(Point p)
{
double dx, dy;
dx = p.X - X;
dy = p.Y - Y;
return Math.Sqrt(dx * dx + dy * dy);
}
}
[ClrJob(baseline: true)]
public class SomeVsMany
{
[Params(1000)]
public static int inputDataSize = 1000;
[Params(10)]
public static int searchDataSize = 10;
static Point[] inputData = new Point[inputDataSize];
static Point[] searchData = new Point[searchDataSize];
static Point[] resultData = new Point[searchDataSize];
[GlobalSetup]
public static void Setup()
{
GenerateRandomData(inputData);
GenerateRandomData(searchData);
}
[Benchmark]
public static void AllThreadSearch()
{
List<Thread> threads = new List<Thread>();
for (int i = 0; i < searchDataSize; i++)
{
var thread = new Thread(
obj =>
{
int index = (int)obj;
SearchOne(index);
});
thread.Start(i);
threads.Add(thread);
}
foreach (var t in threads) t.Join();
}
[Benchmark]
public static void FewThreadSearch()
{
int threadCount = Environment.ProcessorCount;
int workSize = searchDataSize / threadCount;
List<Thread> threads = new List<Thread>();
for (int i = 0; i < threadCount; i++)
{
var thread = new Thread(
obj =>
{
int[] range = (int[])obj;
int from = range[0];
int to = range[1];
for (int index = from; index < to; index++)
{
SearchOne(index);
}
}
);
int rangeFrom = workSize * i;
int rangeTo = workSize * (i + 1);
thread.Start(new int[] { rangeFrom, rangeTo });
threads.Add(thread);
}
foreach (var t in threads) t.Join();
}
[Benchmark]
public static void ParallelThreadSearch()
{
System.Threading.Tasks.Parallel.For(0, searchDataSize,
index =>
{
SearchOne(index);
});
}
private static void GenerateRandomData(Point[] array)
{
Random rand = new Random();
for (int i = 0; i < array.Length; i++)
{
array[i] = new Point()
{
X = rand.NextDouble() * 100_000,
Y = rand.NextDouble() * 100_000
};
}
}
private static void SearchOne(int i)
{
var searchPoint = searchData[i];
foreach (var p in inputData)
{
if (resultData[i] == null)
{
resultData[i] = p;
}
else
{
double oldDistance = searchPoint.GetDistanceFrom(resultData[i]);
double newDistance = searchPoint.GetDistanceFrom(p);
if (newDistance < oldDistance)
{
resultData[i] = p;
}
}
}
}
}
public class Program
{
static void Main(string[] args)
{
var summary = BenchmarkRunner.Run<SomeVsMany>();
}
}
}
When I run the benchmark I get these results,
BenchmarkDotNet=v0.11.1, OS=Windows 10.0.14393.2485
(1607/AnniversaryUpdate/Redstone1) Intel Core i7-7600U CPU 2.80GHz
(Max: 2.90GHz) (Kaby Lake), 1 CPU, 4 logical and 2 physical cores
Frequency=2835938 Hz, Resolution=352.6170 ns, Timer=TSC [Host] :
.NET Framework 4.7.2 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.7.3163.0
Clr : .NET Framework 4.7.2 (CLR 4.0.30319.42000), 64bit
RyuJIT-v4.7.3163.0 Job=Clr Runtime=Clr
Method inputDataSize searchDataSize Mean Error StdDev
AllThreadSearch 1000 10 1,276.53us 51.0605us 142.3364us
FewThreadSearch 1000 10 547.72us 24.8199us 70.0049us
ParallelThreadSearch 1000 10 36.54us 0.6973us 0.8564us
These are the kind of results I'd expect and different to what you are claiming in the question. However, as you correctly identify in the comment, this is because I have reduced the values of inputDataSize and searchDataSize.
If I rerun the test with the original values I get results like this,
Method inputDataSize searchDataSize Mean Error StdDev
AllThreadSearch 1000000 1000 2.872s 0.0554s 0.0701s
FewThreadSearch 1000000 1000 2.384s 0.0471s 0.0612s
ParallelThreadSearch 1000000 1000 2.449s 0.0368s 0.0344s
These results support your question.
FWIW I did another test run,
Method inputDataSize searchDataSize Mean Error StdDev
AllThreadSearch 20000000 40 1.972s 0.0392s 0.1045s
FewThreadSearch 20000000 40 1.718s 0.0501s 0.1477s
ParallelThreadSearch 20000000 40 1.978s 0.0454s 0.0523s
This may help distinguish the cost of context switching versus thread creation but ultimately, there must be an element of both.
There is a little speculation but, here are a few assertions and, a conclusion, based on our aggregated results.
Creating a Thread incurs some fixed overhead. When the work is large, the overhead becomes insignificant.
The operating system and processor architecture can only run a certain number of CPU threads at a time. Some amount of CPU time will be reserved for the many operations that keep the computer running behind the scenes. A chunk of that CPU time will be consumed by the background processes and services, not related to this test.
Even if we have a 8 core CPU and spawn just 2 threads we cannot expect both threads to progress through the program at exactly the same rate.
Accepting the points above, whether or not the threads are serviced via a .Net ThreadPool, only a finite number can be serviced concurrently. Even if all instantiated threads are progressed to some semaphore, they did not all get there at once and they will not all proceed at once. If we have more threads than available cores, some threads will have to wait before they can progress at all.
Each thread will proceed for a certain time-slice or until it is waiting for a resource.
This is where the speculation comes in but, when inputDataSize is small, the threads will tend to complete their work within one time-slice, requiring less or no context switching.
When inputDataSize becomes sufficiently large, the work cannot be completed within one time-slice, this makes context switching more likely.
So, given a large fixed size for searchDataSize we have three scenarios. The boundaries of these scenarios will depend on the characteristics of the test platform.
inputDataSize is small
Here, the cost of thread creation is significant, AllThreadSearch is massively slower. ParallelThreadSearch tends to win because it minimizes the cost of thread creation.
inputDataSize is medium
The cost of thread creation is insignificant. Crucially, the work can be completed in one time slice. AllThreadSearch makes use of OS level scheduling and avoids the reasonable but significant overhead of both the Parallel.For and the bucket looping in FewThreadSearch. Somewhere in this area is the sweet spot for AllThreadSearch, it may be possible that for some combinations AllThreadSearch is the fastest option.
inputDataSize is large
Crucially, the work cannot be completed in one time slice. Both the OS scheduler and the ThreadPool fail to anticipate the cost of context switching. Without some expensive heuristics how could they? FewThreadSearch wins out because it avoids the context switching, the cost of which outweighs the cost of bucket looping.
As ever, if you care about performance it pays to benchmark, on a representative system, with a representative workload, with a representative configuration.

First you have to understand the difference between Process and Thread to deep dive into the benefits of concurrency to achieve faster results over sequential programming.
Process - We can call it as an instance of a program in execution. Operating System creates different processes while executing an application. An application can have one or more processes. Process creation is some what costly job to the operating system as it needs to provide several resources while creating, such as Memory, Registers, open handles to system objects to access, security context etc.,
Thread - it is the entity within a process that can be scheduled for execution(can be a part of your code). Unlike Process creation, thread creation is not costly/time consuming as threads share virtual address space and system resources of the process where it belongs. It's improving the performance of the OS as it no need to provide resources for each thread it creates.
Below diagram will elaborate more than my words.
As threads are sharing the resources and having the concurrency nature in them they can run parallel and produce improved results. If your application needs to be highly parallel then you can create ThreadPool(collection of worker threads) to achieve efficiently execute asynchronous callbacks.
And to correct your final assumption/question, creating/destroying threads is not costly than creating/destroying process so always having a "properly handled threading code" would benefit the performance of the application.

It is simply because you can't create threads more than the capacity of your cpu ... so actually in both cases you are creating the same number of threads; your CPU max ...

Parallel.ForEach slower than normal foreach

I'm playing around with the Parallel.ForEach in a C# console application, but can't seem to get it right. I'm creating an array with random numbers and i have a sequential foreach and a Parallel.ForEach that finds the largest value in the array. With approximately the same code in c++ i started to see a tradeoff to using several threads at 3M values in the array. But the Parallel.ForEach is twice as slow even at 100M values. What am i doing wrong?
class Program
{
static void Main(string[] args)
{
dostuff();
}
static void dostuff() {
Console.WriteLine("How large do you want the array to be?");
int size = int.Parse(Console.ReadLine());
int[] arr = new int[size];
Random rand = new Random();
for (int i = 0; i < size; i++)
{
arr[i] = rand.Next(0, int.MaxValue);
}
var watchSeq = System.Diagnostics.Stopwatch.StartNew();
var largestSeq = FindLargestSequentially(arr);
watchSeq.Stop();
var elapsedSeq = watchSeq.ElapsedMilliseconds;
Console.WriteLine("Finished sequential in: " + elapsedSeq + "ms. Largest = " + largestSeq);
var watchPar = System.Diagnostics.Stopwatch.StartNew();
var largestPar = FindLargestParallel(arr);
watchPar.Stop();
var elapsedPar = watchPar.ElapsedMilliseconds;
Console.WriteLine("Finished parallel in: " + elapsedPar + "ms Largest = " + largestPar);
dostuff();
}
static int FindLargestSequentially(int[] arr) {
int largest = arr[0];
foreach (int i in arr) {
if (largest < i) {
largest = i;
}
}
return largest;
}
static int FindLargestParallel(int[] arr) {
int largest = arr[0];
Parallel.ForEach<int, int>(arr, () => 0, (i, loop, subtotal) =>
{
if (i > subtotal)
subtotal = i;
return subtotal;
},
(finalResult) => {
Console.WriteLine("Thread finished with result: " + finalResult);
if (largest < finalResult) largest = finalResult;
}
);
return largest;
}
}

It's performance ramifications of having a very small delegate body.
We can achieve better performance using the partitioning. In this case the body delegate performs work with a high data volume.
static int FindLargestParallelRange(int[] arr)
{
object locker = new object();
int largest = arr[0];
Parallel.ForEach(Partitioner.Create(0, arr.Length), () => arr[0], (range, loop, subtotal) =>
{
for (int i = range.Item1; i < range.Item2; i++)
if (arr[i] > subtotal)
subtotal = arr[i];
return subtotal;
},
(finalResult) =>
{
lock (locker)
if (largest < finalResult)
largest = finalResult;
});
return largest;
}
Pay attention to synchronize the localFinally delegate. Also note the need for proper initialization of the localInit: () => arr[0] instead of () => 0.
Partitioning with PLINQ:
static int FindLargestPlinqRange(int[] arr)
{
return Partitioner.Create(0, arr.Length)
.AsParallel()
.Select(range =>
{
int largest = arr[0];
for (int i = range.Item1; i < range.Item2; i++)
if (arr[i] > largest)
largest = arr[i];
return largest;
})
.Max();
}
I highly recommend free book Patterns of Parallel Programming by Stephen Toub.

As the other answerers have mentioned, the action you're trying to perform against each item here is so insignificant that there are a variety of other factors which end up carrying more weight than the actual work you're doing. These may include:
JIT optimizations
CPU branch prediction
I/O (outputting thread results while the timer is running)
the cost of invoking delegates
the cost of task management
the system incorrectly guessing what thread strategy will be optimal
memory/cpu caching
memory pressure
environment (debugging)
etc.
Running each approach a single time is not an adequate way to test, because it enables a number of the above factors to weigh more heavily on one iteration than on another. You should start with a more robust benchmarking strategy.
Furthermore, your implementation is actually dangerously incorrect. The documentation specifically says:
The localFinally delegate is invoked once per task to perform a final action on each task’s local state. This delegate might be invoked concurrently on multiple tasks; therefore, you must synchronize access to any shared variables.
You have not synchronized your final delegate, so your function is prone to race conditions that would make it produce incorrect results.
As in most cases, the best approach to this one is to take advantage of work done by people smarter than we are. In my testing, the following approach appears to be the fastest overall:
return arr.AsParallel().Max();

The Parallel Foreach loop should be running slower because the algorithm used is not parallel and a lot more work is being done to run this algorithm.
In the single thread, to find the max value, we can take the first number as our max value and compare it to every other number in the array. If one of the numbers larger than our first number, we swap and continue. This way we access each number in the array once, for a total of N comparisons.
In the Parallel loop above, the algorithm creates overhead because each operation is wrapped inside a function call with a return value. So in addition to doing the comparisons, it is running overhead of adding and removing these calls onto the call stack. In addition, since each call is dependent on the value of the function call before, it needs to run in sequence.
In the Parallel For Loop below, the array is divided into an explicit number of threads determined by the variable threadNumber. This limits the overhead of function calls to a low number.
Note, for low values, the parallel loops performs slower. However, for 100M, there is a decrease in time elapsed.
static int FindLargestParallel(int[] arr)
{
var answers = new ConcurrentBag<int>();
int threadNumber = 4;
int partitionSize = arr.Length/threadNumber;
Parallel.For(0, /* starting number */
threadNumber+1, /* Adding 1 to threadNumber in case array.Length not evenly divisible by threadNumber */
i =>
{
if (i*partitionSize < arr.Length) /* check in case # in array is divisible by # threads */
{
var max = arr[i*partitionSize];
for (var x = i*partitionSize;
x < (i + 1)*partitionSize && x < arr.Length;
++x)
{
if (arr[x] > max)
max = arr[x];
}
answers.Add(max);
}
});
/* note the shortcut in finding max in the bag */
return answers.Max(i=>i);
}

Some thoughts here: In the parallel case, there is thread management logic involved that determines how many threads it wants to use. This thread management logic presumably possibly runs on your main thread. Every time a thread returns with the new maximum value, the management logic kicks in and determines the next work item (the next number to process in your array). I'm pretty sure that this requires some kind of locking. In any case, determining the next item may even cost more than performing the comparison operation itself.
That sounds like a magnitude more work (overhead) to me than a single thread that processes one number after the other. In the single-threaded case there are a number of optimization at play: No boundary checks, CPU can load data into the first level cache within the CPU, etc. Not sure, which of these optimizations apply for the parallel case.
Keep in mind that on a typical desktop machine there are only 2 to 4 physical CPU cores available so you will never have more than that actually doing work. So if the parallel processing overhead is more than 2-4 times of a single-threaded operation, the parallel version will inevitably be slower, which you are observing.
Have you attempted to run this on a 32 core machine? ;-)
A better solution would be determine non-overlapping ranges (start + stop index) covering the entire array and let each parallel task process one range. This way, each parallel task can internally do a tight single-threaded loop and only return once the entire range has been processed. You could probably even determine a near optimal number of ranges based on the number of logical cores of the machine. I haven't tried this but I'm pretty sure you will see an improvement over the single-threaded case.

Try splitting the set into batches and running the batches in parallel, where the number of batches corresponds to your number of CPU cores.
I ran some equations 1K, 10K and 1M times using the following methods:
A "for" loop.
A "Parallel.For" from the System.Threading.Tasks lib, across the entire set.
A "Parallel.For" across 4 batches.
A "Parallel.ForEach" from the System.Threading.Tasks lib, across the entire set.
A "Parallel.ForEach" across 4 batches.
Results: (Measured in seconds)
Conclusion:
Processing batches in parallel using the "Parallel.ForEach" has the best outcome in cases above 10K records. I believe the batching helps because it utilizes all CPU cores (4 in this example), but also minimizes the amount of threading overhead associated with parallelization.
Here is my code:
public void ParallelSpeedTest()
{
var rnd = new Random(56);
int range = 1000000;
int numberOfCores = 4;
int batchSize = range / numberOfCores;
int[] rangeIndexes = Enumerable.Range(0, range).ToArray();
double[] inputs = rangeIndexes.Select(n => rnd.NextDouble()).ToArray();
double[] weights = rangeIndexes.Select(n => rnd.NextDouble()).ToArray();
double[] outputs = new double[rangeIndexes.Length];
/// Series "for"...
var startTimeSeries = DateTime.Now;
for (var i = 0; i < range; i++)
{
outputs[i] = Math.Sqrt(Math.Pow(inputs[i] * weights[i], 2));
}
var durationSeries = DateTime.Now - startTimeSeries;
/// "Parallel.For"...
var startTimeParallel = DateTime.Now;
Parallel.For(0, range, (i) => {
outputs[i] = Math.Sqrt(Math.Pow(inputs[i] * weights[i], 2));
});
var durationParallelFor = DateTime.Now - startTimeParallel;
/// "Parallel.For" in Batches...
var startTimeParallel2 = DateTime.Now;
Parallel.For(0, numberOfCores, (c) => {
var endValue = (c == numberOfCores - 1) ? range : (c + 1) * batchSize;
var startValue = c * batchSize;
for (var i = startValue; i < endValue; i++)
{
outputs[i] = Math.Sqrt(Math.Pow(inputs[i] * weights[i], 2));
}
});
var durationParallelForBatches = DateTime.Now - startTimeParallel2;
/// "Parallel.ForEach"...
var startTimeParallelForEach = DateTime.Now;
Parallel.ForEach(rangeIndexes, (i) => {
outputs[i] = Math.Sqrt(Math.Pow(inputs[i] * weights[i], 2));
});
var durationParallelForEach = DateTime.Now - startTimeParallelForEach;
/// Parallel.ForEach in Batches...
List<Tuple<int,int>> ranges = new List<Tuple<int, int>>();
for (var i = 0; i < numberOfCores; i++)
{
int start = i * batchSize;
int end = (i == numberOfCores - 1) ? range : (i + 1) * batchSize;
ranges.Add(new Tuple<int,int>(start, end));
}
var startTimeParallelBatches = DateTime.Now;
Parallel.ForEach(ranges, (range) => {
for(var i = range.Item1; i < range.Item1; i++) {
outputs[i] = Math.Sqrt(Math.Pow(inputs[i] * weights[i], 2));
}
});
var durationParallelForEachBatches = DateTime.Now - startTimeParallelBatches;
Debug.Print($"=================================================================");
Debug.Print($"Given: Set-size: {range}, number-of-batches: {numberOfCores}, batch-size: {batchSize}");
Debug.Print($".................................................................");
Debug.Print($"Series For: {durationSeries}");
Debug.Print($"Parallel For: {durationParallelFor}");
Debug.Print($"Parallel For Batches: {durationParallelForBatches}");
Debug.Print($"Parallel ForEach: {durationParallelForEach}");
Debug.Print($"Parallel ForEach Batches: {durationParallelForEachBatches}");
Debug.Print($"");
}

Performance penalty between for loop and Parallel.For() with MaxDegreeOfParallelism of 1

I want to do something like the following:
int firstLoopMaxThreads = 1; // or -1
int secondLoopMaxThreads = firstLoopMaxThreads == 1 ? -1 : 1;
Parallel.For(0, m, new ParallelOptions() { MaxDegreeOfParallelism = firstLoopMaxThreads }, i =>
{
//do some processor and/or memory-intensive stuff
Parallel.For(0, n, new ParallelOptions() { MaxDegreeOfParallelism = secondLoopMaxThreads }, j =>
{
//do some other processor and/or memory-intensive stuff
});
});
Would it be worth it, performance wise, to swap the inner Parallel.For loop with a normal for loop when secondLoopMaxThreads = 1? What is the performance difference between a regular for loop and a Parallel.For loop with MaxDegreeofParallelism = 1?

It depends on how many iterations you're talking about and what level of performance you're talking about to answer whether it is worth it or not. In your context is 1ms considered a lot, or a little?
I did a rudimentary test as below (since Thread.Sleep is not entirely accurate.. although the for loop measured 15,000ms to within 1ms everytime). Over 15,000 iterations repeated 5 times it generally added about 4ms of overhead compared to a standard for loop... but of course results would be different depending on the environment.
for (int z = 0; z < 5; z++)
{
int iterations = 15000;
Stopwatch s = Stopwatch.StartNew();
for (int i = 0; i < iterations; i++)
Thread.Sleep(1);
s.Stop();
Console.WriteLine("#{0}:Elapsed (for): {1:#,0}ms", z, ((double)s.ElapsedTicks / (double)Stopwatch.Frequency) * 1000);
var options = new ParallelOptions() { MaxDegreeOfParallelism = 1 };
s = Stopwatch.StartNew();
Parallel.For(0, iterations, options, (i) => Thread.Sleep(1));
s.Stop();
Console.WriteLine("#{0}: Elapsed (parallel): {1:#,0}ms", z, ((double)s.ElapsedTicks / (double)Stopwatch.Frequency) * 1000);
}

The loop body performs equally well in both versions but the loop itself is drastically slower with Parallel.For even for single-threaded execution. Each element needs to call a delegate. This is very much slower than incrementing a loop counter.
If your loop body does anything meaningful the loop overhead will be dwarfed by useful work. Just ensure that your work items are not too small and you won't notice a difference.
Nesting parallel loops is rarely a good idea. A single parallel loop is usually best enough provided the work items are neither too small nor too big.

How come this algorithm in Ruby runs faster than in Parallel'd C#?

The following ruby code runs in ~15s. It barely uses any CPU/Memory (about 25% of one CPU):
def collatz(num)
num.even? ? num/2 : 3*num + 1
end
start_time = Time.now
max_chain_count = 0
max_starter_num = 0
(1..1000000).each do |i|
count = 0
current = i
current = collatz(current) and count += 1 until (current == 1)
max_chain_count = count and max_starter_num = i if (count > max_chain_count)
end
puts "Max starter num: #{max_starter_num} -> chain of #{max_chain_count} elements. Found in: #{Time.now - start_time}s"
And the following TPL C# puts all my 4 cores to 100% usage and is orders of magnitude slower than the ruby version:
static void Euler14Test()
{
Stopwatch sw = new Stopwatch();
sw.Start();
int max_chain_count = 0;
int max_starter_num = 0;
object locker = new object();
Parallel.For(1, 1000000, i =>
{
int count = 0;
int current = i;
while (current != 1)
{
current = collatz(current);
count++;
}
if (count > max_chain_count)
{
lock (locker)
{
max_chain_count = count;
max_starter_num = i;
}
}
if (i % 1000 == 0)
Console.WriteLine(i);
});
sw.Stop();
Console.WriteLine("Max starter i: {0} -> chain of {1} elements. Found in: {2}s", max_starter_num, max_chain_count, sw.Elapsed.ToString());
}
static int collatz(int num)
{
return num % 2 == 0 ? num / 2 : 3 * num + 1;
}
How come ruby runs faster than C#? I've been told that Ruby is slow. Is that not true when it comes to algorithms?
Perf AFTER correction:
Ruby (Non parallel): 14.62s
C# (Non parallel): 2.22s
C# (With TPL): 0.64s

Actually, the bug is quite subtle, and has nothing to do with threading. The reason that your C# version takes so long is that the intermediate values computed by the collatz method eventually start to overflow the int type, resulting in negative numbers which may then take ages to converge.
This first happens when i is 134,379, for which the 129th term (assuming one-based counting) is 2,482,111,348. This exceeds the maximum value of 2,147,483,647 and therefore gets stored as -1,812,855,948.
To get good performance (and correct results) on the C# version, just change:
int current = i;
…to:
long current = i;
…and:
static int collatz(int num)
…to:
static long collatz(long num)
That will bring down your performance to a respectable 1.5 seconds.
Edit: CodesInChaos raises a very valid point about enabling overflow checking when debugging math-oriented applications. Doing so would have allowed the bug to be immediately identified, since the runtime would throw an OverflowException.

Should be:
Parallel.For(1L, 1000000L, i =>
{
Otherwise, you have integer overfill and start checking negative values. The same collatz method should operate with long values.

I experienced something like that. And I figured out that's because each of your loop iterations need to start other thread and this takes some time, and in this case it's comparable (I think it's more time) than the operations you acctualy do in the loop body.
There is an alternative for that: You can get how many CPU cores you have and than use a parallelism loop with the same number of iterations you have cores, each loop will evaluate part of the acctual loop you want, it's done by making an inner for loop that depends on the parallel loop.
EDIT: EXAMPLE
int start = 1, end = 1000000;
Parallel.For(0, N_CORES, n =>
{
int s = start + (end - start) * n / N_CORES;
int e = n == N_CORES - 1 ? end : start + (end - start) * (n + 1) / N_CORES;
for (int i = s; i < e; i++)
{
// Your code
}
});
You should try this code, I'm pretty sure this will do the job faster.
EDIT: ELUCIDATION
Well, quite a long time since I answered this question, but I faced the problem again and finally understood what's going on.
I've been using AForge implementation of Parallel for loop, and it seems like, it fires a thread for each iteration of the loop, so, that's why if the loop takes relatively a small amount of time to execute, you end up with a inefficient parallelism.
So, as some of you pointed out, System.Threading.Tasks.Parallel methods are based on Tasks, which are kind of a higher level of abstraction of a Thread:
"Behind the scenes, tasks are queued to the ThreadPool, which has been enhanced with algorithms that determine and adjust to the number of threads and that provide load balancing to maximize throughput. This makes tasks relatively lightweight, and you can create many of them to enable fine-grained parallelism."
So yeah, if you use the default library's implementation, you won't need to use this kind of "bogus".

Should I always use Parallel.Foreach because more threads MUST speed up everything?

Does it make sense to you to use for every normal foreach a parallel.foreach loop ?
When should I start using parallel.foreach, only iterating 1,000,000 items?

No, it doesn't make sense for every foreach. Some reasons:
Your code may not actually be parallelizable. For example, if you're using the "results so far" for the next iteration and the order is important)
If you're aggregating (e.g. summing values) then there are ways of using Parallel.ForEach for this, but you shouldn't just do it blindly
If your work will complete very fast anyway, there's no benefit, and it may well slow things down
Basically nothing in threading should be done blindly. Think about where it actually makes sense to parallelize. Oh, and measure the impact to make sure the benefit is worth the added complexity. (It will be harder for things like debugging.) TPL is great, but it's no free lunch.

No, you should definitely not do that. The important point here is not really the number of iterations, but the work to be done. If your work is really simple, executing 1000000 delegates in parallel will add a huge overhead and will most likely be slower than a traditional single threaded solution. You can get around this by partitioning the data, so you execute chunks of work instead.
E.g. consider the situation below:
Input = Enumerable.Range(1, Count).ToArray();
Result = new double[Count];
Parallel.ForEach(Input, (value, loopState, index) => { Result[index] = value*Math.PI; });
The operation here is so simple, that the overhead of doing this in parallel will dwarf the gain of using multiple cores. This code runs significantly slower than a regular foreach loop.
By using a partition we can reduce the overhead and actually observe a gain in performance.
Parallel.ForEach(Partitioner.Create(0, Input.Length), range => {
for (var index = range.Item1; index < range.Item2; index++) {
Result[index] = Input[index]*Math.PI;
}
});
The morale of the story here is that parallelism is hard and you should only employ this after looking closely at the situation at hand. Additionally, you should profile the code both before and after adding parallelism.
Remember that regardless of any potential performance gain parallelism always adds complexity to the code, so if the performance is already good enough, there's little reason to add the complexity.

The short answer is no, you should not just use Parallel.ForEach or related constructs on each loop that you can.
Parallel has some overhead, which is not justified in loops with few, fast iterations. Also, break is significantly more complex inside these loops.
Parallel.ForEach is a request to schedule the loop as the task scheduler sees fit, based on number of iterations in the loop, number of CPU cores on the hardware and current load on that hardware. Actual parallel execution is not always guaranteed, and is less likely if there are fewer cores, the number of iterations is low and/or the current load is high.
See also Does Parallel.ForEach limits the number of active threads? and Does Parallel.For use one Task per iteration?
The long answer:
We can classify loops by how they fall on two axes:
Few iterations through to many iterations.
Each iteration is fast through to each iteration is slow.
A third factor is if the tasks vary in duration very much – for instance if you are calculating points on the Mandelbrot set, some points are quick to calculate, some take much longer.
When there are few, fast iterations it's probably not worth using parallelisation in any way, most likely it will end up slower due to the overheads. Even if parallelisation does speed up a particular small, fast loop, it's unlikely to be of interest: the gains will be small and it's not a performance bottleneck in your application so optimise for readability not performance.
Where a loop has very few, slow iterations and you want more control, you may consider using Tasks to handle them, along the lines of:
var tasks = new List<Task>(actions.Length);
foreach(var action in actions)
{
tasks.Add(Task.Factory.StartNew(action));
}
Task.WaitAll(tasks.ToArray());
Where there are many iterations, Parallel.ForEach is in its element.
The Microsoft documentation states that
When a parallel loop runs, the TPL partitions the data source so that
the loop can operate on multiple parts concurrently. Behind the
scenes, the Task Scheduler partitions the task based on system
resources and workload. When possible, the scheduler redistributes
work among multiple threads and processors if the workload becomes
unbalanced.
This partitioning and dynamic re-scheduling is going to be harder to do effectively as the number of loop iterations decreases, and is more necessary if the iterations vary in duration and in the presence of other tasks running on the same machine.
I ran some code.
The test results below show a machine with nothing else running on it, and no other threads from the .Net Thread Pool in use. This is not typical (in fact in a web-server scenario it is wildly unrealistic). In practice, you may not see any parallelisation with a small number of iterations.
The test code is:
namespace ParallelTests
{
class Program
{
private static int Fibonacci(int x)
{
if (x <= 1)
{
return 1;
}
return Fibonacci(x - 1) + Fibonacci(x - 2);
}
private static void DummyWork()
{
var result = Fibonacci(10);
// inspect the result so it is no optimised away.
// We know that the exception is never thrown. The compiler does not.
if (result > 300)
{
throw new Exception("failed to to it");
}
}
private const int TotalWorkItems = 2000000;
private static void SerialWork(int outerWorkItems)
{
int innerLoopLimit = TotalWorkItems / outerWorkItems;
for (int index1 = 0; index1 < outerWorkItems; index1++)
{
InnerLoop(innerLoopLimit);
}
}
private static void InnerLoop(int innerLoopLimit)
{
for (int index2 = 0; index2 < innerLoopLimit; index2++)
{
DummyWork();
}
}
private static void ParallelWork(int outerWorkItems)
{
int innerLoopLimit = TotalWorkItems / outerWorkItems;
var outerRange = Enumerable.Range(0, outerWorkItems);
Parallel.ForEach(outerRange, index1 =>
{
InnerLoop(innerLoopLimit);
});
}
private static void TimeOperation(string desc, Action operation)
{
Stopwatch timer = new Stopwatch();
timer.Start();
operation();
timer.Stop();
string message = string.Format("{0} took {1:mm}:{1:ss}.{1:ff}", desc, timer.Elapsed);
Console.WriteLine(message);
}
static void Main(string[] args)
{
TimeOperation("serial work: 1", () => Program.SerialWork(1));
TimeOperation("serial work: 2", () => Program.SerialWork(2));
TimeOperation("serial work: 3", () => Program.SerialWork(3));
TimeOperation("serial work: 4", () => Program.SerialWork(4));
TimeOperation("serial work: 8", () => Program.SerialWork(8));
TimeOperation("serial work: 16", () => Program.SerialWork(16));
TimeOperation("serial work: 32", () => Program.SerialWork(32));
TimeOperation("serial work: 1k", () => Program.SerialWork(1000));
TimeOperation("serial work: 10k", () => Program.SerialWork(10000));
TimeOperation("serial work: 100k", () => Program.SerialWork(100000));
TimeOperation("parallel work: 1", () => Program.ParallelWork(1));
TimeOperation("parallel work: 2", () => Program.ParallelWork(2));
TimeOperation("parallel work: 3", () => Program.ParallelWork(3));
TimeOperation("parallel work: 4", () => Program.ParallelWork(4));
TimeOperation("parallel work: 8", () => Program.ParallelWork(8));
TimeOperation("parallel work: 16", () => Program.ParallelWork(16));
TimeOperation("parallel work: 32", () => Program.ParallelWork(32));
TimeOperation("parallel work: 64", () => Program.ParallelWork(64));
TimeOperation("parallel work: 1k", () => Program.ParallelWork(1000));
TimeOperation("parallel work: 10k", () => Program.ParallelWork(10000));
TimeOperation("parallel work: 100k", () => Program.ParallelWork(100000));
Console.WriteLine("done");
Console.ReadLine();
}
}
}
the results on a 4-core Windows 7 machine are:
serial work: 1 took 00:02.31
serial work: 2 took 00:02.27
serial work: 3 took 00:02.28
serial work: 4 took 00:02.28
serial work: 8 took 00:02.28
serial work: 16 took 00:02.27
serial work: 32 took 00:02.27
serial work: 1k took 00:02.27
serial work: 10k took 00:02.28
serial work: 100k took 00:02.28
parallel work: 1 took 00:02.33
parallel work: 2 took 00:01.14
parallel work: 3 took 00:00.96
parallel work: 4 took 00:00.78
parallel work: 8 took 00:00.84
parallel work: 16 took 00:00.86
parallel work: 32 took 00:00.82
parallel work: 64 took 00:00.80
parallel work: 1k took 00:00.77
parallel work: 10k took 00:00.78
parallel work: 100k took 00:00.77
done
Running code Compiled in .Net 4 and .Net 4.5 give much the same results.
The serial work runs are all the same. It doesn't matter how you slice it, it runs in about 2.28 seconds.
The parallel work with 1 iteration is slightly longer than no parallelism at all. 2 items is shorter, so is 3 and with 4 or more iterations is all about 0.8 seconds.
It is using all cores, but not with 100% efficiency. If the serial work was divided 4 ways with no overhead it would complete in 0.57 seconds (2.28 / 4 = 0.57).
In other scenarios I saw no speed-up at all with parallel 2-3 iterations. You do not have fine-grained control over that with Parallel.ForEach and the algorithm may decide to "partition " them into just 1 chunk and run it on 1 core if the machine is busy.

There is no lower limit for doing parallel operations. If you have only 2 items to work on but each one will take a while, it might still make sense to use Parallel.ForEach. On the other hand if you have 1000000 items but they don't do very much, the parallel loop might not go any faster than the regular loop.
For example, I wrote a simple program to time nested loops where the outer loop ran both with a for loop and with Parallel.ForEach. I timed it on my 4-CPU (dual-core, hyperthreaded) laptop.
Here's a run with only 2 items to work on, but each takes a while:
2 outer iterations, 100000000 inner iterations:
for loop: 00:00:00.1460441
ForEach : 00:00:00.0842240
Here's a run with millions of items to work on, but they don't do very much:
100000000 outer iterations, 2 inner iterations:
for loop: 00:00:00.0866330
ForEach : 00:00:02.1303315
The only real way to know is to try it.

In general, once you go above a thread per core, each extra thread involved in an operation will make it slower, not faster.
However, if part of each operation will block (the classic example being waiting on disk or network I/O, another being producers and consumers that are out of synch with each other) then more threads than cores can begin to speed things up again, because tasks can be done while other threads are unable to make progress until the I/O operation returns.
For this reason, when single-core machines were the norm, the only real justifications in multi-threading were when either there was blocking of the sort I/O introduces or else to improve responsiveness (slightly slower to perform a task, but much quicker to start responding to user-input again).
Still, these days single-core machines are increasingly rare, so it would appear that you should be able to make everything at least twice as fast with parallel processing.
This will still not be the case if order is important, or something inherent to the task forces it to have a synchronised bottleneck, or if the number of operations is so small that the increase in speed from parallel processing is outweighed by the overheads involved in setting up that parallel processing. It may or may not be the case if a share resource requires threads to block on other threads performing the same parallel operation (depending on the degree of lock contention).
Also, if your code is inherently multithreaded to begin with, you can be in a situation where you are essentially competing for resources with yourself (a classic case being ASP.NET code handling simultaneous requests). Here the advantage in parallel operation may mean that a single test operation on a 4-core machine approaches 4 times the performance, but once the number of requests needing the same task to be performed reaches 4, then since each of those 4 requests are each trying to use each core, it becomes little better than if they had a core each (perhaps slightly better, perhaps slightly worse). The benefits of parallel operation hence disappears as the use changes from a single-request test to a real-world multitude of requests.

You shouldn't blindly replace every single foreach loop in your application with the parallel foreach. More threads doesn't necessary mean that your application will work faster. You need to slice the task into smaller tasks which could run in parallel if you want to really benefit from multiple threads. If your algorithm is not parallelizable you won't get any benefit.

No. You need to understand what the code is doing and whether it is amenable to parallelization. Dependencies between your data items can make it hard to parallelize, i.e., if a thread uses the value calculated for the previous element it has to wait until the value is calculated anyway and can't run in parallel. You also need to understand your target architecture, though, you will typically have a multicore CPU on just about anything you buy these days. Even on a single core, you can get some benefits from more threads but only if you have some blocking tasks. You should also keep in mind that there is overhead in creating and organizing the parallel threads. If this overhead is a significant fraction of (or more than) the time your task takes you could slow it down.

These are my benchmarks showing pure serial is slowest, along with various levels of partitioning.
class Program
{
static void Main(string[] args)
{
NativeDllCalls(true, 1, 400000000, 0); // Seconds: 0.67 |) 595,203,995.01 ops
NativeDllCalls(true, 1, 400000000, 3); // Seconds: 0.91 |) 439,052,826.95 ops
NativeDllCalls(true, 1, 400000000, 4); // Seconds: 0.80 |) 501,224,491.43 ops
NativeDllCalls(true, 1, 400000000, 8); // Seconds: 0.63 |) 635,893,653.15 ops
NativeDllCalls(true, 4, 100000000, 0); // Seconds: 0.35 |) 1,149,359,562.48 ops
NativeDllCalls(true, 400, 1000000, 0); // Seconds: 0.24 |) 1,673,544,236.17 ops
NativeDllCalls(true, 10000, 40000, 0); // Seconds: 0.22 |) 1,826,379,772.84 ops
NativeDllCalls(true, 40000, 10000, 0); // Seconds: 0.21 |) 1,869,052,325.05 ops
NativeDllCalls(true, 1000000, 400, 0); // Seconds: 0.24 |) 1,652,797,628.57 ops
NativeDllCalls(true, 100000000, 4, 0); // Seconds: 0.31 |) 1,294,424,654.13 ops
NativeDllCalls(true, 400000000, 0, 0); // Seconds: 1.10 |) 364,277,890.12 ops
}
static void NativeDllCalls(bool useStatic, int nonParallelIterations, int parallelIterations = 0, int maxParallelism = 0)
{
if (useStatic) {
Iterate<string, object>(
(msg, cntxt) => {
ServiceContracts.ForNativeCall.SomeStaticCall(msg);
}
, "test", null, nonParallelIterations,parallelIterations, maxParallelism );
}
else {
var instance = new ServiceContracts.ForNativeCall();
Iterate(
(msg, cntxt) => {
cntxt.SomeCall(msg);
}
, "test", instance, nonParallelIterations, parallelIterations, maxParallelism);
}
}
static void Iterate<T, C>(Action<T, C> action, T testMessage, C context, int nonParallelIterations, int parallelIterations=0, int maxParallelism= 0)
{
var start = DateTime.UtcNow;
if(nonParallelIterations == 0)
nonParallelIterations = 1; // normalize values
if(parallelIterations == 0)
parallelIterations = 1;
if (parallelIterations > 1) {
ParallelOptions options;
if (maxParallelism == 0) // default max parallelism
options = new ParallelOptions();
else
options = new ParallelOptions { MaxDegreeOfParallelism = maxParallelism };
if (nonParallelIterations > 1) {
Parallel.For(0, parallelIterations, options
, (j) => {
for (int i = 0; i < nonParallelIterations; ++i) {
action(testMessage, context);
}
});
}
else { // no nonParallel iterations
Parallel.For(0, parallelIterations, options
, (j) => {
action(testMessage, context);
});
}
}
else {
for (int i = 0; i < nonParallelIterations; ++i) {
action(testMessage, context);
}
}
var end = DateTime.UtcNow;
Console.WriteLine("\tSeconds: {0,8:0.00} |) {1,16:0,000.00} ops",
(end - start).TotalSeconds, (Math.Max(parallelIterations, 1) * nonParallelIterations / (end - start).TotalSeconds));
}
}

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.