Parallel.ForEach slower than normal foreach - c#

I'm playing around with the Parallel.ForEach in a C# console application, but can't seem to get it right. I'm creating an array with random numbers and i have a sequential foreach and a Parallel.ForEach that finds the largest value in the array. With approximately the same code in c++ i started to see a tradeoff to using several threads at 3M values in the array. But the Parallel.ForEach is twice as slow even at 100M values. What am i doing wrong?
class Program
static void Main(string[] args)
static void dostuff() {
Console.WriteLine("How large do you want the array to be?");
int size = int.Parse(Console.ReadLine());
int[] arr = new int[size];
Random rand = new Random();
for (int i = 0; i < size; i++)
arr[i] = rand.Next(0, int.MaxValue);
var watchSeq = System.Diagnostics.Stopwatch.StartNew();
var largestSeq = FindLargestSequentially(arr);
var elapsedSeq = watchSeq.ElapsedMilliseconds;
Console.WriteLine("Finished sequential in: " + elapsedSeq + "ms. Largest = " + largestSeq);
var watchPar = System.Diagnostics.Stopwatch.StartNew();
var largestPar = FindLargestParallel(arr);
var elapsedPar = watchPar.ElapsedMilliseconds;
Console.WriteLine("Finished parallel in: " + elapsedPar + "ms Largest = " + largestPar);
static int FindLargestSequentially(int[] arr) {
int largest = arr[0];
foreach (int i in arr) {
if (largest < i) {
largest = i;
return largest;
static int FindLargestParallel(int[] arr) {
int largest = arr[0];
Parallel.ForEach<int, int>(arr, () => 0, (i, loop, subtotal) =>
if (i > subtotal)
subtotal = i;
return subtotal;
(finalResult) => {
Console.WriteLine("Thread finished with result: " + finalResult);
if (largest < finalResult) largest = finalResult;
return largest;

It's performance ramifications of having a very small delegate body.
We can achieve better performance using the partitioning. In this case the body delegate performs work with a high data volume.
static int FindLargestParallelRange(int[] arr)
object locker = new object();
int largest = arr[0];
Parallel.ForEach(Partitioner.Create(0, arr.Length), () => arr[0], (range, loop, subtotal) =>
for (int i = range.Item1; i < range.Item2; i++)
if (arr[i] > subtotal)
subtotal = arr[i];
return subtotal;
(finalResult) =>
lock (locker)
if (largest < finalResult)
largest = finalResult;
return largest;
Pay attention to synchronize the localFinally delegate. Also note the need for proper initialization of the localInit: () => arr[0] instead of () => 0.
Partitioning with PLINQ:
static int FindLargestPlinqRange(int[] arr)
return Partitioner.Create(0, arr.Length)
.Select(range =>
int largest = arr[0];
for (int i = range.Item1; i < range.Item2; i++)
if (arr[i] > largest)
largest = arr[i];
return largest;
I highly recommend free book Patterns of Parallel Programming by Stephen Toub.

As the other answerers have mentioned, the action you're trying to perform against each item here is so insignificant that there are a variety of other factors which end up carrying more weight than the actual work you're doing. These may include:
JIT optimizations
CPU branch prediction
I/O (outputting thread results while the timer is running)
the cost of invoking delegates
the cost of task management
the system incorrectly guessing what thread strategy will be optimal
memory/cpu caching
memory pressure
environment (debugging)
Running each approach a single time is not an adequate way to test, because it enables a number of the above factors to weigh more heavily on one iteration than on another. You should start with a more robust benchmarking strategy.
Furthermore, your implementation is actually dangerously incorrect. The documentation specifically says:
The localFinally delegate is invoked once per task to perform a final action on each task’s local state. This delegate might be invoked concurrently on multiple tasks; therefore, you must synchronize access to any shared variables.
You have not synchronized your final delegate, so your function is prone to race conditions that would make it produce incorrect results.
As in most cases, the best approach to this one is to take advantage of work done by people smarter than we are. In my testing, the following approach appears to be the fastest overall:
return arr.AsParallel().Max();

The Parallel Foreach loop should be running slower because the algorithm used is not parallel and a lot more work is being done to run this algorithm.
In the single thread, to find the max value, we can take the first number as our max value and compare it to every other number in the array. If one of the numbers larger than our first number, we swap and continue. This way we access each number in the array once, for a total of N comparisons.
In the Parallel loop above, the algorithm creates overhead because each operation is wrapped inside a function call with a return value. So in addition to doing the comparisons, it is running overhead of adding and removing these calls onto the call stack. In addition, since each call is dependent on the value of the function call before, it needs to run in sequence.
In the Parallel For Loop below, the array is divided into an explicit number of threads determined by the variable threadNumber. This limits the overhead of function calls to a low number.
Note, for low values, the parallel loops performs slower. However, for 100M, there is a decrease in time elapsed.
static int FindLargestParallel(int[] arr)
var answers = new ConcurrentBag<int>();
int threadNumber = 4;
int partitionSize = arr.Length/threadNumber;
Parallel.For(0, /* starting number */
threadNumber+1, /* Adding 1 to threadNumber in case array.Length not evenly divisible by threadNumber */
i =>
if (i*partitionSize < arr.Length) /* check in case # in array is divisible by # threads */
var max = arr[i*partitionSize];
for (var x = i*partitionSize;
x < (i + 1)*partitionSize && x < arr.Length;
if (arr[x] > max)
max = arr[x];
/* note the shortcut in finding max in the bag */
return answers.Max(i=>i);

Some thoughts here: In the parallel case, there is thread management logic involved that determines how many threads it wants to use. This thread management logic presumably possibly runs on your main thread. Every time a thread returns with the new maximum value, the management logic kicks in and determines the next work item (the next number to process in your array). I'm pretty sure that this requires some kind of locking. In any case, determining the next item may even cost more than performing the comparison operation itself.
That sounds like a magnitude more work (overhead) to me than a single thread that processes one number after the other. In the single-threaded case there are a number of optimization at play: No boundary checks, CPU can load data into the first level cache within the CPU, etc. Not sure, which of these optimizations apply for the parallel case.
Keep in mind that on a typical desktop machine there are only 2 to 4 physical CPU cores available so you will never have more than that actually doing work. So if the parallel processing overhead is more than 2-4 times of a single-threaded operation, the parallel version will inevitably be slower, which you are observing.
Have you attempted to run this on a 32 core machine? ;-)
A better solution would be determine non-overlapping ranges (start + stop index) covering the entire array and let each parallel task process one range. This way, each parallel task can internally do a tight single-threaded loop and only return once the entire range has been processed. You could probably even determine a near optimal number of ranges based on the number of logical cores of the machine. I haven't tried this but I'm pretty sure you will see an improvement over the single-threaded case.

Try splitting the set into batches and running the batches in parallel, where the number of batches corresponds to your number of CPU cores.
I ran some equations 1K, 10K and 1M times using the following methods:
A "for" loop.
A "Parallel.For" from the System.Threading.Tasks lib, across the entire set.
A "Parallel.For" across 4 batches.
A "Parallel.ForEach" from the System.Threading.Tasks lib, across the entire set.
A "Parallel.ForEach" across 4 batches.
Results: (Measured in seconds)
Processing batches in parallel using the "Parallel.ForEach" has the best outcome in cases above 10K records. I believe the batching helps because it utilizes all CPU cores (4 in this example), but also minimizes the amount of threading overhead associated with parallelization.
Here is my code:
public void ParallelSpeedTest()
var rnd = new Random(56);
int range = 1000000;
int numberOfCores = 4;
int batchSize = range / numberOfCores;
int[] rangeIndexes = Enumerable.Range(0, range).ToArray();
double[] inputs = rangeIndexes.Select(n => rnd.NextDouble()).ToArray();
double[] weights = rangeIndexes.Select(n => rnd.NextDouble()).ToArray();
double[] outputs = new double[rangeIndexes.Length];
/// Series "for"...
var startTimeSeries = DateTime.Now;
for (var i = 0; i < range; i++)
outputs[i] = Math.Sqrt(Math.Pow(inputs[i] * weights[i], 2));
var durationSeries = DateTime.Now - startTimeSeries;
/// "Parallel.For"...
var startTimeParallel = DateTime.Now;
Parallel.For(0, range, (i) => {
outputs[i] = Math.Sqrt(Math.Pow(inputs[i] * weights[i], 2));
var durationParallelFor = DateTime.Now - startTimeParallel;
/// "Parallel.For" in Batches...
var startTimeParallel2 = DateTime.Now;
Parallel.For(0, numberOfCores, (c) => {
var endValue = (c == numberOfCores - 1) ? range : (c + 1) * batchSize;
var startValue = c * batchSize;
for (var i = startValue; i < endValue; i++)
outputs[i] = Math.Sqrt(Math.Pow(inputs[i] * weights[i], 2));
var durationParallelForBatches = DateTime.Now - startTimeParallel2;
/// "Parallel.ForEach"...
var startTimeParallelForEach = DateTime.Now;
Parallel.ForEach(rangeIndexes, (i) => {
outputs[i] = Math.Sqrt(Math.Pow(inputs[i] * weights[i], 2));
var durationParallelForEach = DateTime.Now - startTimeParallelForEach;
/// Parallel.ForEach in Batches...
List<Tuple<int,int>> ranges = new List<Tuple<int, int>>();
for (var i = 0; i < numberOfCores; i++)
int start = i * batchSize;
int end = (i == numberOfCores - 1) ? range : (i + 1) * batchSize;
ranges.Add(new Tuple<int,int>(start, end));
var startTimeParallelBatches = DateTime.Now;
Parallel.ForEach(ranges, (range) => {
for(var i = range.Item1; i < range.Item1; i++) {
outputs[i] = Math.Sqrt(Math.Pow(inputs[i] * weights[i], 2));
var durationParallelForEachBatches = DateTime.Now - startTimeParallelBatches;
Debug.Print($"Given: Set-size: {range}, number-of-batches: {numberOfCores}, batch-size: {batchSize}");
Debug.Print($"Series For: {durationSeries}");
Debug.Print($"Parallel For: {durationParallelFor}");
Debug.Print($"Parallel For Batches: {durationParallelForBatches}");
Debug.Print($"Parallel ForEach: {durationParallelForEach}");
Debug.Print($"Parallel ForEach Batches: {durationParallelForEachBatches}");


How can I determine whether a parallel foreach loop is going to have better performance than a foreach loop?

I just did a simple test in .NET Fiddle of sorting 100 random integer arrays of length 1000 and seeing whether doing so with a Paralell.ForEach loop is faster than a plain old foreach loop.
Here is my code (I put this together fast, so please ignore the repetition and overall bad look of the code)
using System;
using System.Net;
using System.Collections.Generic;
using System.Threading;
using System.Threading.Tasks;
using System.Linq;
public class Program
public static int[] RandomArray(int minval, int maxval, int arrsize)
Random randNum = new Random();
int[] rand = Enumerable
.Repeat(0, arrsize)
.Select(i => randNum.Next(minval, maxval))
return rand;
public static void SortOneThousandArraysSync()
var arrs = new List<int[]>(100);
for(int i = 0; i < 100; ++i)
Parallel.ForEach(arrs, (arr) =>
public static void SortOneThousandArraysAsync()
var arrs = new List<int[]>(100);
for(int i = 0; i < 100; ++i)
foreach(var arr in arrs)
public static void Main()
var start = DateTime.Now;
var end = DateTime.Now;
Console.WriteLine("t1 = " + (end - start).ToString());
start = DateTime.Now;
end = DateTime.Now;
Console.WriteLine("t2 = " + (end - start).ToString());
and here are the results after hitting Run twice:
t1 = 00:00:00.0156244
t2 = 00:00:00.0156243
t1 = 00:00:00.0467854
t2 = 00:00:00.0156246
So, sometimes it's faster and sometimes it's about the same.
Possible explanations:
The random arrays were "more unsorted" for the sync one versus the async one in the 2nd test I ran
It has something to do with the processes running on .NET Fiddle. In the first case the parallel one basically ran like a non-parallel operation because there weren't any threads for my fiddle to take over. (Or something like that)
You should only use Parallel.ForEach() if the code within the loop takes a significant amount of time to execute. In this case, it takes more time to create multiple threads, sort the array, and then combine the result onto one thread than it is to simply sort it on a single thread. For example, the Parallel.ForEach() in the following code snippet takes less time to execute than the normal ForEach loop:
public static void Main(string[] args)
var numbers = Enumerable.Range(1, 10000);
Parallel.ForEach(numbers, n => Factorial(n));
foreach (var number in numbers)
private static int Factorial(int number)
if (number == 1 || number == 0)
return 1;
return number * Factorial(number - 1);
However, if I change var numbers = Enumerable.Range(1, 10000); to var numbers = Enumerable.Range(1, 1000);, the ForEach loop is faster than Parallel.ForEach().
When working with small tasks (which don't take a significant amount of time to execute) have a look at Partitioner class; in your case:
public static void SortOneThousandArraysAsyncWithPart() {
var arrs = new List<int[]>(100);
for (int i = 0; i < 100; ++i)
arrs.Add(RandomArray(Int32.MinValue, Int32.MaxValue, 1000));
// Let's spread the tasks between threads manually with a help of Partitioner.
// We don't want task stealing and other optimizations: just split the
// list between 8 (on my workstation) threads and run them
Parallel.ForEach(Partitioner.Create(0, 100), part => {
for (int i = part.Item1; i < part.Item2; ++i)
I get the following results (i7 3.2GHz 4 cores HT, .Net 4.6 IA-64) - averaged by 100 runs:
0.0081 Async (foreach)
0.0119 Parallel.ForEach
0.0084 Parallel.ForEach + Partitioner
as you can see, foreach is still on the top, but Parallel.ForEach + Partitioner is very close to the winner
Checking performance of algorithms is a tricky business, and performance at small scale can easily be affected by a variety of factors external to your code. Please see my answer to an almost-duplicate question here for an in-depth explanation, plus some links to benchmarking templates that you can adapt to better measure your algorithm's performance.

Performance penalty between for loop and Parallel.For() with MaxDegreeOfParallelism of 1

I want to do something like the following:
int firstLoopMaxThreads = 1; // or -1
int secondLoopMaxThreads = firstLoopMaxThreads == 1 ? -1 : 1;
Parallel.For(0, m, new ParallelOptions() { MaxDegreeOfParallelism = firstLoopMaxThreads }, i =>
//do some processor and/or memory-intensive stuff
Parallel.For(0, n, new ParallelOptions() { MaxDegreeOfParallelism = secondLoopMaxThreads }, j =>
//do some other processor and/or memory-intensive stuff
Would it be worth it, performance wise, to swap the inner Parallel.For loop with a normal for loop when secondLoopMaxThreads = 1? What is the performance difference between a regular for loop and a Parallel.For loop with MaxDegreeofParallelism = 1?
It depends on how many iterations you're talking about and what level of performance you're talking about to answer whether it is worth it or not. In your context is 1ms considered a lot, or a little?
I did a rudimentary test as below (since Thread.Sleep is not entirely accurate.. although the for loop measured 15,000ms to within 1ms everytime). Over 15,000 iterations repeated 5 times it generally added about 4ms of overhead compared to a standard for loop... but of course results would be different depending on the environment.
for (int z = 0; z < 5; z++)
int iterations = 15000;
Stopwatch s = Stopwatch.StartNew();
for (int i = 0; i < iterations; i++)
Console.WriteLine("#{0}:Elapsed (for): {1:#,0}ms", z, ((double)s.ElapsedTicks / (double)Stopwatch.Frequency) * 1000);
var options = new ParallelOptions() { MaxDegreeOfParallelism = 1 };
s = Stopwatch.StartNew();
Parallel.For(0, iterations, options, (i) => Thread.Sleep(1));
Console.WriteLine("#{0}: Elapsed (parallel): {1:#,0}ms", z, ((double)s.ElapsedTicks / (double)Stopwatch.Frequency) * 1000);
The loop body performs equally well in both versions but the loop itself is drastically slower with Parallel.For even for single-threaded execution. Each element needs to call a delegate. This is very much slower than incrementing a loop counter.
If your loop body does anything meaningful the loop overhead will be dwarfed by useful work. Just ensure that your work items are not too small and you won't notice a difference.
Nesting parallel loops is rarely a good idea. A single parallel loop is usually best enough provided the work items are neither too small nor too big.

Why .NET group by is (much) slower when the number of buckets grows

Given this simple piece of code and 10mln array of random numbers:
static int Main(string[] args)
int size = 10000000;
int num = 10; //increase num to reduce number of buckets
int numOfBuckets = size/num;
int[] ar = new int[size];
Random r = new Random(); //initialize with randum numbers
for (int i = 0; i < size; i++)
ar[i] = r.Next(size);
var s = new Stopwatch();
var group = ar.GroupBy(i => i / num);
var l = group.Count();
return 0;
I did some performance on grouping, so when the number of buckets is 10k the estimated execution time is 0.7s, for 100k buckets it is 2s, for 1m buckets it is 7.5s.
I wonder why is that. I imagine that if the GroupBy is implemented using HashTable there might be problem with collisions. For example initially the hashtable is prepard to work for let's say 1000 groups and then when the number of groups is growing it needs to increase the size and do the rehashing. If these was the case I could then write my own grouping where I would initialize the HashTable with expected number of buckets, I did that but it was only slightly faster.
So my question is, why number of buckets influences groupBy performance that much?
running under release mode change the results to 0.55s, 1.6s, 6.5s respectively.
I also changed the group.ToArray to piece of code below just to force execution of grouping :
foreach (var g in group)
array[g.Key] = 1;
where array is initialized before timer with appropriate size, the results stayed almost the same.
You can see the working code from mellamokb in here
I'm pretty certain this is showing the effects of memory locality (various levels of caching) and also object allocation.
To verify this, I took three steps:
Improve the benchmarking to avoid unnecessary parts and to garbage collect between tests
Remove the LINQ part by populating a Dictionary (which is effecively what GroupBy does behind the scenes)
Remove even Dictionary<,> and show the same trend for plain arrays.
In order to show this for arrays, I needed to increase the input size, but it does show the same kind of growth.
Here's a short but complete program which can be used to test both the dictionary and the array side - just flip which line is commented out in the middle:
using System;
using System.Collections.Generic;
using System.Diagnostics;
class Test
const int Size = 100000000;
const int Iterations = 3;
static void Main()
int[] input = new int[Size];
// Use the same seed for repeatability
var rng = new Random(0);
for (int i = 0; i < Size; i++)
input[i] = rng.Next(Size);
// Switch to PopulateArray to change which method is tested
Func<int[], int, TimeSpan> test = PopulateDictionary;
for (int buckets = 10; buckets <= Size; buckets *= 10)
TimeSpan total = TimeSpan.Zero;
for (int i = 0; i < Iterations; i++)
// Switch which line is commented to change the test
// total += PopulateDictionary(input, buckets);
total += PopulateArray(input, buckets);
Console.WriteLine("{0,9}: {1,7}ms", buckets, (long) total.TotalMilliseconds);
static TimeSpan PopulateDictionary(int[] input, int buckets)
int divisor = input.Length / buckets;
var dictionary = new Dictionary<int, int>(buckets);
var stopwatch = Stopwatch.StartNew();
foreach (var item in input)
int key = item / divisor;
int count;
dictionary.TryGetValue(key, out count);
dictionary[key] = count;
return stopwatch.Elapsed;
static TimeSpan PopulateArray(int[] input, int buckets)
int[] output = new int[buckets];
int divisor = input.Length / buckets;
var stopwatch = Stopwatch.StartNew();
foreach (var item in input)
int key = item / divisor;
return stopwatch.Elapsed;
Results on my machine:
10: 10500ms
100: 10556ms
1000: 10557ms
10000: 11303ms
100000: 15262ms
1000000: 54037ms
10000000: 64236ms // Why is this slower? See later.
100000000: 56753ms
10: 1298ms
100: 1287ms
1000: 1290ms
10000: 1286ms
100000: 1357ms
1000000: 2717ms
10000000: 5940ms
100000000: 7870ms
An earlier version of PopulateDictionary used an Int32Holder class, and created one for each bucket (when the lookup in the dictionary failed). This was faster when there was a small number of buckets (presumably because we were only going through the dictionary lookup path once per iteration instead of twice) but got significantly slower, and ended up running out of memory. This would contribute to fragmented memory access as well, of course. Note that PopulateDictionary specifies the capacity to start with, to avoid effects of data copying within the test.
The aim of using the PopulateArray method is to remove as much framework code as possible, leaving less to the imagination. I haven't yet tried using an array of a custom struct (with various different struct sizes) but that may be something you'd like to try too.
EDIT: I can reproduce the oddity of the slower result for 10000000 than 100000000 at will, regardless of test ordering. I don't understand why yet. It may well be specific to the exact processor and cache I'm using...
The reason why 10000000 is slower than the 100000000 results has to do with the way hashing works. A few more tests explain this.
First off, let's look at the operations. There's Dictionary.FindEntry, which is used in the [] indexing and in Dictionary.TryGetValue, and there's Dictionary.Insert, which is used in the [] indexing and in Dictionary.Add. If we would just do a FindEntry, the timings would go up as we expect it:
static TimeSpan PopulateDictionary1(int[] input, int buckets)
int divisor = input.Length / buckets;
var dictionary = new Dictionary<int, int>(buckets);
var stopwatch = Stopwatch.StartNew();
foreach (var item in input)
int key = item / divisor;
int count;
dictionary.TryGetValue(key, out count);
return stopwatch.Elapsed;
This is implementation doesn't have to deal with hash collisions (because there are none), which makes the behavior as we expect it. Once we start dealing with collisions, the timings start to drop. If we have as much buckets as elements, there are obviously less collisions... To be exact, we can figure out exactly how many collisions there are by doing:
static TimeSpan PopulateDictionary(int[] input, int buckets)
int divisor = input.Length / buckets;
int c1, c2;
c1 = c2 = 0;
var dictionary = new Dictionary<int, int>(buckets);
var stopwatch = Stopwatch.StartNew();
foreach (var item in input)
int key = item / divisor;
int count;
if (!dictionary.TryGetValue(key, out count))
dictionary.Add(key, 1);
dictionary[key] = count;
Console.WriteLine("{0}:{1}", c1, c2);
return stopwatch.Elapsed;
The result is something like this:
10: 4683ms
100: 4946ms
1000: 4732ms
10000: 4964ms
100000: 7033ms
1000000: 22038ms
9999538:90000462 <<-
10000000: 26104ms
63196841:36803159 <<-
100000000: 25045ms
Note the value of '36803159'. This answers the question why the last result is faster than the first result: it simply has to do less operations -- and since caching fails anyways, that factor doesn't make a difference anymore.
10k the estimated execution time is 0.7s, for 100k buckets it is 2s, for 1m buckets it is 7.5s.
This is an important pattern to recognize when you profile code. It is one of the standard size vs execution time relationships in software algorithms. Just from seeing the behavior, you can tell a lot about the way the algorithm was implemented. And the other way around of course, from the algorithm you can predict the expected execution time. A relationship that's annotated in the Big Oh notation.
Speediest code you can get is amortized O(1), execution time barely increases when you double the size of the problem. The Dictionary<> class behaves that way, as John demonstrated. The increases in time as the problem set gets large is the "amortized" part. A side-effect of Dictionary having to perform linear O(n) searches in buckets that keep getting bigger.
A very common pattern is O(n). That tells you that there is a single for() loop in the algorithm that iterates over the collection. O(n^2) tells you there are two nested for() loops. O(n^3) has three, etcetera.
What you got is the one in between, O(log n). It is the standard complexity of a divide-and-conquer algorithm. In other words, each pass splits the problem in two, continuing with the smaller set. Very common, you see it back in sorting algorithms. Binary search is the one you find back in your text book. Note how log₂(10) = 3.3, very close to the increment you see in your test. Perf starts to tank a bit for very large sets due to the poor locality of reference, a cpu cache problem that's always associated with O(log n) algoritms.
The one thing that John's answer demonstrates is that his guess cannot be correct, GroupBy() certainly does not use a Dictionary<>. And it is not possible by design, Dictionary<> cannot provide an ordered collection. Where GroupBy() must be ordered, it says so in the MSDN Library:
The IGrouping objects are yielded in an order based on the order of the elements in source that produced the first key of each IGrouping. Elements in a grouping are yielded in the order they appear in source.
Not having to maintain order is what makes Dictionary<> fast. Keeping order always cost O(log n), a binary tree in your text book.
Long story short, if you don't actually care about order, and you surely would not for random numbers, then you don't want to use GroupBy(). You want to use a Dictionary<>.
There are (at least) two influence factors: First, a hash table lookup only takes O(1) if you have a perfect hash function, which does not exist. Thus, you have hash collisions.
I guess more important, though, are caching effects. Modern CPUs have large caches, so for the smaller bucket count, the hash table itself might fit into the cache. As the hash table is frequently accessed, this might have a strong influence on the performance. If there are more buckets, more accesses to the RAM might be neccessary, which are slow compared to a cache hit.
There are a few factors at work here.
Hashes and groupings
The way grouping works is by creating a hash table. Each individual group then supports an 'add' operation, which adds an element to the add list. To put it bluntly, it's like a Dictionary<Key, List<Value>>.
Hash tables are always overallocated. If you add an element to the hash, it checks if there is enough capacity, and if not, recreates the hash table with a larger capacity (To be exact: new capacity = count * 2 with count the number of groups). However, a larger capacity means that the bucket index is no longer correct, which means you have to re-build the entries in the hash table. The Resize() method in Lookup<Key, Value> does this.
The 'groups' themselves work like a List<T>. These too are overallocated, but are easier to reallocate. To be precise: the data is simply copied (with Array.Copy in Array.Resize) and a new element is added. Since there's no re-hashing or calculation involved, this is quite a fast operation.
The initial capacity of a grouping is 7. This means, for 10 elements you need to reallocate 1 time, for 100 elements 4 times, for 1000 elements 8 times, and so on. Because you have to re-hash more elements each time, your code gets a bit slower each time the number of buckets grows.
I think these overallocations are the largest contributors to the small growth in the timings as the number of buckets grow. The easiest way to test this theory is to do no overallocations at all (test 1), and simply put counters in an array. The result can be shown below in the code for FixArrayTest (or if you like FixBucketTest which is closer to how groupings work). As you can see, the timings of # buckets = 10...10000 are the same, which is correct according to this theory.
Cache and random
Caching and random number generators aren't friends.
Our little test also shows that when the number of buckets grows above a certain threshold, memory comes into play. On my computer this is at an array size of roughly 4 MB (4 * number of buckets). Because the data is random, random chunks of RAM will be loaded and unloaded into the cache, which is a slow process. This is also the large jump in the speed. To see this in action, change the random numbers to a sequence (called 'test 2'), and - because the data pages can now be cached - the speed will remain the same overall.
Note that hashes overallocate, so you will hit the mark before you have a million entries in your grouping.
Test code
static void Main(string[] args)
int size = 10000000;
int[] ar = new int[size];
//random number init with numbers [0,size-1]
var r = new Random();
for (var i = 0; i < size; i++)
ar[i] = r.Next(0, size);
//ar[i] = i; // Test 2 -> uncomment to see the effects of caching more clearly
Console.WriteLine("Fixed dictionary:");
for (var numBuckets = 10; numBuckets <= 1000000; numBuckets *= 10)
var num = (size / numBuckets);
var timing = 0L;
for (var i = 0; i < 5; i++)
timing += FixBucketTest(ar, num);
//timing += FixArrayTest(ar, num); // test 1
var avg = ((float)timing) / 5.0f;
Console.WriteLine("Avg Time: " + avg + " ms for " + numBuckets);
Console.WriteLine("Fixed array:");
for (var numBuckets = 10; numBuckets <= 1000000; numBuckets *= 10)
var num = (size / numBuckets);
var timing = 0L;
for (var i = 0; i < 5; i++)
timing += FixArrayTest(ar, num); // test 1
var avg = ((float)timing) / 5.0f;
Console.WriteLine("Avg Time: " + avg + " ms for " + numBuckets);
static long FixBucketTest(int[] ar, int num)
// This test shows that timings will not grow for the smaller numbers of buckets if you don't have to re-allocate
System.Diagnostics.Stopwatch s = new Stopwatch();
var grouping = new Dictionary<int, List<int>>(ar.Length / num + 1); // exactly the right size
foreach (var item in ar)
int idx = item / num;
List<int> ll;
if (!grouping.TryGetValue(idx, out ll))
grouping.Add(idx, ll = new List<int>());
//ll.Add(item); //-> this would complete a 'grouper'; however, we don't want the overallocator of List to kick in
return s.ElapsedMilliseconds;
// Test with arrays
static long FixArrayTest(int[] ar, int num)
System.Diagnostics.Stopwatch s = new Stopwatch();
int[] buf = new int[(ar.Length / num + 1) * 10];
foreach (var item in ar)
int code = (item & 0x7FFFFFFF) % buf.Length;
return s.ElapsedMilliseconds;
When executing bigger calculations, less physical memory is available on the computer, counting the buckets will be slower with less memory, as you expend the buckets, your memory will decrease.
Try something like the following:
int size = 2500000; //10000000 divided by 4
int[] ar = new int[size];
//random number init with numbers [0,size-1]
System.Diagnostics.Stopwatch s = new Stopwatch();
for (int i = 0; i<4; i++)
var group = ar.GroupBy(i => i / num);
//the number of expected buckets is size / num.
var l = group.ToArray();
calcuting 4 times with lower numbers.

How come this algorithm in Ruby runs faster than in Parallel'd C#?

The following ruby code runs in ~15s. It barely uses any CPU/Memory (about 25% of one CPU):
def collatz(num)
num.even? ? num/2 : 3*num + 1
start_time =
max_chain_count = 0
max_starter_num = 0
(1..1000000).each do |i|
count = 0
current = i
current = collatz(current) and count += 1 until (current == 1)
max_chain_count = count and max_starter_num = i if (count > max_chain_count)
puts "Max starter num: #{max_starter_num} -> chain of #{max_chain_count} elements. Found in: #{ - start_time}s"
And the following TPL C# puts all my 4 cores to 100% usage and is orders of magnitude slower than the ruby version:
static void Euler14Test()
Stopwatch sw = new Stopwatch();
int max_chain_count = 0;
int max_starter_num = 0;
object locker = new object();
Parallel.For(1, 1000000, i =>
int count = 0;
int current = i;
while (current != 1)
current = collatz(current);
if (count > max_chain_count)
lock (locker)
max_chain_count = count;
max_starter_num = i;
if (i % 1000 == 0)
Console.WriteLine("Max starter i: {0} -> chain of {1} elements. Found in: {2}s", max_starter_num, max_chain_count, sw.Elapsed.ToString());
static int collatz(int num)
return num % 2 == 0 ? num / 2 : 3 * num + 1;
How come ruby runs faster than C#? I've been told that Ruby is slow. Is that not true when it comes to algorithms?
Perf AFTER correction:
Ruby (Non parallel): 14.62s
C# (Non parallel): 2.22s
C# (With TPL): 0.64s
Actually, the bug is quite subtle, and has nothing to do with threading. The reason that your C# version takes so long is that the intermediate values computed by the collatz method eventually start to overflow the int type, resulting in negative numbers which may then take ages to converge.
This first happens when i is 134,379, for which the 129th term (assuming one-based counting) is 2,482,111,348. This exceeds the maximum value of 2,147,483,647 and therefore gets stored as -1,812,855,948.
To get good performance (and correct results) on the C# version, just change:
int current = i;
long current = i;
static int collatz(int num)
static long collatz(long num)
That will bring down your performance to a respectable 1.5 seconds.
Edit: CodesInChaos raises a very valid point about enabling overflow checking when debugging math-oriented applications. Doing so would have allowed the bug to be immediately identified, since the runtime would throw an OverflowException.
Should be:
Parallel.For(1L, 1000000L, i =>
Otherwise, you have integer overfill and start checking negative values. The same collatz method should operate with long values.
I experienced something like that. And I figured out that's because each of your loop iterations need to start other thread and this takes some time, and in this case it's comparable (I think it's more time) than the operations you acctualy do in the loop body.
There is an alternative for that: You can get how many CPU cores you have and than use a parallelism loop with the same number of iterations you have cores, each loop will evaluate part of the acctual loop you want, it's done by making an inner for loop that depends on the parallel loop.
int start = 1, end = 1000000;
Parallel.For(0, N_CORES, n =>
int s = start + (end - start) * n / N_CORES;
int e = n == N_CORES - 1 ? end : start + (end - start) * (n + 1) / N_CORES;
for (int i = s; i < e; i++)
// Your code
You should try this code, I'm pretty sure this will do the job faster.
Well, quite a long time since I answered this question, but I faced the problem again and finally understood what's going on.
I've been using AForge implementation of Parallel for loop, and it seems like, it fires a thread for each iteration of the loop, so, that's why if the loop takes relatively a small amount of time to execute, you end up with a inefficient parallelism.
So, as some of you pointed out, System.Threading.Tasks.Parallel methods are based on Tasks, which are kind of a higher level of abstraction of a Thread:
"Behind the scenes, tasks are queued to the ThreadPool, which has been enhanced with algorithms that determine and adjust to the number of threads and that provide load balancing to maximize throughput. This makes tasks relatively lightweight, and you can create many of them to enable fine-grained parallelism."
So yeah, if you use the default library's implementation, you won't need to use this kind of "bogus".

Cannot corrupt data with concurrent thread access

I wrote this experiment to demonstrate to someone that accessing shared data conccurently with multiple threads was a big no-no. To my surprise, regardless of how many threads I created, I was not able to create a concurrency issue and the value always resulted in a balanced value of 0. I know that the increment operator is not thread-safe which is why there are methods like Interlocked.Increment() and Interlocked.Decrement() (also noted here Is the ++ operator thread safe?).
If the increment/decrement operator is not thread safe, then why does the below code execute without any issues and results to the expected value?
The below snippet creates 2,000 threads. 1,000 constantly incrementing and 1,000 constantly decrementing to insure that the data is being accessed by multiple threads at the same time. What makes it worse is that in a normal program you would not have nearly as many threads. Yet despite the exaggerated numbers in an effort to create a concurrency issue the value always results in being a balanced value of 0.
static void Main(string[] args)
Random random = new Random();
int value = 0;
for (int x=0; x<1000; x++)
Thread incThread = new Thread(() =>
for (int y=0; y<100; y++)
Thread decThread = new Thread(() =>
for (int z=0; z<100; z++)
I'm hoping someone can provide me with an explanation so that I know that all my effort into writing thread-safe software is not in vain, or perhaps this experiment is flawed in some way. I have also tried with all threads incrementing and using the ++i instead of i++. The value always results in the expected value.
You'll usually only see issues if you have two threads which are incrementing and decrementing at very close times. (There are also memory model issues, but they're separate.) That means you want them spending most of the time incrementing and decrementing, in order to give you the best chance of the operations colliding.
Currently, your threads will be spending the vast majority of the time sleeping or writing to the console. That's massively reducing the chances of collision.
Additionally, I'd note that absence of evidence is not evidence of absence - concurrency issues can indeed be hard to provoke, particularly if you happen to be running on a CPU with a strong memory model and internally-atomic increment/decrement instructions that the JIT can use. It could be that you'll never provoke the problem on your particular machine - but that the same program could fail on another machine.
IMO these loops are too short. I bet that by the time the second thread starts the first thread has already finished executing its loop and exited. Try to drastically increase the number of iterations that each thread executes. At this point you could even spawn just two threads (remove the outer loop) and it should be enough to see wrong values.
For example, with the following code I'm getting totally wrong results on my system:
static void Main(string[] args)
Random random = new Random();
int value = 0;
Thread incThread = new Thread(() =>
for (int y = 0; y < 2000000; y++)
Thread decThread = new Thread(() =>
for (int z = 0; z < 2000000; z++)
In addition to Jon Skeets answer:
A simple test that at least on my litte Dual Core shows the problem easily:
Sub Main()
Dim i As Long = 1
Dim j As Long = 1
Dim f = Sub()
While Interlocked.Read(j) < 10 * 1000 * 1000
i += 1
End While
End Sub
Dim l As New List(Of Task)
For n = 1 To 4
Console.WriteLine("i={0} j={1}", i, j)
End Sub
i and j should both have the same final value. But they dont have!
And in case you think, that C# is more clever than VB:
static void Main(string[] args)
long i = 1;
long j = 1;
Task[] t = new Task[4];
for (int k = 0; k < 4; k++)
t[k] = Task.Run(() => {
while (Interlocked.Read(ref j) < (long)(10*1000*1000))
Interlocked.Increment(ref j);
Console.WriteLine("i = {0} j = {1}", i, j);
it isnt ;)
The result: i is around 15% (percent!) lower than j. ON my machine. Having an eight thread machine, probabyl might even make the result more imminent, because the error is more likely to happen if several tasks run truly parallel and are not just pre-empted.
The above code is flawed of course :(
IF a task is preempted, just AFTER i++, all other tasks continue to increment i and j, so i is expected to differ from j, even if "++" would be atomic. There a simple solution though:
static void Main(string[] args)
long i = 0;
int runs = 10*1000*1000;
Task[] t = new Task[Environment.ProcessorCount];
Stopwatch stp = Stopwatch.StartNew();
for (int k = 0; k < t.Length; k++)
t[k] = Task.Run(() =>
for (int j = 0; j < runs; j++ )
Console.WriteLine("i = {0} should be = {1} ms={2}", i, runs * t.Length, stp.ElapsedMilliseconds);
Now a task could be pre-empted somewhere in the loop statements. But that wouldn't effect i. So the only way to see an effect on i would be, if a task is preempted when it just at the i++ statement. And thats what was to be shown: It CAN happen and it's more likely to happen when you have fewer but longer running tasks.
If you write Interlocked.Increment(ref i); instead of i++ the code runs much longer (because of the locking), but i is exactly what it should be!
