Related
I'm playing around with the Parallel.ForEach in a C# console application, but can't seem to get it right. I'm creating an array with random numbers and i have a sequential foreach and a Parallel.ForEach that finds the largest value in the array. With approximately the same code in c++ i started to see a tradeoff to using several threads at 3M values in the array. But the Parallel.ForEach is twice as slow even at 100M values. What am i doing wrong?
class Program
{
static void Main(string[] args)
{
dostuff();
}
static void dostuff() {
Console.WriteLine("How large do you want the array to be?");
int size = int.Parse(Console.ReadLine());
int[] arr = new int[size];
Random rand = new Random();
for (int i = 0; i < size; i++)
{
arr[i] = rand.Next(0, int.MaxValue);
}
var watchSeq = System.Diagnostics.Stopwatch.StartNew();
var largestSeq = FindLargestSequentially(arr);
watchSeq.Stop();
var elapsedSeq = watchSeq.ElapsedMilliseconds;
Console.WriteLine("Finished sequential in: " + elapsedSeq + "ms. Largest = " + largestSeq);
var watchPar = System.Diagnostics.Stopwatch.StartNew();
var largestPar = FindLargestParallel(arr);
watchPar.Stop();
var elapsedPar = watchPar.ElapsedMilliseconds;
Console.WriteLine("Finished parallel in: " + elapsedPar + "ms Largest = " + largestPar);
dostuff();
}
static int FindLargestSequentially(int[] arr) {
int largest = arr[0];
foreach (int i in arr) {
if (largest < i) {
largest = i;
}
}
return largest;
}
static int FindLargestParallel(int[] arr) {
int largest = arr[0];
Parallel.ForEach<int, int>(arr, () => 0, (i, loop, subtotal) =>
{
if (i > subtotal)
subtotal = i;
return subtotal;
},
(finalResult) => {
Console.WriteLine("Thread finished with result: " + finalResult);
if (largest < finalResult) largest = finalResult;
}
);
return largest;
}
}
It's performance ramifications of having a very small delegate body.
We can achieve better performance using the partitioning. In this case the body delegate performs work with a high data volume.
static int FindLargestParallelRange(int[] arr)
{
object locker = new object();
int largest = arr[0];
Parallel.ForEach(Partitioner.Create(0, arr.Length), () => arr[0], (range, loop, subtotal) =>
{
for (int i = range.Item1; i < range.Item2; i++)
if (arr[i] > subtotal)
subtotal = arr[i];
return subtotal;
},
(finalResult) =>
{
lock (locker)
if (largest < finalResult)
largest = finalResult;
});
return largest;
}
Pay attention to synchronize the localFinally delegate. Also note the need for proper initialization of the localInit: () => arr[0] instead of () => 0.
Partitioning with PLINQ:
static int FindLargestPlinqRange(int[] arr)
{
return Partitioner.Create(0, arr.Length)
.AsParallel()
.Select(range =>
{
int largest = arr[0];
for (int i = range.Item1; i < range.Item2; i++)
if (arr[i] > largest)
largest = arr[i];
return largest;
})
.Max();
}
I highly recommend free book Patterns of Parallel Programming by Stephen Toub.
As the other answerers have mentioned, the action you're trying to perform against each item here is so insignificant that there are a variety of other factors which end up carrying more weight than the actual work you're doing. These may include:
JIT optimizations
CPU branch prediction
I/O (outputting thread results while the timer is running)
the cost of invoking delegates
the cost of task management
the system incorrectly guessing what thread strategy will be optimal
memory/cpu caching
memory pressure
environment (debugging)
etc.
Running each approach a single time is not an adequate way to test, because it enables a number of the above factors to weigh more heavily on one iteration than on another. You should start with a more robust benchmarking strategy.
Furthermore, your implementation is actually dangerously incorrect. The documentation specifically says:
The localFinally delegate is invoked once per task to perform a final action on each task’s local state. This delegate might be invoked concurrently on multiple tasks; therefore, you must synchronize access to any shared variables.
You have not synchronized your final delegate, so your function is prone to race conditions that would make it produce incorrect results.
As in most cases, the best approach to this one is to take advantage of work done by people smarter than we are. In my testing, the following approach appears to be the fastest overall:
return arr.AsParallel().Max();
The Parallel Foreach loop should be running slower because the algorithm used is not parallel and a lot more work is being done to run this algorithm.
In the single thread, to find the max value, we can take the first number as our max value and compare it to every other number in the array. If one of the numbers larger than our first number, we swap and continue. This way we access each number in the array once, for a total of N comparisons.
In the Parallel loop above, the algorithm creates overhead because each operation is wrapped inside a function call with a return value. So in addition to doing the comparisons, it is running overhead of adding and removing these calls onto the call stack. In addition, since each call is dependent on the value of the function call before, it needs to run in sequence.
In the Parallel For Loop below, the array is divided into an explicit number of threads determined by the variable threadNumber. This limits the overhead of function calls to a low number.
Note, for low values, the parallel loops performs slower. However, for 100M, there is a decrease in time elapsed.
static int FindLargestParallel(int[] arr)
{
var answers = new ConcurrentBag<int>();
int threadNumber = 4;
int partitionSize = arr.Length/threadNumber;
Parallel.For(0, /* starting number */
threadNumber+1, /* Adding 1 to threadNumber in case array.Length not evenly divisible by threadNumber */
i =>
{
if (i*partitionSize < arr.Length) /* check in case # in array is divisible by # threads */
{
var max = arr[i*partitionSize];
for (var x = i*partitionSize;
x < (i + 1)*partitionSize && x < arr.Length;
++x)
{
if (arr[x] > max)
max = arr[x];
}
answers.Add(max);
}
});
/* note the shortcut in finding max in the bag */
return answers.Max(i=>i);
}
Some thoughts here: In the parallel case, there is thread management logic involved that determines how many threads it wants to use. This thread management logic presumably possibly runs on your main thread. Every time a thread returns with the new maximum value, the management logic kicks in and determines the next work item (the next number to process in your array). I'm pretty sure that this requires some kind of locking. In any case, determining the next item may even cost more than performing the comparison operation itself.
That sounds like a magnitude more work (overhead) to me than a single thread that processes one number after the other. In the single-threaded case there are a number of optimization at play: No boundary checks, CPU can load data into the first level cache within the CPU, etc. Not sure, which of these optimizations apply for the parallel case.
Keep in mind that on a typical desktop machine there are only 2 to 4 physical CPU cores available so you will never have more than that actually doing work. So if the parallel processing overhead is more than 2-4 times of a single-threaded operation, the parallel version will inevitably be slower, which you are observing.
Have you attempted to run this on a 32 core machine? ;-)
A better solution would be determine non-overlapping ranges (start + stop index) covering the entire array and let each parallel task process one range. This way, each parallel task can internally do a tight single-threaded loop and only return once the entire range has been processed. You could probably even determine a near optimal number of ranges based on the number of logical cores of the machine. I haven't tried this but I'm pretty sure you will see an improvement over the single-threaded case.
Try splitting the set into batches and running the batches in parallel, where the number of batches corresponds to your number of CPU cores.
I ran some equations 1K, 10K and 1M times using the following methods:
A "for" loop.
A "Parallel.For" from the System.Threading.Tasks lib, across the entire set.
A "Parallel.For" across 4 batches.
A "Parallel.ForEach" from the System.Threading.Tasks lib, across the entire set.
A "Parallel.ForEach" across 4 batches.
Results: (Measured in seconds)
Conclusion:
Processing batches in parallel using the "Parallel.ForEach" has the best outcome in cases above 10K records. I believe the batching helps because it utilizes all CPU cores (4 in this example), but also minimizes the amount of threading overhead associated with parallelization.
Here is my code:
public void ParallelSpeedTest()
{
var rnd = new Random(56);
int range = 1000000;
int numberOfCores = 4;
int batchSize = range / numberOfCores;
int[] rangeIndexes = Enumerable.Range(0, range).ToArray();
double[] inputs = rangeIndexes.Select(n => rnd.NextDouble()).ToArray();
double[] weights = rangeIndexes.Select(n => rnd.NextDouble()).ToArray();
double[] outputs = new double[rangeIndexes.Length];
/// Series "for"...
var startTimeSeries = DateTime.Now;
for (var i = 0; i < range; i++)
{
outputs[i] = Math.Sqrt(Math.Pow(inputs[i] * weights[i], 2));
}
var durationSeries = DateTime.Now - startTimeSeries;
/// "Parallel.For"...
var startTimeParallel = DateTime.Now;
Parallel.For(0, range, (i) => {
outputs[i] = Math.Sqrt(Math.Pow(inputs[i] * weights[i], 2));
});
var durationParallelFor = DateTime.Now - startTimeParallel;
/// "Parallel.For" in Batches...
var startTimeParallel2 = DateTime.Now;
Parallel.For(0, numberOfCores, (c) => {
var endValue = (c == numberOfCores - 1) ? range : (c + 1) * batchSize;
var startValue = c * batchSize;
for (var i = startValue; i < endValue; i++)
{
outputs[i] = Math.Sqrt(Math.Pow(inputs[i] * weights[i], 2));
}
});
var durationParallelForBatches = DateTime.Now - startTimeParallel2;
/// "Parallel.ForEach"...
var startTimeParallelForEach = DateTime.Now;
Parallel.ForEach(rangeIndexes, (i) => {
outputs[i] = Math.Sqrt(Math.Pow(inputs[i] * weights[i], 2));
});
var durationParallelForEach = DateTime.Now - startTimeParallelForEach;
/// Parallel.ForEach in Batches...
List<Tuple<int,int>> ranges = new List<Tuple<int, int>>();
for (var i = 0; i < numberOfCores; i++)
{
int start = i * batchSize;
int end = (i == numberOfCores - 1) ? range : (i + 1) * batchSize;
ranges.Add(new Tuple<int,int>(start, end));
}
var startTimeParallelBatches = DateTime.Now;
Parallel.ForEach(ranges, (range) => {
for(var i = range.Item1; i < range.Item1; i++) {
outputs[i] = Math.Sqrt(Math.Pow(inputs[i] * weights[i], 2));
}
});
var durationParallelForEachBatches = DateTime.Now - startTimeParallelBatches;
Debug.Print($"=================================================================");
Debug.Print($"Given: Set-size: {range}, number-of-batches: {numberOfCores}, batch-size: {batchSize}");
Debug.Print($".................................................................");
Debug.Print($"Series For: {durationSeries}");
Debug.Print($"Parallel For: {durationParallelFor}");
Debug.Print($"Parallel For Batches: {durationParallelForBatches}");
Debug.Print($"Parallel ForEach: {durationParallelForEach}");
Debug.Print($"Parallel ForEach Batches: {durationParallelForEachBatches}");
Debug.Print($"");
}
I found something interesting while doing a HW question.
The howework question asks to code the Median Maintenance algorithm.
The formal statement is as follows:
The goal of this problem is to implement the "Median Maintenance" algorithm (covered in the Week 5 lecture on heap applications). The text file contains a list of the integers from 1 to 10000 in unsorted order; you should treat this as a stream of numbers, arriving one by one. Letting xi denote the ith number of the file, the kth median mk is defined as the median of the numbers x1,…,xk. (So, if k is odd, then mk is ((k+1)/2)th smallest number among x1,…,xk; if k is even, then m1 is the (k/2)th smallest number among x1,…,xk.)
In order to get O(n) running time, this should be implemented using heaps obviously. Anyways, I coded this using Brute Force (deadline was too soon and needed an answer right away) (O(n2)) with the following steps:
Read data in
Sort array
Find Median
Add it to running time
I ran the algorithm through several test cases (with a known answer) and got the correct results, however when I was running the same algorithm on a larger data set I was getting the wrong answer. I was doing all the operations using Int64 ro represent the data.
Then I tried switching to Int32 and magically I got the correct answer which makes no sense to me.
The code is below, and it is also found here (the data is in the repo). The algorithm starts to give erroneous results after the 3810 index:
private static void Main(string[] args)
{
MedianMaintenance("Question2.txt");
}
private static void MedianMaintenance(string filename)
{
var txtData = File.ReadLines(filename).ToArray();
var inputData32 = new List<Int32>();
var medians32 = new List<Int32>();
var sums32 = new List<Int32>();
var inputData64 = new List<Int64>();
var medians64 = new List<Int64>();
var sums64 = new List<Int64>();
var sum = 0;
var sum64 = 0f;
var i = 0;
foreach (var s in txtData)
{
//Add to sorted list
var intToAdd = Convert.ToInt32(s);
inputData32.Add(intToAdd);
inputData64.Add(Convert.ToInt64(s));
//Compute sum
var count = inputData32.Count;
inputData32.Sort();
inputData64.Sort();
var index = 0;
if (count%2 == 0)
{
//Even number of elements
index = count/2 - 1;
}
else
{
//Number is odd
index = ((count + 1)/2) - 1;
}
var val32 = Convert.ToInt32(inputData32[index]);
var val64 = Convert.ToInt64(inputData64[index]);
if (i > 3810)
{
var t = sum;
var t1 = sum + val32;
}
medians32.Add(val32);
medians64.Add(val64);
//Debug.WriteLine("Median is {0}", val);
sum += val32;
sums32.Add(Convert.ToInt32(sum));
sum64 += val64;
sums64.Add(Convert.ToInt64(sum64));
i++;
}
Console.WriteLine("Median Maintenance result is {0}", (sum).ToString("N"));
Console.WriteLine("Median Maintenance result is {0}", (medians32.Sum()).ToString("N"));
Console.WriteLine("Median Maintenance result is {0} - Int64", (sum64).ToString("N"));
Console.WriteLine("Median Maintenance result is {0} - Int64", (medians64.Sum()).ToString("N"));
}
What's more interesting is that the running sum (in the sum64 variable) yields a different result than summing all items in the list with LINQ's Sum() function.
The results (the thirs one is the one that's wrong):
These are the computer details:
I'll appreciate if someone can give me some insights on why is this happening.
Thanks,
0f is initializing a 32 bit float variable, you meant 0d or 0.0 to receive a 64 bit floating point.
As for linq, you'll probably get better results if you use strongly typed lists.
new List<int>()
new List<long>()
The first thing I notice is what the commenter did: var sum64 = 0f initializes sum64 as a float. As the median value of a collection of Int64s will itself be an Int64 (the specified rules don't use the mean between two midpoint values in a collection of even cardinality), you should instead declare this variable explicitly as a long. In fact, I would go ahead and replace all usages of var in this code example; the convenience of var is being lost here in causing type-related bugs.
Given this simple piece of code and 10mln array of random numbers:
static int Main(string[] args)
{
int size = 10000000;
int num = 10; //increase num to reduce number of buckets
int numOfBuckets = size/num;
int[] ar = new int[size];
Random r = new Random(); //initialize with randum numbers
for (int i = 0; i < size; i++)
ar[i] = r.Next(size);
var s = new Stopwatch();
s.Start();
var group = ar.GroupBy(i => i / num);
var l = group.Count();
s.Stop();
Console.WriteLine(s.ElapsedMilliseconds);
Console.ReadLine();
return 0;
}
I did some performance on grouping, so when the number of buckets is 10k the estimated execution time is 0.7s, for 100k buckets it is 2s, for 1m buckets it is 7.5s.
I wonder why is that. I imagine that if the GroupBy is implemented using HashTable there might be problem with collisions. For example initially the hashtable is prepard to work for let's say 1000 groups and then when the number of groups is growing it needs to increase the size and do the rehashing. If these was the case I could then write my own grouping where I would initialize the HashTable with expected number of buckets, I did that but it was only slightly faster.
So my question is, why number of buckets influences groupBy performance that much?
EDIT:
running under release mode change the results to 0.55s, 1.6s, 6.5s respectively.
I also changed the group.ToArray to piece of code below just to force execution of grouping :
foreach (var g in group)
array[g.Key] = 1;
where array is initialized before timer with appropriate size, the results stayed almost the same.
EDIT2:
You can see the working code from mellamokb in here pastebin.com/tJUYUhGL
I'm pretty certain this is showing the effects of memory locality (various levels of caching) and also object allocation.
To verify this, I took three steps:
Improve the benchmarking to avoid unnecessary parts and to garbage collect between tests
Remove the LINQ part by populating a Dictionary (which is effecively what GroupBy does behind the scenes)
Remove even Dictionary<,> and show the same trend for plain arrays.
In order to show this for arrays, I needed to increase the input size, but it does show the same kind of growth.
Here's a short but complete program which can be used to test both the dictionary and the array side - just flip which line is commented out in the middle:
using System;
using System.Collections.Generic;
using System.Diagnostics;
class Test
{
const int Size = 100000000;
const int Iterations = 3;
static void Main()
{
int[] input = new int[Size];
// Use the same seed for repeatability
var rng = new Random(0);
for (int i = 0; i < Size; i++)
{
input[i] = rng.Next(Size);
}
// Switch to PopulateArray to change which method is tested
Func<int[], int, TimeSpan> test = PopulateDictionary;
for (int buckets = 10; buckets <= Size; buckets *= 10)
{
TimeSpan total = TimeSpan.Zero;
for (int i = 0; i < Iterations; i++)
{
// Switch which line is commented to change the test
// total += PopulateDictionary(input, buckets);
total += PopulateArray(input, buckets);
GC.Collect();
GC.WaitForPendingFinalizers();
}
Console.WriteLine("{0,9}: {1,7}ms", buckets, (long) total.TotalMilliseconds);
}
}
static TimeSpan PopulateDictionary(int[] input, int buckets)
{
int divisor = input.Length / buckets;
var dictionary = new Dictionary<int, int>(buckets);
var stopwatch = Stopwatch.StartNew();
foreach (var item in input)
{
int key = item / divisor;
int count;
dictionary.TryGetValue(key, out count);
count++;
dictionary[key] = count;
}
stopwatch.Stop();
return stopwatch.Elapsed;
}
static TimeSpan PopulateArray(int[] input, int buckets)
{
int[] output = new int[buckets];
int divisor = input.Length / buckets;
var stopwatch = Stopwatch.StartNew();
foreach (var item in input)
{
int key = item / divisor;
output[key]++;
}
stopwatch.Stop();
return stopwatch.Elapsed;
}
}
Results on my machine:
PopulateDictionary:
10: 10500ms
100: 10556ms
1000: 10557ms
10000: 11303ms
100000: 15262ms
1000000: 54037ms
10000000: 64236ms // Why is this slower? See later.
100000000: 56753ms
PopulateArray:
10: 1298ms
100: 1287ms
1000: 1290ms
10000: 1286ms
100000: 1357ms
1000000: 2717ms
10000000: 5940ms
100000000: 7870ms
An earlier version of PopulateDictionary used an Int32Holder class, and created one for each bucket (when the lookup in the dictionary failed). This was faster when there was a small number of buckets (presumably because we were only going through the dictionary lookup path once per iteration instead of twice) but got significantly slower, and ended up running out of memory. This would contribute to fragmented memory access as well, of course. Note that PopulateDictionary specifies the capacity to start with, to avoid effects of data copying within the test.
The aim of using the PopulateArray method is to remove as much framework code as possible, leaving less to the imagination. I haven't yet tried using an array of a custom struct (with various different struct sizes) but that may be something you'd like to try too.
EDIT: I can reproduce the oddity of the slower result for 10000000 than 100000000 at will, regardless of test ordering. I don't understand why yet. It may well be specific to the exact processor and cache I'm using...
--EDIT--
The reason why 10000000 is slower than the 100000000 results has to do with the way hashing works. A few more tests explain this.
First off, let's look at the operations. There's Dictionary.FindEntry, which is used in the [] indexing and in Dictionary.TryGetValue, and there's Dictionary.Insert, which is used in the [] indexing and in Dictionary.Add. If we would just do a FindEntry, the timings would go up as we expect it:
static TimeSpan PopulateDictionary1(int[] input, int buckets)
{
int divisor = input.Length / buckets;
var dictionary = new Dictionary<int, int>(buckets);
var stopwatch = Stopwatch.StartNew();
foreach (var item in input)
{
int key = item / divisor;
int count;
dictionary.TryGetValue(key, out count);
}
stopwatch.Stop();
return stopwatch.Elapsed;
}
This is implementation doesn't have to deal with hash collisions (because there are none), which makes the behavior as we expect it. Once we start dealing with collisions, the timings start to drop. If we have as much buckets as elements, there are obviously less collisions... To be exact, we can figure out exactly how many collisions there are by doing:
static TimeSpan PopulateDictionary(int[] input, int buckets)
{
int divisor = input.Length / buckets;
int c1, c2;
c1 = c2 = 0;
var dictionary = new Dictionary<int, int>(buckets);
var stopwatch = Stopwatch.StartNew();
foreach (var item in input)
{
int key = item / divisor;
int count;
if (!dictionary.TryGetValue(key, out count))
{
dictionary.Add(key, 1);
++c1;
}
else
{
count++;
dictionary[key] = count;
++c2;
}
}
stopwatch.Stop();
Console.WriteLine("{0}:{1}", c1, c2);
return stopwatch.Elapsed;
}
The result is something like this:
10:99999990
10: 4683ms
100:99999900
100: 4946ms
1000:99999000
1000: 4732ms
10000:99990000
10000: 4964ms
100000:99900000
100000: 7033ms
1000000:99000000
1000000: 22038ms
9999538:90000462 <<-
10000000: 26104ms
63196841:36803159 <<-
100000000: 25045ms
Note the value of '36803159'. This answers the question why the last result is faster than the first result: it simply has to do less operations -- and since caching fails anyways, that factor doesn't make a difference anymore.
10k the estimated execution time is 0.7s, for 100k buckets it is 2s, for 1m buckets it is 7.5s.
This is an important pattern to recognize when you profile code. It is one of the standard size vs execution time relationships in software algorithms. Just from seeing the behavior, you can tell a lot about the way the algorithm was implemented. And the other way around of course, from the algorithm you can predict the expected execution time. A relationship that's annotated in the Big Oh notation.
Speediest code you can get is amortized O(1), execution time barely increases when you double the size of the problem. The Dictionary<> class behaves that way, as John demonstrated. The increases in time as the problem set gets large is the "amortized" part. A side-effect of Dictionary having to perform linear O(n) searches in buckets that keep getting bigger.
A very common pattern is O(n). That tells you that there is a single for() loop in the algorithm that iterates over the collection. O(n^2) tells you there are two nested for() loops. O(n^3) has three, etcetera.
What you got is the one in between, O(log n). It is the standard complexity of a divide-and-conquer algorithm. In other words, each pass splits the problem in two, continuing with the smaller set. Very common, you see it back in sorting algorithms. Binary search is the one you find back in your text book. Note how log₂(10) = 3.3, very close to the increment you see in your test. Perf starts to tank a bit for very large sets due to the poor locality of reference, a cpu cache problem that's always associated with O(log n) algoritms.
The one thing that John's answer demonstrates is that his guess cannot be correct, GroupBy() certainly does not use a Dictionary<>. And it is not possible by design, Dictionary<> cannot provide an ordered collection. Where GroupBy() must be ordered, it says so in the MSDN Library:
The IGrouping objects are yielded in an order based on the order of the elements in source that produced the first key of each IGrouping. Elements in a grouping are yielded in the order they appear in source.
Not having to maintain order is what makes Dictionary<> fast. Keeping order always cost O(log n), a binary tree in your text book.
Long story short, if you don't actually care about order, and you surely would not for random numbers, then you don't want to use GroupBy(). You want to use a Dictionary<>.
There are (at least) two influence factors: First, a hash table lookup only takes O(1) if you have a perfect hash function, which does not exist. Thus, you have hash collisions.
I guess more important, though, are caching effects. Modern CPUs have large caches, so for the smaller bucket count, the hash table itself might fit into the cache. As the hash table is frequently accessed, this might have a strong influence on the performance. If there are more buckets, more accesses to the RAM might be neccessary, which are slow compared to a cache hit.
There are a few factors at work here.
Hashes and groupings
The way grouping works is by creating a hash table. Each individual group then supports an 'add' operation, which adds an element to the add list. To put it bluntly, it's like a Dictionary<Key, List<Value>>.
Hash tables are always overallocated. If you add an element to the hash, it checks if there is enough capacity, and if not, recreates the hash table with a larger capacity (To be exact: new capacity = count * 2 with count the number of groups). However, a larger capacity means that the bucket index is no longer correct, which means you have to re-build the entries in the hash table. The Resize() method in Lookup<Key, Value> does this.
The 'groups' themselves work like a List<T>. These too are overallocated, but are easier to reallocate. To be precise: the data is simply copied (with Array.Copy in Array.Resize) and a new element is added. Since there's no re-hashing or calculation involved, this is quite a fast operation.
The initial capacity of a grouping is 7. This means, for 10 elements you need to reallocate 1 time, for 100 elements 4 times, for 1000 elements 8 times, and so on. Because you have to re-hash more elements each time, your code gets a bit slower each time the number of buckets grows.
I think these overallocations are the largest contributors to the small growth in the timings as the number of buckets grow. The easiest way to test this theory is to do no overallocations at all (test 1), and simply put counters in an array. The result can be shown below in the code for FixArrayTest (or if you like FixBucketTest which is closer to how groupings work). As you can see, the timings of # buckets = 10...10000 are the same, which is correct according to this theory.
Cache and random
Caching and random number generators aren't friends.
Our little test also shows that when the number of buckets grows above a certain threshold, memory comes into play. On my computer this is at an array size of roughly 4 MB (4 * number of buckets). Because the data is random, random chunks of RAM will be loaded and unloaded into the cache, which is a slow process. This is also the large jump in the speed. To see this in action, change the random numbers to a sequence (called 'test 2'), and - because the data pages can now be cached - the speed will remain the same overall.
Note that hashes overallocate, so you will hit the mark before you have a million entries in your grouping.
Test code
static void Main(string[] args)
{
int size = 10000000;
int[] ar = new int[size];
//random number init with numbers [0,size-1]
var r = new Random();
for (var i = 0; i < size; i++)
{
ar[i] = r.Next(0, size);
//ar[i] = i; // Test 2 -> uncomment to see the effects of caching more clearly
}
Console.WriteLine("Fixed dictionary:");
for (var numBuckets = 10; numBuckets <= 1000000; numBuckets *= 10)
{
var num = (size / numBuckets);
var timing = 0L;
for (var i = 0; i < 5; i++)
{
timing += FixBucketTest(ar, num);
//timing += FixArrayTest(ar, num); // test 1
}
var avg = ((float)timing) / 5.0f;
Console.WriteLine("Avg Time: " + avg + " ms for " + numBuckets);
}
Console.WriteLine("Fixed array:");
for (var numBuckets = 10; numBuckets <= 1000000; numBuckets *= 10)
{
var num = (size / numBuckets);
var timing = 0L;
for (var i = 0; i < 5; i++)
{
timing += FixArrayTest(ar, num); // test 1
}
var avg = ((float)timing) / 5.0f;
Console.WriteLine("Avg Time: " + avg + " ms for " + numBuckets);
}
}
static long FixBucketTest(int[] ar, int num)
{
// This test shows that timings will not grow for the smaller numbers of buckets if you don't have to re-allocate
System.Diagnostics.Stopwatch s = new Stopwatch();
s.Start();
var grouping = new Dictionary<int, List<int>>(ar.Length / num + 1); // exactly the right size
foreach (var item in ar)
{
int idx = item / num;
List<int> ll;
if (!grouping.TryGetValue(idx, out ll))
{
grouping.Add(idx, ll = new List<int>());
}
//ll.Add(item); //-> this would complete a 'grouper'; however, we don't want the overallocator of List to kick in
}
s.Stop();
return s.ElapsedMilliseconds;
}
// Test with arrays
static long FixArrayTest(int[] ar, int num)
{
System.Diagnostics.Stopwatch s = new Stopwatch();
s.Start();
int[] buf = new int[(ar.Length / num + 1) * 10];
foreach (var item in ar)
{
int code = (item & 0x7FFFFFFF) % buf.Length;
buf[code]++;
}
s.Stop();
return s.ElapsedMilliseconds;
}
When executing bigger calculations, less physical memory is available on the computer, counting the buckets will be slower with less memory, as you expend the buckets, your memory will decrease.
Try something like the following:
int size = 2500000; //10000000 divided by 4
int[] ar = new int[size];
//random number init with numbers [0,size-1]
System.Diagnostics.Stopwatch s = new Stopwatch();
s.Start();
for (int i = 0; i<4; i++)
{
var group = ar.GroupBy(i => i / num);
//the number of expected buckets is size / num.
var l = group.ToArray();
}
s.Stop();
calcuting 4 times with lower numbers.
The following ruby code runs in ~15s. It barely uses any CPU/Memory (about 25% of one CPU):
def collatz(num)
num.even? ? num/2 : 3*num + 1
end
start_time = Time.now
max_chain_count = 0
max_starter_num = 0
(1..1000000).each do |i|
count = 0
current = i
current = collatz(current) and count += 1 until (current == 1)
max_chain_count = count and max_starter_num = i if (count > max_chain_count)
end
puts "Max starter num: #{max_starter_num} -> chain of #{max_chain_count} elements. Found in: #{Time.now - start_time}s"
And the following TPL C# puts all my 4 cores to 100% usage and is orders of magnitude slower than the ruby version:
static void Euler14Test()
{
Stopwatch sw = new Stopwatch();
sw.Start();
int max_chain_count = 0;
int max_starter_num = 0;
object locker = new object();
Parallel.For(1, 1000000, i =>
{
int count = 0;
int current = i;
while (current != 1)
{
current = collatz(current);
count++;
}
if (count > max_chain_count)
{
lock (locker)
{
max_chain_count = count;
max_starter_num = i;
}
}
if (i % 1000 == 0)
Console.WriteLine(i);
});
sw.Stop();
Console.WriteLine("Max starter i: {0} -> chain of {1} elements. Found in: {2}s", max_starter_num, max_chain_count, sw.Elapsed.ToString());
}
static int collatz(int num)
{
return num % 2 == 0 ? num / 2 : 3 * num + 1;
}
How come ruby runs faster than C#? I've been told that Ruby is slow. Is that not true when it comes to algorithms?
Perf AFTER correction:
Ruby (Non parallel): 14.62s
C# (Non parallel): 2.22s
C# (With TPL): 0.64s
Actually, the bug is quite subtle, and has nothing to do with threading. The reason that your C# version takes so long is that the intermediate values computed by the collatz method eventually start to overflow the int type, resulting in negative numbers which may then take ages to converge.
This first happens when i is 134,379, for which the 129th term (assuming one-based counting) is 2,482,111,348. This exceeds the maximum value of 2,147,483,647 and therefore gets stored as -1,812,855,948.
To get good performance (and correct results) on the C# version, just change:
int current = i;
…to:
long current = i;
…and:
static int collatz(int num)
…to:
static long collatz(long num)
That will bring down your performance to a respectable 1.5 seconds.
Edit: CodesInChaos raises a very valid point about enabling overflow checking when debugging math-oriented applications. Doing so would have allowed the bug to be immediately identified, since the runtime would throw an OverflowException.
Should be:
Parallel.For(1L, 1000000L, i =>
{
Otherwise, you have integer overfill and start checking negative values. The same collatz method should operate with long values.
I experienced something like that. And I figured out that's because each of your loop iterations need to start other thread and this takes some time, and in this case it's comparable (I think it's more time) than the operations you acctualy do in the loop body.
There is an alternative for that: You can get how many CPU cores you have and than use a parallelism loop with the same number of iterations you have cores, each loop will evaluate part of the acctual loop you want, it's done by making an inner for loop that depends on the parallel loop.
EDIT: EXAMPLE
int start = 1, end = 1000000;
Parallel.For(0, N_CORES, n =>
{
int s = start + (end - start) * n / N_CORES;
int e = n == N_CORES - 1 ? end : start + (end - start) * (n + 1) / N_CORES;
for (int i = s; i < e; i++)
{
// Your code
}
});
You should try this code, I'm pretty sure this will do the job faster.
EDIT: ELUCIDATION
Well, quite a long time since I answered this question, but I faced the problem again and finally understood what's going on.
I've been using AForge implementation of Parallel for loop, and it seems like, it fires a thread for each iteration of the loop, so, that's why if the loop takes relatively a small amount of time to execute, you end up with a inefficient parallelism.
So, as some of you pointed out, System.Threading.Tasks.Parallel methods are based on Tasks, which are kind of a higher level of abstraction of a Thread:
"Behind the scenes, tasks are queued to the ThreadPool, which has been enhanced with algorithms that determine and adjust to the number of threads and that provide load balancing to maximize throughput. This makes tasks relatively lightweight, and you can create many of them to enable fine-grained parallelism."
So yeah, if you use the default library's implementation, you won't need to use this kind of "bogus".
I was working on a project today, and found myself using Math.Max in several places and inline if statements in other places. So, I was wondering if anybody knew which is "better"... or rather, what the real differences are.
For example, in the following, c1 = c2:
Random rand = new Random();
int a = rand.next(0,10000);
int b = rand.next(0,10000);
int c1 = Math.Max(a, b);
int c2 = a>b ? a : b;
I'm asking specifically about C#, but I suppose the answer could be different in different languages, though I'm not sure which ones have similar concepts.
One of the major differences I would notice right away would be for readability sake, as far as I know for implementation/performance sake, they would be nearly equivalent.
Math.Max(a,b) is very simple to understand, regardless of previous coding knowledge.
a>b ? a : b would require the user to have some knowledge of the ternary operator, at least.
"When in doubt - go for readability"
I thought it would be fun to throw in some numbers into this discussion so I wrote some code to profile it. As expected they are almost identical for all practical purposes.
The code does a billion loops (yep 1 billion). Subtracting the overhead of the loop you get:
Math.Max() took .0044 seconds to run 1 billion times
The inline if took .0055 seconds to run 1 billion times
I subtracted the overhead which I calculated by running an empty loop 1 billion times, the overhead was 1.2 seconds.
I ran this on a laptop, 64-bit Windows 7, 1.3 Ghz Intel Core i5 (U470). The code was compiled in release mode and ran without a debugger attached.
Here's the code:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Diagnostics;
namespace TestMathMax {
class Program {
static int Main(string[] args) {
var num1 = 10;
var num2 = 100;
var maxValue = 0;
var LoopCount = 1000000000;
double controlTotalSeconds;
{
var stopwatch = new Stopwatch();
stopwatch.Start();
for (var i = 0; i < LoopCount; i++) {
// do nothing
}
stopwatch.Stop();
controlTotalSeconds = stopwatch.Elapsed.TotalSeconds;
Console.WriteLine("Control - Empty Loop - " + controlTotalSeconds + " seconds");
}
Console.WriteLine();
{
var stopwatch = new Stopwatch();
stopwatch.Start();
for (int i = 0; i < LoopCount; i++) {
maxValue = Math.Max(num1, num2);
}
stopwatch.Stop();
Console.WriteLine("Math.Max() - " + stopwatch.Elapsed.TotalSeconds + " seconds");
Console.WriteLine("Relative: " + (stopwatch.Elapsed.TotalSeconds - controlTotalSeconds) + " seconds");
}
Console.WriteLine();
{
var stopwatch = new Stopwatch();
stopwatch.Start();
for (int i = 0; i < LoopCount; i++) {
maxValue = num1 > num2 ? num1 : num2;
}
stopwatch.Stop();
Console.WriteLine("Inline Max: " + stopwatch.Elapsed.TotalSeconds + " seconds");
Console.WriteLine("Relative: " + (stopwatch.Elapsed.TotalSeconds - controlTotalSeconds) + " seconds");
}
Console.ReadLine();
return maxValue;
}
}
}
UPDATED Results 2/7/2015
On a Windows 8.1, Surface 3 Pro, i7 4650U 2.3Ghz
Ran as a console application in release mode without the debugger attached.
Math.Max() - 0.3194749 seconds
Inline Max: 0.3465041 seconds
if statement considered beneficial
Summary
a statement of the form if (a > max) max = a is the fastest way to determine the maximum of a set of numbers. However the loop infrastructure itself takes most of the CPU time, so this optimization is questionable in the end.
Details
The answer by luisperezphd is interesting because it provides numbers, however I believe the method is flawed: the compiler will most likely move the comparison out of the loop, so the answer doesn't measure what it wants to measure. This explains the negligible timing difference between control loop and measurement loops.
To avoid this loop optimization, I added an operation that depends on the loop variable, to the empty control loop as well as to all measurement loops. I simulate the common use case of finding the maximum in a list of numbers, and used three data sets:
best case: the first number is the maximum, all numbers after it are smaller
worst case: every number is bigger than the previous, so the max changes each iteration
average case: a set of random numbers
See below for the code.
The result was rather surprising to me. On my Core i5 2520M laptop I got the following for 1 billion iterations (the empty control took about 2.6 sec in all cases):
max = Math.Max(max, a): 2.0 sec best case / 1.3 sec worst case / 2.0 sec average case
max = Math.Max(a, max): 1.6 sec best case / 2.0 sec worst case / 1.5 sec average case
max = max > a ? max : a: 1.2 sec best case / 1.2 sec worst case / 1.2 sec average case
if (a > max) max = a: 0.2 sec best case / 0.9 sec worst case / 0.3 sec average case
So despite long CPU pipelines and the resulting penalties for branching, the good old if statement is the clear winner for all simulated data sets; in the best case it is 10 times faster than Math.Max, and in the worst case still more than 30% faster.
Another surprise is that the order of the arguments to Math.Max matters. Presumably this is because of CPU branch prediction logic working differently for the two cases, and mispredicting branches more or less depending on the order of arguments.
However, the majority of the CPU time is spent in the loop infrastructure, so in the end this optimization is questionable at best. It provides a measurable but minor reduction in overall execution time.
UPDATED by luisperezphd
I couldn't fit this as a comment and it made more sense to write it here instead of as part of my answer so that it was in context.
Your theory makes sense, but I was not able to reproduce the results. First for some reason using your code my control loop was taking longer than the loops containing work.
For that reason I made the numbers here relative to the lowest time instead of the control loop. The seconds in the results are how much longer it took than the fastest time. For example in the results immediately below the fastest time was for Math.Max(a, max) best case, so every other result represents how much longer they took than that.
Below are the results I got:
max = Math.Max(max, a): 0.012 sec best case / 0.007 sec worst case / 0.028 sec average case
max = Math.Max(a, max): 0.000 best case / 0.021 worst case / 0.019 sec average case
max = max > a ? max : a: 0.022 sec best case / 0.02 sec worst case / 0.01 sec average case
if (a > max) max = a: 0.015 sec best case / 0.024 sec worst case / 0.019 sec average case
The second time I ran it I got:
max = Math.Max(max, a): 0.024 sec best case / 0.010 sec worst case / 0.009 sec average case
max = Math.Max(a, max): 0.001 sec best case / 0.000 sec worst case / 0.018 sec average case
max = max > a ? max : a: 0.011 sec best case / 0.005 sec worst case / 0.018 sec average case
if (a > max) max = a: 0.000 sec best case / 0.005 sec worst case / 0.039 sec average case
There is enough volume in these tests that any anomalies should have been wiped out. Yet despite that the results are pretty different. Maybe the large memory allocation for the array has something to do with it. Or possibly the difference is so small that anything else happening on the computer at the time is the true cause of the variation.
Note the fastest time, represented in the results above by 0.000 is about 8 seconds. So if you consider that the longest run then was 8.039, the variation in time is about half a percent (0.5%) - aka too small to matter.
The computer
The code was ran on Windows 8.1, i7 4810MQ 2.8Ghz and compiled in .NET 4.0.
Code modifications
I modified your code a bit to output the results in the format shown above. I also added additional code to wait 1 second after starting to account for any additional loading time .NET might need when running the assembly.
Also I ran all the tests twice to account for any CPU optimizations. Finally I changed the int for i to a unit so I could run the loop 4 billion times instead of 1 billion to get a longer timespan.
That's probably all overkill, but it's all to make sure as much as possible that the tests are not affected by any of those factors.
You can find the code at: http://pastebin.com/84qi2cbD
Code
using System;
using System.Diagnostics;
namespace ProfileMathMax
{
class Program
{
static double controlTotalSeconds;
const int InnerLoopCount = 100000;
const int OuterLoopCount = 1000000000 / InnerLoopCount;
static int[] values = new int[InnerLoopCount];
static int total = 0;
static void ProfileBase()
{
Stopwatch stopwatch = new Stopwatch();
stopwatch.Start();
int maxValue;
for (int j = 0; j < OuterLoopCount; j++)
{
maxValue = 0;
for (int i = 0; i < InnerLoopCount; i++)
{
// baseline
total += values[i];
}
}
stopwatch.Stop();
controlTotalSeconds = stopwatch.Elapsed.TotalSeconds;
Console.WriteLine("Control - Empty Loop - " + controlTotalSeconds + " seconds");
}
static void ProfileMathMax()
{
int maxValue;
Stopwatch stopwatch = new Stopwatch();
stopwatch.Start();
for (int j = 0; j < OuterLoopCount; j++)
{
maxValue = 0;
for (int i = 0; i < InnerLoopCount; i++)
{
maxValue = Math.Max(values[i], maxValue);
total += values[i];
}
}
stopwatch.Stop();
Console.WriteLine("Math.Max(a, max) - " + stopwatch.Elapsed.TotalSeconds + " seconds");
Console.WriteLine("Relative: " + (stopwatch.Elapsed.TotalSeconds - controlTotalSeconds) + " seconds");
}
static void ProfileMathMaxReverse()
{
int maxValue;
Stopwatch stopwatch = new Stopwatch();
stopwatch.Start();
for (int j = 0; j < OuterLoopCount; j++)
{
maxValue = 0;
for (int i = 0; i < InnerLoopCount; i++)
{
maxValue = Math.Max(maxValue, values[i]);
total += values[i];
}
}
stopwatch.Stop();
Console.WriteLine("Math.Max(max, a) - " + stopwatch.Elapsed.TotalSeconds + " seconds");
Console.WriteLine("Relative: " + (stopwatch.Elapsed.TotalSeconds - controlTotalSeconds) + " seconds");
}
static void ProfileInline()
{
int maxValue = 0;
Stopwatch stopwatch = new Stopwatch();
stopwatch.Start();
for (int j = 0; j < OuterLoopCount; j++)
{
maxValue = 0;
for (int i = 0; i < InnerLoopCount; i++)
{
maxValue = maxValue > values[i] ? values[i] : maxValue;
total += values[i];
}
}
stopwatch.Stop();
Console.WriteLine("max = max > a ? a : max: " + stopwatch.Elapsed.TotalSeconds + " seconds");
Console.WriteLine("Relative: " + (stopwatch.Elapsed.TotalSeconds - controlTotalSeconds) + " seconds");
}
static void ProfileIf()
{
int maxValue = 0;
Stopwatch stopwatch = new Stopwatch();
stopwatch.Start();
for (int j = 0; j < OuterLoopCount; j++)
{
maxValue = 0;
for (int i = 0; i < InnerLoopCount; i++)
{
if (values[i] > maxValue)
maxValue = values[i];
total += values[i];
}
}
stopwatch.Stop();
Console.WriteLine("if (a > max) max = a: " + stopwatch.Elapsed.TotalSeconds + " seconds");
Console.WriteLine("Relative: " + (stopwatch.Elapsed.TotalSeconds - controlTotalSeconds) + " seconds");
}
static void Main(string[] args)
{
Random rnd = new Random();
for (int i = 0; i < InnerLoopCount; i++)
{
//values[i] = i; // worst case: every new number biggest than the previous
//values[i] = i == 0 ? 1 : 0; // best case: first number is the maximum
values[i] = rnd.Next(int.MaxValue); // average case: random numbers
}
ProfileBase();
Console.WriteLine();
ProfileMathMax();
Console.WriteLine();
ProfileMathMaxReverse();
Console.WriteLine();
ProfileInline();
Console.WriteLine();
ProfileIf();
Console.ReadLine();
}
}
}
I'd say it is quicker to understand what Math.Max is doing, and that should really be the only deciding factor here.
But as an indulgence, it's interesting to consider that Math.Max(a,b) evaluates the arguments once, whilst a > b ? a : b evaluates one of them twice. Not a problem with local variables, but for properties with side effects, the side effect may happen twice.
If the JITer chooses to inline the Math.Max function, the executable code will be identical to the if statement. If Math.Max isn't inlined, it will execute as a function call with call and return overhead not present in the if statement. So, the if statement will give identical performance to Math.Max() in the inlining case or the if statement may be a few clock cycles faster in the non-inlined case, but the difference won't be noticeable unless you are running tens of millions of comparisons.
Since the performance difference between the two is small enough to be negligible in most situations, I'd prefer the Math.Max(a,b) because it's easier to read.
Regarding performance, Modern CPUs have internal command pipeline such that every assembly command is executed in several internal steps. (e.g. fetching, interpretation, calculation, storage)
In most cases the CPU is smart enough to run these steps in parallel for sequential commands so the overall throughput is very high.
This is fine till there comes a branch (if, ?: etc.) .
The branch may break the sequence and force the CPU to trash the pipeline.
This costs a lot of clock cycles.
In theory, if the compiler is smart enough, the Math.Max can be implemented using a built it CPU command and the branching can be avoided.
In this case the Math.Max would actually be faster than the if - but it depends on the compiler..
In case of more complicated Max - like working on vectors, double []v; v.Max() the compiler can utilize highly optimized library code, that can be much faster than regular compiled code.
So it's best to go with Math.Max, but it is also recommended to check on your particular target system and compiler if it is important enough.
Math.Max(a,b)
is NOT equivalent to a > b ? a : b in all cases.
Math.Max returns the greater value of the two arguments, that is:
if (a == b) return a; // or b, doesn't matter since they're identical
else if (a > b && b < a) return a;
else if (b > a && a < b) return b;
else return undefined;
The undefined is mapped to double.NaN in case of the double overload of Math.Max for example.
a > b ? a : b
evaluates to a if a is greater than b, which does not necessarily mean that b is less than a.
A simple example that demonstrates that they are not equivalent:
var a = 0.0/0.0; // or double.NaN
var b = 1.0;
a > b ? a : b // evaluates to 1.0
Math.Max(a, b) // returns double.NaN
Take an operation;
N must be >= 0
General solutions:
A) N = Math.Max(0, N)
B) if(N < 0){N = 0}
Sorting by speed:
Slow: Math.Max (A) < (B) if-then statement :Fast (3% more faster than solution 'A')
But my solution is 4% faster than solution 'B':
N *= Math.Sign(1 + Math.Sign(N));