Performance when Generating CPU Cache Misses - c#

I am trying to learn about CPU cache performance in the world of .NET. Specifically I am working through Igor Ostovsky's article about Processor Cache Effects.
I have gone through the first three examples in his article and have recorded results that widely differ from his. I think I must be doing something wrong because the performance on my machine is showing almost the exact opposite results of what he shows in his article. I am not seeing the large effects from cache misses that I would expect.
What am I doing wrong? (bad code, compiler setting, etc.)
Here are the performance results on my machine:
If it helps, the processor on my machine is an Intel Core i7-2630QM. Here is info on my processor's cache:
I have compiled in x64 Release mode.
Below is my source code:
class Program
{
static Stopwatch watch = new Stopwatch();
static int[] arr = new int[64 * 1024 * 1024];
static void Main(string[] args)
{
Example1();
Example2();
Example3();
Console.ReadLine();
}
static void Example1()
{
Console.WriteLine("Example 1:");
// Loop 1
watch.Restart();
for (int i = 0; i < arr.Length; i++) arr[i] *= 3;
watch.Stop();
Console.WriteLine(" Loop 1: " + watch.ElapsedMilliseconds.ToString() + " ms");
// Loop 2
watch.Restart();
for (int i = 0; i < arr.Length; i += 32) arr[i] *= 3;
watch.Stop();
Console.WriteLine(" Loop 2: " + watch.ElapsedMilliseconds.ToString() + " ms");
Console.WriteLine();
}
static void Example2()
{
Console.WriteLine("Example 2:");
for (int k = 1; k <= 1024; k *= 2)
{
watch.Restart();
for (int i = 0; i < arr.Length; i += k) arr[i] *= 3;
watch.Stop();
Console.WriteLine(" K = "+ k + ": " + watch.ElapsedMilliseconds.ToString() + " ms");
}
Console.WriteLine();
}
static void Example3()
{
Console.WriteLine("Example 3:");
for (int k = 1; k <= 1024*1024; k *= 2)
{
//256* 4bytes per 32 bit int * k = k Kilobytes
arr = new int[256*k];
int steps = 64 * 1024 * 1024; // Arbitrary number of steps
int lengthMod = arr.Length - 1;
watch.Restart();
for (int i = 0; i < steps; i++)
{
arr[(i * 16) & lengthMod]++; // (x & lengthMod) is equal to (x % arr.Length)
}
watch.Stop();
Console.WriteLine(" Array size = " + arr.Length * 4 + " bytes: " + (int)(watch.Elapsed.TotalMilliseconds * 1000000.0 / arr.Length) + " nanoseconds per element");
}
Console.WriteLine();
}
}

Why are you using i += 32 in the second loop. You are stepping over cache lines in this way. 32*4 = 128bytes way bigger then 64bytes needed.

Related

Ryzen vs. i7 Multi-Threaded Performance

I made the following C# Console App:
class Program
{
static RNGCryptoServiceProvider rng = new RNGCryptoServiceProvider();
public static ConcurrentDictionary<int, int> StateCount { get; set; }
static int length = 1000000000;
static void Main(string[] args)
{
StateCount = new ConcurrentDictionary<int, int>();
for (int i = 0; i < 3; i++)
{
StateCount.AddOrUpdate(i, 0, (k, v) => 0);
}
Console.WriteLine("Processors: " + Environment.ProcessorCount);
Console.WriteLine("Starting...");
Console.WriteLine();
Timer t = new Timer(1000);
t.Elapsed += T_Elapsed;
t.Start();
Stopwatch sw = new Stopwatch();
sw.Start();
Parallel.For(0, length, (i) =>
{
var rand = GetRandomNumber();
int newState = 0;
if(rand < 0.3)
{
newState = 0;
}
else if (rand < 0.6)
{
newState = 1;
}
else
{
newState = 2;
}
StateCount.AddOrUpdate(newState, 0, (k, v) => v + 1);
});
sw.Stop();
t.Stop();
Console.WriteLine();
Console.WriteLine("Total time: " + sw.Elapsed.TotalSeconds);
Console.ReadKey();
}
private static void T_Elapsed(object sender, ElapsedEventArgs e)
{
int total = 0;
for (int i = 0; i < 3; i++)
{
if(StateCount.TryGetValue(i, out int value))
{
total += value;
}
}
int percent = (int)Math.Round((total / (double)length) * 100);
Console.Write("\r" + percent + "%");
}
public static double GetRandomNumber()
{
var bytes = new Byte[8];
rng.GetBytes(bytes);
var ul = BitConverter.ToUInt64(bytes, 0) / (1 << 11);
Double randomDouble = ul / (Double)(1UL << 53);
return randomDouble;
}
}
Before running this, the Task Manager reported <2% CPU usage (across all runs and machines).
I ran it on a machine with a Ryzen 3800X. The output was:
Processors: 16
Total time: 209.22
The speed reported in the Task Manager while it ran was ~4.12 GHz.
I ran it on a machine with an i7-7820HK The output was:
Processors: 8
Total time: 213.09
The speed reported in the Task Manager while it ran was ~3.45 GHz.
I modified Parallel.For to include the processor count (Parallel.For(0, length, new ParallelOptions() { MaxDegreeOfParallelism = Environment.ProcessorCount }, (i) => {code});). The outputs were:
3800X: 16 - 158.58 # ~4.13
7820HK: 8 - 210.49 # ~3.40
There's something to be said about Parallel.For not natively identifying the Ryzen processors vs cores, but setting that aside, even here the Ryzen performance is still significantly poorer than would be expected (only ~25% faster with double the cores/processors, a faster speed, and larger L1-3 caches). Can anyone explain why?
Edit: Following a couple of comments, I made some changes to my code. See below:
static int length = 1000;
static void Main(string[] args)
{
StateCount = new ConcurrentDictionary<int, int>();
for (int i = 0; i < 3; i++)
{
StateCount.AddOrUpdate(i, 0, (k, v) => 0);
}
var procCount = Environment.ProcessorCount;
Console.WriteLine("Processors: " + procCount);
Console.WriteLine("Starting...");
Console.WriteLine();
List<double> times = new List<double>();
Stopwatch sw = new Stopwatch();
for (int m = 0; m < 10; m++)
{
sw.Restart();
Parallel.For(0, length, new ParallelOptions() { MaxDegreeOfParallelism = procCount }, (i) =>
{
for (int j = 0; j < 1000000; j++)
{
var rand = GetRandomNumber();
int newState = 0;
if (rand < 0.3)
{
newState = 0;
}
else if (rand < 0.6)
{
newState = 1;
}
else
{
newState = 2;
}
StateCount.AddOrUpdate(newState, 0, (k, v) => v + 1);
}
});
sw.Stop();
Console.WriteLine("Total time: " + sw.Elapsed.TotalSeconds);
times.Add(sw.Elapsed.TotalSeconds);
}
Console.WriteLine();
var avg = times.Average();
var variance = times.Select(x => (x - avg) * (x - avg)).Sum() / times.Count;
var stdev = Math.Sqrt(variance);
Console.WriteLine("Average time: " + avg + " +/- " + stdev);
Console.ReadKey();
Console.ReadKey();
}
The outside loop is 1,000 instead of 1,000,000,000, so there are "only" 1,000 parallel "tasks." Within each parallel "task" however there's now a loop of 1,000,000 actions, so the act of "getting the task" or whatever should have a much smaller affect on the total. I also loop the whole thing 10 times and get the average + standard devation. Output:
Ryzen 3800X: 158.531 +/- 0.429 # ~4.13
i7-7820HK: 202.159 +/- 2.538 # ~3.48
Even here, the Ryzen's twice as many threads and 0.60 GHz higher clock only result in a ~75% faster time for the total operation.

Relative speed of jagged, multidimensional arrays and object fields

Some background, im an engineer not a professional programmer so apologies for any dumb questions. I have a program that is calculation intensive and i want to speed it up. My understanding (from reading forums) was that C# is optimised for Jagged Arrays, but my test below seems to say not. If i'm doing something wrong please let me know, also the test which uses the tray object (could be any moderately complex object) seems too fast, is there a problem with the code?
The relative speeds im getting (in debug) are; (Note: I just realised the times include array initislisation, but thats ok, the arrays will be cleared or re-initialised many times in the program).
Object/field access 0ms
Jagged Array access 87ms
Normal Array access 12ms
If i dont count initialising the arrays i get 0, 7 and 7 ms respectively.
var watch = Stopwatch.StartNew();
double res=0;
int count = 10000;
Tray tray = column[0][4];
for (int i = 0; i < count; i++)
{
tray = column[0][4];
tray.T = i;
res = tray.T;
}
watch.Stop();
var elapsedMs = watch.ElapsedMilliseconds;
Debug.WriteLine("Column Solution Time1 ms: " + elapsedMs.ToString() + " " + res.ToString());
MessageBox.Show("Object Field" + elapsedMs.ToString());
watch = Stopwatch.StartNew();
double res2 = 0;
int count2 = count;
double[][] tray1 = new double[count2][];
for (int i = 0; i < count2; i++)
tray1 = new double[count2];
for (int i = 0; i < count2; i++)
{
tray1[i][i] = i;
res2 = tray1[i][i];
}
watch.Stop();
elapsedMs = watch.ElapsedMilliseconds;
Debug.WriteLine("Column Solution Time2 ms: " + elapsedMs.ToString() +" "+ res2.ToString());
MessageBox.Show("Jagged Matix" + elapsedMs.ToString());
watch = Stopwatch.StartNew();
double res3 = 0;
int count3 = count;
double[,] tray3 = new double[count3,count3];
for (int i = 0; i < count3; i++)
{
tray3[i,i] = i;
res3 = tray3[i,i];
}
watch.Stop();
elapsedMs = watch.ElapsedMilliseconds;
Debug.WriteLine("Column Solution Time3 ms: " + elapsedMs.ToString() + " " + res3.ToString());
MessageBox.Show("Matix" + elapsedMs.ToString());

Adding integers from 2 arrays using Vector<int> takes longer time than traditional for loop

I am trying to use Vector to add integer values from 2 arrays faster than a traditional for loop.
My Vector count is: 4 which should mean that the addArrays_Vector function should run about 4 times faster than: addArrays_Normally
var vectSize = Vector<int>.Count;
This is true on my computer:
Vector.IsHardwareAccelerated
However strangely enough those are the benchmarks:
addArrays_Normally takes 475 milliseconds
addArrays_Vectortakes 627 milliseconds
How is this possible? Shouldn't addArrays_Vector take only approx 120 milliseconds? I wonder if I do this wrong?
void runVectorBenchmark()
{
var v1 = new int[92564080];
var v2 = new int[92564080];
for (int i = 0; i < v1.Length; i++)
{
v1[i] = 2;
v2[i] = 2;
}
//new Thread(() => addArrays_Normally(v1, v2)).Start();
new Thread(() => addArrays_Vector(v1, v2, Vector<int>.Count)).Start();
}
void addArrays_Normally(int[] v1, int[] v2)
{
Stopwatch stopWatch = new Stopwatch();
stopWatch.Start();
int sum = 0;
int i = 0;
for (i = 0; i < v1.Length; i++)
{
sum = v1[i] + v2[i];
}
stopWatch.Stop();
MessageBox.Show("stopWatch: " + stopWatch.ElapsedMilliseconds.ToString() + " milliseconds\n\n" );
}
void addArrays_Vector(int[] v1, int[] v2, int vectSize)
{
Stopwatch stopWatch = new Stopwatch();
stopWatch.Start();
int[] retVal = new int[v1.Length];
int i = 0;
for (i = 0; i < v1.Length - vectSize; i += vectSize)
{
var va = new Vector<int>(v1, i);
var vb = new Vector<int>(v2, i);
var vc = va + vb;
vc.CopyTo(retVal, i);
}
stopWatch.Stop();
MessageBox.Show("stopWatch: " + stopWatch.ElapsedMilliseconds.ToString() + " milliseconds\n\n" );
}
Two functions are different. And looks like RAM memory is a bottleneck here:
in the first example
var v1 = new int[92564080];
var v2 = new int[92564080];
...
int sum = 0;
int i = 0;
for (i = 0; i < v1.Length; i++)
{
sum = v1[i] + v2[i];
}
Code is reading both array once. So memory consumption is: sizeof(int) * 92564080 * 2 == 4 * 92564080 * 2 == 706 MB .
in the second example
var v1 = new int[92564080];
var v2 = new int[92564080];
...
int[] retVal = new int[v1.Length];
int i = 0;
for (i = 0; i < v1.Length - vectSize; i += vectSize)
{
var va = new Vector<int>(v1, i);
var vb = new Vector<int>(v2, i);
var vc = va + vb;
vc.CopyTo(retVal, i);
}
Code is reading 2 input arrays and writing into an output array. Memory consumption is at least sizeof(int) * 92564080 * 3 == 1 059 MB
Update:
RAM is much slower than CPU / CPU cache. From this great article about
Memory Bandwidth Napkin Math roughly:
L1 Bandwidth: 210 GB/s
...
RAM Bandwidth: 45 GB/s
So extra memory consumption would neglect vectorization speed up.
And the Youtube video mentioned is doing comparison on different code, non-vectorized code from the video is as follows, which consumes the same amount of memory as the vectorized code:
int[] AddArrays_Simple(int[] v1, int[] v2)
{
int[] retVal = new int[v1.Length];
for (int i = 0; i < v1.Length; i++)
{
retVal[i] = v1[i] + v2[i];
}
return retVal;
}

StopWatch gives random results

That is my first attempt to use StopWatch to meassure code performance and I don't know what is wrong. I want to check if there is difference when casting to double to calculate average with integers.
public static double Avarage(int a, int b)
{
return (a + b + 0.0) / 2;
}
public static double AvarageDouble(int s, int d)
{
return (double)(s + d) / 2;
}
public static double AvarageDouble2(int x, int v)
{
return ((double)x + v) / 2;
}
Code to test these 3 methods, using StopWatch:
Stopwatch sw = new Stopwatch();
sw.Start();
for (int i = 0; i < 1000000; i++)
{
var ret = Avarage(2, 3);
}
sw.Stop();
Console.Write("Using 0.0: " + sw.ElapsedTicks + "\n");
sw.Reset();
sw.Start();
for (int i = 0; i < 1000000; i++)
{
var ret2 = AvarageDouble(2, 3);
}
sw.Stop();
Console.Write("Using Double(s+d): " + sw.ElapsedTicks + "\n");
sw.Reset();
sw.Start();
for (int i = 0; i < 1000000; i++)
{
var ret3 = AvarageDouble2(2, 3);
}
sw.Stop();
Console.Write("Using double (x): " + sw.ElapsedTicks + "\n");
It shows random result, once Average is the fastets, other time AverageDouble or AverageDouble2. I use diff variable names, but looks like it does not matter.
What am I missing?
PS. What is the best method to calculate average with two ints as inputs?
Tested your code, yes the results was very random at times. Remember Stopwatch is only the time elapsed from sw.start() to sw.stop(). It does not take into consideration .Net's Just In Time compilation, operating system process scheduling, cpu load etc.
This will be more noteworthy in methods with such small runtimes. Where these noises, can more then double the runtime.
An elaborate and better explanation is written in the following SO question.

Trying to find the average amount of attempt it takes for Tom to roll a six on: 10 repetitions

It takes the total number of rolls and divides them by 10. For example, if it took 56 rolls so my average is 5.6.
Random numGen = new Random ();
int numOfAttempt = 0;
int Attempt = 0;
int avrBefore = 0;
int avrAfter = 0;
for (int i = 1; i <= 10; i++)
do {
Attempt = numGen.Next (1, 7);
Console.WriteLine (Attempt);
numOfAttempt++;
avrBefore = numOfAttempt;
avrAfter = avrBefore / 10;
} while (Attempt != 6);
Console.WriteLine ("He tried " + numOfAttempt + " times to roll a six.");
Console.WriteLine ("The average number of times it took to get a six was " + avrAfter);
Console.ReadKey ();
maybe you need to reset the vars inside the for loop, before the while:
for (int i = 1; i <= 10; i++){
int numOfAttempt = 0;
int Attempt = 0;
int avrBefore = 0;
int avrAfter = 0;
do {//your code
You are diving by ten at the wrong time.
You are dividing by ten even when you haven't done ten iterations.
This code below instead shows the correct average at each iteration.
void Main()
{
Random numGen = new Random ();
int totalAttempts = 0;
for (int i = 1; i <= 10; i++)
{
int attempts = 0;
int attempt = 0;
do {
attempt = numGen.Next (1, 7);
attempts++;
} while (attempt != 6);
totalAttempts+=attempts;
Console.WriteLine ("He tried " + attempts + " times to roll a six.");
Console.WriteLine ("The average number of times it took to get a six was " + (double)totalAttempts / i);
}
}
So you want to know how many times it took to get a six. This is quite simple. You just have to save how many times you rolled the dices and how many times you got a six.
Random numGen = new Random ();
int attempt = 0;
int numOfAttempt = 0;
int numberOfSixes = 0;
int i;
for (i = 0; i < 10; i++) {
do {
attempt = numGen.Next (1, 7);
Console.WriteLine (Attempt);
numOfAttempt++;
} while (Attempt != 6);
numberOfSixes++;
}
Console.WriteLine ("He tried " + numOfAttempt + " times to roll a six.");
Console.WriteLine ("The average number of times it took to get a six was " + (numberOfAttempt/numberOfSixes));
Console.ReadKey ();
I hope you understand my approach.
I would appreciate it if you could elaborate your thought process behind avrBefore,avrAfter and divide by 10. I didn't get that.
You can shorten down the code a lot and get it correct:
Random numGen = new Random();
int numOfAttempt = 0;
for (int i = 0; i < 10; i++)
{
int attempt = 0;
while (attempt != 6)
{
attempt = numGen.Next(1, 7);
Console.WriteLine (attempt );
numOfAttempt++;
}
}
Console.WriteLine("He tried " + numOfAttempt + " times to roll a six.");
Console.WriteLine("The average number of times it took to get a six was " + numOfAttempt / 10.0);
Console.ReadKey();

Categories