I made the following C# Console App:
class Program
{
static RNGCryptoServiceProvider rng = new RNGCryptoServiceProvider();
public static ConcurrentDictionary<int, int> StateCount { get; set; }
static int length = 1000000000;
static void Main(string[] args)
{
StateCount = new ConcurrentDictionary<int, int>();
for (int i = 0; i < 3; i++)
{
StateCount.AddOrUpdate(i, 0, (k, v) => 0);
}
Console.WriteLine("Processors: " + Environment.ProcessorCount);
Console.WriteLine("Starting...");
Console.WriteLine();
Timer t = new Timer(1000);
t.Elapsed += T_Elapsed;
t.Start();
Stopwatch sw = new Stopwatch();
sw.Start();
Parallel.For(0, length, (i) =>
{
var rand = GetRandomNumber();
int newState = 0;
if(rand < 0.3)
{
newState = 0;
}
else if (rand < 0.6)
{
newState = 1;
}
else
{
newState = 2;
}
StateCount.AddOrUpdate(newState, 0, (k, v) => v + 1);
});
sw.Stop();
t.Stop();
Console.WriteLine();
Console.WriteLine("Total time: " + sw.Elapsed.TotalSeconds);
Console.ReadKey();
}
private static void T_Elapsed(object sender, ElapsedEventArgs e)
{
int total = 0;
for (int i = 0; i < 3; i++)
{
if(StateCount.TryGetValue(i, out int value))
{
total += value;
}
}
int percent = (int)Math.Round((total / (double)length) * 100);
Console.Write("\r" + percent + "%");
}
public static double GetRandomNumber()
{
var bytes = new Byte[8];
rng.GetBytes(bytes);
var ul = BitConverter.ToUInt64(bytes, 0) / (1 << 11);
Double randomDouble = ul / (Double)(1UL << 53);
return randomDouble;
}
}
Before running this, the Task Manager reported <2% CPU usage (across all runs and machines).
I ran it on a machine with a Ryzen 3800X. The output was:
Processors: 16
Total time: 209.22
The speed reported in the Task Manager while it ran was ~4.12 GHz.
I ran it on a machine with an i7-7820HK The output was:
Processors: 8
Total time: 213.09
The speed reported in the Task Manager while it ran was ~3.45 GHz.
I modified Parallel.For to include the processor count (Parallel.For(0, length, new ParallelOptions() { MaxDegreeOfParallelism = Environment.ProcessorCount }, (i) => {code});). The outputs were:
3800X: 16 - 158.58 # ~4.13
7820HK: 8 - 210.49 # ~3.40
There's something to be said about Parallel.For not natively identifying the Ryzen processors vs cores, but setting that aside, even here the Ryzen performance is still significantly poorer than would be expected (only ~25% faster with double the cores/processors, a faster speed, and larger L1-3 caches). Can anyone explain why?
Edit: Following a couple of comments, I made some changes to my code. See below:
static int length = 1000;
static void Main(string[] args)
{
StateCount = new ConcurrentDictionary<int, int>();
for (int i = 0; i < 3; i++)
{
StateCount.AddOrUpdate(i, 0, (k, v) => 0);
}
var procCount = Environment.ProcessorCount;
Console.WriteLine("Processors: " + procCount);
Console.WriteLine("Starting...");
Console.WriteLine();
List<double> times = new List<double>();
Stopwatch sw = new Stopwatch();
for (int m = 0; m < 10; m++)
{
sw.Restart();
Parallel.For(0, length, new ParallelOptions() { MaxDegreeOfParallelism = procCount }, (i) =>
{
for (int j = 0; j < 1000000; j++)
{
var rand = GetRandomNumber();
int newState = 0;
if (rand < 0.3)
{
newState = 0;
}
else if (rand < 0.6)
{
newState = 1;
}
else
{
newState = 2;
}
StateCount.AddOrUpdate(newState, 0, (k, v) => v + 1);
}
});
sw.Stop();
Console.WriteLine("Total time: " + sw.Elapsed.TotalSeconds);
times.Add(sw.Elapsed.TotalSeconds);
}
Console.WriteLine();
var avg = times.Average();
var variance = times.Select(x => (x - avg) * (x - avg)).Sum() / times.Count;
var stdev = Math.Sqrt(variance);
Console.WriteLine("Average time: " + avg + " +/- " + stdev);
Console.ReadKey();
Console.ReadKey();
}
The outside loop is 1,000 instead of 1,000,000,000, so there are "only" 1,000 parallel "tasks." Within each parallel "task" however there's now a loop of 1,000,000 actions, so the act of "getting the task" or whatever should have a much smaller affect on the total. I also loop the whole thing 10 times and get the average + standard devation. Output:
Ryzen 3800X: 158.531 +/- 0.429 # ~4.13
i7-7820HK: 202.159 +/- 2.538 # ~3.48
Even here, the Ryzen's twice as many threads and 0.60 GHz higher clock only result in a ~75% faster time for the total operation.
Related
I'm having an application that calculates the prime numbers of the user input. so if the user puts 10 in the console. Then it will show every prime number from 0 to 10. Now If I do something like 10000000 it will take a long time before it shows every prime number so I want to divide it by 4 threads, so that each thread can do 1/4 of the total number. So the first thread does from 0 till 250000, the second thread does the calculations from 250000 till 500000, etc. this is my code so far, what I'm doing with this code now is that I have 1 thread that calculates the prime of the number that the user puts in and in the end I'm making a sum of the values in the array.
using System;
using System.Diagnostics;
using System.Linq;
using System.Threading;
namespace WeekOpdr__5
{
internal class Program
{
static int[] deel1 = new int[10000];
static int n;
static int startNumber;
static int secondNumber;
static void Main(string[] args)
{
Stopwatch sw = new Stopwatch();
while (true)
{
Console.WriteLine("type a number in: ");
n = Convert.ToInt32(Console.ReadLine());
int start1 = 2;
int threadCount = 4;
var threads = new Thread[threadCount];
sw.Start();
for (int i = 0; i < threadCount; i++)
{
int range = n / threadCount * 2;
Console.WriteLine(range);
secondNumber = n / 2;
int start2 = start1;
threads[i] = new Thread(() => PrimeNumbers(startNumber, range ));
//threads[i] = new Thread(() => PrimeNumbers(secondNumber, n));
threads[i].Start();
}
for (int i = 0; i < threadCount; i++)
{
threads[i].Join();
}
Console.WriteLine($"The Prime Numbers between 0 and {n} are : ");
sw.Restart();
ReturnN();
SumN();
long timeElapsed = sw.ElapsedMilliseconds;
Console.WriteLine($"\nthe total time is {timeElapsed} ms");
}
static void PrimeNumbers(int startNumber, int n)
{
int position = 0;
for (int i = startNumber; i <= n; i++)
{
bool primeDetected = true;
for (int j = 2; j <= i / 2; j++)
{
if (i % j == 0)
{
primeDetected = false;
break;
}
}
if (primeDetected && i != 1)
{
deel1[position++] = i;
}
}
}
static void ReturnN()
{
foreach (int i in deel1)
{
Console.Write($"{i} ");
}
}
static void SumN()
{
int sum = deel1.Sum();
Console.WriteLine($"\nDe som van de priemgetallen tussen 2 en {n}: {sum}");
}
}
}
}
I am trying to use Vector to add integer values from 2 arrays faster than a traditional for loop.
My Vector count is: 4 which should mean that the addArrays_Vector function should run about 4 times faster than: addArrays_Normally
var vectSize = Vector<int>.Count;
This is true on my computer:
Vector.IsHardwareAccelerated
However strangely enough those are the benchmarks:
addArrays_Normally takes 475 milliseconds
addArrays_Vectortakes 627 milliseconds
How is this possible? Shouldn't addArrays_Vector take only approx 120 milliseconds? I wonder if I do this wrong?
void runVectorBenchmark()
{
var v1 = new int[92564080];
var v2 = new int[92564080];
for (int i = 0; i < v1.Length; i++)
{
v1[i] = 2;
v2[i] = 2;
}
//new Thread(() => addArrays_Normally(v1, v2)).Start();
new Thread(() => addArrays_Vector(v1, v2, Vector<int>.Count)).Start();
}
void addArrays_Normally(int[] v1, int[] v2)
{
Stopwatch stopWatch = new Stopwatch();
stopWatch.Start();
int sum = 0;
int i = 0;
for (i = 0; i < v1.Length; i++)
{
sum = v1[i] + v2[i];
}
stopWatch.Stop();
MessageBox.Show("stopWatch: " + stopWatch.ElapsedMilliseconds.ToString() + " milliseconds\n\n" );
}
void addArrays_Vector(int[] v1, int[] v2, int vectSize)
{
Stopwatch stopWatch = new Stopwatch();
stopWatch.Start();
int[] retVal = new int[v1.Length];
int i = 0;
for (i = 0; i < v1.Length - vectSize; i += vectSize)
{
var va = new Vector<int>(v1, i);
var vb = new Vector<int>(v2, i);
var vc = va + vb;
vc.CopyTo(retVal, i);
}
stopWatch.Stop();
MessageBox.Show("stopWatch: " + stopWatch.ElapsedMilliseconds.ToString() + " milliseconds\n\n" );
}
Two functions are different. And looks like RAM memory is a bottleneck here:
in the first example
var v1 = new int[92564080];
var v2 = new int[92564080];
...
int sum = 0;
int i = 0;
for (i = 0; i < v1.Length; i++)
{
sum = v1[i] + v2[i];
}
Code is reading both array once. So memory consumption is: sizeof(int) * 92564080 * 2 == 4 * 92564080 * 2 == 706 MB .
in the second example
var v1 = new int[92564080];
var v2 = new int[92564080];
...
int[] retVal = new int[v1.Length];
int i = 0;
for (i = 0; i < v1.Length - vectSize; i += vectSize)
{
var va = new Vector<int>(v1, i);
var vb = new Vector<int>(v2, i);
var vc = va + vb;
vc.CopyTo(retVal, i);
}
Code is reading 2 input arrays and writing into an output array. Memory consumption is at least sizeof(int) * 92564080 * 3 == 1 059 MB
Update:
RAM is much slower than CPU / CPU cache. From this great article about
Memory Bandwidth Napkin Math roughly:
L1 Bandwidth: 210 GB/s
...
RAM Bandwidth: 45 GB/s
So extra memory consumption would neglect vectorization speed up.
And the Youtube video mentioned is doing comparison on different code, non-vectorized code from the video is as follows, which consumes the same amount of memory as the vectorized code:
int[] AddArrays_Simple(int[] v1, int[] v2)
{
int[] retVal = new int[v1.Length];
for (int i = 0; i < v1.Length; i++)
{
retVal[i] = v1[i] + v2[i];
}
return retVal;
}
my task is, to sum up numbers in some range, to achieve that I have to use threads to separate computation.
I divided number to parts and used a thread for each part.
public class ParallelCalc
{
public long resultLong;
private Thread[] threads;
private List<long> list = new List<long>();
public long MaxNumber { get; set; }
public int ThreadsNumber { get; set; }
public event CalcFinishedEventHandler finished;
public ParallelCalc(long MaxNumber, int ThreadsNumber)
{
this.MaxNumber = MaxNumber;
this.ThreadsNumber = ThreadsNumber;
this.threads = new Thread[ThreadsNumber];
}
public void Start()
{
Stopwatch sw = new Stopwatch();
for (int i = 0; i < ThreadsNumber; i++)
{
threads[i] = new Thread(() => Sum(((MaxNumber / ThreadsNumber) * i) + 1,
MaxNumber / ThreadsNumber * (i + 1)));
if (i == ThreadsNumber - 1)
{
threads[i] = new Thread(() => Sum(((MaxNumber / ThreadsNumber) * i) + 1,
MaxNumber));
}
sw.Start();
threads[i].Start();
}
while (threads.All(t => t.IsAlive));
sw.Stop();
finished?.Invoke(this,
new CalcFinishedEventArgs()
{
Result = list.Sum(),
Time = sw.ElapsedMilliseconds
});
}
private void Sum(long startNumber, long endnumber)
{
long result = 0;
for (long i = startNumber; i <= endnumber; i++)
{
result += i;
}
list.Add(result);
}
}
The result has to be the sum of numbers, however, it is incorrect due to thread asynchronous assignment in list. Please indicate the error.
There is more than one thing wrong here, brace yourself...
Start creates a Stopwatch sw, but you call sw.Start on every iteration of the loop. Start it only once.
if i == ThreadsNumber - 1 evaluates to true, you let Thread to garbage. I fail to grasp why...
(MaxNumber / ThreadsNumber) * (i + 1) WHEN i == ThreadsNumber - 1
=
(MaxNumber / ThreadsNumber) * (ThreadsNumber - 1 + 1)
=
(MaxNumber / ThreadsNumber) * (ThreadsNumber)
=
MaxNumber
Do you have rounding problems? Rewrite like this:
((i + 1) * MaxNumber) / ThreadsNumber
By dividing last, you avoid the rounding problem.
You are spin waiting on the threads while (threads.All(t => t.IsAlive));. You could as well use Thread.Join or better yet, let the threads notify you when they are done.
The ranges in the lambdas have a closure on i. You need to be careful with C# - For loop and the lambda expressions.
List<T> is not thread safe. I would suggest to use a simple array (you know the number of threads afterall) and tell each thread to store only on the position that corresponds to them.
You have not considered what would happen if a second call to Start happens before the first one finishes.
So, we will have an array for the output:
var output = new long[ThreadsNumber];
And one for the Threads:
var threads = new Thread[ThreadsNumber];
Hmm, almost like we should create a class.
We will have the stopwatch:
var sw = new Stopwatch();
Let us start it once:
sw.Start();
Now a for to create the Threads:
for (var i = 0; i < ThreadsNumber; i++)
{
// ...
}
Have a copy of i to prevent problems:
for (var i = 0; i < ThreadsNumber; i++)
{
var index = i;
// ...
}
Compute the range for the current thread:
for (var i = 0; i < ThreadsNumber; i++)
{
var index = i;
var start = 1 + (i * MaxNumber) / ThreadsNumber;
var end = ((i + 1) * MaxNumber) / ThreadsNumber;
// ...
}
We need to write Sum in such way that we can store the output in the array:
private void Sum(long startNumber, long endNumber, int index)
{
long result = 0;
for (long i = startNumber; i <= endnumber; i++)
{
result += i;
}
output[index] = result;
}
Hmm... wait, there is a better way...
private static void Sum(long startNumber, long endNumber, out long output)
{
long result = 0;
for (long i = startNumber; i <= endNumber; i++)
{
result += i;
}
output = result;
}
Hmm... no, we can do better...
private static long Sum(long startNumber, long endNumber)
{
long result = 0;
for (long i = startNumber; i <= endNumber; i++)
{
result += i;
}
return result;
}
Create the Thread
for (var i = 0; i < ThreadsNumber; i++)
{
var index = i;
var start = 1 + (i * MaxNumber) / ThreadsNumber;
var end = ((i + 1) * MaxNumber) / ThreadsNumber;
threads[i] = new Thread(() => output[index] = Sum(start, end));
// ...
}
And start the Thread:
for (var i = 0; i < ThreadsNumber; i++)
{
var index = i;
var start = 1 + (i * MaxNumber) / ThreadsNumber;
var end = ((i + 1) * MaxNumber) / ThreadsNumber;
threads[i] = new Thread(() => {output[index] = Sum(start, end);});
threads[i].Start();
}
Are we really going to wait on these?
Think, think...
We keep track of how many threads are pending... and when they are all done, we call the event (and stop the Stopwatch).
var pendingThreads = ThreadsNumber;
// ...
for (var i = 0; i < ThreadsNumber; i++)
{
// ...
threads[i] = new Thread
(
() =>
{
output[index] = Sum(start, end);
if (Interlocked.Decrement(ref pendingThreads) == 0)
{
sw.Stop();
finished?.Invoke
(
this,
new CalcFinishedEventArgs()
{
Result = output.Sum(),
Time = sw.ElapsedMilliseconds
}
);
}
}
);
// ...
}
Let us bring it all togheter:
void Main()
{
var pc = new ParallelCalc(20, 5);
pc.Finished += (sender, args) =>
{
Console.WriteLine(args);
};
pc.Start();
}
public class CalcFinishedEventArgs : EventArgs
{
public long Result {get; set;}
public long Time {get; set;}
}
public class ParallelCalc
{
public long MaxNumber { get; set; }
public int ThreadsNumber { get; set; }
public event EventHandler<CalcFinishedEventArgs> Finished;
public ParallelCalc(long MaxNumber, int ThreadsNumber)
{
this.MaxNumber = MaxNumber;
this.ThreadsNumber = ThreadsNumber;
}
public void Start()
{
var output = new long[ThreadsNumber];
var threads = new Thread[ThreadsNumber];
var pendingThreads = ThreadsNumber;
var sw = new Stopwatch();
sw.Start();
for (var i = 0; i < ThreadsNumber; i++)
{
var index = i;
var start = 1 + (i * MaxNumber) / ThreadsNumber;
var end = ((i + 1) * MaxNumber) / ThreadsNumber;
threads[i] = new Thread
(
() =>
{
output[index] = Sum(start, end);
if (Interlocked.Decrement(ref pendingThreads) == 0)
{
sw.Stop();
Finished?.Invoke
(
this,
new CalcFinishedEventArgs()
{
Result = output.Sum(),
Time = sw.ElapsedMilliseconds
}
);
}
}
);
threads[i].Start();
}
}
private static long Sum(long startNumber, long endNumber)
{
long result = 0;
for (long i = startNumber; i <= endNumber; i++)
{
result += i;
}
return result;
}
}
Output:
Result
210
Time
0
That is too fast... let me input:
var pc = new ParallelCalc(2000000000, 5);
pc.Finished += (sender, args) =>
{
Console.WriteLine(args);
};
pc.Start();
Output:
Result
2000000001000000000
Time
773
And that is correct.
And yes, this code takes care of the case of calling Start multiple times. Notice that it create a new array for the output and a new array of threads each time. That way, it does not trip over itself.
I let error handling to you. Hints: MaxNumber / ThreadsNumber -> division by 0, and (i + 1) * MaxNumber -> overflow, not to mention output.Sum() -> overflow.
I am trying to optimize a search algorithm I am using to find marked Symbols in TwinCat 3 through the ADS Interface. The question is not TwinCat related so don't get scared off yet.
The problems:
Symbols are not loaded at once. I think the TwinCatAds library use lazy loading.
Symbols have treelike structure of non-binary not balanced tree.
The solution:
You can open more than one stream to ADS. And handle the the streams in multiple threads.
The question is, I divide the first level of symbols by the number of the processor cores. So Because the tree is unbalanced some of the Threads finish faster than the others. Because of this I need a nicer solution how to divide the work between the threads.
PS: I can't use the Parallel.ForEach(). Because of the streams it results in the same or greater time amount as the single thread solution.
My test code looks looks this, it just counts all Symbols of a huge Project.
using TwinCAT.Ads;
using System.Threading;
using System.IO;
using System.Diagnostics;
using System.Collections;
namespace MultipleStreamsTest
{
class Program
{
static int numberOfThreads = Environment.ProcessorCount;
static TcAdsClient client;
static TcAdsSymbolInfoLoader symbolLoader;
static TcAdsSymbolInfoCollection[] collection = new TcAdsSymbolInfoCollection[numberOfThreads];
static int[] portionResult = new int[numberOfThreads];
static int[] portionStart = new int[numberOfThreads];
static int[] portionStop = new int[numberOfThreads];
static void Connect()
{
client = new TcAdsClient();
client.Connect(851);
Console.WriteLine("Conected ");
}
static void Main(string[] args)
{
Connect();
symbolLoader = client.CreateSymbolInfoLoader();
CountAllOneThread();
CountWithMultipleThreads();
Console.ReadKey();
}
static public void CountAllOneThread()
{
Stopwatch stopwatch = new Stopwatch();
int index = 0;
stopwatch.Start();
Console.WriteLine("Counting with one thread...");
//Count all symbols
foreach (TcAdsSymbolInfo symbol in symbolLoader)
{
index++;
}
stopwatch.Stop();
//Output
Console.WriteLine("Counted with one thred " + index + " symbols in " + stopwatch.Elapsed);
}
static public int countRecursive(TcAdsSymbolInfo symbol)
{
int i = 0;
TcAdsSymbolInfo subSymbol = symbol.FirstSubSymbol;
while (subSymbol != null)
{
i = i + countRecursive(subSymbol);
subSymbol = subSymbol.NextSymbol;
i++;
}
return i;
}
static public void countRecursiveMultiThread(object portionNum)
{
int portionNumAsInt = (int)portionNum;
for (int i = portionStart[portionNumAsInt]; i <= portionStop[portionNumAsInt]; i++)
{
portionResult[portionNumAsInt] += countRecursive(collection[portionNumAsInt][i]);//Collection Teil
}
}
static public void CountWithMultipleThreads()
{
Stopwatch stopwatch = new Stopwatch();
int sum = 0;
stopwatch.Start();
Console.WriteLine("Counting with multiple thread...");
for (int i = 0; i < numberOfThreads; i++)
{
collection[i] = symbolLoader.GetSymbols(true);
}
int size = (int)(collection[0].Count / numberOfThreads);
int rest = collection[0].Count % numberOfThreads;
int m = 0;
for (; m < numberOfThreads; m++)
{
portionStart[m] = m * size;
portionStop[m] = portionStart[m] + size - 1;
}
portionStop[m - 1] += rest;
Thread[] threads = new Thread[numberOfThreads];
for (int i = 0; i < numberOfThreads; i++)
{
threads[i] = new Thread(countRecursiveMultiThread);
threads[i].Start(i);
Console.WriteLine("Thread #" + threads[i].ManagedThreadId + " started, fieldIndex: " + i);
}
//Check when threads finishing:
int threadsFinished = 0;
bool[] threadFinished = new bool[numberOfThreads];
int x = 0;
while (true)
{
if (threads[x].Join(10) && !threadFinished[x] )
{
Console.WriteLine("Thread #" + threads[x].ManagedThreadId + " finished ~ at: " + stopwatch.Elapsed);
threadsFinished++;
threadFinished[x] = true;
}
x++;
x = x % numberOfThreads;
if (threadsFinished == numberOfThreads) break;
Thread.Sleep(50);
}
foreach (int n in portionResult)
{
sum += n;
}
sum += collection[0].Count;
stopwatch.Stop();
//Output
Console.WriteLine("Counted with multiple threds in Collection " + sum + " symbols " + " in " + stopwatch.Elapsed);
for (int i = 0; i < numberOfThreads; i++)
{
Console.WriteLine("#" + i + ": " + portionResult[i]);
}
}
}
}
The console output:
If you trying to run the Code use TwinCat.Ads Version 4.0.17.0(that i am using). They broke something in the new version that is available with NuGet.
Make a thread pool and keep track of threads running and idling status. At each branch check if there is idling threads, if there is assign thread to sub branch.
I am trying to learn about CPU cache performance in the world of .NET. Specifically I am working through Igor Ostovsky's article about Processor Cache Effects.
I have gone through the first three examples in his article and have recorded results that widely differ from his. I think I must be doing something wrong because the performance on my machine is showing almost the exact opposite results of what he shows in his article. I am not seeing the large effects from cache misses that I would expect.
What am I doing wrong? (bad code, compiler setting, etc.)
Here are the performance results on my machine:
If it helps, the processor on my machine is an Intel Core i7-2630QM. Here is info on my processor's cache:
I have compiled in x64 Release mode.
Below is my source code:
class Program
{
static Stopwatch watch = new Stopwatch();
static int[] arr = new int[64 * 1024 * 1024];
static void Main(string[] args)
{
Example1();
Example2();
Example3();
Console.ReadLine();
}
static void Example1()
{
Console.WriteLine("Example 1:");
// Loop 1
watch.Restart();
for (int i = 0; i < arr.Length; i++) arr[i] *= 3;
watch.Stop();
Console.WriteLine(" Loop 1: " + watch.ElapsedMilliseconds.ToString() + " ms");
// Loop 2
watch.Restart();
for (int i = 0; i < arr.Length; i += 32) arr[i] *= 3;
watch.Stop();
Console.WriteLine(" Loop 2: " + watch.ElapsedMilliseconds.ToString() + " ms");
Console.WriteLine();
}
static void Example2()
{
Console.WriteLine("Example 2:");
for (int k = 1; k <= 1024; k *= 2)
{
watch.Restart();
for (int i = 0; i < arr.Length; i += k) arr[i] *= 3;
watch.Stop();
Console.WriteLine(" K = "+ k + ": " + watch.ElapsedMilliseconds.ToString() + " ms");
}
Console.WriteLine();
}
static void Example3()
{
Console.WriteLine("Example 3:");
for (int k = 1; k <= 1024*1024; k *= 2)
{
//256* 4bytes per 32 bit int * k = k Kilobytes
arr = new int[256*k];
int steps = 64 * 1024 * 1024; // Arbitrary number of steps
int lengthMod = arr.Length - 1;
watch.Restart();
for (int i = 0; i < steps; i++)
{
arr[(i * 16) & lengthMod]++; // (x & lengthMod) is equal to (x % arr.Length)
}
watch.Stop();
Console.WriteLine(" Array size = " + arr.Length * 4 + " bytes: " + (int)(watch.Elapsed.TotalMilliseconds * 1000000.0 / arr.Length) + " nanoseconds per element");
}
Console.WriteLine();
}
}
Why are you using i += 32 in the second loop. You are stepping over cache lines in this way. 32*4 = 128bytes way bigger then 64bytes needed.