I'm not entirely sure if I have done something wrong in my tests, but from my results MemoryPool is consistently slower and allocates more memory than ArrayPool, since you can convert Array type to Memory anyway, what is the point of using MemoryPool?
using System.Buffers;
using BenchmarkDotNet.Running;
using BenchmarkDotNet.Attributes;
BenchmarkRunner.Run<test>();
[MemoryDiagnoser]
public class test
{
[Benchmark]
public void WithArrayPool()
{
ArrayPool<int> pool = ArrayPool<int>.Shared;
for (int z = 0; z < 100; z++)
{
var memory = pool.Rent(2347);
for (int i = 0; i < memory.Length; i++)
{
memory[i] = i + 1;
}
int total = 0;
for (int i = 0; i < memory.Length; i++)
{
total += memory[i];
}
pool.Return(memory);
}
}
[Benchmark]
public void WithMemoryPool()
{
MemoryPool<int> pool = MemoryPool<int>.Shared;
for (int z = 0; z < 100; z++)
{
var rentedArray = pool.Rent(2347);
var memory = rentedArray.Memory;
for (int i = 0; i < memory.Length; i++)
{
memory.Span[i] = i + 1;
}
int total = 0;
for (int i = 0; i < memory.Length; i++)
{
total += memory.Span[i];
}
rentedArray.Dispose();
}
}
}
Method
Mean
Error
StdDev
Allocated
WithArrayPool
770.2 us
2.27 us
2.01 us
1 B
WithMemoryPool
1,714.6 us
0.56 us
0.50 us
2,402 B
My test code with results is above. Is Memory Pool actually just slower in general or is there something I am missing? If MemoryPool is in fact slower, what use case does it have?
Thanks.
About the performance.
The repeated call to memory.Span[i] is the culprit.
The source code at GitHub shows that quite some handling occurs behind that property getter.
Instead of repeating that call, store the result of that call in a variable
var span = memory.Span;
See full code block below.
Now the Mean numbers are almost equal.
Method
Mean
Error
StdDev
Median
Gen 0
Allocated
WithArrayPool
333.4 us
2.34 us
5.15 us
331.0 us
-
-
WithMemoryPool
368.6 us
7.08 us
5.53 us
366.7 us
0.4883
2,400 B
[Benchmark]
public void WithMemoryPool()
{
MemoryPool<int> pool = MemoryPool<int>.Shared;
for (int z = 0; z < 100; z++)
{
var rentedArray = pool.Rent(2347);
var memory = rentedArray.Memory;
var span = memory.Span;
for (int i = 0; i < memory.Length; i++)
{
span[i] = i + 1;
}
int total = 0;
for (int i = 0; i < memory.Length; i++)
{
total += span[i];
}
rentedArray.Dispose();
}
}
About the difference in allocated memory.
That's by design.
There's already an very good post that explains this.
Related
I recently translated some python code into c#. The c# function is working fine and results in the same output as the python one. However, it takes about 15 times longer. This is mostly due to one of my functions which calculates the cross-correlation of two vectors and takes much longer than numpy's correlate(a,b,"full")
I wrote (assume a and b are of the same length):
double[] CrossCorr(double[] a, double[] b)
{
int l = a.Length;
int jmin, jmax, index;
index = 0;
int lmax = 2*l - 1;
double[] z = new double[lmax];
for (int i = 0; i < lmax; i++)
{
if (i >= l)
{
jmin = i - l + 1;
jmax = l - 1;
}
else
{
jmax = i;
jmin = 0;
}
for (int j = jmin; j <= jmax; j++)
{
index = l - i + j - 1;
z[i] += (a[j] * b[index]);
}
}
return z;
}
Another attempt was to use the known equation:
corr(a, b) = ifft(fft(a_and_zeros) * conj(fft(b_and_zeros)))
which resulted in this function (using Math.Net):
double[] Corrcorfour(double[] a, double[] b)
{
//Fourier transformation
Complex[] distrib_Pr_Com = new Complex[a.Count()];
Complex[] distrib_Pi_noise_Com = new Complex[b.Count()];
for (int k = 0; k < x_count; k++)
{
distrib_Pr_Com[k] = (Complex)a[k];
distrib_Pi_noise_Com[k] = (Complex)b[k];
}
MathNet.Numerics.IntegralTransforms.Fourier.Forward(distrib_Pi_noise_Com);
MathNet.Numerics.IntegralTransforms.Fourier.Forward(distrib_Pr_Com);
//complex conj
for (int k = 0; k < distrib_Pi_noise_Com.Count(); k++)
{
distrib_Pi_noise_Com[k] = Complex.Conjugate(distrib_Pi_noise_Com[k]);
}
//multiply results
Complex[] test = new Complex[distrib_Pr_Com.Count()];
for (int k = 0; k < distrib_Pr_Com.Count(); k++)
{
test[k] = Complex.Multiply(distrib_Pr_Com[k], distrib_Pi_noise_Com[k]);
}
//transformierenback
MathNet.Numerics.IntegralTransforms.Fourier.Inverse(test);
//transform to double
double[] finish = new double[test.Count()];
for (int k = 0; k < test.Count(); k++)
{
finish[k] = test[k].Real;
}
return finish;
}
However, this function somehow throws out an array of different lengths and with different results (so it's wrong). It is also only 2 times quicker than the first function, which would still not be enough.
I tried to look up some other people functions but they didn't seem to do it much differently from either of the two. is there a shortcut I'm not seeing, or an error?
I found a few similar questions but I didn't find a satisfying answer in any of them.
One option to speed this up significantly would just be to run it in a parallel loop since every output index is not affected by the others. I believe that numpy does use parallelism as well. I'm sure there's some other optimizations per thread, but this should be a good start.
public static double[] CrossCorrParallel(double[] a, double[] b)
{
int l = a.Length;
int lmax = 2 * l - 1;
double[] z = new double[lmax];
Parallel.For(0, lmax, (i) =>
{
int jmin, jmax, index;
if (i >= l)
{
jmin = i - l + 1;
jmax = l - 1;
}
else
{
jmax = i;
jmin = 0;
}
for (int j = jmin; j <= jmax; j++)
{
index = l - i + j - 1;
z[i] += (a[j] * b[index]);
}
});
return z;
}
My benchmarks are here, note that I am using a 10850k with 10 cores (20 threads):
| Method | Mean | Error | StdDev |
|------------- |----------:|---------:|---------:|
| TestOrig | 189.32 ms | 0.150 ms | 0.133 ms |
| TestParallel | 11.49 ms | 0.221 ms | 0.206 ms |
I am working on a project that compares the time bubble and selection sort take. I made two separate programs and combined them into one and now bubble sort is running much faster than selection sort. I checked to make sure that the code wasn't just giving me 0s because of some conversion error and was running as intended. I am using System.Diagnostics; to measure the time. I also checked that the machine was not the problem, I ran it on Replit and got similar results.
{
class Program
{
public static int s1 = 0;
public static int s2 = 0;
static decimal bubblesort(int[] arr1)
{
int n = arr1.Length;
var sw1 = Stopwatch.StartNew();
for (int i = 0; i < n - 1; i++)
{
for (int j = 0; j < n - i - 1; j++)
{
if (arr1[j] > arr1[j + 1])
{
int tmp = arr1[j];
// swap tmp and arr[i] int tmp = arr[j];
arr1[j] = arr1[j + 1];
arr1[j + 1] = tmp;
s1++;
}
}
}
sw1.Stop();
// Console.WriteLine(sw1.ElapsedMilliseconds);
decimal a = Convert.ToDecimal(sw1.ElapsedMilliseconds);
return a;
}
static decimal selectionsort(int[] arr2)
{
int n = arr2.Length;
var sw1 = Stopwatch.StartNew();
// for (int e = 0; e < 1000; e++)
// {
for (int x = 0; x < arr2.Length - 1; x++)
{
int minPos = x;
for (int y = x + 1; y < arr2.Length; y++)
{
if (arr2[y] < arr2[minPos])
minPos = y;
}
if (x != minPos && minPos < arr2.Length)
{
int temp = arr2[minPos];
arr2[minPos] = arr2[x];
arr2[x] = temp;
s2++;
}
}
// }
sw1.Stop();
// Console.WriteLine(sw1.ElapsedMilliseconds);
decimal a = Convert.ToDecimal(sw1.ElapsedMilliseconds);
return a;
}
static void Main(string[] args)
{
Console.WriteLine("Enter the size of n");
int n = Convert.ToInt32(Console.ReadLine());
Random rnd = new System.Random();
decimal bs = 0M;
decimal ss = 0M;
int s = 0;
int[] arr1 = new int[n];
int tx = 1000; //tx is a variable that I can use to adjust sample size
decimal tm = Convert.ToDecimal(tx);
for (int i = 0; i < tx; i++)
{
for (int a = 0; a < n; a++)
{
arr1[a] = rnd.Next(0, 1000000);
}
ss += selectionsort(arr1);
bs += bubblesort(arr1);
}
bs = bs / tm;
ss = ss / tm;
Console.WriteLine("Bubble Sort took " + bs + " miliseconds");
Console.WriteLine("Selection Sort took " + ss + " miliseconds");
}
}
}
What is going on? What is causing bubble sort to be fast or what is slowing down Selection sort? How can I fix this?
I found that the problem was that the Selection Sort was looping 1000 times per method run in addition to the 1000 runs for sample size, causing the method to perform significantly worse than bubble sort. Thank you guys for help and thank you TheGeneral for showing me the benchmarking tools. Also, the array that was given as a parameter was a copy instead of a reference, as running through the loop manually showed me that the bubble sort was doing it's job and not sorting an already sorted array.
To solve your initial problem you just need to copy your arrays, you can do this easily with ToArray():
Creates an array from a IEnumerable.
ss += selectionsort(arr1.ToArray());
bs += bubblesort(arr1.ToArray());
However let's learn how to do a more reliable benchmark with BenchmarkDotNet:
BenchmarkDotNet Nuget
Official Documentation
Given
public class Sort
{
public static void BubbleSort(int[] arr1)
{
int n = arr1.Length;
for (int i = 0; i < n - 1; i++)
{
for (int j = 0; j < n - i - 1; j++)
{
if (arr1[j] > arr1[j + 1])
{
int tmp = arr1[j];
// swap tmp and arr[i] int tmp = arr[j];
arr1[j] = arr1[j + 1];
arr1[j + 1] = tmp;
}
}
}
}
public static void SelectionSort(int[] arr2)
{
int n = arr2.Length;
for (int x = 0; x < arr2.Length - 1; x++)
{
int minPos = x;
for (int y = x + 1; y < arr2.Length; y++)
{
if (arr2[y] < arr2[minPos])
minPos = y;
}
if (x != minPos && minPos < arr2.Length)
{
int temp = arr2[minPos];
arr2[minPos] = arr2[x];
arr2[x] = temp;
}
}
}
}
Benchmark code
[SimpleJob(RuntimeMoniker.Net50)]
[MemoryDiagnoser()]
public class SortBenchmark
{
private int[] data;
[Params(100, 1000)]
public int N;
[GlobalSetup]
public void Setup()
{
var r = new Random(42);
data = Enumerable
.Repeat(0, N)
.Select(i => r.Next(0, N))
.ToArray();
}
[Benchmark]
public void Bubble() => Sort.BubbleSort(data.ToArray());
[Benchmark]
public void Selection() => Sort.SelectionSort(data.ToArray());
}
Usage
static void Main(string[] args)
{
BenchmarkRunner.Run<SortBenchmark>();
}
Results
Method
N
Mean
Error
StdDev
Bubble
100
8.553 us
0.0753 us
0.0704 us
Selection
100
4.757 us
0.0247 us
0.0231 us
Bubble
1000
657.760 us
7.2581 us
6.7893 us
Selection
1000
300.395 us
2.3302 us
2.1796 us
Summary
What have we learnt? Your bubble sort code is slower ¯\_(ツ)_/¯
It looks like you're passing in the sorted array into Bubble Sort. Because arrays are passed by reference, the sort that you're doing on the array is editing the same contents of the array that will be eventually passed into bubble sort.
Make a second array and pass the second array into bubble sort.
I have two-dimensional array when I am adding values by column it write very slowly (less than 300x):
class Program
{
static void Main(string[] args)
{
TwoDimArrayPerfomrance.GetByColumns();
TwoDimArrayPerfomrance.GetByRows();
}
}
class TwoDimArrayPerfomrance
{
public static void GetByRows()
{
int maxLength = 20000;
int[,] a = new int[maxLength, maxLength];
DateTime dt = DateTime.Now;
Console.WriteLine("The current time is: " + dt.ToString());
//fill value
for (int i = 0; i < maxLength; i++)
{
for (int j = 0; j < maxLength; j++)
{
a[i, j] = i + j;
}
}
DateTime end = DateTime.Now;
Console.WriteLine("Total: " + end.Subtract(dt).TotalSeconds);
}
public static void GetByColumns()
{
int maxLength = 20000;
int[,] a = new int[maxLength, maxLength];
DateTime dt = DateTime.Now;
Console.WriteLine("The current time is: " + dt.ToString());
for (int i = 0; i < maxLength; i++)
{
for (int j = 0; j < maxLength; j++)
{
a[j, i] = j + i;
}
}
DateTime end = DateTime.Now;
Console.WriteLine("Total: " + end.Subtract(dt).TotalSeconds);
}
}
The Column vice taking around 4.2 seconds
while Row wise taking 1.53
It is the "cache proximity" problem mentioned in the first comment. There are memory caches that any data must go through to be accessed by the CPU. Those caches store blocks of memory, so if you are first accessing memory N and then memory N+1 then cache is not changed. But if you first access memory N and then memory N+M (where M is big enough) then new memory block must be added to the cache. When you add new block to the cache some existing block must be removed. If you then have to access this removed block then you have inefficiency in the code.
I concur fully with what #Dialecticus wrote... I'll just add that there are bad ways to write a microbenchark, and there are worse ways. There are many things to do when microbenchmarking. Remembering to run in Release mode without the debugger attached, remembering that there is a GC, and that it is better if it runs when you want it to run, and not casually when you are benchmarking, remembering that sometimes the code is compiled only after it is executed at least once, so at least a round of full warmup is a good idea... and so on... There is even a full library about benchmarking (https://benchmarkdotnet.org/articles/overview.html) that is used by Microscot .NET Core teams to check that there are no speed regressions on the code they write.
class Program
{
static void Main(string[] args)
{
if (Debugger.IsAttached)
{
Console.WriteLine("Warning, debugger attached!");
}
#if DEBUG
Console.WriteLine("Warning, Debug version!");
#endif
Console.WriteLine($"Running at {(Environment.Is64BitProcess ? 64 : 32)}bits");
Console.WriteLine(RuntimeInformation.FrameworkDescription);
Process.GetCurrentProcess().PriorityClass = ProcessPriorityClass.High;
Console.WriteLine();
const int MaxLength = 10000;
for (int i = 0; i < 10; i++)
{
Console.WriteLine($"Round {i + 1}:");
TwoDimArrayPerfomrance.GetByRows(MaxLength);
GC.Collect();
GC.WaitForPendingFinalizers();
TwoDimArrayPerfomrance.GetByColumns(MaxLength);
GC.Collect();
GC.WaitForPendingFinalizers();
Console.WriteLine();
}
}
}
class TwoDimArrayPerfomrance
{
public static void GetByRows(int maxLength)
{
int[,] a = new int[maxLength, maxLength];
Stopwatch sw = Stopwatch.StartNew();
//fill value
for (int i = 0; i < maxLength; i++)
{
for (int j = 0; j < maxLength; j++)
{
a[i, j] = i + j;
}
}
sw.Stop();
Console.WriteLine($"By Rows, size {maxLength} * {maxLength}, {sw.ElapsedMilliseconds / 1000.0:0.00} seconds");
// So that the assignment isn't optimized out, we do some fake operation on the array
for (int i = 0; i < maxLength; i++)
{
for (int j = 0; j < maxLength; j++)
{
if (a[i, j] == int.MaxValue)
{
throw new Exception();
}
}
}
}
public static void GetByColumns(int maxLength)
{
int[,] a = new int[maxLength, maxLength];
Stopwatch sw = Stopwatch.StartNew();
//fill value
for (int i = 0; i < maxLength; i++)
{
for (int j = 0; j < maxLength; j++)
{
a[j, i] = i + j;
}
}
sw.Stop();
Console.WriteLine($"By Columns, size {maxLength} * {maxLength}, {sw.ElapsedMilliseconds / 1000.0:0.00} seconds");
// So that the assignment isn't optimized out, we do some fake operation on the array
for (int i = 0; i < maxLength; i++)
{
for (int j = 0; j < maxLength; j++)
{
if (a[i, j] == int.MaxValue)
{
throw new Exception();
}
}
}
}
}
Ah... and multi-dimensional arrays of the type FooType[,] went the way of the dodo with .NET 3.5, when LINQ came out and it didn't support them. You should use jagged arrays FooType[][].
If you try to map your two dimensional array to a one dimensional one, it might be a bit easier to see what is going on.
The mapping gives
var a = int[maxLength * maxLength];
Now the lookup calculation is up to you.
for (int i = 0; i < maxLength; i++)
{
for (int j = 0; j < maxLength; j++)
{
//var rowBased = j + i * MaxLength;
var colBased = i + j * MaxLength;
//a[rowBased] = i + j;
a[colBased] = i + j;
}
}
So observe the following
On column based lookup the number of multiplications is 20.000 * 20.000 multiplications, because j changes for every loop
On row based lookup the i * MaxLength is compiler optimised and only happens 20.000 times.
Now that a is a one dimensional array it's also easier to see how the memory is being accessed. On row based index the memory is access sequentially, where as column based access is almost random and depending on the size of the array the overhead will vary as you have seen it.
Looking a bit on what BenchmarkDotNet produces
BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19042
AMD Ryzen 9 3900X, 1 CPU, 24 logical and 12 physical cores
.NET Core SDK=5.0.101
Method
MaxLength
Mean
Error
StdDev
GetByRows
100
23.60 us
0.081 us
0.076 us
GetByColumns
100
23.74 us
0.357 us
0.334 us
GetByRows
1000
2,333.20 us
13.150 us
12.301 us
GetByColumns
1000
2,784.43 us
10.027 us
8.889 us
GetByRows
10000
238,599.37 us
1,592.838 us
1,412.009 us
GetByColumns
10000
516,771.56 us
4,272.849 us
3,787.770 us
GetByRows
50000
5,903,087.26 us
13,822.525 us
12,253.308 us
GetByColumns
50000
19,623,369.45 us
92,325.407 us
86,361.243 us
You will see that while MaxLength is reasonable small, the differences are almost negligible (100x100) and (1000x1000), because I expect that the CPU can keep the allocated two dimensional array in the fast access memory cache and the differences are only related to the number of multiplications.
When the matrix becomes larger, then the CPU can no longer keep all allocated memory in its internal cache and we will start to see cache-misses and fetching memory from the external memory storage instead, which is always going to be a lot slower.
That overhead just increases as the size of the matrix grows.
I need to optimise code that counts pos/neg values and remove non-qualified values by time.
I have queue of values with time-stamp attached.
I need to discard values which are 1ms old and count negative and positive values. here is pseudo code
list<val> l;
v = q.dequeue();
deleteold(l, v.time);
l.add(v);
negcount = l.count(i => i.value < 0);
poscount = l.count(i => i.value >= 0);
if(negcount == 10) return -1;
if(poscount == 10) return 1;
I need this code in c# working with max speed. No need to stick to the List. In fact arrays separated for neg and pos values are welcome.
edit: probably unsafe arrays will be the best. any hints?
EDIT: thanks for the heads up.. i quickly tested array version vs list (which i already have) and the list is faster: 35 vs 16 ms for 1 mil iterations...
Here is the code for fairness sake:
class Program
{
static int LEN = 10;
static int LEN1 = 9;
static void Main(string[] args)
{
Var[] data = GenerateData();
Stopwatch sw = new Stopwatch();
for (int i = 0; i < 30; i++)
{
sw.Reset();
ArraysMethod(data, sw);
Console.Write("Array: {0:0.0000}ms ", sw.ElapsedTicks / 10000.0);
sw.Reset();
ListMethod(data, sw);
Console.WriteLine("List: {0:0.0000}ms", sw.ElapsedTicks / 10000.0);
}
Console.ReadLine();
}
private static void ArraysMethod(Var[] data, Stopwatch sw)
{
int signal = 0;
int ni = 0, pi = 0;
Var[] n = new Var[LEN];
Var[] p = new Var[LEN];
for (int i = 0; i < LEN; i++)
{
n[i] = new Var();
p[i] = new Var();
}
sw.Start();
for (int i = 0; i < DATALEN; i++)
{
Var v = data[i];
if (v.val < 0)
{
int x = 0;
ni = 0;
// time is not sequential
for (int j = 0; j < LEN; j++)
{
long diff = v.time - n[j].time;
if (diff < 0)
diff = 0;
// too old
if (diff > 10000)
x = j;
else
ni++;
}
n[x] = v;
if (ni >= LEN1)
signal = -1;
}
else
{
int x = 0;
pi = 0;
// time is not sequential
for (int j = 0; j < LEN; j++)
{
long diff = v.time - p[j].time;
if (diff < 0)
diff = 0;
// too old
if (diff > 10000)
x = j;
else
pi++;
}
p[x] = v;
if (pi >= LEN1)
signal = 1;
}
}
sw.Stop();
}
private static void ListMethod(Var[] data, Stopwatch sw)
{
int signal = 0;
List<Var> d = new List<Var>();
sw.Start();
for (int i = 0; i < DATALEN; i++)
{
Var v = data[i];
d.Add(new Var() { time = v.time, val = v.val < 0 ? -1 : 1 });
// delete expired
for (int j = 0; j < d.Count; j++)
{
if (v.time - d[j].time < 10000)
d.RemoveAt(j--);
else
break;
}
int cnt = 0;
int k = d.Count;
for (int j = 0; j < k; j++)
{
cnt += d[j].val;
}
if ((cnt >= 0 ? cnt : -cnt) >= LEN)
signal = 9;
}
sw.Stop();
}
static int DATALEN = 1000000;
private static Var[] GenerateData()
{
Random r = new Random(DateTime.Now.Millisecond);
Var[] data = new Var[DATALEN];
Var prev = new Var() { val = 0, time = DateTime.Now.TimeOfDay.Ticks};
for (int i = 0; i < DATALEN; i++)
{
int x = r.Next(20);
data[i] = new Var() { val = x - 10, time = prev.time + x * 1000 };
}
return data;
}
class Var
{
public int val;
public long time;
}
}
To get negcount and poscount, you are traversing the entire list twice.
Instead, traverse it once (to compute negcount), and then poscount = l.Count - negcount.
Some ideas:
Only count until max(negcount,poscount) becomes 10, then quit (no need to count the rest). Only works if 10 is the maximum count.
Count negative and positive items in 1 go.
Calculate only negcount and infer poscount from count-negcount which is easier to do than counting them both.
Whether any of them are faster than what you have now, and which is fastest, depends among other things on what the data typically looks like. Is it long? Short?
Some more about 3:
You can use trickery to avoid branches here. You don't have to test whether the item is negative, you can add its negativity to a counter. Supposing the item is x and it is an int, x >> 31 is 0 for positive x and -1 for negative x. So counter -= x >> 31 will give negcount.
Edit: unsafe arrays can be faster, but shouldn't be in this case, because the loop would be of the form
for (int i = 0; i < array.Length; i++)
do something with array[i];
Which is optimized by the JIT compiler.
I was doing some performance metrics and I ran into something that seems quite odd to me. I time the following two functions:
private static void DoOne()
{
List<int> A = new List<int>();
for (int i = 0; i < 200; i++) A.Add(i);
int s=0;
for (int j = 0; j < 100000; j++)
{
for (int c = 0; c < A.Count; c++) s += A[c];
}
}
private static void DoTwo()
{
List<int> A = new List<int>();
for (int i = 0; i < 200; i++) A.Add(i);
IList<int> L = A;
int s = 0;
for (int j = 0; j < 100000; j++)
{
for (int c = 0; c < L.Count; c++) s += L[c];
}
}
Even when compiling in release mode, the timings results were consistently showing that DoTwo takes ~100 longer then DoOne:
DoOne took 0.06171706 seconds.
DoTwo took 8.841709 seconds.
Given the fact the List directly implements IList I was very surprised by the results. Can anyone clarify this behavior?
The gory details
Responding to questions, here is the full code and an image of the project build preferences:
Dead Image Link
using System;
using System.Collections.Generic;
using System.Text;
using System.Diagnostics;
using System.Collections;
namespace TimingTests
{
class Program
{
static void Main(string[] args)
{
Stopwatch SW = new Stopwatch();
SW.Start();
DoOne();
SW.Stop();
Console.WriteLine(" DoOne took {0} seconds.", ((float)SW.ElapsedTicks) / Stopwatch.Frequency);
SW.Reset();
SW.Start();
DoTwo();
SW.Stop();
Console.WriteLine(" DoTwo took {0} seconds.", ((float)SW.ElapsedTicks) / Stopwatch.Frequency);
}
private static void DoOne()
{
List<int> A = new List<int>();
for (int i = 0; i < 200; i++) A.Add(i);
int s=0;
for (int j = 0; j < 100000; j++)
{
for (int c = 0; c < A.Count; c++) s += A[c];
}
}
private static void DoTwo()
{
List<int> A = new List<int>();
for (int i = 0; i < 200; i++) A.Add(i);
IList<int> L = A;
int s = 0;
for (int j = 0; j < 100000; j++)
{
for (int c = 0; c < L.Count; c++) s += L[c];
}
}
}
}
Thanks for all the good answers (especially #kentaromiura). I would have closed the question, though I feel we still miss an important part of the puzzle. Why would accessing a class via an interface it implements be so much slower? The only difference I can see is that accessing a function via an Interface implies using virtual tables while the normally the functions can be called directly. To see whether this is the case I made a couple of changes to the above code. First I introduced two almost identical classes:
public class VC
{
virtual public int f() { return 2; }
virtual public int Count { get { return 200; } }
}
public class C
{
public int f() { return 2; }
public int Count { get { return 200; } }
}
As you can see VC is using virtual functions and C doesn't. Now to DoOne and DoTwo:
private static void DoOne()
{ C a = new C();
int s=0;
for (int j = 0; j < 100000; j++)
{
for (int c = 0; c < a.Count; c++) s += a.f();
}
}
private static void DoTwo()
{
VC a = new VC();
int s = 0;
for (int j = 0; j < 100000; j++)
{
for (int c = 0; c < a.Count; c++) s += a.f();
}
}
And indeed:
DoOne took 0.01287789 seconds.
DoTwo took 8.982396 seconds.
This is even more scary - virtual function calls 800 times slower?? so a couple of question to the community:
Can you reproduce? (given the
fact that all had worse performance
before, but not as bad as mine)
Can you explain?
(this may be the
most important) - can you think of
a way to avoid?
Boaz
A note to everyone out there who is trying to benchmark stuff like this.
Do not forget that the code is not jitted until the first time it runs. That means that the first time you run a method, the cost of running that method could be dominated by the time spent loading the IL, analyzing the IL, and jitting it into machine code, particularly if it is a trivial method.
If what you're trying to do is compare the "marginal" runtime cost of two methods, it's a good idea to run both of them twice and consider only the second runs for comparison purposes.
Profiling one on one:
Testing with Snippet compiler.
using your code results:
0.043s vs 0.116s
eliminating Temporary L
0.043s vs 0.116s - ininfluent
by caching A.count in cmax on both Methods
0.041s vs 0.076s
IList<int> A = new List<int>();
for (int i = 0; i < 200; i++) A.Add(i);
int s = 0;
for (int j = 0; j < 100000; j++)
{
for (int c = 0,cmax=A.Count;c< cmax; c++) s += A[c];
}
Now I will try to slow down DoOne, first try, casting to IList before add:
for (int i = 0; i < 200; i++) ((IList<int>)A).Add(i);
0,041s 0,076s - so add is ininfluent
so it remains only one place where the slowdown can happen : s += A[c];
so I try this:
s += ((IList<int>)A)[c];
0.075s 0.075s - TADaaan!
so seems that accessing Count or a index element is slower on the interfaced version:
EDIT:
Just for fun take a look at this:
for (int c = 0,cmax=A.Count;c< cmax; c++) s += ((List<int>)A)[c];
0.041s 0.050s
so is not a cast problem, but a reflection one!
First I want to thank all for their answers. It was really essential in the path figuring our what was going on. Special thanks goes to #kentaromiura which found the key needed to get to the bottom of things.
The source of the slow down of using List<T> via an IList<T> interface is the lack of the ability of the JIT complier to inline the Item property get function. The use of virtual tables caused by accessing the list through it's IList interface prevents that from happening.
As a proof , I have written the following code:
public class VC
{
virtual public int f() { return 2; }
virtual public int Count { get { return 200; } }
}
public class C
{
//[MethodImpl( MethodImplOptions.NoInlining)]
public int f() { return 2; }
public int Count
{
// [MethodImpl(MethodImplOptions.NoInlining)]
get { return 200; }
}
}
and modified the DoOne and DoTwo classes to the following:
private static void DoOne()
{
C c = new C();
int s = 0;
for (int j = 0; j < 100000; j++)
{
for (int i = 0; i < c.Count; i++) s += c.f();
}
}
private static void DoTwo()
{
VC c = new VC();
int s = 0;
for (int j = 0; j < 100000; j++)
{
for (int i = 0; i < c.Count; i++) s += c.f();
}
}
Sure enough the function times are now very similar to before:
DoOne took 0.01273598 seconds.
DoTwo took 8.524558 seconds.
Now, if you remove the comments before the MethodImpl in the C class (forcing the JIT not to inline) - the timing becomes:
DoOne took 8.734635 seconds.
DoTwo took 8.887354 seconds.
Voila - the methods take almost the same time. You can stil see that method DoOne is still slightly fast , which is consistent which the extra overhead of a virtual function.
I believe that the problem lies in your time metrics, what are you using to measure the elapsed time?
Just for the record, here are my results:
DoOne() -> 295 ms
DoTwo() -> 291 ms
And the code:
Stopwatch sw = new Stopwatch();
sw.Start();
{
DoOne();
}
sw.Stop();
Console.WriteLine("DoOne() -> {0} ms", sw.ElapsedMilliseconds);
sw.Reset();
sw.Start();
{
DoTwo();
}
sw.Stop();
Console.WriteLine("DoTwo() -> {0} ms", sw.ElapsedMilliseconds);
I am seeing some significant penalty for the interface version but nowhere near the magnitude penalty you are seeing.
Can you post a small, complete program that demonstrates the behaviour along with exactly how you are compiling it and exactly what version of the framework you are using?
My tests show the interface version to be about 3x slower when compiled in release mode. When compiled in debug mode they're almost neck-and-neck.
--------------------------------------------------------
DoOne Release (ms) | 92 | 91 | 91 | 92 | 92 | 92
DoTwo Release (ms) | 313 | 313 | 316 | 352 | 320 | 318
--------------------------------------------------------
DoOne Debug (ms) | 535 | 534 | 548 | 536 | 534 | 537
DoTwo Debug (ms) | 566 | 570 | 569 | 565 | 568 | 571
--------------------------------------------------------
EDIT
In my tests I used a slightly modified version of the DoTwo method so that it was directly comparable to DoOne. (This change didn't make any discernible difference to the performance.)
private static void DoTwo()
{
IList<int> A = new List<int>();
for (int i = 0; i < 200; i++) A.Add(i);
int s = 0;
for (int j = 0; j < 100000; j++)
{
for (int c = 0; c < A.Count; c++) s += A[c];
}
}
The only difference between the IL generated for DoOne and (modified) DoTwo is that the callvirt instructions for Add, get_Item and get_Count use IList and ICollection rather than List itself.
I'm guessing that the runtime has to do more work to find the actual method implementation when the callvirt is through an interface (and that the JIT compiler/optimiser can do a better job with the non-interface calls than the interface calls when compiling in release mode).
Can anybody confirm this?
I've ran this using Jon Skeet's Benchmark Helper and I am not seeing the results you are, the execution time is approximately the same between the two methods.