Why writing by column is slow in two dimensional array in C# - c#

I have two-dimensional array when I am adding values by column it write very slowly (less than 300x):
class Program
{
static void Main(string[] args)
{
TwoDimArrayPerfomrance.GetByColumns();
TwoDimArrayPerfomrance.GetByRows();
}
}
class TwoDimArrayPerfomrance
{
public static void GetByRows()
{
int maxLength = 20000;
int[,] a = new int[maxLength, maxLength];
DateTime dt = DateTime.Now;
Console.WriteLine("The current time is: " + dt.ToString());
//fill value
for (int i = 0; i < maxLength; i++)
{
for (int j = 0; j < maxLength; j++)
{
a[i, j] = i + j;
}
}
DateTime end = DateTime.Now;
Console.WriteLine("Total: " + end.Subtract(dt).TotalSeconds);
}
public static void GetByColumns()
{
int maxLength = 20000;
int[,] a = new int[maxLength, maxLength];
DateTime dt = DateTime.Now;
Console.WriteLine("The current time is: " + dt.ToString());
for (int i = 0; i < maxLength; i++)
{
for (int j = 0; j < maxLength; j++)
{
a[j, i] = j + i;
}
}
DateTime end = DateTime.Now;
Console.WriteLine("Total: " + end.Subtract(dt).TotalSeconds);
}
}
The Column vice taking around 4.2 seconds
while Row wise taking 1.53

It is the "cache proximity" problem mentioned in the first comment. There are memory caches that any data must go through to be accessed by the CPU. Those caches store blocks of memory, so if you are first accessing memory N and then memory N+1 then cache is not changed. But if you first access memory N and then memory N+M (where M is big enough) then new memory block must be added to the cache. When you add new block to the cache some existing block must be removed. If you then have to access this removed block then you have inefficiency in the code.

I concur fully with what #Dialecticus wrote... I'll just add that there are bad ways to write a microbenchark, and there are worse ways. There are many things to do when microbenchmarking. Remembering to run in Release mode without the debugger attached, remembering that there is a GC, and that it is better if it runs when you want it to run, and not casually when you are benchmarking, remembering that sometimes the code is compiled only after it is executed at least once, so at least a round of full warmup is a good idea... and so on... There is even a full library about benchmarking (https://benchmarkdotnet.org/articles/overview.html) that is used by Microscot .NET Core teams to check that there are no speed regressions on the code they write.
class Program
{
static void Main(string[] args)
{
if (Debugger.IsAttached)
{
Console.WriteLine("Warning, debugger attached!");
}
#if DEBUG
Console.WriteLine("Warning, Debug version!");
#endif
Console.WriteLine($"Running at {(Environment.Is64BitProcess ? 64 : 32)}bits");
Console.WriteLine(RuntimeInformation.FrameworkDescription);
Process.GetCurrentProcess().PriorityClass = ProcessPriorityClass.High;
Console.WriteLine();
const int MaxLength = 10000;
for (int i = 0; i < 10; i++)
{
Console.WriteLine($"Round {i + 1}:");
TwoDimArrayPerfomrance.GetByRows(MaxLength);
GC.Collect();
GC.WaitForPendingFinalizers();
TwoDimArrayPerfomrance.GetByColumns(MaxLength);
GC.Collect();
GC.WaitForPendingFinalizers();
Console.WriteLine();
}
}
}
class TwoDimArrayPerfomrance
{
public static void GetByRows(int maxLength)
{
int[,] a = new int[maxLength, maxLength];
Stopwatch sw = Stopwatch.StartNew();
//fill value
for (int i = 0; i < maxLength; i++)
{
for (int j = 0; j < maxLength; j++)
{
a[i, j] = i + j;
}
}
sw.Stop();
Console.WriteLine($"By Rows, size {maxLength} * {maxLength}, {sw.ElapsedMilliseconds / 1000.0:0.00} seconds");
// So that the assignment isn't optimized out, we do some fake operation on the array
for (int i = 0; i < maxLength; i++)
{
for (int j = 0; j < maxLength; j++)
{
if (a[i, j] == int.MaxValue)
{
throw new Exception();
}
}
}
}
public static void GetByColumns(int maxLength)
{
int[,] a = new int[maxLength, maxLength];
Stopwatch sw = Stopwatch.StartNew();
//fill value
for (int i = 0; i < maxLength; i++)
{
for (int j = 0; j < maxLength; j++)
{
a[j, i] = i + j;
}
}
sw.Stop();
Console.WriteLine($"By Columns, size {maxLength} * {maxLength}, {sw.ElapsedMilliseconds / 1000.0:0.00} seconds");
// So that the assignment isn't optimized out, we do some fake operation on the array
for (int i = 0; i < maxLength; i++)
{
for (int j = 0; j < maxLength; j++)
{
if (a[i, j] == int.MaxValue)
{
throw new Exception();
}
}
}
}
}
Ah... and multi-dimensional arrays of the type FooType[,] went the way of the dodo with .NET 3.5, when LINQ came out and it didn't support them. You should use jagged arrays FooType[][].

If you try to map your two dimensional array to a one dimensional one, it might be a bit easier to see what is going on.
The mapping gives
var a = int[maxLength * maxLength];
Now the lookup calculation is up to you.
for (int i = 0; i < maxLength; i++)
{
for (int j = 0; j < maxLength; j++)
{
//var rowBased = j + i * MaxLength;
var colBased = i + j * MaxLength;
//a[rowBased] = i + j;
a[colBased] = i + j;
}
}
So observe the following
On column based lookup the number of multiplications is 20.000 * 20.000 multiplications, because j changes for every loop
On row based lookup the i * MaxLength is compiler optimised and only happens 20.000 times.
Now that a is a one dimensional array it's also easier to see how the memory is being accessed. On row based index the memory is access sequentially, where as column based access is almost random and depending on the size of the array the overhead will vary as you have seen it.
Looking a bit on what BenchmarkDotNet produces
BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19042
AMD Ryzen 9 3900X, 1 CPU, 24 logical and 12 physical cores
.NET Core SDK=5.0.101
Method
MaxLength
Mean
Error
StdDev
GetByRows
100
23.60 us
0.081 us
0.076 us
GetByColumns
100
23.74 us
0.357 us
0.334 us
GetByRows
1000
2,333.20 us
13.150 us
12.301 us
GetByColumns
1000
2,784.43 us
10.027 us
8.889 us
GetByRows
10000
238,599.37 us
1,592.838 us
1,412.009 us
GetByColumns
10000
516,771.56 us
4,272.849 us
3,787.770 us
GetByRows
50000
5,903,087.26 us
13,822.525 us
12,253.308 us
GetByColumns
50000
19,623,369.45 us
92,325.407 us
86,361.243 us
You will see that while MaxLength is reasonable small, the differences are almost negligible (100x100) and (1000x1000), because I expect that the CPU can keep the allocated two dimensional array in the fast access memory cache and the differences are only related to the number of multiplications.
When the matrix becomes larger, then the CPU can no longer keep all allocated memory in its internal cache and we will start to see cache-misses and fetching memory from the external memory storage instead, which is always going to be a lot slower.
That overhead just increases as the size of the matrix grows.

Related

Why is MemoryPool slower and allocates more than ArrayPool?

I'm not entirely sure if I have done something wrong in my tests, but from my results MemoryPool is consistently slower and allocates more memory than ArrayPool, since you can convert Array type to Memory anyway, what is the point of using MemoryPool?
using System.Buffers;
using BenchmarkDotNet.Running;
using BenchmarkDotNet.Attributes;
BenchmarkRunner.Run<test>();
[MemoryDiagnoser]
public class test
{
[Benchmark]
public void WithArrayPool()
{
ArrayPool<int> pool = ArrayPool<int>.Shared;
for (int z = 0; z < 100; z++)
{
var memory = pool.Rent(2347);
for (int i = 0; i < memory.Length; i++)
{
memory[i] = i + 1;
}
int total = 0;
for (int i = 0; i < memory.Length; i++)
{
total += memory[i];
}
pool.Return(memory);
}
}
[Benchmark]
public void WithMemoryPool()
{
MemoryPool<int> pool = MemoryPool<int>.Shared;
for (int z = 0; z < 100; z++)
{
var rentedArray = pool.Rent(2347);
var memory = rentedArray.Memory;
for (int i = 0; i < memory.Length; i++)
{
memory.Span[i] = i + 1;
}
int total = 0;
for (int i = 0; i < memory.Length; i++)
{
total += memory.Span[i];
}
rentedArray.Dispose();
}
}
}
Method
Mean
Error
StdDev
Allocated
WithArrayPool
770.2 us
2.27 us
2.01 us
1 B
WithMemoryPool
1,714.6 us
0.56 us
0.50 us
2,402 B
My test code with results is above. Is Memory Pool actually just slower in general or is there something I am missing? If MemoryPool is in fact slower, what use case does it have?
Thanks.
About the performance.
The repeated call to memory.Span[i] is the culprit.
The source code at GitHub shows that quite some handling occurs behind that property getter.
Instead of repeating that call, store the result of that call in a variable
var span = memory.Span;
See full code block below.
Now the Mean numbers are almost equal.
Method
Mean
Error
StdDev
Median
Gen 0
Allocated
WithArrayPool
333.4 us
2.34 us
5.15 us
331.0 us
-
-
WithMemoryPool
368.6 us
7.08 us
5.53 us
366.7 us
0.4883
2,400 B
[Benchmark]
public void WithMemoryPool()
{
MemoryPool<int> pool = MemoryPool<int>.Shared;
for (int z = 0; z < 100; z++)
{
var rentedArray = pool.Rent(2347);
var memory = rentedArray.Memory;
var span = memory.Span;
for (int i = 0; i < memory.Length; i++)
{
span[i] = i + 1;
}
int total = 0;
for (int i = 0; i < memory.Length; i++)
{
total += span[i];
}
rentedArray.Dispose();
}
}
About the difference in allocated memory.
That's by design.
There's already an very good post that explains this.

Cannon's algorithm of matrix multiplication

I try to implement the Cannon's algorithm of matrix multiplication.
I read description on the wikipedia that provides next pseudocode:
row i of matrix a is circularly shifted by i elements to the left.
col j of matrix b is circularly shifted by j elements up.
Repeat n times:
p[i][j] multiplies its two entries and adds to running total.
circular shift each row of a 1 element left
circular shift each col of b 1 element up
and I implemented it on the C# next way:
public static void ShiftLeft(int[][] matrix, int i, int count)
{
int ind = 0;
while (ind < count)
{
int temp = matrix[i][0];
int indl = matrix[i].Length - 1;
for (int j = 0; j < indl; j++)
matrix[i][j] = matrix[i][j + 1];
matrix[i][indl] = temp;
ind++;
}
}
public static void ShiftUp(int[][] matrix, int j, int count)
{
int ind = 0;
while (ind < count)
{
int temp = matrix[0][j];
int indl = matrix.Length - 1;
for (int i = 0; i < indl; i++)
matrix[i][j] = matrix[i + 1][j];
matrix[indl][j] = temp;
ind++;
}
}
public static int[][] Cannon(int[][] A, int[][] B)
{
int[][] C = new int[A.Length][];
for (int i = 0; i < C.Length; i++)
C[i] = new int[A.Length];
for (int i = 0; i < A.Length; i++)
ShiftLeft(A, i, i);
for (int i = 0; i < B.Length; i++)
ShiftUp(B, i, i);
for (int k = 0; k < A.Length; k++)
{
for (int i = 0; i < A.Length; i++)
{
for (int j = 0; j < B.Length; j++)
{
var m = (i + j + k) % A.Length;
C[i][j] += A[i][m] * B[m][j];
ShiftLeft(A, i, 1);
ShiftUp(B, j, 1);
}
}
};
return C;
}
this code return correct result, but do it very slowly. Much slowly even than naive algorithm of matrix multiplication.
For matrix 200x200 I got that result:
00:00:00.0490432 //naive algorithm
00:00:07.1397479 //Cannon's algorithm
What I am doing wrong?
Edit
Thanks SergeySlepov, it was bad attempt to do it parallel. When I back to sequential implementation I got next result:
Count Naive Cannon's
200 00:00:00.0492098 00:00:08.0465076
250 00:00:00.0908136 00:00:22.3891375
300 00:00:00.1477764 00:00:58.0640621
350 00:00:00.2639114 00:01:51.5545524
400 00:00:00.4323984 00:04:50.7260942
okay, it's not a parallel implementation, but how can I do it correctly?
Cannon's algorithm was built for a 'Distributed Memory Machine' (a grid of processors, each with its own memory). This is very different to the hardware you're running it on (a few processors with shared memory) and that is why you're not seeing any increase in performance.
The 'circular shifts' in the pseudocode that you quoted actually mimic data transfers between processors. After the initial matrix 'skewing', each processor in the grid keeps track of three numbers (a, b and c) and executes pseudocode similar to this:
c += a * b;
pass 'a' to the processor to your left (wrapping around)
pass 'b' to the processor to 'above' you (wrapping around)
wait for the next iteration of k
We could mimic this behaviour on a PC using NxN threads but the overhead of context switching (or spawning Tasks) would kill all the joy. To make the most of a PC's 4 (or so) CPUs we could make the loop over i parallel. The loop over k needs to be sequential (unlike your solution), otherwise you might face racing conditions as each iteration of k modifies the matrices A, B and C. In a 'distributed memory machine' race conditions are not a problem as processors do not share any memory.

Best/Fastest Way To Change/Access Elements of a Matrix

I'm quite new to C# and I'm having difficult with our arrays, arrays of arrays, jagged arrays, matrixes and stuff. It's quite different from the C++ , since I can't get a reference (unless I use unsafe code) to a row of a matrix, using pointers and stuff.
Anyway, here's the problem: I have a struct/class called "Image" that cointains 1024 columns and 768 lines. For each line/column theres a 'pixel' struct/class that contains 3 bytes. I'd like to get/set pixels in random places of the matrix as fast as possible.
Let's pretend I have a matrix with 25 pixels. That is 5 rows and 5 columns, something like this:
A B C D E
F G H I J
K L M N O
P Q R S T
U V X W Y
And I need to compare M to H and R. Then M to L and N. Then I need to 'sum' G+H+I+L+M+N+Q+R+S.
How can I do that?
Possibilities:
1) Create something like pixel[5][5] (that's a jagged array, right?), which will be slow whenever I try to compare elements on different columns, right?
2) Create something like pixel[25] , which won't be as easy to code/ready because I'll need to do some (simple) math each and everything I want to access a element
3) Create something like pixe[5,5] (that's a multi-dimensional array, right?)... But I don't know how that will be translated to actual memory... If it's going to be a single block of memory, like the pixe[25], or what...
Since I intend to do this operations ('random' sums/comparison of elements that are in different rows/columns) tens of thousands of times per image. And I have 1000+ imagens. Code optimization is a must... Sadly I'm not sure which structure / classe I should use.
TL;DR: Whats the FASTEST and whats the EASIEST (coding wise) way of getting/setting elements in random positions of a (fixed size) matrix?
edit: I do not want to compare C++ to C#. I'm just saying I AM NEW TO C# and I'd like to find the best way to accomplish this, using C#. Please don't tell me to go back to C++.
I have worked on pixel based image processing in C#. I found that pattern #2 in your list is fastest. For speed, you must forget about accessing pixels via some kind of nice abstract interface. The pixel processing routines must explicitly deal with the width and height of the image. This makes for crappy code in general, but unless Microsoft improves the C# compiler, we are stuck with this approach.
If your for loops start at index 0 of an array, and end at array.Length - 1, the compiler will optimize out the array index bounds testing. This is nice, but typically you have to use more than one pixel at a time while processing.
I just finished testing, heres the result:
SD Array Test1: 00:00:00.9388379
SD Array Test2: 00:00:00.4117926
MD Array Test1: 00:00:01.4977765
MD Array Test2: 00:00:00.8950093
Jagged Array Test1: 00:00:03.6850013
Jagged Array Test2: 00:00:00.5036041
Conclusion: Single dimensional array is the way to go... Sadly we lose in readability.
And heres the code:
int[] myArray = new int[10000 * 10000];
for (int i = 0; i < 10000; i++)
{
for (int j = 0; j < 10000; j++)
{
myArray[(i*10000)+j] = i+j;
}
}
Stopwatch sw = new Stopwatch();
int sum = 0;
sw.Start();
for (int i = 0; i < 10000; i++)
{
for (int j = 0; j < 10000; j++)
{
sum += myArray[(j * 10000) + i];
}
}
sw.Stop();
Console.WriteLine("SD Array Test1: " + sw.Elapsed.ToString());
sum=0;
sw.Restart();
for (int i = 0; i < 10000; i++)
{
for (int j = 0; j < 10000; j++)
{
sum += myArray[(i * 10000) + j];
}
}
sw.Stop();
Console.WriteLine("SD Array Test2: " + sw.Elapsed.ToString());
myArray = null;
int[,] MDA = new int[10000, 10000];
for (int i = 0; i < 10000; i++)
{
for (int j = 0; j < 10000; j++)
{
MDA[i, j] = i + j;
}
}
sum = 0;
sw.Restart();
for (int i = 0; i < 10000; i++)
{
for (int j = 0; j < 10000; j++)
{
sum += MDA[j, i];
}
}
sw.Stop();
Console.WriteLine("MD Array Test1: " + sw.Elapsed.ToString());
sw.Restart();
for (int i = 0; i < 10000; i++)
{
for (int j = 0; j < 10000; j++)
{
sum += MDA[i, j];
}
}
sw.Stop();
Console.WriteLine("MD Array Test2: " + sw.Elapsed.ToString());
MDA = null;
int[][] JA = new int[10000][];
for (int i = 0; i < 10000; i++)
{
JA[i] = new int[10000];
}
for (int i = 0; i < 10000; i++)
{
for (int j = 0; j < 10000; j++)
{
JA[i][j] = i + j;
}
}
sum = 0;
sw.Restart();
for (int i = 0; i < 10000; i++)
{
for (int j = 0; j < 10000; j++)
{
sum += JA[j][i];
}
}
sw.Stop();
Console.WriteLine("Jagged Array Test1: " + sw.Elapsed.ToString());
sw.Restart();
for (int i = 0; i < 10000; i++)
{
for (int j = 0; j < 10000; j++)
{
sum += JA[i][j];
}
}
sw.Stop();
Console.WriteLine("Jagged Array Test2: " + sw.Elapsed.ToString());
MDA = null;
Console.ReadKey();

2D array vs 1D array

I have read the question for Performance of 2-dimensional array vs 1-dimensional array
But in conclusion it says could be the same (depending the map own map function, C does this automatically)?...
I have a matrix wich has 1,000 columns and 440,000,000 rows where each element is a double in C#...
If I am doing some computations in memory, which one could be better to use in performance aspect? (note that I have the memory needed to hold such a monstruos quantity of information)...
If what you're asking is which is better, a 2D array of size 1000x44000 or a 1D array of size 44000000, well what's the difference as far as memory goes? You still have the same number of elements! In the case of performance and understandability, the 2D is probably better. Imagine having to manually find each column or row in a 1D array, when you know exactly where they are in a 2D array.
It depends on how many operations you are performing. In the below example, I'm setting the values of the array 2500 times. Size of the array is (1000 * 1000 * 3). The 1D array took 40 seconds and the 3D array took 1:39 mins.
var startTime = DateTime.Now;
Test1D(new byte[1000 * 1000 * 3]);
Console.WriteLine("Total Time taken 1d = " + (DateTime.Now - startTime));
startTime = DateTime.Now;
Test3D(new byte[1000,1000,3], 1000, 1000);
Console.WriteLine("Total Time taken 3D = " + (DateTime.Now - startTime));
public static void Test1D(byte[] array)
{
for (int c = 0; c < 2500; c++)
{
for (int i = 0; i < array.Length; i++)
{
array[i] = 10;
}
}
}
public static void Test3D(byte[,,] array, int w, int h)
{
for (int c = 0; c < 2500; c++)
{
for (int i = 0; i < h; i++)
{
for (int j = 0; j < w; j++)
{
array[i, j, 0] = 10;
array[i, j, 1] = 10;
array[i, j, 2] = 10;
}
}
}
}
The difference between double[1000,44000] and double[44000000] will not be significant.
You're probably better of with the [,] version (letting the compiler(s) figure out the addressing. But the pattern of your calculations is likely to have more impact (locality and cache use).
Also consider the array-of-array variant, double[1000][]. It is a known 'feature' of the Jitter that it cannot eliminate range-checking in the [,] arrays.

What's better in regards to performance? type[,] or type[][]?

Is it more performant to have a bidimensional array (type[,]) or an array of arrays (type[][]) in C#?
Particularly for initial allocation and item access
Of course, if all else fails... test it! Following gives (in "Release", at the console):
Size 1000, Repeat 1000
int[,] set: 3460
int[,] get: 4036 (chk=1304808064)
int[][] set: 2441
int[][] get: 1283 (chk=1304808064)
So a jagged array is quicker, at least in this test. Interesting! However, it is a relatively small factor, so I would still stick with whichever describes my requirement better. Except for some specific (high CPU/processing) scenarios, readability / maintainability should trump a small performance gain. Up to you, though.
Note that this test assumes you access the array much more often than you create it, so I have not included timings for creation, where I would expect rectangular to be slightly quicker unless memory is highly fragmented.
using System;
using System.Diagnostics;
static class Program
{
static void Main()
{
Console.WriteLine("First is just for JIT...");
Test(10,10);
Console.WriteLine("Real numbers...");
Test(1000,1000);
Console.ReadLine();
}
static void Test(int size, int repeat)
{
Console.WriteLine("Size {0}, Repeat {1}", size, repeat);
int[,] rect = new int[size, size];
int[][] jagged = new int[size][];
for (int i = 0; i < size; i++)
{ // don't count this in the metrics...
jagged[i] = new int[size];
}
Stopwatch watch = Stopwatch.StartNew();
for (int cycle = 0; cycle < repeat; cycle++)
{
for (int i = 0; i < size; i++)
{
for (int j = 0; j < size; j++)
{
rect[i, j] = i * j;
}
}
}
watch.Stop();
Console.WriteLine("\tint[,] set: " + watch.ElapsedMilliseconds);
int sum = 0;
watch = Stopwatch.StartNew();
for (int cycle = 0; cycle < repeat; cycle++)
{
for (int i = 0; i < size; i++)
{
for (int j = 0; j < size; j++)
{
sum += rect[i, j];
}
}
}
watch.Stop();
Console.WriteLine("\tint[,] get: {0} (chk={1})", watch.ElapsedMilliseconds, sum);
watch = Stopwatch.StartNew();
for (int cycle = 0; cycle < repeat; cycle++)
{
for (int i = 0; i < size; i++)
{
for (int j = 0; j < size; j++)
{
jagged[i][j] = i * j;
}
}
}
watch.Stop();
Console.WriteLine("\tint[][] set: " + watch.ElapsedMilliseconds);
sum = 0;
watch = Stopwatch.StartNew();
for (int cycle = 0; cycle < repeat; cycle++)
{
for (int i = 0; i < size; i++)
{
for (int j = 0; j < size; j++)
{
sum += jagged[i][j];
}
}
}
watch.Stop();
Console.WriteLine("\tint[][] get: {0} (chk={1})", watch.ElapsedMilliseconds, sum);
}
}
I believe that [,] can allocate one contiguous chunk of memory, while [][] is N+1 chunk allocations where N is the size of the first dimension. So I would guess that [,] is faster on initial allocation.
Access is probably about the same, except that [][] would involve one extra dereference. Unless you're in an exceptionally tight loop it's probably a wash. Now, if you're doing something like image processing where you are referencing between rows rather than traversing row by row, locality of reference will play a big factor and [,] will probably edge out [][] depending on your cache size.
As Marc Gravell mentioned, usage is key to evaluating the performance...
It really depends. The MSDN Magazine article, Harness the Features of C# to Power Your Scientific Computing Projects, says this:
Although rectangular arrays are generally superior to jagged arrays in terms of structure and performance, there might be some cases where jagged arrays provide an optimal solution. If your application does not require arrays to be sorted, rearranged, partitioned, sparse, or large, then you might find jagged arrays to perform quite well.
type[,] will work faster. Not only because of less offset calculations. Mainly because of less constraint checking, less memory allocation and greater localization in memory. type[][] is not a single object -- it's 1 + N objects that must be allocated and can be away from each other.

Categories