I try to implement the Cannon's algorithm of matrix multiplication.
I read description on the wikipedia that provides next pseudocode:
row i of matrix a is circularly shifted by i elements to the left.
col j of matrix b is circularly shifted by j elements up.
Repeat n times:
p[i][j] multiplies its two entries and adds to running total.
circular shift each row of a 1 element left
circular shift each col of b 1 element up
and I implemented it on the C# next way:
public static void ShiftLeft(int[][] matrix, int i, int count)
{
int ind = 0;
while (ind < count)
{
int temp = matrix[i][0];
int indl = matrix[i].Length - 1;
for (int j = 0; j < indl; j++)
matrix[i][j] = matrix[i][j + 1];
matrix[i][indl] = temp;
ind++;
}
}
public static void ShiftUp(int[][] matrix, int j, int count)
{
int ind = 0;
while (ind < count)
{
int temp = matrix[0][j];
int indl = matrix.Length - 1;
for (int i = 0; i < indl; i++)
matrix[i][j] = matrix[i + 1][j];
matrix[indl][j] = temp;
ind++;
}
}
public static int[][] Cannon(int[][] A, int[][] B)
{
int[][] C = new int[A.Length][];
for (int i = 0; i < C.Length; i++)
C[i] = new int[A.Length];
for (int i = 0; i < A.Length; i++)
ShiftLeft(A, i, i);
for (int i = 0; i < B.Length; i++)
ShiftUp(B, i, i);
for (int k = 0; k < A.Length; k++)
{
for (int i = 0; i < A.Length; i++)
{
for (int j = 0; j < B.Length; j++)
{
var m = (i + j + k) % A.Length;
C[i][j] += A[i][m] * B[m][j];
ShiftLeft(A, i, 1);
ShiftUp(B, j, 1);
}
}
};
return C;
}
this code return correct result, but do it very slowly. Much slowly even than naive algorithm of matrix multiplication.
For matrix 200x200 I got that result:
00:00:00.0490432 //naive algorithm
00:00:07.1397479 //Cannon's algorithm
What I am doing wrong?
Edit
Thanks SergeySlepov, it was bad attempt to do it parallel. When I back to sequential implementation I got next result:
Count Naive Cannon's
200 00:00:00.0492098 00:00:08.0465076
250 00:00:00.0908136 00:00:22.3891375
300 00:00:00.1477764 00:00:58.0640621
350 00:00:00.2639114 00:01:51.5545524
400 00:00:00.4323984 00:04:50.7260942
okay, it's not a parallel implementation, but how can I do it correctly?
Cannon's algorithm was built for a 'Distributed Memory Machine' (a grid of processors, each with its own memory). This is very different to the hardware you're running it on (a few processors with shared memory) and that is why you're not seeing any increase in performance.
The 'circular shifts' in the pseudocode that you quoted actually mimic data transfers between processors. After the initial matrix 'skewing', each processor in the grid keeps track of three numbers (a, b and c) and executes pseudocode similar to this:
c += a * b;
pass 'a' to the processor to your left (wrapping around)
pass 'b' to the processor to 'above' you (wrapping around)
wait for the next iteration of k
We could mimic this behaviour on a PC using NxN threads but the overhead of context switching (or spawning Tasks) would kill all the joy. To make the most of a PC's 4 (or so) CPUs we could make the loop over i parallel. The loop over k needs to be sequential (unlike your solution), otherwise you might face racing conditions as each iteration of k modifies the matrices A, B and C. In a 'distributed memory machine' race conditions are not a problem as processors do not share any memory.
Related
I have two-dimensional array when I am adding values by column it write very slowly (less than 300x):
class Program
{
static void Main(string[] args)
{
TwoDimArrayPerfomrance.GetByColumns();
TwoDimArrayPerfomrance.GetByRows();
}
}
class TwoDimArrayPerfomrance
{
public static void GetByRows()
{
int maxLength = 20000;
int[,] a = new int[maxLength, maxLength];
DateTime dt = DateTime.Now;
Console.WriteLine("The current time is: " + dt.ToString());
//fill value
for (int i = 0; i < maxLength; i++)
{
for (int j = 0; j < maxLength; j++)
{
a[i, j] = i + j;
}
}
DateTime end = DateTime.Now;
Console.WriteLine("Total: " + end.Subtract(dt).TotalSeconds);
}
public static void GetByColumns()
{
int maxLength = 20000;
int[,] a = new int[maxLength, maxLength];
DateTime dt = DateTime.Now;
Console.WriteLine("The current time is: " + dt.ToString());
for (int i = 0; i < maxLength; i++)
{
for (int j = 0; j < maxLength; j++)
{
a[j, i] = j + i;
}
}
DateTime end = DateTime.Now;
Console.WriteLine("Total: " + end.Subtract(dt).TotalSeconds);
}
}
The Column vice taking around 4.2 seconds
while Row wise taking 1.53
It is the "cache proximity" problem mentioned in the first comment. There are memory caches that any data must go through to be accessed by the CPU. Those caches store blocks of memory, so if you are first accessing memory N and then memory N+1 then cache is not changed. But if you first access memory N and then memory N+M (where M is big enough) then new memory block must be added to the cache. When you add new block to the cache some existing block must be removed. If you then have to access this removed block then you have inefficiency in the code.
I concur fully with what #Dialecticus wrote... I'll just add that there are bad ways to write a microbenchark, and there are worse ways. There are many things to do when microbenchmarking. Remembering to run in Release mode without the debugger attached, remembering that there is a GC, and that it is better if it runs when you want it to run, and not casually when you are benchmarking, remembering that sometimes the code is compiled only after it is executed at least once, so at least a round of full warmup is a good idea... and so on... There is even a full library about benchmarking (https://benchmarkdotnet.org/articles/overview.html) that is used by Microscot .NET Core teams to check that there are no speed regressions on the code they write.
class Program
{
static void Main(string[] args)
{
if (Debugger.IsAttached)
{
Console.WriteLine("Warning, debugger attached!");
}
#if DEBUG
Console.WriteLine("Warning, Debug version!");
#endif
Console.WriteLine($"Running at {(Environment.Is64BitProcess ? 64 : 32)}bits");
Console.WriteLine(RuntimeInformation.FrameworkDescription);
Process.GetCurrentProcess().PriorityClass = ProcessPriorityClass.High;
Console.WriteLine();
const int MaxLength = 10000;
for (int i = 0; i < 10; i++)
{
Console.WriteLine($"Round {i + 1}:");
TwoDimArrayPerfomrance.GetByRows(MaxLength);
GC.Collect();
GC.WaitForPendingFinalizers();
TwoDimArrayPerfomrance.GetByColumns(MaxLength);
GC.Collect();
GC.WaitForPendingFinalizers();
Console.WriteLine();
}
}
}
class TwoDimArrayPerfomrance
{
public static void GetByRows(int maxLength)
{
int[,] a = new int[maxLength, maxLength];
Stopwatch sw = Stopwatch.StartNew();
//fill value
for (int i = 0; i < maxLength; i++)
{
for (int j = 0; j < maxLength; j++)
{
a[i, j] = i + j;
}
}
sw.Stop();
Console.WriteLine($"By Rows, size {maxLength} * {maxLength}, {sw.ElapsedMilliseconds / 1000.0:0.00} seconds");
// So that the assignment isn't optimized out, we do some fake operation on the array
for (int i = 0; i < maxLength; i++)
{
for (int j = 0; j < maxLength; j++)
{
if (a[i, j] == int.MaxValue)
{
throw new Exception();
}
}
}
}
public static void GetByColumns(int maxLength)
{
int[,] a = new int[maxLength, maxLength];
Stopwatch sw = Stopwatch.StartNew();
//fill value
for (int i = 0; i < maxLength; i++)
{
for (int j = 0; j < maxLength; j++)
{
a[j, i] = i + j;
}
}
sw.Stop();
Console.WriteLine($"By Columns, size {maxLength} * {maxLength}, {sw.ElapsedMilliseconds / 1000.0:0.00} seconds");
// So that the assignment isn't optimized out, we do some fake operation on the array
for (int i = 0; i < maxLength; i++)
{
for (int j = 0; j < maxLength; j++)
{
if (a[i, j] == int.MaxValue)
{
throw new Exception();
}
}
}
}
}
Ah... and multi-dimensional arrays of the type FooType[,] went the way of the dodo with .NET 3.5, when LINQ came out and it didn't support them. You should use jagged arrays FooType[][].
If you try to map your two dimensional array to a one dimensional one, it might be a bit easier to see what is going on.
The mapping gives
var a = int[maxLength * maxLength];
Now the lookup calculation is up to you.
for (int i = 0; i < maxLength; i++)
{
for (int j = 0; j < maxLength; j++)
{
//var rowBased = j + i * MaxLength;
var colBased = i + j * MaxLength;
//a[rowBased] = i + j;
a[colBased] = i + j;
}
}
So observe the following
On column based lookup the number of multiplications is 20.000 * 20.000 multiplications, because j changes for every loop
On row based lookup the i * MaxLength is compiler optimised and only happens 20.000 times.
Now that a is a one dimensional array it's also easier to see how the memory is being accessed. On row based index the memory is access sequentially, where as column based access is almost random and depending on the size of the array the overhead will vary as you have seen it.
Looking a bit on what BenchmarkDotNet produces
BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19042
AMD Ryzen 9 3900X, 1 CPU, 24 logical and 12 physical cores
.NET Core SDK=5.0.101
Method
MaxLength
Mean
Error
StdDev
GetByRows
100
23.60 us
0.081 us
0.076 us
GetByColumns
100
23.74 us
0.357 us
0.334 us
GetByRows
1000
2,333.20 us
13.150 us
12.301 us
GetByColumns
1000
2,784.43 us
10.027 us
8.889 us
GetByRows
10000
238,599.37 us
1,592.838 us
1,412.009 us
GetByColumns
10000
516,771.56 us
4,272.849 us
3,787.770 us
GetByRows
50000
5,903,087.26 us
13,822.525 us
12,253.308 us
GetByColumns
50000
19,623,369.45 us
92,325.407 us
86,361.243 us
You will see that while MaxLength is reasonable small, the differences are almost negligible (100x100) and (1000x1000), because I expect that the CPU can keep the allocated two dimensional array in the fast access memory cache and the differences are only related to the number of multiplications.
When the matrix becomes larger, then the CPU can no longer keep all allocated memory in its internal cache and we will start to see cache-misses and fetching memory from the external memory storage instead, which is always going to be a lot slower.
That overhead just increases as the size of the matrix grows.
I am trying to work towards a
512 x 512 (262144 elements)
I currently have a
List<double[,]> data;
Dimensions are:
4096 x [8 , 8] (262144 elements)
The 2d array I am working towards is square.
List<List<float>> newList = new List<List<float>(); //working towards
I have tried something along the lines of:
for (int i = 0; i < Math.Sqrt(data.Count); i++ ) {
List<float> row = new List<float>();
foreach (double[,] block in data) {
for (int j = 0; j < 8; j++) {
row.Add(block[i,j]); //i clearly out of range
}
}
newList.Add(row);
}
What I was trying to do there was to brute force my way and add up every row (which is 8 in length) and then add the large rows to the newList.
I believe you can do that in the following way
var newList = new List<List<float>>();
for (int i = 0; i < 512; i++)
{
var innerList = new List<float>();
for (int j = 0; j < 512; j++)
{
int x =(i/8)*64 + (j/8);
int y = i % 8;
int z = j % 8;
innerList.Add(data[x][y,z]);
}
newList.Add(innerList);
}
Basically you have 64x64 of your 8x8 blocks. So the (i,j) coordinate of the larger 512x512 structure translate in the following ways. First to determine the 8x8 block you have to figure out the row and column of the 64x64 structure of blocks by dividing the i and j by the size of the block (8) then you multiply the row (i/8) by the number of block in a row (64) and add the column (j/8). For the y and z it's simpler because you know that its just a matter of the remainder of i and j when divided by 8 (i%8) and (j%8).
I have written a console application
Int64 sum = 0;
int T = Convert.ToInt32(Console.ReadLine());
Int64[] input = new Int64[T];
for (int i = 0; i < T; i++)
{
input[i] = Convert.ToInt32(Console.ReadLine());
}
for (int i = 0; i < T; i++)
{
int[,] Matrix = new int[input[i], input[i]];
sum = 0;
for (int j = 0; j < input[i]; j++)
{
for (int k = 0; k < input[i]; k++)
{
Matrix[j, k] = Math.Abs(j - k);
sum += Matrix[j, k];
}
}
Console.WriteLine(sum);
}
When I gave input as
2
1
999999
It gave Out of memory exception. Can you please help.
Look at what you are allocating:
input[] is allocated as 2 elements (16 bytes) - no worries
But then you enter values: 1 and 999999 and in the first iteration of the loop attempt to allocate
Matrix[1,1] = 4 bytes - again no worries,
but the second time round you try to allocate
Matrix[999999, 999999]
which is 4 * 10e12 bytes and certainly beyond the capacity of your computer even with swap space on the disk.
I suspect that this is not what you really want to allocate (you'd never be able to fill or manipulate that many elements anyway...)
If you are merely trying to do the calculations as per your original code, there is not need to allocate or use the array, as you only ever store one value and immediately use that value and then never again.
Int64 sum = 0;
int T = Convert.ToInt32(Console.ReadLine());
Int64[] input = new Int64[T];
for (int i = 0; i < T; i++)
input[i] = Convert.ToInt32(Console.ReadLine());
for (int i = 0; i < T; i++)
{
// int[,] Matrix = new int[input[i], input[i]];
sum = 0;
for (int j = 0; j < input[i]; j++)
for (int k = 0; k < input[i]; k++)
{
//Matrix[j, k] = Math.Abs(j - k);
//sum += Matrix[j, k];
sum += Math.Abs(j - k);
}
Console.WriteLine(sum);
}
But now beware - a trillion sums is going to take forever to calculate - it won't bomb out, but you might like to take a vacation, get married and have kids before you can expect a result.
Of course instead of doing the full squared set of calculations, you can calculate the sum thus:
for (int i = 0; i < T; i++)
{
sum = 0;
for (int j = 1, term = 0; j < input[i]; j++)
{
term += j;
sum += term * 2;
}
Console.WriteLine(sum);
}
So now the calculation is O(n) instead of O(n^2)
And if you need to know what the value in Matrix[x,y] would have been, you can calculate it by the simple expression Math.Abs(x - y) thus there is no need to store that value.
I'm trying to write a code that will fill array with unique numbers.
I could write the code separately for 1, 2 and 3 dimensional arrays but number of for cycles grow to "infinity".
this is the code for 2D array:
static void fillArray(int[,] array)
{
Random rand = new Random();
for (int i = 0; i < array.GetLength(0); i++)
{
for (int j = 0; j < array.GetLength(1); j++)
{
array[i, j] = rand.Next(1, 100);
for (int k = 0; k < j; k++)
if (array[i, k] == array[i, j])
j--;
}
}
print_info(array);
}
Is it possible to do something like this for n-dimensional arrays?
My approach is to start with a 1-d array of unique numbers, which you can shuffle, and then slot into appropriate places in your real array.
Here is the main function:
private static void Initialize(Array array)
{
var rank = array.Rank;
var dimensionLengths = new List<int>();
var totalSize = 1;
int[] arrayIndices = new int[rank];
for (var dimension = 0; dimension < rank; dimension++)
{
var upperBound = array.GetLength(dimension);
dimensionLengths.Add(upperBound);
totalSize *= upperBound;
}
var singleArray = new int[totalSize];
for (int i = 0; i < totalSize; i++) singleArray[i] = i;
singleArray = Shuffle(singleArray);
for (var i = 0; i < singleArray.Length; i++)
{
var remainingIndex = i;
for (var dimension = array.Rank - 1; dimension >= 0; dimension--)
{
arrayIndices[dimension] = remainingIndex%dimensionLengths[dimension];
remainingIndex /= dimensionLengths[dimension];
}
// Now, set the appropriate cell in your real array:
array.SetValue(singleArray[i], arrayIndices);
}
}
The key in this example is the array.SetValue(value, params int[] indices) function. By building up the correct list of indices, you can use this function to set an arbitrary cell in your array.
Here is the Shuffle function:
private static int[] Shuffle(int[] singleArray)
{
var random = new Random();
for (int i = singleArray.Length; i > 1; i--)
{
// Pick random element to swap.
int j = random.Next(i); // 0 <= j <= i-1
// Swap.
int tmp = singleArray[j];
singleArray[j] = singleArray[i - 1];
singleArray[i - 1] = tmp;
}
return singleArray;
}
And finally a demonstration of it in use:
var array1 = new int[2,3,5];
Initialize(array1);
var array2 = new int[2,2,3,4];
Initialize(array2);
My strategy assigns sequential numbers to the original 1-d array to ensure uniqueness, but you can adopt a different strategy for this as you see fit.
You can use Rank property to get the total number of dimentions in your array
To insert use SetValue method
In the first two for loops you are analysing the array properly (i and j go from the start to the end of the corresponding dimension). The problem comes in the most internal part where you introduce a "correction" which actually provokes an endless loop for j.
First iteration:
- First loop: i = 0;
- Second loop: j = 0;
- Third loop: j = -1
Second iteration
- First loop: i = 0;
- Second loop: j = 0;
- Third loop: j = -1
. etc., etc.
(I start my analysis in the moment when the internal loop is used for the first time. Also bear in mind that the exact behaviour cannot be predicted as far as random numbers are involved. But the idea is that you are making the j counter back over and over by following an arbitrary rule).
What you want to accomplish exactly? What is this last correction (the one provoking the endless loop) meant to do?
If the only thing you intend to do is checking the previously stored values, you have to rely on a different variable (j2, for example) which will not affect any of the loops above:
int j2 = j;
for (int k = 0; k < j2; k++)
if (array[i, k] == array[i, j2])
j2--;
I have read the question for Performance of 2-dimensional array vs 1-dimensional array
But in conclusion it says could be the same (depending the map own map function, C does this automatically)?...
I have a matrix wich has 1,000 columns and 440,000,000 rows where each element is a double in C#...
If I am doing some computations in memory, which one could be better to use in performance aspect? (note that I have the memory needed to hold such a monstruos quantity of information)...
If what you're asking is which is better, a 2D array of size 1000x44000 or a 1D array of size 44000000, well what's the difference as far as memory goes? You still have the same number of elements! In the case of performance and understandability, the 2D is probably better. Imagine having to manually find each column or row in a 1D array, when you know exactly where they are in a 2D array.
It depends on how many operations you are performing. In the below example, I'm setting the values of the array 2500 times. Size of the array is (1000 * 1000 * 3). The 1D array took 40 seconds and the 3D array took 1:39 mins.
var startTime = DateTime.Now;
Test1D(new byte[1000 * 1000 * 3]);
Console.WriteLine("Total Time taken 1d = " + (DateTime.Now - startTime));
startTime = DateTime.Now;
Test3D(new byte[1000,1000,3], 1000, 1000);
Console.WriteLine("Total Time taken 3D = " + (DateTime.Now - startTime));
public static void Test1D(byte[] array)
{
for (int c = 0; c < 2500; c++)
{
for (int i = 0; i < array.Length; i++)
{
array[i] = 10;
}
}
}
public static void Test3D(byte[,,] array, int w, int h)
{
for (int c = 0; c < 2500; c++)
{
for (int i = 0; i < h; i++)
{
for (int j = 0; j < w; j++)
{
array[i, j, 0] = 10;
array[i, j, 1] = 10;
array[i, j, 2] = 10;
}
}
}
}
The difference between double[1000,44000] and double[44000000] will not be significant.
You're probably better of with the [,] version (letting the compiler(s) figure out the addressing. But the pattern of your calculations is likely to have more impact (locality and cache use).
Also consider the array-of-array variant, double[1000][]. It is a known 'feature' of the Jitter that it cannot eliminate range-checking in the [,] arrays.