Matrix Multiplication? - c#

What i am trying here is multiplication of two 500x500 matrices of doubles.
And i am getting null reference exception !
Could u take a look into it
void Topt(double[][] A, double[][] B, double[][] C) {
var source = Enumerable.Range(0, N);
var pquery = from num in source.AsParallel()
select num;
pquery.ForAll((e) => Popt(A, B, C, e));
}
void Popt(double[][] A, double[][] B, double[][] C, int i) {
double[] iRowA = A[i];
double[] iRowC = C[i];
for (int k = 0; k < N; k++) {
double[] kRowB = B[k];
double ikA = iRowA[k];
for (int j = 0; j < N; j++) {
iRowC[j] += ikA * kRowB[j];
}
}
}
Thanks in advance

Since your nullpointer problem was already solved, why not a small performance tip ;) One thing you could try would be a cache oblivious algorithm. For 2k x 2k matrices I get 24.7sec for the recursive cache oblivious variant and 50.26s with the trivial iterative method on a e8400 #3ghz single threaded in C (both could be further optimized obviously, some better arguments for the compiler and making sure SSE is used etc.). Though 500x500 is quite small so you'd have to try if that gives you noticeable improvements.
The recursive one is easier to make multithreaded usually though.
Oh but the most important thing since you're writing in c#: Read this - it's for java, but the same should apply for any modern JIT..

Related

How to multiply 2 matrices using Parallel.ForEach?

There is a function that multiplies two matrices as usual
public IMatrix Multiply(IMatrix m1, IMatrix m2)
{
var resultMatrix = new Matrix(m1.RowCount, m2.ColCount);
for (long i = 0; i < m1.RowCount; i++)
{
for (byte j = 0; j < m2.ColCount; j++)
{
long sum = 0;
for (byte k = 0; k < m1.ColCount; k++)
{
sum += m1.GetElement(i, k) * m2.GetElement(k, j);
}
resultMatrix.SetElement(i, j, sum);
}
}
return resultMatrix;
}
This function should be rewritten using Parallel.ForEach Threading, I tried this way
public IMatrix Multiply(IMatrix m1, IMatrix m2)
{
// todo: feel free to add your code here
var resultMatrix = new Matrix(m1.RowCount, m2.ColCount);
Parallel.ForEach(m1.RowCount, row =>
{
for (byte j = 0; j < m2.ColCount; j++)
{
long sum = 0;
for (byte k = 0; k < m1.ColCount; k++)
{
sum += m1.GetElement(row, k) * m2.GetElement(k, j);
}
resultMatrix.SetElement(row, j, sum);
}
});
return resultMatrix;
}
But there is an error with the type argument in the loop. How can I fix it?
Just use Parallel.For instead of Parallel.Foreach, that should let you keep the exact same body as the non-parallel version:
Parallel.For(0, m1.RowCount, i =>{
...
}
Note that only fairly large matrices will benefit from parallelization, so if you are working with 4x4 matrices for graphics, this is not the approach to take.
One problem with multiplying matrices is that you need to access one value for each row for one of the matrices in your innermost loop. This access pattern may be difficult to cache by your processor, causing lots of cache misses. So a fairly easy optimization is to copy an entire column to a temporary array and do all computations that need this column before reading the next. This lets all memory access be nice and linear and easy to cache. this will do more work overall, but better cache utilization easily makes it a win. There are even more cache efficient methods, but the complexity also tend to increase.
Another optimization would be to use SIMD, but this might require platform specific code for best performance, and will likely involve more work. But you might be able to find libraries that are already optimized.
But perhaps most importantly, Profile your code. It is quite easy to have simple things consume lot of time. You are for example using an interface, so if you may have a virtual method call for each memory access that cannot be inlined, potentially causing a severe performance penalty compared to a direct array access.
ForEach receives a collection, IEnumerable as the first argument and m1.RowCount is a number.
Probably Parallel.For() is what you wanted.

Optimizing this enormous for loop

So I have this for loop:
for (int i = 0; i < meshes.Count; i++)
{
for (int j = 0; j < meshes.Count; j++)
{
for (int m = 0; m < meshes[i].vertices.Length; m++)
{
for (int n = 0; n < meshes[i].vertices.Length; n++)
{
if ((meshes[i].vertices[m].x == meshes[j].vertices[n].x) && (meshes[i].vertices[m].z == meshes[j].vertices[n].z))
{
if (meshes[i].vertices[m] != meshes[j].vertices[n])
{
meshes[i].vertices[m].y = meshes[j].vertices[n].y;
}
}
}
}
}
}
Which goes through a few million vectors and compares them to all other vectors, to then modify some of their y values. I think it works, however after hitting play it takes an unbelievably long time to load (currently been waiting for 15 minutes, and still going). Is there a way to make it more efficient? Thanks for the help!
As I read this, what you're basically doing, is that for all vertices with the same x and z, you set their y value to the same.
A more optimized way would be to use the Linq method GroupBy which internally uses hash mapping to avoid exponential time complexity like your current approach:
var vGroups = meshes.SelectMany(mesh => mesh.vertices)
.GroupBy(vertex => new { vertex.x, vertex.z });
foreach (var vGroup in vGroups)
{
vGroup.Aggregate((prev, curr) =>
{
// If prev is null (i.e. first iteration of the "loop")
// don't change the 'y' value
curr.Y = prev?.y ?? curr.y;
return curr;
});
}
// All vertices should now be updated in the 'meshes'
Note, that the final y value of the vertices depends on the order of the meshes and vertices in your original list. The first vertex in each vGroup is the deciding vertex. I believe it'll be opposite of your approach, where it's the last vertex that's the deciding one, but it doesn't sound like that's important for you.
Furthermore, be aware that in this (and your) approach you are possibly merging two vertices in the same mesh if two vertices have the same x and z values. I don't know if that's intended but I wanted to point it out.
A additional performance optimization would be to parallelize this. Just start out with call to AsParallel:
var vGroups = meshes.AsParallel()
.SelectMany(mesh => mesh.vertices)
.GroupBy(vertex => new { vertex.x, vertex.z });
// ...
Be aware, that parallelization is not always speeding things up if the computation you are trying to parallelize is not that computationally expensive. The overhead from parallelizing it may outweigh the benefits. I'm not sure if the GroupBy operation is heavy enough for it to be beneficial but you'll have to test that out for yourself. Try without it first.
For a simplified example, see this fiddle.
You want to make Y equal for all vertices with the same X and Z. Lets do just that
var yForXZDict = new Dictionary<(int, int), int>();
foreach (var mesh in meshes)
{
foreach (var vertex in mesh.vertices)
{
var xz = (vertex.x, vertex.z);
if (yForXZDict.TryGetValue(xz, out var y))
{
vertex.y = y;
}
else
{
yForXZDict[xz] = vertex.y;
}
}
}
You should replace int to the exact type you use for coordinates
You are comparing twice unnecessarily.
Here a short example of what I mean:
Let's say we have meshes A, B, C.
You are comparing
A, A
A, B
A, C
B, A
B, B
B, C
C, A
C, B
C, C
while this checks e.g. the combination A and B two times.
One first easy improvement would be to use e.g.
for (int i = 0; i < meshes.Count; i++)
{
// only check the current and following meshes
for (int j = i; j < meshes.Count; j++)
{
...
do you even want to compare a mesh with itself? Otherwise you can actually even use j = i + 1 so only compare the current mesh to the next and following meshes.
Then for the vertices it depends. If you actually also want to check the mesh with itself at least you want int n = m + 1 in the case that i == j.
It makes no sense to check a vertex with itself since the condition will always be true.
A next point is minimize accesses
You are accessing e.g.
meshes[i].vertices
five times!
rather get and store it once like e.g.
// To minimize GC it sometimes makes sense to reuse variables outside of a loop
Mesh meshA;
Mesh meshB;
Vector3[] vertsA;
Vector3[] vertsB;
Vector3 vA;
Vector3 vB;
for (int i = 0; i < meshes.Count; i++)
{
meshA = meshes[i];
vertsA = meshA.vertices;
for (int j = i; j < meshes.Count; j++)
{
meshB = meshes[j];
vertsB = meshB.vertices;
for(int m = 0; m < vertsA.Length; m++)
{
vA = vertsA[m];
...
Also note that a line like
meshes[i].vertices[m].y = meshes[j].vertices[n].y;
Actually shouldn't even compile!
The vertices are Vector3 which is a struct so assigning the
meshes[i].vertices[m].y
only changes the value of a returned Vector3 instance but shouldn't in any way change the content of the array.
You would rather work with the vA as mentioned before and at the end assign it back via
vertsA[m] = vA;
and then at the end of the loop assign the entire array back once via
meshA.vertices = vertsA;
And well finally: I would put this into a Thread or use Unity's JobSystem and the burst compiler and meanwhile e.g. display a progress bar or some User feedback instead of freezing the entire application.
Yet another point is floating point precision
you are directly comparing two float values using ==. Due to the floating point precision this might fail even if it shouldn't e.g.
10f * 0.1f == 1f
is not necessarily true. It might be 0.99999999 or 1.0000000001.
Therefore Unity uses only a precision of 0.00001 for Vector3 == Vector3.
You should either do the same and use
if(Mathf.Abs(vA.x - vB.x) <= 0.00001f)`
or use
if(Mathf.Approximately(vA.x, vB.x))
which equals
if(Mathf.Abs(vA.x - vB.x) <= Mathf.Epsilon)`
where Epsilon is the smallest value two floats can differ

I need to do my operations faster

I have a piece of code that reads points from an stl, then I have to do a transformation, aplying a transformation matrix, of this stl and write the results on other stl. I do all this stuff, but it's too slow, about 5 or more minutes.
I put the code of the matrix multiplication, it recieves the two matrix and it makes the multiplication:
public double[,] MultiplyMatrix(double[,] A, double[,] B)
{
int rA = A.GetLength(0);
int cA = A.GetLength(1);
int rB = B.GetLength(0);
int cB = B.GetLength(1);
double temp = 0;
double[,] kHasil = new double[rA, cB];
if (cA != rB)
{
MessageBox.Show("matrix can't be multiplied !!");
}
else
{
for (int i = 0; i < rA; i++)
{
for (int j = 0; j < cB; j++)
{
temp = 0;
for (int k = 0; k < cA; k++)
{
temp += A[i, k] * B[k, j];
}
kHasil[i, j] = temp;
}
}
return kHasil;
}
return kHasil;
}
My problem is that all the code is too slow, it has to read from a stl, multiply all the points and write in other stl the results, it spends 5-10 minutes to do that. I see that all comercial programs, like cloudcompare, do all this operations in a few seconds.
Can anyone tell me how I can do it faster? Is there any library to do that faster than my code?
Thank you! :)
I fond this on internet:
double[] iRowA = A[i];
double[] iRowC = C[i];
for (int k = 0; k < N; k++) {
double[] kRowB = B[k];
double ikA = iRowA[k];
for (int j = 0; j < N; j++) {
iRowC[j] += ikA * kRowB[j];
}
}
then use Plinq
var source = Enumerable.Range(0, N);
var pquery = from num in source.AsParallel()
select num;
pquery.ForAll((e) => Popt(A, B, C, e));
Where Popt is our method name taking 3 jagged arrays (C = A * B) and the row to calculate (e). How fast is this:
1.Name Milliseconds2.Popt 187
Source is: Daniweb
That's over 12 times faster than our original code! With the magic of PLINQ we are creating 500 threads in this example and don't have to manage a single one of them, everything is handled for you.
You have couple of options:
Rewrite your code with jagged arrays (like double[][] A) it should give ~2x increase in speed.
Write unmanaged C/C++ DLL with matrix multiplication code.
Use third-party math library that have native BLAS implementation under the hood. I suggest Math.NET Numerics, it can be switched to use Intel MKL which is smoking fast.
Probably, the third option is the best.
Just for the records: CloudCompare is not a commercial product. It's a free open-source project. And there are no 'huge team of developers' (only a handful of them actually, doing this on their free time).
Here is our biggest secret: we use pure C++ code ;). And we rarely use multi-threading but for very lengthy processes (you have to take the thread management and processing time overhead into account).
Here are a few 'best practice' rules for the parts of the code that are called loads of times:
avoid any dynamic memory allocation
make as less (far) function calls as possible
always process the most probable case first in a 'if-then-else' branching
avoid very small loops (inline them if N = 2 or 3)

Is there a more efficient way to convert double to float?

I have a need to convert a multi-dimensional double array to a jagged float array. The sizes will var from [2][5] up to around [6][1024].
I was curious how just looping and casting the double to the float would perform and it's not TOO bad, about 225µs for a [2][5] array - here's the code:
const int count = 5;
const int numCh = 2;
double[,] dbl = new double[numCh, count];
float[][] flt = new float[numCh][];
for (int i = 0; i < numCh; i++)
{
flt[i] = new float[count];
for (int j = 0; j < count; j++)
{
flt[i][j] = (float)dbl[i, j];
}
}
However if there are more efficient techniques I'd like to use them. I should mention that I ONLY timed the two nested loops, not the allocations before it.
After experimenting a little more I think 99% of the time is burned on the loops, even without the assignment!
This will run faster, for small data it's not worth doing Parallel.For(0, count, (j) => it actually runs considerably slower for very small data, which is why that I have commented that section out.
double* dp0;
float* fp0;
fixed (double* dp1 = dbl)
{
dp0 = dp1;
float[] newFlt = new float[count];
fixed (float* fp1 = newFlt)
{
fp0 = fp1;
for (int i = 0; i < numCh; i++)
{
//Parallel.For(0, count, (j) =>
for (int j = 0; j < count; j++)
{
fp0[j] = (float)dp0[i * count + j];
}
//});
flt[i] = newFlt.Clone() as float[];
}
}
}
This runs faster because double accessing double arrays [,] is really taxing in .NET due to the array bounds checking. the newFlt.Clone() just means we're not fixing and unfixing new pointers all the time (as there is a slight overhead in doing so)
You will need to run it with the unsafe code tag and compile with /UNSAFE
But really you should be running with data closer to 5000 x 5000 not 5 x 2, if something takes less than 1000 ms you need to either add in more loops or increase the data because at that level a minor spike in cpu activity can add a lot of noise to your profiling.
In your example - I think you dont measure the double/float comparison so much (which should be a processor internal instruction) as the array accesses (which have a lot of redirects plus obviousl.... aray delimiter checks (for the array index of bounds exception).
I would suggest timining a test without arrays.
I don't really think that you can optimize your code much more, one option would be to make your code parallel but for your input data size ([2][5] up to around [6][1024]) I don't thing that you would profit so much if you would even have any profit. In fact, I wouldn't even bother optimizing that piece of code at all...
Anyway, to optimize that, the only thing that I would do (if that fits in what you want to do) would be to just used fixed-width arrays instead of the jagged ones, even if you would waste memory with that.
If you could use also Lists in your case you could use the LINQ approach:
List<List<double>> t = new List<List<double>>();
//adding test data
t.Add(new List<double>() { 12343, 345, 3, 23, 2, 1 });
t.Add(new List<double>() { 43, 123, 3, 54, 233, 1 });
//creating target
List<List<float>> q;
//conversion
q = t.ConvertAll<List<float>>(
(List<double> inList) =>
{
return inList.ConvertAll<float>((double inValue) => { return (float)inValue; });
}
);
if its faster you have to measure yourself. (doubtful)
but you could parallelize it which could fasten it up (PLINQ)

How to prevent creating intermediate objects in cascading operators?

I use a custom Matrix class in my application, and I frequently add multiple matrices:
Matrix result = a + b + c + d; // a, b, c and d are also Matrices
However, this creates an intermediate matrix for each addition operation. Since this is simple addition, it is possible to avoid the intermediate objects and create the result by adding the elements of all 4 matrices at once. How can I accomplish this?
NOTE: I know I can define multiple functions like Add3Matrices(a, b, c), Add4Matrices(a, b, c, d), etc. but I want to keep the elegancy of result = a + b + c + d.
You could limit yourself to a single small intermediate by using lazy evaluation. Something like
public class LazyMatrix
{
public static implicit operator Matrix(LazyMatrix l)
{
Matrix m = new Matrix();
foreach (Matrix x in l.Pending)
{
for (int i = 0; i < 2; ++i)
for (int j = 0; j < 2; ++j)
m.Contents[i, j] += x.Contents[i, j];
}
return m;
}
public List<Matrix> Pending = new List<Matrix>();
}
public class Matrix
{
public int[,] Contents = { { 0, 0 }, { 0, 0 } };
public static LazyMatrix operator+(Matrix a, Matrix b)
{
LazyMatrix l = new LazyMatrix();
l.Pending.Add(a);
l.Pending.Add(b);
return l;
}
public static LazyMatrix operator+(Matrix a, LazyMatrix b)
{
b.Pending.Add(a);
return b;
}
}
class Program
{
static void Main(string[] args)
{
Matrix a = new Matrix();
Matrix b = new Matrix();
Matrix c = new Matrix();
Matrix d = new Matrix();
a.Contents[0, 0] = 1;
b.Contents[1, 0] = 4;
c.Contents[0, 1] = 9;
d.Contents[1, 1] = 16;
Matrix m = a + b + c + d;
for (int i = 0; i < 2; ++i)
{
for (int j = 0; j < 2; ++j)
{
System.Console.Write(m.Contents[i, j]);
System.Console.Write(" ");
}
System.Console.WriteLine();
}
System.Console.ReadLine();
}
}
Something that would at least avoid the pain of
Matrix Add3Matrices(a,b,c) //and so on
would be
Matrix AddMatrices(Matrix[] matrices)
In C++ it is possible to use Template Metaprograms and also here, using templates to do exactly this. However, the template programing is non-trivial. I don't know if a similar technique is available in C#, quite possibly not.
This technique, in c++ does exactly what you want. The disadvantage is that if something is not quite right then the compiler error messages tend to run to several pages and are almost impossible to decipher.
Without such techniques I suspect you are limited to functions such as Add3Matrices.
But for C# this link might be exactly what you need: Efficient Matrix Programming in C# although it seems to work slightly differently to C++ template expressions.
You can't avoid creating intermediate objects.
However, you can use expression templates as described here to minimise them and do fancy lazy evaluation of the templates.
At the simplest level, the expression template could be an object that stores references to several matrices and calls an appropriate function like Add3Matrices() upon assignment. At the most advanced level, the expression templates will do things like calculate the minimum amount of information in a lazy fashion upon request.
This is not the cleanest solution, but if you know the evaluation order, you could do something like this:
result = MatrixAdditionCollector() << a + b + c + d
(or the same thing with different names). The MatrixCollector then implements + as +=, that is, starts with a 0-matrix of undefined size, takes a size once the first + is evaluated and adds everything together (or, copies the first matrix). This reduces the amount of intermediate objects to 1 (or even 0, if you implement assignment in a good way, because the MatrixCollector might be/contain the result immediately.)
I am not entirely sure if this is ugly as hell or one of the nicer hacks one might do. A certain advantage is that it is kind of obvious what's happening.
Might I suggest a MatrixAdder that behaves much like a StringBuilder. You add matrixes to the MatrixAdder and then call a ToMatrix() method that would do the additions for you in a lazy implementation. This would get you the result you want, could be expandable to any sort of LazyEvaluation, but also wouldn't introduce any clever implementations that could confuse other maintainers of the code.
I thought that you could just make the desired add-in-place behavior explicit:
Matrix result = a;
result += b;
result += c;
result += d;
But as pointed out by Doug in the Comments on this post, this code is treated by the compiler as if I had written:
Matrix result = a;
result = result + b;
result = result + c;
result = result + d;
so temporaries are still created.
I'd just delete this answer, but it seems others might have the same misconception, so consider this a counter example.
Bjarne Stroustrup has a short paper called Abstraction, libraries, and efficiency in C++ where he mentions techniques used to achieve what you're looking for. Specifically, he mentions the library Blitz++, a library for scientific calculations that also has efficient operations for matrices, along with some other interesting libraries. Also, I recommend reading a conversation with Bjarne Stroustrup on artima.com on that subject.
It is not possible, using operators.
My first solution would be something along this lines (to add in the Matrix class if possible) :
static Matrix AddMatrices(Matrix[] lMatrices) // or List<Matrix> lMatrices
{
// Check consistency of matrices
Matrix m = new Matrix(n, p);
for (int i = 0; i < n; i++)
for (int j = 0; j < n; j++)
foreach (Maxtrix mat in lMatrices)
m[i, j] += mat[i, j];
return m;
}
I'd had it in the Matrix class because you can rely on the private methods and properties that could be usefull for your function in case the implementation of the matrix change (linked list of non empty nodes instead of a big double array, for example).
Of course, you would loose the elegance of result = a + b + c + d. But you would have something along the lines of result = Matrix.AddMatrices(new Matrix[] { a, b, c, d });.
There are several ways to implement lazy evaluation to achieve that. But its important to remember that not always your compiler will be able to get the best code of all of them.
I already made implementations that worked great in GCC and even superceeded the performance of the traditional several For unreadable code because they lead the compiler to observe that there were no aliases between the data segments (somethign hard to grasp with arrays coming of nowhere). But some of those were a complete fail at MSVC and vice versa on other implementations. Unfortunately those are too long to post here (don't think several thousands lines of code fit here).
A very complex library with great embedded knowledge int he area is Blitz++ library for scientific computation.

Categories