I need to do my operations faster - c#

I have a piece of code that reads points from an stl, then I have to do a transformation, aplying a transformation matrix, of this stl and write the results on other stl. I do all this stuff, but it's too slow, about 5 or more minutes.
I put the code of the matrix multiplication, it recieves the two matrix and it makes the multiplication:
public double[,] MultiplyMatrix(double[,] A, double[,] B)
{
int rA = A.GetLength(0);
int cA = A.GetLength(1);
int rB = B.GetLength(0);
int cB = B.GetLength(1);
double temp = 0;
double[,] kHasil = new double[rA, cB];
if (cA != rB)
{
MessageBox.Show("matrix can't be multiplied !!");
}
else
{
for (int i = 0; i < rA; i++)
{
for (int j = 0; j < cB; j++)
{
temp = 0;
for (int k = 0; k < cA; k++)
{
temp += A[i, k] * B[k, j];
}
kHasil[i, j] = temp;
}
}
return kHasil;
}
return kHasil;
}
My problem is that all the code is too slow, it has to read from a stl, multiply all the points and write in other stl the results, it spends 5-10 minutes to do that. I see that all comercial programs, like cloudcompare, do all this operations in a few seconds.
Can anyone tell me how I can do it faster? Is there any library to do that faster than my code?
Thank you! :)

I fond this on internet:
double[] iRowA = A[i];
double[] iRowC = C[i];
for (int k = 0; k < N; k++) {
double[] kRowB = B[k];
double ikA = iRowA[k];
for (int j = 0; j < N; j++) {
iRowC[j] += ikA * kRowB[j];
}
}
then use Plinq
var source = Enumerable.Range(0, N);
var pquery = from num in source.AsParallel()
select num;
pquery.ForAll((e) => Popt(A, B, C, e));
Where Popt is our method name taking 3 jagged arrays (C = A * B) and the row to calculate (e). How fast is this:
1.Name Milliseconds2.Popt 187
Source is: Daniweb
That's over 12 times faster than our original code! With the magic of PLINQ we are creating 500 threads in this example and don't have to manage a single one of them, everything is handled for you.

You have couple of options:
Rewrite your code with jagged arrays (like double[][] A) it should give ~2x increase in speed.
Write unmanaged C/C++ DLL with matrix multiplication code.
Use third-party math library that have native BLAS implementation under the hood. I suggest Math.NET Numerics, it can be switched to use Intel MKL which is smoking fast.
Probably, the third option is the best.

Just for the records: CloudCompare is not a commercial product. It's a free open-source project. And there are no 'huge team of developers' (only a handful of them actually, doing this on their free time).
Here is our biggest secret: we use pure C++ code ;). And we rarely use multi-threading but for very lengthy processes (you have to take the thread management and processing time overhead into account).
Here are a few 'best practice' rules for the parts of the code that are called loads of times:
avoid any dynamic memory allocation
make as less (far) function calls as possible
always process the most probable case first in a 'if-then-else' branching
avoid very small loops (inline them if N = 2 or 3)

Related

How to multiply 2 matrices using Parallel.ForEach?

There is a function that multiplies two matrices as usual
public IMatrix Multiply(IMatrix m1, IMatrix m2)
{
var resultMatrix = new Matrix(m1.RowCount, m2.ColCount);
for (long i = 0; i < m1.RowCount; i++)
{
for (byte j = 0; j < m2.ColCount; j++)
{
long sum = 0;
for (byte k = 0; k < m1.ColCount; k++)
{
sum += m1.GetElement(i, k) * m2.GetElement(k, j);
}
resultMatrix.SetElement(i, j, sum);
}
}
return resultMatrix;
}
This function should be rewritten using Parallel.ForEach Threading, I tried this way
public IMatrix Multiply(IMatrix m1, IMatrix m2)
{
// todo: feel free to add your code here
var resultMatrix = new Matrix(m1.RowCount, m2.ColCount);
Parallel.ForEach(m1.RowCount, row =>
{
for (byte j = 0; j < m2.ColCount; j++)
{
long sum = 0;
for (byte k = 0; k < m1.ColCount; k++)
{
sum += m1.GetElement(row, k) * m2.GetElement(k, j);
}
resultMatrix.SetElement(row, j, sum);
}
});
return resultMatrix;
}
But there is an error with the type argument in the loop. How can I fix it?
Just use Parallel.For instead of Parallel.Foreach, that should let you keep the exact same body as the non-parallel version:
Parallel.For(0, m1.RowCount, i =>{
...
}
Note that only fairly large matrices will benefit from parallelization, so if you are working with 4x4 matrices for graphics, this is not the approach to take.
One problem with multiplying matrices is that you need to access one value for each row for one of the matrices in your innermost loop. This access pattern may be difficult to cache by your processor, causing lots of cache misses. So a fairly easy optimization is to copy an entire column to a temporary array and do all computations that need this column before reading the next. This lets all memory access be nice and linear and easy to cache. this will do more work overall, but better cache utilization easily makes it a win. There are even more cache efficient methods, but the complexity also tend to increase.
Another optimization would be to use SIMD, but this might require platform specific code for best performance, and will likely involve more work. But you might be able to find libraries that are already optimized.
But perhaps most importantly, Profile your code. It is quite easy to have simple things consume lot of time. You are for example using an interface, so if you may have a virtual method call for each memory access that cannot be inlined, potentially causing a severe performance penalty compared to a direct array access.
ForEach receives a collection, IEnumerable as the first argument and m1.RowCount is a number.
Probably Parallel.For() is what you wanted.

Why are multiple cores engaged on sequential algorithms?

When I run bubblesorts, cocktailsorts and quicksorts in C#, I can see that all 3 cores are engaged on my AMD X3 (X4 shipped with 1 broken core).
Why is this happening? My algorithm is sequential and my code doesn't have any thread tags. Especially something like sorting algorithms which is a highly sequential algorithm no, one event can't happen until the next is completed. How does it manage to split up the algorithm?
The bubblesort for example on request:
public void BubbleSort()
{
for (int i = 1; i < amount; i++)
{
for (int j = 0; j < a; j++)
{
if (numbers[j] > numbers[j + 1])
{
t = numbers[j + 1];
numbers[j + 1] = numbers[j];
numbers[j] = t;
}
}
a--;
}
}
Your code can potentially swap cores on a context switch. But will only use one at a time.
Sorting algorithms can be made to run in parallel and utilize multiple cores. Which sort routines are you using? It's very possible that they are not sequential algorithms.
As an example, Quicksort is very easy to parallelize via divide and conquer.

Is there a more efficient way to convert double to float?

I have a need to convert a multi-dimensional double array to a jagged float array. The sizes will var from [2][5] up to around [6][1024].
I was curious how just looping and casting the double to the float would perform and it's not TOO bad, about 225µs for a [2][5] array - here's the code:
const int count = 5;
const int numCh = 2;
double[,] dbl = new double[numCh, count];
float[][] flt = new float[numCh][];
for (int i = 0; i < numCh; i++)
{
flt[i] = new float[count];
for (int j = 0; j < count; j++)
{
flt[i][j] = (float)dbl[i, j];
}
}
However if there are more efficient techniques I'd like to use them. I should mention that I ONLY timed the two nested loops, not the allocations before it.
After experimenting a little more I think 99% of the time is burned on the loops, even without the assignment!
This will run faster, for small data it's not worth doing Parallel.For(0, count, (j) => it actually runs considerably slower for very small data, which is why that I have commented that section out.
double* dp0;
float* fp0;
fixed (double* dp1 = dbl)
{
dp0 = dp1;
float[] newFlt = new float[count];
fixed (float* fp1 = newFlt)
{
fp0 = fp1;
for (int i = 0; i < numCh; i++)
{
//Parallel.For(0, count, (j) =>
for (int j = 0; j < count; j++)
{
fp0[j] = (float)dp0[i * count + j];
}
//});
flt[i] = newFlt.Clone() as float[];
}
}
}
This runs faster because double accessing double arrays [,] is really taxing in .NET due to the array bounds checking. the newFlt.Clone() just means we're not fixing and unfixing new pointers all the time (as there is a slight overhead in doing so)
You will need to run it with the unsafe code tag and compile with /UNSAFE
But really you should be running with data closer to 5000 x 5000 not 5 x 2, if something takes less than 1000 ms you need to either add in more loops or increase the data because at that level a minor spike in cpu activity can add a lot of noise to your profiling.
In your example - I think you dont measure the double/float comparison so much (which should be a processor internal instruction) as the array accesses (which have a lot of redirects plus obviousl.... aray delimiter checks (for the array index of bounds exception).
I would suggest timining a test without arrays.
I don't really think that you can optimize your code much more, one option would be to make your code parallel but for your input data size ([2][5] up to around [6][1024]) I don't thing that you would profit so much if you would even have any profit. In fact, I wouldn't even bother optimizing that piece of code at all...
Anyway, to optimize that, the only thing that I would do (if that fits in what you want to do) would be to just used fixed-width arrays instead of the jagged ones, even if you would waste memory with that.
If you could use also Lists in your case you could use the LINQ approach:
List<List<double>> t = new List<List<double>>();
//adding test data
t.Add(new List<double>() { 12343, 345, 3, 23, 2, 1 });
t.Add(new List<double>() { 43, 123, 3, 54, 233, 1 });
//creating target
List<List<float>> q;
//conversion
q = t.ConvertAll<List<float>>(
(List<double> inList) =>
{
return inList.ConvertAll<float>((double inValue) => { return (float)inValue; });
}
);
if its faster you have to measure yourself. (doubtful)
but you could parallelize it which could fasten it up (PLINQ)

Matrix Multiplication?

What i am trying here is multiplication of two 500x500 matrices of doubles.
And i am getting null reference exception !
Could u take a look into it
void Topt(double[][] A, double[][] B, double[][] C) {
var source = Enumerable.Range(0, N);
var pquery = from num in source.AsParallel()
select num;
pquery.ForAll((e) => Popt(A, B, C, e));
}
void Popt(double[][] A, double[][] B, double[][] C, int i) {
double[] iRowA = A[i];
double[] iRowC = C[i];
for (int k = 0; k < N; k++) {
double[] kRowB = B[k];
double ikA = iRowA[k];
for (int j = 0; j < N; j++) {
iRowC[j] += ikA * kRowB[j];
}
}
}
Thanks in advance
Since your nullpointer problem was already solved, why not a small performance tip ;) One thing you could try would be a cache oblivious algorithm. For 2k x 2k matrices I get 24.7sec for the recursive cache oblivious variant and 50.26s with the trivial iterative method on a e8400 #3ghz single threaded in C (both could be further optimized obviously, some better arguments for the compiler and making sure SSE is used etc.). Though 500x500 is quite small so you'd have to try if that gives you noticeable improvements.
The recursive one is easier to make multithreaded usually though.
Oh but the most important thing since you're writing in c#: Read this - it's for java, but the same should apply for any modern JIT..

Is Multiplication Faster Than Comparison in .NET?

Good morning, afternoon or night,
Up until today, I thought comparison was one of the basic processor instructions, and so that it was one of the fastest operations one can do in a computer... On the other hand I know multiplication in sometimes trickier and involves lots of bits operations. However, I was a little shocked to look at the results of the following code:
Stopwatch Test = new Stopwatch();
int a = 0;
int i = 0, j = 0, l = 0;
double c = 0, d = 0;
for (i = 0; i < 32; i++)
{
Test.Start();
for (j = Int32.MaxValue, l = 1; j != 0; j = -j + ((j < 0) ? -1 : 1), l = -l)
{
a = l * j;
}
Test.Stop();
Console.WriteLine("Product: {0}", Test.Elapsed.TotalMilliseconds);
c += Test.Elapsed.TotalMilliseconds;
Test.Reset();
Test.Start();
for (j = Int32.MaxValue, l = 1; j != 0; j = -j + ((j < 0) ? -1 : 1), l = -l)
{
a = (j < 0) ? -j : j;
}
Test.Stop();
Console.WriteLine("Comparison: {0}", Test.Elapsed.TotalMilliseconds);
d += Test.Elapsed.TotalMilliseconds;
Test.Reset();
}
Console.WriteLine("Product: {0}", c / 32);
Console.WriteLine("Comparison: {0}", d / 32);
Console.ReadKey();
}
Result:
Product: 8558.6
Comparison: 9799.7
Quick explanation: j is an ancillary alternate variable which goes like (...), 11, -10, 9, -8, 7, (...) until it reaches zero, l is a variable which stores j's sign, and a is the test variable, which I want always to be equal to the modulus of j. The goal of the test was to check whether it is faster to set a to this value using multiplication or the conditional operator.
Can anyone please comment on these results?
Thank you very much.
Your second test it's not a mere comparison, but an if statement.
That's probably translated in a JUMP/BRANCH instruction in CPU, involving branch prediction (with possible blocks of the pipeline) and then is likely slower than a simple multiplication (even if not so much).
It can often be very difficult to make such assertions about an optimising compiler. They do lots of tricks that make simple cases different from real code. That said, you aren't just doing a comparison, you're doing a compare/assign in a very tight loop. The thread you're working on may have to pause many times at the branch; the multiplication can assign as many times as it likes, as long as the last assignment is still last, so many multiplies can be going on at once.
As is the general rule, make your code clear and ignore minor timing issues unless they become a problem.
If you do have a speed problem a good tracing/timing tool will guide you much better than knowing if one operation is faster than other in a specific case.
I guess one comment I would make is that you are doing a lot more in the second operation:
a = (j < 0) ? -j : j;
Not only are you doing a comparison, but also effectivly a "if..else.." with the ? operator and a negation of j.
You should try to run this test a 1000 or so times and use avrage to compare you never now what CLR is doing in background

Categories