Matrix multiplication in C#

Matrix multiplication in C# - c#

I'm having problems multiplying two matrices of dimensions 5000x1024.
I tried to do it through loops in a conventional manner, but it takes forever.
Is there any good library with implemented and optimized matrix operations, or any algorithm that does it without 3 loops?

Have you looked into using OpenCL? One of the examples in the Cloo (C# OpenCL library) distribution is a large 2D matrix multiplication.
Unlike CUDA, OpenCL kernels will run on your GPU (if available and supported) or the CPU. On the GPU you'll see really, really, really dramatic speed increases. I mean, really dramatic, on the order of 10x-100x depending on how efficient your kernel is and how many cores your GPU has. (A Fermi-based NVidia card will have between 384-512, and the new 600's have something like 1500.)
If you're not interested in going that route - though anyone that's doing numerically-intensive, easily parallelizable operations like this should be using the GPU - make sure you're at least using C#'s built-in parallelization:
Parallel.For(
0
,5000
, (i) => {
for(var j=0;j<1024;j++)
{
result[i,j] = .....
}
);
also, check out GPU.NET and Brahma. Brahma lets you build OpenCL kernels in C# with LINQ. Definitely lowers the learning curve.

Take a look at Strassen algorithm which has a runtime of approx. O(n2.8) instead of O(n3) with a naive method of multiplying matrices. One problem is that is not always stable but works fine for really high dimensions. Furthermore it is really complex so I would suggest you to rethink your design and maybe decrease the size of the matrix or split your problem into smaller pieces.
But keep in mind that a matrix multiplication with no special properties (like Aidan mentioned) is nearly impossible to optimize. Here an example: the Coppersmith-Winograd algorithm takes O(n2.3737) and it is by far one of the best algorithms for matrix multiplication! The best option here would be to either use OpenCL and the GPU (mentioned by David) or to look at other optimized programming languages like Python with the package numpy.
Good luck either way!

Related

Unity C# Voxel finite water optimization

I got a (basic) voxel engine running and a water system that looks (and I assume basically works) like this: https://www.youtube.com/watch?v=Q_TdeGIOOts (not my game).
The water values are stored in a 3d Array of floats, and every 0.05s it calculates water flow by checking the voxel below and adjacent (y-1, x-1, x+1, z-1, z+1) and adds the value.
This system works fine (70+ fps) for small amounts of water, but when I start calculating water on 8+ chunks, it gets too much.
(I disabled all rendering or mesh creation to check if that is the bottleneck, it isnt. Its purely the flow calculations).
I am not a very experienced programmer so I wouldnt know where to start optimizing, apart from making the calculations happen in a coroutine as I already did.
In this post: https://gamedev.stackexchange.com/questions/55414/how-to-define-areas-filled-with-water (near the bottom) Boreal suggests running it in a compute shader. Is this the way to go for me? And how would I go about such a thing?
Any help is much appreciated.

If you're really calculating a voxel based simulation, you will be expanding the number of calculations geometrically as your size increases, so you will quickly run out of processing power on larger volumes.
A compute shader is great for doing massively parallel calculations quickly, although it's a very different programming paradigm that takes some getting used to. A compute shader will look at the contents of a buffer (ie, a 'texture' for us civilians) and do things to it very quickly -- in your case the buffer will probably be a buffer/texture whose pixel values represent water cells. If you want to do something really simple like increment them up or down the compute shader uses the parallel processing power of the GPU to do it really fast.
The hard part is that GPUs are optimized for parallel processing. This means that you can't write code like "texelA.value += texelB.value" - without extra work on your part, each fragment of the buffer is processed with zero knowledge of what happens in the other fragments. To reference other texels you need to read the texture again somehow - some techniques read one texture multiple times with offsets (this GL example does this to implement blurs, others do it by repeatedly processing a texture, putting the result into a temporary texture and then reprocessing that.
At the 10,000 foot level: yes, a compute shader is a good tool for this kind of problem since it involves tons of self-similar calculation. But, it won't be easy to do off the bat. If you have not done conventional shader programming before, You may want to look at that first to get used to the way GPUs work. Even really basic tools (if-then-else or loops) have very different performance implications and uses in GPU programming and it takes some time to get your head around the differences. As of this writing (1/10/13) it looks like Nvidia and Udacity are offering an intro to compute shader course which might be a good way to get up to speed.
FWIW you also need pretty modern hardware for compute shaders, which may limit your audience.

XNA, Vector math and the GPU

I am looking into making a game for Windows Phone and Windows 8 RT. The first iteration of the game will use XNA for the UI.
But since I plan to have other iterations that may not use XNA, I am writing my core game logic in a Portable Class Library.
I have gotten to the part where I am calculating vector math (sprite locations) in the core game logic.
As I was figuring this out, I had a co-worker tell me that I should make sure that I am doing these calculations on the GPU (not the CPU).
So, here is the question, if I use XNA vector libraries to do my vector calculations, are they automatically done on the GPU?
Side Question: If not, should they be done on the GPU? Or is it OK for me to do them in my Portable Class Library and have the CPU run them?
Note: If I need to have XNA do them so that I can use the GPU then it is not hard to inject that functionality into my core logic from XNA. I just want to know if it is something I should really be doing.
Note II: My game is a 2D game. It will be calculating movement of bad guys and projectiles along a vector. (Meaning this is not a huge 3D Game.)

I think your co-worker is mistaken. Here are just two of the reasons that doing this kind of calculation on the GPU doesn't make sense:
The #1 reason, by a very large margin, is that it's not cheap to get data onto the GPU in the first place. And then it's extremely expensive to get data back from the GPU.
The #2 reason is that the GPU is good for doing parallel calculations - that is - it does the same operation on a large amount of data. The kind of vector operations you will be doing are many different operations, on a small-to-medium amount of data.
So you'd get a huge win if - say - you were doing a particle system on the GPU. It's a large amount of homogeneous data, you perform the same operation on each particle, and all the data can live on the GPU.
Even XNA's built-in SpriteBatch does most of its per-sprite work on the CPU (everything but the final, overall matrix transformation). While it could do per-sprite transforms on the GPU (and I think it used to in XNA 3), it doesn't. This allows it to reduce the amount of data it needs to send the GPU (a performance win), and makes it more flexible - as it leaves the vertex shader free for your own use.
These are great reasons to use the CPU. I'd say if it's good enough for the XNA team - it's good enough for you :)
Now, what I think your co-worker may have meant - rather than the GPU - was to do the vector maths using SIMD instructions (on the CPU).
These give you a performance win. For example - adding a vector usually requires you to add the X component, and then the Y component. Using SIMD allows the CPU to add both components at the same time.
Sadly Microsoft's .NET runtime does not currently make (this kind of) use of SIMD instructions. It's supported in Mono, though.

So, here is the question, if I use XNA vector libraries to do my vector calculations, are they automatically done on the GPU?
Looking inside the Vector class in XNA using ILSpy reveals that the XNA Vector libraries do not use the graphics card for vector math.

How can I avoid frequent int/float/double casts when developing an XNA application?

I'm making a simple 2D game for Windows Phone 7 (Mango) using the XNA Framework.
I've made the following observations:
Most of the drawing operations accept floats
SpriteBatch.Draw accepts a Rectangle which uses ints
The Math class accepts doubles as parameters and also returns doubles
So my code is full of typecasts between ints, floats and doubles. That's a hell a lot of typecasts.
Is there any way I can get rid of them or I should just not care about this?
Also, do these typecasts present a measurable performance loss?

I noticed this too but unless you notice any speed decrease, it could be considered a micro-optimization. Converting between float and int is relatively expensive whereas converting between float and double is cheap. So wherever you don't need to perform a float to int conversion, you can avoid it. Type casting is generally cheaper than actual conversion (e.g. using Convert.ToInt32).
However, all of this is unlikely to be a bottleneck unless you're performing it many times. Also, from this post:
float, double and int multiplications have the same performance ==> use the right type of number for your app, no worries
the phone is 4x to 10x slower than the PC I use to develop my apps ==> test with a real phone, don't trust your PC for math operations
divisions are up to 8x slower than multiplications! ==> don't use divisions, e.g. try to use 1/a then multiply
Unofficial numbers, but I think the last one is quite an accepted method. Also, doubles are often thought to be slower than floats, but this comes down to the system it's running on. AFAIK, Windows Phone is optimized to use doubles, which would explain the Math class accepting those.
All in all, I'd say it's quite common to see casting occur quite a bit with the XNA framework. Of course, it should be avoided whenever possible, but it's unlikely to be the source of bottlenecks for games unless you need to perform it often, in which case other areas may be easier to optimize (or a redesign of the game structure might be required).

If you're worried about the Rectangle conversion, there are overloads that take Vector2's instead, which are float-based:
http://msdn.microsoft.com/en-us/library/ff433988.aspx
Note the source (texture) is still a rectangle but this is typically a static (not C# keyword static but unchanging, undynamic) thing.

What technologies to use for a particle system with enormous calculation demand?

I have a particle system with X particles.
Each particle tests for collision with other particles. This gives X*X = X^2 collision tests per frame. For 60f/s, this corresponds to 60*X^2 collision detection per second.
What is the best technological approach for these intensive calculations? Should I use F#, C, C++ or C#, or something else?
The following are constraints
The code is written in C# with the latest XNA
Multi-threaded may be considered
No special algorithm that tests the collision with the nearest neighbors or that reduces the problem
The last constraint may be strange, so let me explain.
Regardless constraint 3, given a problem with enormous computational requirement what would be the best approach to solve the problem.
An algorithm reduces the problem; still the same algorithm may behave different depending on technology. Consider pros and cons of CLR vs native C.

The simple answer is "measure it". But take a look at this graph (that I borrowed from this question - which is worth your reading).
C++ is maybe 10% faster than MS's C# implementation (for this particular calculation) and faster still against Mono's C# implementation. But in real world terms, C++ is not all that much faster than C#.
If you're doing hard-core number crunching, you will want to use the SIMD/SSE unit of your CPU. This is something that C# does not normally support - but Mono is adding support for through Mono.Simd. You can see from the graph that using the SIMD unit gives a significant performance boost to both languages.
(It's worth noting that while C++ is still "faster" than C#, the choice of language has only a small effect on performance, compared to the choice of what hardware to use. As was mentioned in the comments - your choice of algorithm will have by far the greatest effect.)
And finally, as Jerry Coffin mentioned in his answer, that you could also do the processing on the GPU. I imagine that it would be even faster than SIMD (but as I said - measure it). Using the GPU has the added beneift of leaving the CPU free to do other tasks. The downside is that your end-users will need a reasonable GPU.

You should probably consider doing this on the GPU using something like CUDA, OpenCL, or a DirectX compute shader.

Sweep and prune is a broad phase collision detection algorithm that may be well suited to this task. If you can make use of temporal coherency, that being from frame to frame the location differences are generally small a reduction in processing may be obtained. A good book on the subject is "real time collision detection".

For a simple speed up you could sort by one axis first and loop through checking for a collision in that axis before doing a full check... For each particle you only need to look further in the array until you find one that doesn't collide in that axis then you can move to the next one.

Vector Transformation with Matrix

I'm working on a code to do a software skinner (bone/skin animation), and I'm at the "optimization" phase (the skinner works pretty well and skin a 4900 triangles mesh with 22 bones in 1.09 ms on a Core Duo 2 Ghz (notebook)). What I need to know is :
1) Can someone show me the way (maybe with pseudocode) to transform a float3 (array of 3 float) (representing a coordinate) against a float4x3 matrix?
2) Can someone show me the way (maybe with pseudocode) to transform a float3 (array of 3 float) (representing a normal) against a float3x3 matrix?
I ask this as i know that in the skinning process you can avoid to use part of the matrix without getting any change in the animation process. (So to recover some elaboration time)
Thanks!

Optimizing vector/matrix operations via mathematical reduction is possible, but tricky. You can find some information on the topic here, here, and here.
Now, this may not be quite what you're looking for, but...
You can use the machine GPU (graphics card processor) to vastly increase the computation performance of vector/matrix operations. Many operations can be increased by several orders of magnitude by taking advantage of SIMD processing available on the GPU.
There are two reasonably good libraries available for C# developers for GPGPU programming:
Microsoft's Accelerator Library, documentation available here.
Brahma - an open source GPU library for C# developers that leverages LINQ.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.