Faster Matrix Multiplication in C#

Faster Matrix Multiplication in C# - c#

I have as small c# project that involves matrices. I am processing large amounts of data by splitting it into n-length chunks, treating the chucks as vectors, and multiplying by a Vandermonde** matrix. The problem is, depending on the conditions, the size of the chucks and corresponding Vandermonde** matrix can vary. I have a general solution which is easy to read, but way too slow:
public byte[] addBlockRedundancy(byte[] data) {
if (data.Length!=numGood) D.error("Expecting data to be just "+numGood+" bytes long");
aMatrix d=aMatrix.newColumnMatrix(this.mod, data);
var r=vandermonde.multiplyBy(d);
return r.ToByteArray();
}//method
This can process about 1/4 megabytes per second on my i5 U470 # 1.33GHz. I can make this faster by manually inlining the matrix multiplication:
int o=0;
int d=0;
for (d=0; d<data.Length-numGood; d+=numGood) {
for (int r=0; r<numGood+numRedundant; r++) {
Byte value=0;
for (int c=0; c<numGood; c++) {
value=mod.Add(value, mod.Multiply(vandermonde.get(r, c), data[d+c]));
}//for
output[r][o]=value;
}//for
o++;
}//for
This can process about 1 meg a second.
(Please note the "mod" is performing operations over GF(2^8) modulo my favorite irreducible polynomial.)
I know this can get a lot faster: After all, the Vandermonde** matrix is mostly zeros. I should be able to make a routine, or find a routine, that can take my matrix and return a optimized method which will effectively multiply vectors by the given matrix, but faster. Then, when I give this routine a 5x5 Vandermonde matrix (the identity matrix), there is simply no arithmetic to perform, and the original data is just copied.
** Please note: What I use the term "Vandermonde", I actually mean an Identity matrix with some number of rows from the Vandermonde matrix appended (see comments). This matrix is wonderful because of all the zeros, and because if you remove enough rows (of your choosing) to make it square, it is an invertible matrix. And, of course, I would like to use this same routine to convert any one of those inverted matrices into an optimized series of instructions.
How can I make this matrix multiplication faster?
Thanks!
(edited to correct my mistake with Vandermonde matrix)

Maybe you can define a matrix interface and build implementations at runtime using Reflection.Emit.
IMatrix m = MatrixGenerator.CreateMatrix(data);
m.multiplyBy(...)
Here, MatrixGenerator.CreateMatrix will create a tailored IMatrix implementation, with full loop unrolling, and further code pruning (0 cell, identity, etc). MatrixGenerator.CreateMatrix may cache matrices to avoid recreating it later for the same set of data.

I've seen solutions using Reflection.Emit, and I've seen solutions which involve TPL. The real answer here is, for most situations, that you want to use an existing unmanaged library such as Intel MKL via P/Invoke. Alternatively, if you are using the GPU, you can go with the GPGPU approach which would go a lot faster.
And yes, SSE together with multi-core processing is the fastest way to do it on a CPU. But I wouldn't recommend writing your own algorithm - instead, go look for something that's already out there. Most likely, it will end up being a C++ library, possibly with a C# wrapper.

While it won't speed up the math, you could at least use all your cores with the Parallel.For in .Net 4.0. Microsoft link

From the math perspective
You could look at Eigen Spaces, Eigen Vectors, Eigen Values. I'm not sure what your application does and if it will help.
You could look at LU Decomposition.
All of the above topics can be found at wikipedia
From a programming perspective
You could try SIMD, but they are designed for 4x4 matrices to do homogeneous transformations of 3D space, mostly for computer graphics.
You could write special algorithms for your most common dimensions.
Using SSE in c# is it possible?

Related

Improve performance of per-pixel image multiplication with mask and thresholding

I am looking for suggestions to make this algorithm run faster. Using C# and EmguCV (a managed wrapper over OpenCV), I am processing images in real-time, adjusting the gain of pixels that are not fully saturated (basically a flat-field correction with a threshold). The code works correctly, but the CPU usage is relatively high due to frame-rate and image size.
I have dabbled a bit with using UMat to perform the same operations, but performance has been worse. I also tried a version with integer math (i.e., doing integer multiplication and division instead of a FP multiply), but that was slower than leaving it in floating point.
I have profiled the code and optimized it some by doing things like reusing my Mat objects, pointing Mats to existing blocks of memory instead allocating new memory, etc. Suggestions using OpenCV would also be welcome, and I will find/use the EmguCV equivalents.
var ushortImage = new Mat(_frameH, _frameW, DepthType.Cv16U, 1, (IntPtr)pixelsPtr, _frameW * 2);
//get mask of all the pixels that are at max-value
CvInvoke.InRange(ushortImage, _lowerRange, _upperRange, _byteMask);
//convert input to float for multiplication
ushortImage.ConvertTo(_floatFrame, DepthType.Cv32F);
//multiply each pixel by it's gain value
CvInvoke.Multiply(_floatFrame, gainMat, _floatFrame);
//threshold to allowable range
CvInvoke.Threshold(_floatFrame, _floatFrame, _frameMaxPixelValue, _frameMaxPixelValue, ThresholdType.Trunc);
//convert back to 16-bit
_floatFrame.ConvertTo(ushortImage, DepthType.Cv16U);
//restore the max-intensity pixels
ushortImage.SetTo(new MCvScalar(_frameMaxPixelValue), _byteMask);
Here is the profiling info from dotTrace. Nothing really stands out as being the bottleneck, so I'm hoping there's a way to combine some operations into a single function call. Even a 10% improvement would be helpful because I'm trying to basically double the throughout of the application.

Least squares using general matrix vector multiplication, not sparse matrices

Is there a way to compute
\argmin_{x}\|Ax-b\|_2
based on a function that computes matrix vector products Ax, without explicitly storing, sparse or non-sparse, A in memory?
In Python, I'd use scipy.sparse.linalg.lsqr for that (despite the package name, this function doesn't require sparse matrices, but allows for LinearOperators.

I ended up translating the open source code of Scipy's lsmr to C#. Most of the already surprisingly short code there is documentation and logging, there are maybe 100 non-trivial lines, all of which have direct equivalents in BLAS.
(lsmr is an improved version of lsqr)

Understanding performance of two Matrix rotation algorithms

My lack of in depth understanding of the fundamentals has taken a toll on these types of problem solving challenges.
The HackerRank matrix rotation problem is a very fun one to solve. I recommend people who are trying to enrich their coding skills to use hackerrank (https://www.hackerrank.com/challenges/matrix-rotation-algo)
The problem summary is that you are given an R x C matrix of integers where the minimum of R and C must be even. You have to rotate the matrix anti-clockwise x number of times. Rotation applies to the elements of the matrix, not the matrix dimension in case it is not clear.
So I solved this problem with two algorithms. They are both very similar in that you can imagine the matrix like layers of onions where you loop through each layer, and rotate the elements in that layer. The number of rotations is simply x % (count of elements in that layer) so if you are given x=1,000,000 it doesn't make sense to repeat full rotations.
The first one, which is the fastest is:
https://codetidy.com/8002/
The second one, does not loop through the number of rotations but instead does some heavy logic and math to figure out where to move an element to.
https://codetidy.com/8001/
So when I was writing the second one, I assumed that it would be crazy faster, because you don't iterate through maximum number of rotations in each layer. However, it ended up faring slower.
I don't quite understand why. I logged the number of iterations in a console and the first one does 50x more iterations, but is faster.

Number of iterations is not everything. Here are a few general things that might affect the performance.
One important thing to keep in mind with arrays and matrices are cache hits. If your operations generate lots of cache hits they will seem orders of magnitude faster. To get cache hits you usually need to go in the memory order. For an array that is sequentially forward. For a matrix it means incrementing the lowest index first. To get misses you need to jump around in increments larger than the size of the cacheline (CPU dependent). Fun experiment: benchmark for (i...) for (j...) ++m[i][j] and for (i...) for (j..) ++m[j][i] to see the difference.
In your case I would guess that the faster approach has very linear access on the horizontal parts at least.
Then there's branch prediction. Modern CPUs pipeline the instructions to make better use of the existing hardware. Branches (IFs) break the pipeline since you don't know which path to take (that instruction is still executing). As an optimization the compiler/CPU pick one and start processing and if the condition result is the other way it will throw everything away and restart processing. Checking something that usually gives the same result (like i<n) will be faster than something that's harder to predict.
These are some lowlevel reasons why the simpler approach might seem faster. Add some higher level reasons (like compiler not optimizing the code the way you expect) and you get results like this.
An important note: The complexity reflects the asymptotically behavior. Yes, the second approach will be faster for a sufficiently large matrix, and it's very likely that the sizes used for this problem are not sufficiently large.

Uniform distribution from a fractal Perlin noise function in C#

My Perlin noise function (which adds up 6 octaves of 3D simplex at 0.75 persistence) generates a 2D array array of doubles.
These numbers each come out normalized to [-1, 1], with mean at 0. I clamp them to avoid exceptions, which I think are due to floating-point accuracy issues, but I am fairly sure my scaling factor is good enough for restricting the noise output to exactly this neighborhood in the ideal case.
Anyway, that's all details. The point is, here is a 256-by-256 array of noise:
The histogram with a normal fit looks like this:
Matlab's lillietest is a function which applies the Lilliefors test to determine if a set of numbers comes from a normal distribution. My result was, repeatedly, 1, which means that these numbers are not normally distributed.
I would like a function f(x) such that, when applied to the list of values from my noise function, the results appear uniformly distributed.
I would like this function to be implementable in C# and not take minutes to run.
Once again, it shouldn't matter where the numbers come from (the question is about transforming one distribution into another, specifically a normal-like one to uniform). Nevertheless, my noise function implementation is based on this and this. You can find the above array of values here.

Oddly enough I just wrote an article on your very question:
http://ericlippert.com/2012/02/21/generating-random-non-uniform-data/
There I discuss how to turn a uniform distribution into some other distribution, but of course you can use similar techniques to transform other distributions.

You will probably be interested in one of the following (related) techniques:
Probability integral transform
Histogram equalization

FFT on WP7 shows two mirrors

Hello
I'm exploring the audio possibilities of the WP7 platform and the first stumble I've had is trying to implement a FFT using the Cooley-Tukey method. The result of that is that the spectrogram shows 4 identical images in this order: one normal, one reversed, one normal, one reversed.
The code was taken from another C# project (for desktop), the implementation and all variables seem in place with the algorithm.
So I can see two problems right away: reduced resolution and CPU wasted to generate four identical spectrograms.
Given a sample size of 1600 (could be 2048) I know have only 512 usable frequency information which leaves me with a 15Hz resolution for an 8kHz frequency span. Not bad, but not so good either.
Should I just give up on the code and use NAudio? I cannot seem to have an explanation why the spectrum is quadrupled, input data is ok, algorithm seems ok.

This sounds correct. You have 2 mirrors, I can only assume that one is the Real part and the other is the Image part. This is standard FFT.
From the real and image you can compute the magnitude or amplitude of each harmonic which is more common or compute the angle or phase shift of each harmonic which is less common.
Gilad.

I switched to NAudio and now the FFT works. However I might have found the cause (I probably won't try to test again): when I was constructing an array of double to feed into the FFT function, I did something like:
for (int i = 0; i < buffer.Length; i+= sizeof(short))
{
samples[i] = ReadSample(buffer, i);
}
For reference, 'samples' is the double[] input to fft, ReadSample is something that takes care of little/big endian. Can't remember right now how the code was, but it was skipping every odd sample.
My math knowledge has never been great but I'm thinking this induces some aliasing patterns which might in the end produce the effect I experienced.
Anyway, problem worked around, but thanks for your input and if you can still explain the phenomenon I am grateful.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.