I am looking for suggestions to make this algorithm run faster. Using C# and EmguCV (a managed wrapper over OpenCV), I am processing images in real-time, adjusting the gain of pixels that are not fully saturated (basically a flat-field correction with a threshold). The code works correctly, but the CPU usage is relatively high due to frame-rate and image size.
I have dabbled a bit with using UMat to perform the same operations, but performance has been worse. I also tried a version with integer math (i.e., doing integer multiplication and division instead of a FP multiply), but that was slower than leaving it in floating point.
I have profiled the code and optimized it some by doing things like reusing my Mat objects, pointing Mats to existing blocks of memory instead allocating new memory, etc. Suggestions using OpenCV would also be welcome, and I will find/use the EmguCV equivalents.
var ushortImage = new Mat(_frameH, _frameW, DepthType.Cv16U, 1, (IntPtr)pixelsPtr, _frameW * 2);
//get mask of all the pixels that are at max-value
CvInvoke.InRange(ushortImage, _lowerRange, _upperRange, _byteMask);
//convert input to float for multiplication
ushortImage.ConvertTo(_floatFrame, DepthType.Cv32F);
//multiply each pixel by it's gain value
CvInvoke.Multiply(_floatFrame, gainMat, _floatFrame);
//threshold to allowable range
CvInvoke.Threshold(_floatFrame, _floatFrame, _frameMaxPixelValue, _frameMaxPixelValue, ThresholdType.Trunc);
//convert back to 16-bit
_floatFrame.ConvertTo(ushortImage, DepthType.Cv16U);
//restore the max-intensity pixels
ushortImage.SetTo(new MCvScalar(_frameMaxPixelValue), _byteMask);
Here is the profiling info from dotTrace. Nothing really stands out as being the bottleneck, so I'm hoping there's a way to combine some operations into a single function call. Even a 10% improvement would be helpful because I'm trying to basically double the throughout of the application.
Related
I have a data source that provides many (4096) double values in an array. These are measured with high resolution and are the result of a FFT. For visualisation purposes, they need to be reduced. (Reapplying the FFT on the raw signal is not possible here.) I could simply average each n samples and have a resulting array of length / n values. To allow more flexible selection of the number of resulting values, I need interpolation, though.
I've looked up some basic information about this on Wikipedia. I am already familiar with 2D downsampling/interpolation from a user prespective in raster image editors. Now I need this in 1D in C# code. Think of it as changing (reducing) the raster image size of a 1D barcode image, or resampling an audio wave file.
One library I've found recommended is Math.NET Numerics. This is already used for other tasks in my application, so I could easily use that. There's the CubicSpline class in there but I have no idea how to use it.
Q: What would be an approach to reduce the number of samples in a double[] to an arbitrary number using interpolation?
I'm not interested in finding a single double value between two others. I need to combine multiple source values into each a single output value, while not losing any information (single frequency bins with an extreme level) at the boundaries of the groups, and without aliasing effects or rounding because of different group sizes if the numbers aren't divisible.
Maybe the use of bitmap image functions and a 1*n bitmap size is a good solution instead of dealing with the math directly? This would involve a lot of data conversion, though, which reduces performance and probably also precision. Or some library from the autio processing field?
I saw a lot a topic about this, I understood the theory but I'm not able to code this.
I have some pictures and I want to determine if they are blurred or not. I found a library (aforge.dll) and I used it to compte a FFT for an image.
As an example, there is two images i'm working on :
My code is in c# :
public Bitmap PerformFFT(Bitmap Picture)
{
//Loade Image
ComplexImage output = ComplexImage.FromBitmap(Picture);
// Perform FFT
output.ForwardFourierTransform();
// return image
return = output.ToBitmap();
}
How can I determine if the image is blurred ? I am not very comfortable with the theory, I need concret example. I saw this post, but I have no idea how to do that.
EDIT:
I'll clarify my question. When I have a 2D array of complex ComplexImage output (image FFT), what is the C# code (or pseudo code) I can use to determine if image is blurred ?
The concept of "blurred" is subjective. How much power at high frequencies indicates it's not blurry? Note that a blurry image of a complex scene has more power at high frequencies than a sharp image of a very simple scene. For example a sharp picture of a completely uniform scene has no high frequencies whatsoever. Thus it is impossible to define a unique blurriness measure.
What is possible is to compare two images of the same scene, and determine which one is more blurry (or identically, which one is sharper). This is what is used in automatic focussing. I don't know how exactly what process commercial cameras use, but in microscopy, images are taken at a series of focal depths, and compared.
One of the classical comparison methods doesn't involve Fourier transforms at all. One computes the local variance (for each pixel, take a small window around it and compute the variance for those values), and averages it across the image. The image with the highest variance has the best focus.
Comparing high vs low frequencies as in MBo's answer would be comparable to computing the Laplace filtered image, and averaging its absolute values (because it can return negative values). The Laplace filter is a high-pass filter, meaning that low frequencies are removed. Since the power in the high frequencies gives a relative measure of sharpness, this statistic does too (again relative, it is to be compared only to images of the same scene, taken under identical circumstances).
Blurred image has FFT result with smaller magnitude in high-frequency regions. Array elements with low indexes (near Result[0][0]) represent low-frequency region.
So divide resulting array by some criteria, sum magnitudes in both regions and compare them. For example, select a quarter of result array (of size M) with index<M/2 and indexy<M/2
For series of more and more blurred image (for the same initial image) you should see higher and higher ratio Sum(Low)/Sum(High)
Result is square array NxN. It has central symmetry (F(x,y)=F(-x,-y) because source is pure real), so it is enough to treat top half of array with y<N/2.
Low-frequency components are located near top-left and top-right corners of array (smallest values of y, smallest and highest values of x). So sum magnitudes of array elements in ranges
for y in range 0..N/2
for x in range 0..N
amp = magnitude(y,x)
if (y<N/4) and ((x<N/4)or (x>=3*N/4))
low = low + amp
else
high = high + amp
Note that your picture shows jumbled array pieces - this is standard practice to show zero component in the center.
I use C# with OpenTK to access OpenGL API. My project uses tessellation to render a heightmap. My tessellation control shader splits a square into a grid of 64 squares and my tessellation evaluation shader adds vertical offsets to those points. Vertical offsets are stored in a uniform float buffer like this:
uniform float HeightmapBuffer[65 * 65];
Everything works fine, when I run the project on my laptop with AMD Radeon 8250 GPU. The problems start when I try to run it on Nvidia graphic cards. I tried an older GT 430 and a brand new GTX 1060, but results are same:
Tessellation evaluation info
----------------------------
0(13) : error C5041: cannot locate suitable resource to bind variable "HeightmapBuffer". Possibly large array.
As I researched this problem, I found GL_MAX_UNIFORM_BLOCK_SIZE variable which returns ~500MB on the AMD and 65.54 kB on both Nvidia chips. It's a little strange, since my array actually uses only 16.9 kB, so I am not even sure if the "BLOCK SIZE" actually limits the size of one variable. Maybe it limits the size of all uniforms passed to one shader? Even so, I can't believe that my program would use 65 kB.
Note that I also tried to go the 'common' way by using a texture, but I think there were problems with interpolation, so when placing two adjacent heightmaps together, the borders didn't match. With a uniform buffer array on the other side, things work perfectly.
So what is the actual meaning of GL_MAX_UNIFORM_BLOCK_SIZE? Why is this value on Nvidia GPUs so low? Is there any other way to pass a large array to my shader?
As I researched this problem, I found GL_MAX_UNIFORM_BLOCK_SIZE variable which returns ~500MB on the AMD and 65.54 kB on both Nvidia chips.
GL_MAX_UNIFORM_BLOCK_SIZE is the wrong limit. That applies only to Uniform Buffer Objects.
You just declare an array
uniform float HeightmapBuffer[65 * 65];
outside of a uniform block. Since you seem to use this in a tesselation evaluation shader, the relevant limit is MAX_TESS_EVALUATION_UNIFORM_COMPONENTS (there is a separate such limit for each programmable stage). This component limit counts just the number of float components, so a vec4 will consume 4 components, a float just one.
In your particular case, the latest GL spec, [GL 4.6 core profile] (https://www.khronos.org/registry/OpenGL/specs/gl/glspec46.core.pdf)
at the time of this writing, just guarantees a minimum value of 1024 for that (=4kiB), and you are way beyond that limit.
It is actually a very bad idea to use plain uniforms for such amounts of data. You should consider using UBOs, Texture Buffer Objects, Shader Storage Buffer Objects or even plain textures to store your array. UBOs would probably be the most natural choice in your scenario.
I have as small c# project that involves matrices. I am processing large amounts of data by splitting it into n-length chunks, treating the chucks as vectors, and multiplying by a Vandermonde** matrix. The problem is, depending on the conditions, the size of the chucks and corresponding Vandermonde** matrix can vary. I have a general solution which is easy to read, but way too slow:
public byte[] addBlockRedundancy(byte[] data) {
if (data.Length!=numGood) D.error("Expecting data to be just "+numGood+" bytes long");
aMatrix d=aMatrix.newColumnMatrix(this.mod, data);
var r=vandermonde.multiplyBy(d);
return r.ToByteArray();
}//method
This can process about 1/4 megabytes per second on my i5 U470 # 1.33GHz. I can make this faster by manually inlining the matrix multiplication:
int o=0;
int d=0;
for (d=0; d<data.Length-numGood; d+=numGood) {
for (int r=0; r<numGood+numRedundant; r++) {
Byte value=0;
for (int c=0; c<numGood; c++) {
value=mod.Add(value, mod.Multiply(vandermonde.get(r, c), data[d+c]));
}//for
output[r][o]=value;
}//for
o++;
}//for
This can process about 1 meg a second.
(Please note the "mod" is performing operations over GF(2^8) modulo my favorite irreducible polynomial.)
I know this can get a lot faster: After all, the Vandermonde** matrix is mostly zeros. I should be able to make a routine, or find a routine, that can take my matrix and return a optimized method which will effectively multiply vectors by the given matrix, but faster. Then, when I give this routine a 5x5 Vandermonde matrix (the identity matrix), there is simply no arithmetic to perform, and the original data is just copied.
** Please note: What I use the term "Vandermonde", I actually mean an Identity matrix with some number of rows from the Vandermonde matrix appended (see comments). This matrix is wonderful because of all the zeros, and because if you remove enough rows (of your choosing) to make it square, it is an invertible matrix. And, of course, I would like to use this same routine to convert any one of those inverted matrices into an optimized series of instructions.
How can I make this matrix multiplication faster?
Thanks!
(edited to correct my mistake with Vandermonde matrix)
Maybe you can define a matrix interface and build implementations at runtime using Reflection.Emit.
IMatrix m = MatrixGenerator.CreateMatrix(data);
m.multiplyBy(...)
Here, MatrixGenerator.CreateMatrix will create a tailored IMatrix implementation, with full loop unrolling, and further code pruning (0 cell, identity, etc). MatrixGenerator.CreateMatrix may cache matrices to avoid recreating it later for the same set of data.
I've seen solutions using Reflection.Emit, and I've seen solutions which involve TPL. The real answer here is, for most situations, that you want to use an existing unmanaged library such as Intel MKL via P/Invoke. Alternatively, if you are using the GPU, you can go with the GPGPU approach which would go a lot faster.
And yes, SSE together with multi-core processing is the fastest way to do it on a CPU. But I wouldn't recommend writing your own algorithm - instead, go look for something that's already out there. Most likely, it will end up being a C++ library, possibly with a C# wrapper.
While it won't speed up the math, you could at least use all your cores with the Parallel.For in .Net 4.0. Microsoft link
From the math perspective
You could look at Eigen Spaces, Eigen Vectors, Eigen Values. I'm not sure what your application does and if it will help.
You could look at LU Decomposition.
All of the above topics can be found at wikipedia
From a programming perspective
You could try SIMD, but they are designed for 4x4 matrices to do homogeneous transformations of 3D space, mostly for computer graphics.
You could write special algorithms for your most common dimensions.
Using SSE in c# is it possible?
interpolation of bitmap:
I have bitmap of 16*16, i want to increase the size of the bitmap to 160*160, which is best interpolation type that can be suited.
Bicubic interpolation (cubic spline) blurs area and because of that destroys edges.
Nearest neighbour interpolation preserves edges, but introduces pixelation.
So best interpolation of bitmap would be hybrid algorithm of those two above - such as 2xSal / Eagle and such.
EDIT: Bicubic interpolation JAVA example code.
Good luck.
Um, this is a horrendously bad idea no matter which way you look at it - you're basically increasing the size of the bitmap by an order of magnitude, but you're not adding any new data - regardless of whether you use a polynomial, linear, or simple copy mechanism the result is going to show either extreme pixelation, extreme blurring, or some mixture of the two.
In general, these algorithms work much better for reducing the size of bitmap images and maintaining overall image integrity, blowing an image up by an order of magnitude is going to be ugly, no matter what you do - the bottom line is that the original information is now dwarfed by the information the interpolater 'made up'