LockBits Performance Critical Code - c#

I have a method which needs to be as fast as it possibly can, it uses unsafe memory pointers and its my first foray into this type of coding so I know it can probably be faster.
/// <summary>
/// Copies bitmapdata from one bitmap to another at a specified point on the output bitmapdata
/// </summary>
/// <param name="sourcebtmpdata">The sourcebitmap must be smaller that the destbitmap</param>
/// <param name="destbtmpdata"></param>
/// <param name="point">The point on the destination bitmap to draw at</param>
private static unsafe void CopyBitmapToDest(BitmapData sourcebtmpdata, BitmapData destbtmpdata, Point point)
{
// calculate total number of rows to draw.
var totalRow = Math.Min(
destbtmpdata.Height - point.Y,
sourcebtmpdata.Height);
//loop through each row on the source bitmap and get mem pointers
//to the source bitmap and dest bitmap
for (int i = 0; i < totalRow; i++)
{
int destRow = point.Y + i;
//get the pointer to the start of the current pixel "row" on the output image
byte* destRowPtr = (byte*)destbtmpdata.Scan0 + (destRow * destbtmpdata.Stride);
//get the pointer to the start of the FIRST pixel row on the source image
byte* srcRowPtr = (byte*)sourcebtmpdata.Scan0 + (i * sourcebtmpdata.Stride);
int pointX = point.X;
//the rowSize is pre-computed before the loop to improve performance
int rowSize = Math.Min(destbtmpdata.Width - pointX, sourcebtmpdata.Width);
//for each row each set each pixel
for (int j = 0; j < rowSize; j++)
{
int firstBlueByte = ((pointX + j)*3);
int srcByte = j *3;
destRowPtr[(firstBlueByte)] = srcRowPtr[srcByte];
destRowPtr[(firstBlueByte) + 1] = srcRowPtr[srcByte + 1];
destRowPtr[(firstBlueByte) + 2] = srcRowPtr[srcByte + 2];
}
}
}
So is there anything that can be done to make this faster? Ignore the todo for now, ill fix that later once I have some baseline performance measurements.
UPDATE: Sorry, should have mentioned that the reason i'm using this instead of Graphics.DrawImage is because im implementing multi-threading and because of that I cant use DrawImage.
UPDATE 2: I'm still not satisfied with the performance and i'm sure there's a few more ms that can be had.

There was something fundamentally wrong with the code that I cant believe I didn't notice until now.
byte* destRowPtr = (byte*)destbtmpdata.Scan0 + (destRow * destbtmpdata.Stride);
This gets a pointer to the destination row but it does not get the column that it is copying to, that in the old code is done inside the rowSize loop. It now looks like:
byte* destRowPtr = (byte*)destbtmpdata.Scan0 + (destRow * destbtmpdata.Stride) + pointX * 3;
So now we have the correct pointer for the destination data. Now we can get rid of that for loop. Using suggestions from Vilx- and Rob the code now looks like:
private static unsafe void CopyBitmapToDestSuperFast(BitmapData sourcebtmpdata, BitmapData destbtmpdata, Point point)
{
//calculate total number of rows to copy.
//using ternary operator instead of Math.Min, few ms faster
int totalRows = (destbtmpdata.Height - point.Y < sourcebtmpdata.Height) ? destbtmpdata.Height - point.Y : sourcebtmpdata.Height;
//calculate the width of the image to draw, this cuts off the image
//if it goes past the width of the destination image
int rowWidth = (destbtmpdata.Width - point.X < sourcebtmpdata.Width) ? destbtmpdata.Width - point.X : sourcebtmpdata.Width;
//loop through each row on the source bitmap and get mem pointers
//to the source bitmap and dest bitmap
for (int i = 0; i < totalRows; i++)
{
int destRow = point.Y + i;
//get the pointer to the start of the current pixel "row" and column on the output image
byte* destRowPtr = (byte*)destbtmpdata.Scan0 + (destRow * destbtmpdata.Stride) + point.X * 3;
//get the pointer to the start of the FIRST pixel row on the source image
byte* srcRowPtr = (byte*)sourcebtmpdata.Scan0 + (i * sourcebtmpdata.Stride);
//RtlMoveMemory function
CopyMemory(new IntPtr(destRowPtr), new IntPtr(srcRowPtr), (uint)rowWidth * 3);
}
}
Copying a 500x500 image to a 5000x5000 image in a grid 50 times took: 00:00:07.9948993 secs. Now with the changes above it takes 00:00:01.8714263 secs. Much better.

Well... I'm not sure whether .NET bitmap data formats are entirely compatible with Windows's GDI32 functions...
But one of the first few Win32 API I learned was BitBlt:
BOOL BitBlt(
HDC hdcDest,
int nXDest,
int nYDest,
int nWidth,
int nHeight,
HDC hdcSrc,
int nXSrc,
int nYSrc,
DWORD dwRop
);
And it was the fastest way to copy data around, if I remember correctly.
Here's the BitBlt PInvoke signature for use in C# and related usage information, a great read for any one working with high-performance graphics in C#:
http://www.pinvoke.net/default.aspx/gdi32/BitBlt.html
Definitely worth a look.

The inner loop is where you want to concentrate a lot of your time (but, do measurements to make sure)
for (int j = 0; j < sourcebtmpdata.Width; j++)
{
destRowPtr[(point.X + j) * 3] = srcRowPtr[j * 3];
destRowPtr[((point.X + j) * 3) + 1] = srcRowPtr[(j * 3) + 1];
destRowPtr[((point.X + j) * 3) + 2] = srcRowPtr[(j * 3) + 2];
}
Get rid of the multiplies and the array indexing (which is a multiply under the hoods) and replace with a pointer that you are incrementing.
Ditto with the +1, +2, increment a pointer.
Probably your compiler won't keep computing point.X (check), but make a local variable just in case. It won't do it on the single iteration, but it might each iteration.

You may want to look at Eigen.
It is a C++ template library that uses SSE (2 and later) and AltiVec instruction sets with graceful fallback to non-vectorized code.
Fast. (See benchmark).
Expression templates allow to intelligently remove temporaries and enable lazy evaluation, when that is appropriate -- Eigen takes care of this automatically and handles aliasing too in most cases.
Explicit vectorization is performed for the SSE (2 and later) and AltiVec instruction sets, with graceful fallback to non-vectorized code. Expression templates allow to perform these optimizations globally for whole expressions.
With fixed-size objects, dynamic memory allocation is avoided, and the loops are unrolled when that makes sense.
For large matrices, special attention is paid to cache-friendliness.
You could implement you function in C++ and then call that from C#

You don't always need to use pointers to get good speed. This should be within a couple ms of the original:
private static void CopyBitmapToDest(BitmapData sourcebtmpdata, BitmapData destbtmpdata, Point point)
{
byte[] src = new byte[sourcebtmpdata.Height * sourcebtmpdata.Width * 3];
int maximum = src.Length;
byte[] dest = new byte[maximum];
Marshal.Copy(sourcebtmpdata.Scan0, src, 0, src.Length);
int pointX = point.X * 3;
int copyLength = destbtmpdata.Width*3 - pointX;
int k = pointX + point.Y * sourcebtmpdata.Stride;
int rowWidth = sourcebtmpdata.Stride;
while (k<maximum)
{
Array.Copy(src,k,dest,k,copyLength);
k += rowWidth;
}
Marshal.Copy(dest, 0, destbtmpdata.Scan0, dest.Length);
}

Unfortunately I don't have the time to write a full solution, but I would look into using the platform RtlMoveMemory() function to move rows as a whole, not byte-by-byte. That should be a lot faster.

I think the stride size and row number limits can be calculated in advance.
And I precalculated all multiplications, resulting in the following code:
private static unsafe void CopyBitmapToDest(BitmapData sourcebtmpdata, BitmapData destbtmpdata, Point point)
{
//TODO: It is expected that the bitmap PixelFormat is Format24bppRgb but this could change in the future
const int pixelSize = 3;
// calculate total number of rows to draw.
var totalRow = Math.Min(
destbtmpdata.Height - point.Y,
sourcebtmpdata.Height);
var rowSize = Math.Min(
(destbtmpdata.Width - point.X) * pixelSize,
sourcebtmpdata.Width * pixelSize);
// starting point of copy operation
byte* srcPtr = (byte*)sourcebtmpdata.Scan0;
byte* destPtr = (byte*)destbtmpdata.Scan0 + point.Y * destbtmpdata.Stride;
// loop through each row
for (int i = 0; i < totalRow; i++) {
// draw the entire row
for (int j = 0; j < rowSize; j++)
destPtr[point.X + j] = srcPtr[j];
// advance each pointer by 1 row
destPtr += destbtmpdata.Stride;
srcPtr += sourcebtmpdata.Stride;
}
}
Havn't tested it thoroughly but you should be able to get that to work.
I have removed multiplication operations from the loop (pre-calculated instead) and removed most branchings so it should be somewhat faster.
Let me know if this helps :-)

I am looking at your C# code and I can't recognize anything familiar. It all looks like a ton of C++. BTW, it looks like DirectX/XNA needs to become your new friend. Just my 2 cents. Don't kill the messenger.
If you must rely on CPU to do this: I've done some 24-bit layout optimizations myself, and I can tell you that memory access speed should be your bottleneck. Use SSE3 instructions for fastest possible bytewise access. This means C++ and embedded assembly language. In pure C you'll be 30% slower on most machines.
Keep in mind that modern GPUs are MUCH faster than CPU in this sort of operations.

I am not sure if this will give extra performance, but I see the pattern a lot in Reflector.
So:
int srcByte = j *3;
destRowPtr[(firstBlueByte)] = srcRowPtr[srcByte];
destRowPtr[(firstBlueByte) + 1] = srcRowPtr[srcByte + 1];
destRowPtr[(firstBlueByte) + 2] = srcRowPtr[srcByte + 2];
Becomes:
*destRowPtr++ = *srcRowPtr++;
*destRowPtr++ = *srcRowPtr++;
*destRowPtr++ = *srcRowPtr++;
Probably needs more braces.
If the width is fixed, you could probably unroll the entire line into a few hundred lines. :)
Update
You could also try using a bigger type, eg Int32 or Int64 for better performance.

Alright, this is going to be fairly close to the line of how many ms you can get out of the algorithm, but get rid of the call to Math.Min and replace it with a trinary operator instead.
Generally, making a library call will take longer than doing something on your own and I made a simple test driver to confirm this for Math.Min.
using System;
using System.Diagnostics;
namespace TestDriver
{
class Program
{
static void Main(string[] args)
{
// Start the stopwatch
if (Stopwatch.IsHighResolution)
{ Console.WriteLine("Using high resolution timer"); }
else
{ Console.WriteLine("High resolution timer unavailable"); }
// Test Math.Min for 10000 iterations
Stopwatch sw = Stopwatch.StartNew();
for (int ndx = 0; ndx < 10000; ndx++)
{
int result = Math.Min(ndx, 5000);
}
Console.WriteLine(sw.Elapsed.TotalMilliseconds.ToString("0.0000"));
// Test trinary operator for 10000 iterations
sw = Stopwatch.StartNew();
for (int ndx = 0; ndx < 10000; ndx++)
{
int result = (ndx < 5000) ? ndx : 5000;
}
Console.WriteLine(sw.Elapsed.TotalMilliseconds.ToString("0.0000"));
Console.ReadKey();
}
}
}
The results when running the above on my computer, an Intel T2400 #1.83GHz. Also, note that there is a bit of variation in the results, but generally the trinay operator is faster by about 0.01 ms. That's not much, but over a big enough dataset it will add up.
Using high resolution timer
0.0539
0.0402

Related

c# managedCuda 2d array to GPU

I'm new to CUDA and trying to figure out how to pass 2d array to the kernel.
I have to following working code for 1 dimension array:
class Program
{
static void Main(string[] args)
{
int N = 10;
int deviceID = 0;
CudaContext ctx = new CudaContext(deviceID);
CudaKernel kernel = ctx.LoadKernel(#"doubleIt.ptx", "DoubleIt");
kernel.GridDimensions = (N + 255) / 256;
kernel.BlockDimensions = Math.Min(N,256);
// Allocate input vectors h_A in host memory
float[] h_A = new float[N];
// Initialize input vectors h_A
for (int i = 0; i < N; i++)
{
h_A[i] = i;
}
// Allocate vectors in device memory and copy vectors from host memory to device memory
CudaDeviceVariable<float> d_A = h_A;
CudaDeviceVariable<float> d_C = new CudaDeviceVariable<float>(N);
// Invoke kernel
kernel.Run(d_A.DevicePointer, d_C.DevicePointer, N);
// Copy result from device memory to host memory
float[] h_C = d_C;
// h_C contains the result in host memory
}
}
with the following kernel code:
__global__ void DoubleIt(const float* A, float* C, int N)
{
int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i < N)
C[i] = A[i] * 2;
}
as I said, everything works fine but I want to work with 2d array as follow:
// Allocate input vectors h_A in host memory
int W = 10;
float[][] h_A = new float[N][];
// Initialize input vectors h_A
for (int i = 0; i < N; i++)
{
h_A[i] = new float[W];
for (int j = 0; j < W; j++)
{
h_A[i][j] = i*W+j;
}
}
I need all the 2nd dimension to be on the same thread so the kernel.BlockDimensions must stay as 1 dimension and each kernel thread need to get 1d array with 10 elements.
so my bottom question is: How shell I copy this 2d array to the device and how to use it in the kernel? (as to the example it supposed to have total of 10 threads).
Short answer: you shouldn't do it...
Long answer: Jagged arrays are difficult to handle in general. Instead of one continuous segment of memory for your data, you have plenty small ones lying sparsely somewhere in your memory. What happens if you copy the data to GPU? If you had one large continuous segment you call the cudaMemcpy/CopyToDevice functions and copy the entire block at once. But same as you allocate jagged arrays in a for loop, you’d have to copy your data line by line into a CudaDeviceVariable<CUdeviceptr>, where each entry points to a CudaDeviceVariable<float>. In parallel you maintain a host array CudaDeviceVariable<float>[] that manages your CUdeviceptrs on host side. Copying data in general is already quite slow, doing it this way is probably a real performance killer...
To conclude: If you can, use flattened arrays and index the entries with index y * DimX + x. Even better on GPU side, use pitched memory, where the allocation is done so that each line starts on a "good" address: Index then turns to y * Pitch + x (simplified). The 2D copy methods in CUDA are made for these pitched memory allocations where each line gets some additional bytes added.
For completeness: In C# you also have 2-dimensional arrays like float[,]. You can also use these on host side instead of flattened 1D arrays. But I wouldn’t recommend to do so, as the ISO standard of .net does not guarantee that the internal memory is actually continuous, an assumption that managedCuda must use in order to use these arrays. Current .net framework doesn’t have any internal weirdness, but who knows if it will stay like this...
This would realize the jagged array copy:
float[][] data_h;
CudaDeviceVariable<CUdeviceptr> data_d;
CUdeviceptr[] ptrsToData_h; //represents data_d on host side
CudaDeviceVariable<float>[] arrayOfarray_d; //Array of CudaDeviceVariables to manage memory, source for pointers in ptrsToData_h.
int sizeX = 512;
int sizeY = 256;
data_h = new float[sizeX][];
arrayOfarray_d = new CudaDeviceVariable<float>[sizeX];
data_d = new CudaDeviceVariable<CUdeviceptr>(sizeX);
ptrsToData_h = new CUdeviceptr[sizeX];
for (int x = 0; x < sizeX; x++)
{
data_h[x] = new float[sizeY];
arrayOfarray_d[x] = new CudaDeviceVariable<float>(sizeY);
ptrsToData_h[x] = arrayOfarray_d[x].DevicePointer;
//ToDo: init data on host...
}
//Copy the pointers once:
data_d.CopyToDevice(ptrsToData_h);
//Copy data:
for (int x = 0; x < sizeX; x++)
{
arrayOfarray_d[x].CopyToDevice(data_h[x]);
}
//Call a kernel:
kernel.Run(data_d.DevicePointer /*, other parameters*/);
//kernel in *cu file:
//__global__ void kernel(float** data_d, ...)
This is a sample for CudaPitchedDeviceVariable:
int dimX = 512;
int dimY = 512;
float[] array_host = new float[dimX * dimY];
CudaPitchedDeviceVariable<float> arrayPitched_d = new CudaPitchedDeviceVariable<float>(dimX, dimY);
for (int y = 0; y < dimY; y++)
{
for (int x = 0; x < dimX; x++)
{
array_host[y * dimX + x] = x * y;
}
}
arrayPitched_d.CopyToDevice(array_host);
kernel.Run(arrayPitched_d.DevicePointer, arrayPitched_d.Pitch, dimX, dimY);
//Correspondend kernel:
extern "C"
__global__ void kernel(float* data, size_t pitch, int dimX, int dimY)
{
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
if (x >= dimX || y >= dimY)
return;
//pointer arithmetic: add y*pitch to char* pointer as pitch is given in bytes,
//which gives the start of line y. Convert to float* and add x, to get the
//value at entry x of line y:
float value = *(((float*)((char*)data + y * pitch)) + x);
*(((float*)((char*)data + y * pitch)) + x) = value + 1;
//Or simpler if you don't like pointers:
float* line = (float*)((char*)data + y * pitch);
float value2 = line[x];
}

Accessing processed values from FFT

I am attempting to use Lomont FFT in order to return complex numbers to build a spectrogram / spectral density chart using c#.
I am having trouble understanding how to return values from the class.
Here is the code I have put together thus far which appears to be working.
int read = 0;
Double[] data;
byte[] buffer = new byte[1024];
FileStream wave = new FileStream(args[0], FileMode.Open, FileAccess.Read);
read = wave.Read(buffer, 0, 44);
read = wave.Read(buffer, 0, 1024);
data = new Double[read];
for (int i = 0; i < read; i+=2)
{
data[i] = BitConverter.ToInt16(buffer, i) / 32768.0;
Console.WriteLine(data[i]);
}
LomontFFT LFFT = new LomontFFT();
LFFT.FFT(data, true);
What I am not clear on is, how to return/access the values from Lomont FFT implementation back into my application (console)?
Being pretty new to c# development, I'm thinking I am perhaps missing a fundamental aspect of understanding regarding how to retrieve processed values from the instance of the Lomont Class, or perhaps even calling it incorrectly.
Console.WriteLine(LFFT.A); // Returns 0
Console.WriteLine(LFFT.B); // Returns 1
I have been searching for a code snippet or explanation of how to do this, but so far have come up with nothing that I understand or explains this particular aspect of the issue I am facing. Any guidance would be greatly appreciated.
A subset of the results held in data array noted in the code above can be found below and based on my current understanding, appear to be valid:
0.00531005859375
0.0238037109375
0.041473388671875
0.0576171875
0.07183837890625
0.083465576171875
0.092193603515625
0.097625732421875
0.099639892578125
0.098114013671875
0.0931396484375
0.0848388671875
0.07354736328125
0.05963134765625
0.043609619140625
0.026031494140625
0.007476806640625
-0.011260986328125
-0.0296630859375
-0.047027587890625
-0.062713623046875
-0.076141357421875
-0.086883544921875
-0.09454345703125
-0.098785400390625
-0.0994873046875
-0.0966796875
-0.090362548828125
-0.080810546875
-0.06842041015625
-0.05352783203125
-0.036712646484375
-0.0185546875
What am I actually attempting to do? (perspective)
I am looking to load a wave file into a console application and return a spectrogram/spectral density chart/image as a jpg/png for further processing.
The wave files I am reading are mono in format
UPDATE 1
I Receive slightly different results depending on which FFT is used.
Using RealFFT
for (int i = 0; i < read; i+=2)
{
data[i] = BitConverter.ToInt16(buffer, i) / 32768.0;
//Console.WriteLine(data[i]);
}
LomontFFT LFFT = new LomontFFT();
LFFT.RealFFT(data, true);
for (int i = 0; i < buffer.Length / 2; i++)
{
System.Console.WriteLine("{0}",
Math.Sqrt(data[2 * i] * data[2 * i] + data[2 * i + 1] * data[2 * i + 1]));
}
Partial Result of RealFFT
0.314566983321381
0.625242818210924
0.30314888696868
0.118468857708093
0.0587697011760449
0.0369034115568654
0.0265842582236275
0.0207195964060356
0.0169601273233317
0.0143745438577886
0.012528799609089
0.0111831275153128
0.0102313284519146
0.00960198279358434
0.00920236001619566
Using FFT
for (int i = 0; i < read; i+=2)
{
data[i] = BitConverter.ToInt16(buffer, i) / 32768.0;
//Console.WriteLine(data[i]);
}
double[] bufferB = new double[2 * data.Length];
for (int i = 0; i < data.Length; i++)
{
bufferB[2 * i] = data[i];
bufferB[2 * i + 1] = 0;
}
LomontFFT LFFT = new LomontFFT();
LFFT.FFT(bufferB, true);
for (int i = 0; i < bufferB.Length / 2; i++)
{
System.Console.WriteLine("{0}",
Math.Sqrt(bufferB[2 * i] * bufferB[2 * i] + bufferB[2 * i + 1] * bufferB[2 * i + 1]));
}
Partial Result of FFT:
0.31456698332138
0.625242818210923
0.303148886968679
0.118468857708092
0.0587697011760447
0.0369034115568653
0.0265842582236274
0.0207195964060355
0.0169601273233317
0.0143745438577886
0.012528799609089
0.0111831275153127
0.0102313284519146
0.00960198279358439
0.00920236001619564
Looking at the LomontFFT.FFT documentation:
Compute the forward or inverse Fourier Transform of data, with
data containing complex valued data as alternating real and
imaginary parts. The length must be a power of 2. The data is
modified in place.
This tells us a few things. First the function is expecting complex-valued data whereas your data is real. A quick fix for this is to create another buffer of twice the size and setting all the imaginary parts to 0:
double[] buffer = new double[2*data.Length];
for (int i=0; i<data.Length; i++)
{
buffer[2*i] = data[i];
buffer[2*i+1] = 0;
}
The documentation also tells us that the computation is done in place. That means that after the call to FFT returns, the input array is replaced with the computed result. You could thus print the spectrum with:
LomontFFT LFFT = new LomontFFT();
LFFT.FFT(buffer, true);
for (int i = 0; i < buffer.Length/2; i++)
{
System.Console.WriteLine("{0}",
Math.Sqrt(buffer[2*i]*buffer[2*i]+buffer[2*i+1]*buffer[2*i+1]));
}
Note since your input data is real valued you could also use LomontFFT.RealFFT. In that case, given a slightly different packing rule, you would obtain the FFT results using:
LomontFFT LFFT = new LomontFFT();
LFFT.RealFFT(data, true);
System.Console.WriteLine("{0}", Math.Abs(data[0]);
for (int i = 1; i < data.Length/2; i++)
{
System.Console.WriteLine("{0}",
Math.Sqrt(data[2*i]*data[2*i]+data[2*i+1]*data[2*i+1]));
}
System.Console.WriteLine("{0}", Math.Abs(data[1]);
This would give you the non-redundant lower half of the spectrum (Unlike LomontFFT.FFT which provides the entire spectrum). Also, numerical differences on the order of double precision (around 1e-16 times the spectrum peak value) with respect to LomontFFT.FFT can be expected.

Segmented Aggregation within an Array

I have a large array of primitive value-types. The array is in fact one dimentional, but logically represents a 2-dimensional field. As you read from left to right, the values need to become (the original value of the current cell) + (the result calculated in the cell to the left). Obviously with the exception of the first element of each row which is just the original value.
I already have an implementation which accomplishes this, but is entirely iterative over the entire array and is extremely slow for large (1M+ elements) arrays.
Given the following example array,
0 0 1 0 0
2 0 0 0 3
0 4 1 1 0
0 1 0 4 1
Becomes
0 0 1 1 1
2 2 2 2 5
0 4 5 6 6
0 1 1 5 6
And so forth to the right, up to problematic sizes (1024x1024)
The array needs to be updated (ideally), but another array can be used if necessary. Memory footprint isn't much of an issue here, but performance is critical as these arrays have millions of elements and must be processed hundreds of times per second.
The individual cell calculations do not appear to be parallelizable given their dependence on values starting from the left, so GPU acceleration seems impossible. I have investigated PLINQ but requisite for indices makes it very difficult to implement.
Is there another way to structure the data to make it faster to process?
If efficient GPU processing is feasible using an innovative teqnique, this would be vastly preferable, as this is currently texture data which is having to be pulled from and pushed back to the video card.
Proper coding and a bit of insight in how .NET knows stuff helps as well :-)
Some rules of thumb that apply in this case:
If you can hint the JIT that the indexing will never get out of bounds of the array, it will remove the extra branche.
You should vectorize it only in multiple threads if it's really slow (f.ex. >1 second). Otherwise task switching, cache flushes etc will probably just eat up the added speed and you'll end up worse.
If possible, make memory access predictable, even sequential. If you need another array, so be it - if not, prefer that.
Use as few IL instructions as possible if you want speed. Generally this seems to work.
Test multiple iterations. A single iteration might not be good enough.
using these rules, you can make a small test case as follows. Note that I've upped the stakes to 4Kx4K since 1K is just so fast you cannot measure it :-)
public static void Main(string[] args)
{
int width = 4096;
int height = 4096;
int[] ar = new int[width * height];
Random rnd = new Random(213);
for (int i = 0; i < ar.Length; ++i)
{
ar[i] = rnd.Next(0, 120);
}
// (5)...
for (int j = 0; j < 10; ++j)
{
Stopwatch sw = Stopwatch.StartNew();
int sum = 0;
for (int i = 0; i < ar.Length; ++i) // (3) sequential access
{
if ((i % width) == 0)
{
sum = 0;
}
// (1) --> the JIT will notice this won't go out of bounds because [0<=i<ar.Length]
// (5) --> '+=' is an expression generating a 'dup'; this creates less IL.
ar[i] = (sum += ar[i]);
}
Console.WriteLine("This took {0:0.0000}s", sw.Elapsed.TotalSeconds);
}
Console.ReadLine();
}
One of these iterations wil take roughly 0.0174 sec here, and since this is about 16x the worst case scenario you describe, I suppose your performance problem is solved.
If you really want to parallize it to make it faster, I suppose that is possible, even though you will loose some of the optimizations in the JIT (Specifically: (1)). However, if you have a multi-core system like most people, the benefits might outweight these:
for (int j = 0; j < 10; ++j)
{
Stopwatch sw = Stopwatch.StartNew();
Parallel.For(0, height, (a) =>
{
int sum = 0;
for (var i = width * a + 1; i < width * (a + 1); i++)
{
ar[i] = (sum += ar[i]);
}
});
Console.WriteLine("This took {0:0.0000}s", sw.Elapsed.TotalSeconds);
}
If you really, really need performance, you can compile it to C++ and use P/Invoke. Even if you don't use the GPU, I suppose the SSE/AVX instructions might already give you a significant performance boost that you won't get with .NET/C#. Also I'd like to point out that the Intel C++ compiler can automatically vectorize your code - even to Xeon PHI's. Without a lot of effort, this might give you a nice boost in performance.
Well, I don't know too much about GPU, but I see no reason why you can't parallelize it as the dependencies are only from left to right.
There are no dependencies between rows.
0 0 1 0 0 - process on core1 |
2 0 0 0 3 - process on core1 |
-------------------------------
0 4 1 1 0 - process on core2 |
0 1 0 4 1 - process on core2 |
Although the above statement is not completely true. There's still hidden dependencies between rows when it comes to memory cache.
It's possible that there's going to be cache trashing. You can read about "cache false sharing", in order to understand the problem, and see how to overcome that.
As #Chris Eelmaa told you it is possible to do a parallel execution by row. Using Parallel.For could be rewritten like this:
static int[,] values = new int[,]{
{0, 0, 1, 0, 0},
{2, 0, 0, 0, 3},
{0, 4, 1, 1, 0},
{0, 1, 0, 4 ,1}};
static void Main(string[] args)
{
int rows=values.GetLength(0);
int columns=values.GetLength(1);
Parallel.For(0, rows, (row) =>
{
for (var column = 1; column < columns; column++)
{
values[row, column] += values[row, column - 1];
}
});
for (var row = 0; row < rows; row++)
{
for (var column = 0; column < columns; column++)
{
Console.Write("{0} ", values[row, column]);
}
Console.WriteLine();
}
So, as stated in your question, you have a one dimensional array, the code would be a bit faster:
static void Main(string[] args)
{
var values = new int[1024 * 1024];
Random r = new Random();
for (int i = 0; i < 1024; i++)
{
for (int j = 0; j < 1024; j++)
{
values[i * 1024 + j] = r.Next(25);
}
}
int rows = 1024;
int columns = 1024;
Stopwatch sw = Stopwatch.StartNew();
for (int i = 0; i < 100; i++)
{
Parallel.For(0, rows, (row) =>
{
for (var column = 1; column < columns; column++)
{
values[(row * columns) + column] += values[(row * columns) + column - 1];
}
});
}
Console.WriteLine(sw.Elapsed);
}
But not as fast as a GPU. To use parallel GPU processing you will have to rewrite it in C++ AMP or take a look on how to port this parallel for to cudafy: http://w8isms.blogspot.com.es/2012/09/cudafy-me-part-3-of-4.html
You may as well store the array as a jagged array, the memory layout will be the same. So, instead of,
int[] texture;
you have,
int[][] texture;
Isolate the row operation as,
private static Task ProcessRow(int[] row)
{
var v = row[0];
for (var i = 1; i < row.Length; i++)
{
v = row[i] += v;
}
return Task.FromResult(true);
}
then you can write a function that does,
Task.WhenAll(texture.Select(ProcessRow)).Wait();
If you want to remain with a 1-dimensional array, a similar approach will work, just change ProcessRow.
private static Task ProcessRow(int[] texture, int start, int limit)
{
var v = texture[start];
for (var i = start + 1; i < limit; i++)
{
v = texture[i] += v;
}
return Task.FromResult(true);
}
then once,
var rowSize = 1024;
var rows =
Enumerable.Range(0, texture.Length / rowSize)
.Select(i => Tuple.Create(i * rowSize, (i * rowSize) + rowSize))
.ToArray();
then on each cycle.
Task.WhenAll(rows.Select(t => ProcessRow(texture, t.Item1, t.Item2)).Wait();
Either way, each row is processed in parallel.

New image overlays previous bitmap

There are a number of posts about this, but i still can't figure it out. I am rather new at this, so please be forgiving.
I display an image, then grab a new image, and try to display it. When the new image is displayed, it has remnants of the old image. I have tried Picture1.Image= null to no avail.
Is it an issue with managed memory? I suspect it has to do with how the memory is being managed, that somehow the code copies a new image over and old image in a way that leaves some data from the previous image.
Here is the code to display the data in scaled1 (from this helpful earlier post):
Edit:
Code added showing processing of arrays that are plotted. The overlaying behavior stops if the arrays are cleared using the Array.Clear method. Perhaps when this is cleared up I can post a canonical snippet demonstrating the issue.
This resets the question as: Why do arrays need to be cleared when each value of the array is rewritten? How can the array retain information of previous values?
ushort[] frame = null;
byte[] scaled1 = null;
double[][] frameringSin;
double[][] frameringCos;
double[] sumsin;
double[] sumcos;
frame = new ushort[mImageWidth * mImageHeight];
scaled1 = new byte[mImageWidth * mImageHeight];
frameringSin = new double[RingSize][];
frameringCos = new double[RingSize][];
ringsin = new double[RingSize];
ringcos = new double[RingSize];
//Fill array with images
for(int ring=0; ring <nN; ++ring)
{
mCamera.GrabFrameReduced(framering[ring], reduced, out preset);
}
//Process images
for (int i = 0; i < nN; ++i)
{
Array.Clear(frameringSin[i], 0, frameringSin.Length);
Array.Clear(frameringCos[i], 0, frameringSin.Length);
}
Array.Clear(sumsin, 0, sumsin.Length);
Array.Clear(sumcos, 0, sumcos.Length);
for(int r=0;r<nN; ++r)
{
for (int i = 0; i < frame.Length; ++i)//upto 12 ms
{
frameringSin[r][i] = framering[r][i]* ringsin[r] / nN;
frameringCos[r][i] = framering[r][i] *ringcos[r] / nN;
}
}
for (int i = 0; i < sumsin.Length; ++i)//up to 25ms
{
for (int r = 0; r < nN; ++r)
{
sumsin[i] += frameringSin[r][i];
sumcos[i] += frameringCos[r][i];
}
}
for(int r=0 ; r<nN ;++r)
{
for (int i = 0; i < sumsin.Length; ++i)
{
A[i] = Math.Sqrt(sumsin[i] * sumsin[i] + sumcos[i] * sumcos[i]);
}
//extract scaling parameters
...
//Scale Image
for (i1 = 0; i1 < frame.Length; ++i1)
scaled1[i1] = (byte)((Math.Min(Math.Max(min1, frameA[i1]), max1) - min1) * scale1);
bmp1 = new Bitmap(mImageWidth,mImageHeight,System.Drawing.Imaging.PixelFormat.Format8bppIndexed);
var bdata1 = bmp1.LockBits(new Rectangle(new Point(0, 0), bmp1.Size), ImageLockMode.WriteOnly, bmp1.PixelFormat);
try
{
Marshal.Copy(scaled1, 0, bdata1.Scan0, scaled1.Length);
}
finally
{
bmp1.UnlockBits(bdata1);
}
Picture1.Image = bmp1;
Picture1.Refresh();
Actually, you're not replacing all values in the arrays - your for cycles are wrong. You want them to look like this:
for (i1 = 0; i1 < frame.Length; i1++)
scaled1[i1] = (byte)((Math.Min(Math.Max(min1, frameA[i1]), max1) - min1)
* scale1);
The difference (i++ vs ++i) is that your way, you're skipping the first and the last index. C# arrays start at 0, while you start at 1 (you increment the loop variable before you run the body for the first time).
Also, note that for performance reasons, it's very handy if you're going through the array like this:
for (var i = 0; i < array.Length; i++)
/* do work with array[i] */
The JIT compiler recognizes this and avoids bounds checks, because it knows there can never be an overflow. When you're doing a lot of work with arrays, this can give you a massive performance boost, even if you access multiple arrays through the same index (one of them will not have the checks, the others will - still saves a lot of work).
The default JIT isn't very smart about this (it has to be quite fast), so pretty much anything else will reintroduce the bounds check. If performance is a concern for you, you'd want to profile the code anyway, of course.
EDIT: Ah, my bad. Anyway, I believe your problem isn't having to clear the frameringXXX arrays, but rather, the sumsin and sumcos arrays - you're always adding to those, so you'd be adding to the value that was already there, rather than starting from zero again. So you need to reset those arrays to zeroes first (which is what Array.Clear does).

Connected-component labeling algorithm optimization

I need some help with optimisation of my CCL algorithm implementation. I use it to detect black areas on the image. On a 2000x2000 it takes 11 seconds, which is pretty much. I need to reduce the running time to the lowest value possible to achieve. Also, I would be glad to know if there is any other algorithm out there which allows you to do the same thing, but faster than this one. So here is my code:
//The method returns a dictionary, where the key is the label
//and the list contains all the pixels with that label
public Dictionary<short, LinkedList<Point>> ProcessCCL()
{
Color backgroundColor = this.image.Palette.Entries[1];
//Matrix to store pixels' labels
short[,] labels = new short[this.image.Width, this.image.Height];
//I particulary don't like how I store the label equality table
//But I don't know how else can I store it
//I use LinkedList to add and remove items faster
Dictionary<short, LinkedList<short>> equalityTable = new Dictionary<short, LinkedList<short>>();
//Current label
short currentKey = 1;
for (int x = 1; x < this.bitmap.Width; x++)
{
for (int y = 1; y < this.bitmap.Height; y++)
{
if (!GetPixelColor(x, y).Equals(backgroundColor))
{
//Minumum label of the neighbours' labels
short label = Math.Min(labels[x - 1, y], labels[x, y - 1]);
//If there are no neighbours
if (label == 0)
{
//Create a new unique label
labels[x, y] = currentKey;
equalityTable.Add(currentKey, new LinkedList<short>());
equalityTable[currentKey].AddFirst(currentKey);
currentKey++;
}
else
{
labels[x, y] = label;
short west = labels[x - 1, y], north = labels[x, y - 1];
//A little trick:
//Because of those "ifs" the lowest label value
//will always be the first in the list
//but I'm afraid that because of them
//the running time also increases
if (!equalityTable[label].Contains(west))
if (west < equalityTable[label].First.Value)
equalityTable[label].AddFirst(west);
if (!equalityTable[label].Contains(north))
if (north < equalityTable[label].First.Value)
equalityTable[label].AddFirst(north);
}
}
}
}
//This dictionary will be returned as the result
//I'm not proud of using dictionary here too, I guess there
//is a better way to store the result
Dictionary<short, LinkedList<Point>> result = new Dictionary<short, LinkedList<Point>>();
//I define the variable outside the loops in order
//to reuse the memory address
short cellValue;
for (int x = 0; x < this.bitmap.Width; x++)
{
for (int y = 0; y < this.bitmap.Height; y++)
{
cellValue = labels[x, y];
//If the pixel is not a background
if (cellValue != 0)
{
//Take the minimum value from the label equality table
short value = equalityTable[cellValue].First.Value;
//I'd like to get rid of these lines
if (!result.ContainsKey(value))
result.Add(value, new LinkedList<Point>());
result[value].AddLast(new Point(x, y));
}
}
}
return result;
}
Thanks in advance!
You could split your picture in multiple sub-pictures and process them in parallel and then merge the results.
1 pass: 4 tasks, each processing a 1000x1000 sub-picture
2 pass: 2 tasks, each processing 2 of the sub-pictures from pass 1
3 pass: 1 task, processing the result of pass 2
For C# I recommend the Task Parallel Library (TPL), which allows to easily define tasks depending and waiting for each other. Following code project articel gives you a basic introduction into the TPL: The Basics of Task Parallelism via C#.
I would process one scan line at a time, keeping track of the beginning and end of each run of black pixels.
Then I would, on each scan line, compare it to the runs on the previous line. If there is a run on the current line that does not overlap a run on the previous line, it represents a new blob. If there is a run on the previous line that overlaps a run on the current line, it gets the same blob label as the previous. etc. etc. You get the idea.
I would try not to use dictionaries and such.
In my experience, randomly halting the program shows that those things may make programming incrementally easier, but they can exact a serious performance cost due to new-ing.
The problem is about GetPixelColor(x, y), it take very long time to access image data.
Set/GetPixel function are terribly slow in C#, so if you need to use them a lot, you should use Bitmap.lockBits instead.
private void ProcessUsingLockbits(Bitmap ProcessedBitmap)
{
BitmapData bitmapData = ProcessedBitmap.LockBits(new Rectangle(0, 0, ProcessedBitmap.Width, ProcessedBitmap.Height), ImageLockMode.ReadWrite, ProcessedBitmap.PixelFormat);
int BytesPerPixel = System.Drawing.Bitmap.GetPixelFormatSize(ProcessedBitmap.PixelFormat) / 8;
int ByteCount = bitmapData.Stride * ProcessedBitmap.Height;
byte[] Pixels = new byte[ByteCount];
IntPtr PtrFirstPixel = bitmapData.Scan0;
Marshal.Copy(PtrFirstPixel, Pixels, 0, Pixels.Length);
int HeightInPixels = bitmapData.Height;
int WidthInBytes = bitmapData.Width * BytesPerPixel;
for (int y = 0; y < HeightInPixels; y++)
{
int CurrentLine = y * bitmapData.Stride;
for (int x = 0; x < WidthInBytes; x = x + BytesPerPixel)
{
int OldBlue = Pixels[CurrentLine + x];
int OldGreen = Pixels[CurrentLine + x + 1];
int OldRed = Pixels[CurrentLine + x + 2];
// Transform blue and clip to 255:
Pixels[CurrentLine + x] = (byte)((OldBlue + BlueMagnitudeToAdd > 255) ? 255 : OldBlue + BlueMagnitudeToAdd);
// Transform green and clip to 255:
Pixels[CurrentLine + x + 1] = (byte)((OldGreen + GreenMagnitudeToAdd > 255) ? 255 : OldGreen + GreenMagnitudeToAdd);
// Transform red and clip to 255:
Pixels[CurrentLine + x + 2] = (byte)((OldRed + RedMagnitudeToAdd > 255) ? 255 : OldRed + RedMagnitudeToAdd);
}
}
// Copy modified bytes back:
Marshal.Copy(Pixels, 0, PtrFirstPixel, Pixels.Length);
ProcessedBitmap.UnlockBits(bitmapData);
}
Here is the basic code to access pixel data.
And I made a function to transform this into a 2D matrix, it's easier to manipulate (but little slower)
private void bitmap_to_matrix()
{
unsafe
{
bitmapData = ProcessedBitmap.LockBits(new Rectangle(0, 0, ProcessedBitmap.Width, ProcessedBitmap.Height), ImageLockMode.ReadWrite, ProcessedBitmap.PixelFormat);
int BytesPerPixel = System.Drawing.Bitmap.GetPixelFormatSize(ProcessedBitmap.PixelFormat) / 8;
int HeightInPixels = ProcessedBitmap.Height;
int WidthInPixels = ProcessedBitmap.Width;
int WidthInBytes = ProcessedBitmap.Width * BytesPerPixel;
byte* PtrFirstPixel = (byte*)bitmapData.Scan0;
Parallel.For(0, HeightInPixels, y =>
{
byte* CurrentLine = PtrFirstPixel + (y * bitmapData.Stride);
for (int x = 0; x < WidthInBytes; x = x + BytesPerPixel)
{
// Conversion in grey level
double rst = CurrentLine[x] * 0.0721 + CurrentLine[x + 1] * 0.7154 + CurrentLine[x + 2] * 0.2125;
// Fill the grey matix
TG[x / 3, y] = (int)rst;
}
});
}
}
And the website where the code comes
"High performance SystemDrawingBitmap"
Thanks to the author for his really good job !
Hope this will help !

Categories