Connected-component labeling algorithm optimization - c#

I need some help with optimisation of my CCL algorithm implementation. I use it to detect black areas on the image. On a 2000x2000 it takes 11 seconds, which is pretty much. I need to reduce the running time to the lowest value possible to achieve. Also, I would be glad to know if there is any other algorithm out there which allows you to do the same thing, but faster than this one. So here is my code:
//The method returns a dictionary, where the key is the label
//and the list contains all the pixels with that label
public Dictionary<short, LinkedList<Point>> ProcessCCL()
{
Color backgroundColor = this.image.Palette.Entries[1];
//Matrix to store pixels' labels
short[,] labels = new short[this.image.Width, this.image.Height];
//I particulary don't like how I store the label equality table
//But I don't know how else can I store it
//I use LinkedList to add and remove items faster
Dictionary<short, LinkedList<short>> equalityTable = new Dictionary<short, LinkedList<short>>();
//Current label
short currentKey = 1;
for (int x = 1; x < this.bitmap.Width; x++)
{
for (int y = 1; y < this.bitmap.Height; y++)
{
if (!GetPixelColor(x, y).Equals(backgroundColor))
{
//Minumum label of the neighbours' labels
short label = Math.Min(labels[x - 1, y], labels[x, y - 1]);
//If there are no neighbours
if (label == 0)
{
//Create a new unique label
labels[x, y] = currentKey;
equalityTable.Add(currentKey, new LinkedList<short>());
equalityTable[currentKey].AddFirst(currentKey);
currentKey++;
}
else
{
labels[x, y] = label;
short west = labels[x - 1, y], north = labels[x, y - 1];
//A little trick:
//Because of those "ifs" the lowest label value
//will always be the first in the list
//but I'm afraid that because of them
//the running time also increases
if (!equalityTable[label].Contains(west))
if (west < equalityTable[label].First.Value)
equalityTable[label].AddFirst(west);
if (!equalityTable[label].Contains(north))
if (north < equalityTable[label].First.Value)
equalityTable[label].AddFirst(north);
}
}
}
}
//This dictionary will be returned as the result
//I'm not proud of using dictionary here too, I guess there
//is a better way to store the result
Dictionary<short, LinkedList<Point>> result = new Dictionary<short, LinkedList<Point>>();
//I define the variable outside the loops in order
//to reuse the memory address
short cellValue;
for (int x = 0; x < this.bitmap.Width; x++)
{
for (int y = 0; y < this.bitmap.Height; y++)
{
cellValue = labels[x, y];
//If the pixel is not a background
if (cellValue != 0)
{
//Take the minimum value from the label equality table
short value = equalityTable[cellValue].First.Value;
//I'd like to get rid of these lines
if (!result.ContainsKey(value))
result.Add(value, new LinkedList<Point>());
result[value].AddLast(new Point(x, y));
}
}
}
return result;
}
Thanks in advance!

You could split your picture in multiple sub-pictures and process them in parallel and then merge the results.
1 pass: 4 tasks, each processing a 1000x1000 sub-picture
2 pass: 2 tasks, each processing 2 of the sub-pictures from pass 1
3 pass: 1 task, processing the result of pass 2
For C# I recommend the Task Parallel Library (TPL), which allows to easily define tasks depending and waiting for each other. Following code project articel gives you a basic introduction into the TPL: The Basics of Task Parallelism via C#.

I would process one scan line at a time, keeping track of the beginning and end of each run of black pixels.
Then I would, on each scan line, compare it to the runs on the previous line. If there is a run on the current line that does not overlap a run on the previous line, it represents a new blob. If there is a run on the previous line that overlaps a run on the current line, it gets the same blob label as the previous. etc. etc. You get the idea.
I would try not to use dictionaries and such.
In my experience, randomly halting the program shows that those things may make programming incrementally easier, but they can exact a serious performance cost due to new-ing.

The problem is about GetPixelColor(x, y), it take very long time to access image data.
Set/GetPixel function are terribly slow in C#, so if you need to use them a lot, you should use Bitmap.lockBits instead.
private void ProcessUsingLockbits(Bitmap ProcessedBitmap)
{
BitmapData bitmapData = ProcessedBitmap.LockBits(new Rectangle(0, 0, ProcessedBitmap.Width, ProcessedBitmap.Height), ImageLockMode.ReadWrite, ProcessedBitmap.PixelFormat);
int BytesPerPixel = System.Drawing.Bitmap.GetPixelFormatSize(ProcessedBitmap.PixelFormat) / 8;
int ByteCount = bitmapData.Stride * ProcessedBitmap.Height;
byte[] Pixels = new byte[ByteCount];
IntPtr PtrFirstPixel = bitmapData.Scan0;
Marshal.Copy(PtrFirstPixel, Pixels, 0, Pixels.Length);
int HeightInPixels = bitmapData.Height;
int WidthInBytes = bitmapData.Width * BytesPerPixel;
for (int y = 0; y < HeightInPixels; y++)
{
int CurrentLine = y * bitmapData.Stride;
for (int x = 0; x < WidthInBytes; x = x + BytesPerPixel)
{
int OldBlue = Pixels[CurrentLine + x];
int OldGreen = Pixels[CurrentLine + x + 1];
int OldRed = Pixels[CurrentLine + x + 2];
// Transform blue and clip to 255:
Pixels[CurrentLine + x] = (byte)((OldBlue + BlueMagnitudeToAdd > 255) ? 255 : OldBlue + BlueMagnitudeToAdd);
// Transform green and clip to 255:
Pixels[CurrentLine + x + 1] = (byte)((OldGreen + GreenMagnitudeToAdd > 255) ? 255 : OldGreen + GreenMagnitudeToAdd);
// Transform red and clip to 255:
Pixels[CurrentLine + x + 2] = (byte)((OldRed + RedMagnitudeToAdd > 255) ? 255 : OldRed + RedMagnitudeToAdd);
}
}
// Copy modified bytes back:
Marshal.Copy(Pixels, 0, PtrFirstPixel, Pixels.Length);
ProcessedBitmap.UnlockBits(bitmapData);
}
Here is the basic code to access pixel data.
And I made a function to transform this into a 2D matrix, it's easier to manipulate (but little slower)
private void bitmap_to_matrix()
{
unsafe
{
bitmapData = ProcessedBitmap.LockBits(new Rectangle(0, 0, ProcessedBitmap.Width, ProcessedBitmap.Height), ImageLockMode.ReadWrite, ProcessedBitmap.PixelFormat);
int BytesPerPixel = System.Drawing.Bitmap.GetPixelFormatSize(ProcessedBitmap.PixelFormat) / 8;
int HeightInPixels = ProcessedBitmap.Height;
int WidthInPixels = ProcessedBitmap.Width;
int WidthInBytes = ProcessedBitmap.Width * BytesPerPixel;
byte* PtrFirstPixel = (byte*)bitmapData.Scan0;
Parallel.For(0, HeightInPixels, y =>
{
byte* CurrentLine = PtrFirstPixel + (y * bitmapData.Stride);
for (int x = 0; x < WidthInBytes; x = x + BytesPerPixel)
{
// Conversion in grey level
double rst = CurrentLine[x] * 0.0721 + CurrentLine[x + 1] * 0.7154 + CurrentLine[x + 2] * 0.2125;
// Fill the grey matix
TG[x / 3, y] = (int)rst;
}
});
}
}
And the website where the code comes
"High performance SystemDrawingBitmap"
Thanks to the author for his really good job !
Hope this will help !

Related

c# managedCuda 2d array to GPU

I'm new to CUDA and trying to figure out how to pass 2d array to the kernel.
I have to following working code for 1 dimension array:
class Program
{
static void Main(string[] args)
{
int N = 10;
int deviceID = 0;
CudaContext ctx = new CudaContext(deviceID);
CudaKernel kernel = ctx.LoadKernel(#"doubleIt.ptx", "DoubleIt");
kernel.GridDimensions = (N + 255) / 256;
kernel.BlockDimensions = Math.Min(N,256);
// Allocate input vectors h_A in host memory
float[] h_A = new float[N];
// Initialize input vectors h_A
for (int i = 0; i < N; i++)
{
h_A[i] = i;
}
// Allocate vectors in device memory and copy vectors from host memory to device memory
CudaDeviceVariable<float> d_A = h_A;
CudaDeviceVariable<float> d_C = new CudaDeviceVariable<float>(N);
// Invoke kernel
kernel.Run(d_A.DevicePointer, d_C.DevicePointer, N);
// Copy result from device memory to host memory
float[] h_C = d_C;
// h_C contains the result in host memory
}
}
with the following kernel code:
__global__ void DoubleIt(const float* A, float* C, int N)
{
int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i < N)
C[i] = A[i] * 2;
}
as I said, everything works fine but I want to work with 2d array as follow:
// Allocate input vectors h_A in host memory
int W = 10;
float[][] h_A = new float[N][];
// Initialize input vectors h_A
for (int i = 0; i < N; i++)
{
h_A[i] = new float[W];
for (int j = 0; j < W; j++)
{
h_A[i][j] = i*W+j;
}
}
I need all the 2nd dimension to be on the same thread so the kernel.BlockDimensions must stay as 1 dimension and each kernel thread need to get 1d array with 10 elements.
so my bottom question is: How shell I copy this 2d array to the device and how to use it in the kernel? (as to the example it supposed to have total of 10 threads).
Short answer: you shouldn't do it...
Long answer: Jagged arrays are difficult to handle in general. Instead of one continuous segment of memory for your data, you have plenty small ones lying sparsely somewhere in your memory. What happens if you copy the data to GPU? If you had one large continuous segment you call the cudaMemcpy/CopyToDevice functions and copy the entire block at once. But same as you allocate jagged arrays in a for loop, you’d have to copy your data line by line into a CudaDeviceVariable<CUdeviceptr>, where each entry points to a CudaDeviceVariable<float>. In parallel you maintain a host array CudaDeviceVariable<float>[] that manages your CUdeviceptrs on host side. Copying data in general is already quite slow, doing it this way is probably a real performance killer...
To conclude: If you can, use flattened arrays and index the entries with index y * DimX + x. Even better on GPU side, use pitched memory, where the allocation is done so that each line starts on a "good" address: Index then turns to y * Pitch + x (simplified). The 2D copy methods in CUDA are made for these pitched memory allocations where each line gets some additional bytes added.
For completeness: In C# you also have 2-dimensional arrays like float[,]. You can also use these on host side instead of flattened 1D arrays. But I wouldn’t recommend to do so, as the ISO standard of .net does not guarantee that the internal memory is actually continuous, an assumption that managedCuda must use in order to use these arrays. Current .net framework doesn’t have any internal weirdness, but who knows if it will stay like this...
This would realize the jagged array copy:
float[][] data_h;
CudaDeviceVariable<CUdeviceptr> data_d;
CUdeviceptr[] ptrsToData_h; //represents data_d on host side
CudaDeviceVariable<float>[] arrayOfarray_d; //Array of CudaDeviceVariables to manage memory, source for pointers in ptrsToData_h.
int sizeX = 512;
int sizeY = 256;
data_h = new float[sizeX][];
arrayOfarray_d = new CudaDeviceVariable<float>[sizeX];
data_d = new CudaDeviceVariable<CUdeviceptr>(sizeX);
ptrsToData_h = new CUdeviceptr[sizeX];
for (int x = 0; x < sizeX; x++)
{
data_h[x] = new float[sizeY];
arrayOfarray_d[x] = new CudaDeviceVariable<float>(sizeY);
ptrsToData_h[x] = arrayOfarray_d[x].DevicePointer;
//ToDo: init data on host...
}
//Copy the pointers once:
data_d.CopyToDevice(ptrsToData_h);
//Copy data:
for (int x = 0; x < sizeX; x++)
{
arrayOfarray_d[x].CopyToDevice(data_h[x]);
}
//Call a kernel:
kernel.Run(data_d.DevicePointer /*, other parameters*/);
//kernel in *cu file:
//__global__ void kernel(float** data_d, ...)
This is a sample for CudaPitchedDeviceVariable:
int dimX = 512;
int dimY = 512;
float[] array_host = new float[dimX * dimY];
CudaPitchedDeviceVariable<float> arrayPitched_d = new CudaPitchedDeviceVariable<float>(dimX, dimY);
for (int y = 0; y < dimY; y++)
{
for (int x = 0; x < dimX; x++)
{
array_host[y * dimX + x] = x * y;
}
}
arrayPitched_d.CopyToDevice(array_host);
kernel.Run(arrayPitched_d.DevicePointer, arrayPitched_d.Pitch, dimX, dimY);
//Correspondend kernel:
extern "C"
__global__ void kernel(float* data, size_t pitch, int dimX, int dimY)
{
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
if (x >= dimX || y >= dimY)
return;
//pointer arithmetic: add y*pitch to char* pointer as pitch is given in bytes,
//which gives the start of line y. Convert to float* and add x, to get the
//value at entry x of line y:
float value = *(((float*)((char*)data + y * pitch)) + x);
*(((float*)((char*)data + y * pitch)) + x) = value + 1;
//Or simpler if you don't like pointers:
float* line = (float*)((char*)data + y * pitch);
float value2 = line[x];
}

A method designed to chop a tileset into an array fails when it's given more then one row?

This is the method in question:
Color[][] ChopUpTiles()
{
int numTilesPerRow = terrainTiles.width / tileResolution;
int numRows = terrainTiles.height / tileResolution;
Color[][] tiles = new Color[numTilesPerRow * numRows][];
for (int y = 0; y < numRows; y++)
{
for (int x = 0; x < numTilesPerRow; x++)
{
tiles[y * numTilesPerRow + x] = terrainTiles.GetPixels(x * tileResolution , y * tileResolution, tileResolution, tileResolution);
}
}
return tiles;
}
It's a pretty basic function, and works - as long as the tileset in question only has one row. If it has more then a single row, it freaks out. Suddenly, using "tiles[1]" no longer returns tile 1. Instead, it returns... tile 15. I have no idea why it's acting this way, or where the math is wrong. Can someone spot what's going on?
Don't you mean tiles[y][numTilesPerRow + x] or tiles[y][x] or something along those lines? because i don't know what you are trying to do, but you are retrieving an entire row not a tile itself.
also, i think Color[][] tiles = new Color[numTilesPerRow * numRows][]; should be Color[][] tiles = new Color[numRows][numTilesPerRow]; or am i wrong?
Basically, you have a multi-dimensional Array yet you are treating it as a single-dimensional Array

Converting a simple JavaScript code

I've been trying to convert this JavaScript code that gets the dominant color from an image, so far with no success. I get errors with the colorCount & color variables. I don't know the suitable & equivalent data types to use for these variables. Here is my code:
public string dominantColor(Bitmap img)
{
int[] colorCount = new int[0];
int maxCount = 0;
string dominantColor = "";
// data is an array of a series of 4 one-byte values representing the rgba values of each pixel
Bitmap Bmp = new Bitmap(img);
BitmapData BmpData = Bmp.LockBits(new Rectangle(0, 0, Bmp.Width, Bmp.Height), ImageLockMode.ReadOnly, Bmp.PixelFormat);
byte[] data = new byte[BmpData.Stride * Bmp.Height];
for (int i = 0; i < data.Length; i += 4)
{
// ignore transparent pixels
if (data[i+3] == 0)
continue;
string color = data[i] + "." + data[i+1] + "," + data[i+2];
// ignore white
if (color == "255,255,255")
continue;
if (colorCount[color] != 0)
colorCount[color] = colorCount[color] + 1;
else
colorCount[color] = 0;
// keep track of the color that appears the most times
if (colorCount[color] > maxCount)
{
maxCount = colorCount[color];
dominantColor = color.ToString;
}
}
string rgb = dominantColor.Split(",");
return rgb;
}
I'll give you a complete managed version of your code:
static Color dominantColor(Bitmap img)
{
Hashtable colorCount = new Hashtable();
int maxCount = 0;
Color dominantColor = Color.White;
for (int i = 0; i < img.Width; i++)
{
for (int j = 0; j < img.Height; j++)
{
var color = img.GetPixel(i, j);
if (color.A == 0)
continue;
// ignore white
if (color.Equals(Color.White))
continue;
if (colorCount[color] != null)
colorCount[color] = (int)colorCount[color] + 1;
else
colorCount.Add(color, 0);
// keep track of the color that appears the most times
if ((int)colorCount[color] > maxCount)
{
maxCount = (int)colorCount[color];
dominantColor = color;
}
}
}
return dominantColor;
}
So what is the difference here?
- I use a Hashtable instead of your array (you never redefine the dimension of it - and the best way to use an extensible object from JavaScript is a Hashtable)
- I prefer to use the already included structure Color (which saves 4 bytes for Alpha, Red, Green, Blue)
- I also do the comparisons and return this structure (then you are free to do whatever you want to do - in JavaScript using those strings is just a workaround because the browser is just giving you such RGB(a) strings)
What is another problem in your code is the line containing byte[] data = new byte[BmpData.Stride * Bmp.Height]; - Your array is created and initialized but with no data (.NET will erase all previous data resulting in a lot of zeros). Therefore you will not anywhere.
Drawback of my version is that it is indeed very small (this is where your lockbits are coming into play). I can give you a non-managed version (using the lockbits and an unsafe-block) if you want to. Depends if performance matters a lot for you and if you are interested!

A way to list to array of bytes pixels values of bricks

Sorry I had no idea how set a topic which could express what help I need.
I have in an array of bytes, values for each pixel from a bitmap. It is a one dimensional array, from left to right. It takes each row and add it to the end of array's index.
I would like to split a bitmap to 225(=15*15) pieces. Each brick has for example dimension 34x34 and the length of array is then 260100(=225*34*34). So as you see now we will need 15 bricks on width and on height.
Few months ago I was using two loops starting from 0 - 14. I wrote own long code to get all that 34x34 bricks. However I didn't used any array which was storing all values.
Now I have a one dimensional array because marshal copy and bitmapdata with bitlocks were the best way to fast copy all pixels' values to array.
But I stand face to face with problem how to get 34 elements then one row lower and another one knowing that on 35 level will be another brick with its own starting value..
PS. edit my post if something is not good.
Few people could say "first make any your test code". I tried that but what I got was just trash and I really don't know how to do that.
This method was used to crop image to smaller images containing bricks. But I don't want store small images of brick. I need values storing in array of bytes.
Under, there is a proof.
private void OCropImage(int ii, int jj, int p, int p2)
{
////We took letter and save value to binnary, then we search in dictionary by value
this.rect = new Rectangle();
this.newBitmap = new Bitmap(this.bitmap);
for (ii = 0; ii < p; ii++)
{
for (jj = 0; jj < p2; jj++)
{
////New bitmap
this.newBitmap = new Bitmap(this.bitmap);
////Set rectangle working area with letters
this.rect = new Rectangle(jj * this.miniszerokosc, ii * this.miniwysokosc, this.miniszerokosc, this.miniwysokosc);
////Cut single rectangle with letter
this.newBitmap = this.newBitmap.Clone(this.rect, this.newBitmap.PixelFormat);
////Add frame to rectangle to delet bad noise
this.OAddFrameToCropImage(this.newBitmap, this.rect.Width, this.rect.Height);
this.frm1.SetIm3 = (System.Drawing.Image)this.newBitmap;
////Create image with letter which constains less background
this.newBitmap = this.newBitmap.Clone(this.GetAreaLetter(this.newBitmap), this.newBitmap.PixelFormat);
////Count pixels in bitmap
this.workingArea = this.GetBinnary(this.newBitmap);
var keysWithMatchingValues = this.alphabetLetters.Where(x => x.Value == this.workingArea).Select(x => x.Key);
foreach (var key in keysWithMatchingValues)
{
this.chesswords += key.ToString();
}
}
this.chesswords += Environment.NewLine;
var ordered = this.alphabetLetters.OrderBy(x => x.Value);
}
}
PS2. sorry for my English, please correct it if it is needed.
If I get you right, then if you have an image like this
p00|p01|p02|...
---+---+-------
p10|p11|p12|...
---+---+-------
p20|p21|p22|...
---+---+---+---
...|...|...|...
Which is stored in an array in left-to-right row scan like this:
p00,p01,...,p0n, p10,p11,...,p1n, p20,p21, ...
If I understand you correctly, what you want to be able to do, is to take a given rectangle (from a certain x and y with a certain width and height) from the image. Here is code to do this, with explanations:
byte[] crop_area (byte[] source_image, int image_width, int image_height,
int start_x, int start_y, int result_width, int result_height)
{
byte[] result = new byte[result_width * result_height];
int endX = x + result_width;
int endY = y + result_height;
int pos = 0;
for (int y = startY; y < endY; y++)
for (int x = startX; x < endX; x++)
{
/* To get to the pixel in the row I (starting from I=1), we need
* to skip I-1 rows. Since our y indexes start from row 0 (not 1),
* then we don't need to subtract 1.
*
* So, the offset of the pixel at (x,y) is:
*
* y * image_width + x
* |-----------------------| |-----------------|
* Skip pixels of y rows Offset inside row
*/
result[pos] = source_image[y * image_width + x];
/* Advance to the next pixel in the result image */
pos++;
}
return result;
}
Then, to take the block in the row I and column J (I,J=0,...,14) do:
crop_area (source_image, image_width, image_height, J*image_width/15, I*image_height/15, image_width/15, image_height/15)

LockBits Performance Critical Code

I have a method which needs to be as fast as it possibly can, it uses unsafe memory pointers and its my first foray into this type of coding so I know it can probably be faster.
/// <summary>
/// Copies bitmapdata from one bitmap to another at a specified point on the output bitmapdata
/// </summary>
/// <param name="sourcebtmpdata">The sourcebitmap must be smaller that the destbitmap</param>
/// <param name="destbtmpdata"></param>
/// <param name="point">The point on the destination bitmap to draw at</param>
private static unsafe void CopyBitmapToDest(BitmapData sourcebtmpdata, BitmapData destbtmpdata, Point point)
{
// calculate total number of rows to draw.
var totalRow = Math.Min(
destbtmpdata.Height - point.Y,
sourcebtmpdata.Height);
//loop through each row on the source bitmap and get mem pointers
//to the source bitmap and dest bitmap
for (int i = 0; i < totalRow; i++)
{
int destRow = point.Y + i;
//get the pointer to the start of the current pixel "row" on the output image
byte* destRowPtr = (byte*)destbtmpdata.Scan0 + (destRow * destbtmpdata.Stride);
//get the pointer to the start of the FIRST pixel row on the source image
byte* srcRowPtr = (byte*)sourcebtmpdata.Scan0 + (i * sourcebtmpdata.Stride);
int pointX = point.X;
//the rowSize is pre-computed before the loop to improve performance
int rowSize = Math.Min(destbtmpdata.Width - pointX, sourcebtmpdata.Width);
//for each row each set each pixel
for (int j = 0; j < rowSize; j++)
{
int firstBlueByte = ((pointX + j)*3);
int srcByte = j *3;
destRowPtr[(firstBlueByte)] = srcRowPtr[srcByte];
destRowPtr[(firstBlueByte) + 1] = srcRowPtr[srcByte + 1];
destRowPtr[(firstBlueByte) + 2] = srcRowPtr[srcByte + 2];
}
}
}
So is there anything that can be done to make this faster? Ignore the todo for now, ill fix that later once I have some baseline performance measurements.
UPDATE: Sorry, should have mentioned that the reason i'm using this instead of Graphics.DrawImage is because im implementing multi-threading and because of that I cant use DrawImage.
UPDATE 2: I'm still not satisfied with the performance and i'm sure there's a few more ms that can be had.
There was something fundamentally wrong with the code that I cant believe I didn't notice until now.
byte* destRowPtr = (byte*)destbtmpdata.Scan0 + (destRow * destbtmpdata.Stride);
This gets a pointer to the destination row but it does not get the column that it is copying to, that in the old code is done inside the rowSize loop. It now looks like:
byte* destRowPtr = (byte*)destbtmpdata.Scan0 + (destRow * destbtmpdata.Stride) + pointX * 3;
So now we have the correct pointer for the destination data. Now we can get rid of that for loop. Using suggestions from Vilx- and Rob the code now looks like:
private static unsafe void CopyBitmapToDestSuperFast(BitmapData sourcebtmpdata, BitmapData destbtmpdata, Point point)
{
//calculate total number of rows to copy.
//using ternary operator instead of Math.Min, few ms faster
int totalRows = (destbtmpdata.Height - point.Y < sourcebtmpdata.Height) ? destbtmpdata.Height - point.Y : sourcebtmpdata.Height;
//calculate the width of the image to draw, this cuts off the image
//if it goes past the width of the destination image
int rowWidth = (destbtmpdata.Width - point.X < sourcebtmpdata.Width) ? destbtmpdata.Width - point.X : sourcebtmpdata.Width;
//loop through each row on the source bitmap and get mem pointers
//to the source bitmap and dest bitmap
for (int i = 0; i < totalRows; i++)
{
int destRow = point.Y + i;
//get the pointer to the start of the current pixel "row" and column on the output image
byte* destRowPtr = (byte*)destbtmpdata.Scan0 + (destRow * destbtmpdata.Stride) + point.X * 3;
//get the pointer to the start of the FIRST pixel row on the source image
byte* srcRowPtr = (byte*)sourcebtmpdata.Scan0 + (i * sourcebtmpdata.Stride);
//RtlMoveMemory function
CopyMemory(new IntPtr(destRowPtr), new IntPtr(srcRowPtr), (uint)rowWidth * 3);
}
}
Copying a 500x500 image to a 5000x5000 image in a grid 50 times took: 00:00:07.9948993 secs. Now with the changes above it takes 00:00:01.8714263 secs. Much better.
Well... I'm not sure whether .NET bitmap data formats are entirely compatible with Windows's GDI32 functions...
But one of the first few Win32 API I learned was BitBlt:
BOOL BitBlt(
HDC hdcDest,
int nXDest,
int nYDest,
int nWidth,
int nHeight,
HDC hdcSrc,
int nXSrc,
int nYSrc,
DWORD dwRop
);
And it was the fastest way to copy data around, if I remember correctly.
Here's the BitBlt PInvoke signature for use in C# and related usage information, a great read for any one working with high-performance graphics in C#:
http://www.pinvoke.net/default.aspx/gdi32/BitBlt.html
Definitely worth a look.
The inner loop is where you want to concentrate a lot of your time (but, do measurements to make sure)
for (int j = 0; j < sourcebtmpdata.Width; j++)
{
destRowPtr[(point.X + j) * 3] = srcRowPtr[j * 3];
destRowPtr[((point.X + j) * 3) + 1] = srcRowPtr[(j * 3) + 1];
destRowPtr[((point.X + j) * 3) + 2] = srcRowPtr[(j * 3) + 2];
}
Get rid of the multiplies and the array indexing (which is a multiply under the hoods) and replace with a pointer that you are incrementing.
Ditto with the +1, +2, increment a pointer.
Probably your compiler won't keep computing point.X (check), but make a local variable just in case. It won't do it on the single iteration, but it might each iteration.
You may want to look at Eigen.
It is a C++ template library that uses SSE (2 and later) and AltiVec instruction sets with graceful fallback to non-vectorized code.
Fast. (See benchmark).
Expression templates allow to intelligently remove temporaries and enable lazy evaluation, when that is appropriate -- Eigen takes care of this automatically and handles aliasing too in most cases.
Explicit vectorization is performed for the SSE (2 and later) and AltiVec instruction sets, with graceful fallback to non-vectorized code. Expression templates allow to perform these optimizations globally for whole expressions.
With fixed-size objects, dynamic memory allocation is avoided, and the loops are unrolled when that makes sense.
For large matrices, special attention is paid to cache-friendliness.
You could implement you function in C++ and then call that from C#
You don't always need to use pointers to get good speed. This should be within a couple ms of the original:
private static void CopyBitmapToDest(BitmapData sourcebtmpdata, BitmapData destbtmpdata, Point point)
{
byte[] src = new byte[sourcebtmpdata.Height * sourcebtmpdata.Width * 3];
int maximum = src.Length;
byte[] dest = new byte[maximum];
Marshal.Copy(sourcebtmpdata.Scan0, src, 0, src.Length);
int pointX = point.X * 3;
int copyLength = destbtmpdata.Width*3 - pointX;
int k = pointX + point.Y * sourcebtmpdata.Stride;
int rowWidth = sourcebtmpdata.Stride;
while (k<maximum)
{
Array.Copy(src,k,dest,k,copyLength);
k += rowWidth;
}
Marshal.Copy(dest, 0, destbtmpdata.Scan0, dest.Length);
}
Unfortunately I don't have the time to write a full solution, but I would look into using the platform RtlMoveMemory() function to move rows as a whole, not byte-by-byte. That should be a lot faster.
I think the stride size and row number limits can be calculated in advance.
And I precalculated all multiplications, resulting in the following code:
private static unsafe void CopyBitmapToDest(BitmapData sourcebtmpdata, BitmapData destbtmpdata, Point point)
{
//TODO: It is expected that the bitmap PixelFormat is Format24bppRgb but this could change in the future
const int pixelSize = 3;
// calculate total number of rows to draw.
var totalRow = Math.Min(
destbtmpdata.Height - point.Y,
sourcebtmpdata.Height);
var rowSize = Math.Min(
(destbtmpdata.Width - point.X) * pixelSize,
sourcebtmpdata.Width * pixelSize);
// starting point of copy operation
byte* srcPtr = (byte*)sourcebtmpdata.Scan0;
byte* destPtr = (byte*)destbtmpdata.Scan0 + point.Y * destbtmpdata.Stride;
// loop through each row
for (int i = 0; i < totalRow; i++) {
// draw the entire row
for (int j = 0; j < rowSize; j++)
destPtr[point.X + j] = srcPtr[j];
// advance each pointer by 1 row
destPtr += destbtmpdata.Stride;
srcPtr += sourcebtmpdata.Stride;
}
}
Havn't tested it thoroughly but you should be able to get that to work.
I have removed multiplication operations from the loop (pre-calculated instead) and removed most branchings so it should be somewhat faster.
Let me know if this helps :-)
I am looking at your C# code and I can't recognize anything familiar. It all looks like a ton of C++. BTW, it looks like DirectX/XNA needs to become your new friend. Just my 2 cents. Don't kill the messenger.
If you must rely on CPU to do this: I've done some 24-bit layout optimizations myself, and I can tell you that memory access speed should be your bottleneck. Use SSE3 instructions for fastest possible bytewise access. This means C++ and embedded assembly language. In pure C you'll be 30% slower on most machines.
Keep in mind that modern GPUs are MUCH faster than CPU in this sort of operations.
I am not sure if this will give extra performance, but I see the pattern a lot in Reflector.
So:
int srcByte = j *3;
destRowPtr[(firstBlueByte)] = srcRowPtr[srcByte];
destRowPtr[(firstBlueByte) + 1] = srcRowPtr[srcByte + 1];
destRowPtr[(firstBlueByte) + 2] = srcRowPtr[srcByte + 2];
Becomes:
*destRowPtr++ = *srcRowPtr++;
*destRowPtr++ = *srcRowPtr++;
*destRowPtr++ = *srcRowPtr++;
Probably needs more braces.
If the width is fixed, you could probably unroll the entire line into a few hundred lines. :)
Update
You could also try using a bigger type, eg Int32 or Int64 for better performance.
Alright, this is going to be fairly close to the line of how many ms you can get out of the algorithm, but get rid of the call to Math.Min and replace it with a trinary operator instead.
Generally, making a library call will take longer than doing something on your own and I made a simple test driver to confirm this for Math.Min.
using System;
using System.Diagnostics;
namespace TestDriver
{
class Program
{
static void Main(string[] args)
{
// Start the stopwatch
if (Stopwatch.IsHighResolution)
{ Console.WriteLine("Using high resolution timer"); }
else
{ Console.WriteLine("High resolution timer unavailable"); }
// Test Math.Min for 10000 iterations
Stopwatch sw = Stopwatch.StartNew();
for (int ndx = 0; ndx < 10000; ndx++)
{
int result = Math.Min(ndx, 5000);
}
Console.WriteLine(sw.Elapsed.TotalMilliseconds.ToString("0.0000"));
// Test trinary operator for 10000 iterations
sw = Stopwatch.StartNew();
for (int ndx = 0; ndx < 10000; ndx++)
{
int result = (ndx < 5000) ? ndx : 5000;
}
Console.WriteLine(sw.Elapsed.TotalMilliseconds.ToString("0.0000"));
Console.ReadKey();
}
}
}
The results when running the above on my computer, an Intel T2400 #1.83GHz. Also, note that there is a bit of variation in the results, but generally the trinay operator is faster by about 0.01 ms. That's not much, but over a big enough dataset it will add up.
Using high resolution timer
0.0539
0.0402

Categories