c# managedCuda 2d array to GPU - c#

I'm new to CUDA and trying to figure out how to pass 2d array to the kernel.
I have to following working code for 1 dimension array:
class Program
{
static void Main(string[] args)
{
int N = 10;
int deviceID = 0;
CudaContext ctx = new CudaContext(deviceID);
CudaKernel kernel = ctx.LoadKernel(#"doubleIt.ptx", "DoubleIt");
kernel.GridDimensions = (N + 255) / 256;
kernel.BlockDimensions = Math.Min(N,256);
// Allocate input vectors h_A in host memory
float[] h_A = new float[N];
// Initialize input vectors h_A
for (int i = 0; i < N; i++)
{
h_A[i] = i;
}
// Allocate vectors in device memory and copy vectors from host memory to device memory
CudaDeviceVariable<float> d_A = h_A;
CudaDeviceVariable<float> d_C = new CudaDeviceVariable<float>(N);
// Invoke kernel
kernel.Run(d_A.DevicePointer, d_C.DevicePointer, N);
// Copy result from device memory to host memory
float[] h_C = d_C;
// h_C contains the result in host memory
}
}
with the following kernel code:
__global__ void DoubleIt(const float* A, float* C, int N)
{
int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i < N)
C[i] = A[i] * 2;
}
as I said, everything works fine but I want to work with 2d array as follow:
// Allocate input vectors h_A in host memory
int W = 10;
float[][] h_A = new float[N][];
// Initialize input vectors h_A
for (int i = 0; i < N; i++)
{
h_A[i] = new float[W];
for (int j = 0; j < W; j++)
{
h_A[i][j] = i*W+j;
}
}
I need all the 2nd dimension to be on the same thread so the kernel.BlockDimensions must stay as 1 dimension and each kernel thread need to get 1d array with 10 elements.
so my bottom question is: How shell I copy this 2d array to the device and how to use it in the kernel? (as to the example it supposed to have total of 10 threads).

Short answer: you shouldn't do it...
Long answer: Jagged arrays are difficult to handle in general. Instead of one continuous segment of memory for your data, you have plenty small ones lying sparsely somewhere in your memory. What happens if you copy the data to GPU? If you had one large continuous segment you call the cudaMemcpy/CopyToDevice functions and copy the entire block at once. But same as you allocate jagged arrays in a for loop, you’d have to copy your data line by line into a CudaDeviceVariable<CUdeviceptr>, where each entry points to a CudaDeviceVariable<float>. In parallel you maintain a host array CudaDeviceVariable<float>[] that manages your CUdeviceptrs on host side. Copying data in general is already quite slow, doing it this way is probably a real performance killer...
To conclude: If you can, use flattened arrays and index the entries with index y * DimX + x. Even better on GPU side, use pitched memory, where the allocation is done so that each line starts on a "good" address: Index then turns to y * Pitch + x (simplified). The 2D copy methods in CUDA are made for these pitched memory allocations where each line gets some additional bytes added.
For completeness: In C# you also have 2-dimensional arrays like float[,]. You can also use these on host side instead of flattened 1D arrays. But I wouldn’t recommend to do so, as the ISO standard of .net does not guarantee that the internal memory is actually continuous, an assumption that managedCuda must use in order to use these arrays. Current .net framework doesn’t have any internal weirdness, but who knows if it will stay like this...
This would realize the jagged array copy:
float[][] data_h;
CudaDeviceVariable<CUdeviceptr> data_d;
CUdeviceptr[] ptrsToData_h; //represents data_d on host side
CudaDeviceVariable<float>[] arrayOfarray_d; //Array of CudaDeviceVariables to manage memory, source for pointers in ptrsToData_h.
int sizeX = 512;
int sizeY = 256;
data_h = new float[sizeX][];
arrayOfarray_d = new CudaDeviceVariable<float>[sizeX];
data_d = new CudaDeviceVariable<CUdeviceptr>(sizeX);
ptrsToData_h = new CUdeviceptr[sizeX];
for (int x = 0; x < sizeX; x++)
{
data_h[x] = new float[sizeY];
arrayOfarray_d[x] = new CudaDeviceVariable<float>(sizeY);
ptrsToData_h[x] = arrayOfarray_d[x].DevicePointer;
//ToDo: init data on host...
}
//Copy the pointers once:
data_d.CopyToDevice(ptrsToData_h);
//Copy data:
for (int x = 0; x < sizeX; x++)
{
arrayOfarray_d[x].CopyToDevice(data_h[x]);
}
//Call a kernel:
kernel.Run(data_d.DevicePointer /*, other parameters*/);
//kernel in *cu file:
//__global__ void kernel(float** data_d, ...)
This is a sample for CudaPitchedDeviceVariable:
int dimX = 512;
int dimY = 512;
float[] array_host = new float[dimX * dimY];
CudaPitchedDeviceVariable<float> arrayPitched_d = new CudaPitchedDeviceVariable<float>(dimX, dimY);
for (int y = 0; y < dimY; y++)
{
for (int x = 0; x < dimX; x++)
{
array_host[y * dimX + x] = x * y;
}
}
arrayPitched_d.CopyToDevice(array_host);
kernel.Run(arrayPitched_d.DevicePointer, arrayPitched_d.Pitch, dimX, dimY);
//Correspondend kernel:
extern "C"
__global__ void kernel(float* data, size_t pitch, int dimX, int dimY)
{
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
if (x >= dimX || y >= dimY)
return;
//pointer arithmetic: add y*pitch to char* pointer as pitch is given in bytes,
//which gives the start of line y. Convert to float* and add x, to get the
//value at entry x of line y:
float value = *(((float*)((char*)data + y * pitch)) + x);
*(((float*)((char*)data + y * pitch)) + x) = value + 1;
//Or simpler if you don't like pointers:
float* line = (float*)((char*)data + y * pitch);
float value2 = line[x];
}

Related

Tensorflowsharp results getvalue() is very slow

I am using TensorflowSharp to run evaluations using a neural network on an Android phone. I am building the project with Unity.
I am using the tensorflowsharp unity plugin listed under the requirements here: https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Using-TensorFlow-Sharp-in-Unity.md.
Everything is working, however extracting the result is very slow.
The network I am running is an autoencoder and the output is an image with dimensions of 128x128x16 (yes there is a lot of output channels).
The evaluation is done in ~ 0.2 seconds which is acceptable. However when i need to extract the result data using results[0].GetValue() it is VERY slow.
This is my code where i run the neural network
var runner = session.GetRunner();
runner.AddInput(graph[INPUT_NAME][0], tensor).Fetch(graph[OUTPUT_NAME][0]);
var results = runner.Run();
float[,,,] heatmaps = results[0].GetValue() as float[,,,]; // <- this is SLOW
The problem:
The last line where i convert the result to floats is taking ~1.2 seconds.
Can it realy be true that reading the result data into a float array is taking more than 5 times as long as the actual evaluation of the network?
Is there another way to extract the result values?
So I have found a solution to this. I still do not know why the GetValue() call is so slow, but I found another way to retrieve the data.
I chose to manually read the raw tensor data available at results[0].Data
I created a small function to handle this as a drop in for GetValue, (Here just with the dimensions i am expecting hardcoded)
private float[,,,] TensorToFLoats(TFTensor tensor)
{
IntPtr resData = tensor.Data;
UIntPtr dataSize = tensor.TensorByteSize;
byte[] s_ImageBuffer = new byte[(int)dataSize];
System.Runtime.InteropServices.Marshal.Copy(resData, s_ImageBuffer, 0, (int)dataSize);
int floatsLength = s_ImageBuffer.Length / 4;
float[] floats = new float[floatsLength];
for (int n = 0; n < s_ImageBuffer.Length; n += 4)
{
floats[n / 4] = BitConverter.ToSingle(s_ImageBuffer, n);
}
float[,,,] result = new float[1, 128, 128, 16];
int i = 0;
for (int y = 0; y < 128; y++)
{
for (int x = 0; x < 128; x++)
{
for (int p = 0; p < 16; p++)
{
result[0, y, x, p] = floats[i++];
}
}
}
return result;
}
Given this i can replace the code in my question with the following
var runner = session.GetRunner();
runner.AddInput(graph[INPUT_NAME][0], tensor).Fetch(graph[OUTPUT_NAME][0]);
var results = runner.Run();
float[,,,] heatmaps = TensorToFLoats(results[0]);
This is insanely much faster. Where GetValue took ~1 second the TensorToFloats function i created got the same data in ~0.02 seconds

EmguCv: Reduce the grayscales

Is there a way to reduce the grayscales of an gray-image in openCv?
Normaly i have grayvalues from 0 to 256 for an
Image<Gray, byte> inputImage.
In my case i just need grayvalues from 0-10. Is there i good way to do that with OpenCV, especially for C# ?
There's nothing built-in on OpenCV that allows this sort of thing.
Nevertheless, you can write something yourself. Take a look at this C++ implementation and just translate it to C#:
void colorReduce(cv::Mat& image, int div=64)
{
int nl = image.rows; // number of lines
int nc = image.cols * image.channels(); // number of elements per line
for (int j = 0; j < nl; j++)
{
// get the address of row j
uchar* data = image.ptr<uchar>(j);
for (int i = 0; i < nc; i++)
{
// process each pixel
data[i] = data[i] / div * div + div / 2;
}
}
}
Just send a grayscale Mat to this function and play with the div parameter.

AccessViolationException with sound buffer conversion

I'm using Naudio AsioOut object to pass data from input buffer to my delayProc() function and then to output buffer.
The delayProc() needs float[] buffer type, and this is possible using e.GetAsInterleavedSamples(). The problem is I need to re-convert it to a multidimensional IntPtr, to do this I'm using AsioSampleConvertor class.
When I try to apply the effect it shows me an error: AccessViolationException on the code of AsioSampleConvertor class.
So I think the problem is due to the conversion from float[] to IntPtr[]..
I give you some code:
OnAudioAvailable()
floatIn = new float[e.SamplesPerBuffer * e.InputBuffers.Length];//*2
e.GetAsInterleavedSamples(floatIn);
floatOut = delayProc(floatIn, e.SamplesPerBuffer * e.InputBuffers.Length, 1.5f);
//conversione da float[] a IntPtr[L][R]
Outp = Marshal.AllocHGlobal(sizeof(float)*floatOut.Length);
Marshal.Copy(floatOut, 0, Outp, floatOut.Length);
NAudio.Wave.Asio.ASIOSampleConvertor.ConvertorFloatToInt2Channels(Outp, e.OutputBuffers, e.InputBuffers.Length, floatOut.Length);
delayProc()
private float[] delayProc(float[] sourceBuffer, int sampleCount, float delay)
{
if (OldBuf == null)
{
OldBuf = new float[sampleCount];
}
float[] BufDly = new float[(int)(sampleCount * delay)];
int delayLength = (int)(BufDly.Length - (BufDly.Length / delay));
for (int j = sampleCount - delayLength; j < sampleCount; j++)
for (int i = 0; i < delayLength; i++)
BufDly[i] = OldBuf[j];
for (int j = 0; j < sampleCount; j++)
for (int i = delayLength; i < BufDly.Length; i++)
BufDly[i] = sourceBuffer[j];
for (int i = 0; i < sampleCount; i++)
OldBuf[i] = sourceBuffer[i];
return BufDly;
}
AsioSampleConvertor
public static void ConvertorFloatToInt2Channels(IntPtr inputInterleavedBuffer, IntPtr[] asioOutputBuffers, int nbChannels, int nbSamples)
{
unsafe
{
float* inputSamples = (float*)inputInterleavedBuffer;
int* leftSamples = (int*)asioOutputBuffers[0];
int* rightSamples = (int*)asioOutputBuffers[1];
for (int i = 0; i < nbSamples; i++)
{
*leftSamples++ = clampToInt(inputSamples[0]);
*rightSamples++ = clampToInt(inputSamples[1]);
inputSamples += 2;
}
}
}
ClampToInt()
private static int clampToInt(double sampleValue)
{
sampleValue = (sampleValue < -1.0) ? -1.0 : (sampleValue > 1.0) ? 1.0 : sampleValue;
return (int)(sampleValue * 2147483647.0);
}
If you need some other code, just ask me.
When you call ConvertorFloatToInt2Channels you are passing in the total number of samples across all channels, then trying to read that many pairs of samples. So you are trying to read twice as many samples from your input buffer as are actually there. Using unsafe code you are trying to address well past the end of the allocated block, which results in the access violation you are getting.
Change the for loop in your ConvertorFloatToInt2Channels method to read:
for (int i = 0; i < nbSamples; i += 2)
This will stop your code from trying to read double the number of items actually present in the source memory block.
Incidentally, why are you messing around with allocating global memory and using unsafe code here? Why not process them as managed arrays? Processing the data itself isn't much slower, and you save on all the overheads of copying data to and from unmanaged memory.
Try this:
public static void FloatMonoToIntStereo(float[] samples, float[] leftChannel, float[] rightChannel)
{
for (int i = 0, j = 0; i < samples.Length; i += 2, j++)
{
leftChannel[j] = (int)(samples[i] * Int32.MaxValue);
rightChannel[j] = (int)(samples[i + 1] * Int32.MaxValue);
}
}
On my machine that processes around 12 million samples per second, converting the samples to integer and splitting the channels. About half that speed if I allocate the buffers for every set of results. About half again when I write that to use unsafe code, AllocHGlobal etc.
Never assume that unsafe code is faster.

Connected-component labeling algorithm optimization

I need some help with optimisation of my CCL algorithm implementation. I use it to detect black areas on the image. On a 2000x2000 it takes 11 seconds, which is pretty much. I need to reduce the running time to the lowest value possible to achieve. Also, I would be glad to know if there is any other algorithm out there which allows you to do the same thing, but faster than this one. So here is my code:
//The method returns a dictionary, where the key is the label
//and the list contains all the pixels with that label
public Dictionary<short, LinkedList<Point>> ProcessCCL()
{
Color backgroundColor = this.image.Palette.Entries[1];
//Matrix to store pixels' labels
short[,] labels = new short[this.image.Width, this.image.Height];
//I particulary don't like how I store the label equality table
//But I don't know how else can I store it
//I use LinkedList to add and remove items faster
Dictionary<short, LinkedList<short>> equalityTable = new Dictionary<short, LinkedList<short>>();
//Current label
short currentKey = 1;
for (int x = 1; x < this.bitmap.Width; x++)
{
for (int y = 1; y < this.bitmap.Height; y++)
{
if (!GetPixelColor(x, y).Equals(backgroundColor))
{
//Minumum label of the neighbours' labels
short label = Math.Min(labels[x - 1, y], labels[x, y - 1]);
//If there are no neighbours
if (label == 0)
{
//Create a new unique label
labels[x, y] = currentKey;
equalityTable.Add(currentKey, new LinkedList<short>());
equalityTable[currentKey].AddFirst(currentKey);
currentKey++;
}
else
{
labels[x, y] = label;
short west = labels[x - 1, y], north = labels[x, y - 1];
//A little trick:
//Because of those "ifs" the lowest label value
//will always be the first in the list
//but I'm afraid that because of them
//the running time also increases
if (!equalityTable[label].Contains(west))
if (west < equalityTable[label].First.Value)
equalityTable[label].AddFirst(west);
if (!equalityTable[label].Contains(north))
if (north < equalityTable[label].First.Value)
equalityTable[label].AddFirst(north);
}
}
}
}
//This dictionary will be returned as the result
//I'm not proud of using dictionary here too, I guess there
//is a better way to store the result
Dictionary<short, LinkedList<Point>> result = new Dictionary<short, LinkedList<Point>>();
//I define the variable outside the loops in order
//to reuse the memory address
short cellValue;
for (int x = 0; x < this.bitmap.Width; x++)
{
for (int y = 0; y < this.bitmap.Height; y++)
{
cellValue = labels[x, y];
//If the pixel is not a background
if (cellValue != 0)
{
//Take the minimum value from the label equality table
short value = equalityTable[cellValue].First.Value;
//I'd like to get rid of these lines
if (!result.ContainsKey(value))
result.Add(value, new LinkedList<Point>());
result[value].AddLast(new Point(x, y));
}
}
}
return result;
}
Thanks in advance!
You could split your picture in multiple sub-pictures and process them in parallel and then merge the results.
1 pass: 4 tasks, each processing a 1000x1000 sub-picture
2 pass: 2 tasks, each processing 2 of the sub-pictures from pass 1
3 pass: 1 task, processing the result of pass 2
For C# I recommend the Task Parallel Library (TPL), which allows to easily define tasks depending and waiting for each other. Following code project articel gives you a basic introduction into the TPL: The Basics of Task Parallelism via C#.
I would process one scan line at a time, keeping track of the beginning and end of each run of black pixels.
Then I would, on each scan line, compare it to the runs on the previous line. If there is a run on the current line that does not overlap a run on the previous line, it represents a new blob. If there is a run on the previous line that overlaps a run on the current line, it gets the same blob label as the previous. etc. etc. You get the idea.
I would try not to use dictionaries and such.
In my experience, randomly halting the program shows that those things may make programming incrementally easier, but they can exact a serious performance cost due to new-ing.
The problem is about GetPixelColor(x, y), it take very long time to access image data.
Set/GetPixel function are terribly slow in C#, so if you need to use them a lot, you should use Bitmap.lockBits instead.
private void ProcessUsingLockbits(Bitmap ProcessedBitmap)
{
BitmapData bitmapData = ProcessedBitmap.LockBits(new Rectangle(0, 0, ProcessedBitmap.Width, ProcessedBitmap.Height), ImageLockMode.ReadWrite, ProcessedBitmap.PixelFormat);
int BytesPerPixel = System.Drawing.Bitmap.GetPixelFormatSize(ProcessedBitmap.PixelFormat) / 8;
int ByteCount = bitmapData.Stride * ProcessedBitmap.Height;
byte[] Pixels = new byte[ByteCount];
IntPtr PtrFirstPixel = bitmapData.Scan0;
Marshal.Copy(PtrFirstPixel, Pixels, 0, Pixels.Length);
int HeightInPixels = bitmapData.Height;
int WidthInBytes = bitmapData.Width * BytesPerPixel;
for (int y = 0; y < HeightInPixels; y++)
{
int CurrentLine = y * bitmapData.Stride;
for (int x = 0; x < WidthInBytes; x = x + BytesPerPixel)
{
int OldBlue = Pixels[CurrentLine + x];
int OldGreen = Pixels[CurrentLine + x + 1];
int OldRed = Pixels[CurrentLine + x + 2];
// Transform blue and clip to 255:
Pixels[CurrentLine + x] = (byte)((OldBlue + BlueMagnitudeToAdd > 255) ? 255 : OldBlue + BlueMagnitudeToAdd);
// Transform green and clip to 255:
Pixels[CurrentLine + x + 1] = (byte)((OldGreen + GreenMagnitudeToAdd > 255) ? 255 : OldGreen + GreenMagnitudeToAdd);
// Transform red and clip to 255:
Pixels[CurrentLine + x + 2] = (byte)((OldRed + RedMagnitudeToAdd > 255) ? 255 : OldRed + RedMagnitudeToAdd);
}
}
// Copy modified bytes back:
Marshal.Copy(Pixels, 0, PtrFirstPixel, Pixels.Length);
ProcessedBitmap.UnlockBits(bitmapData);
}
Here is the basic code to access pixel data.
And I made a function to transform this into a 2D matrix, it's easier to manipulate (but little slower)
private void bitmap_to_matrix()
{
unsafe
{
bitmapData = ProcessedBitmap.LockBits(new Rectangle(0, 0, ProcessedBitmap.Width, ProcessedBitmap.Height), ImageLockMode.ReadWrite, ProcessedBitmap.PixelFormat);
int BytesPerPixel = System.Drawing.Bitmap.GetPixelFormatSize(ProcessedBitmap.PixelFormat) / 8;
int HeightInPixels = ProcessedBitmap.Height;
int WidthInPixels = ProcessedBitmap.Width;
int WidthInBytes = ProcessedBitmap.Width * BytesPerPixel;
byte* PtrFirstPixel = (byte*)bitmapData.Scan0;
Parallel.For(0, HeightInPixels, y =>
{
byte* CurrentLine = PtrFirstPixel + (y * bitmapData.Stride);
for (int x = 0; x < WidthInBytes; x = x + BytesPerPixel)
{
// Conversion in grey level
double rst = CurrentLine[x] * 0.0721 + CurrentLine[x + 1] * 0.7154 + CurrentLine[x + 2] * 0.2125;
// Fill the grey matix
TG[x / 3, y] = (int)rst;
}
});
}
}
And the website where the code comes
"High performance SystemDrawingBitmap"
Thanks to the author for his really good job !
Hope this will help !

LockBits Performance Critical Code

I have a method which needs to be as fast as it possibly can, it uses unsafe memory pointers and its my first foray into this type of coding so I know it can probably be faster.
/// <summary>
/// Copies bitmapdata from one bitmap to another at a specified point on the output bitmapdata
/// </summary>
/// <param name="sourcebtmpdata">The sourcebitmap must be smaller that the destbitmap</param>
/// <param name="destbtmpdata"></param>
/// <param name="point">The point on the destination bitmap to draw at</param>
private static unsafe void CopyBitmapToDest(BitmapData sourcebtmpdata, BitmapData destbtmpdata, Point point)
{
// calculate total number of rows to draw.
var totalRow = Math.Min(
destbtmpdata.Height - point.Y,
sourcebtmpdata.Height);
//loop through each row on the source bitmap and get mem pointers
//to the source bitmap and dest bitmap
for (int i = 0; i < totalRow; i++)
{
int destRow = point.Y + i;
//get the pointer to the start of the current pixel "row" on the output image
byte* destRowPtr = (byte*)destbtmpdata.Scan0 + (destRow * destbtmpdata.Stride);
//get the pointer to the start of the FIRST pixel row on the source image
byte* srcRowPtr = (byte*)sourcebtmpdata.Scan0 + (i * sourcebtmpdata.Stride);
int pointX = point.X;
//the rowSize is pre-computed before the loop to improve performance
int rowSize = Math.Min(destbtmpdata.Width - pointX, sourcebtmpdata.Width);
//for each row each set each pixel
for (int j = 0; j < rowSize; j++)
{
int firstBlueByte = ((pointX + j)*3);
int srcByte = j *3;
destRowPtr[(firstBlueByte)] = srcRowPtr[srcByte];
destRowPtr[(firstBlueByte) + 1] = srcRowPtr[srcByte + 1];
destRowPtr[(firstBlueByte) + 2] = srcRowPtr[srcByte + 2];
}
}
}
So is there anything that can be done to make this faster? Ignore the todo for now, ill fix that later once I have some baseline performance measurements.
UPDATE: Sorry, should have mentioned that the reason i'm using this instead of Graphics.DrawImage is because im implementing multi-threading and because of that I cant use DrawImage.
UPDATE 2: I'm still not satisfied with the performance and i'm sure there's a few more ms that can be had.
There was something fundamentally wrong with the code that I cant believe I didn't notice until now.
byte* destRowPtr = (byte*)destbtmpdata.Scan0 + (destRow * destbtmpdata.Stride);
This gets a pointer to the destination row but it does not get the column that it is copying to, that in the old code is done inside the rowSize loop. It now looks like:
byte* destRowPtr = (byte*)destbtmpdata.Scan0 + (destRow * destbtmpdata.Stride) + pointX * 3;
So now we have the correct pointer for the destination data. Now we can get rid of that for loop. Using suggestions from Vilx- and Rob the code now looks like:
private static unsafe void CopyBitmapToDestSuperFast(BitmapData sourcebtmpdata, BitmapData destbtmpdata, Point point)
{
//calculate total number of rows to copy.
//using ternary operator instead of Math.Min, few ms faster
int totalRows = (destbtmpdata.Height - point.Y < sourcebtmpdata.Height) ? destbtmpdata.Height - point.Y : sourcebtmpdata.Height;
//calculate the width of the image to draw, this cuts off the image
//if it goes past the width of the destination image
int rowWidth = (destbtmpdata.Width - point.X < sourcebtmpdata.Width) ? destbtmpdata.Width - point.X : sourcebtmpdata.Width;
//loop through each row on the source bitmap and get mem pointers
//to the source bitmap and dest bitmap
for (int i = 0; i < totalRows; i++)
{
int destRow = point.Y + i;
//get the pointer to the start of the current pixel "row" and column on the output image
byte* destRowPtr = (byte*)destbtmpdata.Scan0 + (destRow * destbtmpdata.Stride) + point.X * 3;
//get the pointer to the start of the FIRST pixel row on the source image
byte* srcRowPtr = (byte*)sourcebtmpdata.Scan0 + (i * sourcebtmpdata.Stride);
//RtlMoveMemory function
CopyMemory(new IntPtr(destRowPtr), new IntPtr(srcRowPtr), (uint)rowWidth * 3);
}
}
Copying a 500x500 image to a 5000x5000 image in a grid 50 times took: 00:00:07.9948993 secs. Now with the changes above it takes 00:00:01.8714263 secs. Much better.
Well... I'm not sure whether .NET bitmap data formats are entirely compatible with Windows's GDI32 functions...
But one of the first few Win32 API I learned was BitBlt:
BOOL BitBlt(
HDC hdcDest,
int nXDest,
int nYDest,
int nWidth,
int nHeight,
HDC hdcSrc,
int nXSrc,
int nYSrc,
DWORD dwRop
);
And it was the fastest way to copy data around, if I remember correctly.
Here's the BitBlt PInvoke signature for use in C# and related usage information, a great read for any one working with high-performance graphics in C#:
http://www.pinvoke.net/default.aspx/gdi32/BitBlt.html
Definitely worth a look.
The inner loop is where you want to concentrate a lot of your time (but, do measurements to make sure)
for (int j = 0; j < sourcebtmpdata.Width; j++)
{
destRowPtr[(point.X + j) * 3] = srcRowPtr[j * 3];
destRowPtr[((point.X + j) * 3) + 1] = srcRowPtr[(j * 3) + 1];
destRowPtr[((point.X + j) * 3) + 2] = srcRowPtr[(j * 3) + 2];
}
Get rid of the multiplies and the array indexing (which is a multiply under the hoods) and replace with a pointer that you are incrementing.
Ditto with the +1, +2, increment a pointer.
Probably your compiler won't keep computing point.X (check), but make a local variable just in case. It won't do it on the single iteration, but it might each iteration.
You may want to look at Eigen.
It is a C++ template library that uses SSE (2 and later) and AltiVec instruction sets with graceful fallback to non-vectorized code.
Fast. (See benchmark).
Expression templates allow to intelligently remove temporaries and enable lazy evaluation, when that is appropriate -- Eigen takes care of this automatically and handles aliasing too in most cases.
Explicit vectorization is performed for the SSE (2 and later) and AltiVec instruction sets, with graceful fallback to non-vectorized code. Expression templates allow to perform these optimizations globally for whole expressions.
With fixed-size objects, dynamic memory allocation is avoided, and the loops are unrolled when that makes sense.
For large matrices, special attention is paid to cache-friendliness.
You could implement you function in C++ and then call that from C#
You don't always need to use pointers to get good speed. This should be within a couple ms of the original:
private static void CopyBitmapToDest(BitmapData sourcebtmpdata, BitmapData destbtmpdata, Point point)
{
byte[] src = new byte[sourcebtmpdata.Height * sourcebtmpdata.Width * 3];
int maximum = src.Length;
byte[] dest = new byte[maximum];
Marshal.Copy(sourcebtmpdata.Scan0, src, 0, src.Length);
int pointX = point.X * 3;
int copyLength = destbtmpdata.Width*3 - pointX;
int k = pointX + point.Y * sourcebtmpdata.Stride;
int rowWidth = sourcebtmpdata.Stride;
while (k<maximum)
{
Array.Copy(src,k,dest,k,copyLength);
k += rowWidth;
}
Marshal.Copy(dest, 0, destbtmpdata.Scan0, dest.Length);
}
Unfortunately I don't have the time to write a full solution, but I would look into using the platform RtlMoveMemory() function to move rows as a whole, not byte-by-byte. That should be a lot faster.
I think the stride size and row number limits can be calculated in advance.
And I precalculated all multiplications, resulting in the following code:
private static unsafe void CopyBitmapToDest(BitmapData sourcebtmpdata, BitmapData destbtmpdata, Point point)
{
//TODO: It is expected that the bitmap PixelFormat is Format24bppRgb but this could change in the future
const int pixelSize = 3;
// calculate total number of rows to draw.
var totalRow = Math.Min(
destbtmpdata.Height - point.Y,
sourcebtmpdata.Height);
var rowSize = Math.Min(
(destbtmpdata.Width - point.X) * pixelSize,
sourcebtmpdata.Width * pixelSize);
// starting point of copy operation
byte* srcPtr = (byte*)sourcebtmpdata.Scan0;
byte* destPtr = (byte*)destbtmpdata.Scan0 + point.Y * destbtmpdata.Stride;
// loop through each row
for (int i = 0; i < totalRow; i++) {
// draw the entire row
for (int j = 0; j < rowSize; j++)
destPtr[point.X + j] = srcPtr[j];
// advance each pointer by 1 row
destPtr += destbtmpdata.Stride;
srcPtr += sourcebtmpdata.Stride;
}
}
Havn't tested it thoroughly but you should be able to get that to work.
I have removed multiplication operations from the loop (pre-calculated instead) and removed most branchings so it should be somewhat faster.
Let me know if this helps :-)
I am looking at your C# code and I can't recognize anything familiar. It all looks like a ton of C++. BTW, it looks like DirectX/XNA needs to become your new friend. Just my 2 cents. Don't kill the messenger.
If you must rely on CPU to do this: I've done some 24-bit layout optimizations myself, and I can tell you that memory access speed should be your bottleneck. Use SSE3 instructions for fastest possible bytewise access. This means C++ and embedded assembly language. In pure C you'll be 30% slower on most machines.
Keep in mind that modern GPUs are MUCH faster than CPU in this sort of operations.
I am not sure if this will give extra performance, but I see the pattern a lot in Reflector.
So:
int srcByte = j *3;
destRowPtr[(firstBlueByte)] = srcRowPtr[srcByte];
destRowPtr[(firstBlueByte) + 1] = srcRowPtr[srcByte + 1];
destRowPtr[(firstBlueByte) + 2] = srcRowPtr[srcByte + 2];
Becomes:
*destRowPtr++ = *srcRowPtr++;
*destRowPtr++ = *srcRowPtr++;
*destRowPtr++ = *srcRowPtr++;
Probably needs more braces.
If the width is fixed, you could probably unroll the entire line into a few hundred lines. :)
Update
You could also try using a bigger type, eg Int32 or Int64 for better performance.
Alright, this is going to be fairly close to the line of how many ms you can get out of the algorithm, but get rid of the call to Math.Min and replace it with a trinary operator instead.
Generally, making a library call will take longer than doing something on your own and I made a simple test driver to confirm this for Math.Min.
using System;
using System.Diagnostics;
namespace TestDriver
{
class Program
{
static void Main(string[] args)
{
// Start the stopwatch
if (Stopwatch.IsHighResolution)
{ Console.WriteLine("Using high resolution timer"); }
else
{ Console.WriteLine("High resolution timer unavailable"); }
// Test Math.Min for 10000 iterations
Stopwatch sw = Stopwatch.StartNew();
for (int ndx = 0; ndx < 10000; ndx++)
{
int result = Math.Min(ndx, 5000);
}
Console.WriteLine(sw.Elapsed.TotalMilliseconds.ToString("0.0000"));
// Test trinary operator for 10000 iterations
sw = Stopwatch.StartNew();
for (int ndx = 0; ndx < 10000; ndx++)
{
int result = (ndx < 5000) ? ndx : 5000;
}
Console.WriteLine(sw.Elapsed.TotalMilliseconds.ToString("0.0000"));
Console.ReadKey();
}
}
}
The results when running the above on my computer, an Intel T2400 #1.83GHz. Also, note that there is a bit of variation in the results, but generally the trinay operator is faster by about 0.01 ms. That's not much, but over a big enough dataset it will add up.
Using high resolution timer
0.0539
0.0402

Categories