New image overlays previous bitmap

New image overlays previous bitmap - c#

There are a number of posts about this, but i still can't figure it out. I am rather new at this, so please be forgiving.
I display an image, then grab a new image, and try to display it. When the new image is displayed, it has remnants of the old image. I have tried Picture1.Image= null to no avail.
Is it an issue with managed memory? I suspect it has to do with how the memory is being managed, that somehow the code copies a new image over and old image in a way that leaves some data from the previous image.
Here is the code to display the data in scaled1 (from this helpful earlier post):
Edit:
Code added showing processing of arrays that are plotted. The overlaying behavior stops if the arrays are cleared using the Array.Clear method. Perhaps when this is cleared up I can post a canonical snippet demonstrating the issue.
This resets the question as: Why do arrays need to be cleared when each value of the array is rewritten? How can the array retain information of previous values?
ushort[] frame = null;
byte[] scaled1 = null;
double[][] frameringSin;
double[][] frameringCos;
double[] sumsin;
double[] sumcos;
frame = new ushort[mImageWidth * mImageHeight];
scaled1 = new byte[mImageWidth * mImageHeight];
frameringSin = new double[RingSize][];
frameringCos = new double[RingSize][];
ringsin = new double[RingSize];
ringcos = new double[RingSize];
//Fill array with images
for(int ring=0; ring <nN; ++ring)
{
mCamera.GrabFrameReduced(framering[ring], reduced, out preset);
}
//Process images
for (int i = 0; i < nN; ++i)
{
Array.Clear(frameringSin[i], 0, frameringSin.Length);
Array.Clear(frameringCos[i], 0, frameringSin.Length);
}
Array.Clear(sumsin, 0, sumsin.Length);
Array.Clear(sumcos, 0, sumcos.Length);
for(int r=0;r<nN; ++r)
{
for (int i = 0; i < frame.Length; ++i)//upto 12 ms
{
frameringSin[r][i] = framering[r][i]* ringsin[r] / nN;
frameringCos[r][i] = framering[r][i] *ringcos[r] / nN;
}
}
for (int i = 0; i < sumsin.Length; ++i)//up to 25ms
{
for (int r = 0; r < nN; ++r)
{
sumsin[i] += frameringSin[r][i];
sumcos[i] += frameringCos[r][i];
}
}
for(int r=0 ; r<nN ;++r)
{
for (int i = 0; i < sumsin.Length; ++i)
{
A[i] = Math.Sqrt(sumsin[i] * sumsin[i] + sumcos[i] * sumcos[i]);
}
//extract scaling parameters
...
//Scale Image
for (i1 = 0; i1 < frame.Length; ++i1)
scaled1[i1] = (byte)((Math.Min(Math.Max(min1, frameA[i1]), max1) - min1) * scale1);
bmp1 = new Bitmap(mImageWidth,mImageHeight,System.Drawing.Imaging.PixelFormat.Format8bppIndexed);
var bdata1 = bmp1.LockBits(new Rectangle(new Point(0, 0), bmp1.Size), ImageLockMode.WriteOnly, bmp1.PixelFormat);
try
{
Marshal.Copy(scaled1, 0, bdata1.Scan0, scaled1.Length);
}
finally
{
bmp1.UnlockBits(bdata1);
}
Picture1.Image = bmp1;
Picture1.Refresh();

Actually, you're not replacing all values in the arrays - your for cycles are wrong. You want them to look like this:
for (i1 = 0; i1 < frame.Length; i1++)
scaled1[i1] = (byte)((Math.Min(Math.Max(min1, frameA[i1]), max1) - min1)
* scale1);
The difference (i++ vs ++i) is that your way, you're skipping the first and the last index. C# arrays start at 0, while you start at 1 (you increment the loop variable before you run the body for the first time).
Also, note that for performance reasons, it's very handy if you're going through the array like this:
for (var i = 0; i < array.Length; i++)
/* do work with array[i] */
The JIT compiler recognizes this and avoids bounds checks, because it knows there can never be an overflow. When you're doing a lot of work with arrays, this can give you a massive performance boost, even if you access multiple arrays through the same index (one of them will not have the checks, the others will - still saves a lot of work).
The default JIT isn't very smart about this (it has to be quite fast), so pretty much anything else will reintroduce the bounds check. If performance is a concern for you, you'd want to profile the code anyway, of course.
EDIT: Ah, my bad. Anyway, I believe your problem isn't having to clear the frameringXXX arrays, but rather, the sumsin and sumcos arrays - you're always adding to those, so you'd be adding to the value that was already there, rather than starting from zero again. So you need to reset those arrays to zeroes first (which is what Array.Clear does).

Related

Intrinsics SIMD instruction to replace values

I wonder how it would be possible to replace byte values in a Vector128<byte>
I think it is okay to assume the code below where we have a resultvector with
those values :
<0,0,0,0,1,1,1,1,0,0,0,0,0,0,0,0>
Here I like to create a new vector where all "0" will be replaced with "2"
and all "1" will be replaced with "0" like this :
<2,2,2,2,0,0,0,0,2,2,2,2,2,2,2,2>
I am not sure if there is an intrinsics for this or how to achieve this?
Thank you!
//Create array
byte[] array = new byte[16];
for (int i = 0; i < 4; i++) { array[i] = 0; }
for (int i = 4; i < 8; i++) { array[i] = 1; }
for (int i = 8; i < 16; i++) { array[i] = 0; }
fixed (byte* ptr = array)
{
byte* pointarray = &*((byte*)(ptr + 0));
System.Runtime.Intrinsics.Vector128<byte> resultvector = System.Runtime.Intrinsics.X86.Avx.LoadVector128(&pointarray[0]);
//<0,0,0,0,1,1,1,1,0,0,0,0,0,0,0,0>
//resultvector
}

The instruction for that is pshufb, available in modern .NET as Avx2.Shuffle, and Ssse3.Shuffle for 16-byte version. Both are really fast, 1 cycle latency on modern CPUs.
Pass your source data into shuffle control mask argument, and a special value for the first argument which is the bytes being shuffled, something like this:
// Create AVX vector with all zeros except the first byte in each 16-byte lane which is 2
static Vector256<byte> makeShufflingVector()
{
Vector128<byte> res = Vector128<byte>.Zero;
res = Sse2.Insert( res.AsInt16(), 2, 0 ).AsByte();
return Vector256.Create( res, res );
}
See _mm_shuffle_epi8 section on page 18 of this article for details.
Update: if you don’t have SSSE3, you can do the same in SSE2, in 2 instructions instead of 1:
static Vector128<byte> replaceZeros( Vector128<byte> src )
{
src = Sse2.CompareEqual( src, Vector128<byte>.Zero );
return Sse2.And( src, Vector128.Create( (byte)2 ) );
}
By the way, there’s a performance problem in .NET that prevents compiler from loading constants outside of loops. If you gonna call that method in a loop and want to maximize the performance, consider passing both constant vectors, with zero and 2, as method parameters.

Tensorflowsharp results getvalue() is very slow

I am using TensorflowSharp to run evaluations using a neural network on an Android phone. I am building the project with Unity.
I am using the tensorflowsharp unity plugin listed under the requirements here: https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Using-TensorFlow-Sharp-in-Unity.md.
Everything is working, however extracting the result is very slow.
The network I am running is an autoencoder and the output is an image with dimensions of 128x128x16 (yes there is a lot of output channels).
The evaluation is done in ~ 0.2 seconds which is acceptable. However when i need to extract the result data using results[0].GetValue() it is VERY slow.
This is my code where i run the neural network
var runner = session.GetRunner();
runner.AddInput(graph[INPUT_NAME][0], tensor).Fetch(graph[OUTPUT_NAME][0]);
var results = runner.Run();
float[,,,] heatmaps = results[0].GetValue() as float[,,,]; // <- this is SLOW
The problem:
The last line where i convert the result to floats is taking ~1.2 seconds.
Can it realy be true that reading the result data into a float array is taking more than 5 times as long as the actual evaluation of the network?
Is there another way to extract the result values?

So I have found a solution to this. I still do not know why the GetValue() call is so slow, but I found another way to retrieve the data.
I chose to manually read the raw tensor data available at results[0].Data
I created a small function to handle this as a drop in for GetValue, (Here just with the dimensions i am expecting hardcoded)
private float[,,,] TensorToFLoats(TFTensor tensor)
{
IntPtr resData = tensor.Data;
UIntPtr dataSize = tensor.TensorByteSize;
byte[] s_ImageBuffer = new byte[(int)dataSize];
System.Runtime.InteropServices.Marshal.Copy(resData, s_ImageBuffer, 0, (int)dataSize);
int floatsLength = s_ImageBuffer.Length / 4;
float[] floats = new float[floatsLength];
for (int n = 0; n < s_ImageBuffer.Length; n += 4)
{
floats[n / 4] = BitConverter.ToSingle(s_ImageBuffer, n);
}
float[,,,] result = new float[1, 128, 128, 16];
int i = 0;
for (int y = 0; y < 128; y++)
{
for (int x = 0; x < 128; x++)
{
for (int p = 0; p < 16; p++)
{
result[0, y, x, p] = floats[i++];
}
}
}
return result;
}
Given this i can replace the code in my question with the following
var runner = session.GetRunner();
runner.AddInput(graph[INPUT_NAME][0], tensor).Fetch(graph[OUTPUT_NAME][0]);
var results = runner.Run();
float[,,,] heatmaps = TensorToFLoats(results[0]);
This is insanely much faster. Where GetValue took ~1 second the TensorToFloats function i created got the same data in ~0.02 seconds

Accessing processed values from FFT

I am attempting to use Lomont FFT in order to return complex numbers to build a spectrogram / spectral density chart using c#.
I am having trouble understanding how to return values from the class.
Here is the code I have put together thus far which appears to be working.
int read = 0;
Double[] data;
byte[] buffer = new byte[1024];
FileStream wave = new FileStream(args[0], FileMode.Open, FileAccess.Read);
read = wave.Read(buffer, 0, 44);
read = wave.Read(buffer, 0, 1024);
data = new Double[read];
for (int i = 0; i < read; i+=2)
{
data[i] = BitConverter.ToInt16(buffer, i) / 32768.0;
Console.WriteLine(data[i]);
}
LomontFFT LFFT = new LomontFFT();
LFFT.FFT(data, true);
What I am not clear on is, how to return/access the values from Lomont FFT implementation back into my application (console)?
Being pretty new to c# development, I'm thinking I am perhaps missing a fundamental aspect of understanding regarding how to retrieve processed values from the instance of the Lomont Class, or perhaps even calling it incorrectly.
Console.WriteLine(LFFT.A); // Returns 0
Console.WriteLine(LFFT.B); // Returns 1
I have been searching for a code snippet or explanation of how to do this, but so far have come up with nothing that I understand or explains this particular aspect of the issue I am facing. Any guidance would be greatly appreciated.
A subset of the results held in data array noted in the code above can be found below and based on my current understanding, appear to be valid:
0.00531005859375
0.0238037109375
0.041473388671875
0.0576171875
0.07183837890625
0.083465576171875
0.092193603515625
0.097625732421875
0.099639892578125
0.098114013671875
0.0931396484375
0.0848388671875
0.07354736328125
0.05963134765625
0.043609619140625
0.026031494140625
0.007476806640625
-0.011260986328125
-0.0296630859375
-0.047027587890625
-0.062713623046875
-0.076141357421875
-0.086883544921875
-0.09454345703125
-0.098785400390625
-0.0994873046875
-0.0966796875
-0.090362548828125
-0.080810546875
-0.06842041015625
-0.05352783203125
-0.036712646484375
-0.0185546875
What am I actually attempting to do? (perspective)
I am looking to load a wave file into a console application and return a spectrogram/spectral density chart/image as a jpg/png for further processing.
The wave files I am reading are mono in format
UPDATE 1
I Receive slightly different results depending on which FFT is used.
Using RealFFT
for (int i = 0; i < read; i+=2)
{
data[i] = BitConverter.ToInt16(buffer, i) / 32768.0;
//Console.WriteLine(data[i]);
}
LomontFFT LFFT = new LomontFFT();
LFFT.RealFFT(data, true);
for (int i = 0; i < buffer.Length / 2; i++)
{
System.Console.WriteLine("{0}",
Math.Sqrt(data[2 * i] * data[2 * i] + data[2 * i + 1] * data[2 * i + 1]));
}
Partial Result of RealFFT
0.314566983321381
0.625242818210924
0.30314888696868
0.118468857708093
0.0587697011760449
0.0369034115568654
0.0265842582236275
0.0207195964060356
0.0169601273233317
0.0143745438577886
0.012528799609089
0.0111831275153128
0.0102313284519146
0.00960198279358434
0.00920236001619566
Using FFT
for (int i = 0; i < read; i+=2)
{
data[i] = BitConverter.ToInt16(buffer, i) / 32768.0;
//Console.WriteLine(data[i]);
}
double[] bufferB = new double[2 * data.Length];
for (int i = 0; i < data.Length; i++)
{
bufferB[2 * i] = data[i];
bufferB[2 * i + 1] = 0;
}
LomontFFT LFFT = new LomontFFT();
LFFT.FFT(bufferB, true);
for (int i = 0; i < bufferB.Length / 2; i++)
{
System.Console.WriteLine("{0}",
Math.Sqrt(bufferB[2 * i] * bufferB[2 * i] + bufferB[2 * i + 1] * bufferB[2 * i + 1]));
}
Partial Result of FFT:
0.31456698332138
0.625242818210923
0.303148886968679
0.118468857708092
0.0587697011760447
0.0369034115568653
0.0265842582236274
0.0207195964060355
0.0169601273233317
0.0143745438577886
0.012528799609089
0.0111831275153127
0.0102313284519146
0.00960198279358439
0.00920236001619564

Looking at the LomontFFT.FFT documentation:
Compute the forward or inverse Fourier Transform of data, with
data containing complex valued data as alternating real and
imaginary parts. The length must be a power of 2. The data is
modified in place.
This tells us a few things. First the function is expecting complex-valued data whereas your data is real. A quick fix for this is to create another buffer of twice the size and setting all the imaginary parts to 0:
double[] buffer = new double[2*data.Length];
for (int i=0; i<data.Length; i++)
{
buffer[2*i] = data[i];
buffer[2*i+1] = 0;
}
The documentation also tells us that the computation is done in place. That means that after the call to FFT returns, the input array is replaced with the computed result. You could thus print the spectrum with:
LomontFFT LFFT = new LomontFFT();
LFFT.FFT(buffer, true);
for (int i = 0; i < buffer.Length/2; i++)
{
System.Console.WriteLine("{0}",
Math.Sqrt(buffer[2*i]*buffer[2*i]+buffer[2*i+1]*buffer[2*i+1]));
}
Note since your input data is real valued you could also use LomontFFT.RealFFT. In that case, given a slightly different packing rule, you would obtain the FFT results using:
LomontFFT LFFT = new LomontFFT();
LFFT.RealFFT(data, true);
System.Console.WriteLine("{0}", Math.Abs(data[0]);
for (int i = 1; i < data.Length/2; i++)
{
System.Console.WriteLine("{0}",
Math.Sqrt(data[2*i]*data[2*i]+data[2*i+1]*data[2*i+1]));
}
System.Console.WriteLine("{0}", Math.Abs(data[1]);
This would give you the non-redundant lower half of the spectrum (Unlike LomontFFT.FFT which provides the entire spectrum). Also, numerical differences on the order of double precision (around 1e-16 times the spectrum peak value) with respect to LomontFFT.FFT can be expected.

Segmented Aggregation within an Array

I have a large array of primitive value-types. The array is in fact one dimentional, but logically represents a 2-dimensional field. As you read from left to right, the values need to become (the original value of the current cell) + (the result calculated in the cell to the left). Obviously with the exception of the first element of each row which is just the original value.
I already have an implementation which accomplishes this, but is entirely iterative over the entire array and is extremely slow for large (1M+ elements) arrays.
Given the following example array,
0 0 1 0 0
2 0 0 0 3
0 4 1 1 0
0 1 0 4 1
Becomes
0 0 1 1 1
2 2 2 2 5
0 4 5 6 6
0 1 1 5 6
And so forth to the right, up to problematic sizes (1024x1024)
The array needs to be updated (ideally), but another array can be used if necessary. Memory footprint isn't much of an issue here, but performance is critical as these arrays have millions of elements and must be processed hundreds of times per second.
The individual cell calculations do not appear to be parallelizable given their dependence on values starting from the left, so GPU acceleration seems impossible. I have investigated PLINQ but requisite for indices makes it very difficult to implement.
Is there another way to structure the data to make it faster to process?
If efficient GPU processing is feasible using an innovative teqnique, this would be vastly preferable, as this is currently texture data which is having to be pulled from and pushed back to the video card.

Proper coding and a bit of insight in how .NET knows stuff helps as well :-)
Some rules of thumb that apply in this case:
If you can hint the JIT that the indexing will never get out of bounds of the array, it will remove the extra branche.
You should vectorize it only in multiple threads if it's really slow (f.ex. >1 second). Otherwise task switching, cache flushes etc will probably just eat up the added speed and you'll end up worse.
If possible, make memory access predictable, even sequential. If you need another array, so be it - if not, prefer that.
Use as few IL instructions as possible if you want speed. Generally this seems to work.
Test multiple iterations. A single iteration might not be good enough.
using these rules, you can make a small test case as follows. Note that I've upped the stakes to 4Kx4K since 1K is just so fast you cannot measure it :-)
public static void Main(string[] args)
{
int width = 4096;
int height = 4096;
int[] ar = new int[width * height];
Random rnd = new Random(213);
for (int i = 0; i < ar.Length; ++i)
{
ar[i] = rnd.Next(0, 120);
}
// (5)...
for (int j = 0; j < 10; ++j)
{
Stopwatch sw = Stopwatch.StartNew();
int sum = 0;
for (int i = 0; i < ar.Length; ++i) // (3) sequential access
{
if ((i % width) == 0)
{
sum = 0;
}
// (1) --> the JIT will notice this won't go out of bounds because [0<=i<ar.Length]
// (5) --> '+=' is an expression generating a 'dup'; this creates less IL.
ar[i] = (sum += ar[i]);
}
Console.WriteLine("This took {0:0.0000}s", sw.Elapsed.TotalSeconds);
}
Console.ReadLine();
}
One of these iterations wil take roughly 0.0174 sec here, and since this is about 16x the worst case scenario you describe, I suppose your performance problem is solved.
If you really want to parallize it to make it faster, I suppose that is possible, even though you will loose some of the optimizations in the JIT (Specifically: (1)). However, if you have a multi-core system like most people, the benefits might outweight these:
for (int j = 0; j < 10; ++j)
{
Stopwatch sw = Stopwatch.StartNew();
Parallel.For(0, height, (a) =>
{
int sum = 0;
for (var i = width * a + 1; i < width * (a + 1); i++)
{
ar[i] = (sum += ar[i]);
}
});
Console.WriteLine("This took {0:0.0000}s", sw.Elapsed.TotalSeconds);
}
If you really, really need performance, you can compile it to C++ and use P/Invoke. Even if you don't use the GPU, I suppose the SSE/AVX instructions might already give you a significant performance boost that you won't get with .NET/C#. Also I'd like to point out that the Intel C++ compiler can automatically vectorize your code - even to Xeon PHI's. Without a lot of effort, this might give you a nice boost in performance.

Well, I don't know too much about GPU, but I see no reason why you can't parallelize it as the dependencies are only from left to right.
There are no dependencies between rows.
0 0 1 0 0 - process on core1 |
2 0 0 0 3 - process on core1 |
-------------------------------
0 4 1 1 0 - process on core2 |
0 1 0 4 1 - process on core2 |
Although the above statement is not completely true. There's still hidden dependencies between rows when it comes to memory cache.
It's possible that there's going to be cache trashing. You can read about "cache false sharing", in order to understand the problem, and see how to overcome that.

As #Chris Eelmaa told you it is possible to do a parallel execution by row. Using Parallel.For could be rewritten like this:
static int[,] values = new int[,]{
{0, 0, 1, 0, 0},
{2, 0, 0, 0, 3},
{0, 4, 1, 1, 0},
{0, 1, 0, 4 ,1}};
static void Main(string[] args)
{
int rows=values.GetLength(0);
int columns=values.GetLength(1);
Parallel.For(0, rows, (row) =>
{
for (var column = 1; column < columns; column++)
{
values[row, column] += values[row, column - 1];
}
});
for (var row = 0; row < rows; row++)
{
for (var column = 0; column < columns; column++)
{
Console.Write("{0} ", values[row, column]);
}
Console.WriteLine();
}
So, as stated in your question, you have a one dimensional array, the code would be a bit faster:
static void Main(string[] args)
{
var values = new int[1024 * 1024];
Random r = new Random();
for (int i = 0; i < 1024; i++)
{
for (int j = 0; j < 1024; j++)
{
values[i * 1024 + j] = r.Next(25);
}
}
int rows = 1024;
int columns = 1024;
Stopwatch sw = Stopwatch.StartNew();
for (int i = 0; i < 100; i++)
{
Parallel.For(0, rows, (row) =>
{
for (var column = 1; column < columns; column++)
{
values[(row * columns) + column] += values[(row * columns) + column - 1];
}
});
}
Console.WriteLine(sw.Elapsed);
}
But not as fast as a GPU. To use parallel GPU processing you will have to rewrite it in C++ AMP or take a look on how to port this parallel for to cudafy: http://w8isms.blogspot.com.es/2012/09/cudafy-me-part-3-of-4.html

You may as well store the array as a jagged array, the memory layout will be the same. So, instead of,
int[] texture;
you have,
int[][] texture;
Isolate the row operation as,
private static Task ProcessRow(int[] row)
{
var v = row[0];
for (var i = 1; i < row.Length; i++)
{
v = row[i] += v;
}
return Task.FromResult(true);
}
then you can write a function that does,
Task.WhenAll(texture.Select(ProcessRow)).Wait();
If you want to remain with a 1-dimensional array, a similar approach will work, just change ProcessRow.
private static Task ProcessRow(int[] texture, int start, int limit)
{
var v = texture[start];
for (var i = start + 1; i < limit; i++)
{
v = texture[i] += v;
}
return Task.FromResult(true);
}
then once,
var rowSize = 1024;
var rows =
Enumerable.Range(0, texture.Length / rowSize)
.Select(i => Tuple.Create(i * rowSize, (i * rowSize) + rowSize))
.ToArray();
then on each cycle.
Task.WhenAll(rows.Select(t => ProcessRow(texture, t.Item1, t.Item2)).Wait();
Either way, each row is processed in parallel.

LockBits Performance Critical Code

I have a method which needs to be as fast as it possibly can, it uses unsafe memory pointers and its my first foray into this type of coding so I know it can probably be faster.
/// <summary>
/// Copies bitmapdata from one bitmap to another at a specified point on the output bitmapdata
/// </summary>
/// <param name="sourcebtmpdata">The sourcebitmap must be smaller that the destbitmap</param>
/// <param name="destbtmpdata"></param>
/// <param name="point">The point on the destination bitmap to draw at</param>
private static unsafe void CopyBitmapToDest(BitmapData sourcebtmpdata, BitmapData destbtmpdata, Point point)
{
// calculate total number of rows to draw.
var totalRow = Math.Min(
destbtmpdata.Height - point.Y,
sourcebtmpdata.Height);
//loop through each row on the source bitmap and get mem pointers
//to the source bitmap and dest bitmap
for (int i = 0; i < totalRow; i++)
{
int destRow = point.Y + i;
//get the pointer to the start of the current pixel "row" on the output image
byte* destRowPtr = (byte*)destbtmpdata.Scan0 + (destRow * destbtmpdata.Stride);
//get the pointer to the start of the FIRST pixel row on the source image
byte* srcRowPtr = (byte*)sourcebtmpdata.Scan0 + (i * sourcebtmpdata.Stride);
int pointX = point.X;
//the rowSize is pre-computed before the loop to improve performance
int rowSize = Math.Min(destbtmpdata.Width - pointX, sourcebtmpdata.Width);
//for each row each set each pixel
for (int j = 0; j < rowSize; j++)
{
int firstBlueByte = ((pointX + j)*3);
int srcByte = j *3;
destRowPtr[(firstBlueByte)] = srcRowPtr[srcByte];
destRowPtr[(firstBlueByte) + 1] = srcRowPtr[srcByte + 1];
destRowPtr[(firstBlueByte) + 2] = srcRowPtr[srcByte + 2];
}
}
}
So is there anything that can be done to make this faster? Ignore the todo for now, ill fix that later once I have some baseline performance measurements.
UPDATE: Sorry, should have mentioned that the reason i'm using this instead of Graphics.DrawImage is because im implementing multi-threading and because of that I cant use DrawImage.
UPDATE 2: I'm still not satisfied with the performance and i'm sure there's a few more ms that can be had.

There was something fundamentally wrong with the code that I cant believe I didn't notice until now.
byte* destRowPtr = (byte*)destbtmpdata.Scan0 + (destRow * destbtmpdata.Stride);
This gets a pointer to the destination row but it does not get the column that it is copying to, that in the old code is done inside the rowSize loop. It now looks like:
byte* destRowPtr = (byte*)destbtmpdata.Scan0 + (destRow * destbtmpdata.Stride) + pointX * 3;
So now we have the correct pointer for the destination data. Now we can get rid of that for loop. Using suggestions from Vilx- and Rob the code now looks like:
private static unsafe void CopyBitmapToDestSuperFast(BitmapData sourcebtmpdata, BitmapData destbtmpdata, Point point)
{
//calculate total number of rows to copy.
//using ternary operator instead of Math.Min, few ms faster
int totalRows = (destbtmpdata.Height - point.Y < sourcebtmpdata.Height) ? destbtmpdata.Height - point.Y : sourcebtmpdata.Height;
//calculate the width of the image to draw, this cuts off the image
//if it goes past the width of the destination image
int rowWidth = (destbtmpdata.Width - point.X < sourcebtmpdata.Width) ? destbtmpdata.Width - point.X : sourcebtmpdata.Width;
//loop through each row on the source bitmap and get mem pointers
//to the source bitmap and dest bitmap
for (int i = 0; i < totalRows; i++)
{
int destRow = point.Y + i;
//get the pointer to the start of the current pixel "row" and column on the output image
byte* destRowPtr = (byte*)destbtmpdata.Scan0 + (destRow * destbtmpdata.Stride) + point.X * 3;
//get the pointer to the start of the FIRST pixel row on the source image
byte* srcRowPtr = (byte*)sourcebtmpdata.Scan0 + (i * sourcebtmpdata.Stride);
//RtlMoveMemory function
CopyMemory(new IntPtr(destRowPtr), new IntPtr(srcRowPtr), (uint)rowWidth * 3);
}
}
Copying a 500x500 image to a 5000x5000 image in a grid 50 times took: 00:00:07.9948993 secs. Now with the changes above it takes 00:00:01.8714263 secs. Much better.

Well... I'm not sure whether .NET bitmap data formats are entirely compatible with Windows's GDI32 functions...
But one of the first few Win32 API I learned was BitBlt:
BOOL BitBlt(
HDC hdcDest,
int nXDest,
int nYDest,
int nWidth,
int nHeight,
HDC hdcSrc,
int nXSrc,
int nYSrc,
DWORD dwRop
);
And it was the fastest way to copy data around, if I remember correctly.
Here's the BitBlt PInvoke signature for use in C# and related usage information, a great read for any one working with high-performance graphics in C#:
http://www.pinvoke.net/default.aspx/gdi32/BitBlt.html
Definitely worth a look.

The inner loop is where you want to concentrate a lot of your time (but, do measurements to make sure)
for (int j = 0; j < sourcebtmpdata.Width; j++)
{
destRowPtr[(point.X + j) * 3] = srcRowPtr[j * 3];
destRowPtr[((point.X + j) * 3) + 1] = srcRowPtr[(j * 3) + 1];
destRowPtr[((point.X + j) * 3) + 2] = srcRowPtr[(j * 3) + 2];
}
Get rid of the multiplies and the array indexing (which is a multiply under the hoods) and replace with a pointer that you are incrementing.
Ditto with the +1, +2, increment a pointer.
Probably your compiler won't keep computing point.X (check), but make a local variable just in case. It won't do it on the single iteration, but it might each iteration.

You may want to look at Eigen.
It is a C++ template library that uses SSE (2 and later) and AltiVec instruction sets with graceful fallback to non-vectorized code.
Fast. (See benchmark).
Expression templates allow to intelligently remove temporaries and enable lazy evaluation, when that is appropriate -- Eigen takes care of this automatically and handles aliasing too in most cases.
Explicit vectorization is performed for the SSE (2 and later) and AltiVec instruction sets, with graceful fallback to non-vectorized code. Expression templates allow to perform these optimizations globally for whole expressions.
With fixed-size objects, dynamic memory allocation is avoided, and the loops are unrolled when that makes sense.
For large matrices, special attention is paid to cache-friendliness.
You could implement you function in C++ and then call that from C#

You don't always need to use pointers to get good speed. This should be within a couple ms of the original:
private static void CopyBitmapToDest(BitmapData sourcebtmpdata, BitmapData destbtmpdata, Point point)
{
byte[] src = new byte[sourcebtmpdata.Height * sourcebtmpdata.Width * 3];
int maximum = src.Length;
byte[] dest = new byte[maximum];
Marshal.Copy(sourcebtmpdata.Scan0, src, 0, src.Length);
int pointX = point.X * 3;
int copyLength = destbtmpdata.Width*3 - pointX;
int k = pointX + point.Y * sourcebtmpdata.Stride;
int rowWidth = sourcebtmpdata.Stride;
while (k<maximum)
{
Array.Copy(src,k,dest,k,copyLength);
k += rowWidth;
}
Marshal.Copy(dest, 0, destbtmpdata.Scan0, dest.Length);
}

Unfortunately I don't have the time to write a full solution, but I would look into using the platform RtlMoveMemory() function to move rows as a whole, not byte-by-byte. That should be a lot faster.

I think the stride size and row number limits can be calculated in advance.
And I precalculated all multiplications, resulting in the following code:
private static unsafe void CopyBitmapToDest(BitmapData sourcebtmpdata, BitmapData destbtmpdata, Point point)
{
//TODO: It is expected that the bitmap PixelFormat is Format24bppRgb but this could change in the future
const int pixelSize = 3;
// calculate total number of rows to draw.
var totalRow = Math.Min(
destbtmpdata.Height - point.Y,
sourcebtmpdata.Height);
var rowSize = Math.Min(
(destbtmpdata.Width - point.X) * pixelSize,
sourcebtmpdata.Width * pixelSize);
// starting point of copy operation
byte* srcPtr = (byte*)sourcebtmpdata.Scan0;
byte* destPtr = (byte*)destbtmpdata.Scan0 + point.Y * destbtmpdata.Stride;
// loop through each row
for (int i = 0; i < totalRow; i++) {
// draw the entire row
for (int j = 0; j < rowSize; j++)
destPtr[point.X + j] = srcPtr[j];
// advance each pointer by 1 row
destPtr += destbtmpdata.Stride;
srcPtr += sourcebtmpdata.Stride;
}
}
Havn't tested it thoroughly but you should be able to get that to work.
I have removed multiplication operations from the loop (pre-calculated instead) and removed most branchings so it should be somewhat faster.
Let me know if this helps :-)

I am looking at your C# code and I can't recognize anything familiar. It all looks like a ton of C++. BTW, it looks like DirectX/XNA needs to become your new friend. Just my 2 cents. Don't kill the messenger.
If you must rely on CPU to do this: I've done some 24-bit layout optimizations myself, and I can tell you that memory access speed should be your bottleneck. Use SSE3 instructions for fastest possible bytewise access. This means C++ and embedded assembly language. In pure C you'll be 30% slower on most machines.
Keep in mind that modern GPUs are MUCH faster than CPU in this sort of operations.

I am not sure if this will give extra performance, but I see the pattern a lot in Reflector.
So:
int srcByte = j *3;
destRowPtr[(firstBlueByte)] = srcRowPtr[srcByte];
destRowPtr[(firstBlueByte) + 1] = srcRowPtr[srcByte + 1];
destRowPtr[(firstBlueByte) + 2] = srcRowPtr[srcByte + 2];
Becomes:
*destRowPtr++ = *srcRowPtr++;
*destRowPtr++ = *srcRowPtr++;
*destRowPtr++ = *srcRowPtr++;
Probably needs more braces.
If the width is fixed, you could probably unroll the entire line into a few hundred lines. :)
Update
You could also try using a bigger type, eg Int32 or Int64 for better performance.

Alright, this is going to be fairly close to the line of how many ms you can get out of the algorithm, but get rid of the call to Math.Min and replace it with a trinary operator instead.
Generally, making a library call will take longer than doing something on your own and I made a simple test driver to confirm this for Math.Min.
using System;
using System.Diagnostics;
namespace TestDriver
{
class Program
{
static void Main(string[] args)
{
// Start the stopwatch
if (Stopwatch.IsHighResolution)
{ Console.WriteLine("Using high resolution timer"); }
else
{ Console.WriteLine("High resolution timer unavailable"); }
// Test Math.Min for 10000 iterations
Stopwatch sw = Stopwatch.StartNew();
for (int ndx = 0; ndx < 10000; ndx++)
{
int result = Math.Min(ndx, 5000);
}
Console.WriteLine(sw.Elapsed.TotalMilliseconds.ToString("0.0000"));
// Test trinary operator for 10000 iterations
sw = Stopwatch.StartNew();
for (int ndx = 0; ndx < 10000; ndx++)
{
int result = (ndx < 5000) ? ndx : 5000;
}
Console.WriteLine(sw.Elapsed.TotalMilliseconds.ToString("0.0000"));
Console.ReadKey();
}
}
}
The results when running the above on my computer, an Intel T2400 #1.83GHz. Also, note that there is a bit of variation in the results, but generally the trinay operator is faster by about 0.01 ms. That's not much, but over a big enough dataset it will add up.
Using high resolution timer
0.0539
0.0402

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.