I'm having performance issue when I copy the data from input (ReadOnlySpan) to output (Span) using (Loops like 'for')
there is Span.CopyTo, it's perfect and very fast
but for now it's useless without converting the pixels.
below is the code, I have feeling that there is some short way to do that instead of the current process:
public unsafe void UpdateFromOutput(CanvasDevice device, ReadOnlySpan<byte> data, uint width, uint height, uint pitch)
{
using (var renderTargetMap = new BitmapMap(device, RenderTarget))
{
var inputPitch = (int)pitch;
var mapPitch = (int)renderTargetMap.PitchBytes;
var mapData = new Span<byte>(new IntPtr(renderTargetMap.Data).ToPointer(), (int)RenderTarget.Size.Height * mapPitch);
switch (CurrentPixelFormat)
{
case PixelFormats.RGB0555:
FramebufferConverter.ConvertFrameBufferRGB0555ToXRGB8888(width, height, data, inputPitch, mapData, mapPitch);
break;
case PixelFormats.RGB565:
FramebufferConverter.ConvertFrameBufferRGB565ToXRGB8888(width, height, data, inputPitch, mapData, mapPitch);
break;
}
}
}
then inside function like ConvertFrameBufferRGB0555ToXRGB8888
I will go through width and height like below:
var castInput = MemoryMarshal.Cast<byte, ushort>(input);
var castInputPitch = inputPitch / sizeof(ushort);
var castOutput = MemoryMarshal.Cast<byte, uint>(output);
var castOutputPitch = outputPitch / sizeof(uint);
castOutput.Fill(0);
for (var i = 0; i < height;i++)
{
var inputLine = castInput.Slice(i * castInputPitch, castInputPitch);
var outputLine = castOutput.Slice(i * castOutputPitch, castOutputPitch);
for (var j = 0; j < width;j++)
{
outputLine[j] = ConverToRGB888(inputLine[j]);
}
}
The code above working but slow in some cases.
Please note: I'm modifying a project so the code above was written by the original developer and I need help because I don't understand how the process is working, still very confused.. specially in the Slice part.
Tried as test only to copy the input to output directly data.CopyTo(mapData); and I got this (as expected):
Hope there is some solution with Marshal and Span functions
Many thanks.
Update regarding (ConverToRGB888)
As for ConverToRGB888, the original code contains RGB565LookupTable:
private const uint LookupTableSize = ushort.MaxValue + 1;
private static uint[] RGB565LookupTable = new uint[LookupTableSize];
public static void SetRGB0565LookupTable()
{
uint r565, g565, b565;
double red = 255.0;
double green = 255.0;
double blue = 255.0;
for (uint i = 0; i < LookupTableSize; i++)
{
//RGB565
r565 = (i >> 11) & 0x1F;
g565 = (i >> 5) & 0x3F;
b565 = (i & 0x1F);
r565 = (uint)Math.Round(r565 * red / 31.0);
g565 = (uint)Math.Round(g565 * green / 63.0);
b565 = (uint)Math.Round(b565 * blue / 31.0);
RGB565LookupTable[i] = (0xFF000000 | r565 << 16 | g565 << 8 | b565);
}
}
private static uint ConverToRGB888(ushort x)
{
return RGB565LookupTable[x];
}
SetRGB0565LookupTable() will be called only once to fill the values.
Conclusion:
The Fill(0) was not important and it was causing delay
The unsafe version (accepted answer) was clear for me and a bit faster
Avoiding For partially even faster like Here [Tested]
Pre-Lookup table is very helpful and made the conversion faster
Memory helpers like Span.CopyTo, Buffer.MemoryCopy source available Here
Using Parallel.For is faster in some cases with help of UnsafeMemory
If you have input with (Pixels Type) supported by Win2D there is possibility to avoid loops like:
byte[] dataBytes = new byte[data.Length];
fixed (byte* inputPointer = &data[0])
Marshal.Copy((IntPtr)inputPointer, dataBytes, 0, data.Length);
RenderTarget = CanvasBitmap.CreateFromBytes(renderPanel, dataBytes, (int)width, (int)height, DirectXPixelFormat.R8G8UIntNormalized, 92, CanvasAlphaMode.Ignore);
but not sure from the last point as I wasn't able to test on 565,555.
Thanks for DekuDesu the explanation and simplified version he provide helped me to do more tests.
I'm having performance issue when I copy the data from input (ReadOnlySpan) to output (Span) using (Loops like 'for')
The the code you provided is already pretty safe and has the best complexity you're going get for pixel-by-pixel operations. The presence of nested for loops does not necessarily correspond to performance issues or increased complexity.
I need help because I don't understand how the process is working, still very confused.. specially in the Slice part.
This code looks like it's meant to convert one bitmap format to another. Bitmaps come in varying sizes and formats. Because of this they include an additional piece of information along with width and height, pitch.
Pitch is the distance in bytes between two lines of pixel information, this is used to account for formats that don't include full 32/64bit color information.
Knowing this I commented the method in question as to help explain what it's doing.
public static void ConvertFrameBufferRGB565ToXRGB8888(uint width, uint height, ReadOnlySpan<byte> input, int inputPitch, Span<byte> output, int outputPitch)
{
// convert the span of bytes into a span of ushorts
// so we can use span[i] to get a ushort
var castInput = MemoryMarshal.Cast<byte, ushort>(input);
// pitch is the number of bytes between the first byte of a line and the first byte of the next line
// convert the pitch from bytes into ushort pitch
var castInputPitch = inputPitch / sizeof(ushort);
// convert the span of bytes into a span of ushorts
// so we can use span[i] to get a ushort
var castOutput = MemoryMarshal.Cast<byte, uint>(output);
var castOutputPitch = outputPitch / sizeof(uint);
for (var i = 0; i < height; i++)
{
// get a line from the input
// remember that pitch is the number of ushorts between lines
// so i * pitch here gives us the index of the i'th line, and we don't need the padding
// ushorts at the end so we only take castInputPitch number of ushorts
var inputLine = castInput.Slice(i * castInputPitch, castInputPitch);
// same thing as above but for the output
var outputLine = castOutput.Slice(i * castOutputPitch, castOutputPitch);
for (var j = 0; j < width; j++)
{
// iterate through the line, converting each pixel and storing it in the output span
outputLine[j] = ConverToRGB888(inputLine[j]);
}
}
}
Fastest way to copy data from ReadOnlySpan to output with pixel conversion
Honestly the method you provided is just fine, it's safe and fast ish. Keep in mind that copying data like bitmaps linearly on CPU's is an inherently slow process. The most performance savings you could hope for is avoiding copying data redundantly. Unless this needs absolutely blazing speed I would not recommend changes other than removing .fill(0) since it's probably unnecessary, but you would have to test that.
If you ABSOLUTELY must get more performance out of this you may want to consider something like what I've provided below. I caution you however, unsafe code like this is well.. unsafe and is prone to errors. It has almost no error checking and makes a LOT of assumptions, so that's up for you to implement.
If it's still not fast enough consider writing a .dll in C and use interop maybe.
public static unsafe void ConvertExtremelyUnsafe(ulong height, ref byte inputArray, ulong inputLength, ulong inputPitch, ref byte outputArray, ulong outputLength, ulong outputPitch)
{
// pin down pointers so they dont move on the heap
fixed (byte* inputPointer = &inputArray, outputPointer = &outputArray)
{
// since we have to account for padding we should go line by line
for (ulong y = 0; y < height; y++)
{
// get a pointer for the first byte of the line of the input
byte* inputLinePointer = inputPointer + (y * inputPitch);
// get a pointer for the first byte of the line of the output
byte* outputLinePointer = outputPointer + (y * outputPitch);
// traverse the input line by ushorts
for (ulong i = 0; i < (inputPitch / sizeof(ushort)); i++)
{
// calculate the offset for the i'th ushort,
// becuase we loop based on the input and ushort we dont need an index check here
ulong inputOffset = i * sizeof(ushort);
// get a pointer to the i'th ushort
ushort* rgb565Pointer = (ushort*)(inputLinePointer + inputOffset);
ushort rgb565Value = *rgb565Pointer;
// convert the rgb to the other format
uint rgb888Value = ConverToRGB888(rgb565Value);
// calculate the offset for i'th uint
ulong outputOffset = i * sizeof(uint);
// at least attempt to avoid overflowing a buffer, not that the runtime would let you do that, i would hope..
if (outputOffset >= outputLength)
{
throw new IndexOutOfRangeException($"{nameof(outputArray)}[{outputOffset}]");
}
// get a pointer to the i'th uint
uint* rgb888Pointer = (uint*)(outputLinePointer + outputOffset);
// write the bytes of the rgb888 to the output array
*rgb888Pointer = rgb888Value;
}
}
}
}
disclaimer: I wrote this on mobile
I need to fill a byte[] with a single non-zero value. How can I do this in C# without looping through each byte in the array?
Update: The comments seem to have split this into two questions -
Is there a Framework method to fill a byte[] that might be akin to memset
What is the most efficient way to do it when we are dealing with a very large array?
I totally agree that using a simple loop works just fine, as Eric and others have pointed out. The point of the question was to see if I could learn something new about C# :) I think Juliet's method for a Parallel operation should be even faster than a simple loop.
Benchmarks:
Thanks to Mikael Svenson: http://techmikael.blogspot.com/2009/12/filling-array-with-default-value.html
It turns out the simple for loop is the way to go unless you want to use unsafe code.
Apologies for not being clearer in my original post. Eric and Mark are both correct in their comments; need to have more focused questions for sure. Thanks for everyone's suggestions and responses.
You could use Enumerable.Repeat:
byte[] a = Enumerable.Repeat((byte)10, 100).ToArray();
The first parameter is the element you want repeated, and the second parameter is the number of times to repeat it.
This is OK for small arrays but you should use the looping method if you are dealing with very large arrays and performance is a concern.
Actually, there is little known IL operation called Initblk (English version) which does exactly that. So, let's use it as a method that doesn't require "unsafe". Here's the helper class:
public static class Util
{
static Util()
{
var dynamicMethod = new DynamicMethod("Memset", MethodAttributes.Public | MethodAttributes.Static, CallingConventions.Standard,
null, new [] { typeof(IntPtr), typeof(byte), typeof(int) }, typeof(Util), true);
var generator = dynamicMethod.GetILGenerator();
generator.Emit(OpCodes.Ldarg_0);
generator.Emit(OpCodes.Ldarg_1);
generator.Emit(OpCodes.Ldarg_2);
generator.Emit(OpCodes.Initblk);
generator.Emit(OpCodes.Ret);
MemsetDelegate = (Action<IntPtr, byte, int>)dynamicMethod.CreateDelegate(typeof(Action<IntPtr, byte, int>));
}
public static void Memset(byte[] array, byte what, int length)
{
var gcHandle = GCHandle.Alloc(array, GCHandleType.Pinned);
MemsetDelegate(gcHandle.AddrOfPinnedObject(), what, length);
gcHandle.Free();
}
public static void ForMemset(byte[] array, byte what, int length)
{
for(var i = 0; i < length; i++)
{
array[i] = what;
}
}
private static Action<IntPtr, byte, int> MemsetDelegate;
}
And what is the performance? Here's my result for Windows/.NET and Linux/Mono (different PCs).
Mono/for: 00:00:01.1356610
Mono/initblk: 00:00:00.2385835
.NET/for: 00:00:01.7463579
.NET/initblk: 00:00:00.5953503
So it's worth considering. Note that the resulting IL will not be verifiable.
Building on Lucero's answer, here is a faster version. It will double the number of bytes copied using Buffer.BlockCopy every iteration. Interestingly enough, it outperforms it by a factor of 10 when using relatively small arrays (1000), but the difference is not that large for larger arrays (1000000), it is always faster though. The good thing about it is that it performs well even down to small arrays. It becomes faster than the naive approach at around length = 100. For a one million element byte array, it was 43 times faster.
(tested on Intel i7, .Net 2.0)
public static void MemSet(byte[] array, byte value) {
if (array == null) {
throw new ArgumentNullException("array");
}
int block = 32, index = 0;
int length = Math.Min(block, array.Length);
//Fill the initial array
while (index < length) {
array[index++] = value;
}
length = array.Length;
while (index < length) {
Buffer.BlockCopy(array, 0, array, index, Math.Min(block, length-index));
index += block;
block *= 2;
}
}
A little bit late, but the following approach might be a good compromise without reverting to unsafe code. Basically it initializes the beginning of the array using a conventional loop and then reverts to Buffer.BlockCopy(), which should be as fast as you can get using a managed call.
public static void MemSet(byte[] array, byte value) {
if (array == null) {
throw new ArgumentNullException("array");
}
const int blockSize = 4096; // bigger may be better to a certain extent
int index = 0;
int length = Math.Min(blockSize, array.Length);
while (index < length) {
array[index++] = value;
}
length = array.Length;
while (index < length) {
Buffer.BlockCopy(array, 0, array, index, Math.Min(blockSize, length-index));
index += blockSize;
}
}
Looks like System.Runtime.CompilerServices.Unsafe.InitBlock now does the same thing as the OpCodes.Initblk instruction that Konrad's answer mentions (he also mentioned a source link).
The code to fill in the array is as follows:
byte[] a = new byte[N];
byte valueToFill = 255;
System.Runtime.CompilerServices.Unsafe.InitBlock(ref a[0], valueToFill, (uint) a.Length);
This simple implementation uses successive doubling, and performs quite well (about 3-4 times faster than the naive version according to my benchmarks):
public static void Memset<T>(T[] array, T elem)
{
int length = array.Length;
if (length == 0) return;
array[0] = elem;
int count;
for (count = 1; count <= length/2; count*=2)
Array.Copy(array, 0, array, count, count);
Array.Copy(array, 0, array, count, length - count);
}
Edit: upon reading the other answers, it seems I'm not the only one with this idea. Still, I'm leaving this here, since it's a bit cleaner and it performs on par with the others.
If performance is critical, you could consider using unsafe code and working directly with a pointer to the array.
Another option could be importing memset from msvcrt.dll and use that. However, the overhead from invoking that might easily be larger than the gain in speed.
Or use P/Invoke way:
[DllImport("msvcrt.dll",
EntryPoint = "memset",
CallingConvention = CallingConvention.Cdecl,
SetLastError = false)]
public static extern IntPtr MemSet(IntPtr dest, int c, int count);
static void Main(string[] args)
{
byte[] arr = new byte[3];
GCHandle gch = GCHandle.Alloc(arr, GCHandleType.Pinned);
MemSet(gch.AddrOfPinnedObject(), 0x7, arr.Length);
}
With the advent of Span<T> (which is dotnet core only, but it is the future of dotnet) you have yet another way of solving this problem:
var array = new byte[100];
var span = new Span<byte>(array);
span.Fill(255);
If performance is absolutely critical, then Enumerable.Repeat(n, m).ToArray() will be too slow for your needs. You might be able to crank out faster performance using PLINQ or Task Parallel Library:
using System.Threading.Tasks;
// ...
byte initialValue = 20;
byte[] data = new byte[size]
Parallel.For(0, size, index => data[index] = initialValue);
All answers are writing single bytes only - what if you want to fill a byte array with words? Or floats? I find use for that now and then. So after having written similar code to 'memset' in a non-generic way a few times and arriving at this page to find good code for single bytes, I went about writing the method below.
I think PInvoke and C++/CLI each have their drawbacks. And why not have the runtime 'PInvoke' for you into mscorxxx? Array.Copy and Buffer.BlockCopy are native code certainly. BlockCopy isn't even 'safe' - you can copy a long halfway over another, or over a DateTime as long as they're in arrays.
At least I wouldn't go file new C++ project for things like this - it's a waste of time almost certainly.
So here's basically an extended version of the solutions presented by Lucero and TowerOfBricks that can be used to memset longs, ints, etc as well as single bytes.
public static class MemsetExtensions
{
static void MemsetPrivate(this byte[] buffer, byte[] value, int offset, int length) {
var shift = 0;
for (; shift < 32; shift++)
if (value.Length == 1 << shift)
break;
if (shift == 32 || value.Length != 1 << shift)
throw new ArgumentException(
"The source array must have a length that is a power of two and be shorter than 4GB.", "value");
int remainder;
int count = Math.DivRem(length, value.Length, out remainder);
var si = 0;
var di = offset;
int cx;
if (count < 1)
cx = remainder;
else
cx = value.Length;
Buffer.BlockCopy(value, si, buffer, di, cx);
if (cx == remainder)
return;
var cachetrash = Math.Max(12, shift); // 1 << 12 == 4096
si = di;
di += cx;
var dx = offset + length;
// doubling up to 1 << cachetrash bytes i.e. 2^12 or value.Length whichever is larger
for (var al = shift; al <= cachetrash && di + (cx = 1 << al) < dx; al++) {
Buffer.BlockCopy(buffer, si, buffer, di, cx);
di += cx;
}
// cx bytes as long as it fits
for (; di + cx <= dx; di += cx)
Buffer.BlockCopy(buffer, si, buffer, di, cx);
// tail part if less than cx bytes
if (di < dx)
Buffer.BlockCopy(buffer, si, buffer, di, dx - di);
}
}
Having this you can simply add short methods to take the value type you need to memset with and call the private method, e.g. just find replace ulong in this method:
public static void Memset(this byte[] buffer, ulong value, int offset, int count) {
var sourceArray = BitConverter.GetBytes(value);
MemsetPrivate(buffer, sourceArray, offset, sizeof(ulong) * count);
}
Or go silly and do it with any type of struct (although the MemsetPrivate above only works for structs that marshal to a size that is a power of two):
public static void Memset<T>(this byte[] buffer, T value, int offset, int count) where T : struct {
var size = Marshal.SizeOf<T>();
var ptr = Marshal.AllocHGlobal(size);
var sourceArray = new byte[size];
try {
Marshal.StructureToPtr<T>(value, ptr, false);
Marshal.Copy(ptr, sourceArray, 0, size);
} finally {
Marshal.FreeHGlobal(ptr);
}
MemsetPrivate(buffer, sourceArray, offset, count * size);
}
I changed the initblk mentioned before to take ulongs to compare performance with my code and that silently fails - the code runs but the resulting buffer contains the least significant byte of the ulong only.
Nevertheless I compared the performance writing as big a buffer with for, initblk and my memset method. The times are in ms total over 100 repetitions writing 8 byte ulongs whatever how many times fit the buffer length. The for version is manually loop-unrolled for the 8 bytes of a single ulong.
Buffer Len #repeat For millisec Initblk millisec Memset millisec
0x00000008 100 For 0,0032 Initblk 0,0107 Memset 0,0052
0x00000010 100 For 0,0037 Initblk 0,0102 Memset 0,0039
0x00000020 100 For 0,0032 Initblk 0,0106 Memset 0,0050
0x00000040 100 For 0,0053 Initblk 0,0121 Memset 0,0106
0x00000080 100 For 0,0097 Initblk 0,0121 Memset 0,0091
0x00000100 100 For 0,0179 Initblk 0,0122 Memset 0,0102
0x00000200 100 For 0,0384 Initblk 0,0123 Memset 0,0126
0x00000400 100 For 0,0789 Initblk 0,0130 Memset 0,0189
0x00000800 100 For 0,1357 Initblk 0,0153 Memset 0,0170
0x00001000 100 For 0,2811 Initblk 0,0167 Memset 0,0221
0x00002000 100 For 0,5519 Initblk 0,0278 Memset 0,0274
0x00004000 100 For 1,1100 Initblk 0,0329 Memset 0,0383
0x00008000 100 For 2,2332 Initblk 0,0827 Memset 0,0864
0x00010000 100 For 4,4407 Initblk 0,1551 Memset 0,1602
0x00020000 100 For 9,1331 Initblk 0,2768 Memset 0,3044
0x00040000 100 For 18,2497 Initblk 0,5500 Memset 0,5901
0x00080000 100 For 35,8650 Initblk 1,1236 Memset 1,5762
0x00100000 100 For 71,6806 Initblk 2,2836 Memset 3,2323
0x00200000 100 For 77,8086 Initblk 2,1991 Memset 3,0144
0x00400000 100 For 131,2923 Initblk 4,7837 Memset 6,8505
0x00800000 100 For 263,2917 Initblk 16,1354 Memset 33,3719
I excluded the first call every time, since both initblk and memset take a hit of I believe it was about .22ms for the first call. Slightly surprising my code is faster for filling short buffers than initblk, seeing it got half a page full of setup code.
If anybody feels like optimizing this, go ahead really. It's possible.
Tested several ways, described in different answers.
See sources of test in c# test class
You could do it when you initialize the array but I don't think that's what you are asking for:
byte[] myBytes = new byte[5] { 1, 1, 1, 1, 1};
.NET Core has a built-in Array.Fill() function, but sadly .NET Framework is missing it. .NET Core has two variations: fill the entire array and fill a portion of the array starting at an index.
Building on the ideas above, here is a more generic Fill function that will fill the entire array of several data types. This is the fastest function when benchmarking against other methods discussed in this post.
This function, along with the version that fills a portion an array are available in an open source and free NuGet package (HPCsharp on nuget.org). Also included is a slightly faster version of Fill using SIMD/SSE instructions that performs only memory writes, whereas BlockCopy-based methods perform memory reads and writes.
public static void FillUsingBlockCopy<T>(this T[] array, T value) where T : struct
{
int numBytesInItem = 0;
if (typeof(T) == typeof(byte) || typeof(T) == typeof(sbyte))
numBytesInItem = 1;
else if (typeof(T) == typeof(ushort) || typeof(T) != typeof(short))
numBytesInItem = 2;
else if (typeof(T) == typeof(uint) || typeof(T) != typeof(int))
numBytesInItem = 4;
else if (typeof(T) == typeof(ulong) || typeof(T) != typeof(long))
numBytesInItem = 8;
else
throw new ArgumentException(string.Format("Type '{0}' is unsupported.", typeof(T).ToString()));
int block = 32, index = 0;
int endIndex = Math.Min(block, array.Length);
while (index < endIndex) // Fill the initial block
array[index++] = value;
endIndex = array.Length;
for (; index < endIndex; index += block, block *= 2)
{
int actualBlockSize = Math.Min(block, endIndex - index);
Buffer.BlockCopy(array, 0, array, index * numBytesInItem, actualBlockSize * numBytesInItem);
}
}
Most of answers is for byte memset but if you want to use it for float or any other struct you should multiply index by size of your data. Because Buffer.BlockCopy will copy based on the bytes.
This code will be work for float values
public static void MemSet(float[] array, float value) {
if (array == null) {
throw new ArgumentNullException("array");
}
int block = 32, index = 0;
int length = Math.Min(block, array.Length);
//Fill the initial array
while (index < length) {
array[index++] = value;
}
length = array.Length;
while (index < length) {
Buffer.BlockCopy(array, 0, array, index * sizeof(float), Math.Min(block, length-index)* sizeof(float));
index += block;
block *= 2;
}
}
The Array object has a method called Clear. I'm willing to bet that the Clear method is faster than any code you can write in C#.
you can try the following codes, it will work.
byte[] test = new byte[65536];
Array.Clear(test,0,test.Length);