Unsafe.As from byte array to ulong array - c#

I'm currently looking at porting my metro hash implementon to use C#7 features, as several parts might profit from ref locals to improve performance.
The hash does the calculations on a ulong[4] array, but the result is a 16 byte array. Currently I'm copying the ulong array to the result byte buffer, but this takes a bit of time.
So i'm wondering if System.Runtime.CompilerServices.Unsafe is safe to use here:
var result = new byte[16];
ulong[] state = Unsafe.As<byte[], ulong[]>(ref result);
ref var firstState = ref state[0];
ref var secondState = ref state[1];
ulong thirdState = 0;
ulong fourthState = 0;
The above code snippet means that I'm using the result buffer also for parts of my state calculations and not only for the final output.
My unit tests are successful and according to benchmarkdotnet skipping the block copy would result in a 20% performance increase, which is high enough for me to find out if it is correct to use it.

In current .NET terms, this would be a good fit for Span<T>:
Span<byte> result = new byte[16];
Span<ulong> state = MemoryMarshal.Cast<byte, ulong>(result);
This enforces lengths etc, while having good JIT behaviour and not requiring unsafe. You can even stackalloc the original buffer (from C# 7.2 onwards):
Span<byte> result = stackalloc byte[16];
Span<ulong> state = MemoryMarshal.Cast<byte, ulong>(result);
Note that Span<T> gets the length change correct; it is also trivial to cast into a Span<Vector<T>> if you want to use SIMD for hardware acceleration.

What you're doing seems fine, just be careful because there's nothing to stop you from doing this:
byte[] x = new byte[16];
long[] y = Unsafe.As<byte[], long[]>(ref x);
Console.WriteLine(y.Length); // still 16
for (int i = 0; i < y.Length; i++)
Console.WriteLine(y[i]); // reads random memory from your program, could cause crash

C# supports "fixed buffers", here's the kind of thing we can do:
public unsafe struct Bytes
{
public fixed byte bytes[16];
}
then
public unsafe static Bytes Convert (long[] longs)
{
fixed (long * longs_ptr = longs)
return *((Bytes*)(longs_ptr));
}
Try it. (1D arrays of primitive types in C# are always stored as a contiguous block of memory which is why taking the address of the (managed) arrays is fine).
You could also even return the pointer for more speed:
public unsafe static Bytes * Convert (long[] longs)
{
fixed (long * longs_ptr = longs)
return ((Bytes*)(longs_ptr));
}
and manipulate/access the bytes as you want.
var s = Convert(longs);
var b = s->bytes[0];

Related

Speed up nested loops and bitwise operations with Alea GPU

I'm trying to use Alea to speed up a program I'm working on but I need some help.
What I need to do is a lot of bitcount and bitwise operations with values stored in two arrays.
For each element of my first array I have to do a bitwise & operation with each element of my second array, then count the bits set to 1 of the & result.
If the result is greater than/equal to a certain value I need to exit the inner for and go to the next element of my first array.
The first array is usually a big one, with millions of elements, the second one is usually less than 200.000 elements.
Trying to do all these operations in parallel, here is my code:
[GpuManaged]
private long[] Check(long[] arr1, long[] arr2, int limit)
{
Gpu.FreeAllImplicitMemory(true);
var gpu = Gpu.Default;
long[] result = new long[arr1.Length];
gpu.For(0, arr1.Length, i =>
{
bool found = false;
long b = arr1[i];
for (int i2 = 0; i2 < arr2.Length; i2++)
{
if (LibDevice.__nv_popcll(b & arr2[i2]) >= limit)
{
found = true;
break;
}
}
if (!found)
{
result[i] = b;
}
});
return result;
}
This works as expected but is just a little faster than my version running in parallel on a quad core CPU.
I'm certainly missing something here, it's my very first attempt to write GPU code.
By the way, my NVIDIA is a GeForce GT 740M.
EDIT
The following code is 2x faster than the previous one, at least on my PC. Many thanks to Michael Randall for pointing me in the right direction.
private static int[] CheckWithKernel(Gpu gpu, int[] arr1, int[] arr2, int limit)
{
var lp = new LaunchParam(16, 256);
var result = new int[arr1.Length];
try
{
using (var dArr1 = gpu.AllocateDevice(arr1))
using (var dArr2 = gpu.AllocateDevice(arr2))
using (var dResult = gpu.AllocateDevice<int>(arr1.Length))
{
gpu.Launch(Kernel, lp, arr1.Length, arr2.Length, dArr1.Ptr, dArr2.Ptr, dResult.Ptr, limit);
Gpu.Copy(dResult, result);
return result;
}
}
finally
{
Gpu.Free(arr1);
Gpu.Free(arr2);
Gpu.Free(result);
}
}
private static void Kernel(int a1, int a2, deviceptr<int> arr1, deviceptr<int> arr2, deviceptr<int> arr3, int limit)
{
var iinit = blockIdx.x * blockDim.x + threadIdx.x;
var istep = gridDim.x * blockDim.x;
for (var i = iinit; i < a1; i += istep)
{
bool found = false;
int b = arr1[i];
for (var j = 0; j < a2; j++)
{
if (LibDevice.__nv_popcll(b & arr2[j]) >= limit)
{
found = true;
break;
}
}
if (!found)
{
arr3[i] = b;
}
}
}
Update
It seems pinning wont work with GCHandle.Alloc()
However the point of this answer is you will get a much greater performance gain out of direct memory access.
http://www.aleagpu.com/release/3_0_3/doc/advanced_features_csharp.html
Directly Working with Device Memory
Device memory provides even more flexibility as it also allows all
kind of pointer arithmetics. Device memory is allocated with
Memory<T> Gpu.AllocateDevice<T>(int length)
Memory<T> Gpu.AllocateDevice<T>(T[] array)
The first overload creates a device memory object for the specified
type T and length on the selected GPU. The second one allocates
storage on the GPU and copies the .NET array into it. Both return a
Memory<T> object, which implements IDisposable and can therefore
support the using syntax which ensures proper disposal once the
Memory<T> object goes out of scope. A Memory<T> object has properties
to determine the length, the GPU or the device on which it lives. The
Memory<T>.Ptr property returns a deviceptr<T>, which can be used in
GPU code to access the actual data or to perform pointer arithmetics.
The following example illustrates a simple use case of device
pointers. The kernel only operates on part of the data, defined by an
offset.
using (var dArg1 = gpu.AllocateDevice(arg1))
using (var dArg2 = gpu.AllocateDevice(arg2))
using (var dOutput = gpu.AllocateDevice<int>(Length/2))
{
// pointer arithmetics to access subset of data
gpu.Launch(Kernel, lp, dOutput.Length, dOutput.Ptr, dArg1.Ptr + Length/2, dArg2.Ptr + Length / 2);
var result = dOutput.ToArray();
var expected = arg1.Skip(Length/2).Zip(arg2.Skip(Length/2), (x, y) => x + y);
Assert.That(result, Is.EqualTo(expected));
}
Original Answer
Disregarding the logic going on, or how relevant this is to GPU code. However you could compliment your Parallel routine and possibly speed things up by by Pinning your Arrays in memory with GCHandle.Alloc() and the GCHandleType.Pinned flag and using Direct Pointer access (if you can run unsafe code)
Notes
You will cop a hit from pinning the memory, however for large arrays you can realize a lot of performance from direct access*
You will have to mark your assembly unsafe in Build Properties*
This is obviously untested and just an example*
You could used fixed, however the Parallel Lambda makes it fiddlier
Example
private unsafe long[] Check(long[] arr1, long[] arr2, int limit)
{
Gpu.FreeAllImplicitMemory(true);
var gpu = Gpu.Default;
var result = new long[arr1.Length];
// Create some pinned memory
var resultHandle = GCHandle.Alloc(result, GCHandleType.Pinned);
var arr2Handle = GCHandle.Alloc(result, GCHandleType.Pinned);
var arr1Handle = GCHandle.Alloc(result, GCHandleType.Pinned);
// Get the addresses
var resultPtr = (int*)resultHandle.AddrOfPinnedObject().ToPointer();
var arr2Ptr = (int*)arr2Handle.AddrOfPinnedObject().ToPointer();
var arr1Ptr = (int*)arr2Handle.AddrOfPinnedObject().ToPointer();
// I hate nasty lambda statements. I always find local methods easier to read.
void Workload(int i)
{
var found = false;
var b = *(arr1Ptr + i);
for (var j = 0; j < arr2.Length; j++)
{
if (LibDevice.__nv_popcll(b & *(arr2Ptr + j)) >= limit)
{
found = true;
break;
}
}
if (!found)
{
*(resultPtr + i) = b;
}
}
try
{
gpu.For(0, arr1.Length, i => Workload(i));
}
finally
{
// Make sure we free resources
arr1Handle.Free();
arr2Handle.Free();
resultHandle.Free();
}
return result;
}
Additional Resources
GCHandle.Alloc Method (Object)
A new GCHandle that protects the object from garbage collection. This
GCHandle must be released with Free when it is no longer needed.
GCHandleType Enumeration
Pinned : This handle type is similar to Normal, but allows the address of the pinned object to be taken. This prevents the garbage
collector from moving the object and hence undermines the efficiency
of the garbage collector. Use the Free method to free the allocated
handle as soon as possible.
Unsafe Code and Pointers (C# Programming Guide)
In the common language runtime (CLR), unsafe code is referred to as
unverifiable code. Unsafe code in C# is not necessarily dangerous; it
is just code whose safety cannot be verified by the CLR. The CLR will
therefore only execute unsafe code if it is in a fully trusted
assembly. If you use unsafe code, it is your responsibility to ensure
that your code does not introduce security risks or pointer errors.
A note, there has since been an update, this:
http://www.aleagpu.com/release/3_0_3/doc/advanced_features_csharp.html
is now this:
http://www.aleagpu.com/release/3_0_4/doc/advanced_features_csharp.html
some of the samples and info have changed or moved in release 3.0.4.

Writing a value to a byte array without using C# unsafe nor fixed keyword

In C or C++, we can write the value of a variable directly onto a byte array.
int value = 3;
unsigned char array[100];
*(int*)(&array[10]) = value;
In C#, we also can do this by using unsafe and fixed keyword.
int value = 3;
byte[] array = new byte[100];
fixed(...) { ... }
However, Unity3D does not allow using unsafe nor fixed. In this case, what is the runtime cost-efficient way of doing it? I roughly guess it can be done with using a binary reader or writer class in .Net Core or .Net Framework, but I am not sure of it.
Since you can't use unsafe - you can just pack that int value yourself:
int value = 3;
var array = new char[100];
array[10] = (char)value; // right half
array[11] = (char)(value >> 16); // left half
Because char is basically ushort in C# (16-bit number). This should do the same as you would in C++ with
*(int*)(&array[10]) = value;
Another approach is using `BitConverter:
var bytes = BitConverter.GetBytes(value);
array[10] = (char)BitConverter.ToInt16(bytes, 0);
array[11] = (char)BitConverter.ToInt16(bytes, 2);
But pay attention to endianess.
You could also try to activate the unsafe keyword in Unity:
How to use unsafe code Unity
That would spare you the effor to use any "hacks".

Byte-array to float using bitwise shifting instead of BitConverter

I'm receiving byte-arrays containing float variables (32 bit).
In my C# application I'd like to turn byte[] byteArray into a float using bitwise shifting (because it's a lot faster than BitConverter).
Turning a byte-array into a short works like this:
short shortVal = (short)((short)inputBuffer [i++] << 8 | inputBuffer [i++]);
How do I do this for float-variables?
Let's gut the BCL and use its intestines for our purposes:
unsafe public static float ToSingle (byte[] value, int startIndex)
{
int val = ToInt32(value, startIndex);
return *(float*)&val;
}
You can implement ToInt32 using bit shifting.
If you don't need endianness behavior a single unsafe access can give you the float (assuming it's aligned).
Alternatively, you can use a union struct to convert an int to a float.
To get away from C# conventional methods and obtain fast performance, you'll most likely have to implement "unsafe" behavior. You could do something like the C style memory copy.
unsafe public static void MemoryCopy (void* memFrom, void* memTo, int size) {
byte* pFrom = (byte*)memFrom;
byte* pTo = (byte*)memTo;
while (size-- >= 0)
*pTo++ = *pFrom++;
}
This assumes that the float's endianness is the same going into the byte[] as it on the other end.
To use this you'll have to first fix the byte array since the runtime can move it anytime it wants during garbage collection. Something like this:
float f;
unsafe {
fixed (byte* ptr = byteArray) {
MemoryCopy (ptr, &f, sizeof(float));
}
}

Casting int on a c++ logic layer

I am new to c++ and as for now I have quite a heavy task on my work, I have a gui made in wpf and I need to send parameters from the gui to the c++ the (which as for now I already handled)
My problem is that on the c++ layer I get the info as a BYTE* I need to reinterprete the values to their "original" state (the first translation from ont\float to byte array is being made on the C# level using the static BitConvertor class) as for now I used this little method -
void GetNextValue(byte* bytes, deque<BYTE> *buffer)
{
bytes[3] = buffer->front();
buffer->pop_front();
bytes[2] = buffer->front();
buffer->pop_front();
bytes[1] = buffer->front();
buffer->pop_front();
bytes[0] = buffer->front();
buffer->pop_front();
}
But for an integer value of 1 I get a really high number, on the other hand going directly for the int value in the whole buffer will yield the correct answer...(i.e. int x = pBuffer[4]), any help or suggestions will be gladly accepted..
BTW-
I used
_rxBuffer.insert( _rxBuffer.end(), pBuffer, pBuffer + nLength);
To convert the BYTE* of data to -
deque<BYTE> _rxBuffer;
If you have an array of byte[4] you can just convert it to integer by this:
byte bytes[4];
int value = *(int*)bytes;
But beware, depending on endianess of your platform you may or may need not swap bytes order (try to replace 3<>0 and 2<>1 in bytes).

What is a equivalent of Delphi FillChar in C#?

What is the C# equivalent of Delphi's FillChar?
I'm assuming you want to fill a byte array with zeros (as that's what FillChar is mostly used for in Delphi).
.NET is guaranteed to initialize all the values in a byte array to zero on creation, so generally FillChar in .NET isn't necessary.
So saying:
byte[] buffer = new byte[1024];
will create a buffer of 1024 zero bytes.
If you need to zero the bytes after the buffer has been used, you could consider just discarding your byte array and declaring a new one (that's if you don't mind having the GC work a bit harder cleaning up after you).
If I understand FillChar correctly, it sets all elements of an array to the same value, yes?
In which case, unless the value is 0, you probably have to loop:
for(int i = 0 ; i < arr.Length ; i++) {
arr[i] = value;
}
For setting the values to the type's 0, there is Array.Clear
Obviously, with the loop answer you can stick this code in a utility method if you need... for example, as an extension method:
public static void FillChar<T>(this T[] arr, T value) {...}
Then you can use:
int[] data = {1,2,3,4,5};
//...
data.FillChar(7);
If you absolutely must have block operations, then Buffer.BlockCopy can be used to blit data between array locatiosn - for example, you could write the first chunk, then blit it a few times to fill the bulk of the array.
Try this in C#:
String text = "hello";
text.PadRight(10, 'h').ToCharArray();

Categories