Understanding Unsafe code and its uses

Understanding Unsafe code and its uses - c#

I am currently reading the ECMA-334 as suggested by a friend that does programming for a living. I am on the section dealing with Unsafe code. Although, I am a bit confused by what they are talking about.
The garbage collector underlying C# might work by moving objects
around in memory, but this motion is invisible to most C# developers.
For developers who are generally content with automatic memory
management but sometimes need fine-grained control or that extra bit
of performance, C# provides the ability to write “unsafe” code. Such
code can deal directly with pointer types and object addresses;
however, C# requires the programmer to fix objects to temporarily
prevent the garbage collector from moving them. This “unsafe” code
feature is in fact a “safe” feature from the perspective of both
developers and users. Unsafe code shall be clearly marked in the code
with the modifier unsafe, so developers can't possibly use unsafe
language features accidentally, and the compiler and the execution
engine work together to ensure 26 8 9BLanguage overview that unsafe
code cannot masquerade as safe code. These restrictions limit the use
of unsafe code to situations in which the code is trusted.
The example
using System;
class Test
{
static void WriteLocations(byte[] arr)
{
unsafe
{
fixed (byte* pArray = arr)
{
byte* pElem = pArray;
for (int i = 0; i < arr.Length; i++)
{
byte value = *pElem;
Console.WriteLine("arr[{0}] at 0x{1:X} is {2}",
i, (uint)pElem, value);
pElem++;
}
}
}
}
static void Main()
{
byte[] arr = new byte[] { 1, 2, 3, 4, 5 };
WriteLocations(arr);
Console.ReadLine();
}
}
shows an unsafe block in a method named WriteLocations that fixes an
array instance and uses pointer manipulation to iterate over the
elements. The index, value, and location of each array element are
written to the console. One possible example of output is:
arr[0] at 0x8E0360 is 1
arr[1] at 0x8E0361 is 2
arr[2] at 0x8E0362 is 3
arr[3] at 0x8E0363 is 4
arr[4] at 0x8E0364 is 5
but, of course, the exact memory locations can be different in
different executions of the application.
Why is knowing the exact memory locations of for example, this array beneficial to us as developers? And could someone explain this ideal in a simplified context?

The fixed language feature is not exactly "beneficial" as it is "absolutely necessary".
Ordinarily a C# user will imagine Reference-types as being equivalent to single-indirection pointers (e.g. for class Foo, this: Foo foo = new Foo(); is equivalent to this C++: Foo* foo = new Foo();.
In reality, references in C# are closer to double-indirection pointers, it's a pointer (or rather, a handle) to an entry in a massive object table that then stores the actual addresses of objects. The GC not only will clean-up unused objects, but also move objects around in memory to avoid memory fragmentation.
All this is well-and-good if you're exclusively using object references in C#. As soon as you use pointers then you've got problems because the GC could run at any point in time, even during tight-loop execution, and when the GC runs your program's execution is frozen (which is why the CLR and Java are not suitable for Hard Real Time applications - a GC pause can last a few hundred milliseconds in some cases).
...because of this inherent behaviour (where an object is moved during code execution) you need to prevent that object being moved, hence the fixed keyword, which instructs the GC not to move that object.
An example:
unsafe void Foo() {
Byte[] safeArray = new Byte[ 50 ];
safeArray[0] = 255;
Byte* p = &safeArray[0];
Console.WriteLine( "Array address: {0}", &safeArray );
Console.WriteLine( "Pointer target: {0}", p );
// These will both print "0x12340000".
while( executeTightLoop() ) {
Console.WriteLine( *p );
// valid pointer dereferencing, will output "255".
}
// Pretend at this point that GC ran right here during execution. The safeArray object has been moved elsewhere in memory.
Console.WriteLine( "Array address: {0}", &safeArray );
Console.WriteLine( "Pointer target: {0}", p );
// These two printed values will differ, demonstrating that p is invalid now.
Console.WriteLine( *p )
// the above code now prints garbage (if the memory has been reused by another allocation) or causes the program to crash (if it's in a memory page that has been released, an Access Violation)
}
So instead by applying fixed to the safeArray object, the pointer p will always be a valid pointer and not cause a crash or handle garbage data.
Side-note: An alternative to fixed is to use stackalloc, but that limits the object lifetime to the scope of your function.

One of the primary reasons I use fixed is for interfacing with native code. Suppose you have a native function with the following signature:
double cblas_ddot(int n, double* x, int incx, double* y, int incy);
You could write an interop wrapper like this:
public static extern double cblas_ddot(int n, [In] double[] x, int incx,
[In] double[] y, int incy);
And write C# code to call it like this:
double[] x = ...
double[] y = ...
cblas_dot(n, x, 1, y, 1);
But now suppose I wanted to operate on some data in the middle of my array say starting at x[2] and y[2]. There is no way to make the call without copying the array.
double[] x = ...
double[] y = ...
cblas_dot(n, x[2], 1, y[2], 1);
^^^^
this wouldn't compile
In this case fixed comes to the rescue. We can change the signature of the interop and use fixed from the caller.
public unsafe static extern double cblas_ddot(int n, [In] double* x, int incx,
[In] double* y, int incy);
double[] x = ...
double[] y = ...
fixed (double* pX = x, pY = y)
{
cblas_dot(n, pX + 2, 1, pY + 2, 1);
}
I've also used fixed in rare cases where I need fast loops over arrays and needed to ensure the .NET array bounds checking was not happening.

In general, the exact memory locations within an "unsafe" block are not so relevant.
As explained in Dai`s answer, when you are using Garbage Collector managed memory, you need to make sure that the data you are manipulating does not get moved (using "fixed"). You generally use this when
You are running a performance critical operation many times in a loop, and manipulating raw byte structures is sufficiently faster.
You are doing interop and have some non-standard data marshaling needs.
In a some cases, you are working with memory that is not managed by the Garbage Collector, some examples of such scenarios are:
When doing interop with unmanaged code, it can be used to prevent repeatedly marshaling data back and forth, and instead do some work in larger granularity chunks, using the "raw bytes", or structs mapped to these raw bytes.
When doing low level IO with large buffers that you need to share with the OS (e.g. for scatter/gather IO).
When creating specific structures in a memory mapped file. An example for instance could be a B+Tree with memory page sized nodes, that is stored in a disk based file that you want to page into memory.

Related

Fixing an array of array in C# (unsafe code)

I'm trying to come up with a solution as to how I can pass an array of arrays from C# into a native function. I already have a delegate to the function (Marshal.GetDelegateForFunctionPointer), but now I'm trying to pass a multidimensional array (or rather; an array of arrays) into it.
This code example works when the input has 2 sub-arrays, but I need to be able to handle any number of sub-arrays. What's the easiest way you can think of to do that? I'd prefer not to copy the data between arrays as this will be happening in a real-time loop (I'm communicating with an audio effect)
public void process(float[][] input)
{
unsafe
{
// If I know how many sub-arrays I have I can just fix them like this... but I need to handle n-many arrays
fixed (float* inp0 = input[0], inp1 = input[1] )
{
// Create the pointer array and put the pointers to input[0] and input[1] into it
float*[] inputArray = new float*[2];
inputArray[0] = inp0;
inputArray[1] = inp1;
fixed(float** inputPtr = inputArray)
{
// C function signature is someFuction(float** input, int numberOfChannels, int length)
functionDelegate(inputPtr, 2, input[0].length);
}
}
}
}

You can pin an object in place without using fixed by instead obtaining a pinned GCHandle to the object in question. Of course, it should go without saying that by doing so you take responsibility for ensuring that the pointer does not survive past the point where the object is unpinned. We call it "unsafe" code for a reason; you get to be responsible for safe memory management, not the runtime.
http://msdn.microsoft.com/en-us/library/system.runtime.interopservices.gchandle.aspx

It makes no sense trying to lock the array of references to the managed arrays.
The references values in there probably don't point to the adress of the first element, and even if they did, that would be an implementation detail. It could change from release to release.
Copying an array of pointers to a lot of data should not be that slow, especcially not when compared with the multimedia processing you are calling into.
If it is significant, allocate your data outside of the managed heap, then there is no pinning or copying. But more bookkeeping.

The easiest way I know is to use one dimension array. It reduce complexity, memory fragmentation and also will have better performance. I actually do so in my project. You can use manual indexing like array[i][j] = oneDimArray[i *n + j] and pass n as param to a function. And you will do only one fixing just like you done in your example:
public void process(float[] oneDimInput, int numberOfColumns)
{
unsafe
{
fixed (float* inputPtr = &oneDimInput[0])
{
// C function signature is someFuction(
// float* input,
// int number of columns in oneDimInput
// int numberOfChannels,
// int length)
functionDelegate(inputPtr, numberOfColumns, 2, oneDimInput[0].length);
}
}
}
Also I need to note, that two dimension arrays rarely used in high performance computation libraries as Intel MKL, Intel IPP and many others. Even BLAS and Lapack interfaces contain only one dimension arrays and emulate two dimension using aproach I've mentioned (for performance reasons).

Passing an array of ints from C# to native code with interop

I have a Blah.cs:
public unsafe static int Main()
{
int[] ai = {1, 2, 3, 4, 5};
UIntPtr stai = (UIntPtr) ai.Length;
CManagedStuff obj = new CManagedStuff();
obj.DoSomething(ai, stai);
}
Then a ManagedStuff.cpp:
void CManagedStuff::DoSomething(int^ _ai, UIntPtr _stai)
{
// Here I should do something to marshal the int^ to an int*
pUnmanagedStuff->DoSomething(_ai, (size_t) _stai);
}
And an UnmanagedStuff.cpp:
void CUnmanagedStuff::DoSomething(int* _ai, size_t _stai)
{
// Walk and print the _stai ints in _ai
}
How can I pass int[] ai from Main to ManagedStuff::DoSomething? I understand there is no marshaling in that call, because all the code involved is managed.
And how can I then marshal int^ _ai in ManagedStuff::DoSomething to call UnmanagedStuff::DoSomething? If I had an int[] _ai the code in the answer for this SO question may help (C#: Marshalling a "pointer to an int array" from a SendMessage() lParam).
Alternatively, how can I avoid working with C#, C++ interop, Microsoft and Windows, and stop world suffering?

I just need to point out how broken the original idea is.
In native code, you can pass an array by passing the address of the first element, because adjacent elements can be found through pointer arithmetic.
In managed code, the elements are also stored adjacently, but passing a int^ boxes the element, making a copy outside the array. This copy will not have any other array elements stored nearby.
In fact, this also happens in native cross-process communications. The trick of using pointer arithmetic to find other elements only works in-process, and is not generally applicable.

OK, I've got it working like this:
void CManagedStuff::DoSomething(array<int>^ _ai, UIntPtr _stai)
{
// Here I should do something to marshal the int^ to an int*
pin_ptr<int> _aiPinned = &_ai[0];
pUnmanagedStuff->DoSomething(_aiPinned, (size_t) _stai);
}
First, passing an array<int>^.
Secondly, as Tamschi was suggesting, using a pin pointer pointing to the address of the first element in the array.

You have to pin the managed resource (your array), so the garbage collector doesn't move it while you're using the pointer.
In C#, you can do this with the fixed statement: fixed Statement (C# Reference)
Pinning in C++ works with pinning pointers, which pin a managed object while they're in scope. (A pointer to any element will pin the entire array):
// In CManagedStuff:
pin_ptr<int> _aiPinned = _ai
More info: C++/CLI in Action - Using interior and pinning pointers

Is it safe to point to the middle of a valuetype in C#/CLR?

I have a Matrix4D that should be passed to glLoadMatrixf. To overcome p/invoke overhead (i.e. pinning, marshaling etc. each time), I'm using pointers instead of usual arrays. So I have two issues.
Matrix4D is based on a copypasted class. It's tested and probably optimized a bit -- didn't want to reinvent the wheel (also I suck at math). Anyway, that class uses 16 fields instead of 1 fixed array (the class was written in the C# 1.0 era I guess). The layout is sequential, so that GetPointer method just gets a pointer to the very first field. THE QUESTION is: can there be some padding problems? I mean cases when, for example, the runtime extends floats to doubles so that indexing a pack of fields as an array would get garbage. Or does sequential layout prevent that by specs? Or should I adhere strictly to fixed arrays?
The second issue is possible alterations by the optimizer. The matrix is a value type, on which float* GetPointer() is called. I'm afraid the optimizer may rearrange the code in such a way that GetPointer would point to some garbage.
For example:
GL32NativeMethods.glLoadMatrixf((mat1 * mat2).GetPointer());
Is it safe to do, or not? Currently I'm doing this to be sure (though I'm not sure at all):
Matrix4D tmp = mat1 * mat2;
GL32NativeMethods.glLoadMatrixf(tmp.GetPointer());
Are there other possible solutions to this problem?
P.S. After the call to glLoadMatrixf, the pointer isn't needed.
UPD
My concern is that in between the calls to GetPointer() and glLoadMatrixf() the value may be discarded by the optimizer (as I suppose):
float* f = mat.GetPointer();
// Here the optimizer decides to discard mat variable because it isn't used anymore.
// Maybe it now fills the memory area of mat with other helper values (for P/Invoke, for example?)
GL32NativeMethods.glLoadMatrixf(f); // References discarded data.

Heh, I plumb forgot that native code is type-agnostic, i.e. it's actual memory alignment that matters.
I'll try out this one:
public static extern void glLoadMatrixf(ref Matrix4D mat);
GL32NativeMethods.glLoadMatrixf(ref mat);
Native code will be tricked into thinking that mat is an array of floats, although it's actually a valuetype with the alignment of a float array.

I wrapped OpenGL methods in the following way:
public static void UniformMatrix4(int location, Int32 count, bool transpose, float[] value) {
unsafe {
fixed (float* fp_value = value)
{
Delegates.pglUniformMatrix4fv(location, count, transpose, fp_value);
}
}
}
[System.Runtime.InteropServices.DllImport(Library, EntryPoint = "glUniformMatrix4fv", ExactSpelling = true)]
internal extern static unsafe void glUniformMatrix4fv(int location, Int32 count, bool transpose, float* value);
Then, I can use a float[] for specifying matrix components. Of course there's a Matrix class which defines the array of floats and abstract math operations.

Are ref and out in C# the same a pointers in C++?

I just made a Swap routine in C# like this:
static void Swap(ref int x, ref int y)
{
int temp = x;
x = y;
y = temp;
}
It does the same thing that this C++ code does:
void swap(int *d1, int *d2)
{
int temp=*d1;
*d1=*d2;
*d2=temp;
}
So are the ref and out keywords like pointers for C# without using unsafe code?

They're more limited. You can say ++ on a pointer, but not on a ref or out.
EDIT Some confusion in the comments, so to be absolutely clear: the point here is to compare with the capabilities of pointers. You can't perform the same operation as ptr++ on a ref/out, i.e. make it address an adjacent location in memory. It's true (but irrelevant here) that you can perform the equivalent of (*ptr)++, but that would be to compare it with the capabilities of values, not pointers.
It's a safe bet that they are internally just pointers, because the stack doesn't get moved and C# is carefully organised so that ref and out always refer to an active region of the stack.
EDIT To be absolutely clear again (if it wasn't already clear from the example below), the point here is not that ref/out can only point to the stack. It's that when it points to the stack, it is guaranteed by the language rules not to become a dangling pointer. This guarantee is necessary (and relevant/interesting here) because the stack just discards information in accordance with method call exits, with no checks to ensure that any referrers still exist.
Conversely when ref/out refers to objects in the GC heap it's no surprise that those objects are able to be kept alive as long as necessary: the GC heap is designed precisely for the purpose of retaining objects for any length of time required by their referrers, and provides pinning (see example below) to support situations where the object must not be moved by GC compacting.
If you ever play with interop in unsafe code, you will find that ref is very closely related to pointers. For example, if a COM interface is declared like this:
HRESULT Write(BYTE *pBuffer, UINT size);
The interop assembly will turn it into this:
void Write(ref byte pBuffer, uint size);
And you can do this to call it (I believe the COM interop stuff takes care of pinning the array):
byte[] b = new byte[1000];
obj.Write(ref b[0], b.Length);
In other words, ref to the first byte gets you access to all of it; it's apparently a pointer to the first byte.

Reference parameters in C# can be used to replace one use of pointers, yes. But not all.
Another common use for pointers is as a means for iterating over an array. Out/ref parameters can not do that, so no, they are not "the same as pointers".

ref and out are only used with function arguments to signify that the argument is to be passed by reference instead of value. In this sense, yes, they are somewhat like pointers in C++ (more like references actually). Read more about it in this article.

The nice thing about using out is that you're guaranteed that the item will be assigned a value -- you will get a compile error if not.

Actually, I'd compare them to C++ references rather than pointers. Pointers, in C++ and C, are a more general concept, and references will do what you want.
All of these are undoubtedly pointers under the covers, of course.

While comparisons are in the eye of the beholder...I say no. 'ref' changes the calling convention but not the type of the parameters. In your C++ example, d1 and d2 are of type int*. In C# they are still Int32's, they just happen to be passed by reference instead of by value.
By the way, your C++ code doesn't really swap its inputs in the traditional sense. Generalizing it like so:
template<typename T>
void swap(T *d1, T *d2)
{
T temp = *d1;
*d1 = *d2;
*d2 = temp;
}
...won't work unless all types T have copy constructors, and even then will be much more inefficient than swapping pointers.

The short answer is Yes (similar functionality, but not exactly the same mechanism).
As a side note, if you use FxCop to analyse your code, using out and ref will result in a "Microsoft.Design" error of "CA1045:DoNotPassTypesByReference."

C# Unsafe/Fixed Code

Can someone give an example of a good time to actually use "unsafe" and "fixed" in C# code? I've played with it before, but never actually found a good use for it.
Consider this code...
fixed (byte* pSrc = src, pDst = dst) {
//Code that copies the bytes in a loop
}
compared to simply using...
Array.Copy(source, target, source.Length);
The second is the code found in the .NET Framework, the first a part of the code copied from the Microsoft website, http://msdn.microsoft.com/en-us/library/28k1s2k6(VS.80).aspx.
The built in Array.Copy() is dramatically faster than using Unsafe code. This might just because the second is just better written and the first is just an example, but what kinds of situations would you really even need to use Unsafe/Fixed code for anything? Or is this poor web developer messing with something above his head?

It's useful for interop with unmanaged code. Any pointers passed to unmanaged functions need to be fixed (aka. pinned) to prevent the garbage collector from relocating the underlying memory.
If you are using P/Invoke, then the default marshaller will pin objects for you. Sometimes it's necessary to perform custom marshalling, and sometimes it's necessary to pin an object for longer than the duration of a single P/Invoke call.

I've used unsafe-blocks to manipulate Bitmap-data. Raw pointer-access is significantly faster than SetPixel/GetPixel.
unsafe
{
BitmapData bmData = bm.LockBits(...)
byte *bits = (byte*)pixels.ToPointer();
// Do stuff with bits
}
"fixed" and "unsafe" is typically used when doing interop, or when extra performance is required. Ie. String.CopyTo() uses unsafe and fixed in its implementation.

reinterpret_cast style behaviour
If you are bit manipulating then this can be incredibly useful
many high performance hashcode implementations use UInt32 for the hash value (this makes the shifts simpler). Since .Net requires Int32 for the method you want to quickly convert the uint to an int. Since it matters not what the actual value is, only that all the bits in the value are preserved a reinterpret cast is desired.
public static unsafe int UInt32ToInt32Bits(uint x)
{
return *((int*)(void*)&x);
}
note that the naming is modelled on the BitConverter.DoubleToInt64Bits
Continuing in the hashing vein, converting a stack based struct into a byte* allows easy use of per byte hashing functions:
// from the Jenkins one at a time hash function
private static unsafe void Hash(byte* data, int len, ref uint hash)
{
for (int i = 0; i < len; i++)
{
hash += data[i];
hash += (hash << 10);
hash ^= (hash >> 6);
}
}
public unsafe static void HashCombine(ref uint sofar, long data)
{
byte* dataBytes = (byte*)(void*)&data;
AddToHash(dataBytes, sizeof(long), ref sofar);
}
unsafe also (from 2.0 onwards) lets you use stackalloc. This can be very useful in high performance situations where some small variable length array like temporary space is needed.
All of these uses would be firmly in the 'only if your application really needs the performance' and thus are inappropriate in general use, but sometimes you really do need it.
fixed is necessary for when you wish to interop with some useful unmanaged function (there are many) that takes c-style arrays or strings. As such it is not only for performance reasons but correctness ones when in interop scenarios.

Unsafe is useful for (for example) getting pixel data out of an image quickly using LockBits. The performance improvement over doing this using the managed API is several orders of magnitude.

We had to use a fixed when an address gets passed to a legacy C DLL. Since the DLL maintained an internal pointer across function calls, all hell would break loose if the GC compacted the heap and moved stuff around.

I believe unsafe code is used if you want to access something outside of the .NET runtime, ie. it is not managed code (no garbage collection and so on). This includes raw calls to the Windows API and all that jazz.

This tells me the designers of the .NET framework did a good job of covering the problem space--of making sure the "managed code" environment can do everything a traditional (e.g. C++) approach can do with its unsafe code/pointers. In case it cannot, the unsafe/fixed features are there if you need them. I'm sure someone has an example where unsafe code is needed, but it seems rare in practice--which is rather the point, isn't it? :)

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.