What is marshalling and why do we need it?
I find it hard to believe that I cannot send an int over the wire from C# to C and have to marshall it. Why can't C# just send the 32 bits over with a starting and terminating signal, telling C code that it has received an int?
If there are any good tutorials or sites about why we need marshalling and how to use it, that would be great.
Because different languages and environments have different calling conventions, different layout conventions, different sizes of primitives (cf. char in C# and char in C), different object creation/destruction conventions, and different design guidelines. You need a way to get the stuff out of managed land an into somewhere where unmanaged land can see and understand it and vice versa. That's what marshalling is for.
.NET code(C#, VB) is called "managed" because it's "managed" by CLR (Common Language Runtime)
If you write code in C or C++ or assembler it is all called "unmanaged", since no CLR is involved. You are responsible for all memory allocation/de-allocation.
Marshaling is the process between managed code and unmanaged code; It is one of the most important services offered by the CLR.
Marshalling an int is ideally just what you said: copying the memory from the CLR's managed stack into someplace where the C code can see it. Marshalling strings, objects, arrays, and other types are the difficult things.
But the P/Invoke interop layer takes care of almost all of these things for you.
As Vinko says in the comments, you can pass primitive types without any special marshalling. These are called "blittable" types and include types like byte, short, int, long, etc and their unsigned counterparts.
This page contains the list of blittable and non-blittable types.
Marshalling is a "medium" for want of a better word or a gateway, to communicate with the unmanaged world's data types and vice versa, by using the pinvoke, and ensures the data is returned back in a safe manner.
Marshalling is passing signature of a function to a different process which is on a different machine, and it is usually implemented by conversion of structured data to a dedicated format, which can be transferred to other processor systems (serialization / deserialization).
Related
So I'm writing a wrapper in C# for a C dll. The problem is several of the functions use complex datatypes e.g.:
ComplexType* CreateComplexType(int a, int b);
Is there a way I can declare a valid C# type such that I can use dllimport?
If I were doing a Windows-only solution I'd probably use C++/CLI as a go-between the native complex type and a managed complex type.
I do have access to the source code of the C dll, so would it be possible to instead use an opaque type (e.g. handles)?
Such a function is difficult to call reliably from a C program, it doesn't get better when you pinvoke it. The issue is memory management, that struct needs to be destroyed again. Which requires the calling program to use the exact same memory allocator as the DLL. This rarely turns out well in a C program but you might be lucky that you have the source code for the DLL so you can recompile it and ensure that everybody is using the same shared CRT version.
There is no such luck from C# of course, the pinvoke marshaller will call CoTaskMemFree() to release the struct. Few real C programs use CoTaskMemAlloc() to allocate the struct so that's a silent failure on XP, an AccessViolationException on Vista and higher. Modern Windows versions have a much stricter heap manager that doesn't ignore invalid pointers.
You can declare the return value as IntPtr, that stops the pinvoke marshaller from trying to destroy it. And then manually marshal with Marshal.PtrToStructure(). This doesn't otherwise stop the memory leak, your program will eventually crash with OOM. Usually anyway.
Mono has a good documentation page on using P/Invoke in Windows vs. Linux. Specifically, see the section on marshaling, that discusses simple vs. complex types. If you want to get creative, you could serialize your type to some convenient string-based format like JSON or XML and use that as your marshaling mechanism.
Read this question today about safe and unsafe code I then read about it in MSDN but I still don't understand it. Why would you want to use pointers in C#? Is this purely for speed?
There are three reasons to use unsafe code:
APIs (as noted by John)
Getting actual memory address of data (e.g. access memory-mapped hardware)
Most efficient way to access and modify data (time-critical performance requirements)
Sometimes you'll need pointers to interface your C# to the underlying operating system or other native code. You're strongly discouraged from doing so, as it is "unsafe" (natch).
There will be some very rare occasions where your performance is so CPU-bound that you need that minuscule extra bit of performance. My recommendation would be to write those CPU-intesive pieces in a separate module in assembler or C/C++, export an API, and have your .NET code call that API. An possible additional benefit is that you can put platform-specific code in the unmanaged module, and leave the .NET platform agnostic.
I tend to avoid it, but there are some times when it is very helpful:
for performance working with raw buffers (graphics, etc)
needed for some unmanaged APIs (also pretty rare for me)
for cheating with data
For example of the last, I maintain some serialization code. Writing a float to a stream without having to use BitConverter.GetBytes (which creates an array each time) is painful - but I can cheat:
float f = ...;
int i = *(int*)&f;
Now I can use shift (>>) etc to write i much more easily than writing f would be (the bytes will be identical to if I had called BitConverter.GetBytes, plus I now control the endianness by how I choose to use shift).
There is at least one managed .Net API that often makes using pointers unavoidable. See SecureString and Marshal.SecureStringToGlobalAllocUnicode.
The only way to get the plain text value of a SecureString is to use one of the Marshal methods to copy it to unmanaged memory.
We have to interop with native code a lot, and in this case it is much faster to use unsafe structs that don't require marshaling. However, we cannot do this when the structs contain fixed size buffers of nonprimitive types.
Why is it a requirement from the C# compiler that fixed size buffers are only of the primitive types? Why can a fixed size buffer not be made of a struct such as:
[StructLayout(LayoutKind.Sequential)]
struct SomeType
{
int Number1;
int Number2;
}
Fixed size buffers in C# are implemented with a CLI feature called "opaque classes". Section I.12.1.6.3 of Ecma-335 describes them:
Some languages provide multi-byte data structures whose contents are manipulated directly by
address arithmetic and indirection operations. To support this feature, the CLI allows value types
to be created with a specified size but no information about their data members. Instances of
these “opaque classes” are handled in precisely the same way as instances of any other class, but
the ldfld, stfld, ldflda, ldsfld, and stsfld instructions shall not be used to access their contents.
The "no information about their data members" and "ldfld/stfld shall not be used" are the rub. The 2nd rule puts the kibosh on structures, you need ldfld and stfld to access their members. The C# compiler cannot provide an alternative, the layout of a struct is a runtime implementation detail. Decimal and Nullable<> are out because they are structs as well. IntPtr is out because its size depends on the bitness of the process, making it difficult for the C# compiler to generate the address for the ldind/stind opcode used to access the buffer. Reference types references are out because the GC needs to be able to find them back and can't by the 1st rule. Enum types have a variable size that depend on their base type; sounds like a solvable problem, not entirely sure why they skipped it.
Which just leaves the ones mentioned by the C# language specification: sbyte, byte, short, ushort, int, uint, long, ulong, char, float, double or bool. Just the simple types with a well defined size.
What is a fixed buffer?
From MSDN:
In C#, you can use the fixed statement to create a buffer with a fixed size array in a data structure. This is useful when you are working with existing code, such as code written in other languages, pre-existing DLLs or COM projects. The fixed array can take any attributes or modifiers that are allowed for regular struct members. The only restriction is that the array type must be bool, byte, char, short, int, long, sbyte, ushort, uint, ulong, float, or double.
I'm just going to quote Mr. Hans Passant in regards to why a fixed buffer MUST be unsafe. You might see Why is a fixed size buffers (arrays) must be unsafe? for more information.
Because a "fixed buffer" is not a real array. It is a custom value type, about the only way
to generate one in the C# language that I know. There is no way for
the CLR to verify that indexing of the array is done in a safe way.
The code is not verifiable either. The most graphic demonstration of
this:
using System;
class Program {
static unsafe void Main(string[] args) {
var buf = new Buffer72();
Console.WriteLine(buf.bs[8]);
Console.ReadLine();
}
}
public struct Buffer72 {
public unsafe fixed byte bs[7];
}
You can arbitrarily access the stack frame in this example. The standard buffer overflow injection
technique would be available to malicious code to patch the function
return address and force your code to jump to an arbitrary location.
Yes, that's quite unsafe.
Why can't a fixed buffer contain non-primitive data types?
Simon White raised a valid point:
I'm gonna go with "added complexities to the compiler". The compiler would have to check that no .NET specific functionality was applied to the struct that applied to enumerable items. For example, generics, interface implementation, even deeper properties of non-primitive arrays, etc. No doubt the runtime would also have some interop issues with that sort of thing too.
And Ibasa:
"But that is already done by the compiler." Only partly. The compiler can do the checks to see if a type is managed but that doesn't take care of generating code to read/write structs to fixed buffers. It can be done (there's nothing stopping it at CIL level) it just isn't implemented in C#.
Lastly, Mehrdad:
I think it's literally because they don't want you to use fixed-size buffers (because they want you to use managed code). Making it too easy to interop with native code makes you less likely to use .NET for everything, and they want to promote managed code as much as possible.
The answer appears to be a resounding "it's just not implemented".
Why's it not implemented?
My guess is that the cost and implementation time just isn't worth it to them. The developers would rather promote managed code over unmanaged code. It could possibly be done in a future version of C#, but the current CLR lacks a lot of the complexity needed.
An alternative could be the security issue. Being that fixed buffers are immensely vulnerable to all sorts of problems and security risks should they be implemented poorly in your code, I can see why the use of them would be discouraged over managed code in C#. Why put a lot of work into something you'd like to discourage the use of?
I understand your point of view...on the other hand I suppose that it could be some kind of forward compatibility reserved by Microsoft. Your code is compiled to MSIL and it is bussiness of specific .NET Framework and OS to layout it in memory.
I can imagine that it may come new CPU from intel which will require to layout variables to every 8 bytes to gain the optimal performance. In that case there will be need in future, in some future .NET Framework 6 and some future Windows 9 to layout these struct in different way. In this case, your example code would be pressure for Microsoft not to change the memory layout in the future and not speed up the .NET framework to modern HW.
It is only speculation...
Did you tried to set FieldOffset? See C++ union in C#
Read this question today about safe and unsafe code I then read about it in MSDN but I still don't understand it. Why would you want to use pointers in C#? Is this purely for speed?
There are three reasons to use unsafe code:
APIs (as noted by John)
Getting actual memory address of data (e.g. access memory-mapped hardware)
Most efficient way to access and modify data (time-critical performance requirements)
Sometimes you'll need pointers to interface your C# to the underlying operating system or other native code. You're strongly discouraged from doing so, as it is "unsafe" (natch).
There will be some very rare occasions where your performance is so CPU-bound that you need that minuscule extra bit of performance. My recommendation would be to write those CPU-intesive pieces in a separate module in assembler or C/C++, export an API, and have your .NET code call that API. An possible additional benefit is that you can put platform-specific code in the unmanaged module, and leave the .NET platform agnostic.
I tend to avoid it, but there are some times when it is very helpful:
for performance working with raw buffers (graphics, etc)
needed for some unmanaged APIs (also pretty rare for me)
for cheating with data
For example of the last, I maintain some serialization code. Writing a float to a stream without having to use BitConverter.GetBytes (which creates an array each time) is painful - but I can cheat:
float f = ...;
int i = *(int*)&f;
Now I can use shift (>>) etc to write i much more easily than writing f would be (the bytes will be identical to if I had called BitConverter.GetBytes, plus I now control the endianness by how I choose to use shift).
There is at least one managed .Net API that often makes using pointers unavoidable. See SecureString and Marshal.SecureStringToGlobalAllocUnicode.
The only way to get the plain text value of a SecureString is to use one of the Marshal methods to copy it to unmanaged memory.
Judy array is fast data structure that may represent a sparse array or a set of values. Is there its implementation for managed languages such as C#? Thanks
It's worth noting that these are often called Judy Trees or Judy Tries if you are googling for them.
I also looked for a .Net implementation but found nothing.
Also worth noting that:
The implementation is heavily designed around efficient cache usage, as such implementation specifics may be highly dependent on the size of certain constructs used within the sub structures. A .Net managed implementation may be somewhat different in this regard.
There are some significant hurdles to it that I can see (and there are probably more that my brief scan missed)
The API has some fairly anti OO aspects (for example a null pointer is viewed as an empty tree) so simplistic, move the state pointer to the LHS and make functions instance methods conversion to C++ wouldn't work.
The implementation of the sub structures I looked at made heavy use of pointers. I cannot see these efficiently being translated to references in managed languages.
The implementation is a distillation of a lot of very complex ideas that belies the simplicity of the public api.
The code base is about 20K lines (most of it complex), this doesn't strike me as an easy port.
You could take the library and wrap the C code in C++/CLI (probably simply holding internally a pointer that is the c api trie and having all the c calls point to this one). This would provide a simplistic implementation but the linked libraries for the native implementation may be problematic (as might memory allocation).
You would also probably need to deal with converting .Net strings to plain old byte* on the transition as well (or just work with bytes directly)
Judy really doesn't fit well with managed languages. I don't think you'll be able to use something like SWIG and get the first layer done automatically.
I wrote PyJudy and I ended up having to make some non-trivial API changes to fit well in Python. For example, I wrote in the documentation:
JudyL arrays map machine words to
machine words. In practice the words
store unsigned integers or pointers.
PyJudy supports all four mappings as
distinct classes.
pyjudy.JudyLIntInt - map unsigned
integer keys to unsigned integer
values
pyjudy.JudyLIntObj - map unsigned
integer keys to Python object values
pyjudy.JudyLObjInt - map Python
object keys to unsigned integer
values
pyjudy.JudyLObjObj - map Python
object keys to Python object values
I haven't looked at the code for a few years so my memories about it are pretty hazy. It was my first Python extension library, and I remember I hacked together a sort of template system for code generation. Nowadays I would use something like genshi.
I can't point to alternatives to Judy - that's one reason why I'm searching Stackoverflow.
Edit: I've been told that my timing numbers in the documentation are off from what Judy's documentation suggests because Judy is developed for 64-bit cache lines and my PowerBook was only 32 bits.
Some other links:
Patricia tries (http://www.csse.monash.edu.au/~lloyd/tildeAlgDS/Tree/PATRICIA/ )
Double-Array tries (http://linux.thai.net/~thep/datrie/datrie.html)
HAT-trie (http://members.optusnet.com.au/~askitisn/index.html)
The last has comparison numbers for different high-performance trie implementations.
This is proving trickier than I thought. PyJudy might be worth a look, as would be Tie::Judy. There's something on Softpedia, and something Ruby-ish. Trouble is, none of these are .NET specifically.