Getting most performance out of C++/CLI - c#

Profiling my application reveals that 50% of runtime is being spent in a packArrays() function which performs array transformations where C++ strongly outperforms C#.
In order to improve performance, I used unsafe in packArrays to gain only low single digit percentage improvements in runtime. In order to eliminate cache as the bottleneck and in order to estimate the ceiling of performance improvement, I wrote packArrays in C++ and timed the difference in both languages. The C++ version runs approx 5x faster than C#. I decided to give C++/CLI a try.
As a result, I have three implementations:
C++ - a simple packArrays() function
C# - packArrays() is wrapped into a class, however the code inside the function is identical to the C++ version
C++/CLI - shown below, but again the implementation of packArrays() is identical (literally) to the previous two
The C++/CLI implementation is as follows
QCppCliPackArrays.cpp
public ref class QCppCliPackArrays
{
void pack(array<bool> ^ xBoolArray, int xLen, array<int> ^% yBoolArray, int % yLen)
{
// prepare variables
pin_ptr<bool> xBoolArrayPinned = &xBoolArray[0];
bool * xBoolArray_ = xBarsAreTruePinned;
pin_ptr<bool> yBoolArrayPinned = &yBoolArray[0];
bool * yBoolArray_ = yBarsAreTruePinned;
// go
packArrays(xBoolArray_, xBarCount, yBoolArray_ , yLen);
}
};
packArraysWorker.cpp
#pragma managed(push, off)
void packArrays(bool * xArray, int xLen, bool * yArray, int & yLen)
{
... actual code that is identical across languages code ...
}
#pragma managed(pop)
QCppCliPackArrays.cpp is compiled with \clr option, packArraysWorker.cpp is compiled with No Common Language RunTime Support option.
The problem: When using a C# application to run both C# and C++/CLI implementations, C++/CLI implementation is still only marginally faster than C#.
Questions:
Is there any other option/setting/keyword I can use to increase the performance of C++/CLI?
Can the performance loss of C++/CLI compared to C++ be wholely attributed to interop? Currently, for 10K repetitions C# runs some 4.5 seconds slower than C++, giving interop 0.45 millisecond per repetition. As all types being passed are blittable, I would expect the interop to .. well just pass over some pointers.
Would I gain anything by using P/Invoke? From what I read not, but it's always better to ask.
Is there any other method I can use? Leaving a five-fold increase in performance on the table is just too much.
All timings are made in Release/x64 from the command line (not from VS) on a single thread.
EDIT:
In order to determine the performance loss due to interop, I placed a Stopwatch around the QCppCliPackArrays::packArrays() call as well a chrono::high_resolution_clock inside the packArrays() per se. The results show that The C# <-> C++/CLI switch costs approx. 5 milliseconds per 10K calls. The switch from managed C++/CLI to unmanaged C++/CLI, according to results, costs nothing.
Hence, interop can be ruled out as the cause of performange degradation.
On the other hand, its obvious that packArrays() is NOT run as unmanaged! But why?
EDIT 2:
I tried to link the packArrays() as a .lib file exported from a separate unmanaged C++ library. Results are still the same.
EDIT 3:
The actual packArrays is this
public void packArrays(bool[] xConditions, int[] xValues, int xLen, ref int[] yValuesPacked, ref int yPackedLen)
{
// alloc
yPackedLen = xConditions.trueCount();
yValuesPacked = new int [yPackedLen];
// fill
int xPackedIdx = 0;
for (int xIdx = 0; xIdx < xLen; xIdx++)
if (xConditions[xIdx] == true)
yValuesPacked[xPackedIdx++] = xValues[xIdx];
}
into yValuesPacked puts all values from xValues where the corresponding xConditions[i] is true.
Now, I am facing a new issue - I have several implementations aiming to solve this problem, all of them work correctly (tested). When I run a benchmark that invididually calls these different implementations 50K times on arrays 86K items long, I get the following timinigs in seconds:
The original implementation originalArray is the code listed above. Clearly, the QCsCpp* versions dominate the benchmark - these are the implementations using C++/CLI. However, when I replace originalArrayin my original application, that calls packArrays a vast number of times, with either QCsCpp* implementation, the whole application runs SLOWER. With this result, I am really clueless and I must admit that it honestly crushed me. How can this be true? As always, any insight is much appreciated.

Related

C# - How to Bypass Error cs0212 Cheaply for Programmers and Computers?

I want to process many integers in a class, so I listed them into an int* array.
int*[] pp = new int*[]{&aaa,&bbb,&ccc};
However, the compiler declined the code above with the following EXCUSE:
> You can only take the address of an unfixed expression inside of a fixed statement initializer
I know I can change the code above to avoid this error; however, we need to consider ddd and eee will join the array in the future.
public enum E {
aaa,
bbb,
ccc,
_count
}
for(int i=0;i<(int)E._count;i++)
gg[(int)E.bbb]
 
Dictionary<string,int>ppp=new Dictionary<string,int>();
ppp["aaa"]=ppp.Count;
ppp["bbb"]=ppp.Count;
ppp["ccc"]=ppp.Count;
gg[ppp["bbb"]]
These solution works, but they make the code and the execution time longer.
I also expect a nonofficial patch to the compiler or a new nonofficial C# compiler, but I have not seen an available download for many years; it seems very difficult to have one for us.
Are there better ways so that
I do not need to count the count of the array ppp.
If the code becomes long, there are only several letters longer.
The execution time does not increase much.
To add ddd and eee into the array, there are only one or two
setences for each new member.
.NET runtime is a managed execution runtime which (among other things) provides garbage collection. .NET garbage collector (GC)
not only manages the allocation and release of memory, but also transparently moves the objects around the "managed heap", blocking
the rest of your code while doing it.
It also compacts (defragments) the memory by moving longer lived objects together, and even "promoting" them into different parts of the heap, called generations, to avoid checking their status too often.
There is a bunch of memory being copied all the time without your program even realizing it. Since garbage collection is an operation that can happen at any time during the execution of your program, any pointer-related
("unsafe") operations must be done within a small scope, by telling the runtime to "pin" the objects using the fixed keyword. This prevents the GC from moving them, but only for a while.
Using pointers and unsafe code in C# is not only less safe, but also not very idiomatic for managed languages in general. If coming from a C background, you may feel like at home with these constructs, but C# has a completely different philosophy: your job as a C# programmer should be to write reliable, readable and maintenable code, and only then think about squeezing a couple of CPU cycles for performance reasons. You can use pointers from time to time in small functions, doing some very specific, time-critical code. But even then it is your duty to profile before making such optimizations. Even the most experienced programmers often fail at predicting bottlenecks before profiling.
Finally, regarding your actual code:
I don't see why you think this:
int*[] pp = new int*[] {&aaa, &bbb, &ccc};
would be any more performant than this:
int[] pp = new int[] {aaa, bbb, ccc};
On a 32-bit machine, an int and a pointer are of the same size. On a 64-bit machine, a pointer is even bigger.
Consider replacing these plain ints with a class of your own which will provide some context and additional functionality/data to each of these values. Create a new question describing the actual problem you are trying to solve (you can also use Code Review for such questions) and you will benefit from much better suggestions.

Prevent compiler/cpu instruction reordering c#

I have an Int64 containing two Int32 like this:
[StructLayout(LayoutKind.Explicit)]
public struct PackedInt64
{
[FieldOffset(0)]
public Int64 All;
[FieldOffset(0)]
public Int32 First;
[FieldOffset(4)]
public Int32 Second;
}
Now I want constructors (for all, first and second). However the struct requires all fields to be assigned before the constructor is exited.
Consider the all constructor.
public PackedInt64(Int64 all)
{
this.First = 0;
this.Second = 0;
Thread.MemoryBarrier();
this.All = all;
}
I want to be absolutely sure that this.All is assigned last in the constructor so that half of the field or more isn't overwritten in case of some compiler optimization or instruction reordering in the cpu.
Is Thread.MemoryBarrier() sufficient? Is it the best option?
Yes, this is the correct and best way of preventing reordering.
By executing Thread.MemoryBarrier() in your sample code, the processor will never be allowed to reorder instructions in such a way that the access/modification to First or Second will occur after the access/modification to All. Since they both occupy the same address space, you don't have to worry about your later changes being overwritten by your earlier ones.
Note that Thread.MemoryBarrier() only works for the current executing thread -- it isn't a type of lock. However, given that this code is running in a constructor and no other thread can yet have access to this data, this should be perfectly fine. If you do need cross-thread guarantee of operations, however, you'll need to use a locking mechanism to guarantee exclusive access.
Note that you may not actually need this instruction on x86 based machines, but I would still recommend the code in case you run on another platform one day (such as IA64). See the below chart for what platforms will reorder memory post-save, rather than just post-load.
The MemoryBarrier will prevent re-ordering, but this code is still broken.
LayoutKind.Explicit and FieldOffsetAttribute are documented as affecting the memory layout of the object when it is passed to unmanaged code. It can be used to interop with a C union, but it cannot be used to emulate a C union.
Even if it currently acts the way you expect, on the platform you tested, there is no guarantee that it will continue to do so. The only guarantee made is in the context of interop with unmanaged code (that is, p/invoke, COM interop, or C++/CLI it-just-works).
If you want to read a subset of bytes in a portable future-proof manner, you'll have to use bitwise operations or a byte array and BitConverter. Even if the syntax isn't as nice.
Check the remarks section of the following link: http://msdn.microsoft.com/en-us/library/system.threading.thread.memorybarrier.aspx
It says MemoryBarrier() is required only on multiprocessor systems with weak memory ordering. So, this is a sufficient option but whether this is the best option or not depends upon the system you are using.
First, I'm aware this answer doesn't really solve the reordering problem, but negates it. By using unsafe code, you can avoid writing to First and Second completely.
public unsafe PackedInt64(long all) {
fixed (PackedInt64* ptr = &this)
*(long*) ptr = all;
}
It's not meant to be the most elegant solution and probably doesn't pass most company policies regarding managed code, but it should work.

.NET EventWaitHandle slow

I'm using waveOutWrite with a callback function, and under native code everything is fast. Under .NET it is much slower, to the point I think I'm doing something very wrong, 5 or 10 times slower sometimes.
I can post both sets of code, but seems like too much, so I'll just post the C code that is fast and point out the minor variances in the .NET code.
HANDLE WaveEvent;
const int TestCount = 100;
HWAVEOUT hWaveOut[1]; // don't ask why this is an array, just test code
WAVEHDR woh[1][20];
void CALLBACK OnWaveOut(HWAVEOUT,UINT uMsg,DWORD,DWORD,DWORD)
{
if(uMsg != WOM_DONE)
return;
assert(SetEvent(WaveEvent)); // .NET code uses EventWaitHandle.Set()
}
void test(void)
{
WaveEvent = CreateEvent(NULL,FALSE,FALSE,NULL);
assert(WaveEvent);
WAVEFORMATEX wf;
memset(&wf,0,sizeof(wf));
wf.wFormatTag = WAVE_FORMAT_PCM;
wf.nChannels = 1;
wf.nSamplesPerSec = 8000;
wf.wBitsPerSample = 16;
wf.nBlockAlign = WORD(wf.nChannels*(wf.wBitsPerSample/8));
wf.nAvgBytesPerSec = (wf.wBitsPerSample/8)*wf.nSamplesPerSec;
assert(waveOutOpen(&hWaveOut[0],WAVE_MAPPER,&wf,(DWORD)OnWaveOut,0,CALLBACK_FUNCTION) == MMSYSERR_NOERROR);
for(int x=0;x<2;x++)
{
memset(&woh[0][x],0,sizeof(woh[0][x]));
woh[0][x].dwBufferLength = PCM_BUF_LEN;
woh[0][x].lpData = (char*) malloc(woh[0][x].dwBufferLength);
assert(waveOutPrepareHeader(hWaveOut[0],&woh[0][x],sizeof(woh[0][x])) == MMSYSERR_NOERROR);
assert(waveOutWrite(hWaveOut[0],&woh[0][x],sizeof(woh[0][x])) == MMSYSERR_NOERROR);
}
int bufferIndex = 0;
DWORD times[TestCount];
for(int x=0;x<TestCount;x++)
{
DWORD t = timeGetTime();
assert(WaitForSingleObject(WaveEvent,INFINITE) == WAIT_OBJECT_0); // .NET code uses EventWaitHandle.WaitOne()
assert(woh[0][bufferIndex].dwFlags & WHDR_DONE);
assert(waveOutWrite(hWaveOut[0],&woh[0][bufferIndex],sizeof(woh[0][bufferIndex])) == MMSYSERR_NOERROR);
bufferIndex = bufferIndex == 0 ? 1 : 0;
times[x] = timeGetTime() - t;
}
}
The times[] array for the C code always has values around 80, which is the PCM buffer length I am using. The .NET code also shows similar values sometimes, however, it sometimes shows values as high as 1000, and more often values in the 300 to 500 range.
Doing the part that is in the bottom loop inside the OnWaveOut callback instead of using events, makes it fast all the time, with .NET or native code. So it appears the issue is with the wait events in .NET only, and mostly only when "other stuff" is happening on the test PC -- but not a lot of stuff, can be as simple as moving a window around, or opening a folder in my computer.
Maybe .NET events are just really bad about context switching, or .NET apps/threads in general? In the app I'm using to test my .NET code, the code just runs in the constructor of a form (easy place to add test code), not on a thread-pool thread or anything.
I also tried using the version of waveOutOpen that takes an event instead of a function callback. This is also slow in .NET but not in C, so again, it points to an issue with events and/or context switching.
I'm trying to keep my code simple and setting an event to do the work outside the callback is the best way I can do this with my overall design. Actually just using the event driven waveOut is even better, but I tried this other method because straight callbacks are fast, and I didn't expect normal event wait handles to be so slow.
Maybe not 100% related but I faced somehow the same issue: calling EventWaitHandle.Set for X times is fine, but then, after a threshold that I can't mention, each call of this method takes 1 complete second!
Is appears that some .net way to synchronize thread are much slower than the ones you use in C++.
The all mighty #jonskeet once made a post on his web site (https://jonskeet.uk/csharp/threads/waithandles.html) where he also refers the very complex concept of .net synchronization domains explained here: https://www.drdobbs.com/windows/synchronization-domains/184405771
He mentions that .net and the OS must communicate in a very very very time precise way with object that must be converted from one environment to another. All this is very time consuming.
I summarized a lot here, not to take credit for the answer but there is an explanation. There are some recommendations here (https://learn.microsoft.com/en-us/dotnet/standard/threading/overview-of-synchronization-primitives) about some ways to choose how to synchronize depending on the context, and the performance aspect is mentioned a little bit.

How many bytes does my function use? (C#)

I would like to calculate how many bytes my function fills so that I can inject it into another process using CreateRemoteThread(). Once I know the number of bytes, I can write them into the remote process using the function's pointer. I have found an article online (see http://www.codeproject.com/KB/threads/winspy.aspx#section_3, chapter III) where they do the following in C++ :
// ThreadFunc
// Notice: - the code being injected;
//Return value: password length
static DWORD WINAPI ThreadFunc (INJDATA *pData)
{
//Code to be executed remotely
}
// This function marks the memory address after ThreadFunc.
static void AfterThreadFunc (void) {
}
Then they calculate the number of bytes ThreadFunc fills using :
const int cbCodeSize = ((LPBYTE) AfterThreadFunc - (LPBYTE) ThreadFunc);
Using cbCodeSize they allocate memory in the remote process for the injected ThreadFunc and write a copy of ThreadFunc to the allocated memory:
pCodeRemote = (PDWORD) VirtualAllocEx( hProcess, 0, cbCodeSize, MEM_COMMIT, PAGE_EXECUTE_READWRITE );
if (pCodeRemote == NULL)
__leave;
WriteProcessMemory( hProcess, pCodeRemote, &ThreadFunc, cbCodeSize, &dwNumBytesXferred );
I would like to do this in C#. :)
I have tried creating delegates, getting their pointers, and subtracting them like this:
// Thread proc, to be used with Create*Thread
public delegate int ThreadProc(InjectionData param);
//Function pointer
ThreadFuncDeleg = new ThreadProc(ThreadFunc);
ThreadFuncPtr = Marshal.GetFunctionPointerForDelegate(ThreadFuncDeleg);
//FunctionPointer
AfterThreadFuncDeleg = new ThreadProc(AfterThreadFunc);
IntPtr AfterThreadFuncDelegPtr= Marshal.GetFunctionPointerForDelegate(AfterThreadFuncDeleg);
//Number of bytes
int cbCodeSize = (AfterThreadFuncDelegPtr.ToInt32() - ThreadFuncPtr.ToInt32())*4 ;
It just does not seem right, as I get a static number no matter what I do with the code.
My question is, if possible, how does one calculate the number of bytes a function's code fills in C#?
Thank you in advance.
I don't think it is possible due dynamic optimization and code generation in .NET. You can try to measure IL-code length but when you try to measure machine-depended code length in general case it will fail.
By 'fail' I mean you can't get correct size that provide any meaning by using this technique dynamically.
Of course you can go with finding how NGEN, JIT compile works, pdb structure and try to measure. You can determine size of your code by exploring generated machine code in VS for example.
How to see the Assembly code generated by the JIT using Visual Studio
If you really need to determine size, start with NET Internals and Code Injection / NET Internals and Native Compiling but I can't imagine why you ever want it.
Be aware all internals about how JIT works exactly is subject to change so depending solution can be broken by any future version of .NET.
If you want to stick with IL: check Profiling Interfaces (CLR Profiling API), and a bit old articles: Rewrite MSIL Code on the Fly with the .NET Framework Profiling API and No Code Can Hide from the Profiling API in the .NET Framework 2.0. There are also some topics about CLR Profiling API here on SO.
But simplest way to explore assembly is Reflection API, you want MethodBody there. So you can check Length of MethodBody.GetILAsByteArray and you'll find method length in IL-commands.

How much does bytecode size impact JIT / Inlining / Performance?

I've been poking around mscorlib to see how the generic collection optimized their enumerators and I stumbled on this:
// in List<T>.Enumerator<T>
public bool MoveNext()
{
List<T> list = this.list;
if ((this.version == list._version) && (this.index < list._size))
{
this.current = list._items[this.index];
this.index++;
return true;
}
return this.MoveNextRare();
}
The stack size is 3, and the size of the bytecode should be 80 bytes. The naming of the MoveNextRare method got me on my toes and it contains an error case as well as an empty collection case, so obviously this is breaching separation of concern.
I assume the MoveNext method is split this way to optimize stack space and help the JIT, and I'd like to do the same for some of my perf bottlenecks, but without hard data, I don't want my voodoo programming turning into cargo-cult ;)
Thanks!
Florian
If you're going to think about ways in which List<T>.Enumerator is "odd" for the sake of performance, consider this first: it's a mutable struct. Feel free to recoil with horror; I know I do.
Ultimately, I wouldn't start mimicking optimisations from the BCL without benchmarking/profiling what difference they make in your specific application. It may well be appropriate for the BCL but not for you; don't forget that the BCL goes through the whole NGEN-alike service on install. The only way to find out what's appropriate for your application is to measure it.
You say you want to try the same kind of thing for your performance bottlenecks: that suggests you already know the bottlenecks, which suggests you've got some sort of measurement in place. So, try this optimisation and measure it, then see whether the gain in performance is worth the pain of readability/maintenance which goes with it.
There's nothing cargo-culty about trying something and measuring it, then making decisions based on that evidence.
Separating it into two functions has some advantages:
If the method were to be inlined, only the fast path would be inlined and the error handling would still be a function call. This prevents inlining from costing too much extra space. But 80 bytes of IL is probably still above the threshold for inlining (it was once documented as 32 bytes, don't know if it's changed since .NET 2.0).
Even if it isn't inlined, the function will be smaller and fit within the CPU's instruction cache more easily, and since the slow path is separate, it won't have to be fetched into cache every time the fast path is.
It may help the CPU branch predictor optimize for the more common path (returning true).
I think that MoveNextRare is always going to return false, but by structuring it like this it becomes a tail call, and if it's private and can only be called from here then the JIT could theoretically build a custom calling convention between these two methods that consists of just a jmp instruction with no prologue and no duplication of epilogue.

Categories