Why is .NET faster than C++ in this case?

Why is .NET faster than C++ in this case? - c#

Make sure you run outside of the IDE. That is key.
-edit- I LOVE SLaks comment. "The amount of misinformation in these answers is staggering." :D
Calm down guys. Pretty much all of you were wrong. I DID make optimizations.
It turns out whatever optimizations I made wasn't good enough.
I ran the code in GCC using gettimeofday (I'll paste code below) and used g++ -O2 file.cpp and got slightly faster results then C#.
Maybe MS didn't create the optimizations needed in this specific case but after downloading and installing mingw I was tested and found the speed to be near identical.
Justicle Seems to be right. I could have sworn I use clock on my PC and used that to count and found it was slower but problem solved. C++ speed isn't almost twice as slower in the MS compiler.
When my friend informed me of this I couldn't believe it. So I took his code and put some timers onto it.
Instead of Boo I used C#. I constantly got faster results in C#. Why? The .NET version was nearly half the time no matter what number I used.
C++ version (bad version):
#include <iostream>
#include <stdio.h>
#include <intrin.h>
#include <windows.h>
using namespace std;
int fib(int n)
{
if (n < 2) return n;
return fib(n - 1) + fib(n - 2);
}
int main()
{
__int64 time = 0xFFFFFFFF;
while (1)
{
int n;
//cin >> n;
n = 41;
if (n < 0) break;
__int64 start = __rdtsc();
int res = fib(n);
__int64 end = __rdtsc();
cout << res << endl;
cout << (float)(end-start)/1000000<<endl;
break;
}
return 0;
}
C++ version (better version):
#include <iostream>
#include <stdio.h>
#include <intrin.h>
#include <windows.h>
using namespace std;
int fib(int n)
{
if (n < 2) return n;
return fib(n - 1) + fib(n - 2);
}
int main()
{
__int64 time = 0xFFFFFFFF;
while (1)
{
int n;
//cin >> n;
n = 41;
if (n < 0) break;
LARGE_INTEGER start, end, delta, freq;
::QueryPerformanceFrequency( &freq );
::QueryPerformanceCounter( &start );
int res = fib(n);
::QueryPerformanceCounter( &end );
delta.QuadPart = end.QuadPart - start.QuadPart;
cout << res << endl;
cout << ( delta.QuadPart * 1000 ) / freq.QuadPart <<endl;
break;
}
return 0;
}
C# version:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Runtime.InteropServices;
using System.ComponentModel;
using System.Threading;
using System.IO;
using System.Diagnostics;
namespace fibCSTest
{
class Program
{
static int fib(int n)
{
if (n < 2)return n;
return fib(n - 1) + fib(n - 2);
}
static void Main(string[] args)
{
//var sw = new Stopwatch();
//var timer = new PAB.HiPerfTimer();
var timer = new Stopwatch();
while (true)
{
int n;
//cin >> n;
n = 41;
if (n < 0) break;
timer.Start();
int res = fib(n);
timer.Stop();
Console.WriteLine(res);
Console.WriteLine(timer.ElapsedMilliseconds);
break;
}
}
}
}
GCC version:
#include <iostream>
#include <stdio.h>
#include <sys/time.h>
using namespace std;
int fib(int n)
{
if (n < 2) return n;
return fib(n - 1) + fib(n - 2);
}
int main()
{
timeval start, end;
while (1)
{
int n;
//cin >> n;
n = 41;
if (n < 0) break;
gettimeofday(&start, 0);
int res = fib(n);
gettimeofday(&end, 0);
int sec = end.tv_sec - start.tv_sec;
int usec = end.tv_usec - start.tv_usec;
cout << res << endl;
cout << sec << " " << usec <<endl;
break;
}
return 0;
}

EDIT: TL/DR version: CLR JIT will inline one level of recursion, MSVC 8 SP1 will not without #pragma inline_recursion(on). And you should run the C# version outside of a debugger to get the fully optimized JIT.
I got similar results to acidzombie24 with C# vs. C++ using VS 2008 SP1 on a Core2 Duo laptop running Vista plugged in with "high performance" power settings (~1600 ms vs. ~3800 ms). It's kind of tricky to see the optimized JIT'd C# code, but for x86 it boils down to this:
00000000 55 push ebp
00000001 8B EC mov ebp,esp
00000003 57 push edi
00000004 56 push esi
00000005 53 push ebx
00000006 8B F1 mov esi,ecx
00000008 83 FE 02 cmp esi,2
0000000b 7D 07 jge 00000014
0000000d 8B C6 mov eax,esi
0000000f 5B pop ebx
00000010 5E pop esi
00000011 5F pop edi
00000012 5D pop ebp
00000013 C3 ret
return fib(n - 1) + fib(n - 2);
00000014 8D 7E FF lea edi,[esi-1]
00000017 83 FF 02 cmp edi,2
0000001a 7D 04 jge 00000020
0000001c 8B DF mov ebx,edi
0000001e EB 19 jmp 00000039
00000020 8D 4F FF lea ecx,[edi-1]
00000023 FF 15 F8 2F 12 00 call dword ptr ds:[00122FF8h]
00000029 8B D8 mov ebx,eax
0000002b 4F dec edi
0000002c 4F dec edi
0000002d 8B CF mov ecx,edi
0000002f FF 15 F8 2F 12 00 call dword ptr ds:[00122FF8h]
00000035 03 C3 add eax,ebx
00000037 8B D8 mov ebx,eax
00000039 4E dec esi
0000003a 4E dec esi
0000003b 83 FE 02 cmp esi,2
0000003e 7D 04 jge 00000044
00000040 8B D6 mov edx,esi
00000042 EB 19 jmp 0000005D
00000044 8D 4E FF lea ecx,[esi-1]
00000047 FF 15 F8 2F 12 00 call dword ptr ds:[00122FF8h]
0000004d 8B F8 mov edi,eax
0000004f 4E dec esi
00000050 4E dec esi
00000051 8B CE mov ecx,esi
00000053 FF 15 F8 2F 12 00 call dword ptr ds:[00122FF8h]
00000059 03 C7 add eax,edi
0000005b 8B D0 mov edx,eax
0000005d 03 DA add ebx,edx
0000005f 8B C3 mov eax,ebx
00000061 5B pop ebx
00000062 5E pop esi
00000063 5F pop edi
00000064 5D pop ebp
00000065 C3 ret
In contrast to the C++ generated code (/Ox /Ob2 /Oi /Ot /Oy /GL /Gr):
int fib(int n)
{
00B31000 56 push esi
00B31001 8B F1 mov esi,ecx
if (n < 2) return n;
00B31003 83 FE 02 cmp esi,2
00B31006 7D 04 jge fib+0Ch (0B3100Ch)
00B31008 8B C6 mov eax,esi
00B3100A 5E pop esi
00B3100B C3 ret
00B3100C 57 push edi
return fib(n - 1) + fib(n - 2);
00B3100D 8D 4E FE lea ecx,[esi-2]
00B31010 E8 EB FF FF FF call fib (0B31000h)
00B31015 8D 4E FF lea ecx,[esi-1]
00B31018 8B F8 mov edi,eax
00B3101A E8 E1 FF FF FF call fib (0B31000h)
00B3101F 03 C7 add eax,edi
00B31021 5F pop edi
00B31022 5E pop esi
}
00B31023 C3 ret
The C# version basically inlines fib(n-1) and fib(n-2). For a function that is so call heavy, reducing the number of function calls is the key to speed. Replacing fib with the following:
int fib(int n);
int fib2(int n)
{
if (n < 2) return n;
return fib(n - 1) + fib(n - 2);
}
int fib(int n)
{
if (n < 2) return n;
return fib2(n - 1) + fib2(n - 2);
}
Gets it down to ~1900 ms. Incidentally, if I use #pragma inline_recursion(on) I get similar results with the original fib. Unrolling it one more level:
int fib(int n);
int fib3(int n)
{
if (n < 2) return n;
return fib(n - 1) + fib(n - 2);
}
int fib2(int n)
{
if (n < 2) return n;
return fib3(n - 1) + fib3(n - 2);
}
int fib(int n)
{
if (n < 2) return n;
return fib2(n - 1) + fib2(n - 2);
}
Gets it down to ~1380 ms. Beyond that it tapers off.
So it appears that the CLR JIT for my machine will inline recursive calls one level, whereas the C++ compiler will not do that by default.
If only all performance critical code were like fib!

EDIT:
While the original C++ timing is wrong (comparing cycles to milliseconds), better timing does show C# is faster with vanilla compiler settings.
OK, enough random speculation, time for some science. After getting weird results with existing C++ code, I just tried running:
int fib(int n)
{
if (n < 2) return n;
return fib(n - 1) + fib(n - 2);
}
int main()
{
__int64 time = 0xFFFFFFFF;
while (1)
{
int n;
//cin >> n;
n = 41;
if (n < 0) break;
LARGE_INTEGER start, end, delta, freq;
::QueryPerformanceFrequency( &freq );
::QueryPerformanceCounter( &start );
int res = fib(n);
::QueryPerformanceCounter( &end );
delta.QuadPart = end.QuadPart - start.QuadPart;
cout << res << endl;
cout << ( delta.QuadPart * 1000 ) / freq.QuadPart <<endl;
break;
}
return 0;
}
EDIT:
MSN pointed out you should time C# outside the debugger, so I re-ran everything:
Best Results (VC2008, running release build from commandline, no special options enabled)
C++ Original Code - 10239
C++ QPF - 3427
C# - 2166 (was 4700 in debugger).
The original C++ code (with rdtsc) wasn't returning milliseconds, just a factor of reported clock cycles, so comparing directly to StopWatch() results is invalid. The original timing code is just wrong.
Note StopWatch() uses QueryPerformance* calls:
http://msdn.microsoft.com/en-us/library/system.diagnostics.stopwatch.aspx
So in this case C++ is faster than C#.
It depends on your compiler settings - see MSN's answer.

Don't understand the answer with garbage collection or console buffering.
It could be that your timer mechanism in C++ is inherently flawed.
According to http://en.wikipedia.org/wiki/Rdtsc, it is possible that you get wrong benchmark results.
Quoted:
While this makes time keeping more
consistent, it can skew benchmarks,
where a certain amount of spin-up time
is spent at a lower clock rate before
the OS switches the processor to the
higher rate. This has the effect of
making things seem like they require
more processor cycles than they
normally would.

I think the problem is your timing code in C++.
From the MS docs for __rdtsc:
Generates the rdtsc instruction, which returns the processor time stamp.
The processor time stamp records the number of clock cycles since the last reset.
Perhaps try GetTickCount().

Not saying that's the issue, but you may want to read How to: Use the High-Resolution Timer
Also see this...
http://en.wikipedia.org/wiki/Comparison_of_Java_and_C%2B%2B#Performance
Several studies of mostly numerical benchmarks argue that Java could potentially be faster than C++ in some circumstances, for a variety of reasons:[8][9]
Pointers make optimization difficult since they may point to arbitrary data, though many C++ compilers provide the C99 keyword restrict which corrects this problem.[10]
Compared to C++ implementations which make unrestrained use of standard implementations of malloc/new for memory allocation, implementations of Java garbage collection may have better cache coherence as its allocations are generally made sequentially.
* Run-time compilation can potentially use additional information available at run-time to optimise code more effectively, such as knowing what processor the code will be executed on.
It's about Java but begins to tackle the issue of Performance between C runtimes and JITed runtimes.

Maybe C# is able to unroll stack in recursive calls? I think it is also reduces number of computations.

One important thing to remember when comparing languages is that if you do a simple line-by-line translation, you're not comparing apples to apples.
What makes sense in one language may have horrible side effects in another. To really compare the performance characteristics you need a C# version and a C++, and the code for those versions may be very different. For example, in C# I wouldn't even use the same function signature. I'd go with something more like this:
IEnumerable<int> Fibonacci()
{
int n1 = 0;
int n2 = 1;
yield return 1;
while (true)
{
int n = n1 + n2;
n1 = n2;
n2 = n;
yield return n;
}
}
and then wrap that like this:
public static int fib(int n)
{
return Fibonacci().Skip(n).First();
}
That will do much better, because it works from the bottom up to take advantage of the calculations in the last term to help build the next one, rather than two separate sets of recursive calls.
And if you really want screaming performance in C++ you can use meta-programming to make the compiler pre-compute your results like this:
template<int N> struct fibonacci
{
static const int value = fibonacci<N - 1>::value + fibonacci<N - 2>::value;
};
template<> struct fibonacci<1>
{
static const int value = 1;
};
template<> struct fibonacci<0>
{
static const int value = 0;
};

It could be that the methods are pre-jitted at runtime prior to running the test...or that the Console is a wrapper around the API for outputting to console, when the C++'s code for cout is buffered..I guess..
Hope this helps,
Best regards,
Tom.

you are calling static function in c# code which will be inlined, and in c++ you use nonstatic function. i have ~1.4 sec for c++. with g++ -O3 you can have 1.21 sec.
you just can't compare c# with c++ with badly translated code

If that code is truly 1/2 the execution time then some possible reasons are:
Garbage collection speeds up execution of C# code over C++ code if that were happening anywhere in the above code.
The C# writing to the console may be buffered (C++ might not, or it might just not be as efficient)

Speculation 1
Garbage collection procedure might play a role.
In the C++ version all memory management would occur inline while the program is running, and that would count into the final time.
In .NET the Garbage Collector (GC) of the Common Language Runtime (CLR) is a separate process on a different thread and often cleans up your program after it's completed. Therefore your program will finish, the times will print out before memory is freed. Especially for small programs which usually won't be cleaned up at all until completion.
It all depends on details of the Garbage Collection implementation (and if it optimizes for the stack in the same way as the heap) but I assume this plays a partial role in the speed gains. If the C++ version was also optimized to not deallocate/clean up memory until after it finished (or push that step until after the program completed) then I'm sure you would see C++ speed gains.
To Test GC: To see the "delayed" .NET GC behaviour in action, put a breakpoint in some of your object's destructor/finalizer methods. The debugger will come alive and hit those breakpoints after the program is completed (yes, after Main is completed).
Speculation 2
Otherwise, the C# source code is compiled by the programmer down to IL code (Microsoft byte code instructions) and at runtime those are in turn compiled by the CLR's Just-In-Time compiler into an processor-specific instruction set (as with classic compiled programs) so there's really no reason a .NET program should be slower once it gets going and has run the first time.

I think everyone here has missed the "secret ingredient" that makes all the difference: The JIT compiler knows exactly what the target architecture is, whereas a static compiler does not. Different x86 processors have very different architectures and pipelines, so a sequence of instructions that is the fastest possible on one CPU might be relatively slower on another.
In this case the Microsoft C++ compiler's optimization strategy was targeted to a different processor than the CPU acidzombie24 was actually using, but gcc chose instructions more suited to his CPU. On a newer, older, or different-manufacturer CPU it is likely Microsoft C++ would be faster than gcc.
JIT has the best potential of all: Since it knows exactly what CPU is being targeted it has the ability to generate the very best possible code in every situation. Thus C# is inherently (in the long term) likely to be faster than C++ for such code.
Having said this, I would guess that the fact that CLR's JIT picked a better instruction sequence than Microsoft C++ was more a matter of luck than knowing the architecture. This is evidenced by the fact that on Justicle's CPU the Microsoft C++ compiler selected a better instruction sequence than the CLR JIT compiler.
A note on _rdtsc vs QueryPerformanceCounter: Yes _rdtsc is broken, but when you're talking a 3-4 second operation and running it several times to validate consistent timing, any situation that causes _rdtsc to give bogus timings (such as processor speed changes or processor changes) should cause outlying values in the test data that will be thrown out, so assuming acidzombie24 did his original benchmarks properly I doubt the _rdtsc vs QueryPerformanceCounter question really had any impact.

I know that the .NET compiler has a Intel optimization.

Related

Why is the enumeration value from a multi dimensional array not equal to itself?

Consider:
using System;
public class Test
{
enum State : sbyte { OK = 0, BUG = -1 }
static void Main(string[] args)
{
var s = new State[1, 1];
s[0, 0] = State.BUG;
State a = s[0, 0];
Console.WriteLine(a == s[0, 0]); // False
}
}
How can this be explained? It occurs in debug builds in Visual Studio 2015 when running in the x86 JIT. A release build or running in the x64 JIT prints True as expected.
To reproduce from the command line:
csc Test.cs /platform:x86 /debug
(/debug:pdbonly, /debug:portable and /debug:full also reproduce.)

You found a code generation bug in the .NET 4 x86 jitter. It is a very unusual one, it only fails when the code is not optimized. The machine code looks like this:
State a = s[0, 0];
013F04A9 push 0 ; index 2 = 0
013F04AB mov ecx,dword ptr [ebp-40h] ; s[] reference
013F04AE xor edx,edx ; index 1 = 0
013F04B0 call 013F0058 ; eax = s[0, 0]
013F04B5 mov dword ptr [ebp-4Ch],eax ; $temp1 = eax
013F04B8 movsx eax,byte ptr [ebp-4Ch] ; convert sbyte to int
013F04BC mov dword ptr [ebp-44h],eax ; a = s[0, 0]
Console.WriteLine(a == s[0, 0]); // False
013F04BF mov eax,dword ptr [ebp-44h] ; a
013F04C2 mov dword ptr [ebp-50h],eax ; $temp2 = a
013F04C5 push 0 ; index 2 = 0
013F04C7 mov ecx,dword ptr [ebp-40h] ; s[] reference
013F04CA xor edx,edx ; index 1 = 0
013F04CC call 013F0058 ; eax = s[0, 0]
013F04D1 mov dword ptr [ebp-54h],eax ; $temp3 = eax
; <=== Bug here!
013F04D4 mov eax,dword ptr [ebp-50h] ; a == s[0, 0]
013F04D7 cmp eax,dword ptr [ebp-54h]
013F04DA sete cl
013F04DD movzx ecx,cl
013F04E0 call 731C28F4
A plodding affair with lots of temporaries and code duplication, that's normal for unoptimized code. The instruction at 013F04B8 is notable, that is where the necessary conversion from sbyte to a 32-bit integer occurs. The array getter helper function returned 0x0000000FF, equal to State.BUG, and that needs to be converted to -1 (0xFFFFFFFF) before the value can be compared. The MOVSX instruction is a Sign eXtension instruction.
Same thing happens again at 013F04CC, but this time there is no MOVSX instruction to make the same conversion. That's where the chips fall down, the CMP instruction compares 0xFFFFFFFF with 0x000000FF and that is false. So this is an error of omission, the code generator failed to emit MOVSX again to perform the same sbyte to int conversion.
What is particularly unusual about this bug is that this works correctly when you enable the optimizer, it now knows to use MOVSX in both cases.
The probable reason that this bug went undetected for so long is the usage of sbyte as the base type of the enum. Quite rare to do. Using a multi-dimensional array is instrumental as well, the combination is fatal.
Otherwise a pretty critical bug I'd say. How widespread it might be is hard to guess, I only have the 4.6.1 x86 jitter to test. The x64 and the 3.5 x86 jitter generate very different code and avoid this bug. The temporary workaround to keep going is to remove sbyte as the enum base type and let it be the default, int, so no sign extension is necessary.
You can file the bug at connect.microsoft.com, linking to this Q+A should be enough to tell them everything they need to know. Let me know if you don't want to take the time and I'll take care of it.

Let's consider OP's declaration:
enum State : sbyte { OK = 0, BUG = -1 }
Since the bug only occurs when BUG is negative (from -128 to -1) and State is an enum of signed byte I started to suppose that there were a cast issue somewhere.
If you run this:
Console.WriteLine((sbyte)s[0, 0]);
Console.WriteLine((sbyte)State.BUG);
Console.WriteLine(s[0, 0]);
unchecked
{
Console.WriteLine((byte) State.BUG);
}
it will output :
255
-1
BUG
255
For a reason that I ignore(as of now) s[0, 0] is cast to a byte before evaluation and that's why it claims that a == s[0,0] is false.

What are these extra disassembly instructions when using SIMD intrinsics?

I'm testing what sort of speedup I can get from using SIMD instructions with RyuJIT and I'm seeing some disassembly instructions that I don't expect. I'm basing the code on this blog post from the RyuJIT team's Kevin Frei, and a related post here. Here's the function:
static void AddPointwiseSimd(float[] a, float[] b) {
int simdLength = Vector<float>.Count;
int i = 0;
for (i = 0; i < a.Length - simdLength; i += simdLength) {
Vector<float> va = new Vector<float>(a, i);
Vector<float> vb = new Vector<float>(b, i);
va += vb;
va.CopyTo(a, i);
}
}
The section of disassembly I'm querying copies the array values into the Vector<float>. Most of the disassembly is similar to that in Kevin and Sasha's posts, but I've highlighted some extra instructions (along with my confused annotations) that don't appear in their disassemblies:
;// Vector<float> va = new Vector<float>(a, i);
cmp eax,r8d ; <-- Unexpected - Compare a.Length to i?
jae 00007FFB17DB6D5F ; <-- Unexpected - Jump to range check failure
lea r10d,[rax+3]
cmp r10d,r8d
jae 00007FFB17DB6D5F
mov r11,rcx ; <-- Unexpected - Extra register copy?
movups xmm0,xmmword ptr [r11+rax*4+10h ]
;// Vector<float> vb = new Vector<float>(b, i);
cmp eax,r9d ; <-- Unexpected - Compare b.Length to i?
jae 00007FFB17DB6D5F ; <-- Unexpected - Jump to range check failure
cmp r10d,r9d
jae 00007FFB17DB6D5F
movups xmm1,xmmword ptr [rdx+rax*4+10h]
Note the loop range check is as expected:
;// for (i = 0; i < a.Length - simdLength; i += simdLength) {
add eax,4
cmp r9d,eax
jg loop
so I don't know why there are extra comparisons to eax. Can anyone explain why I'm seeing these extra instructions and if it's possible to get rid of them.
In case it's related to the project settings I've got a very similar project that shows the same issue here on github (see FloatSimdProcessor.HwAcceleratedSumInPlace() or UShortSimdProcessor.HwAcceleratedSumInPlaceUnchecked()).

I'll annotate the code generation that I see, for a processor that supports AVX2 like Haswell, it can move 8 floats at a time:
00007FFA1ECD4E20 push rsi
00007FFA1ECD4E21 sub rsp,20h
00007FFA1ECD4E25 xor eax,eax ; i = 0
00007FFA1ECD4E27 mov r8d,dword ptr [rcx+8] ; a.Length
00007FFA1ECD4E2B lea r9d,[r8-8] ; a.Length - simdLength
00007FFA1ECD4E2F test r9d,r9d ; if (i >= a.Length - simdLength)
00007FFA1ECD4E32 jle 00007FFA1ECD4E75 ; then skip loop
00007FFA1ECD4E34 mov r10d,dword ptr [rdx+8] ; b.Length
00007FFA1ECD4E38 cmp eax,r8d ; if (i >= a.Length)
00007FFA1ECD4E3B jae 00007FFA1ECD4E7B ; then OutOfRangeException
00007FFA1ECD4E3D lea r11d,[rax+7] ; i+7
00007FFA1ECD4E41 cmp r11d,r8d ; if (i+7 >= a.Length)
00007FFA1ECD4E44 jae 00007FFA1ECD4E7B ; then OutOfRangeException
00007FFA1ECD4E46 mov rsi,rcx ; move a[i..i+7]
00007FFA1ECD4E49 vmovupd ymm0,ymmword ptr [rsi+rax*4+10h]
00007FFA1ECD4E50 cmp eax,r10d ; same as above
00007FFA1ECD4E53 jae 00007FFA1ECD4E7B ; but for b
00007FFA1ECD4E55 cmp r11d,r10d
00007FFA1ECD4E58 jae 00007FFA1ECD4E7B
00007FFA1ECD4E5A vmovupd ymm1,ymmword ptr [rdx+rax*4+10h]
00007FFA1ECD4E61 vaddps ymm0,ymm0,ymm1 ; a[i..] + b[i...]
00007FFA1ECD4E66 vmovupd ymmword ptr [rsi+rax*4+10h],ymm0
00007FFA1ECD4E6D add eax,8 ; i += 8
00007FFA1ECD4E70 cmp r9d,eax ; if (i < a.Length)
00007FFA1ECD4E73 jg 00007FFA1ECD4E38 ; then loop
00007FFA1ECD4E75 add rsp,20h
00007FFA1ECD4E79 pop rsi
00007FFA1ECD4E7A ret
So the eax compares are those "pesky bound checks" that the blog post talks about. The blog post gives an optimized version that is not actually implemented (yet), real code right now checks both the first and the last index of the 8 floats that are moved at the same time. The blog post's comment "Hopefully, we'll get our bounds-check elimination work strengthened enough" is an uncompleted task :)
The mov rsi,rcx instruction is present in the blog post as well and appears to be a limitation in the register allocator. Probably influenced by RCX being an important register, it normally stores this. Not important enough to do the work to get this optimized away I'd assume, register-to-register moves take 0 cycles since they only affect register renaming.
Note how the difference between SSE2 and AVX2 is ugly, while the code moves and adds 8 floats at a time, it only actually uses 4 of them. Vector<float>.Count is 4 regardless of the processor flavor, leaving 2x perf on the table. Hard to hide the implementation detail I guess.

Porting code containing unsigned char pointer in C to C#

I have this code in C that I need to port to C#:
void CryptoBuffer(unsigned char *Buffer, unsigned short length)
{
unsigned short i;
for(i=0; i < length; i++)
{
*Buffer ^= 0xAA;
*Buffer++ += 0xC9;
}
}
I tried this:
public void CryptoBuffer(byte[] buffer, int length)
{
for(int i = 0; i < length; i++)
{
buffer[i] ^= 0xAA;
buffer[i] += 0xC9;
}
}
But the outcome doesn't match the one expected.
According to the example, this:
A5 03 18 01...
should become this:
A5 6F 93 8B...
It also says the first byte is not encrypted, so that's why A5 stays the same.
EDIT for clarification: The specification just says you should skip the first byte, it doesn't go into details, so I'm guessing you just pass the sequence from position 1 until the last position to skip the first byte.
But my outcome with that C# port is:
A5 72 7B 74...
Is this port correct or am I missing something?
EDIT 2: For further clarification, this is a closed protocol, so I can't go into details, that's why I provided just enough information to help me port the code, that C code was the one that was given to me, and that's what the specification said it would do.
The real problem was that the "0xAA" was wrong in the specification, that's why the output wasn't the expected one. The C# code provided here and by the accepted answer are correct after all.

Let's break it down shall we, one step at a time.
void CryptoBuffer(unsigned char *Buffer, unsigned short length)
{
unsigned short i;
for(i=0; i < length; i++)
{
*Buffer ^= 0xAA;
*Buffer++ += 0xC9;
}
}
Regardless of some other remarks, this is how you normally do these things in C/C++. There's nothing fancy about this code, and it isn't overly complicated, but I think it is good to break it down to show you what happens.
Things to note:
unsigned char is basically the same as byte in c#
unsigned length has a value between 0-65536. Int should do the trick.
Buffer has a post-increment
The byte assignment (+= 0xC9) will overflow. If it overflows it's truncated to 8 bits in this case.
The buffer is passed by ptr, so the pointer in the calling method will stay the same.
This is just basic C code, no C++. It's quite safe to assume people don't use operator overloading here.
The only "difficult" thing here is the Buffer++. Details can be read in the book "Exceptional C++" from Sutter, but a small example explains this as well. And fortunately we have a perfect example at our disposal. A literal translation of the above code is:
void CryptoBuffer(unsigned char *Buffer, unsigned short length)
{
unsigned short i;
for(i=0; i < length; i++)
{
*Buffer ^= 0xAA;
unsigned char *tmp = Buffer;
*tmp += 0xC9;
Buffer = tmp + 1;
}
}
In this case the temp variable can be solved trivially, which leads us to:
void CryptoBuffer(unsigned char *Buffer, unsigned short length)
{
unsigned short i;
for(i=0; i < length; i++)
{
*Buffer ^= 0xAA;
*Buffer += 0xC9;
++Buffer;
}
}
Changing this code to C# now is pretty easy:
private void CryptoBuffer(byte[] Buffer, int length)
{
for (int i=0; i<length; ++i)
{
Buffer[i] = (byte)((Buffer[i] ^ 0xAA) + 0xC9);
}
}
This is basically the same as your ported code. This means that somewhere down the road something else went wrong... So let's hack the cryptobuffer shall we? :-)
If we assume that the first byte isn't used (as you stated) and that the '0xAA' and/or the '0xC9' are wrong, we can simply try all combinations:
static void Main(string[] args)
{
byte[] orig = new byte[] { 0x03, 0x18, 0x01 };
byte[] target = new byte[] { 0x6F, 0x93, 0x8b };
for (int i = 0; i < 256; ++i)
{
for (int j = 0; j < 256; ++j)
{
bool okay = true;
for (int k = 0; okay && k < 3; ++k)
{
byte tmp = (byte)((orig[k] ^ i) + j);
if (tmp != target[k]) { okay = false; break; }
}
if (okay)
{
Console.WriteLine("Solution for i={0} and j={1}", i, j);
}
}
}
Console.ReadLine();
}
There we go: oops there are no solutions. That means that the cryptobuffer is not doing what you think it's doing, or part of the C code is missing here. F.ex. do they really pass 'Buffer' to the CryptoBuffer method or did they change the pointer before?
Concluding, I think the only good answer here is that critical information for solving this question is missing.

The example you were provided with is inconsistent with the code in the C sample, and the C and C# code produce identical results.

The porting looks right; can you explain why 03 should become 6F? The fact that the result seems to be off the "expected" value by 03 is a bit suspicious to me.

The port looks right.
What I would do in this situation is to take out a piece of paper and a pen, write out the bytes in binary, do the XOR, and then the addition. Now compare this to the C and C# codes.

In C#, you are overflowing the byte so it gets truncated to 0x72. Here's the math for converting the 0x03 in both binary and hex:
00000011 0x003
^ 10101010 0x0AA
= 10101001 0x0A9
+ 11001001 0x0C9
= 101110010 0x172

With the original method in C, we first suppose the sequence is decrypted/encrypted in a symmetric way with calling CryptoBuffer
initially invoke on a5 03 18 01 ...
a5 03 18 01 ... => d8 72 7b 74 ...
then on d8 72 7b 74 ...
d8 72 7b 74 ... => 3b a1 9a a7 ...
initially invoke on a5 6f 93 8b ...
a5 6f 93 8b ... => d8 8e 02 ea ...
then on d8 8e 02 ea ...
d8 8e 02 ea ... => 3b ed 71 09 ...
and we know it's not feasible.
Of course, you might have an asymmetric decrypt method; but first off, we would need either a5 03 18 01 ... => a5 6f 93 8b ... or the reverse of direction been proved with any possible magic number. The code of an analysis with a brute force approach is put at the rear of post.
I made the magic number been a variable for testing. With reproducibility analysis, we found that the original sequence can be reproduced every 256 invocation on continuously varied magic number. Okay, with what we've gone through it's still possible here.
However, the feasibility analysis which tests all of 256*256=65536 cases with both direction, from original => expected and expected => original, and none makes it.
And now we know there is no way to decrypt the encrypted sequence to the expected result.
Thus, we can only tell that the expected behavior of both language or your code are identical, but for the expected result is not possible because of the assumption was broken.
Code for the analysis
public void CryptoBuffer(byte[] buffer, ushort magicShort) {
var magicBytes=BitConverter.GetBytes(magicShort);
var count=buffer.Length;
for(var i=0; i<count; i++) {
buffer[i]^=magicBytes[1];
buffer[i]+=magicBytes[0];
}
}
int Analyze(
Action<byte[], ushort> subject,
byte[] expected, byte[] original,
ushort? magicShort=default(ushort?)
) {
Func<byte[], String> LaHeX= // narrowing bytes to hex statement
arg => arg.Select(x => String.Format("{0:x2}\x20", x)).Aggregate(String.Concat);
var temporal=(byte[])original.Clone();
var found=0;
for(var i=ushort.MaxValue; i>=0; --i) {
if(found>255) {
Console.WriteLine(": might found more than the number of square root; ");
Console.WriteLine(": analyze stopped ");
Console.WriteLine();
break;
}
subject(temporal, magicShort??i);
if(expected.SequenceEqual(temporal)) {
++found;
Console.WriteLine("i={0:x2}; temporal={1}", i, LaHeX(temporal));
}
if(expected!=original)
temporal=(byte[])original.Clone();
}
return found;
}
void PerformTest() {
var original=new byte[] { 0xa5, 0x03, 0x18, 0x01 };
var expected=new byte[] { 0xa5, 0x6f, 0x93, 0x8b };
Console.WriteLine("--- reproducibility analysis --- ");
Console.WriteLine("found: {0}", Analyze(CryptoBuffer, original, original, 0xaac9));
Console.WriteLine();
Console.WriteLine("--- feasibility analysis --- ");
Console.WriteLine("found: {0}", Analyze(CryptoBuffer, expected, original));
Console.WriteLine();
// swap original and expected
var temporal=original;
original=expected;
expected=temporal;
Console.WriteLine("--- reproducibility analysis --- ");
Console.WriteLine("found: {0}", Analyze(CryptoBuffer, original, original, 0xaac9));
Console.WriteLine();
Console.WriteLine("--- feasibility analysis --- ");
Console.WriteLine("found: {0}", Analyze(CryptoBuffer, expected, original));
Console.WriteLine();
}

Here's a demonstration
http://codepad.org/UrX0okgu
shows that the original code, given an input of A5 03 18 01 produces D8 72 7B 01; so
rule that the first byte is not decoded can be correct only if the buffer is sent starting from 2nd (show us the call)
the output does not match (do you miss other calls?)
So your translation is correct but your expectations on what the original code does are not.

Performance of bitwise & on longs vs ints on 64 bit

It seems that when performing an & operation between two longs it takes the same amount of time as the equivalent operation inside 4 32bit ints.
For example
long1 & long2
Takes as long as
int1 & int2
int3 & int4
This is running on a 64bit OS and targeting 64bit .net.
In theory, this should be twice as fast. Has anyone encountered this previously?
EDIT
As a simplification, imagine I have two lots of 64 bits of data. I take those 64 bits and put them into a long, and perform a bitwise & on those two.
I also take those two sets of data, and put the 64 bits into two 32 bit int values and perform two &s. I expect to see the long & operation running faster than the int & operation.

I couldn't reproduce the problem.
My test was as follows (int version shown):
// deliberately made hard to optimise without whole program optimisation
public static int[] data = new int[1000000]; // long[] when testing long
// I happened to have a winforms app open, feel free to make this a console app..
private void button1_Click(object sender, EventArgs e)
{
long best = long.MaxValue;
for (int j = 0; j < 1000; j++)
{
Stopwatch timer = Stopwatch.StartNew();
int a1 = ~0, b1 = 0x55555555, c1 = 0x12345678; // varies: see below
int a2 = ~0, b2 = 0x55555555, c2 = 0x12345678;
int[] d = data; // long[] when testing long
for (int i = 0; i < d.Length; i++)
{
int v = d[i]; // long when testing long, see below
a1 &= v; a2 &= v;
b1 &= v; b2 &= v;
c1 &= v; c2 &= v;
}
// don't average times: we want the result with minimal context switching
best = Math.Min(best, timer.ElapsedTicks);
button1.Text = best.ToString() + ":" + (a1 + a2 + b1 + b2 + c1 + c2).ToString("X8");
}
}
For testing longs a1 and a2 etc are merged, giving:
long a = ~0, b = 0x5555555555555555, c = 0x1234567812345678;
Running the two programs on my laptop (i7 Q720) as a release build outside of VS (.NET 4.5) I got the following times:
int: 2238, long: 1924
Now considering there's a huge amount of loop overhead, and that the long version is working with twice as much data (8mb vs 4mb), it still comes out clearly ahead. So I have no reason to believe that C# is not making full use of the processor's 64 bit bitops.
But we really shouldn't be benching it in the first place. If there's a concern, simply check the jited code (Debug -> Windows -> Disassembly). Ensure the compiler's using the instructions you expect it to use, and move on.
Attempting to measure the performance of those individual instructions on your processor (and this could well be specific to your processor model) in anything other than assembler is a very bad idea - and from within a jit compiled language like C#, beyond futile. But there's no need to anyway, as it's all in Intel's optimisation handbook should you need to know.
To this end, here's the disassembly of the a &= for the long version of the program on x64 (release, but inside of debugger - unsure if this affects the assembly, but it certainly affects the performance):
00000111 mov rcx,qword ptr [rsp+60h] ; a &= v
00000116 mov rax,qword ptr [rsp+38h]
0000011b and rax,rcx
0000011e mov qword ptr [rsp+38h],rax
As you can see there's a single 64 bit and operation as expected, along with three 64 bit moves. So far so good, and exactly half the number of ops of the int version:
00000122 mov ecx,dword ptr [rsp+5Ch] ; a1 &= v
00000126 mov eax,dword ptr [rsp+38h]
0000012a and eax,ecx
0000012c mov dword ptr [rsp+38h],eax
00000130 mov ecx,dword ptr [rsp+5Ch] ; a2 &= v
00000134 mov eax,dword ptr [rsp+44h]
00000138 and eax,ecx
0000013a mov dword ptr [rsp+44h],eax
I can only conclude that the problem you're seeing is specific to something about your test suite, build options, processor... or quite possibly, that the & isn't the point of contention you believe it to be. HTH.

I can't reproduce your timings. The following code generates two arrays: one of 1,000,000 longs, and one with 2,000,000 ints. Then it loops through the arrays, applying the & operator to successive values. It keeps a running sum and outputs it, just to make sure that the compiler doesn't decide to remove the loop entirely because it isn't doing anything.
Over dozens of successive runs, the long loop is at least twice as fast as the int loop. This is running on a Core 2 Quad with Windows 8 Developer Preview and Visual Studio 11 Developer Preview. Program is compiled with "Any CPU", and run in 64 bit mode. All testing done using Ctrl+F5 so that the debugger isn't involved.
int numLongs = 1000000;
int numInts = 2*numLongs;
var longs = new long[numLongs];
var ints = new int[numInts];
Random rnd = new Random();
// generate values
for (int i = 0; i < numLongs; ++i)
{
int i1 = rnd.Next();
int i2 = rnd.Next();
ints[2 * i] = i1;
ints[2 * i + 1] = i2;
long l = i1;
l = (l << 32) | (uint)i2;
longs[i] = l;
}
// time operations.
int isum = 0;
Stopwatch sw = Stopwatch.StartNew();
for (int i = 0; i < numInts; i += 2)
{
isum += ints[i] & ints[i + 1];
}
sw.Stop();
Console.WriteLine("Ints: {0} ms. isum = {1}", sw.ElapsedMilliseconds, isum);
long lsum = 0;
int halfLongs = numLongs / 2;
sw.Restart();
for (int i = 0; i < halfLongs; i += 2)
{
lsum += longs[i] & longs[i + 1];
}
sw.Stop();
Console.WriteLine("Longs: {0} ms. lsum = {1}", sw.ElapsedMilliseconds, lsum);

Why the performance difference between C# (quite a bit slower) and Win32/C?

We are looking to migrate a performance critical application to .Net and find that the c# version is 30% to 100% slower than the Win32/C depending on the processor (difference more marked on mobile T7200 processor). I have a very simple sample of code that demonstrates this. For brevity I shall just show the C version - the c# is a direct translation:
#include "stdafx.h"
#include "Windows.h"
int array1[100000];
int array2[100000];
int Test();
int main(int argc, char* argv[])
{
int res = Test();
return 0;
}
int Test()
{
int calc,i,k;
calc = 0;
for (i = 0; i < 50000; i++) array1[i] = i + 2;
for (i = 0; i < 50000; i++) array2[i] = 2 * i - 2;
for (i = 0; i < 50000; i++)
{
for (k = 0; k < 50000; k++)
{
if (array1[i] == array2[k]) calc = calc - array2[i] + array1[k];
else calc = calc + array1[i] - array2[k];
}
}
return calc;
}
If we look at the disassembly in Win32 for the 'else' we have:
35: else calc = calc + array1[i] - array2[k];
004011A0 jmp Test+0FCh (004011bc)
004011A2 mov eax,dword ptr [ebp-8]
004011A5 mov ecx,dword ptr [ebp-4]
004011A8 add ecx,dword ptr [eax*4+48DA70h]
004011AF mov edx,dword ptr [ebp-0Ch]
004011B2 sub ecx,dword ptr [edx*4+42BFF0h]
004011B9 mov dword ptr [ebp-4],ecx
(this is in debug but bear with me)
The disassembly for the optimised c# version using the CLR debugger on the optimised exe:
else calc = calc + pev_tmp[i] - gat_tmp[k];
000000a7 mov eax,dword ptr [ebp-4]
000000aa mov edx,dword ptr [ebp-8]
000000ad mov ecx,dword ptr [ebp-10h]
000000b0 mov ecx,dword ptr [ecx]
000000b2 cmp edx,dword ptr [ecx+4]
000000b5 jb 000000BC
000000b7 call 792BC16C
000000bc add eax,dword ptr [ecx+edx*4+8]
000000c0 mov edx,dword ptr [ebp-0Ch]
000000c3 mov ecx,dword ptr [ebp-14h]
000000c6 mov ecx,dword ptr [ecx]
000000c8 cmp edx,dword ptr [ecx+4]
000000cb jb 000000D2
000000cd call 792BC16C
000000d2 sub eax,dword ptr [ecx+edx*4+8]
000000d6 mov dword ptr [ebp-4],eax
Many more instructions, presumably the cause of the performance difference.
So 3 questions really:
Am I looking at the correct disassembly for the 2 programs or are the tools misleading me?
If the difference in the number of generated instructions is not the cause of the difference what is?
What can we possibly do about it other than keep all our performance critical code in a native DLL.
Thanks in advance
Steve
PS I did receive an invite recently to a joint MS/Intel seminar entitled something like 'Building performance critical native applications' Hmm...

I believe your main issue in this code is going to be bounds checking on your arrays.
If you switch to using unsafe code in C#, and use pointer math, you should be able to achieve the same (or potentially faster) code.
This same issue was previously discussed in detail in this question.

I believe you are seeing the results of bounds checks on the arrays. You can avoid the bounds checks by using unsafe code.
I believe the JITer can recognize patterns like for loops that go up to array.Length and avoid the bounds check, but it doesn't look like your code can utilizate that.

As others have said, one of the aspects is bounds checking. There's also some redundancy in your code in terms of array access. I've managed to improve the performance somewhat by changing the inner block to:
int tmp1 = array1[i];
int tmp2 = array2[k];
if (tmp1 == tmp2)
{
calc = calc - array2[i] + array1[k];
}
else
{
calc = calc + tmp1 - tmp2;
}
That change knocked the total time down from ~8.8s to ~5s.

Just for fun, I tried building this in C# in Visual Studio 2010, and took a look at the JITed disassembly:
else
calc = calc + array1[i] - array2[k];
000000cf mov eax,dword ptr [ebp-10h]
000000d2 add eax,dword ptr [ebp-14h]
000000d5 sub eax,edx
000000d7 mov dword ptr [ebp-10h],eax
They made a number of improvements to the jitter in 4.0 of the CLR.

C# is doing bounds checking
when running the calculation part in C# unsafe code does it perform as well as the native implementation?

If your application's performance critical path consists entirely of unchecked array processing, I'd advise you not to rewrite it in C#.
But then, if your application already works fine in language X, I'd advise you not to rewrite it in language Y.
What do you want to achieve from the rewrite? At the very least, give serious consideration to a mixed language solution, using your already-debugged C code for the high performance sections and using C# to get a nice user interface or convenient integration with the latest rich .NET libraries.
A longer answer on a possibly related theme.

I am sure the optimization for C is different than C#. Also you have to expect at least a little bit of performance slow down. .NET adds another layer to the application with the framework.
The trade off is more rapid development, huge libraries and functions, for (what should be) a small amount of speed.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.