Performance of bitwise & on longs vs ints on 64 bit

Performance of bitwise & on longs vs ints on 64 bit - c#

It seems that when performing an & operation between two longs it takes the same amount of time as the equivalent operation inside 4 32bit ints.
For example
long1 & long2
Takes as long as
int1 & int2
int3 & int4
This is running on a 64bit OS and targeting 64bit .net.
In theory, this should be twice as fast. Has anyone encountered this previously?
EDIT
As a simplification, imagine I have two lots of 64 bits of data. I take those 64 bits and put them into a long, and perform a bitwise & on those two.
I also take those two sets of data, and put the 64 bits into two 32 bit int values and perform two &s. I expect to see the long & operation running faster than the int & operation.

I couldn't reproduce the problem.
My test was as follows (int version shown):
// deliberately made hard to optimise without whole program optimisation
public static int[] data = new int[1000000]; // long[] when testing long
// I happened to have a winforms app open, feel free to make this a console app..
private void button1_Click(object sender, EventArgs e)
{
long best = long.MaxValue;
for (int j = 0; j < 1000; j++)
{
Stopwatch timer = Stopwatch.StartNew();
int a1 = ~0, b1 = 0x55555555, c1 = 0x12345678; // varies: see below
int a2 = ~0, b2 = 0x55555555, c2 = 0x12345678;
int[] d = data; // long[] when testing long
for (int i = 0; i < d.Length; i++)
{
int v = d[i]; // long when testing long, see below
a1 &= v; a2 &= v;
b1 &= v; b2 &= v;
c1 &= v; c2 &= v;
}
// don't average times: we want the result with minimal context switching
best = Math.Min(best, timer.ElapsedTicks);
button1.Text = best.ToString() + ":" + (a1 + a2 + b1 + b2 + c1 + c2).ToString("X8");
}
}
For testing longs a1 and a2 etc are merged, giving:
long a = ~0, b = 0x5555555555555555, c = 0x1234567812345678;
Running the two programs on my laptop (i7 Q720) as a release build outside of VS (.NET 4.5) I got the following times:
int: 2238, long: 1924
Now considering there's a huge amount of loop overhead, and that the long version is working with twice as much data (8mb vs 4mb), it still comes out clearly ahead. So I have no reason to believe that C# is not making full use of the processor's 64 bit bitops.
But we really shouldn't be benching it in the first place. If there's a concern, simply check the jited code (Debug -> Windows -> Disassembly). Ensure the compiler's using the instructions you expect it to use, and move on.
Attempting to measure the performance of those individual instructions on your processor (and this could well be specific to your processor model) in anything other than assembler is a very bad idea - and from within a jit compiled language like C#, beyond futile. But there's no need to anyway, as it's all in Intel's optimisation handbook should you need to know.
To this end, here's the disassembly of the a &= for the long version of the program on x64 (release, but inside of debugger - unsure if this affects the assembly, but it certainly affects the performance):
00000111 mov rcx,qword ptr [rsp+60h] ; a &= v
00000116 mov rax,qword ptr [rsp+38h]
0000011b and rax,rcx
0000011e mov qword ptr [rsp+38h],rax
As you can see there's a single 64 bit and operation as expected, along with three 64 bit moves. So far so good, and exactly half the number of ops of the int version:
00000122 mov ecx,dword ptr [rsp+5Ch] ; a1 &= v
00000126 mov eax,dword ptr [rsp+38h]
0000012a and eax,ecx
0000012c mov dword ptr [rsp+38h],eax
00000130 mov ecx,dword ptr [rsp+5Ch] ; a2 &= v
00000134 mov eax,dword ptr [rsp+44h]
00000138 and eax,ecx
0000013a mov dword ptr [rsp+44h],eax
I can only conclude that the problem you're seeing is specific to something about your test suite, build options, processor... or quite possibly, that the & isn't the point of contention you believe it to be. HTH.

I can't reproduce your timings. The following code generates two arrays: one of 1,000,000 longs, and one with 2,000,000 ints. Then it loops through the arrays, applying the & operator to successive values. It keeps a running sum and outputs it, just to make sure that the compiler doesn't decide to remove the loop entirely because it isn't doing anything.
Over dozens of successive runs, the long loop is at least twice as fast as the int loop. This is running on a Core 2 Quad with Windows 8 Developer Preview and Visual Studio 11 Developer Preview. Program is compiled with "Any CPU", and run in 64 bit mode. All testing done using Ctrl+F5 so that the debugger isn't involved.
int numLongs = 1000000;
int numInts = 2*numLongs;
var longs = new long[numLongs];
var ints = new int[numInts];
Random rnd = new Random();
// generate values
for (int i = 0; i < numLongs; ++i)
{
int i1 = rnd.Next();
int i2 = rnd.Next();
ints[2 * i] = i1;
ints[2 * i + 1] = i2;
long l = i1;
l = (l << 32) | (uint)i2;
longs[i] = l;
}
// time operations.
int isum = 0;
Stopwatch sw = Stopwatch.StartNew();
for (int i = 0; i < numInts; i += 2)
{
isum += ints[i] & ints[i + 1];
}
sw.Stop();
Console.WriteLine("Ints: {0} ms. isum = {1}", sw.ElapsedMilliseconds, isum);
long lsum = 0;
int halfLongs = numLongs / 2;
sw.Restart();
for (int i = 0; i < halfLongs; i += 2)
{
lsum += longs[i] & longs[i + 1];
}
sw.Stop();
Console.WriteLine("Longs: {0} ms. lsum = {1}", sw.ElapsedMilliseconds, lsum);

Related

How do I properly loop through and print bits of an Int, Long, Float, or BigInteger?

I'm trying to debug some bit shifting operations and I need to visualize the bits as they exist before and after a Bit-Shifting operation.
I read from this answer that I may need to handle backfill from the shifting, but I'm not sure what that means.
I think that by asking this question (how do I print the bits in a int) I can figure out what the backfill is, and perhaps some other questions I have.
Here is my sample code so far.
static string GetBits(int num)
{
StringBuilder sb = new StringBuilder();
uint bits = (uint)num;
while (bits!=0)
{
bits >>= 1;
isBitSet = // somehow do an | operation on the first bit.
// I'm unsure if it's possible to handle different data types here
// or if unsafe code and a PTR is needed
if (isBitSet)
sb.Append("1");
else
sb.Append("0");
}
}

Convert.ToString(56,2).PadLeft(8,'0') returns "00111000"
This is for a byte, works for int also, just increase the numbers

To test if the last bit is set you could use:
isBitSet = ((bits & 1) == 1);
But you should do so before shifting right (not after), otherwise you's missing the first bit:
isBitSet = ((bits & 1) == 1);
bits = bits >> 1;
But a better option would be to use the static methods of the BitConverter class to get the actual bytes used to represent the number in memory into a byte array. The advantage (or disadvantage depending on your needs) of this method is that this reflects the endianness of the machine running the code.
byte[] bytes = BitConverter.GetBytes(num);
int bitPos = 0;
while(bitPos < 8 * bytes.Length)
{
int byteIndex = bitPos / 8;
int offset = bitPos % 8;
bool isSet = (bytes[byteIndex] & (1 << offset)) != 0;
// isSet = [True] if the bit at bitPos is set, false otherwise
bitPos++;
}

Typedef for indexes in C# with static type checking without runtime overhead

It's pretty common case to use multidimensional arrays with complicated indexing. It's really confusing and error-prone when all indexes are ints because you can easily mix up columns and rows (or whatever you have) and there's no way for compiler to identify the problem. In fact there should be two types of indexes: rows and columns but it's not expressed on type level.
Here's a small illustration of what I want:
var table = new int[RowsCount,ColumnsCount];
Row row = 5;
Column col = 10;
int value = table[row, col];
public void CalcSum(int[,] table, Column col)
{
int sum = 0;
for (Row r = 0; r < table.GetLength(0); r++)
{
sum += table[row, col];
}
return sum;
}
CalcSum(table, col); // OK
CalcSum(table, row); // Compile time error
Summing up:
indexes should be statically checked for mixing up (kind of type check)
important! they should be run time efficient since it's not OK for performance to wrap ints to custom objects containing the index and then unwrapping them back
they should be implicitly convertible to ints in order to serve as indexes in native multidimensional arrays
Is there any way to achieve this? The perfect solution would be something like typedef which serves as compile-time check only compiling into plane ints.

You'll only get a 2x slowdown with the x64 jitter. It generates interesting optimized code. The loop that uses the struct looks like this:
00000040 mov ecx,1
00000045 nop word ptr [rax+rax+00000000h]
00000050 lea eax,[rcx-1]
s.Idx = j;
00000053 mov dword ptr [rsp+30h],eax
00000057 mov dword ptr [rsp+30h],ecx
0000005b add ecx,2
for (int j = 0; j < 100000000; j++) {
0000005e cmp ecx,5F5E101h
00000064 jl 0000000000000050
This requires some annotation since the code is unusual. First off, the weird NOP at offset 45 is there to align the instruction at the start of the loop. That makes the branch at offset 64 faster. The instruction at 53 looks completely unnecessary. What you see happen here is loop unrolling, note how the instruction at 5b increments the loop counter by 2. The optimizer is however not smart enough to then also see that the store is unnecessary.
And most of all note that there's no ADD instruction to be seen. In other words, the code doesn't actually calculate the value of "sum". Which is because you are not using it anywhere after the loop, the optimizer can see that the calculation is useless and removed it entirely.
It does a much better job at the second loop:
000000af xor eax,eax
000000b1 add eax,4
for (int j = 0; j < 100000000; j++) {
000000b4 cmp eax,5F5E100h
000000b9 jl 00000000000000B1
It now entirely removed the "sum" calculation and the "i" variable assignment. It could have also removed the entire for() loop but that's never done by the jitter optimizer, it assumes that the delay is intentional.
Hopefully the message is clear by now: avoid making assumptions from artificial benchmarks and only ever profile real code. You can make it more real by actually displaying the value of "sum" so the optimizer doesn't throw away the calculation. Add this line of code after the loops:
Console.Write("Sum = {0} ", sum);
And you'll now see that there's no difference anymore.

In C#, Is it slower to reference an array variable?

I've got an array of integers, and I'm looping through them:
for (int i = 0; i < data.Length; i++)
{
// do a lot of stuff here using data[i]
}
If I do:
for (int i = 0; i < data.Length; i++)
{
int value = data[i];
// do a lot of stuff with value instead of data[i]
}
Is there any performance gain/loss?
From my understanding, C/C++ array elements are accessed directly, i.e. an n-element array of integers has a contiguous memory block of length n * sizeof(int), and the program access element i by doing something like *data[i] = *data[0] + (i * sizeof(int)). (Please excuse my abuse of notation, but you get what I mean.)
So this means C/C++ should have no performance gain/loss for referencing array variables.
What about C#?
C# has a bunch of extra overhead like data.Length, data.IsSynchronized, data.GetLowerBound(), data.GetEnumerator().
Clearly, a C# array is not the same as a C/C++ array.
So what's the verdict? Should I store int value = data[i] and work with value, or is there no performance impact?

You can have the cake and eat it too. There are many cases where the jitter optimizer can easily determine that an array indexing access is safe and doesn't need to be checked. Any for-loop like you got in your question is one such case, the jitter knows the range of the index variable. And knows that checking it again is pointless.
The only way you can see that is from the generated machine code. I'll give an annotated example:
static void Main(string[] args) {
int[] array = new int[] { 0, 1, 2, 3 };
for (int ix = 0; ix < array.Length; ++ix) {
int value = array[ix];
Console.WriteLine(value);
}
}
Starting at the for loop, ebx has the pointer to the array:
for (int ix = 0; ix < array.Length; ++ix) {
00000037 xor esi,esi ; ix = 0
00000039 cmp dword ptr [ebx+4],0 ; array.Length < 0 ?
0000003d jle 0000005A ; skip everything
int value = array[ix];
0000003f mov edi,dword ptr [ebx+esi*4+8] ; NO BOUNDS CHECK !!!
Console.WriteLine(value);
00000043 call 6DD5BE38 ; Console.Out
00000048 mov ecx,eax ; arg = Out
0000004a mov edx,edi ; arg = value
0000004c mov eax,dword ptr [ecx] ; call WriteLine()
0000004e call dword ptr [eax+000000BCh]
for (int ix = 0; ix < array.Length; ++ix) {
00000054 inc esi ; ++ix
00000055 cmp dword ptr [ebx+4],esi ; array.Length > ix ?
00000058 jg 0000003F ; loop
The array indexing happens at address 00003f, ebx has the array pointer, esi is the index, 8 is the offset of the array elements in the object. Note how the esi value is not checked again against the array bounds. This runs just as fast as the code generated by a C compiler.

Yes, there is a performance loss due to the bounds check for every access to the array.
No, you most likely don't need to worry about it.
Yes, you can should store the value and work with the value. No, this isn't because of the performance issue, but rather because it makes the code more readable (IMHO).
By the way, the JIT compiler might optimize out redundant checks, so it doesn't mean you'll actually get a check on every call. Either way, it's probably not worth your time to worry about it; just use it, and if it turns out to be a bottleneck you can always go back and use unsafe blocks.

You have written it both ways. Run it both ways, measure it. Then you'll know.
But I think you would prefer working with the copy rather than always working with the array element directly, simply because it's easier to write the code that way, particularly if you have lots of operations involving that particular value.

The compiler can only perform common subexpression optimization here if it can prove that the array isn't accessed by other threads or any methods (including delegates) called inside the loop, it might be better to create the local copy yourself.
But readability should be your main concern, unless this loop executes a huge number of times.
All of this is also true in C and C++ -- indexing into an array will be slower than accessing a local variable.
As a side note, your suggested optimization is no good: value is a keyword, choose a different variable name.

Not really sure, but it probably wouldn't hurt to store the value if you are going to use it multiple times. You could also use a foreach statement :)

Why is .NET faster than C++ in this case?

Make sure you run outside of the IDE. That is key.
-edit- I LOVE SLaks comment. "The amount of misinformation in these answers is staggering." :D
Calm down guys. Pretty much all of you were wrong. I DID make optimizations.
It turns out whatever optimizations I made wasn't good enough.
I ran the code in GCC using gettimeofday (I'll paste code below) and used g++ -O2 file.cpp and got slightly faster results then C#.
Maybe MS didn't create the optimizations needed in this specific case but after downloading and installing mingw I was tested and found the speed to be near identical.
Justicle Seems to be right. I could have sworn I use clock on my PC and used that to count and found it was slower but problem solved. C++ speed isn't almost twice as slower in the MS compiler.
When my friend informed me of this I couldn't believe it. So I took his code and put some timers onto it.
Instead of Boo I used C#. I constantly got faster results in C#. Why? The .NET version was nearly half the time no matter what number I used.
C++ version (bad version):
#include <iostream>
#include <stdio.h>
#include <intrin.h>
#include <windows.h>
using namespace std;
int fib(int n)
{
if (n < 2) return n;
return fib(n - 1) + fib(n - 2);
}
int main()
{
__int64 time = 0xFFFFFFFF;
while (1)
{
int n;
//cin >> n;
n = 41;
if (n < 0) break;
__int64 start = __rdtsc();
int res = fib(n);
__int64 end = __rdtsc();
cout << res << endl;
cout << (float)(end-start)/1000000<<endl;
break;
}
return 0;
}
C++ version (better version):
#include <iostream>
#include <stdio.h>
#include <intrin.h>
#include <windows.h>
using namespace std;
int fib(int n)
{
if (n < 2) return n;
return fib(n - 1) + fib(n - 2);
}
int main()
{
__int64 time = 0xFFFFFFFF;
while (1)
{
int n;
//cin >> n;
n = 41;
if (n < 0) break;
LARGE_INTEGER start, end, delta, freq;
::QueryPerformanceFrequency( &freq );
::QueryPerformanceCounter( &start );
int res = fib(n);
::QueryPerformanceCounter( &end );
delta.QuadPart = end.QuadPart - start.QuadPart;
cout << res << endl;
cout << ( delta.QuadPart * 1000 ) / freq.QuadPart <<endl;
break;
}
return 0;
}
C# version:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Runtime.InteropServices;
using System.ComponentModel;
using System.Threading;
using System.IO;
using System.Diagnostics;
namespace fibCSTest
{
class Program
{
static int fib(int n)
{
if (n < 2)return n;
return fib(n - 1) + fib(n - 2);
}
static void Main(string[] args)
{
//var sw = new Stopwatch();
//var timer = new PAB.HiPerfTimer();
var timer = new Stopwatch();
while (true)
{
int n;
//cin >> n;
n = 41;
if (n < 0) break;
timer.Start();
int res = fib(n);
timer.Stop();
Console.WriteLine(res);
Console.WriteLine(timer.ElapsedMilliseconds);
break;
}
}
}
}
GCC version:
#include <iostream>
#include <stdio.h>
#include <sys/time.h>
using namespace std;
int fib(int n)
{
if (n < 2) return n;
return fib(n - 1) + fib(n - 2);
}
int main()
{
timeval start, end;
while (1)
{
int n;
//cin >> n;
n = 41;
if (n < 0) break;
gettimeofday(&start, 0);
int res = fib(n);
gettimeofday(&end, 0);
int sec = end.tv_sec - start.tv_sec;
int usec = end.tv_usec - start.tv_usec;
cout << res << endl;
cout << sec << " " << usec <<endl;
break;
}
return 0;
}

EDIT: TL/DR version: CLR JIT will inline one level of recursion, MSVC 8 SP1 will not without #pragma inline_recursion(on). And you should run the C# version outside of a debugger to get the fully optimized JIT.
I got similar results to acidzombie24 with C# vs. C++ using VS 2008 SP1 on a Core2 Duo laptop running Vista plugged in with "high performance" power settings (~1600 ms vs. ~3800 ms). It's kind of tricky to see the optimized JIT'd C# code, but for x86 it boils down to this:
00000000 55 push ebp
00000001 8B EC mov ebp,esp
00000003 57 push edi
00000004 56 push esi
00000005 53 push ebx
00000006 8B F1 mov esi,ecx
00000008 83 FE 02 cmp esi,2
0000000b 7D 07 jge 00000014
0000000d 8B C6 mov eax,esi
0000000f 5B pop ebx
00000010 5E pop esi
00000011 5F pop edi
00000012 5D pop ebp
00000013 C3 ret
return fib(n - 1) + fib(n - 2);
00000014 8D 7E FF lea edi,[esi-1]
00000017 83 FF 02 cmp edi,2
0000001a 7D 04 jge 00000020
0000001c 8B DF mov ebx,edi
0000001e EB 19 jmp 00000039
00000020 8D 4F FF lea ecx,[edi-1]
00000023 FF 15 F8 2F 12 00 call dword ptr ds:[00122FF8h]
00000029 8B D8 mov ebx,eax
0000002b 4F dec edi
0000002c 4F dec edi
0000002d 8B CF mov ecx,edi
0000002f FF 15 F8 2F 12 00 call dword ptr ds:[00122FF8h]
00000035 03 C3 add eax,ebx
00000037 8B D8 mov ebx,eax
00000039 4E dec esi
0000003a 4E dec esi
0000003b 83 FE 02 cmp esi,2
0000003e 7D 04 jge 00000044
00000040 8B D6 mov edx,esi
00000042 EB 19 jmp 0000005D
00000044 8D 4E FF lea ecx,[esi-1]
00000047 FF 15 F8 2F 12 00 call dword ptr ds:[00122FF8h]
0000004d 8B F8 mov edi,eax
0000004f 4E dec esi
00000050 4E dec esi
00000051 8B CE mov ecx,esi
00000053 FF 15 F8 2F 12 00 call dword ptr ds:[00122FF8h]
00000059 03 C7 add eax,edi
0000005b 8B D0 mov edx,eax
0000005d 03 DA add ebx,edx
0000005f 8B C3 mov eax,ebx
00000061 5B pop ebx
00000062 5E pop esi
00000063 5F pop edi
00000064 5D pop ebp
00000065 C3 ret
In contrast to the C++ generated code (/Ox /Ob2 /Oi /Ot /Oy /GL /Gr):
int fib(int n)
{
00B31000 56 push esi
00B31001 8B F1 mov esi,ecx
if (n < 2) return n;
00B31003 83 FE 02 cmp esi,2
00B31006 7D 04 jge fib+0Ch (0B3100Ch)
00B31008 8B C6 mov eax,esi
00B3100A 5E pop esi
00B3100B C3 ret
00B3100C 57 push edi
return fib(n - 1) + fib(n - 2);
00B3100D 8D 4E FE lea ecx,[esi-2]
00B31010 E8 EB FF FF FF call fib (0B31000h)
00B31015 8D 4E FF lea ecx,[esi-1]
00B31018 8B F8 mov edi,eax
00B3101A E8 E1 FF FF FF call fib (0B31000h)
00B3101F 03 C7 add eax,edi
00B31021 5F pop edi
00B31022 5E pop esi
}
00B31023 C3 ret
The C# version basically inlines fib(n-1) and fib(n-2). For a function that is so call heavy, reducing the number of function calls is the key to speed. Replacing fib with the following:
int fib(int n);
int fib2(int n)
{
if (n < 2) return n;
return fib(n - 1) + fib(n - 2);
}
int fib(int n)
{
if (n < 2) return n;
return fib2(n - 1) + fib2(n - 2);
}
Gets it down to ~1900 ms. Incidentally, if I use #pragma inline_recursion(on) I get similar results with the original fib. Unrolling it one more level:
int fib(int n);
int fib3(int n)
{
if (n < 2) return n;
return fib(n - 1) + fib(n - 2);
}
int fib2(int n)
{
if (n < 2) return n;
return fib3(n - 1) + fib3(n - 2);
}
int fib(int n)
{
if (n < 2) return n;
return fib2(n - 1) + fib2(n - 2);
}
Gets it down to ~1380 ms. Beyond that it tapers off.
So it appears that the CLR JIT for my machine will inline recursive calls one level, whereas the C++ compiler will not do that by default.
If only all performance critical code were like fib!

EDIT:
While the original C++ timing is wrong (comparing cycles to milliseconds), better timing does show C# is faster with vanilla compiler settings.
OK, enough random speculation, time for some science. After getting weird results with existing C++ code, I just tried running:
int fib(int n)
{
if (n < 2) return n;
return fib(n - 1) + fib(n - 2);
}
int main()
{
__int64 time = 0xFFFFFFFF;
while (1)
{
int n;
//cin >> n;
n = 41;
if (n < 0) break;
LARGE_INTEGER start, end, delta, freq;
::QueryPerformanceFrequency( &freq );
::QueryPerformanceCounter( &start );
int res = fib(n);
::QueryPerformanceCounter( &end );
delta.QuadPart = end.QuadPart - start.QuadPart;
cout << res << endl;
cout << ( delta.QuadPart * 1000 ) / freq.QuadPart <<endl;
break;
}
return 0;
}
EDIT:
MSN pointed out you should time C# outside the debugger, so I re-ran everything:
Best Results (VC2008, running release build from commandline, no special options enabled)
C++ Original Code - 10239
C++ QPF - 3427
C# - 2166 (was 4700 in debugger).
The original C++ code (with rdtsc) wasn't returning milliseconds, just a factor of reported clock cycles, so comparing directly to StopWatch() results is invalid. The original timing code is just wrong.
Note StopWatch() uses QueryPerformance* calls:
http://msdn.microsoft.com/en-us/library/system.diagnostics.stopwatch.aspx
So in this case C++ is faster than C#.
It depends on your compiler settings - see MSN's answer.

Don't understand the answer with garbage collection or console buffering.
It could be that your timer mechanism in C++ is inherently flawed.
According to http://en.wikipedia.org/wiki/Rdtsc, it is possible that you get wrong benchmark results.
Quoted:
While this makes time keeping more
consistent, it can skew benchmarks,
where a certain amount of spin-up time
is spent at a lower clock rate before
the OS switches the processor to the
higher rate. This has the effect of
making things seem like they require
more processor cycles than they
normally would.

I think the problem is your timing code in C++.
From the MS docs for __rdtsc:
Generates the rdtsc instruction, which returns the processor time stamp.
The processor time stamp records the number of clock cycles since the last reset.
Perhaps try GetTickCount().

Not saying that's the issue, but you may want to read How to: Use the High-Resolution Timer
Also see this...
http://en.wikipedia.org/wiki/Comparison_of_Java_and_C%2B%2B#Performance
Several studies of mostly numerical benchmarks argue that Java could potentially be faster than C++ in some circumstances, for a variety of reasons:[8][9]
Pointers make optimization difficult since they may point to arbitrary data, though many C++ compilers provide the C99 keyword restrict which corrects this problem.[10]
Compared to C++ implementations which make unrestrained use of standard implementations of malloc/new for memory allocation, implementations of Java garbage collection may have better cache coherence as its allocations are generally made sequentially.
* Run-time compilation can potentially use additional information available at run-time to optimise code more effectively, such as knowing what processor the code will be executed on.
It's about Java but begins to tackle the issue of Performance between C runtimes and JITed runtimes.

Maybe C# is able to unroll stack in recursive calls? I think it is also reduces number of computations.

One important thing to remember when comparing languages is that if you do a simple line-by-line translation, you're not comparing apples to apples.
What makes sense in one language may have horrible side effects in another. To really compare the performance characteristics you need a C# version and a C++, and the code for those versions may be very different. For example, in C# I wouldn't even use the same function signature. I'd go with something more like this:
IEnumerable<int> Fibonacci()
{
int n1 = 0;
int n2 = 1;
yield return 1;
while (true)
{
int n = n1 + n2;
n1 = n2;
n2 = n;
yield return n;
}
}
and then wrap that like this:
public static int fib(int n)
{
return Fibonacci().Skip(n).First();
}
That will do much better, because it works from the bottom up to take advantage of the calculations in the last term to help build the next one, rather than two separate sets of recursive calls.
And if you really want screaming performance in C++ you can use meta-programming to make the compiler pre-compute your results like this:
template<int N> struct fibonacci
{
static const int value = fibonacci<N - 1>::value + fibonacci<N - 2>::value;
};
template<> struct fibonacci<1>
{
static const int value = 1;
};
template<> struct fibonacci<0>
{
static const int value = 0;
};

It could be that the methods are pre-jitted at runtime prior to running the test...or that the Console is a wrapper around the API for outputting to console, when the C++'s code for cout is buffered..I guess..
Hope this helps,
Best regards,
Tom.

you are calling static function in c# code which will be inlined, and in c++ you use nonstatic function. i have ~1.4 sec for c++. with g++ -O3 you can have 1.21 sec.
you just can't compare c# with c++ with badly translated code

If that code is truly 1/2 the execution time then some possible reasons are:
Garbage collection speeds up execution of C# code over C++ code if that were happening anywhere in the above code.
The C# writing to the console may be buffered (C++ might not, or it might just not be as efficient)

Speculation 1
Garbage collection procedure might play a role.
In the C++ version all memory management would occur inline while the program is running, and that would count into the final time.
In .NET the Garbage Collector (GC) of the Common Language Runtime (CLR) is a separate process on a different thread and often cleans up your program after it's completed. Therefore your program will finish, the times will print out before memory is freed. Especially for small programs which usually won't be cleaned up at all until completion.
It all depends on details of the Garbage Collection implementation (and if it optimizes for the stack in the same way as the heap) but I assume this plays a partial role in the speed gains. If the C++ version was also optimized to not deallocate/clean up memory until after it finished (or push that step until after the program completed) then I'm sure you would see C++ speed gains.
To Test GC: To see the "delayed" .NET GC behaviour in action, put a breakpoint in some of your object's destructor/finalizer methods. The debugger will come alive and hit those breakpoints after the program is completed (yes, after Main is completed).
Speculation 2
Otherwise, the C# source code is compiled by the programmer down to IL code (Microsoft byte code instructions) and at runtime those are in turn compiled by the CLR's Just-In-Time compiler into an processor-specific instruction set (as with classic compiled programs) so there's really no reason a .NET program should be slower once it gets going and has run the first time.

I think everyone here has missed the "secret ingredient" that makes all the difference: The JIT compiler knows exactly what the target architecture is, whereas a static compiler does not. Different x86 processors have very different architectures and pipelines, so a sequence of instructions that is the fastest possible on one CPU might be relatively slower on another.
In this case the Microsoft C++ compiler's optimization strategy was targeted to a different processor than the CPU acidzombie24 was actually using, but gcc chose instructions more suited to his CPU. On a newer, older, or different-manufacturer CPU it is likely Microsoft C++ would be faster than gcc.
JIT has the best potential of all: Since it knows exactly what CPU is being targeted it has the ability to generate the very best possible code in every situation. Thus C# is inherently (in the long term) likely to be faster than C++ for such code.
Having said this, I would guess that the fact that CLR's JIT picked a better instruction sequence than Microsoft C++ was more a matter of luck than knowing the architecture. This is evidenced by the fact that on Justicle's CPU the Microsoft C++ compiler selected a better instruction sequence than the CLR JIT compiler.
A note on _rdtsc vs QueryPerformanceCounter: Yes _rdtsc is broken, but when you're talking a 3-4 second operation and running it several times to validate consistent timing, any situation that causes _rdtsc to give bogus timings (such as processor speed changes or processor changes) should cause outlying values in the test data that will be thrown out, so assuming acidzombie24 did his original benchmarks properly I doubt the _rdtsc vs QueryPerformanceCounter question really had any impact.

I know that the .NET compiler has a Intel optimization.

C# - Making one Int64 from two Int32s

Is there a function in c# that takes two 32 bit integers (int) and returns a single 64 bit one (long)?
Sounds like there should be a simple way to do this, but I couldn't find a solution.

Try the following
public long MakeLong(int left, int right) {
//implicit conversion of left to a long
long res = left;
//shift the bits creating an empty space on the right
// ex: 0x0000CFFF becomes 0xCFFF0000
res = (res << 32);
//combine the bits on the right with the previous value
// ex: 0xCFFF0000 | 0x0000ABCD becomes 0xCFFFABCD
res = res | (long)(uint)right; //uint first to prevent loss of signed bit
//return the combined result
return res;
}

Just for clarity... While the accepted answer does appear to work correctly. All of the one liners presented do not appear to produce accurate results.
Here is a one liner that does work:
long correct = (long)left << 32 | (long)(uint)right;
Here is some code so you can test it for yourself:
long original = 1979205471486323557L;
int left = (int)(original >> 32);
int right = (int)(original & 0xffffffffL);
long correct = (long)left << 32 | (long)(uint)right;
long incorrect1 = (long)(((long)left << 32) | (long)right);
long incorrect2 = ((Int64)left << 32 | right);
long incorrect3 = (long)(left * uint.MaxValue) + right;
long incorrect4 = (long)(left * 0x100000000) + right;
Console.WriteLine(original == correct);
Console.WriteLine(original == incorrect1);
Console.WriteLine(original == incorrect2);
Console.WriteLine(original == incorrect3);
Console.WriteLine(original == incorrect4);

Try
(long)(((long)i1 << 32) | (long)i2)
this shifts the first int left by 32 bits (the length of an int), then ors in the second int, so you end up with the two ints concatentated together in a long.

Be careful with the sign bit. Here is a fast ulong solution, that is also not portable from little endian to big endian:
var a = 123;
var b = -123;
unsafe
{
ulong result = *(uint*)&a;
result <<= 32;
result |= *(uint*)&b;
}

This should do the trick
((Int64) a << 32 | b)
Where a and b are Int32. Although you might want to check what happens with the highest bits. Or just put it inside an "unchecked {...}" block.

Gotta be careful with bit twiddling like this though cause you'll have issues on little endian/big endian machines (exp Mono platforms aren't always little endian). Plus you have to deal with sign extending. Mathematically the following is the same but deals with sign extension and is platform agnostic.
return (long)( high * uint.MaxValue ) + low;
When jitted at runtime it will result in performance similar to the bit twiddling. That's one of the nice things about interpreted languages.

There is a problem when i2 < 0 - high 32 bits will be set (0xFFFFFFFF,1xxx... binary) - thecoop was wrong
Better would be something like (Int64)(((UInt64)i1 << 32) | (UInt32)i2)
Or simply C++ way
public static unsafe UInt64 MakeLong(UInt32 low, UInt32 high)
{
UInt64 retVal;
UInt32* ptr = (UInt32*)&retVal;
*ptr++ = low;
*ptr = high;
return retVal;
}
UInt64 retVal;
unsafe
{
UInt32* ptr = (UInt32*)&retVal;
*ptr++ = low;
*ptr = high;
}
But the best solution found then here ;-)
[StructLayout(LayoutKind.Explicit)]
[FieldOffset()]
https://stackoverflow.com/questions/12898591
(even w/o unsafe)
Anyway FieldOffset works for each item, so you have to specify position of each half separate and remember negative #s are zero complements, so ex. low <0 and high >0 will not make sense - for example -1,0 will give Int64 as 4294967295 probably.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.