Related
I'm trying to create C# app which uses dll library which contains C++ code and inline assembly. In function test_MMX I want to add two arrays of specific length.
extern "C" __declspec(dllexport) void __stdcall test_MMX(int *first_array,int *second_array,int length)
{
__asm
{
mov ecx,length;
mov esi,first_array;
shr ecx,1;
mov edi,second_array;
label:
movq mm0,QWORD PTR[esi];
paddd mm0,QWORD PTR[edi];
add edi,8;
movq QWORD PTR[esi],mm0;
add esi,8;
dec ecx;
jnz label;
}
}
After run app it's showing this warning:
warning C4799: function 'test_MMX' has no EMMS instruction.
When I want to measure time of running this function C# in miliseconds it returns this value: -922337203685477 instead of (for example 0,0141)...
private Stopwatch time = new Stopwatch();
time.Reset();
time.Start();
test_MMX(first_array, second_array, length);
time.Stop();
TimeSpan interval = time.Elapsed;
return trvanie.TotalMilliseconds;
Any ideas how to fix it please ?
Since MMX aliases over the floating-point registers, any routine that uses MMX instructions must end with the EMMS instruction. This instruction "clears" the registers, making them available for use by the x87 FPU once again. (Which any C or C++ calling convention for x86 will assume is safe.)
The compiler is warning you that you have written a routine that uses MMX instructions but does not end with the EMMS instruction. That's a bug waiting to happen, as soon as some FPU instruction tries to execute.
This is a huge disadvantage of MMX, and the reason why you really can't freely intermix MMX and floating-point instructions. Sure, you could just throw EMMS instructions around, but it is a slow, high-latency instruction, so this kills performance. SSE had the same limitations as MMX in this regard, at least for integer operations. SSE2 was the first instruction set to address this problem, since it used its own discrete register set. Its registers are also twice as wide as MMX's are, so you can do even more at a time. Since SSE2 does everything that MMX does, but faster, easier, and more efficiently, and is supported by the Pentium 4 and later, it is quite rare that anyone needs to write new code today that uses MMX. If you can use SSE2, you should. It will be faster than MMX. Another reason not to use MMX is that it is not supported in 64-bit mode.
Anyway, the correct way to write the MMX code would be:
__asm
{
mov ecx, [length]
mov eax, [first_array]
shr ecx, 1
mov edx, [second_array]
label:
movq mm0, QWORD PTR [eax]
paddd mm0, QWORD PTR [edx]
add edx, 8
movq QWORD PTR [eax], mm0
add eax, 8
dec ecx
jnz label
emms
}
Note that, in addition to the EMMS instruction (which, of course, is placed outside of the loop), I made a few additional changes:
Assembly-language instructions do not end with semicolons. In fact, in assembly language's syntax, the semicolon is used to begin a comment. So I have removed your semicolons.
I've also added spaces for readability.
And, while it isn't strictly necessary (Microsoft's inline assembler is sufficiently forgiving so as to allow you to get away with not doing it), it is a good idea to be explicit and wrap the use of addresses (C/C++ variables) in square brackets, since you are actually dereferencing them.
As a commenter pointed out, you can freely use the ESI and EDI registers in inline assembly, since the inline assembler will detect their use and generate additional instructions that push/pop them accordingly. In fact, it will do this with all non-volatile registers. And if you need additional registers, then you need them, and this is a nice feature. But in this code, you're only using three general-purpose registers, and in the __stdcall calling convention, there are three general-purpose registers that are specifically defined as volatile (i.e., can be freely clobbered by any function): EAX, EDX, and ECX. So you should be using those registers for maximum speed. As such, I've changed your use of ESI to EAX, and your use of EDI to EDX. This will improve the code that you can't see, the prologue and epilogue automatically generated by the compiler.
You have a potential speed trap lurking here, though, and that is alignment. To obtain maximum speed, MMX instructions need to operate on data that is aligned on 8-byte boundaries. In a loop, misaligned data has a compounding negative effect on performance: not only is the data misaligned the first time through the loop, exerting a significant performance penalty, but it is guaranteed to be misaligned each subsequent time through the loop, too. So for this code to have any chance of being fast, the caller needs to guarantee that first_array and second_array are aligned on 8-byte boundaries.
If you can't guarantee that, then the function should really have extra code added to it to fix up misalignments. Essentially, you want to do a couple of non-vector operations (on individual bytes) at the beginning, before starting the loop, until you've reached a suitable alignment. Then, you can start issuing the vectorized MMX instructions.
(Unaligned loads are no longer penalized on modern processors, but if you were targeting modern processors, you'd be writing SSE2 code. On the older processors where you need to run MMX code, alignment will be a big deal, and misaligned data will kill your performance.)
Now, this inline assembly won't produce particularly efficient code. When you use inline assembly, the compiler always generates prologue and epilogue code for the function. That isn't terrible, since it's outside of the critical inner loop, but still—it's cruft you don't need. Worse, jumps in inline assembly blocks tend to confuse MSVC's inline assembler and cause it to generate sub-optimal code. It is overly cautious, preventing you from doing something that could corrupt the stack or cause other external side effects, which is nice, except that the whole reason you're writing inline assembly is (presumably) because you desire maximum performance.
(It should go without saying, but if you don't need the maximum possible performance, you should just write the code in C (or C++) and let the compiler optimize it. It does a darn good job in the majority of cases.)
If you do need the maximum possible performance, and have decided that the compiler-generated code just won't cut it, then a better alternative to inline assembly is the use of intrinsics. Intrinsics will generally map one-to-one to assembly-language instructions, but the compiler does a lot better job optimizing around them.
Here's my version of your code, using MMX intrinsics:
#include <intrin.h> // include header with MMX intrinsics
void __stdcall Function_With_Intrinsics(int *first_array, int *second_array, int length)
{
unsigned int counter = static_cast<unsigned int>(length);
counter /= 2;
do
{
*reinterpret_cast<__m64*>(first_array) = _mm_add_pi32(*reinterpret_cast<const __m64*>(first_array),
*reinterpret_cast<const __m64*>(second_array));
first_array += 8;
second_array += 8;
} while (--counter != 0);
_mm_empty();
}
It does the same thing, but more efficiently by delegating more to the compiler's optimizer. A couple of notes:
Since your assembly code treats length as an unsigned integer, I assume that your interface requires that it actually be an unsigned integer. (And, if so, I wonder why you don't declare it as such in the function's signature.) To achieve the same effect, I've cast it to an unsigned int, which is subsequently used as the counter. (If I hadn't done that, I'd have to have either done a shift operation on a signed integer, which risks undefined behavior, or a division by two, for which the compiler would have generated slower code to correctly deal with the sign bit.)
The *reinterpret_cast<__m64*> business scattered throughout looks scary, but is actually safe—at least, relatively speaking. That's what you're supposed to do with the MMX intrinsics. The MMX data type is __m64, which you can think of as being roughly equivalent to an mm? register. It is 64 bits in length, and loads and stores are accomplished by casting. These get translated directly into MOVQ instructions.
Your original assembly code was written such that the loop always iterated at least once, so I transformed that into a do…while loop. This means the test of the loop condition only has to be done at the bottom of the loop, rather than once at the top and once at the bottom.
The _mm_empty() intrinsic causes an EMMS instruction to be emitted.
Just for grins, let's see what the compiler transformed this into. This is the output from MSVC 16 (VS 2010), targeting x86-32 and optimizing for speed over size (though it makes no difference in this particular case):
PUBLIC ?Function_With_Intrinsics##YGXPAH0H#Z
; Function compile flags: /Ogtpy
_first_array$ = 8 ; size = 4
_second_array$ = 12 ; size = 4
_length$ = 16 ; size = 4
?Function_With_Intrinsics##YGXPAH0H#Z PROC
mov ecx, DWORD PTR _length$[esp-4]
mov edx, DWORD PTR _second_array$[esp-4]
mov eax, DWORD PTR _first_array$[esp-4]
shr ecx, 1
sub edx, eax
$LL3:
movq mm0, MMWORD PTR [eax]
movq mm1, MMWORD PTR [edx+eax]
paddd mm0, mm1
movq MMWORD PTR [eax], mm0
add eax, 32
dec ecx
jne SHORT $LL3
emms
ret 12
?Function_With_Intrinsics##YGXPAH0H#Z ENDP
It is recognizably similar to your original code, but does a couple of things differently. In particular, it tracks the array pointers differently, in a way that it (and I) believe is slightly more efficient than your original code, since it does less work inside of the loop. It also breaks apart your PADDD instruction so that both of its operands are MMX registers, instead of the source being a memory operand. Again, this tends to make the code more efficient at the expense of clobbering an additional MMX register, but we've got plenty of those to spare, so it's certainly worth it.
Better yet, as the optimizer improves in newer versions of the compiler, code that is written using intrinsics may get even better!
Of course, rewriting the function to use intrinsics doesn't solve the alignment problem, but I'm assuming you have already dealt with that on the caller side. If not, you'll need to add code to handle it.
If you wanted to use SSE2—perhaps that would be test_SSE2 and you would dynamically delegate to the appropriate implementation depending on the current processor's feature bits—then you could do it like this:
#include <intrin.h> // include header with SSE2 intrinsics
void __stdcall Function_With_Intrinsics_SSE2(int *first_array, int *second_array, int length)
{
unsigned int counter = static_cast<unsigned>(length);
counter /= 4;
do
{
_mm_storeu_si128(reinterpret_cast<__m128i*>(first_array),
_mm_add_epi32(_mm_loadu_si128(reinterpret_cast<const __m128i*>(first_array)),
_mm_loadu_si128(reinterpret_cast<const __m128i*>(second_array))));
first_array += 16;
second_array += 16;
} while (--counter != 0);
}
I've written this code not assuming alignment, so it will work when the loads and stores are misaligned. For maximum speed on many older architectures, SSE2 requires 16-byte alignment, and if you can guarantee that the source and destination pointers are thusly aligned, you can use slightly faster instructions (e.g., MOVDQA as opposed to MOVDQU). As mentioned above, on newer architectures (at least Sandy Bridge and later, perhaps earlier), it doesn't matter.
To give you an idea of how SSE2 is basically just a drop-in replacement for MMX on Pentium 4 and later, except that you also get to do operations that are twice as wide, look at the code this compiles to:
PUBLIC ?Function_With_Intrinsics_SSE2##YGXPAH0H#Z
; Function compile flags: /Ogtpy
_first_array$ = 8 ; size = 4
_second_array$ = 12 ; size = 4
_length$ = 16 ; size = 4
?Function_With_Intrinsics_SSE2##YGXPAH0H#Z PROC
mov ecx, DWORD PTR _length$[esp-4]
mov edx, DWORD PTR _second_array$[esp-4]
mov eax, DWORD PTR _first_array$[esp-4]
shr ecx, 2
sub edx, eax
$LL3:
movdqu xmm0, XMMWORD PTR [eax]
movdqu xmm1, XMMWORD PTR [edx+eax]
paddd xmm0, xmm1
movdqu XMMWORD PTR [eax], xmm0
add eax, 64
dec ecx
jne SHORT $LL3
ret 12
?Function_With_Intrinsics_SSE2##YGXPAH0H#Z ENDP
As for the final question about getting negative values from the .NET Stopwatch class, I would normally guess that would be due to an overflow. In other words, your code executed too slowly, and the timer wrapped around. Kevin Gosse pointed out, though, that this is apparently a bug in the implementation of the Stopwatch class. I don't know much more about it, since I don't really use it. If you want a good microbenchmarking library, I use and recommend Google Benchmark. However, it is for C++, not C#.
While you're benchmarking, definitely take the time to time the code generated by the compiler when you write it the naïve way. Say, something like:
void Naive_PackedAdd(int *first_array, int *second_array, int length)
{
for (unsigned int i = 0; i < static_cast<unsigned int>(length); ++i)
{
first_array[i] += second_array[i];
}
}
You just might be pleasantly surprised at how fast the code is after the compiler gets finished auto-vectorizing the loop. :-) Remember that less code does not necessarily mean faster code. All of that extra code is required to deal with alignment issues, which I've diplomatically skirted throughout this answer. If you scroll down, at $LL4#Naive_Pack, you'll find an inner loop very similar to what we've been considering here.
This is done in C#, profiler used is Intel Vtune. Code compiled in Release configuration.
The following line of code:
_smalestRangeSq = float.MaxValue;
where _smalestRangeSq is a class field, result in this
movss xmm0, dword ptr [rip+0x9ee]
movss dword ptr [rsi+0x58], xmm0
And apparently it consumes a lot of time performing stores. Can anyone explain for me why?
I'm writing some math code in C#, forcing it to compile for x86 in release, optimized, and I'm looking at the disassembly in windbg. It's usually pretty good, often writing better assembly than I could (not that I'm all that good at assembly, but there you go).
However, I've noticed that this function:
static void TemporaryWork()
{
double x = 4;
double y = 3;
double z = Math.Atan2(x, y);
}
Is producing this disassembly:
001f0078 55 push ebp
001f0079 8bec mov ebp,esp
001f007b dd05a0001f00 fld qword ptr ds:[1F00A0h]
001f0081 83ec08 sub esp,8
001f0084 dd1c24 fstp qword ptr [esp]
001f0087 dd05a8001f00 fld qword ptr ds:[1F00A8h]
001f008d 83ec08 sub esp,8
001f0090 dd1c24 fstp qword ptr [esp]
001f0093 e86e9ba66f call clr!GetHashFromBlob+0x94e09 (6fc59c06) (System.Math.Atan2(Double, Double), mdToken: 06000de7)
001f0098 ddd8 fstp st(0)
001f009a 5d pop ebp
001f009b c3 ret
Even if you're not an x86 guru, you'll notice something odd in there: there's a call to System.Math.Atan2. As in a function call.
But there's actually an x86 opcode that would do that: FPATAN
Why is the JITer calling a function when there's an actual assembly instruction to do the operation? I thought that System.Math was basically a wrapper for native assembly instructions. Most of the operations in there have direct assembly opcodes. But that's apparently not the case?
Does anyone have any information on why the JITer isn't/can't perform this rather obvious optimization?
You can chase down the reason from this answer, it shows how these math functions are mapped by the jitter.
Which takes you to clr/src/classlibnative/float/comfloat.cpp, ComDouble::Atan2() function. Which explains the reason:
// the intrinsic for Atan2 does not produce Nan for Atan2(+-inf,+-inf)
if (IS_DBL_INFINITY(x) && IS_DBL_INFINITY(y)) {
return(x / y); // create a NaN
}
return (double) atan2(x, y);
So it is a workaround to fix FPU behavior that is not CLI compliant.
ILSpy shows that String.IsNullOrEmpty is implemented in terms of String.Length. But then why is String.IsNullOrEmpty(s) faster than s.Length == 0?
For example, it's 5% faster in this benchmark:
var stopwatches = Enumerable.Range(0, 4).Select(_ => new Stopwatch()).ToArray();
var strings = "A,B,,C,DE,F,,G,H,,,,I,J,,K,L,MN,OP,Q,R,STU,V,W,X,Y,Z,".Split(',');
var testers = new Func<string, bool>[] { s => s == String.Empty, s => s.Length == 0, s => String.IsNullOrEmpty(s), s => s == "" };
int count = 0;
for (int i = 0; i < 10000; ++i) {
stopwatches[i % 4].Start();
for (int j = 0; j < 1000; ++j)
count += strings.Count(testers[i % 4]);
stopwatches[i % 4].Stop();
}
(Other benchmarks show similar results. This one minimized the effect of cruft running on my computer. Also, as an aside, the tests comparing to empty strings came out the same at about 13% slower than IsNullOrEmpty.)
Additionally, why is IsNullOrEmpty only faster on x86, whereas on x64 String.Length is about 9% faster?
Update: Test setup details: .NET 4.0 running on 64-bit Windows 7, Intel Core i5 processor, console project compiled with "Optimize code" enabled. However, "Suppress JIT optimization on module load" was also enabled (see accepted answer and comments).
With optimization fully enabled, Length is about 14% faster than IsNullOrEmpty with the delegate and other overhead removed, as in this test:
var strings = "A,B,,C,DE,F,,G,H,,,,I,J,,K,L,MN,OP,Q,R,,STU,V,,W,,X,,,Y,,Z,".Split(',');
int count = 0;
for (uint i = 0; i < 100000000; ++i)
count += strings[i % 32].Length == 0 ? 1 : 0; // Replace Length test with String.IsNullOrEmpty
It's because you ran your benchmark from within Visual Studio which prevents JIT compiler from optimizing code. Without optimizations, this code is produced for String.IsNullOrEmpty
00000000 push ebp
00000001 mov ebp,esp
00000003 sub esp,8
00000006 mov dword ptr [ebp-8],ecx
00000009 cmp dword ptr ds:[00153144h],0
00000010 je 00000017
00000012 call 64D85BDF
00000017 mov ecx,dword ptr [ebp-8]
0000001a call 63EF7C0C
0000001f mov dword ptr [ebp-4],eax
00000022 movzx eax,byte ptr [ebp-4]
00000026 mov esp,ebp
00000028 pop ebp
00000029 ret
and now compare it to code produced for Length == 0
00000000 push ebp
00000001 mov ebp,esp
00000003 sub esp,8
00000006 mov dword ptr [ebp-8],ecx
00000009 cmp dword ptr ds:[001E3144h],0
00000010 je 00000017
00000012 call 64C95BDF
00000017 mov ecx,dword ptr [ebp-8]
0000001a cmp dword ptr [ecx],ecx
0000001c call 64EAA65B
00000021 mov dword ptr [ebp-4],eax
00000024 cmp dword ptr [ebp-4],0
00000028 sete al
0000002b movzx eax,al
0000002e mov esp,ebp
00000030 pop ebp
00000031 ret
You can see, that code for Length == 0 does everything that does code for String.IsNullOrEmpty, but additionally it tries something like foolishly convert boolean value (returned from length comparison) again to boolean and this makes it slower than String.IsNullOrEmpty.
If you compile program with optimizations enabled (Release mode) and run .exe file directly from Windows, code generated by JIT compiler is much better. For String.IsNullOrEmpty it is:
001f0650 push ebp
001f0651 mov ebp,esp
001f0653 test ecx,ecx
001f0655 je 001f0663
001f0657 cmp dword ptr [ecx+4],0
001f065b sete al
001f065e movzx eax,al
001f0661 jmp 001f0668
001f0663 mov eax,1
001f0668 and eax,0FFh
001f066d pop ebp
001f066e ret
and for Length == 0:
001406f0 cmp dword ptr [ecx+4],0
001406f4 sete al
001406f7 movzx eax,al
001406fa ret
With this code, result are as expected, i.e. Length == 0 is slightly faster than String.IsNullOrEmpty.
It's also worth mentioning, that using Linq, lambda expressions and computing modulo in your benchmark is not such a good idea, because these operations are slow (relatively to string comparison) and make result of benchmark inaccurate.
Your benchmark does not measure String.IsNullOrEmpty vs String.Length, but rather how different lambda expressions are generated to functions. I.e. it is not very surprising that delegate that just contains single function call (IsNullOrEmpty) is faster than one with function call and comparison (Length == 0).
To get comparison of actuall call - write code that calls them directly without delegates.
EDIT: My rough measurements show that delegate version with IsNullOrEmpty is slightly faster then the rest, while direct calls to the same comparision are in reverse order (and about twice faster due to significantly less number of extra code) on my machine. Results likely to wary between machines, x86/x64 mode, as well between versions of runtime. For practical purposes I would consider all 4 ways are about the same if you need to use them in LINQ queries.
Overall I doubt there will be measurable difference in real program cased by choice between these methods, so pick the one that is most readable to you and use it. I generally prefer IsNullOrEmpty since it gives less chance to get ==/!= wrong in a condition.
Removal of string manipulation altogether from time critical code will likley bring much higer benifit that picking between these choices, also dropping LINQ for critical code is an option. As always - make sure to measure overall program speed in real life scenario.
You test is wrong somethere. IsNullOrEmpty can't be faster by definition, since it makes additional null comparison operation, and then tests the Length.
So the answer can be: it's faster because of your test. However even your code shows that IsNullOrEmpty is consistently slower on my machine in both x86 and x64 modes.
I believe your test is not correct:
This test shows that string.IsNullOrEmpty is always slower than s.Length==0 because it performs an additional null check:
var strings = "A,B,,C,DE,F,,G,H,,,,I,J,,K,L,MN,OP,Q,R,STU,V,W,X,Y,Z,".Split(',');
var testers = new Func<string, bool>[] {
s => s == String.Empty,
s => s.Length == 0,
s => String.IsNullOrEmpty(s),
s => s == "" ,
};
int n = testers.Length;
var stopwatches = Enumerable.Range(0, testers.Length).Select(_ => new Stopwatch()).ToArray();
int count = 0;
for(int i = 0; i < n; ++i) { // iterate testers one by one
Stopwatch sw = stopwatches[i];
var tester = testers[i];
sw.Start();
for(int j = 0; j < 10000000; ++j) // increase this count for better precision
count += strings.Count(tester);
sw.Stop();
}
for(int i = 0; i < testers.Length; i++)
Console.WriteLine(stopwatches[i].ElapsedMilliseconds);
Results:
6573
5328
5488
6419
You can use s.Length==0 when you are ensure that target data does not contains null strings. In other cases I suggest you use the String.IsNullOrEmpty.
I think it is impossible IsNullOrEmpty to be faster because as all the rest said it also makes a check for null. But faster or not the difference is going to be so small, that this gives a plus on using IsNullOrEmpty just because of this additional null check that makes your code safer.
In CLR via CSharp chapter 10 "Properties" Jeff Richter writes:
A property method can take a long time to execute; field access always completes immediately. A common reason to use properties is to perform thread synchronization, which can stop the thread forever, and therefore, a property should not be used if thread synchronization is required. In that situation, a method is preferred. Also, if your class can be accessed remotely (for example, your class is derived from System.MarshalByRefObject), calling the property method will be very slow, and therefore, a method is preferred to a property. In my opinion, classes derived from MarshalByRefObject should never use properties.
So if we see String.Length is property and String.IsNullOrEmpty is a method which may execute faster than the property String.Length.
it may be caused by the types of the involved variables.
*Empty seems to use a boolean, length an int (i guess).
Peace !
: edit
I wrote some code for testing the impact of try-catch, but seeing some surprising results.
static void Main(string[] args)
{
Thread.CurrentThread.Priority = ThreadPriority.Highest;
Process.GetCurrentProcess().PriorityClass = ProcessPriorityClass.RealTime;
long start = 0, stop = 0, elapsed = 0;
double avg = 0.0;
long temp = Fibo(1);
for (int i = 1; i < 100000000; i++)
{
start = Stopwatch.GetTimestamp();
temp = Fibo(100);
stop = Stopwatch.GetTimestamp();
elapsed = stop - start;
avg = avg + ((double)elapsed - avg) / i;
}
Console.WriteLine("Elapsed: " + avg);
Console.ReadKey();
}
static long Fibo(int n)
{
long n1 = 0, n2 = 1, fibo = 0;
n++;
for (int i = 1; i < n; i++)
{
n1 = n2;
n2 = fibo;
fibo = n1 + n2;
}
return fibo;
}
On my computer, this consistently prints out a value around 0.96..
When I wrap the for loop inside Fibo() with a try-catch block like this:
static long Fibo(int n)
{
long n1 = 0, n2 = 1, fibo = 0;
n++;
try
{
for (int i = 1; i < n; i++)
{
n1 = n2;
n2 = fibo;
fibo = n1 + n2;
}
}
catch {}
return fibo;
}
Now it consistently prints out 0.69... -- it actually runs faster! But why?
Note: I compiled this using the Release configuration and directly ran the EXE file (outside Visual Studio).
EDIT: Jon Skeet's excellent analysis shows that try-catch is somehow causing the x86 CLR to use the CPU registers in a more favorable way in this specific case (and I think we're yet to understand why). I confirmed Jon's finding that x64 CLR doesn't have this difference, and that it was faster than the x86 CLR. I also tested using int types inside the Fibo method instead of long types, and then the x86 CLR was as equally fast as the x64 CLR.
UPDATE: It looks like this issue has been fixed by Roslyn. Same machine, same CLR version -- the issue remains as above when compiled with VS 2013, but the problem goes away when compiled with VS 2015.
One of the Roslyn engineers who specializes in understanding optimization of stack usage took a look at this and reports to me that there seems to be a problem in the interaction between the way the C# compiler generates local variable stores and the way the JIT compiler does register scheduling in the corresponding x86 code. The result is suboptimal code generation on the loads and stores of the locals.
For some reason unclear to all of us, the problematic code generation path is avoided when the JITter knows that the block is in a try-protected region.
This is pretty weird. We'll follow up with the JITter team and see whether we can get a bug entered so that they can fix this.
Also, we are working on improvements for Roslyn to the C# and VB compilers' algorithms for determining when locals can be made "ephemeral" -- that is, just pushed and popped on the stack, rather than allocated a specific location on the stack for the duration of the activation. We believe that the JITter will be able to do a better job of register allocation and whatnot if we give it better hints about when locals can be made "dead" earlier.
Thanks for bringing this to our attention, and apologies for the odd behaviour.
Well, the way you're timing things looks pretty nasty to me. It would be much more sensible to just time the whole loop:
var stopwatch = Stopwatch.StartNew();
for (int i = 1; i < 100000000; i++)
{
Fibo(100);
}
stopwatch.Stop();
Console.WriteLine("Elapsed time: {0}", stopwatch.Elapsed);
That way you're not at the mercy of tiny timings, floating point arithmetic and accumulated error.
Having made that change, see whether the "non-catch" version is still slower than the "catch" version.
EDIT: Okay, I've tried it myself - and I'm seeing the same result. Very odd. I wondered whether the try/catch was disabling some bad inlining, but using [MethodImpl(MethodImplOptions.NoInlining)] instead didn't help...
Basically you'll need to look at the optimized JITted code under cordbg, I suspect...
EDIT: A few more bits of information:
Putting the try/catch around just the n++; line still improves performance, but not by as much as putting it around the whole block
If you catch a specific exception (ArgumentException in my tests) it's still fast
If you print the exception in the catch block it's still fast
If you rethrow the exception in the catch block it's slow again
If you use a finally block instead of a catch block it's slow again
If you use a finally block as well as a catch block, it's fast
Weird...
EDIT: Okay, we have disassembly...
This is using the C# 2 compiler and .NET 2 (32-bit) CLR, disassembling with mdbg (as I don't have cordbg on my machine). I still see the same performance effects, even under the debugger. The fast version uses a try block around everything between the variable declarations and the return statement, with just a catch{} handler. Obviously the slow version is the same except without the try/catch. The calling code (i.e. Main) is the same in both cases, and has the same assembly representation (so it's not an inlining issue).
Disassembled code for fast version:
[0000] push ebp
[0001] mov ebp,esp
[0003] push edi
[0004] push esi
[0005] push ebx
[0006] sub esp,1Ch
[0009] xor eax,eax
[000b] mov dword ptr [ebp-20h],eax
[000e] mov dword ptr [ebp-1Ch],eax
[0011] mov dword ptr [ebp-18h],eax
[0014] mov dword ptr [ebp-14h],eax
[0017] xor eax,eax
[0019] mov dword ptr [ebp-18h],eax
*[001c] mov esi,1
[0021] xor edi,edi
[0023] mov dword ptr [ebp-28h],1
[002a] mov dword ptr [ebp-24h],0
[0031] inc ecx
[0032] mov ebx,2
[0037] cmp ecx,2
[003a] jle 00000024
[003c] mov eax,esi
[003e] mov edx,edi
[0040] mov esi,dword ptr [ebp-28h]
[0043] mov edi,dword ptr [ebp-24h]
[0046] add eax,dword ptr [ebp-28h]
[0049] adc edx,dword ptr [ebp-24h]
[004c] mov dword ptr [ebp-28h],eax
[004f] mov dword ptr [ebp-24h],edx
[0052] inc ebx
[0053] cmp ebx,ecx
[0055] jl FFFFFFE7
[0057] jmp 00000007
[0059] call 64571ACB
[005e] mov eax,dword ptr [ebp-28h]
[0061] mov edx,dword ptr [ebp-24h]
[0064] lea esp,[ebp-0Ch]
[0067] pop ebx
[0068] pop esi
[0069] pop edi
[006a] pop ebp
[006b] ret
Disassembled code for slow version:
[0000] push ebp
[0001] mov ebp,esp
[0003] push esi
[0004] sub esp,18h
*[0007] mov dword ptr [ebp-14h],1
[000e] mov dword ptr [ebp-10h],0
[0015] mov dword ptr [ebp-1Ch],1
[001c] mov dword ptr [ebp-18h],0
[0023] inc ecx
[0024] mov esi,2
[0029] cmp ecx,2
[002c] jle 00000031
[002e] mov eax,dword ptr [ebp-14h]
[0031] mov edx,dword ptr [ebp-10h]
[0034] mov dword ptr [ebp-0Ch],eax
[0037] mov dword ptr [ebp-8],edx
[003a] mov eax,dword ptr [ebp-1Ch]
[003d] mov edx,dword ptr [ebp-18h]
[0040] mov dword ptr [ebp-14h],eax
[0043] mov dword ptr [ebp-10h],edx
[0046] mov eax,dword ptr [ebp-0Ch]
[0049] mov edx,dword ptr [ebp-8]
[004c] add eax,dword ptr [ebp-1Ch]
[004f] adc edx,dword ptr [ebp-18h]
[0052] mov dword ptr [ebp-1Ch],eax
[0055] mov dword ptr [ebp-18h],edx
[0058] inc esi
[0059] cmp esi,ecx
[005b] jl FFFFFFD3
[005d] mov eax,dword ptr [ebp-1Ch]
[0060] mov edx,dword ptr [ebp-18h]
[0063] lea esp,[ebp-4]
[0066] pop esi
[0067] pop ebp
[0068] ret
In each case the * shows where the debugger entered in a simple "step-into".
EDIT: Okay, I've now looked through the code and I think I can see how each version works... and I believe the slower version is slower because it uses fewer registers and more stack space. For small values of n that's possibly faster - but when the loop takes up the bulk of the time, it's slower.
Possibly the try/catch block forces more registers to be saved and restored, so the JIT uses those for the loop as well... which happens to improve the performance overall. It's not clear whether it's a reasonable decision for the JIT to not use as many registers in the "normal" code.
EDIT: Just tried this on my x64 machine. The x64 CLR is much faster (about 3-4 times faster) than the x86 CLR on this code, and under x64 the try/catch block doesn't make a noticeable difference.
Jon's disassemblies show, that the difference between the two versions is that the fast version uses a pair of registers (esi,edi) to store one of the local variables where the slow version doesn't.
The JIT compiler makes different assumptions regarding register use for code that contains a try-catch block vs. code which doesn't. This causes it to make different register allocation choices. In this case, this favors the code with the try-catch block. Different code may lead to the opposite effect, so I would not count this as a general-purpose speed-up technique.
In the end, it's very hard to tell which code will end up running the fastest. Something like register allocation and the factors that influence it are such low-level implementation details that I don't see how any specific technique could reliably produce faster code.
For example, consider the following two methods. They were adapted from a real-life example:
interface IIndexed { int this[int index] { get; set; } }
struct StructArray : IIndexed {
public int[] Array;
public int this[int index] {
get { return Array[index]; }
set { Array[index] = value; }
}
}
static int Generic<T>(int length, T a, T b) where T : IIndexed {
int sum = 0;
for (int i = 0; i < length; i++)
sum += a[i] * b[i];
return sum;
}
static int Specialized(int length, StructArray a, StructArray b) {
int sum = 0;
for (int i = 0; i < length; i++)
sum += a[i] * b[i];
return sum;
}
One is a generic version of the other. Replacing the generic type with StructArray would make the methods identical. Because StructArray is a value type, it gets its own compiled version of the generic method. Yet the actual running time is significantly longer than the specialized method's, but only for x86. For x64, the timings are pretty much identical. In other cases, I've observed differences for x64 as well.
This looks like a case of inlining gone bad. On an x86 core, the jitter has the ebx, edx, esi and edi register available for general purpose storage of local variables. The ecx register becomes available in a static method, it doesn't have to store this. The eax register often is needed for calculations. But these are 32-bit registers, for variables of type long it must use a pair of registers. Which are edx:eax for calculations and edi:ebx for storage.
Which is what stands out in the disassembly for the slow version, neither edi nor ebx are used.
When the jitter can't find enough registers to store local variables then it must generate code to load and store them from the stack frame. That slows down code, it prevents a processor optimization named "register renaming", an internal processor core optimization trick that uses multiple copies of a register and allows super-scalar execution. Which permits several instructions to run concurrently, even when they use the same register. Not having enough registers is a common problem on x86 cores, addressed in x64 which has 8 extra registers (r9 through r15).
The jitter will do its best to apply another code generation optimization, it will try to inline your Fibo() method. In other words, not make a call to the method but generate the code for the method inline in the Main() method. Pretty important optimization that, for one, makes properties of a C# class for free, giving them the perf of a field. It avoids the overhead of making the method call and setting up its stack frame, saves a couple of nanoseconds.
There are several rules that determine exactly when a method can be inlined. They are not exactly documented but have been mentioned in blog posts. One rule is that it won't happen when the method body is too large. That defeats the gain from inlining, it generates too much code that doesn't fit as well in the L1 instruction cache. Another hard rule that applies here is that a method won't be inlined when it contains a try/catch statement. The background behind that one is an implementation detail of exceptions, they piggy-back onto Windows' built-in support for SEH (Structure Exception Handling) which is stack-frame based.
One behavior of the register allocation algorithm in the jitter can be inferred from playing with this code. It appears to be aware of when the jitter is trying to inline a method. One rule it appears to use that only the edx:eax register pair can be used for inlined code that has local variables of type long. But not edi:ebx. No doubt because that would be too detrimental to the code generation for the calling method, both edi and ebx are important storage registers.
So you get the fast version because the jitter knows up front that the method body contains try/catch statements. It knows it can never be inlined so readily uses edi:ebx for storage for the long variable. You got the slow version because the jitter didn't know up front that inlining wouldn't work. It only found out after generating the code for the method body.
The flaw then is that it didn't go back and re-generate the code for the method. Which is understandable, given the time constraints it has to operate in.
This slow-down doesn't occur on x64 because for one it has 8 more registers. For another because it can store a long in just one register (like rax). And the slow-down doesn't occur when you use int instead of long because the jitter has a lot more flexibility in picking registers.
I'd have put this in as a comment as I'm really not certain that this is likely to be the case, but as I recall it doesn't a try/except statement involve a modification to the way the garbage disposal mechanism of the compiler works, in that it clears up object memory allocations in a recursive way off the stack. There may not be an object to be cleared up in this case or the for loop may constitute a closure that the garbage collection mechanism recognises sufficient to enforce a different collection method.
Probably not, but I thought it worth a mention as I hadn't seen it discussed anywhere else.
9 years later and the bug is still there! You can see it easily with:
static void Main( string[] args )
{
int hundredMillion = 1000000;
DateTime start = DateTime.Now;
double sqrt;
for (int i=0; i < hundredMillion; i++)
{
sqrt = Math.Sqrt( DateTime.Now.ToOADate() );
}
DateTime end = DateTime.Now;
double sqrtMs = (end - start).TotalMilliseconds;
Console.WriteLine( "Elapsed milliseconds: " + sqrtMs );
DateTime start2 = DateTime.Now;
double sqrt2;
for (int i = 0; i < hundredMillion; i++)
{
try
{
sqrt2 = Math.Sqrt( DateTime.Now.ToOADate() );
}
catch (Exception e)
{
int br = 0;
}
}
DateTime end2 = DateTime.Now;
double sqrtMsTryCatch = (end2 - start2).TotalMilliseconds;
Console.WriteLine( "Elapsed milliseconds: " + sqrtMsTryCatch );
Console.WriteLine( "ratio is " + sqrtMsTryCatch / sqrtMs );
Console.ReadLine();
}
The ratio is less than one on my machine, running the latest version of MSVS 2019, .NET 4.6.1