warning C4799: function has no EMMS instruction - c#

I'm trying to create C# app which uses dll library which contains C++ code and inline assembly. In function test_MMX I want to add two arrays of specific length.
extern "C" __declspec(dllexport) void __stdcall test_MMX(int *first_array,int *second_array,int length)
{
__asm
{
mov ecx,length;
mov esi,first_array;
shr ecx,1;
mov edi,second_array;
label:
movq mm0,QWORD PTR[esi];
paddd mm0,QWORD PTR[edi];
add edi,8;
movq QWORD PTR[esi],mm0;
add esi,8;
dec ecx;
jnz label;
}
}
After run app it's showing this warning:
warning C4799: function 'test_MMX' has no EMMS instruction.
When I want to measure time of running this function C# in miliseconds it returns this value: -922337203685477 instead of (for example 0,0141)...
private Stopwatch time = new Stopwatch();
time.Reset();
time.Start();
test_MMX(first_array, second_array, length);
time.Stop();
TimeSpan interval = time.Elapsed;
return trvanie.TotalMilliseconds;
Any ideas how to fix it please ?

Since MMX aliases over the floating-point registers, any routine that uses MMX instructions must end with the EMMS instruction. This instruction "clears" the registers, making them available for use by the x87 FPU once again. (Which any C or C++ calling convention for x86 will assume is safe.)
The compiler is warning you that you have written a routine that uses MMX instructions but does not end with the EMMS instruction. That's a bug waiting to happen, as soon as some FPU instruction tries to execute.
This is a huge disadvantage of MMX, and the reason why you really can't freely intermix MMX and floating-point instructions. Sure, you could just throw EMMS instructions around, but it is a slow, high-latency instruction, so this kills performance. SSE had the same limitations as MMX in this regard, at least for integer operations. SSE2 was the first instruction set to address this problem, since it used its own discrete register set. Its registers are also twice as wide as MMX's are, so you can do even more at a time. Since SSE2 does everything that MMX does, but faster, easier, and more efficiently, and is supported by the Pentium 4 and later, it is quite rare that anyone needs to write new code today that uses MMX. If you can use SSE2, you should. It will be faster than MMX. Another reason not to use MMX is that it is not supported in 64-bit mode.
Anyway, the correct way to write the MMX code would be:
__asm
{
mov ecx, [length]
mov eax, [first_array]
shr ecx, 1
mov edx, [second_array]
label:
movq mm0, QWORD PTR [eax]
paddd mm0, QWORD PTR [edx]
add edx, 8
movq QWORD PTR [eax], mm0
add eax, 8
dec ecx
jnz label
emms
}
Note that, in addition to the EMMS instruction (which, of course, is placed outside of the loop), I made a few additional changes:
Assembly-language instructions do not end with semicolons. In fact, in assembly language's syntax, the semicolon is used to begin a comment. So I have removed your semicolons.
I've also added spaces for readability.
And, while it isn't strictly necessary (Microsoft's inline assembler is sufficiently forgiving so as to allow you to get away with not doing it), it is a good idea to be explicit and wrap the use of addresses (C/C++ variables) in square brackets, since you are actually dereferencing them.
As a commenter pointed out, you can freely use the ESI and EDI registers in inline assembly, since the inline assembler will detect their use and generate additional instructions that push/pop them accordingly. In fact, it will do this with all non-volatile registers. And if you need additional registers, then you need them, and this is a nice feature. But in this code, you're only using three general-purpose registers, and in the __stdcall calling convention, there are three general-purpose registers that are specifically defined as volatile (i.e., can be freely clobbered by any function): EAX, EDX, and ECX. So you should be using those registers for maximum speed. As such, I've changed your use of ESI to EAX, and your use of EDI to EDX. This will improve the code that you can't see, the prologue and epilogue automatically generated by the compiler.
You have a potential speed trap lurking here, though, and that is alignment. To obtain maximum speed, MMX instructions need to operate on data that is aligned on 8-byte boundaries. In a loop, misaligned data has a compounding negative effect on performance: not only is the data misaligned the first time through the loop, exerting a significant performance penalty, but it is guaranteed to be misaligned each subsequent time through the loop, too. So for this code to have any chance of being fast, the caller needs to guarantee that first_array and second_array are aligned on 8-byte boundaries.
If you can't guarantee that, then the function should really have extra code added to it to fix up misalignments. Essentially, you want to do a couple of non-vector operations (on individual bytes) at the beginning, before starting the loop, until you've reached a suitable alignment. Then, you can start issuing the vectorized MMX instructions.
(Unaligned loads are no longer penalized on modern processors, but if you were targeting modern processors, you'd be writing SSE2 code. On the older processors where you need to run MMX code, alignment will be a big deal, and misaligned data will kill your performance.)
Now, this inline assembly won't produce particularly efficient code. When you use inline assembly, the compiler always generates prologue and epilogue code for the function. That isn't terrible, since it's outside of the critical inner loop, but still—it's cruft you don't need. Worse, jumps in inline assembly blocks tend to confuse MSVC's inline assembler and cause it to generate sub-optimal code. It is overly cautious, preventing you from doing something that could corrupt the stack or cause other external side effects, which is nice, except that the whole reason you're writing inline assembly is (presumably) because you desire maximum performance.
(It should go without saying, but if you don't need the maximum possible performance, you should just write the code in C (or C++) and let the compiler optimize it. It does a darn good job in the majority of cases.)
If you do need the maximum possible performance, and have decided that the compiler-generated code just won't cut it, then a better alternative to inline assembly is the use of intrinsics. Intrinsics will generally map one-to-one to assembly-language instructions, but the compiler does a lot better job optimizing around them.
Here's my version of your code, using MMX intrinsics:
#include <intrin.h> // include header with MMX intrinsics
void __stdcall Function_With_Intrinsics(int *first_array, int *second_array, int length)
{
unsigned int counter = static_cast<unsigned int>(length);
counter /= 2;
do
{
*reinterpret_cast<__m64*>(first_array) = _mm_add_pi32(*reinterpret_cast<const __m64*>(first_array),
*reinterpret_cast<const __m64*>(second_array));
first_array += 8;
second_array += 8;
} while (--counter != 0);
_mm_empty();
}
It does the same thing, but more efficiently by delegating more to the compiler's optimizer. A couple of notes:
Since your assembly code treats length as an unsigned integer, I assume that your interface requires that it actually be an unsigned integer. (And, if so, I wonder why you don't declare it as such in the function's signature.) To achieve the same effect, I've cast it to an unsigned int, which is subsequently used as the counter. (If I hadn't done that, I'd have to have either done a shift operation on a signed integer, which risks undefined behavior, or a division by two, for which the compiler would have generated slower code to correctly deal with the sign bit.)
The *reinterpret_cast<__m64*> business scattered throughout looks scary, but is actually safe—at least, relatively speaking. That's what you're supposed to do with the MMX intrinsics. The MMX data type is __m64, which you can think of as being roughly equivalent to an mm? register. It is 64 bits in length, and loads and stores are accomplished by casting. These get translated directly into MOVQ instructions.
Your original assembly code was written such that the loop always iterated at least once, so I transformed that into a do…while loop. This means the test of the loop condition only has to be done at the bottom of the loop, rather than once at the top and once at the bottom.
The _mm_empty() intrinsic causes an EMMS instruction to be emitted.
Just for grins, let's see what the compiler transformed this into. This is the output from MSVC 16 (VS 2010), targeting x86-32 and optimizing for speed over size (though it makes no difference in this particular case):
PUBLIC ?Function_With_Intrinsics##YGXPAH0H#Z
; Function compile flags: /Ogtpy
_first_array$ = 8 ; size = 4
_second_array$ = 12 ; size = 4
_length$ = 16 ; size = 4
?Function_With_Intrinsics##YGXPAH0H#Z PROC
mov ecx, DWORD PTR _length$[esp-4]
mov edx, DWORD PTR _second_array$[esp-4]
mov eax, DWORD PTR _first_array$[esp-4]
shr ecx, 1
sub edx, eax
$LL3:
movq mm0, MMWORD PTR [eax]
movq mm1, MMWORD PTR [edx+eax]
paddd mm0, mm1
movq MMWORD PTR [eax], mm0
add eax, 32
dec ecx
jne SHORT $LL3
emms
ret 12
?Function_With_Intrinsics##YGXPAH0H#Z ENDP
It is recognizably similar to your original code, but does a couple of things differently. In particular, it tracks the array pointers differently, in a way that it (and I) believe is slightly more efficient than your original code, since it does less work inside of the loop. It also breaks apart your PADDD instruction so that both of its operands are MMX registers, instead of the source being a memory operand. Again, this tends to make the code more efficient at the expense of clobbering an additional MMX register, but we've got plenty of those to spare, so it's certainly worth it.
Better yet, as the optimizer improves in newer versions of the compiler, code that is written using intrinsics may get even better!
Of course, rewriting the function to use intrinsics doesn't solve the alignment problem, but I'm assuming you have already dealt with that on the caller side. If not, you'll need to add code to handle it.
If you wanted to use SSE2—perhaps that would be test_SSE2 and you would dynamically delegate to the appropriate implementation depending on the current processor's feature bits—then you could do it like this:
#include <intrin.h> // include header with SSE2 intrinsics
void __stdcall Function_With_Intrinsics_SSE2(int *first_array, int *second_array, int length)
{
unsigned int counter = static_cast<unsigned>(length);
counter /= 4;
do
{
_mm_storeu_si128(reinterpret_cast<__m128i*>(first_array),
_mm_add_epi32(_mm_loadu_si128(reinterpret_cast<const __m128i*>(first_array)),
_mm_loadu_si128(reinterpret_cast<const __m128i*>(second_array))));
first_array += 16;
second_array += 16;
} while (--counter != 0);
}
I've written this code not assuming alignment, so it will work when the loads and stores are misaligned. For maximum speed on many older architectures, SSE2 requires 16-byte alignment, and if you can guarantee that the source and destination pointers are thusly aligned, you can use slightly faster instructions (e.g., MOVDQA as opposed to MOVDQU). As mentioned above, on newer architectures (at least Sandy Bridge and later, perhaps earlier), it doesn't matter.
To give you an idea of how SSE2 is basically just a drop-in replacement for MMX on Pentium 4 and later, except that you also get to do operations that are twice as wide, look at the code this compiles to:
PUBLIC ?Function_With_Intrinsics_SSE2##YGXPAH0H#Z
; Function compile flags: /Ogtpy
_first_array$ = 8 ; size = 4
_second_array$ = 12 ; size = 4
_length$ = 16 ; size = 4
?Function_With_Intrinsics_SSE2##YGXPAH0H#Z PROC
mov ecx, DWORD PTR _length$[esp-4]
mov edx, DWORD PTR _second_array$[esp-4]
mov eax, DWORD PTR _first_array$[esp-4]
shr ecx, 2
sub edx, eax
$LL3:
movdqu xmm0, XMMWORD PTR [eax]
movdqu xmm1, XMMWORD PTR [edx+eax]
paddd xmm0, xmm1
movdqu XMMWORD PTR [eax], xmm0
add eax, 64
dec ecx
jne SHORT $LL3
ret 12
?Function_With_Intrinsics_SSE2##YGXPAH0H#Z ENDP
As for the final question about getting negative values from the .NET Stopwatch class, I would normally guess that would be due to an overflow. In other words, your code executed too slowly, and the timer wrapped around. Kevin Gosse pointed out, though, that this is apparently a bug in the implementation of the Stopwatch class. I don't know much more about it, since I don't really use it. If you want a good microbenchmarking library, I use and recommend Google Benchmark. However, it is for C++, not C#.
While you're benchmarking, definitely take the time to time the code generated by the compiler when you write it the naïve way. Say, something like:
void Naive_PackedAdd(int *first_array, int *second_array, int length)
{
for (unsigned int i = 0; i < static_cast<unsigned int>(length); ++i)
{
first_array[i] += second_array[i];
}
}
You just might be pleasantly surprised at how fast the code is after the compiler gets finished auto-vectorizing the loop. :-) Remember that less code does not necessarily mean faster code. All of that extra code is required to deal with alignment issues, which I've diplomatically skirted throughout this answer. If you scroll down, at $LL4#Naive_Pack, you'll find an inner loop very similar to what we've been considering here.

Related

Can the compiler/JIT optimize away short-circuit evaluation if there are no side-effects?

I have a test which goes:
if(variable==SOME_CONSTANT || variable==OTHER_CONSTANT)
In this circumstances, on a platform where branching over the second test would take more cycles than simply doing it, would the optimizer be allowed to treat the || as a simple |?
In this circumstances, on a platform where branching over the second test would take more cycles than simply doing it, would the optimizer be allowed to treat the || as a simple |?
Yes, that is permitted, and in fact the C# compiler will perform this optimization in some cases on && and ||, reducing them to & and |. As you note, there must be no side effects of evaluating the right side.
Consult the compiler source code for the exact details of when the optimization is generated.
The compiler will also perform that optimization when the logical operation involves lifted-to-nullable operands. Consider for example
int? z = x + y;
where x and y are also nullable ints; this will be generated as
int? z;
int? temp1 = x;
int? temp2 = y;
z = temp1.HasValue & temp2.HasValue ?
new int?(temp1.GetValueOrDefault() + temp2.GetValueOrDefault()) :
new int?();
Note that it's & and not &&. I knew that it is so fast to call HasValue that it would not be worth the extra branching logic to avoid it.
If you're interested in how I wrote the nullable arithmetic optimizer, I've written a detailed explanation of it here: https://ericlippert.com/2012/12/20/nullable-micro-optimizations-part-one/
Yes, the compiler can make that optimization. Indeed, every language of interest generally has an explicit or implicit "as if" type clause that makes such not-observable optimizations allowed without needing a specific rule for it. This allows is implement the checks in a non-shortcut manner, in addition to a whole host of more extreme optimizations, such as combining multiple conditions into one, eliminating the check entirely, implementing the check without any branch at all using predicated instructions, etc.
The other side, however, is that the specific optimization you mention of unconditionally performing the second check isn't performed very often on most common platforms because on many instruction sets the branching approach is the fastest, if you assume it doesn't change the predictability of the branch. For example, on x86, you can use cmp to compare a variable to a known value (as in your example), but the "result" ends up in the EFLAGs register (of which there is only one, architecturally). How do you implement the || in that case between the two comparison results? The second comparison will overwrite the flag set by the first, so you'll be stuck saving the flag somewhere, and then doing the second comparison, and then trying the "combine" the flags somehow just so you can do your single test1.
The truth is, ignoring prediction, the conditional branch is often almost free, especially when the compiler organizes it to be "not taken". For example, on x86, your condition could look like two cmp operations, each immediately followed by a jump over the code in the if() block. So just two branch instructions versus the hoops you'd have to jump though to reduce it to one. Going further - these cmp and subsequent branches often macro-fuse into a single operation that has about the same cost as the comparison alone (and take a single cycle). There are various caveats, but the overall assumption that "branching over the second test" will take much time is probably not well founded.
The main caveat is branch prediction. In the case that each individual clause is unpredictable, but where the whole condition is predictable, combining everything into a single branch can be very profitable. Imagine, for example, that in your (variable==SOME_CONSTANT || variable==OTHER_CONSTANT) that variable was equal to SOME_CONSTANT 50% of the time, and OTHER_CONSTANT 49% of the time. The if will thus be taken 99% of the time, but the first check variable==SOME_CONSTANT will be totally unpredictable: branching exactly half the time! In this case it would be a great idea to combine the checks, even at some cost, since the misprediction is expensive.
Now there are certain cases where the compiler can combine checks together simply due the form of the check. Peter shows an example using a range-check like example in his answer, and there are others.
Here's an interesting one I stumbled across where your SOME_CONSTANT is 2 and OTHER_CONSTANT is 4:
void test(int a) {
if (a == 2 || a == 4) {
call();
}
}
Both clang and icc implement this as a series of two checks and two branches, but recent gcc uses another trick:
test(int, int):
sub edi, 2
and edi, -3
je .L4
rep ret
.L4:
jmp call()
Essentially it subtracts 2 from a and then checks if any bit other than 0b10 is set. The values 2 and 4 are the only values accepted by that check. Interesting transformation! It's not that much better than the two branch approach, for predictable inputs, but for the unpredictable clauses but predictable final outcome case it will be a big win.
This isn't really a case of doing both checks unconditionally however: just a clever case of being able to combine multiple checks into fewer, possibly with a bit of math. So I don't know if it meets your criteria for a "yes, they actually do in practice" answer. Perhaps compilers do make this optimization, but I haven't seen it on x86. If it exists there it might only be triggered by profile-guided optimization, where the compiler has an idea of the probability of various clauses.
1 On platforms with fast cmov two cmovs to implement || is probably not a terrible choice, and && can be implemented similarly.
Compilers are allowed to optimize short-circuit comparisons into asm that isn't two separate test & branch. But sometimes it's not profitable (especially on x86 where compare-into-register takes multiple instructions), and sometimes compilers miss the optimization.
Or if compilers choose to make branchless code using a conditional-move, both conditions are always evaluated. (This is of course only an option when there are no side-effects).
One special case is range-checks: compilers can transform x > min && x < max (especially when min and max are compile-time constants) into a single check. This can be done with 2 instructions instead of branching on each condition separately. Subtracting the low end of the range will wrap to a large unsigned number if the input was lower, so a subtract + unsigned-compare gives you a range check.
The range-check optimization is easy / well-known (by compiler developers), so I'd assume C# JIT and ahead-of-time compilers would do it, too.
To take a C example (which has the same short-circuit evaluation rules as C#):
int foo(int x, int a, int b) {
if (10 < x && x < 100) {
return a;
}
return b;
}
Compiled (with gcc7.3 -O3 for the x86-64 Windows ABI, on the Godbolt compiler explorer. You can see output for ICC, clang, or MSVC; or for gcc on ARM, MIPS, etc.):
foo(int, int, int):
sub ecx, 11 # x-11
mov eax, edx # retval = a;
cmp ecx, 89
cmovnb eax, r8d # retval = (x-11U) < 89U ? retval : b;
ret
So the function is branchless, using cmov (conditional mov). #HansPassant says .NET's compiler only tends to do this for assignment operations, so maybe you'd only get that asm if you wrote it in the C#
source as retval = (10 < x && x < 100) ? a : b;.
Or to take a branching example, we get the same optimization of the range check into a sub and then an unsigned compare/branch instead of compare/cmov.
int ext(void);
int bar(int x) {
if (10 < x && x < 100) {
return ext();
}
return 0;
}
# gcc -O3
sub ecx, 11
cmp ecx, 88
jbe .L7 # jump if ((unsigned)x-11U) <= 88U
xor eax, eax # return 0;
ret
.L7:
jmp ext() # tailcall ext()
IDK if existing C# implementations make this optimization the same way, but it's easy and valid for all possible inputs, so they should.
Godbolt doesn't have a C# compiler; if there is a convenient online C# compiler that shows you the asm, it would be interesting to try these functions there. (I think they're valid C# syntax as well as valid C and valid C++).
Other cases
Some cases other than range-checks can be profitable to optimize into a single branch or cmov on multiple conditions. x86 can't compare into a register very efficiently (xor-zero / cmp / setcc), but in some cases you only need 0 / non-zero instead of a 0 / 1 boolean to combine later. x86's OR instruction sets flags, so you can or / jnz to jump if either register was non-zero. (But note that saving the test reg,reg before a jcc only saves code-size; macro-fusion works for test/jcc but not or/jcc, so or/test/jcc is the same number of uops as or/jcc. It saves a uop with cmovcc or setcc, though.)
If branches predict perfectly, two cmp / jcc are probably still cheapest (because of macro-fusion: cmp / jne is a single uop on recent CPUs), but if not then two conditions together may well predict better, or be better with CMOV.
int foo(int x, int a, int b) {
if ((a-10) || (x!=5)) {
return a;
}
return b;
}
On Godbolt with gcc7.3, clang5.0, ICC18, and MSVC CL19
gcc compiles it the obvious way, with 2 branches and a couple mov instructions. clang5.0 spots the opportunity to transform it:
# compiled for the x86-64 System V ABI this time: args in edi=x, esi=a, edx=b
mov eax, esi
xor eax, 10
xor edi, 5
or edi, eax # flags set from edi=(a^10) | (x^5)
cmovne edx, esi # edx = (edi!=0) ? a : b
mov eax, edx # return edx
ret
Other compilers need some hand-holding if you want them to emit code like this. (And clang could use the same help to realize that it can use lea to copy-and-subtract instead of needing a mov before xor to avoid destroying an input that's needed later).
int should_optimize_to(int x, int a, int b) {
// x!=10 fools compilers into missing the optimization
if ((a-10) | (x-5)) {
return a;
}
return b;
}
gcc, clang, msvc, and ICC all compile this to basically the same thing:
# gcc7.3 -O3
lea eax, [rsi-10] # eax = a-10
sub edi, 5 # x-=5
or eax, edi # set flags
mov eax, edx
cmovne eax, esi
ret
This is smarter than clang's code: putting the mov to eax before the cmov creates instruction-level parallelism. If mov has non-zero latency, that latency can happen in parallel with the latency of creating the flag input for cmov.
If you want this kind of optimization, you usually have to hand-hold compilers toward it.

How to display the overclocked (actual) CPU frequency in C# [duplicate]

I'm trying to make a C# software that reads information about the CPU and displays them to the user (just like CPU-Z).
My current problem is that I've failed to find a way to display the CPU frequency.
At first I tried the easy way using the Win32_Processor class. It proved very efficient, except if the CPU is overclocked (or underclocked).
Then, I discovered that my Registry contain at HKLM\HARDWARE\DESCRIPTION\System\CentralProcessor\0 the "standard" clock of the CPU (even if overclocked). The problem is that in modern CPUs, the Core Multiplier is decreasing when the CPU does not need it's full power, so the CPU Frequency is also changing, but the Value in the Registry remains the same.
My next step was trying to use the RdTSC to actually calculate the CPU Frequency. I used C++ for this because I can embed it in a C# project if the method is working. I found the next code at http://www.codeproject.com/Articles/7340/Get-the-Processor-Speed-in-two-simple-ways
but the problem was the same: the program gives me only the maximum frequency (like in the registry value, 1-2 Mhz difference) and it also looks like it loads the CPU more than it should (I even had CPU load spikes).
#include "stdafx.h"
#include <windows.h>
#include <cstdlib>
#include "intrin.h"
#include <WinError.h>
#include <winnt.h>
float ProcSpeedCalc() {
#define RdTSC __asm _emit 0x0f __asm _emit 0x31
// variables for the clock-cycles:
__int64 cyclesStart = 0, cyclesStop = 0;
// variables for the High-Res Preformance Counter:
unsigned __int64 nCtr = 0, nFreq = 0, nCtrStop = 0;
// retrieve performance-counter frequency per second:
if(!QueryPerformanceFrequency((LARGE_INTEGER *) &nFreq))
return 0;
// retrieve the current value of the performance counter:
QueryPerformanceCounter((LARGE_INTEGER *) &nCtrStop);
// add the frequency to the counter-value:
nCtrStop += nFreq;
_asm
{// retrieve the clock-cycles for the start value:
RdTSC
mov DWORD PTR cyclesStart, eax
mov DWORD PTR [cyclesStart + 4], edx
}
do{
// retrieve the value of the performance counter
// until 1 sec has gone by:
QueryPerformanceCounter((LARGE_INTEGER *) &nCtr);
}while (nCtr < nCtrStop);
_asm
{// retrieve again the clock-cycles after 1 sec. has gone by:
RdTSC
mov DWORD PTR cyclesStop, eax
mov DWORD PTR [cyclesStop + 4], edx
}
// stop-start is speed in Hz divided by 1,000,000 is speed in MHz
return ((float)cyclesStop-(float)cyclesStart) / 1000000;
}
int _tmain(int argc, _TCHAR* argv[])
{
while(true)
{
printf("CPU frequency = %f\n",ProcSpeedCalc());
Sleep(1000);
}
return 0;
}
I should also mention that I've tested the last method on an AMD CPU.
I've also tried some other codes for the RdTSC method, but none working correctly.
Finally, I've tried to understand the code used to make this program https://code.google.com/p/open-hardware-monitor/source/browse/ , but it was much too complex for me.
So, my question is: how to determine the CPU Frequency in real-time (even when the CPU is overclocked) using C++ or C# ? I know that this question was asked a lot of times, but none actually answers my question.
Yes, that code sits and busy-waits for an entire second, which has causes that core to be 100% busy for a second. One second is more than enough time for dynamic clocking algorithms to detect load and kick the CPU frequency up out of power-saving states. I wouldn't be surprised if processors with boost actually show you a frequency above the labelled frequency.
The concept isn't bad, however. What you have to do is sleep for an interval of about one second. Then, instead of assuming the RDTSC invocations were exactly one second apart, divide by the actual time indicated by QueryPerformanceCounter.
Also, I recommend checking RDTSC both before and after the QueryPerformanceCounter call, to detect whether there was a context switch between RDTSC and QueryPerformanceCounter which would mess up your results.
Unfortunately, RDTSC on new processors doesn't actually count CPU clock cycles. So this doesn't reflect the dynamically changing CPU clock rate (it does measure the nominal rate without busy-waiting, though, so it is a big improvement over the code provided in the question).
Bruce Dawson explained this in a blog post
So it looks like you'd need to access model-specific registers after all. Which can't be done from user-mode. The OpenHardwareMonitor project has both a driver that can be used and code for the frequency calculations
float ProcSpeedCalc()
{
/*
RdTSC:
It's the Pentium instruction "ReaD Time Stamp Counter". It measures the
number of clock cycles that have passed since the processor was reset, as a
64-bit number. That's what the <CODE>_emit</CODE> lines do.
*/
// Microsoft inline assembler knows the rdtsc instruction. No need for emit.
// variables for the CPU cycle counter (unknown rate):
__int64 tscBefore, tscAfter, tscCheck;
// variables for the Performance Counter 9steady known rate):
LARGE_INTEGER hpetFreq, hpetBefore, hpetAfter;
// retrieve performance-counter frequency per second:
if (!QueryPerformanceFrequency(&hpetFreq)) return 0;
int retryLimit = 10;
do {
// read CPU cycle count
_asm
{
rdtsc
mov DWORD PTR tscBefore, eax
mov DWORD PTR [tscBefore + 4], edx
}
// retrieve the current value of the performance counter:
QueryPerformanceCounter(&hpetBefore);
// read CPU cycle count again, to detect context switch
_asm
{
rdtsc
mov DWORD PTR tscCheck, eax
mov DWORD PTR [tscCheck + 4], edx
}
} while ((tscCheck - tscBefore) > 800 && (--retryLimit) > 0);
Sleep(1000);
do {
// read CPU cycle count
_asm
{
rdtsc
mov DWORD PTR tscAfter, eax
mov DWORD PTR [tscAfter + 4], edx
}
// retrieve the current value of the performance counter:
QueryPerformanceCounter(&hpetAfter);
// read CPU cycle count again, to detect context switch
_asm
{
rdtsc
mov DWORD PTR tscCheck, eax
mov DWORD PTR [tscCheck + 4], edx
}
} while ((tscCheck - tscAfter) > 800 && (--retryLimit) > 0);
// stop-start is speed in Hz divided by 1,000,000 is speed in MHz
return (double)(tscAfter - tscBefore) / (double)(hpetAfter.QuadPart - hpetBefore.QuadPart) * (double)hpetFreq.QuadPart / 1.0e6;
}
Most compilers provide an __rdtsc() intrinsic, in which case you could use tscBefore = __rdtsc(); instead of the __asm block. Both methods are platform- and compiler-specific, unfortunately.
The answer depends on what you really want to know.
If your goal is to find the operating frequency of some particular application that you are currently running then this is a hard problem which requires administrator/root privileges to access model specific registers and maybe even access to the BIOS. You can do this with CPU-Z on Windows or powertop on Linux.
However, if you just want to know the operating frequency of your processor for one or many threads under load so that you could for example calculate the peak flops (which is why I care about this) then this can be done with more or less general code which does not need administrator privileges.
I got the idea from the code by Bruce Dawson at http://randomascii.wordpress.com/2013/08/06/defective-heat-sinks-causing-garbage-gaming/. I mostly extended his code to work with multiple threads using OpenMP.
I have tested this on Linux and Windows on Intel processors including, Nahalem, Ivy Bridge, and Haswell with one socket up to four sockets (40 threads). The results all deviate less than 0.5% from the correct answer.
I described how to determine the frequency here how-can-i-programmatically-find-the-cpu-frequency-with-c so I won't repeat all the details.
Your question is fundamentally unanswerable. CPU frequencies change constantly. Sometimes the OS knows about the changes and can tell you, but sometimes it does not. CPUs may overclock themselves (TurboBoost) or underclock themselves (due to overheating). Some processors reduce power to avoid melting by running the clock at the same rate but only doing work on some cycles, at which point the entire concept of clock frequency is meaningless.
In this post I talk about a significant number of machines that I analyze where the CPU was being thermally throttled without Windows noticing.
http://randomascii.wordpress.com/2013/08/06/defective-heat-sinks-causing-garbage-gaming/
It is possible to write some messy code that is very processor specific to detect this but it requires administrator privileges.
My point is that you are asking an unanswerable question and, in most cases, it is not a question that you should be asking. Use the value in the registry, or ask Windows what frequency it things the CPU is running at (see PROCESSOR_POWER_INFORMATION) and call that good enough.

Why does .NET Native compile loop in reverse order?

I'm working on optimization techniques performed by the .NET Native compiler.
I've created a sample loop:
for (int i = 0; i < 100; i++)
{
Function();
}
And I've compiled it with Native. Then I disassembled the result .dll file with machine code inside in IDA. As the result, I have:
(I've removed a few unnecessary lines, so don't worry that address lines are inconsistent)
I understand that add esi, 0FFFFFFFFh means really subtract one from esi and alter Zero Flag if needed, so we can jump to the beginning if zero hasn't been reached yet.
What I don't understand is why did the compiler reverse the loop?
I came to the conclusion that
LOOP:
add esi, 0FFFFFFFFh
jnz LOOP
is just faster than for example
LOOP:
inc esi
cmp esi, 064h
jl LOOP
But is it really because of that and is the speed difference really significant?
inc might be slower than add because of the partial flag update. Moreover add affects the zero flag so you don't need to use another cmp instruction. Just jump directly.
This is one famous type of loop optimization
reversal: Loop reversal reverses the order in which values are assigned to the index variable. This is a subtle optimization which can help eliminate dependencies and thus enable other optimizations. Also, certain architectures utilize looping constructs at Assembly language level that count in a single direction only (e.g. decrement-jump-if-not-zero (DJNZ)).
Is it faster to count down than it is to count up?
GCC Loop optimization
You can see the result for other compilers here.
Your conclusion is correct: inverted cycle will target 0 (cycle will ends when register value reach 0), so that Add will set zero flag used in conditional branch.
This way you don't need dedicated Cmp which leads to: 1) size optimization 2) it's also faster (conclusion from compiler programmers decision and another answer).
That's pretty common assembler trick to write loop targeting 0. I am surprised you understand assembler, but don't know (asking) about it.

Will Thread.SpinWait be inlined when called?

I have following code:
while(flag)
{
Thread.SpinWait(1);
}
following is implementation of SpinWait in Rotor(sscli20\clr\src\vm\comsynchronizable.cpp)
FCIMPL1(void, ThreadNative::SpinWait, int iterations)
{
WRAPPER_CONTRACT;
STATIC_CONTRACT_SO_TOLERANT;
for(int i = 0; i < iterations; i++)
YieldProcessor();
}
FCIMPLEND
Will Thread.SpinWait be inlined when called?
if not, in each loop cycle, it will spend more time on stack operations(push and pop) and consume more execution resource of CPU.
if yes, how does clr accomplish that while ThreadNative::SpinWait is implemented as standard function instruction sequence including stack operations(push and pop)?
By testing of Eren, no inline occurs in debug mode. Is it possible to clr optimize and produce inlined code?
Summary: thanks for your answer. I wish one day clr can inline pre-compiled code by one mechanism such as MethodImplOptions.InternalCall. Then it can eliminate stack operations and spend most time on check of flag and spinning-wait(consuming less cpu resource than nop).
Best to try and see. Sample code:
static void Main(string[] args)
{
while (true)
Thread.SpinWait(1);
}
The optimized disassembly shows:
x86:
00000000 push ebp
00000001 mov ebp,esp
00000003 mov ecx,1
00000008 call 6F11D3FE
0000000d jmp 00000003
x64:
00000000 sub rsp,28h
00000004 mov ecx,1
00000009 call 000000005F815434
0000000e jmp 0000000000000004
00000010 add rsp,28h
00000014 ret
So there is no inlining in either case.
Maybe I'm missing something but I don't quite understand why you care about the stack operations as spinning the CPU consumes cycles anyway (the whole purpose is to not yield).
No, the jitter is not capable of inlining pre-compiled C++ code, only managed code that started as IL.
This is entirely irrelevant for a SpinWait() call. The point of spin-waiting is to have the processor execute code rather then paying the cost of a thread-context switch. With the expectation that flag will turn false in 10,000 cpu cycles or less. It doesn't matter what kind of code. CALL is a fine way to execute code.

Try-catch speeding up my code?

I wrote some code for testing the impact of try-catch, but seeing some surprising results.
static void Main(string[] args)
{
Thread.CurrentThread.Priority = ThreadPriority.Highest;
Process.GetCurrentProcess().PriorityClass = ProcessPriorityClass.RealTime;
long start = 0, stop = 0, elapsed = 0;
double avg = 0.0;
long temp = Fibo(1);
for (int i = 1; i < 100000000; i++)
{
start = Stopwatch.GetTimestamp();
temp = Fibo(100);
stop = Stopwatch.GetTimestamp();
elapsed = stop - start;
avg = avg + ((double)elapsed - avg) / i;
}
Console.WriteLine("Elapsed: " + avg);
Console.ReadKey();
}
static long Fibo(int n)
{
long n1 = 0, n2 = 1, fibo = 0;
n++;
for (int i = 1; i < n; i++)
{
n1 = n2;
n2 = fibo;
fibo = n1 + n2;
}
return fibo;
}
On my computer, this consistently prints out a value around 0.96..
When I wrap the for loop inside Fibo() with a try-catch block like this:
static long Fibo(int n)
{
long n1 = 0, n2 = 1, fibo = 0;
n++;
try
{
for (int i = 1; i < n; i++)
{
n1 = n2;
n2 = fibo;
fibo = n1 + n2;
}
}
catch {}
return fibo;
}
Now it consistently prints out 0.69... -- it actually runs faster! But why?
Note: I compiled this using the Release configuration and directly ran the EXE file (outside Visual Studio).
EDIT: Jon Skeet's excellent analysis shows that try-catch is somehow causing the x86 CLR to use the CPU registers in a more favorable way in this specific case (and I think we're yet to understand why). I confirmed Jon's finding that x64 CLR doesn't have this difference, and that it was faster than the x86 CLR. I also tested using int types inside the Fibo method instead of long types, and then the x86 CLR was as equally fast as the x64 CLR.
UPDATE: It looks like this issue has been fixed by Roslyn. Same machine, same CLR version -- the issue remains as above when compiled with VS 2013, but the problem goes away when compiled with VS 2015.
One of the Roslyn engineers who specializes in understanding optimization of stack usage took a look at this and reports to me that there seems to be a problem in the interaction between the way the C# compiler generates local variable stores and the way the JIT compiler does register scheduling in the corresponding x86 code. The result is suboptimal code generation on the loads and stores of the locals.
For some reason unclear to all of us, the problematic code generation path is avoided when the JITter knows that the block is in a try-protected region.
This is pretty weird. We'll follow up with the JITter team and see whether we can get a bug entered so that they can fix this.
Also, we are working on improvements for Roslyn to the C# and VB compilers' algorithms for determining when locals can be made "ephemeral" -- that is, just pushed and popped on the stack, rather than allocated a specific location on the stack for the duration of the activation. We believe that the JITter will be able to do a better job of register allocation and whatnot if we give it better hints about when locals can be made "dead" earlier.
Thanks for bringing this to our attention, and apologies for the odd behaviour.
Well, the way you're timing things looks pretty nasty to me. It would be much more sensible to just time the whole loop:
var stopwatch = Stopwatch.StartNew();
for (int i = 1; i < 100000000; i++)
{
Fibo(100);
}
stopwatch.Stop();
Console.WriteLine("Elapsed time: {0}", stopwatch.Elapsed);
That way you're not at the mercy of tiny timings, floating point arithmetic and accumulated error.
Having made that change, see whether the "non-catch" version is still slower than the "catch" version.
EDIT: Okay, I've tried it myself - and I'm seeing the same result. Very odd. I wondered whether the try/catch was disabling some bad inlining, but using [MethodImpl(MethodImplOptions.NoInlining)] instead didn't help...
Basically you'll need to look at the optimized JITted code under cordbg, I suspect...
EDIT: A few more bits of information:
Putting the try/catch around just the n++; line still improves performance, but not by as much as putting it around the whole block
If you catch a specific exception (ArgumentException in my tests) it's still fast
If you print the exception in the catch block it's still fast
If you rethrow the exception in the catch block it's slow again
If you use a finally block instead of a catch block it's slow again
If you use a finally block as well as a catch block, it's fast
Weird...
EDIT: Okay, we have disassembly...
This is using the C# 2 compiler and .NET 2 (32-bit) CLR, disassembling with mdbg (as I don't have cordbg on my machine). I still see the same performance effects, even under the debugger. The fast version uses a try block around everything between the variable declarations and the return statement, with just a catch{} handler. Obviously the slow version is the same except without the try/catch. The calling code (i.e. Main) is the same in both cases, and has the same assembly representation (so it's not an inlining issue).
Disassembled code for fast version:
[0000] push ebp
[0001] mov ebp,esp
[0003] push edi
[0004] push esi
[0005] push ebx
[0006] sub esp,1Ch
[0009] xor eax,eax
[000b] mov dword ptr [ebp-20h],eax
[000e] mov dword ptr [ebp-1Ch],eax
[0011] mov dword ptr [ebp-18h],eax
[0014] mov dword ptr [ebp-14h],eax
[0017] xor eax,eax
[0019] mov dword ptr [ebp-18h],eax
*[001c] mov esi,1
[0021] xor edi,edi
[0023] mov dword ptr [ebp-28h],1
[002a] mov dword ptr [ebp-24h],0
[0031] inc ecx
[0032] mov ebx,2
[0037] cmp ecx,2
[003a] jle 00000024
[003c] mov eax,esi
[003e] mov edx,edi
[0040] mov esi,dword ptr [ebp-28h]
[0043] mov edi,dword ptr [ebp-24h]
[0046] add eax,dword ptr [ebp-28h]
[0049] adc edx,dword ptr [ebp-24h]
[004c] mov dword ptr [ebp-28h],eax
[004f] mov dword ptr [ebp-24h],edx
[0052] inc ebx
[0053] cmp ebx,ecx
[0055] jl FFFFFFE7
[0057] jmp 00000007
[0059] call 64571ACB
[005e] mov eax,dword ptr [ebp-28h]
[0061] mov edx,dword ptr [ebp-24h]
[0064] lea esp,[ebp-0Ch]
[0067] pop ebx
[0068] pop esi
[0069] pop edi
[006a] pop ebp
[006b] ret
Disassembled code for slow version:
[0000] push ebp
[0001] mov ebp,esp
[0003] push esi
[0004] sub esp,18h
*[0007] mov dword ptr [ebp-14h],1
[000e] mov dword ptr [ebp-10h],0
[0015] mov dword ptr [ebp-1Ch],1
[001c] mov dword ptr [ebp-18h],0
[0023] inc ecx
[0024] mov esi,2
[0029] cmp ecx,2
[002c] jle 00000031
[002e] mov eax,dword ptr [ebp-14h]
[0031] mov edx,dword ptr [ebp-10h]
[0034] mov dword ptr [ebp-0Ch],eax
[0037] mov dword ptr [ebp-8],edx
[003a] mov eax,dword ptr [ebp-1Ch]
[003d] mov edx,dword ptr [ebp-18h]
[0040] mov dword ptr [ebp-14h],eax
[0043] mov dword ptr [ebp-10h],edx
[0046] mov eax,dword ptr [ebp-0Ch]
[0049] mov edx,dword ptr [ebp-8]
[004c] add eax,dword ptr [ebp-1Ch]
[004f] adc edx,dword ptr [ebp-18h]
[0052] mov dword ptr [ebp-1Ch],eax
[0055] mov dword ptr [ebp-18h],edx
[0058] inc esi
[0059] cmp esi,ecx
[005b] jl FFFFFFD3
[005d] mov eax,dword ptr [ebp-1Ch]
[0060] mov edx,dword ptr [ebp-18h]
[0063] lea esp,[ebp-4]
[0066] pop esi
[0067] pop ebp
[0068] ret
In each case the * shows where the debugger entered in a simple "step-into".
EDIT: Okay, I've now looked through the code and I think I can see how each version works... and I believe the slower version is slower because it uses fewer registers and more stack space. For small values of n that's possibly faster - but when the loop takes up the bulk of the time, it's slower.
Possibly the try/catch block forces more registers to be saved and restored, so the JIT uses those for the loop as well... which happens to improve the performance overall. It's not clear whether it's a reasonable decision for the JIT to not use as many registers in the "normal" code.
EDIT: Just tried this on my x64 machine. The x64 CLR is much faster (about 3-4 times faster) than the x86 CLR on this code, and under x64 the try/catch block doesn't make a noticeable difference.
Jon's disassemblies show, that the difference between the two versions is that the fast version uses a pair of registers (esi,edi) to store one of the local variables where the slow version doesn't.
The JIT compiler makes different assumptions regarding register use for code that contains a try-catch block vs. code which doesn't. This causes it to make different register allocation choices. In this case, this favors the code with the try-catch block. Different code may lead to the opposite effect, so I would not count this as a general-purpose speed-up technique.
In the end, it's very hard to tell which code will end up running the fastest. Something like register allocation and the factors that influence it are such low-level implementation details that I don't see how any specific technique could reliably produce faster code.
For example, consider the following two methods. They were adapted from a real-life example:
interface IIndexed { int this[int index] { get; set; } }
struct StructArray : IIndexed {
public int[] Array;
public int this[int index] {
get { return Array[index]; }
set { Array[index] = value; }
}
}
static int Generic<T>(int length, T a, T b) where T : IIndexed {
int sum = 0;
for (int i = 0; i < length; i++)
sum += a[i] * b[i];
return sum;
}
static int Specialized(int length, StructArray a, StructArray b) {
int sum = 0;
for (int i = 0; i < length; i++)
sum += a[i] * b[i];
return sum;
}
One is a generic version of the other. Replacing the generic type with StructArray would make the methods identical. Because StructArray is a value type, it gets its own compiled version of the generic method. Yet the actual running time is significantly longer than the specialized method's, but only for x86. For x64, the timings are pretty much identical. In other cases, I've observed differences for x64 as well.
This looks like a case of inlining gone bad. On an x86 core, the jitter has the ebx, edx, esi and edi register available for general purpose storage of local variables. The ecx register becomes available in a static method, it doesn't have to store this. The eax register often is needed for calculations. But these are 32-bit registers, for variables of type long it must use a pair of registers. Which are edx:eax for calculations and edi:ebx for storage.
Which is what stands out in the disassembly for the slow version, neither edi nor ebx are used.
When the jitter can't find enough registers to store local variables then it must generate code to load and store them from the stack frame. That slows down code, it prevents a processor optimization named "register renaming", an internal processor core optimization trick that uses multiple copies of a register and allows super-scalar execution. Which permits several instructions to run concurrently, even when they use the same register. Not having enough registers is a common problem on x86 cores, addressed in x64 which has 8 extra registers (r9 through r15).
The jitter will do its best to apply another code generation optimization, it will try to inline your Fibo() method. In other words, not make a call to the method but generate the code for the method inline in the Main() method. Pretty important optimization that, for one, makes properties of a C# class for free, giving them the perf of a field. It avoids the overhead of making the method call and setting up its stack frame, saves a couple of nanoseconds.
There are several rules that determine exactly when a method can be inlined. They are not exactly documented but have been mentioned in blog posts. One rule is that it won't happen when the method body is too large. That defeats the gain from inlining, it generates too much code that doesn't fit as well in the L1 instruction cache. Another hard rule that applies here is that a method won't be inlined when it contains a try/catch statement. The background behind that one is an implementation detail of exceptions, they piggy-back onto Windows' built-in support for SEH (Structure Exception Handling) which is stack-frame based.
One behavior of the register allocation algorithm in the jitter can be inferred from playing with this code. It appears to be aware of when the jitter is trying to inline a method. One rule it appears to use that only the edx:eax register pair can be used for inlined code that has local variables of type long. But not edi:ebx. No doubt because that would be too detrimental to the code generation for the calling method, both edi and ebx are important storage registers.
So you get the fast version because the jitter knows up front that the method body contains try/catch statements. It knows it can never be inlined so readily uses edi:ebx for storage for the long variable. You got the slow version because the jitter didn't know up front that inlining wouldn't work. It only found out after generating the code for the method body.
The flaw then is that it didn't go back and re-generate the code for the method. Which is understandable, given the time constraints it has to operate in.
This slow-down doesn't occur on x64 because for one it has 8 more registers. For another because it can store a long in just one register (like rax). And the slow-down doesn't occur when you use int instead of long because the jitter has a lot more flexibility in picking registers.
I'd have put this in as a comment as I'm really not certain that this is likely to be the case, but as I recall it doesn't a try/except statement involve a modification to the way the garbage disposal mechanism of the compiler works, in that it clears up object memory allocations in a recursive way off the stack. There may not be an object to be cleared up in this case or the for loop may constitute a closure that the garbage collection mechanism recognises sufficient to enforce a different collection method.
Probably not, but I thought it worth a mention as I hadn't seen it discussed anywhere else.
9 years later and the bug is still there! You can see it easily with:
static void Main( string[] args )
{
int hundredMillion = 1000000;
DateTime start = DateTime.Now;
double sqrt;
for (int i=0; i < hundredMillion; i++)
{
sqrt = Math.Sqrt( DateTime.Now.ToOADate() );
}
DateTime end = DateTime.Now;
double sqrtMs = (end - start).TotalMilliseconds;
Console.WriteLine( "Elapsed milliseconds: " + sqrtMs );
DateTime start2 = DateTime.Now;
double sqrt2;
for (int i = 0; i < hundredMillion; i++)
{
try
{
sqrt2 = Math.Sqrt( DateTime.Now.ToOADate() );
}
catch (Exception e)
{
int br = 0;
}
}
DateTime end2 = DateTime.Now;
double sqrtMsTryCatch = (end2 - start2).TotalMilliseconds;
Console.WriteLine( "Elapsed milliseconds: " + sqrtMsTryCatch );
Console.WriteLine( "ratio is " + sqrtMsTryCatch / sqrtMs );
Console.ReadLine();
}
The ratio is less than one on my machine, running the latest version of MSVS 2019, .NET 4.6.1

Categories