Related
I have a test which goes:
if(variable==SOME_CONSTANT || variable==OTHER_CONSTANT)
In this circumstances, on a platform where branching over the second test would take more cycles than simply doing it, would the optimizer be allowed to treat the || as a simple |?
In this circumstances, on a platform where branching over the second test would take more cycles than simply doing it, would the optimizer be allowed to treat the || as a simple |?
Yes, that is permitted, and in fact the C# compiler will perform this optimization in some cases on && and ||, reducing them to & and |. As you note, there must be no side effects of evaluating the right side.
Consult the compiler source code for the exact details of when the optimization is generated.
The compiler will also perform that optimization when the logical operation involves lifted-to-nullable operands. Consider for example
int? z = x + y;
where x and y are also nullable ints; this will be generated as
int? z;
int? temp1 = x;
int? temp2 = y;
z = temp1.HasValue & temp2.HasValue ?
new int?(temp1.GetValueOrDefault() + temp2.GetValueOrDefault()) :
new int?();
Note that it's & and not &&. I knew that it is so fast to call HasValue that it would not be worth the extra branching logic to avoid it.
If you're interested in how I wrote the nullable arithmetic optimizer, I've written a detailed explanation of it here: https://ericlippert.com/2012/12/20/nullable-micro-optimizations-part-one/
Yes, the compiler can make that optimization. Indeed, every language of interest generally has an explicit or implicit "as if" type clause that makes such not-observable optimizations allowed without needing a specific rule for it. This allows is implement the checks in a non-shortcut manner, in addition to a whole host of more extreme optimizations, such as combining multiple conditions into one, eliminating the check entirely, implementing the check without any branch at all using predicated instructions, etc.
The other side, however, is that the specific optimization you mention of unconditionally performing the second check isn't performed very often on most common platforms because on many instruction sets the branching approach is the fastest, if you assume it doesn't change the predictability of the branch. For example, on x86, you can use cmp to compare a variable to a known value (as in your example), but the "result" ends up in the EFLAGs register (of which there is only one, architecturally). How do you implement the || in that case between the two comparison results? The second comparison will overwrite the flag set by the first, so you'll be stuck saving the flag somewhere, and then doing the second comparison, and then trying the "combine" the flags somehow just so you can do your single test1.
The truth is, ignoring prediction, the conditional branch is often almost free, especially when the compiler organizes it to be "not taken". For example, on x86, your condition could look like two cmp operations, each immediately followed by a jump over the code in the if() block. So just two branch instructions versus the hoops you'd have to jump though to reduce it to one. Going further - these cmp and subsequent branches often macro-fuse into a single operation that has about the same cost as the comparison alone (and take a single cycle). There are various caveats, but the overall assumption that "branching over the second test" will take much time is probably not well founded.
The main caveat is branch prediction. In the case that each individual clause is unpredictable, but where the whole condition is predictable, combining everything into a single branch can be very profitable. Imagine, for example, that in your (variable==SOME_CONSTANT || variable==OTHER_CONSTANT) that variable was equal to SOME_CONSTANT 50% of the time, and OTHER_CONSTANT 49% of the time. The if will thus be taken 99% of the time, but the first check variable==SOME_CONSTANT will be totally unpredictable: branching exactly half the time! In this case it would be a great idea to combine the checks, even at some cost, since the misprediction is expensive.
Now there are certain cases where the compiler can combine checks together simply due the form of the check. Peter shows an example using a range-check like example in his answer, and there are others.
Here's an interesting one I stumbled across where your SOME_CONSTANT is 2 and OTHER_CONSTANT is 4:
void test(int a) {
if (a == 2 || a == 4) {
call();
}
}
Both clang and icc implement this as a series of two checks and two branches, but recent gcc uses another trick:
test(int, int):
sub edi, 2
and edi, -3
je .L4
rep ret
.L4:
jmp call()
Essentially it subtracts 2 from a and then checks if any bit other than 0b10 is set. The values 2 and 4 are the only values accepted by that check. Interesting transformation! It's not that much better than the two branch approach, for predictable inputs, but for the unpredictable clauses but predictable final outcome case it will be a big win.
This isn't really a case of doing both checks unconditionally however: just a clever case of being able to combine multiple checks into fewer, possibly with a bit of math. So I don't know if it meets your criteria for a "yes, they actually do in practice" answer. Perhaps compilers do make this optimization, but I haven't seen it on x86. If it exists there it might only be triggered by profile-guided optimization, where the compiler has an idea of the probability of various clauses.
1 On platforms with fast cmov two cmovs to implement || is probably not a terrible choice, and && can be implemented similarly.
Compilers are allowed to optimize short-circuit comparisons into asm that isn't two separate test & branch. But sometimes it's not profitable (especially on x86 where compare-into-register takes multiple instructions), and sometimes compilers miss the optimization.
Or if compilers choose to make branchless code using a conditional-move, both conditions are always evaluated. (This is of course only an option when there are no side-effects).
One special case is range-checks: compilers can transform x > min && x < max (especially when min and max are compile-time constants) into a single check. This can be done with 2 instructions instead of branching on each condition separately. Subtracting the low end of the range will wrap to a large unsigned number if the input was lower, so a subtract + unsigned-compare gives you a range check.
The range-check optimization is easy / well-known (by compiler developers), so I'd assume C# JIT and ahead-of-time compilers would do it, too.
To take a C example (which has the same short-circuit evaluation rules as C#):
int foo(int x, int a, int b) {
if (10 < x && x < 100) {
return a;
}
return b;
}
Compiled (with gcc7.3 -O3 for the x86-64 Windows ABI, on the Godbolt compiler explorer. You can see output for ICC, clang, or MSVC; or for gcc on ARM, MIPS, etc.):
foo(int, int, int):
sub ecx, 11 # x-11
mov eax, edx # retval = a;
cmp ecx, 89
cmovnb eax, r8d # retval = (x-11U) < 89U ? retval : b;
ret
So the function is branchless, using cmov (conditional mov). #HansPassant says .NET's compiler only tends to do this for assignment operations, so maybe you'd only get that asm if you wrote it in the C#
source as retval = (10 < x && x < 100) ? a : b;.
Or to take a branching example, we get the same optimization of the range check into a sub and then an unsigned compare/branch instead of compare/cmov.
int ext(void);
int bar(int x) {
if (10 < x && x < 100) {
return ext();
}
return 0;
}
# gcc -O3
sub ecx, 11
cmp ecx, 88
jbe .L7 # jump if ((unsigned)x-11U) <= 88U
xor eax, eax # return 0;
ret
.L7:
jmp ext() # tailcall ext()
IDK if existing C# implementations make this optimization the same way, but it's easy and valid for all possible inputs, so they should.
Godbolt doesn't have a C# compiler; if there is a convenient online C# compiler that shows you the asm, it would be interesting to try these functions there. (I think they're valid C# syntax as well as valid C and valid C++).
Other cases
Some cases other than range-checks can be profitable to optimize into a single branch or cmov on multiple conditions. x86 can't compare into a register very efficiently (xor-zero / cmp / setcc), but in some cases you only need 0 / non-zero instead of a 0 / 1 boolean to combine later. x86's OR instruction sets flags, so you can or / jnz to jump if either register was non-zero. (But note that saving the test reg,reg before a jcc only saves code-size; macro-fusion works for test/jcc but not or/jcc, so or/test/jcc is the same number of uops as or/jcc. It saves a uop with cmovcc or setcc, though.)
If branches predict perfectly, two cmp / jcc are probably still cheapest (because of macro-fusion: cmp / jne is a single uop on recent CPUs), but if not then two conditions together may well predict better, or be better with CMOV.
int foo(int x, int a, int b) {
if ((a-10) || (x!=5)) {
return a;
}
return b;
}
On Godbolt with gcc7.3, clang5.0, ICC18, and MSVC CL19
gcc compiles it the obvious way, with 2 branches and a couple mov instructions. clang5.0 spots the opportunity to transform it:
# compiled for the x86-64 System V ABI this time: args in edi=x, esi=a, edx=b
mov eax, esi
xor eax, 10
xor edi, 5
or edi, eax # flags set from edi=(a^10) | (x^5)
cmovne edx, esi # edx = (edi!=0) ? a : b
mov eax, edx # return edx
ret
Other compilers need some hand-holding if you want them to emit code like this. (And clang could use the same help to realize that it can use lea to copy-and-subtract instead of needing a mov before xor to avoid destroying an input that's needed later).
int should_optimize_to(int x, int a, int b) {
// x!=10 fools compilers into missing the optimization
if ((a-10) | (x-5)) {
return a;
}
return b;
}
gcc, clang, msvc, and ICC all compile this to basically the same thing:
# gcc7.3 -O3
lea eax, [rsi-10] # eax = a-10
sub edi, 5 # x-=5
or eax, edi # set flags
mov eax, edx
cmovne eax, esi
ret
This is smarter than clang's code: putting the mov to eax before the cmov creates instruction-level parallelism. If mov has non-zero latency, that latency can happen in parallel with the latency of creating the flag input for cmov.
If you want this kind of optimization, you usually have to hand-hold compilers toward it.
I'm trying to create C# app which uses dll library which contains C++ code and inline assembly. In function test_MMX I want to add two arrays of specific length.
extern "C" __declspec(dllexport) void __stdcall test_MMX(int *first_array,int *second_array,int length)
{
__asm
{
mov ecx,length;
mov esi,first_array;
shr ecx,1;
mov edi,second_array;
label:
movq mm0,QWORD PTR[esi];
paddd mm0,QWORD PTR[edi];
add edi,8;
movq QWORD PTR[esi],mm0;
add esi,8;
dec ecx;
jnz label;
}
}
After run app it's showing this warning:
warning C4799: function 'test_MMX' has no EMMS instruction.
When I want to measure time of running this function C# in miliseconds it returns this value: -922337203685477 instead of (for example 0,0141)...
private Stopwatch time = new Stopwatch();
time.Reset();
time.Start();
test_MMX(first_array, second_array, length);
time.Stop();
TimeSpan interval = time.Elapsed;
return trvanie.TotalMilliseconds;
Any ideas how to fix it please ?
Since MMX aliases over the floating-point registers, any routine that uses MMX instructions must end with the EMMS instruction. This instruction "clears" the registers, making them available for use by the x87 FPU once again. (Which any C or C++ calling convention for x86 will assume is safe.)
The compiler is warning you that you have written a routine that uses MMX instructions but does not end with the EMMS instruction. That's a bug waiting to happen, as soon as some FPU instruction tries to execute.
This is a huge disadvantage of MMX, and the reason why you really can't freely intermix MMX and floating-point instructions. Sure, you could just throw EMMS instructions around, but it is a slow, high-latency instruction, so this kills performance. SSE had the same limitations as MMX in this regard, at least for integer operations. SSE2 was the first instruction set to address this problem, since it used its own discrete register set. Its registers are also twice as wide as MMX's are, so you can do even more at a time. Since SSE2 does everything that MMX does, but faster, easier, and more efficiently, and is supported by the Pentium 4 and later, it is quite rare that anyone needs to write new code today that uses MMX. If you can use SSE2, you should. It will be faster than MMX. Another reason not to use MMX is that it is not supported in 64-bit mode.
Anyway, the correct way to write the MMX code would be:
__asm
{
mov ecx, [length]
mov eax, [first_array]
shr ecx, 1
mov edx, [second_array]
label:
movq mm0, QWORD PTR [eax]
paddd mm0, QWORD PTR [edx]
add edx, 8
movq QWORD PTR [eax], mm0
add eax, 8
dec ecx
jnz label
emms
}
Note that, in addition to the EMMS instruction (which, of course, is placed outside of the loop), I made a few additional changes:
Assembly-language instructions do not end with semicolons. In fact, in assembly language's syntax, the semicolon is used to begin a comment. So I have removed your semicolons.
I've also added spaces for readability.
And, while it isn't strictly necessary (Microsoft's inline assembler is sufficiently forgiving so as to allow you to get away with not doing it), it is a good idea to be explicit and wrap the use of addresses (C/C++ variables) in square brackets, since you are actually dereferencing them.
As a commenter pointed out, you can freely use the ESI and EDI registers in inline assembly, since the inline assembler will detect their use and generate additional instructions that push/pop them accordingly. In fact, it will do this with all non-volatile registers. And if you need additional registers, then you need them, and this is a nice feature. But in this code, you're only using three general-purpose registers, and in the __stdcall calling convention, there are three general-purpose registers that are specifically defined as volatile (i.e., can be freely clobbered by any function): EAX, EDX, and ECX. So you should be using those registers for maximum speed. As such, I've changed your use of ESI to EAX, and your use of EDI to EDX. This will improve the code that you can't see, the prologue and epilogue automatically generated by the compiler.
You have a potential speed trap lurking here, though, and that is alignment. To obtain maximum speed, MMX instructions need to operate on data that is aligned on 8-byte boundaries. In a loop, misaligned data has a compounding negative effect on performance: not only is the data misaligned the first time through the loop, exerting a significant performance penalty, but it is guaranteed to be misaligned each subsequent time through the loop, too. So for this code to have any chance of being fast, the caller needs to guarantee that first_array and second_array are aligned on 8-byte boundaries.
If you can't guarantee that, then the function should really have extra code added to it to fix up misalignments. Essentially, you want to do a couple of non-vector operations (on individual bytes) at the beginning, before starting the loop, until you've reached a suitable alignment. Then, you can start issuing the vectorized MMX instructions.
(Unaligned loads are no longer penalized on modern processors, but if you were targeting modern processors, you'd be writing SSE2 code. On the older processors where you need to run MMX code, alignment will be a big deal, and misaligned data will kill your performance.)
Now, this inline assembly won't produce particularly efficient code. When you use inline assembly, the compiler always generates prologue and epilogue code for the function. That isn't terrible, since it's outside of the critical inner loop, but still—it's cruft you don't need. Worse, jumps in inline assembly blocks tend to confuse MSVC's inline assembler and cause it to generate sub-optimal code. It is overly cautious, preventing you from doing something that could corrupt the stack or cause other external side effects, which is nice, except that the whole reason you're writing inline assembly is (presumably) because you desire maximum performance.
(It should go without saying, but if you don't need the maximum possible performance, you should just write the code in C (or C++) and let the compiler optimize it. It does a darn good job in the majority of cases.)
If you do need the maximum possible performance, and have decided that the compiler-generated code just won't cut it, then a better alternative to inline assembly is the use of intrinsics. Intrinsics will generally map one-to-one to assembly-language instructions, but the compiler does a lot better job optimizing around them.
Here's my version of your code, using MMX intrinsics:
#include <intrin.h> // include header with MMX intrinsics
void __stdcall Function_With_Intrinsics(int *first_array, int *second_array, int length)
{
unsigned int counter = static_cast<unsigned int>(length);
counter /= 2;
do
{
*reinterpret_cast<__m64*>(first_array) = _mm_add_pi32(*reinterpret_cast<const __m64*>(first_array),
*reinterpret_cast<const __m64*>(second_array));
first_array += 8;
second_array += 8;
} while (--counter != 0);
_mm_empty();
}
It does the same thing, but more efficiently by delegating more to the compiler's optimizer. A couple of notes:
Since your assembly code treats length as an unsigned integer, I assume that your interface requires that it actually be an unsigned integer. (And, if so, I wonder why you don't declare it as such in the function's signature.) To achieve the same effect, I've cast it to an unsigned int, which is subsequently used as the counter. (If I hadn't done that, I'd have to have either done a shift operation on a signed integer, which risks undefined behavior, or a division by two, for which the compiler would have generated slower code to correctly deal with the sign bit.)
The *reinterpret_cast<__m64*> business scattered throughout looks scary, but is actually safe—at least, relatively speaking. That's what you're supposed to do with the MMX intrinsics. The MMX data type is __m64, which you can think of as being roughly equivalent to an mm? register. It is 64 bits in length, and loads and stores are accomplished by casting. These get translated directly into MOVQ instructions.
Your original assembly code was written such that the loop always iterated at least once, so I transformed that into a do…while loop. This means the test of the loop condition only has to be done at the bottom of the loop, rather than once at the top and once at the bottom.
The _mm_empty() intrinsic causes an EMMS instruction to be emitted.
Just for grins, let's see what the compiler transformed this into. This is the output from MSVC 16 (VS 2010), targeting x86-32 and optimizing for speed over size (though it makes no difference in this particular case):
PUBLIC ?Function_With_Intrinsics##YGXPAH0H#Z
; Function compile flags: /Ogtpy
_first_array$ = 8 ; size = 4
_second_array$ = 12 ; size = 4
_length$ = 16 ; size = 4
?Function_With_Intrinsics##YGXPAH0H#Z PROC
mov ecx, DWORD PTR _length$[esp-4]
mov edx, DWORD PTR _second_array$[esp-4]
mov eax, DWORD PTR _first_array$[esp-4]
shr ecx, 1
sub edx, eax
$LL3:
movq mm0, MMWORD PTR [eax]
movq mm1, MMWORD PTR [edx+eax]
paddd mm0, mm1
movq MMWORD PTR [eax], mm0
add eax, 32
dec ecx
jne SHORT $LL3
emms
ret 12
?Function_With_Intrinsics##YGXPAH0H#Z ENDP
It is recognizably similar to your original code, but does a couple of things differently. In particular, it tracks the array pointers differently, in a way that it (and I) believe is slightly more efficient than your original code, since it does less work inside of the loop. It also breaks apart your PADDD instruction so that both of its operands are MMX registers, instead of the source being a memory operand. Again, this tends to make the code more efficient at the expense of clobbering an additional MMX register, but we've got plenty of those to spare, so it's certainly worth it.
Better yet, as the optimizer improves in newer versions of the compiler, code that is written using intrinsics may get even better!
Of course, rewriting the function to use intrinsics doesn't solve the alignment problem, but I'm assuming you have already dealt with that on the caller side. If not, you'll need to add code to handle it.
If you wanted to use SSE2—perhaps that would be test_SSE2 and you would dynamically delegate to the appropriate implementation depending on the current processor's feature bits—then you could do it like this:
#include <intrin.h> // include header with SSE2 intrinsics
void __stdcall Function_With_Intrinsics_SSE2(int *first_array, int *second_array, int length)
{
unsigned int counter = static_cast<unsigned>(length);
counter /= 4;
do
{
_mm_storeu_si128(reinterpret_cast<__m128i*>(first_array),
_mm_add_epi32(_mm_loadu_si128(reinterpret_cast<const __m128i*>(first_array)),
_mm_loadu_si128(reinterpret_cast<const __m128i*>(second_array))));
first_array += 16;
second_array += 16;
} while (--counter != 0);
}
I've written this code not assuming alignment, so it will work when the loads and stores are misaligned. For maximum speed on many older architectures, SSE2 requires 16-byte alignment, and if you can guarantee that the source and destination pointers are thusly aligned, you can use slightly faster instructions (e.g., MOVDQA as opposed to MOVDQU). As mentioned above, on newer architectures (at least Sandy Bridge and later, perhaps earlier), it doesn't matter.
To give you an idea of how SSE2 is basically just a drop-in replacement for MMX on Pentium 4 and later, except that you also get to do operations that are twice as wide, look at the code this compiles to:
PUBLIC ?Function_With_Intrinsics_SSE2##YGXPAH0H#Z
; Function compile flags: /Ogtpy
_first_array$ = 8 ; size = 4
_second_array$ = 12 ; size = 4
_length$ = 16 ; size = 4
?Function_With_Intrinsics_SSE2##YGXPAH0H#Z PROC
mov ecx, DWORD PTR _length$[esp-4]
mov edx, DWORD PTR _second_array$[esp-4]
mov eax, DWORD PTR _first_array$[esp-4]
shr ecx, 2
sub edx, eax
$LL3:
movdqu xmm0, XMMWORD PTR [eax]
movdqu xmm1, XMMWORD PTR [edx+eax]
paddd xmm0, xmm1
movdqu XMMWORD PTR [eax], xmm0
add eax, 64
dec ecx
jne SHORT $LL3
ret 12
?Function_With_Intrinsics_SSE2##YGXPAH0H#Z ENDP
As for the final question about getting negative values from the .NET Stopwatch class, I would normally guess that would be due to an overflow. In other words, your code executed too slowly, and the timer wrapped around. Kevin Gosse pointed out, though, that this is apparently a bug in the implementation of the Stopwatch class. I don't know much more about it, since I don't really use it. If you want a good microbenchmarking library, I use and recommend Google Benchmark. However, it is for C++, not C#.
While you're benchmarking, definitely take the time to time the code generated by the compiler when you write it the naïve way. Say, something like:
void Naive_PackedAdd(int *first_array, int *second_array, int length)
{
for (unsigned int i = 0; i < static_cast<unsigned int>(length); ++i)
{
first_array[i] += second_array[i];
}
}
You just might be pleasantly surprised at how fast the code is after the compiler gets finished auto-vectorizing the loop. :-) Remember that less code does not necessarily mean faster code. All of that extra code is required to deal with alignment issues, which I've diplomatically skirted throughout this answer. If you scroll down, at $LL4#Naive_Pack, you'll find an inner loop very similar to what we've been considering here.
Back in the day when I was learning C and assembly we were taught it is better to use simple comparisons to increase speed. So for example if you say:
if(x <= 0)
versus
if(x < 1)
which would execute faster? My argument (which may be wrong) is the second would almost always execute faster because there is only a single comparison) i.e. is it less than one, yes or no.
Whereas the first will execute fast if the number is less than 0 because this equates to true there is no need to check the equals making it as fast as the second, however, it will always be slower if the number is 0 or more because it has to then do a second comparison to see if it is equal to 0.
I am now using C# and while developing for desktops speed is not an issue (at least not to the degree that his point is worth arguing), I still think such arguments need to be considered as I am also developing for mobile devices which are much less powerful than desktops and speed does become an issue on such devices.
For further consideration, I am talking about whole numbers (no decimals) and numbers where there cannot be a negative number like -1 or -12,345 etc (unless there is an error), for example, when dealing with lists or arrays when you cant have a negative number of items but you want to check if a list is empty (or if there is a problem, set the value of x to negative to indicate error, an example is where there are some items in a list, but you cannot retrieve the whole list for some reason and to indicate this you set the number to negative which would not be the same as saying there are no items).
For the reason above I deliberately left out the obvious
if(x == 0)
and
if(x.isnullorempty())
and other such items for detecting a list with no items.
Again, for consideration, we are talking about the possibility of retrieving items from a database perhaps using SQL stored procedures which have the functionality mentioned (ie the standard (at least in this company) is to return a negative number to indicate a problem).
So in such cases, is it better to use the first or the second item above?
They're identical. Neither is faster than the other. They both ask precisely the same question, assuming x is an integer. C# is not assembly. You're asking the compiler to generate the best code to get the effect you are asking for. You aren't specifying how it gets that result.
See also this answer.
My argument (which may be wrong) is the second would almost always execute faster because there is only a single comparison) i.e. is it less than one, yes or no.
Clearly that's wrong. Watch what happens if you assume that's true:
< is faster than <= because it asks fewer questions. (Your argument.)
> is the same speed as <= because it asks the same question, just with an inverted answer.
Thus < is faster than >! But this same argument shows > is faster than <.
"just with an inverted answer" seems to sneak in an additional boolean operation so I'm not sure I follow this answer.
That's wrong (for silicon, it is sometimes correct for software) for the same reason. Consider:
3 != 4 is more expensive to compute than 3 == 4, because it's 3 != 4 with an inverted answer, an additional boolean operation.
3 == 4 is more expensive than 3 != 4, because it's 3 != 4 with an inverted answer, an additional boolean operation.
Thus, 3 != 4 is more expensive than itself.
An inverted answer is just the opposite question, not an additional boolean operation. Or, to be a bit more precise, it's with a different mapping of comparison results to final answer. Both 3 == 4 and 3 != 4 require you to compare 3 and 4. That comparison results in ether "equal" or "unequal". The questions just map "equal" and "unequal" to "true" and "false" differently. Neither mapping is more expensive than the other.
At least in most cases, no, there's no advantage to one over the other.
A <= does not normally get implemented as two separate comparisons. On a typical (e.g., x86) CPU, you'll have two separate flags, one to indicate equality, and one to indicate negative (which can also mean "less than"). Along with that, you'll have branches that depend on a combination of those flags, so < translates to a jl or jb (jump if less or jump if below --the former is for signed numbers, the latter for unsigned). A <= will translate to a jle or jbe (jump if less than or equal, jump if below or equal).
Different CPUs will use different names/mnemonics for the instructions, but most still have equivalent instructions. In every case of which I'm aware, all of those execute at the same speed.
Edit: Oops -- I meant to mention one possible exception to the general rule I mentioned above. Although it's not exactly from < vs. <=, if/when you can compare to 0 instead of any other number, you can sometimes gain a little (minuscule) advantage. For example, let's assume you had a variable you were going to count down until you reached some minimum. In a case like this, you might well gain a little advantage if you can count down to 0 instead of counting down to 1. The reason is fairly simple: the flags I mentioned previously are affected by most instructions. Let's assume you had something like:
do {
// whatever
} while (--i >= 1);
A compiler might translate this to something like:
loop_top:
; whatever
dec i
cmp i, 1
jge loop_top
If, instead, you compare to 0 (while (--i > 0) or while (--i != 0)), it might translate to something like this instead;
loop_top:
; whatever
dec i
jg loop_top
; or: jnz loop_top
Here the dec sets/clears the zero flag to indicate whether the result of the decrement was zero or not, so the condition can be based directly on the result from the dec, eliminating the cmp used in the other code.
I should add, however, that while this was quite effective, say, 30+ years ago, most modern compilers can handle translations like this without your help (though some compilers may not, especially for things like small embedded systems). IOW, if you care about optimization in general, it's barely possible that you might someday care -- but at least to me, application to C# seems doubtful at best.
Most modern hardware has built-in instructions for checking the less-than-or-equals consition in a single instruction that executes exactly as fast as the one checking the less-than condition. The argument that applied to the (much) older hardware no longer applies - choose the alternative that you think is most readable, i.e. the one that better conveys your idea to the readers of your code.
Here are my functions:
public static void TestOne()
{
Boolean result;
Int32 i = 2;
for (Int32 j = 0; j < 1000000000; ++j)
result = (i < 1);
}
public static void TestTwo()
{
Boolean result;
Int32 i = 2;
for (Int32 j = 0; j < 1000000000; ++j)
result = (i <= 0);
}
Here is the IL code, which is identical:
L_0000: ldc.i4.2
L_0001: stloc.0
L_0002: ldc.i4.0
L_0003: stloc.1
L_0004: br.s L_000a
L_0006: ldloc.1
L_0007: ldc.i4.1
L_0008: add
L_0009: stloc.1
L_000a: ldloc.1
L_000b: ldc.i4 1000000000
L_0010: blt.s L_0006
L_0012: ret
After a few testing sessions, obviously, the result is that neither is faster than the other. The difference consists only in few milliseconds which can't be considered a real difference, and the produced IL output is the same anyway.
Both ARM and x86 processors will have dedicated instructions both both "less than" and "less than or equal" (Which could also be evaluated as "NOT greater than"), so there will be absolutely no real world difference if you use any semi modern compiler.
While refactoring, if you change your mind about the logic, if(x<=0) is faster (and less error prone) to negate (ie if(!(x<=0)), compared to if(!(x<1)) which does not negate correctly) but that's probably not the performance you're referring to. ;-)
IF x<1 is faster, then the modern compilers will change x<=0 to x<1 (assuming x is an integral). So for modern compilers, this should not matter, and they should produce identical machine code.
Even if x<=0 compiled to different instructions than x<1, the performance difference would be so miniscule as to not be worth worrying about most of the time; there will very likely be other more productive areas for optimizations in your code. The golden rule is to profile your code and optimise the bits that ARE actually slow in the real world, not the bits that you think hypothetically may be slow, or are not as fast as they theoretically could be. Also concentrate on making your code readable to others, and not phantom micro-optimisations that disappear in a puff of compiler smoke.
#Francis Rodgers, you said:
Whereas the first will execute fast if the number is less than 0
because this equates to true there is no need to check the equals
making it as fast as the second, however, it will always be slower if
the number is 0 or more because it has to then do a second comparison
to see if it is equal to 0.
and (in commennts),
Can you explain where > is the same as <= because this doesn't make
sense in my logical world. For example, <=0 is not the same as >0 in
fact, totally opposite. I would just like an example so I can
understand your answer better
You ask for help, and you need help. I really want to help you, and I’m afraid that many other people need this help too.
Begin with the more basic thing. Your idea that testing for > is not the same as testing for <= is logically wrong (not only in any programming language). Look at these diagrams, relax, and think about it. What happens if you know that X <= Y in A and in B? What happen if you know that X > Y in each diagram?
Right, nothing changed, they are equivalent. The key detail of the diagrams is that true and false in A and B are on opposite sides. The meaning of that is that the compiler (or in general – de coder) has the freedom to reorganize the program flow in a way that both question are equivalent.That means, there is no need to split <= into two steps, only reorganize a little in your flow. Only a very bad compiler or interpreter will not be able to do that. Nothing to do yet with any assembler. The idea is that even for CPUs without sufficient flags for all comparisons the compiler can generate (pseudo) assembler code with use the test with best suit it characteristics. But adding the ability of the CPUs to check more than one flag in parallel at electronic level, the job of the compiler is very much simpler.
You may find curios/interesting to read the pages 3-14 to 3-15 and 5-3 to 5-5 (the last include the jump instructions with could be surprising for you) http://download.intel.com/products/processor/manual/325462.pdf
Anyway, I’d like to discuss more about related situations.
Comparing with 0 or with 1: #Jerry Coffin has a very good explanation at assembler level. Going deeply at machine code level the variant to compare with 1 needs to “hard code” the 1 into the CPU instruction and load it into the CPU, while the other variant managed to not do that. Anyway here the gain is absolutely small. I don’t think it will be measurable in speed in any real live situation. As a side comment, the instruction cmp i, 1 will made just a sort of subtractioni-1 (without saving the result) but setting the flags, and you end up comparing actually with 0 !!
More important could be this situation: compare X<=Y or Y>=X with obviously are logicaly equivalent, but that could have severe side effect if X and Y are expression with need to be evaluated and could influence the result of the other! Very bad still, and potentially undefined.
Now, coming back to the diagrams, looking at the assembler examples from #Jerry Coffin too. I see here the following issue. Real software is a sort of linear chain in memory. You select one of the conditions and jump into another program-memory position to continue while the opposite just continues. It could make sense to select the more frequent condition to be the one that just continues. I don’t see how we can give the compiler a hint in these situations, and obviously the complier can’t figure it out itself. Please, correct me if I’m wrong, but these sort of optimization problems are pretty much general, and the programmer must decide himself without the help of the complier.
But again, in fast any situation I’ll write my code looking at the general still and readability and not at these local small optimizations.
Consider the following two alternatives of getting the higher number between currentPrice and 100...
int price = currentPrice > 100 ? currentPrice : 100
int price = Math.Max(currentPrice, 100)
I raised this question because I was thinking about a context where the currentPrice variable could be edited by other threads.
In the first case... could price obtain a value lower than 100?
I'm thinking about the following:
if (currentPrice > 100) {
//currentPrice is edited here.
price = currentPrice;
}
It is not threadsafe.
?: is just shortcut for normal if, so your if sample is equivalent to ? one - you can get price lower than 100 if there is no locking outside this code.
Not a specialist in C#, but even var++ is not thread save, since may translated to from reading into/writing from register in assembly.
Ternary operator is far more complicated. It has 3 parts, while each part can be endlessly big (e.g. call to some function). Therefore, it's pretty easy to conclude that ternary operator is not thread safe.
In theory, currentPrice is read twice. Once for comparison, once for assignment.
In practice, the compiler may cache the access to the variable. I don't know about C# but in C++ on x86:
MOV AX, [currentPrice]
MOV BX, 100 ;cache the immediate
CMP AX, BX
JLE $1 ;if(currentPrice > 100){
MOV AX, BX
$1: ;}
MOV [BP+price], AX ;price is on the stack.
The same load-once optimisation happens in Java bytecode unless currentPrice is declared volatile.
So, in theory, it can happen. In practice, on most platforms, it won't, but you cannot count on that.
As stated by others, it might be cached but the language does not require it.
You can use Interlocked.CompareExchange if you need lock-free threadsafe assignments. But given the example, I'd go for a more coarse grained locking strategy.
I recently changed
this.FieldValues = new object[2, fieldValues.GetUpperBound(1) + 1];
for (int i = 0; i < FieldCount; i++)
{
this.FieldValues[Current, i] = fieldValues[Current, i];
this.FieldValues[Original, i] = fieldValues[Original, i];
}
to
FieldValues = new object[2, fieldValues.GetLength(1)];
Array.Copy(fieldValues, FieldValues, FieldValues.Length);
Where the values of Current and Original are constants 0 and 1 respectively. FieldValues is a field and fieldValues is a parameter.
In the place I was using it, I found the Array.Copy() version to be faster. But another developer says he timed the for-loop against Array.Copy() in a standalone program and found the for-loop faster.
Is it possible that Array.Copy() is not really faster? I thought it was supposed to be super-optimised!
In my own experience, I've found that I can't trust my intuition about anything when it comes to performance. Consequently, I keep a quick-and-dirty benchmarking app around (that I call "StupidPerformanceTricks"), which I use to test these scenarios. This is invaluable, as I've made all sorts of surprising and counter-intuitive discoveries about performance tricks. It's also important to remember to run your benchmark app in release mode, without a debugger attached, as you otherwise don't get JIT optimizations, and those optimizations can make a significant difference: technique A might be slower than technique B in debug mode, but significantly faster in release mode, with optimized code.
That said, in general, my own testing experience indicates that if your array is < ~32 elements, you'll get better performance by rolling your own copy loop - presumably because you don't have the method call overhead, which can be significant. However, if the loop is larger than ~32 elements, you'll get better performance by using Array.Copy(). (If you're copying ints or floats or similar sorts of things, you might also want to investigate Buffer.BlockCopy(), which is ~10% faster than Array.Copy() for small arrays.)
But all that said, the real answer is, "Write your own tests that match these precise alternatives as closely as possible, wrap them each with a loop, give the loop enough iterations for it to chew up at least 2-3 seconds worth of CPU, and then compare the alternatives yourself."
The way .Net works under the hood, I'd guess that in an optimized situation, Array.Copy would avoid bounds checking.
If you do a loop on any type of collection, by default the CLR will check to make sure you're not passing the end of the collection, and then the JIT will either have to do a runtime assessment or emit code that doesn't need checking. (check the article in my comment for better details of this)
You can modify this behaviour, but generally you don't save that much. Unless you're in a tightly executed inner loop where every millisecond counts, that is.
If the Array is large, I'd use Array.Copy, if it's small, either should perform the same.
I do think it's bounds checking that's creating the different results for you though.
In your particular example, there is a factor that might (in theory) indicate the for loop is faster.
Array.Copy is a O(n) operation while your for loop is O(n/2), where n is the total size of you matrix.
Array.Copy needs to loop trough all the elements in your two-dimensional array because:
When copying between multidimensional arrays, the array behaves like a
long one-dimensional array, where the rows (or columns) are
conceptually laid end to end. For example, if an array has three rows
(or columns) with four elements each, copying six elements from the
beginning of the array would copy all four elements of the first row
(or column) and the first two elements of the second row (or column).