I'm testing what sort of speedup I can get from using SIMD instructions with RyuJIT and I'm seeing some disassembly instructions that I don't expect. I'm basing the code on this blog post from the RyuJIT team's Kevin Frei, and a related post here. Here's the function:
static void AddPointwiseSimd(float[] a, float[] b) {
int simdLength = Vector<float>.Count;
int i = 0;
for (i = 0; i < a.Length - simdLength; i += simdLength) {
Vector<float> va = new Vector<float>(a, i);
Vector<float> vb = new Vector<float>(b, i);
va += vb;
va.CopyTo(a, i);
}
}
The section of disassembly I'm querying copies the array values into the Vector<float>. Most of the disassembly is similar to that in Kevin and Sasha's posts, but I've highlighted some extra instructions (along with my confused annotations) that don't appear in their disassemblies:
;// Vector<float> va = new Vector<float>(a, i);
cmp eax,r8d ; <-- Unexpected - Compare a.Length to i?
jae 00007FFB17DB6D5F ; <-- Unexpected - Jump to range check failure
lea r10d,[rax+3]
cmp r10d,r8d
jae 00007FFB17DB6D5F
mov r11,rcx ; <-- Unexpected - Extra register copy?
movups xmm0,xmmword ptr [r11+rax*4+10h ]
;// Vector<float> vb = new Vector<float>(b, i);
cmp eax,r9d ; <-- Unexpected - Compare b.Length to i?
jae 00007FFB17DB6D5F ; <-- Unexpected - Jump to range check failure
cmp r10d,r9d
jae 00007FFB17DB6D5F
movups xmm1,xmmword ptr [rdx+rax*4+10h]
Note the loop range check is as expected:
;// for (i = 0; i < a.Length - simdLength; i += simdLength) {
add eax,4
cmp r9d,eax
jg loop
so I don't know why there are extra comparisons to eax. Can anyone explain why I'm seeing these extra instructions and if it's possible to get rid of them.
In case it's related to the project settings I've got a very similar project that shows the same issue here on github (see FloatSimdProcessor.HwAcceleratedSumInPlace() or UShortSimdProcessor.HwAcceleratedSumInPlaceUnchecked()).
I'll annotate the code generation that I see, for a processor that supports AVX2 like Haswell, it can move 8 floats at a time:
00007FFA1ECD4E20 push rsi
00007FFA1ECD4E21 sub rsp,20h
00007FFA1ECD4E25 xor eax,eax ; i = 0
00007FFA1ECD4E27 mov r8d,dword ptr [rcx+8] ; a.Length
00007FFA1ECD4E2B lea r9d,[r8-8] ; a.Length - simdLength
00007FFA1ECD4E2F test r9d,r9d ; if (i >= a.Length - simdLength)
00007FFA1ECD4E32 jle 00007FFA1ECD4E75 ; then skip loop
00007FFA1ECD4E34 mov r10d,dword ptr [rdx+8] ; b.Length
00007FFA1ECD4E38 cmp eax,r8d ; if (i >= a.Length)
00007FFA1ECD4E3B jae 00007FFA1ECD4E7B ; then OutOfRangeException
00007FFA1ECD4E3D lea r11d,[rax+7] ; i+7
00007FFA1ECD4E41 cmp r11d,r8d ; if (i+7 >= a.Length)
00007FFA1ECD4E44 jae 00007FFA1ECD4E7B ; then OutOfRangeException
00007FFA1ECD4E46 mov rsi,rcx ; move a[i..i+7]
00007FFA1ECD4E49 vmovupd ymm0,ymmword ptr [rsi+rax*4+10h]
00007FFA1ECD4E50 cmp eax,r10d ; same as above
00007FFA1ECD4E53 jae 00007FFA1ECD4E7B ; but for b
00007FFA1ECD4E55 cmp r11d,r10d
00007FFA1ECD4E58 jae 00007FFA1ECD4E7B
00007FFA1ECD4E5A vmovupd ymm1,ymmword ptr [rdx+rax*4+10h]
00007FFA1ECD4E61 vaddps ymm0,ymm0,ymm1 ; a[i..] + b[i...]
00007FFA1ECD4E66 vmovupd ymmword ptr [rsi+rax*4+10h],ymm0
00007FFA1ECD4E6D add eax,8 ; i += 8
00007FFA1ECD4E70 cmp r9d,eax ; if (i < a.Length)
00007FFA1ECD4E73 jg 00007FFA1ECD4E38 ; then loop
00007FFA1ECD4E75 add rsp,20h
00007FFA1ECD4E79 pop rsi
00007FFA1ECD4E7A ret
So the eax compares are those "pesky bound checks" that the blog post talks about. The blog post gives an optimized version that is not actually implemented (yet), real code right now checks both the first and the last index of the 8 floats that are moved at the same time. The blog post's comment "Hopefully, we'll get our bounds-check elimination work strengthened enough" is an uncompleted task :)
The mov rsi,rcx instruction is present in the blog post as well and appears to be a limitation in the register allocator. Probably influenced by RCX being an important register, it normally stores this. Not important enough to do the work to get this optimized away I'd assume, register-to-register moves take 0 cycles since they only affect register renaming.
Note how the difference between SSE2 and AVX2 is ugly, while the code moves and adds 8 floats at a time, it only actually uses 4 of them. Vector<float>.Count is 4 regardless of the processor flavor, leaving 2x perf on the table. Hard to hide the implementation detail I guess.
Take the following code (usable as a Console Application):
static void Main(string[] args)
{
int i = 0;
i += i++;
Console.WriteLine(i);
Console.ReadLine();
}
The result of i is 0. I expected 2 (as some of my colleagues did). Probably the compiler creates some sort of structure that results in i being zero.
The reason I expected 2 is that, in my line of thought, the right hand statement would be evaluated first, incrementing i with 1. Than it is added to i. Since i is already 1, it is adding 1 to 1. So 1 + 1 = 2. Obviously this is not what's happening.
Can you explain what the compiler does or what happens at runtime? Why is the result zero?
Some-sort-of-disclaimer: I'm absolutely aware you won't (and probably shouldn't) use this code. I know I never will. Nevertheless, I find it is interesting to know why it acts in such a way and what is happening exactly.
This:
int i = 0;
i += i++
Can be seen as you doing (the following is a gross oversimplification):
int i = 0;
i = i + i; // i=0 because the ++ is a postfix operator and hasn't been executed
i + 1; // Note that you are discarding the calculation result
What actually happens is more involved than that - take a look at MSDN, 7.5.9 Postfix increment and decrement operators:
The run-time processing of a postfix increment or decrement operation of the form x++ or x-- consists of the following steps:
If x is classified as a variable:
x is evaluated to produce the variable.
The value of x is saved.
The selected operator is invoked with the saved value of x as its argument.
The value returned by the operator is stored in the location given by the evaluation of x.
The saved value of x becomes the result of the operation.
Note that due to order of precedence, the postfix ++ occurs before +=, but the result ends up being unused (as the previous value of i is used).
A more thorough decomposition of i += i++ to the parts it is made of requires one to know that both += and ++ are not atomic (that is, neither one is a single operation), even if they look like they are. The way these are implemented involve temporary variables, copies of i before the operations take place - one for each operation. (I will use the names iAdd and iAssign for the temporary variables used for ++ and += respectively).
So, a closer approximation to what is happening would be:
int i = 0;
int iAdd = i; // Copy of the current value of i, for ++
int iAssign = i; // Copy of the current value of i, for +=
i = i + 1; // i++ - Happens before += due to order of precedence
i = iAdd + iAssign;
Disassembly of the running code:
int i = 0;
xor edx, edx
mov dword ptr i, edx // set i = 0
i += i++;
mov eax, dword ptr i // set eax = i (=0)
mov dword ptr tempVar1, eax // set tempVar1 = eax (=0)
mov eax, dword ptr i // set eax = 0 ( again... why??? =\ )
mov dword ptr tempVar2, eax // set tempVar2 = eax (=0)
inc dword ptr i // set i = i+1 (=1)
mov eax, dword ptr tempVar1 // set eax = tempVar1 (=0)
add eax, dword ptr tempVar2 // set eax = eax+tempVar2 (=0)
mov dword ptr i, eax // set i = eax (=0)
Equivalent code
It compiles to the same code as the following code:
int i, tempVar1, tempVar2;
i = 0;
tempVar1 = i; // created due to postfix ++ operator
tempVar2 = i; // created due to += operator
++i;
i = tempVar1 + tempVar2;
Disassembly of the second code (just to prove they are the same)
int i, tempVar1, tempVar2;
i = 0;
xor edx, edx
mov dword ptr i, edx
tempVar1 = i; // created due to postfix ++ operator
mov eax, dword ptr i
mov dword ptr tempVar1, eax
tempVar2 = i; // created due to += operator
mov eax, dword ptr i
mov dword ptr tempVar2, eax
++i;
inc dword ptr i
i = tempVar1 + tempVar2;
mov eax, dword ptr tempVar1
add eax, dword ptr tempVar2
mov dword ptr i, eax
Opening disassembly window
Most people don't know, or even don't remember, that they can see the final in-memory assembly code, using Visual Studio Disassembly window. It shows the machine code that is being executed, it is not CIL.
Use this while debuging:
Debug (menu) -> Windows (submenu) -> Disassembly
So what is happening with postfix++?
The postfix++ tells that we'd like to increment the value of the operand after the evaluation... that everybody knows... what confuses a bit is the meaning of "after the evaluation".
So what does "after the evaluation" means:
other usages of the operand, on the same line of code must be affected:
a = i++ + i the second i is affected by the increment
Func(i++, i) the second i is affected
other usages on the same line respect short-circuit operator like || and &&:
(false && i++ != i) || i == 0 the third i is not affected by i++ because it is not evaluated
So what is the meaning of: i += i++;?
It is the same as i = i + i++;
The order of evaluation is:
Store i + i (that is 0 + 0)
Increment i (i becomes 1)
Assign the value of step 1 to i (i becomes 0)
Not that the increment is being discarded.
What is the meaning of: i = i++ + i;?
This is not the same as the previous example. The 3rd i is affected by the increment.
The order of evaluation is:
Store i (that is 0)
Increment i (i becomes 1)
Store value of step 1 + i (that is 0 + 1)
Assign the value of step 3 to i (i becomes 1)
int i = 0;
i += i++;
is evaluated as follows:
Stack<int> stack = new Stack<int>();
int i;
// int i = 0;
stack.Push(0); // push 0
i = stack.Pop(); // pop 0 --> i == 0
// i += i++;
stack.Push(i); // push 0
stack.Push(i); // push 0
stack.Push(i); // push 0
stack.Push(1); // push 1
i = stack.Pop() + stack.Pop(); // pop 0 and 1 --> i == 1
i = stack.Pop() + stack.Pop(); // pop 0 and 0 --> i == 0
i.e. i is changed twice: once by the i++ expression and once by the += statement.
But the operands of the += statement are
the value i before the evaluation of i++ (left-hand side of +=) and
the value i before the evaluation of i++ (right-hand side of +=).
First, i++ returns 0. Then i is incremented by 1. Lastly i is set to the initial value of i which is 0 plus the value i++ returned, which is zero too. 0 + 0 = 0.
This is simply left to right, bottom-up evaluation of the abstract syntax tree. Conceptually, the expression's tree is walked from top down, but the evaluation unfolds as the recursion pops back up the tree from the bottom.
// source code
i += i++;
// abstract syntax tree
+=
/ \
i ++ (post)
\
i
Evaluation begins by considering the root node +=. That is the major constituent of the expression. The left operand of += must be evaluated to determine the place where we store the variable, and to obtain the prior value which is zero. Next, the right side must be evaluated.
The right side is a post-incrementing ++ operator. It has one operand, i which is evaluated both as a source of a value, and as a place where a value is to be stored. The operator evaluates i, finding 0, and consequently stores a 1 into that location. It returns the prior value, 0, in accordance with its semantics of returning the prior value.
Now control is back to the += operator. It now has all the info to complete its operation. It knows the place where to store the result (the storage location of i) as well as the prior value, and it has the value to added to the prior value, namely 0. So, i ends up with zero.
Like Java, C# has sanitized a very asinine aspect of the C language by fixing the order of evaluation. Left-to-right, bottom-up: the most obvious order that is likely to be expected by coders.
Because i++ first returns the value, then increments it. But after i is set to 1, you set it back to 0.
The post-increment method looks something like this
int ++(ref int i)
{
int c = i;
i = i + 1;
return c;
}
So basically when you call i++, i is increment but the original value is returned in your case it's 0 being returned.
Simple answer
int i = 0;
i += i++;
// Translates to:
i = i + 0; // because post increment returns the current value 0 of i
// Before the above operation is set, i will be incremented to 1
// Now i gets set after the increment,
// so the original returned value of i will be taken.
i = 0;
i++ means: return the value of i THEN increment it.
i += i++ means:
Take the current value of i.
Add the result of i++.
Now, let's add in i = 0 as a starting condition.
i += i++ is now evaluated like this:
What's the current value of i? It is 0. Store it so we can add the result of i++ to it.
Evaluate i++ (evaluates to 0 because that's the current value of i)
Load the stored value and add the result of step 2 to it. (add 0 to 0)
Note: At the end of step 2, the value of i is actually 1. However, in step 3, you discard it by loading the value of i before it was incremented.
As opposed to i++, ++i returns the incremented value.
Therefore, i+= ++i would give you 1.
The post fix increment operator, ++, gives the variable a value in the expression and then do the increment you assigned returned zero (0) value to i again that overwrites the incremented one (1), so you are getting zero. You can read more about increment operator in ++ Operator (MSDN).
i += i++; will equal zero, because it does the ++ afterwards.
i += ++i; will do it before
The ++ postfix evaluates i before incrementing it, and += only evaluates i once.
Therefore, 0 + 0 = 0, as i is evaluated and used before it is incremented, as the postfix format of ++ is used. To get i incremented first, use the prefix form (++i).
(Also, just a note: you should only get 1, as 0 + (0 + 1) = 1)
References: http://msdn.microsoft.com/en-us/library/sa7629ew.aspx (+=)
http://msdn.microsoft.com/en-us/library/36x43w8w.aspx (++)
What C# is doing, and the "why" of the confusion
I also expected the value to be 1... but some exploration on that matter did clarify some points.
Cosider the following methods:
static int SetSum(ref int a, int b) { return a += b; }
static int Inc(ref int a) { return a++; }
I expected that i += i++ to be the same as SetSum(ref i, Inc(ref i)). The value of i after this statement is 1:
int i = 0;
SetSum(ref i, Inc(ref i));
Console.WriteLine(i); // i is 1
But then I came to another conclusion... i += i++ is actually the same as i = i + i++... so I have created another similar example, using these functions:
static int Sum(int a, int b) { return a + b; }
static int Set(ref int a, int b) { return a = b; }
After calling this Set(ref i, Sum(i, Inc(ref i))) the value of i is 0:
int i = 0;
Set(ref i, Sum(i, Inc(ref i)));
Console.WriteLine(i); // i is 0
This not only explains what C# is doing... but also why a lot of people got confused with it... including me.
A good mnemonic I always remember about this is the following:
If ++ stands after the expression, it returns the value it was before. So the following code
int a = 1;
int b = a++;
is 1, because a was 1 before it got increased by the ++ standing after a. People call this postfix notation. There is also a prefix notation, where things are exactly the opposite: if ++ stands before, the expression returns the value that it is after the operation:
int a = 1;
int b = ++a;
b is two in here.
So for your code, this means
int i = 0;
i += (i++);
i++ returns 0 (as described above), so 0 + 0 = 0.
i += (++i); // Here 'i' would become two
Scott Meyers describes the difference between those two notations in "Effective C++ programming". Internally, i++ (postfix) remembers the value i was, and calls the prefix-notation (++i) and returns the old value, i. This is why you should allways use ++i in for loops (although I think all modern compilers are translating i++ to ++i in for loops).
The only answer to your question which is correct is: Because it is undefined.
i+=i++; result in 0 is undefined.
a bug in the language evaluation mechanism if you will.. or even worse! a bug in design.
want a proof? of course you want!
int t=0; int i=0; t+=i++; //t=0; i=1
Now this... is intuitive result! because we first evaluated t assigned it with a value and only after evaluation and assignment we had the post operation happening - rational isn't it?
is it rational that: i=i++ and i=i yield the same result for i?
while t=i++ and t=i have different results for i.
The post operation is something that should happen after the statement evaluation.
Therefore:
int i=0;
i+=i++;
Should be the same if we wrote:
int i=0;
i = i + i ++;
and therefore the same as:
int i=0;
i= i + i;
i ++;
and therefore the same as:
int i=0;
i = i + i;
i = i + 1;
Any result which is not 1 indicate a bug in the complier or a bug in the language design if we go with rational thinking - however MSDN and many other sources tells us "hey - this is undefined!"
Now, before I continue, even this set of examples I gave is not supported or acknowledged by anyone.. However this is what according to intuitive and rational way should have been the result.
The coder should have no knowledge of how the assembly is being written or translated!
If it is written in a manner that will not respect the language definitions - it is a bug!
And to finish I copied this from Wikipedia, Increment and decrement operators :
Since the increment/decrement operator modifies its operand, use of such an operand more than once within the same expression can produce undefined results. For example, in expressions such as x − ++x, it is not clear in what sequence the subtraction and increment operators should be performed. Situations like this are made even worse when optimizations are applied by the compiler, which could result in the order of execution of the operations to be different than what the programmer intended.
And therefore.
The correct answer is that this SHOULD NOT BE USED! (as it is UNDEFINED!)
Yes.. - It has unpredictable results even if C# complier is trying to normalize it somehow.
I did not find any documentation of C# describing the behavior all of you documented as a normal or well defined behavior of the language. What I did find is the exact opposite!
[copied from MSDN documentation for Postfix Increment and Decrement Operators: ++ and --]
When a postfix operator is applied to a function argument, the value of the argument is not guaranteed to be incremented or decremented before it is passed to the function. See section 1.9.17 in the C++ standard for more information.
Notice those words not guaranteed...
The ++ operator after the variable makes it a postfix increment. The incrementing happens after everything else in the statement, the adding and assignment. If instead, you put the ++ before the variable, it would happen before i's value was evaluated, and give you the expected answer.
The steps in calculation are:
int i=0 //Initialized to 0
i+=i++ //Equation
i=i+i++ //after simplifying the equation by compiler
i=0+i++ //i value substitution
i=0+0 //i++ is 0 as explained below
i=0 //Final result i=0
Here, initially the value of i is 0.
WKT, i++ is nothing but: first use the i value and then increment the i value by 1. So
it uses the i value, 0, while calculating i++ and then increments it by 1.
So it results in a value of 0.
Be very careful: read the C FAQ: what you're trying to do (mixing assignement and ++ of the same variable) is not only unspecified, but it is also undefined (meaning that the compiler may do anything when evaluating!, not only giving "reasonnable" results).
Please read, section 3. The whole section is well worth a read! Especially 3.9, which explains the implication of unspecified. Section 3.3 gives you a quick summary of what you can, and cannot do, with "i++" and the like.
Depending on the compilers internals, you may get 0, or 2, or 1, or even anything else! And as it is undefined, it's OK for them to do so.
There are two options:
The first option: if the compiler read the statement as follows,
i++;
i+=i;
then the result is 2.
For
else if
i+=0;
i++;
the result is 1.
There's lot of excellent reasoning in above answers, I just did a small test and want to share with you
int i = 0;
i+ = i++;
Here result i is showing 0 result.
Now consider below cases :
Case 1:
i = i++ + i; //Answer 1
earlier I thought above code resemble this so at first look answer is 1, and really answer of i for this one is 1.
Case 2:
i = i + i++; //Answer 0 this resembles the question code.
here increment operator doesn't come in execution path, unlike previous case where i++ has the chance to execute before addition.
I hope this helps a bit. Thanks
Hoping to answer this from a C programming 101 type of perspective.
Looks to me like it's happening in this order:
i is evaluated as 0, resulting in i = 0 + 0 with the increment operation i++ "queued", but the assignment of 0 to i hasn't happened yet either.
The increment i++ occurs
The assignment i = 0 from above happens, effectively overwriting anything that #2 (the post-increment) would've done.
Now, #2 may never actually happen (probably doesn't?) because the compiler likely realizes it will serve no purpose, but this could be compiler dependent. Either way, other, more knowledgeable answers have shown that the result is correct and conforms to the C# standard, but it's not defined what happens here for C/C++.
How and why is beyond my expertise, but the fact that the previously evaluated right-hand-side assignment happens after the post-increment is probably what's confusing here.
Further, you would not expect the result to be 2 regardless unless you did ++i instead of i++ I believe.
Simply put,
i++, will add 1 to "i" after the "+=" operator has completed.
What you want is ++i, so that it will add 1 to "i" before the "+=" operator is executed.
i=0
i+=i
i=i+1
i=0;
Then the 1 is added to i.
i+=i++
So before adding 1 to i, i took the value of 0. Only if we add 1 before, i get the value 0.
i+=++i
i=2
The answer is i will be 1.
Let's have a look how:
Initially i=0;.
Then while calculating i +=i++; according to value of we will have something like 0 +=0++;, so according to operator precedence 0+=0 will perform first and the result will be 0.
Then the increment operator will applied as 0++, as 0+1 and the value of i will be 1.
It's pretty common case to use multidimensional arrays with complicated indexing. It's really confusing and error-prone when all indexes are ints because you can easily mix up columns and rows (or whatever you have) and there's no way for compiler to identify the problem. In fact there should be two types of indexes: rows and columns but it's not expressed on type level.
Here's a small illustration of what I want:
var table = new int[RowsCount,ColumnsCount];
Row row = 5;
Column col = 10;
int value = table[row, col];
public void CalcSum(int[,] table, Column col)
{
int sum = 0;
for (Row r = 0; r < table.GetLength(0); r++)
{
sum += table[row, col];
}
return sum;
}
CalcSum(table, col); // OK
CalcSum(table, row); // Compile time error
Summing up:
indexes should be statically checked for mixing up (kind of type check)
important! they should be run time efficient since it's not OK for performance to wrap ints to custom objects containing the index and then unwrapping them back
they should be implicitly convertible to ints in order to serve as indexes in native multidimensional arrays
Is there any way to achieve this? The perfect solution would be something like typedef which serves as compile-time check only compiling into plane ints.
You'll only get a 2x slowdown with the x64 jitter. It generates interesting optimized code. The loop that uses the struct looks like this:
00000040 mov ecx,1
00000045 nop word ptr [rax+rax+00000000h]
00000050 lea eax,[rcx-1]
s.Idx = j;
00000053 mov dword ptr [rsp+30h],eax
00000057 mov dword ptr [rsp+30h],ecx
0000005b add ecx,2
for (int j = 0; j < 100000000; j++) {
0000005e cmp ecx,5F5E101h
00000064 jl 0000000000000050
This requires some annotation since the code is unusual. First off, the weird NOP at offset 45 is there to align the instruction at the start of the loop. That makes the branch at offset 64 faster. The instruction at 53 looks completely unnecessary. What you see happen here is loop unrolling, note how the instruction at 5b increments the loop counter by 2. The optimizer is however not smart enough to then also see that the store is unnecessary.
And most of all note that there's no ADD instruction to be seen. In other words, the code doesn't actually calculate the value of "sum". Which is because you are not using it anywhere after the loop, the optimizer can see that the calculation is useless and removed it entirely.
It does a much better job at the second loop:
000000af xor eax,eax
000000b1 add eax,4
for (int j = 0; j < 100000000; j++) {
000000b4 cmp eax,5F5E100h
000000b9 jl 00000000000000B1
It now entirely removed the "sum" calculation and the "i" variable assignment. It could have also removed the entire for() loop but that's never done by the jitter optimizer, it assumes that the delay is intentional.
Hopefully the message is clear by now: avoid making assumptions from artificial benchmarks and only ever profile real code. You can make it more real by actually displaying the value of "sum" so the optimizer doesn't throw away the calculation. Add this line of code after the loops:
Console.Write("Sum = {0} ", sum);
And you'll now see that there's no difference anymore.
It seems that when performing an & operation between two longs it takes the same amount of time as the equivalent operation inside 4 32bit ints.
For example
long1 & long2
Takes as long as
int1 & int2
int3 & int4
This is running on a 64bit OS and targeting 64bit .net.
In theory, this should be twice as fast. Has anyone encountered this previously?
EDIT
As a simplification, imagine I have two lots of 64 bits of data. I take those 64 bits and put them into a long, and perform a bitwise & on those two.
I also take those two sets of data, and put the 64 bits into two 32 bit int values and perform two &s. I expect to see the long & operation running faster than the int & operation.
I couldn't reproduce the problem.
My test was as follows (int version shown):
// deliberately made hard to optimise without whole program optimisation
public static int[] data = new int[1000000]; // long[] when testing long
// I happened to have a winforms app open, feel free to make this a console app..
private void button1_Click(object sender, EventArgs e)
{
long best = long.MaxValue;
for (int j = 0; j < 1000; j++)
{
Stopwatch timer = Stopwatch.StartNew();
int a1 = ~0, b1 = 0x55555555, c1 = 0x12345678; // varies: see below
int a2 = ~0, b2 = 0x55555555, c2 = 0x12345678;
int[] d = data; // long[] when testing long
for (int i = 0; i < d.Length; i++)
{
int v = d[i]; // long when testing long, see below
a1 &= v; a2 &= v;
b1 &= v; b2 &= v;
c1 &= v; c2 &= v;
}
// don't average times: we want the result with minimal context switching
best = Math.Min(best, timer.ElapsedTicks);
button1.Text = best.ToString() + ":" + (a1 + a2 + b1 + b2 + c1 + c2).ToString("X8");
}
}
For testing longs a1 and a2 etc are merged, giving:
long a = ~0, b = 0x5555555555555555, c = 0x1234567812345678;
Running the two programs on my laptop (i7 Q720) as a release build outside of VS (.NET 4.5) I got the following times:
int: 2238, long: 1924
Now considering there's a huge amount of loop overhead, and that the long version is working with twice as much data (8mb vs 4mb), it still comes out clearly ahead. So I have no reason to believe that C# is not making full use of the processor's 64 bit bitops.
But we really shouldn't be benching it in the first place. If there's a concern, simply check the jited code (Debug -> Windows -> Disassembly). Ensure the compiler's using the instructions you expect it to use, and move on.
Attempting to measure the performance of those individual instructions on your processor (and this could well be specific to your processor model) in anything other than assembler is a very bad idea - and from within a jit compiled language like C#, beyond futile. But there's no need to anyway, as it's all in Intel's optimisation handbook should you need to know.
To this end, here's the disassembly of the a &= for the long version of the program on x64 (release, but inside of debugger - unsure if this affects the assembly, but it certainly affects the performance):
00000111 mov rcx,qword ptr [rsp+60h] ; a &= v
00000116 mov rax,qword ptr [rsp+38h]
0000011b and rax,rcx
0000011e mov qword ptr [rsp+38h],rax
As you can see there's a single 64 bit and operation as expected, along with three 64 bit moves. So far so good, and exactly half the number of ops of the int version:
00000122 mov ecx,dword ptr [rsp+5Ch] ; a1 &= v
00000126 mov eax,dword ptr [rsp+38h]
0000012a and eax,ecx
0000012c mov dword ptr [rsp+38h],eax
00000130 mov ecx,dword ptr [rsp+5Ch] ; a2 &= v
00000134 mov eax,dword ptr [rsp+44h]
00000138 and eax,ecx
0000013a mov dword ptr [rsp+44h],eax
I can only conclude that the problem you're seeing is specific to something about your test suite, build options, processor... or quite possibly, that the & isn't the point of contention you believe it to be. HTH.
I can't reproduce your timings. The following code generates two arrays: one of 1,000,000 longs, and one with 2,000,000 ints. Then it loops through the arrays, applying the & operator to successive values. It keeps a running sum and outputs it, just to make sure that the compiler doesn't decide to remove the loop entirely because it isn't doing anything.
Over dozens of successive runs, the long loop is at least twice as fast as the int loop. This is running on a Core 2 Quad with Windows 8 Developer Preview and Visual Studio 11 Developer Preview. Program is compiled with "Any CPU", and run in 64 bit mode. All testing done using Ctrl+F5 so that the debugger isn't involved.
int numLongs = 1000000;
int numInts = 2*numLongs;
var longs = new long[numLongs];
var ints = new int[numInts];
Random rnd = new Random();
// generate values
for (int i = 0; i < numLongs; ++i)
{
int i1 = rnd.Next();
int i2 = rnd.Next();
ints[2 * i] = i1;
ints[2 * i + 1] = i2;
long l = i1;
l = (l << 32) | (uint)i2;
longs[i] = l;
}
// time operations.
int isum = 0;
Stopwatch sw = Stopwatch.StartNew();
for (int i = 0; i < numInts; i += 2)
{
isum += ints[i] & ints[i + 1];
}
sw.Stop();
Console.WriteLine("Ints: {0} ms. isum = {1}", sw.ElapsedMilliseconds, isum);
long lsum = 0;
int halfLongs = numLongs / 2;
sw.Restart();
for (int i = 0; i < halfLongs; i += 2)
{
lsum += longs[i] & longs[i + 1];
}
sw.Stop();
Console.WriteLine("Longs: {0} ms. lsum = {1}", sw.ElapsedMilliseconds, lsum);