I am working on optimizing a physics simulation program using Red Gate's Performance Profiler. One part of the code dealing with collision detection had around 52 of the following little checks, dealing with cells in 26 directions in 3 dimensions, under two cases.
CollisionPrimitiveList cell = innerGrid[cellIndex + 1];
if (cell.Count > 0)
contactsMade += collideWithCell(obj, cell, data, ref attemptedContacts);
cell = innerGrid[cellIndex + grid.XExtent];
if (cell.Count > 0)
contactsMade += collideWithCell(obj, cell, data, ref attemptedContacts);
cell = innerGrid[cellIndex + grid.XzLayerSize];
if (cell.Count > 0)
contactsMade += collideWithCell(obj, cell, data, ref attemptedContacts);
As an extremely tight loop of the program, all of this had to be in the same method, but I found that, suddenly, after I had extended the area from two dimensions to three dimensions (rising the count to 52 checks from 16), suddenly cell.Count was no longer being inlined, even though it is a simple getter.
public int Count { get { return count; } }
This caused a humongous performance hit, and it took me a considerable time to find that, when cell.Count appeared in the method 28 times or less, it was inlined every time, but once cell.Count appeared in the method 29 times or more, it was not inlined a single time (even though the vast majority of calls were from worst-case scenario parts of the code that were rarely executed.)
So back to my question, does anybody have any idea to get around this limit? I think the easy solution is just to make the count field internal and not private, but I would like a better solution than this, or at least just a better understanding of the situation. I wish this sort of thing would have been mentioned on Microsoft's Writing High-Performance Managed Applications page at http://msdn.microsoft.com/en-us/library/ms973858.aspx but sadly it is not (possibly because of how arbitrary the 28 count limit is?)
I am using .NET 4.0.
EDIT: It looks like I misinterpreted my little testing. I found that the failure to inline was caused not by the methods themselves being called some 28+ times, but because the the method they ought to be inlined into is "too long" by some standard. This still confuses me, because I don't see how a simple getter could be rationally not inlined (and performance is significantly better with them inlined as my profiler clearly shows me), but apparently the CLI JIT compiler is refusing to inline anything just because the method is already large (playing around with slight variations showed me that this limit is a code size (from idasm) of 1500, above which no inlining is done, even in the case of my getters, which some testing showed add no additional code overhead to be inlined).
Thank you.
I haven't tested this, but it seems like one possible workaround is to have multiple properties that all return the same thing. Conceivably you could then get 28 inlines per property.
Note that the number of times a method is inlined most likely depends on the size of native code for that method (See http://blogs.msdn.com/b/vancem/archive/2008/08/19/to-inline-or-not-to-inline-that-is-the-question.aspx), the the number 28 is specific to that one property. A simple property would likely get inlined more times than a more complex method.
Straight off, this doesn't explain why 28 is the magic number, but I'm curious what would happen if you collate all your candidate CollisionListPrimitive instances into an array, and then call your "if count > 0" block within a loop of the array?
Is the cell.Count call then made inline again?
e.g.
CollisionPrimitiveList[] cells = new CollisionPrimitiveList {
innerGrid[cellIndex + 1],
innerGrid[cellIndex + grid.XExtent],
innerGrid[cellIndex + grid.XzLayerSize]
// and all the rest
};
// Loop over cells - for demo only. Use for loop or LINQ'ify if faster
foreach (CollisionPrimitiveList cell in cells)
{
if (cell.Count > 0)
contactsMade += collideWithCell(obj, cell, data, ref attemptedContacts);
}
I know performance is the issue, and you'll have overheads constructing the array and looping through it, but if cell.Count is inline again, might the performance still be better / good enough overall?
I'm guessing (though in no way positive) that this might have to do with the enregistration issue mentioned -- it's possible that the CLR is allocating a new variable for each if statement, and that those are exceeding a total of 64 variables. Do you think this might be the case?
Related
I'm doing a bit of coding, where I have to write this sort of code:
if( array[i]==false )
array[i]=true;
I wonder if it should be re-written as
array[i]=true;
This raises the question: are comparisions faster than assignments?
What about differences from language to language? (contrast between java & cpp, eg.)
NOTE: I've heard that "premature optimization is the root of all evil." I don't think that applies here :)
This isn't just premature optimization, this is micro-optimization, which is an irrelevant distraction.
Assuming your array is of boolean type then your comparison is unnecessary, which is the only relevant observation.
Well, since you say you're sure that this matters you should just write a test program and measure to find the difference.
Comparison can be faster if this code is executed on multiple variables allocated at scattered addresses in memory. With comparison you will only read data from memory to the processor cache, and if you don't change the variable value when the cache decides to to flush the line it will see that the line was not changed and there's no need to write it back to the memory. This can speed up execution.
Edit: I wrote a script in PHP. I just noticed that there was a glaring error in it meaning the best-case runtime was being calculated incorrectly (scary that nobody else noticed!)
Best case just beats outright assignment but worst case is a lot worse than plain assignment. Assignment is likely fastest in terms of real-world data.
Output:
assignment in 0.0119960308075 seconds
worst case comparison in 0.0188510417938 seconds
best case comparison in 0.0116770267487 seconds
Code:
<?php
$arr = array();
$mtime = explode(" ", microtime());
$starttime = $mtime[1] + $mtime[0];
reset_arr($arr);
for ($i=0;$i<10000;$i++)
$arr[i] = true;
$mtime = explode(" ", microtime());
$firsttime = $mtime[1] + $mtime[0];
$totaltime = ($firsttime - $starttime);
echo "assignment in ".$totaltime." seconds<br />";
reset_arr($arr);
for ($i=0;$i<10000;$i++)
if ($arr[i])
$arr[i] = true;
$mtime = explode(" ", microtime());
$secondtime = $mtime[1] + $mtime[0];
$totaltime = ($secondtime - $firsttime);
echo "worst case comparison in ".$totaltime." seconds<br />";
reset_arr($arr);
for ($i=0;$i<10000;$i++)
if (!$arr[i])
$arr[i] = false;
$mtime = explode(" ", microtime());
$thirdtime = $mtime[1] + $mtime[0];
$totaltime = ($thirdtime - $secondtime);
echo "best case comparison in ".$totaltime." seconds<br />";
function reset_arr($arr) {
for ($i=0;$i<10000;$i++)
$arr[$i] = false;
}
I believe if comparison and assignment statements are both atomic(ie one processor instruction) and the loop executes n times, then in the worst-case comparing then assigning would require n+1(comparing on every iteration plus setting the assignement) executions whereas constantly asssigning the bool would require n executions. Therefore the second one is more efficient.
Depends on the language. However looping through arrays can be costly as well. If the array is in consecutive memory, the fastest is to write 1 bits (255s) across the entire array with memcpy assuming your language/compiler can do this.
Thus performing 0 reads-1 write total, no reading/writing the loop variable/array variable (2 reads/2 writes each loop) several hundred times.
I really wouldn't expect there to be any kind of noticeable performance difference for something as trivial as this so surely it comes down to what gives you clear, more readable code. I my opinion that would be always assigning true.
Might give this a try:
if(!array[i])
array[i]=true;
But really the only way to know for sure is to profile, I'm sure pretty much any compiler would see the comparison to false as unnecessary and optimize it out.
It all depends on the data type. Assigning booleans is faster than first comparing them. But that may not be true for larger value-based datatypes.
As others have noted, this is micro-optimization.
(In politics or journalism, this is known as navel-gazing ;-)
Is the program large enough to have more than a couple layers of function/method/subroutine calls?
If so, it probably had some avoidable calls, and those can waste hundreds as much time as low-level inefficiencies.
On the assumption that you have removed those (which few people do), then by all means run it 10^9 times under a stopwatch, and see which is faster.
Why would you even write the first version? What's the benefit of checking to see if something is false before setting it true. If you always are going to set it true, then always set it true.
When you have a performance bottleneck that you've traced back to setting a single boolean value unnecessarily, come back and talk to us.
I remember in one book about assembly language the author claimed that if condition should be avoided, if possible.
It is much slower if the condition is false and execution has to jump to another line, considerably slowing down performance. Also since programs are executed in machine code, I think 'if' is slower in every (compiled) language, unless its condition is true almost all the time.
If you just want to flip the values, then do:
array[i] = !array[i];
Performance using this is actually worse though, as instead of only having to do a single check for a true false value then setting, it checks twice.
If you declare a 1000000 element array of true,false, true,false pattern comparision is slower. (var b = !b) essentially does a check twice instead of once
I don't have a background in C/C++ or related lower-level languages and so I've never ran into pointers before. I'm a game dev working primarily in C# and I finally decided to move to an unsafe context this morning for some performance-critical sections of code (and please no "don't use unsafe" answers as I've read so many times while doing research, as it's already yielding me around 6 times the performance in certain areas, with no issues so far, plus I love the ability to do stuff like reverse arrays with no allocation). Anyhow, there's a certain situation where I expected no difference, or even a possible decrease in speed, and I'm saving a lot of ticks in reality (I'm talking about double the speed in some instances). This benefit seems to decrease with the number of iterations, which I don't fully understand.
This is the situation:
int x = 0;
for(int i = 0; i < 100; i++)
x++;
Takes, on average about 15 ticks.
EDIT: The following is unsafe code, though I assumed that was a given.
int x = 0, i = 0;
int* i_ptr;
for(i_ptr = &i; *i_ptr < 100; (*i_ptr)++)
x++;
Takes about 7 ticks, on average.
As I mentioned, I don't have a low-level background and I literally just started using pointers this morning, at least directly, so I'm probably missing quite a bit of info. So my first query is- why is the pointer more performant in this case? It isn't an isolated instance, and there are a lot of other variables of course, at that specific point in time in relation to the PC, but I'm getting these results very consistently across a lot of tests.
In my head, the operations are as such:
No pointer:
Get address of i
Get value at address
Pointer:
Get address of i_ptr
Get address of i from i_ptr
Get value at address
In my head, there must surely be more overhead, however ridiculously negligible, from using a pointer here. How is it that a pointer is consistently more performant than the direct variable in this case? These are all on the stack as well, of course, so it's not dependent on where they end up being stored, from what I can tell.
As touched on earlier, the caveat is that this bonus decreases with the number of iterations, and pretty fast. I took out the extremes from the following data to account for background interference.
At 1000 iterations, they are both identical at 30 to 34 ticks.
At 10000 iterations, the pointer is slower by about 20 ticks.
Jump up to 10000000 iterations, and the pointer is slower by about 10000 ticks or so.
My assumption is that the decrease comes from the extra step I covered earlier, given that there is an additional lookup, which brings me back to wonder why it's more performant with a pointer than without at low loop counts. At the very least, I'd assume they would be more or less identical (which they are in practice, I suppose, but a difference of 8 ticks from millions of repeated tests is pretty definitive to me) up until the very rough threshold I found somewhere between 100 and 1000 iterations.
Apologies if I'm nitpicking somewhat, or if this is a poor question, but I feel as though it will be beneficial to know exactly what is going on under the hood. And if nothing else, I think it's pretty interesting!
Some users suggested that the test results were most likely due to measurement inaccuracies, and it would seem as such, at least upto a point. When averaged across ten million continuous tests, the mean of both is typically equal, though in some cases the use of pointers averages out to an extra tick. Interestingly, when testing as a single case, the use of pointers has a consistently lower execution time than without. There are of course a lot of additional variables at play at the specific points in time at which a test is tried, which makes it somewhat of a pointless pursuit to track this down any further. But the result is that I've learned some more about pointers, which was my primary goal, and so I'm pleased with the test.
Back in the day when I was learning C and assembly we were taught it is better to use simple comparisons to increase speed. So for example if you say:
if(x <= 0)
versus
if(x < 1)
which would execute faster? My argument (which may be wrong) is the second would almost always execute faster because there is only a single comparison) i.e. is it less than one, yes or no.
Whereas the first will execute fast if the number is less than 0 because this equates to true there is no need to check the equals making it as fast as the second, however, it will always be slower if the number is 0 or more because it has to then do a second comparison to see if it is equal to 0.
I am now using C# and while developing for desktops speed is not an issue (at least not to the degree that his point is worth arguing), I still think such arguments need to be considered as I am also developing for mobile devices which are much less powerful than desktops and speed does become an issue on such devices.
For further consideration, I am talking about whole numbers (no decimals) and numbers where there cannot be a negative number like -1 or -12,345 etc (unless there is an error), for example, when dealing with lists or arrays when you cant have a negative number of items but you want to check if a list is empty (or if there is a problem, set the value of x to negative to indicate error, an example is where there are some items in a list, but you cannot retrieve the whole list for some reason and to indicate this you set the number to negative which would not be the same as saying there are no items).
For the reason above I deliberately left out the obvious
if(x == 0)
and
if(x.isnullorempty())
and other such items for detecting a list with no items.
Again, for consideration, we are talking about the possibility of retrieving items from a database perhaps using SQL stored procedures which have the functionality mentioned (ie the standard (at least in this company) is to return a negative number to indicate a problem).
So in such cases, is it better to use the first or the second item above?
They're identical. Neither is faster than the other. They both ask precisely the same question, assuming x is an integer. C# is not assembly. You're asking the compiler to generate the best code to get the effect you are asking for. You aren't specifying how it gets that result.
See also this answer.
My argument (which may be wrong) is the second would almost always execute faster because there is only a single comparison) i.e. is it less than one, yes or no.
Clearly that's wrong. Watch what happens if you assume that's true:
< is faster than <= because it asks fewer questions. (Your argument.)
> is the same speed as <= because it asks the same question, just with an inverted answer.
Thus < is faster than >! But this same argument shows > is faster than <.
"just with an inverted answer" seems to sneak in an additional boolean operation so I'm not sure I follow this answer.
That's wrong (for silicon, it is sometimes correct for software) for the same reason. Consider:
3 != 4 is more expensive to compute than 3 == 4, because it's 3 != 4 with an inverted answer, an additional boolean operation.
3 == 4 is more expensive than 3 != 4, because it's 3 != 4 with an inverted answer, an additional boolean operation.
Thus, 3 != 4 is more expensive than itself.
An inverted answer is just the opposite question, not an additional boolean operation. Or, to be a bit more precise, it's with a different mapping of comparison results to final answer. Both 3 == 4 and 3 != 4 require you to compare 3 and 4. That comparison results in ether "equal" or "unequal". The questions just map "equal" and "unequal" to "true" and "false" differently. Neither mapping is more expensive than the other.
At least in most cases, no, there's no advantage to one over the other.
A <= does not normally get implemented as two separate comparisons. On a typical (e.g., x86) CPU, you'll have two separate flags, one to indicate equality, and one to indicate negative (which can also mean "less than"). Along with that, you'll have branches that depend on a combination of those flags, so < translates to a jl or jb (jump if less or jump if below --the former is for signed numbers, the latter for unsigned). A <= will translate to a jle or jbe (jump if less than or equal, jump if below or equal).
Different CPUs will use different names/mnemonics for the instructions, but most still have equivalent instructions. In every case of which I'm aware, all of those execute at the same speed.
Edit: Oops -- I meant to mention one possible exception to the general rule I mentioned above. Although it's not exactly from < vs. <=, if/when you can compare to 0 instead of any other number, you can sometimes gain a little (minuscule) advantage. For example, let's assume you had a variable you were going to count down until you reached some minimum. In a case like this, you might well gain a little advantage if you can count down to 0 instead of counting down to 1. The reason is fairly simple: the flags I mentioned previously are affected by most instructions. Let's assume you had something like:
do {
// whatever
} while (--i >= 1);
A compiler might translate this to something like:
loop_top:
; whatever
dec i
cmp i, 1
jge loop_top
If, instead, you compare to 0 (while (--i > 0) or while (--i != 0)), it might translate to something like this instead;
loop_top:
; whatever
dec i
jg loop_top
; or: jnz loop_top
Here the dec sets/clears the zero flag to indicate whether the result of the decrement was zero or not, so the condition can be based directly on the result from the dec, eliminating the cmp used in the other code.
I should add, however, that while this was quite effective, say, 30+ years ago, most modern compilers can handle translations like this without your help (though some compilers may not, especially for things like small embedded systems). IOW, if you care about optimization in general, it's barely possible that you might someday care -- but at least to me, application to C# seems doubtful at best.
Most modern hardware has built-in instructions for checking the less-than-or-equals consition in a single instruction that executes exactly as fast as the one checking the less-than condition. The argument that applied to the (much) older hardware no longer applies - choose the alternative that you think is most readable, i.e. the one that better conveys your idea to the readers of your code.
Here are my functions:
public static void TestOne()
{
Boolean result;
Int32 i = 2;
for (Int32 j = 0; j < 1000000000; ++j)
result = (i < 1);
}
public static void TestTwo()
{
Boolean result;
Int32 i = 2;
for (Int32 j = 0; j < 1000000000; ++j)
result = (i <= 0);
}
Here is the IL code, which is identical:
L_0000: ldc.i4.2
L_0001: stloc.0
L_0002: ldc.i4.0
L_0003: stloc.1
L_0004: br.s L_000a
L_0006: ldloc.1
L_0007: ldc.i4.1
L_0008: add
L_0009: stloc.1
L_000a: ldloc.1
L_000b: ldc.i4 1000000000
L_0010: blt.s L_0006
L_0012: ret
After a few testing sessions, obviously, the result is that neither is faster than the other. The difference consists only in few milliseconds which can't be considered a real difference, and the produced IL output is the same anyway.
Both ARM and x86 processors will have dedicated instructions both both "less than" and "less than or equal" (Which could also be evaluated as "NOT greater than"), so there will be absolutely no real world difference if you use any semi modern compiler.
While refactoring, if you change your mind about the logic, if(x<=0) is faster (and less error prone) to negate (ie if(!(x<=0)), compared to if(!(x<1)) which does not negate correctly) but that's probably not the performance you're referring to. ;-)
IF x<1 is faster, then the modern compilers will change x<=0 to x<1 (assuming x is an integral). So for modern compilers, this should not matter, and they should produce identical machine code.
Even if x<=0 compiled to different instructions than x<1, the performance difference would be so miniscule as to not be worth worrying about most of the time; there will very likely be other more productive areas for optimizations in your code. The golden rule is to profile your code and optimise the bits that ARE actually slow in the real world, not the bits that you think hypothetically may be slow, or are not as fast as they theoretically could be. Also concentrate on making your code readable to others, and not phantom micro-optimisations that disappear in a puff of compiler smoke.
#Francis Rodgers, you said:
Whereas the first will execute fast if the number is less than 0
because this equates to true there is no need to check the equals
making it as fast as the second, however, it will always be slower if
the number is 0 or more because it has to then do a second comparison
to see if it is equal to 0.
and (in commennts),
Can you explain where > is the same as <= because this doesn't make
sense in my logical world. For example, <=0 is not the same as >0 in
fact, totally opposite. I would just like an example so I can
understand your answer better
You ask for help, and you need help. I really want to help you, and I’m afraid that many other people need this help too.
Begin with the more basic thing. Your idea that testing for > is not the same as testing for <= is logically wrong (not only in any programming language). Look at these diagrams, relax, and think about it. What happens if you know that X <= Y in A and in B? What happen if you know that X > Y in each diagram?
Right, nothing changed, they are equivalent. The key detail of the diagrams is that true and false in A and B are on opposite sides. The meaning of that is that the compiler (or in general – de coder) has the freedom to reorganize the program flow in a way that both question are equivalent.That means, there is no need to split <= into two steps, only reorganize a little in your flow. Only a very bad compiler or interpreter will not be able to do that. Nothing to do yet with any assembler. The idea is that even for CPUs without sufficient flags for all comparisons the compiler can generate (pseudo) assembler code with use the test with best suit it characteristics. But adding the ability of the CPUs to check more than one flag in parallel at electronic level, the job of the compiler is very much simpler.
You may find curios/interesting to read the pages 3-14 to 3-15 and 5-3 to 5-5 (the last include the jump instructions with could be surprising for you) http://download.intel.com/products/processor/manual/325462.pdf
Anyway, I’d like to discuss more about related situations.
Comparing with 0 or with 1: #Jerry Coffin has a very good explanation at assembler level. Going deeply at machine code level the variant to compare with 1 needs to “hard code” the 1 into the CPU instruction and load it into the CPU, while the other variant managed to not do that. Anyway here the gain is absolutely small. I don’t think it will be measurable in speed in any real live situation. As a side comment, the instruction cmp i, 1 will made just a sort of subtractioni-1 (without saving the result) but setting the flags, and you end up comparing actually with 0 !!
More important could be this situation: compare X<=Y or Y>=X with obviously are logicaly equivalent, but that could have severe side effect if X and Y are expression with need to be evaluated and could influence the result of the other! Very bad still, and potentially undefined.
Now, coming back to the diagrams, looking at the assembler examples from #Jerry Coffin too. I see here the following issue. Real software is a sort of linear chain in memory. You select one of the conditions and jump into another program-memory position to continue while the opposite just continues. It could make sense to select the more frequent condition to be the one that just continues. I don’t see how we can give the compiler a hint in these situations, and obviously the complier can’t figure it out itself. Please, correct me if I’m wrong, but these sort of optimization problems are pretty much general, and the programmer must decide himself without the help of the complier.
But again, in fast any situation I’ll write my code looking at the general still and readability and not at these local small optimizations.
I have two question:
1) I need some expert view in terms of witting code which will be Performance and Memory Consumption wise sound enough.
2) Performance and Memory Consumption wise how good/bad is following piece of code and why ???
Need to increment the counter that could go maximum by 100 and writing code like this:
Some Sample Code is as follows:
for(int i=0;i=100;i++)
{
Some Code
}
for(long i=0;i=1000;i++)
{
Some Code
}
how good is to use Int16 or anything else instead of int, long if the requirement is same.
Need to increment the counter that could go maximum by 100 and writing code like this:
Options given:
for(int i=0;i=100;i++)
for(long i=0;i=1000;i++)
EDIT: As noted, neither of these would even actually compile, due to the middle expression being an assignment rather than an expression of type bool.
This demonstrates a hugely important point: get your code working before you make it fast. Your two loops don't do the same thing - one has an upper bound of 1000, the other has an upper bound of 100. If you have to choose between "fast" and "correct", you almost always want to pick "correct". (There are exceptions to this, of course - but that's usually in terms of absolute correctness of results across large amounts of data, not code correctness.)
Changing between the variable types here is unlikely to make any measurable difference. That's often the case with micro-optimizations. When it comes to performance, architecture is usually much more important than in-method optimizations - and it's also a lot harder to change later on. In general, you should:
Write the cleanest code you can, using types that represent your data most correctly and simply
Determine reasonable performance requirements
Measure your clean implementation
If it doesn't perform well enough, use profiling etc to work out how to improve it
DateTime dtStart = DateTime.Now;
for(int i=0;i=10000;i++)
{
Some Code
}
response.write ((DateTime.Now - dtStart).TotalMilliseconds.ToString());
same way for Long as well and you can know which one is better... ;)
When you are doing things that require a number representing iterations, or the quantity of something, you should always use int unless you have a good semantic reason to use a different type (ie data can never be negative, or it could be bigger than 2^31). Additionally, Worrying about this sort of nano-optimization concern will basically never matter when writing c# code.
That being said, if you are wondering about the differences between things like this (incrementing a 4 byte register versus incrementing 8 bytes), you can always cosult Mr. Agner's wonderful instruction tables.
On an Amd64 machine, incrementing long takes the same amount of time as incrementing int.**
On a 32 bit x86 machine, incrementing int will take less time.
** The same is true for almost all logic and math operations, as long as the value is not both memory bound and unaligned. In .NET a long will always be aligned, so the two will always be the same.
I recently changed
this.FieldValues = new object[2, fieldValues.GetUpperBound(1) + 1];
for (int i = 0; i < FieldCount; i++)
{
this.FieldValues[Current, i] = fieldValues[Current, i];
this.FieldValues[Original, i] = fieldValues[Original, i];
}
to
FieldValues = new object[2, fieldValues.GetLength(1)];
Array.Copy(fieldValues, FieldValues, FieldValues.Length);
Where the values of Current and Original are constants 0 and 1 respectively. FieldValues is a field and fieldValues is a parameter.
In the place I was using it, I found the Array.Copy() version to be faster. But another developer says he timed the for-loop against Array.Copy() in a standalone program and found the for-loop faster.
Is it possible that Array.Copy() is not really faster? I thought it was supposed to be super-optimised!
In my own experience, I've found that I can't trust my intuition about anything when it comes to performance. Consequently, I keep a quick-and-dirty benchmarking app around (that I call "StupidPerformanceTricks"), which I use to test these scenarios. This is invaluable, as I've made all sorts of surprising and counter-intuitive discoveries about performance tricks. It's also important to remember to run your benchmark app in release mode, without a debugger attached, as you otherwise don't get JIT optimizations, and those optimizations can make a significant difference: technique A might be slower than technique B in debug mode, but significantly faster in release mode, with optimized code.
That said, in general, my own testing experience indicates that if your array is < ~32 elements, you'll get better performance by rolling your own copy loop - presumably because you don't have the method call overhead, which can be significant. However, if the loop is larger than ~32 elements, you'll get better performance by using Array.Copy(). (If you're copying ints or floats or similar sorts of things, you might also want to investigate Buffer.BlockCopy(), which is ~10% faster than Array.Copy() for small arrays.)
But all that said, the real answer is, "Write your own tests that match these precise alternatives as closely as possible, wrap them each with a loop, give the loop enough iterations for it to chew up at least 2-3 seconds worth of CPU, and then compare the alternatives yourself."
The way .Net works under the hood, I'd guess that in an optimized situation, Array.Copy would avoid bounds checking.
If you do a loop on any type of collection, by default the CLR will check to make sure you're not passing the end of the collection, and then the JIT will either have to do a runtime assessment or emit code that doesn't need checking. (check the article in my comment for better details of this)
You can modify this behaviour, but generally you don't save that much. Unless you're in a tightly executed inner loop where every millisecond counts, that is.
If the Array is large, I'd use Array.Copy, if it's small, either should perform the same.
I do think it's bounds checking that's creating the different results for you though.
In your particular example, there is a factor that might (in theory) indicate the for loop is faster.
Array.Copy is a O(n) operation while your for loop is O(n/2), where n is the total size of you matrix.
Array.Copy needs to loop trough all the elements in your two-dimensional array because:
When copying between multidimensional arrays, the array behaves like a
long one-dimensional array, where the rows (or columns) are
conceptually laid end to end. For example, if an array has three rows
(or columns) with four elements each, copying six elements from the
beginning of the array would copy all four elements of the first row
(or column) and the first two elements of the second row (or column).