This question already has answers here:
Performance difference for control structures 'for' and 'foreach' in C#
(9 answers)
Closed 9 years ago.
In this thread we discussed about performance analysis of the for loop and foreach.
Which one gives better performance - the for or foreach?
Here are two simple methods:
public static void TestFor()
{
Stopwatch stopwatch = Stopwatch.StartNew();
int[] myInterger = new int[1];
int total = 0;
for (int i = 0; i < myInterger.Length; i++)
{
total += myInterger[i];
}
stopwatch.Stop();
Console.WriteLine("for loop Time Elapsed={0}", stopwatch.Elapsed);
}
public static void TestForeach()
{
Stopwatch stopwatchForeach = Stopwatch.StartNew();
int[] myInterger1 = new int[1];
int totall = 0;
foreach (int i in myInterger1)
{
totall += i;
}
stopwatchForeach.Stop();
Console.WriteLine("foreach loop Time Elapsed={0}", stopwatchForeach.Elapsed);
}
Then I ran the above code the result was foreach loop Time Elapsed=00:00:00.0000003,
for loop Time Elapsed= 00:00:00.0001462. I think we want high performance code. We would use foreach
My decision would not be based on some simple performance loop like this. I am assuming that you have a frequent use of loops/large data sets. You will not notice the difference until we start talking about iterations in the hundreds of thousands (at a minimum).
1) If you are writing applications with potential memory pressured frameworks (XBOX, Windows Phone, Silverlight). I would use the for loop as foreach can leave lightweight "garbage" that can be left behind for collectioning. When I was doing XBOX development years ago for games, a common trick was to initialize a fixed array of items shown on a screen using a for loop and keep that in memory and then you don't have to worry about garbage collection/memory adjustments/garbage collection etc. This can be an issue if you have a loop like this called 60+ times/second (i.e. games)
2) If you have a very large set you are iterating AND performance is your key decision driver (remember these numbers are not going to be noticeable unless they are large), then you may want to look at parallelizing your code. The difference then might not be for vs foreach, but Parallel.For vs Parallel.Foreach vs PLINQ (AsParallel() method). You have have different threads tackle the problem.
Edit: In a production application, you are more than likely going to have some kind of logic in your loops which will take >>> time to iterate an item. Once you add that to the mix performance drivers usually shift to the actual logic not optimizing iterations (which the compiler does pretty well).
Related
Given the task to improve the performance of a piece of code, I have came across the following phenomenon. I have a large collection of reference types in a generic Queue and I'm removing and processing the element one by one, then add them to another generic collection.
It seems the larger the elements are the more time it takes to add the element to the collection.
Trying to narrow down the problem to the relevant part of the code, I've written a test (omitting the processing of elements, just doing the insert):
class Small
{
public Small()
{
this.s001 = "001";
this.s002 = "002";
}
string s001;
string s002;
}
class Large
{
public Large()
{
this.s001 = "001";
this.s002 = "002";
...
this.s050 = "050";
}
string s001;
string s002;
...
string s050;
}
static void Main(string[] args)
{
const int N = 1000000;
var storage = new List<object>(N);
for (int i = 0; i < N; ++i)
{
//storage.Add(new Small());
storage.Add(new Large());
}
List<object> outCollection = new List<object>();
Stopwatch sw = new Stopwatch();
sw.Start();
for (int i = N-1; i > 0; --i)
{
outCollection.Add(storage[i];);
}
sw.Stop();
Console.WriteLine(sw.ElapsedMilliseconds);
}
On the test machine, using the Small class, it takes about 25-30 ms to run, while it takes 40-45 ms with Large.
I know that the outCollection has to grow from time to time to be able to store all the items, so there is some dynamic memory allocation. But given an initial collection size even makes the difference more obvious: 11-12 ms with Small and 35-38 ms with Large objects.
I am somewhat surprised, as these are reference types, so I was expecting the collections to work only with references to the Small/Large instances. I have read Eric Lippert's relevant article that and know that references should not be treated as pointers. At the same time, AFAIK currently they are implemented as pointers and their size and the collection's performance should be independent of element size.
I've decided to put up a question here hoping that someone could explain or help me to understand what's happening here. Aside the performance improvement, I'm really curious what is happening behind the scenes.
Update:
Profiling data using the diagnostic tools didn't help me much, although I have to admit I'm not an expert using the profiler. I'll collect more data later today to find where the bottleneck is.
The pressure on the GC is quite high of course, especially with the Large instances. But once the instances are created and stored in the storage collection, and the program enters the loop, there was no collection triggered any more, and memory usage hasn't increased significantly (outCollction already pre-allocated).
Most of the CPU time is of course spent with memory allocation (JIT_New), around 62% and the only other significant entry is Function Name Inclusive Samples Exclusive Samples Inclusive Samples % Exclusive Samples % Module Name
System.Collections.Generic.List`1[System.__Canon].Add with about 7%.
With 1 million items the preallocated outCollection size is 8 million bytes (the same as the size of storage); one can suspect 64 bit addresses being stored in the collections.
Probably I'm not using the tools properly or don't have the experience to interpret the results correctly, but the profiler didn't help me to get closer to the cause.
If the loop is not triggering collections and it only copies pointers between 2 pre-allocated collections, how could the item size cause any difference? Cache hit/miss ratio is supposed to be the more or less the same in both cases, as the loop is iteration over a list of "addresses" in both cases.
Thanks for all the help so far, I will collect more data, and put an update here if anything found.
I suspect that at least one action in the above (maybe some type checks) will require a de-reference. Then the fact that many Smalls are probably sat close together on the heap and thus sharing cache lines could account for some amount of difference (certainly many more of them could share a single cache line than Larges).
Added to which you are also accessing them in the reverse order in which they were allocated which maximises such a benefit.
I'm refactoring my app to make it faster. I was looking for tips on doing so, and found this statement:
"ForEach can simplify the code in a For loop but it is a heavy object and is slower than a loop written using For."
Is that true? If it was true when it was written, is it still true today, or has foreach itself been refactored to improve performance?
I have the same question about this tip from the same source:
"Where possible use arrays instead of collections. Arrays are normally more efficient especially for value types. Also, initialize collections to their required size when possible."
UPDATE
I was looking for performance tips because I had a database operation that was taking several seconds.
I have found that the "using" statement is a time hog.
I completely solved my performance problem by reversing the for loop and the "using" (of course, refactoring was necessary for this to work).
The slower-than-molasses code was:
for (int i = 1; i <= googlePlex; i++) {
. . .
using (OracleCommand ocmd = new OracleCommand(insert, oc)) {
. . .
InsertRecord();
. . .
The faster-than-a-speeding-bullet code is:
using (OracleCommand ocmd = new OracleCommand(insert, oc)) {
for (int i = 1; i <= googlePlex; i++) {
. . .
InsertRecord();
. . .
Short answer:
Code that is hard to read eventually results in software that behaves and performs poorly.
Long answer:
There was a culture of micro-optimization suggestions in early .NET. Partly it was because a few Microsoft's internal tools (such as FxCop) had gained popularity among general public. Partly it was because C# had and has aspirations to be a successor to assembly, C, and C++ regarding the unhindered access to raw hardware performance in the few hottest code paths of a performance critical application. This does require more knowledge and discipline than a typical application, of course. The consequences of performance related decisions in framework code and in app code are also quite different.
The net impact of this on C# coding culture has been positive, of course; but it would be ridiculous to stop using foreach or is or "" just in order to save a couple CIL instructions that your recent jitter could probably optimize away completely if it wanted to.
There are probably very many loops in your app and probably at most one of them might be a current performance bottleneck. "Optimizing" a non-bottleck for perfomance at the expense of readability is a very bad deal.
It's true in many cases that foreach is slower than an equivalent for. It's also true that
for (int i = 0; i < myCollection.Length; i++) // Compiler must re-evaluate getter because value may have changed
is slower than
int max = myCollection.Length;
for (int i = 0; i < max; i++)
But that probably will not matter at all. For a very detailed discussion see Performance difference for control structures 'for' and 'foreach' in C#
Have you done any profiling to determine the hot spots of your application? I would be astonished if the loop management overhead is where you should be focusing your attention.
You should try profiling your code with Red Gate ANTS or something of that ilk - you will be surprised.
I found that in an application I was writing it was the parameter sniffing in SQL that took up 25% of the processing time. After writing a command cache which sniffed the params at the start of the application, there was a big speed boost.
Unless you are doing a large amount of nested for loops, I don't think you will see much of a performance benefit from changing your loops. I can't imagine anything but a real time application such as a game or a large number crunching or scientific application would need that kind of optimisation.
Yes. The classic for is a bit faster than a foreach as the iteration is index based instead of access the element of the collection thought an enumerator
static void Main()
{
const int m = 100000000;
//just to create an array
int[] array = new int[100000000];
for (int x = 0; x < array.Length; x++) {
array[x] = x;
}
var s1 = Stopwatch.StartNew();
var upperBound = array.Length;
for (int i = 0; i < upperBound; i++)
{
}
s1.Stop();
GC.Collect();
var s2 = Stopwatch.StartNew();
foreach (var item in array) {
}
s2.Stop();
Console.WriteLine(((double)(s1.Elapsed.TotalMilliseconds *
1000000) / m).ToString("0.00 ns"));
Console.WriteLine(((double)(s2.Elapsed.TotalMilliseconds *
1000000) / m).ToString("0.00 ns"));
Console.Read();
//2.49 ns
//4.68 ns
// In Release Mode
//0.39 ns
//1.05 ns
}
I've been learning a little about parallelism in the last few days, and I came across this example.
I put it side to side with a sequential for loop like this:
private static void NoParallelTest()
{
int[] nums = Enumerable.Range(0, 1000000).ToArray();
long total = 0;
var watch = Stopwatch.StartNew();
for (int i = 0; i < nums.Length; i++)
{
total += nums[i];
}
Console.WriteLine("NoParallel");
Console.WriteLine(watch.ElapsedMilliseconds);
Console.WriteLine("The total is {0}", total);
}
I was surprised to see that the NoParallel method finished way way faster than the parallel example given at the site.
I have an i5 PC.
I really thought that the Parallel method would finish faster.
Is there a reasonable explanation for this? Maybe I misunderstood something?
The sequential version was faster because the time spent doing operations on each iteration in your example is very small and there is a fairly significant overhead involved with creating and managing multiple threads.
Parallel programming only increases efficiency when each iteration is sufficiently expensive in terms of processor time.
I think that's because the loop performs a very simple, very fast operation.
In the case of the non-parallel version that's all it does. But the parallel version has to invoke a delegate. Invoking a delegate is quite fast and usually you don't have to worry how often you do that. But in this extreme case, it's what makes the difference. I can easily imagine that invoking a delegate will be, say, ten times slower (or more, I have no idea what the exact ratio is) than adding a number from an array.
We were having a performance issue in a C# while loop. The loop was super slow doing only one simple math calc. Turns out that parmIn can be a huge number anywhere from 999999999 to MaxInt. We hadn't anticipated the giant value of parmIn. We have fixed our code using a different methodology.
The loop, coded for simplicity below, did one math calc. I am just curious as to what the actual execution time for a single iteration of a while loop containing one simple math calc is?
int v1=0;
while(v1 < parmIn) {
v1+=parmIn2;
}
There is something else going on here. The following will complete in ~100ms for me. You say that the parmIn can approach MaxInt. If this is true, and the ParmIn2 is > 1, you're not checking to see if your int + the new int will overflow. If ParmIn >= MaxInt - parmIn2, your loop might never complete as it will roll back over to MinInt and continue.
static void Main(string[] args)
{
int i = 0;
int x = int.MaxValue - 50;
int z = 42;
System.Diagnostics.Stopwatch st = new System.Diagnostics.Stopwatch();
st.Start();
while (i < x)
{
i += z;
}
st.Stop();
Console.WriteLine(st.Elapsed.Milliseconds.ToString());
Console.ReadLine();
}
Assuming an optimal compiler, it should be one operation to check the while condition, and one operation to do the addition.
The time, small as it is, to execute just one iteration of the loop shown in your question is ... surprise ... small.
However, it depends on the actual CPU speed and whatnot exactly how small it is.
It should be just a few machine instructions, so not many cycles to pass once through the iteration, but there could be a few cycles to loop back up, especially if branch prediction fails.
In any case, the code as shown either suffers from:
Premature optimization (in that you're asking about timing for it)
Incorrect assumptions. You can probably get a much faster code if parmIn is big by just calculating how many loop iterations you would have to perform, and do a multiplication. (note again that this might be an incorrect assumption, which is why there is only one sure way to find performance issues, measure measure measure)
What is your real question?
It depends on the processor you are using and the calculation it is performing. (For example, even on some modern architectures, an add may take only one clock cycle, but a divide may take many clock cycles. There is a comparison to determine if the loop should continue, which is likely to be around one clock cycle, and then a branch back to the start of the loop, which may take any number of cycles depending on pipeline size and branch prediction)
IMHO the best way to find out more is to put the code you are interested into a very large loop (millions of iterations), time the loop, and divide by the number of iterations - this will give you an idea of how long it takes per iteration of the loop. (on your PC). You can try different operations and learn a bit about how your PC works. I prefer this "hands on" approach (at least to start with) because you can learn so much more from physically trying it than just asking someone else to tell you the answer.
The while loop is couple of instructions and one instruction for the math operation. You're really looking at a minimal execution time for one iteration. it's the sheer number of iterations you're doing that is killing you.
Note that a tight loop like this has implications on other things as well, as it bogs down one CPU and it blocks the UI thread (if it's running on it). Thus, not only it is slow due to the number of operations, it also adds a perceived perf impact due to making the whole machine look unresponsive.
If you're interested in the actual execution time, why not time it for yourself and find out?
int parmIn = 10 * 1000 * 1000; // 10 million
int v1=0;
Stopwatch sw = Stopwatch.StartNew();
while(v1 < parmIn) {
v1+=parmIn2;
}
sw.Stop();
double opsPerSec = (double)parmIn / sw.Elapsed.TotalSeconds;
And, of course, the time for one iteration is 1/opsPerSec.
Whenever someone asks about how fast control structures in any language you know they are trying to optimize the wrong thing. If you find yourself changing all your i++ to ++i or changing all your switch to if...else for speed you are micro-optimizing. And micro optimizations almost never give you the speed you want. Instead, think a bit more about what you are really trying to do and devise a better way to do it.
I'm not sure if the code you posted is really what you intend to do or if it is simply the loop stripped down to what you think is causing the problem. If it is the former then what you are trying to do is find the largest value of a number that is smaller than another number. If this is really what you want then you don't really need a loop:
// assuming v1, parmIn and parmIn2 are integers,
// and you want the largest number (v1) that is
// smaller than parmIn but is a multiple of parmIn2.
// AGAIN, assuming INTEGER MATH:
v1 = (parmIn/parmIn2)*parmIn2;
EDIT: I just realized that the code as originally written gives the smallest number that is a multiple of parmIn2 that is larger than parmIn. So the correct code is:
v1 = ((parmIn/parmIn2)*parmIn2)+parmIn2;
If this is not what you really want then my advise remains the same: think a bit on what you are really trying to do (or ask on Stackoverflow) instead of trying to find out weather while or for is faster. Of course, you won't always find a mathematical solution to the problem. In which case there are other strategies to lower the number of loops taken. Here's one based on your current problem: keep doubling the incrementer until it is too large and then back off until it is just right:
int v1=0;
int incrementer=parmIn2;
// keep doubling the incrementer to
// speed up the loop:
while(v1 < parmIn) {
v1+=incrementer;
incrementer=incrementer*2;
}
// now v1 is too big, back off
// and resume normal loop:
v1-=incrementer;
while(v1 < parmIn) {
v1+=parmIn2;
}
Here's yet another alternative that speeds up the loop:
// First count at 100x speed
while(v1 < parmIn) {
v1+=parmIn2*100;
}
// back off and count at 50x speed
v1-=parmIn2*100;
while(v1 < parmIn) {
v1+=parmIn2*50;
}
// back off and count at 10x speed
v1-=parmIn2*50;
while(v1 < parmIn) {
v1+=parmIn2*10;
}
// back off and count at normal speed
v1-=parmIn2*10;
while(v1 < parmIn) {
v1+=parmIn2;
}
In my experience, especially with graphics programming where you have millions of pixels or polygons to process, speeding up code usually involve adding even more code which translates to more processor instructions instead of trying to find the fewest instructions possible for the task at hand. The trick is to avoid processing what you don't have to.
I was trying to figure out if a for loop was faster than a foreach loop and was using the System.Diagnostics classes to time the task. While running the test I noticed that which ever loop I put first always executes slower then the last one. Can someone please tell me why this is happening? My code is below:
using System;
using System.Diagnostics;
namespace cool {
class Program {
static void Main(string[] args) {
int[] x = new int[] { 3, 6, 9, 12 };
int[] y = new int[] { 3, 6, 9, 12 };
DateTime startTime = DateTime.Now;
for (int i = 0; i < 4; i++) {
Console.WriteLine(x[i]);
}
TimeSpan elapsedTime = DateTime.Now - startTime;
DateTime startTime2 = DateTime.Now;
foreach (var item in y) {
Console.WriteLine(item);
}
TimeSpan elapsedTime2 = DateTime.Now - startTime2;
Console.WriteLine("\nSummary");
Console.WriteLine("--------------------------\n");
Console.WriteLine("for:\t{0}\nforeach:\t{1}", elapsedTime, elapsedTime2);
Console.ReadKey();
}
}
}
Here is the output:
for: 00:00:00.0175781
foreach: 00:00:00.0009766
Probably because the classes (e.g. Console) need to be JIT-compiled the first time through. You'll get the best metrics by calling all methods (to JIT them (warm then up)) first, then performing the test.
As other users have indicated, 4 passes is never going to be enough to to show you the difference.
Incidentally, the difference in performance between for and foreach will be negligible and the readability benefits of using foreach almost always outweigh any marginal performance benefit.
I would not use DateTime to measure performance - try the Stopwatch class.
Measuring with only 4 passes is never going to give you a good result. Better use > 100.000 passes (you can use an outer loop). Don't do Console.WriteLine in your loop.
Even better: use a profiler (like Redgate ANTS or maybe NProf)
I am not so much in C#, but when I remember right, Microsoft was building "Just in Time" compilers for Java. When they use the same or similar techniques in C#, it would be rather natural that "some constructs coming second perform faster".
For example it could be, that the JIT-System sees that a loop is executed and decides adhoc to compile the whole method. Hence when the second loop is reached, it is yet compiled and performs much faster than the first. But this is a rather simplistic guess of mine. Of course you need a far greater insight in the C# runtime system to understand what is going on. It could also be, that the RAM-Page is accessed first in the first loop and in the second it is still in the CPU-cache.
Addon: The other comment that was made: that the output module can be JITed a first time in the first loop seams to me more likely than my first guess. Modern languages are just very complex to find out what is done under the hood. Also this statement of mine fits into this guess:
But also you have terminal-outputs in your loops. They make things yet more difficult. It could also be, that it costs some time to open the terminal a first time in a program.
I was just performing tests to get some real numbers, but in the meantime Gaz beat me to the answer - the call to Console.Writeline is jitted at the first call, so you pay that cost in the first loop.
Just for information though - using a stopwatch rather than the datetime and measuring number of ticks:
Without a call to Console.Writeline before the first loop the times were
for: 16802
foreach: 2282
with a call to Console.Writeline they were
for: 2729
foreach: 2268
Though these results were not consistently repeatable because of the limited number of runs, but the magnitude of difference was always roughly the same.
The edited code for reference:
int[] x = new int[] { 3, 6, 9, 12 };
int[] y = new int[] { 3, 6, 9, 12 };
Console.WriteLine("Hello World");
Stopwatch sw = new Stopwatch();
sw.Start();
for (int i = 0; i < 4; i++)
{
Console.WriteLine(x[i]);
}
sw.Stop();
long elapsedTime = sw.ElapsedTicks;
sw.Reset();
sw.Start();
foreach (var item in y)
{
Console.WriteLine(item);
}
sw.Stop();
long elapsedTime2 = sw.ElapsedTicks;
Console.WriteLine("\nSummary");
Console.WriteLine("--------------------------\n");
Console.WriteLine("for:\t{0}\nforeach:\t{1}", elapsedTime, elapsedTime2);
Console.ReadKey();
The reason why is there are several forms of overhead in the foreach version that are not present in the for loop
Use of an IDisposable.
An additional method call for every element. Each element must be accessed under the hood by using IEnumerator<T>.Current which is a method call. Because it's on an interface it cannot be inlined. This means N method calls where N is the number of elements in the enumeration. The for loop just uses and indexer
In a foreach loop all calls go through an interface. In general this a bit slower than through a concrete type
Please note that the things I listed above are not necessarily huge costs. They are typically very small costs that can contribute to a small performance difference.
Also note, as Mehrdad pointed out, the compilers and JIT may choose to optimize a foreach loop for certain known data structures such as an array. The end result may just be a for loop.
Note: Your performance benchmark in general needs a bit more work to be accurate.
You should use a StopWatch instead of DateTime. It is much more accurate for performance benchmarks.
You should perform the test many times not just once
You need to do a dummy run on each loop to eliminate the problems that come with JITing a method the first time. This probably isn't an issue when all of the code is in the same method but it doesn't hurt.
You need to use more than just 4 values in the list. Try 40,000 instead.
You should be using the StopWatch to time the behavior.
Technically the for loop is faster. Foreach calls the MoveNext() method (creating a method stack and other overhead from a call) on the IEnumerable's iterator, when for only has to increment a variable.
I don't see why everyone here says that for would be faster than foreach in this particular case. For a List<T>, it is (about 2x slower to foreach through a List than to for through a List<T>).
In fact, the foreach will be slightly faster than the for here. Because foreach on an array essentially compiles to:
for(int i = 0; i < array.Length; i++) { }
Using .Length as a stop criteria allows the JIT to remove bounds checks on the array access, since it's a special case. Using i < 4 makes the JIT insert extra instructions to check each iteration whether or not i is out of bounds of the array, and throw an exception if that is the case. However, with .Length, it can guarantee you'll never go outside of the array bounds so the bounds checks are redundant, making it faster.
However, in most loops, the overhead of the loop is insignificant compared to the work done inside.
The discrepancy you're seeing can only be explained by the JIT I guess.
I wouldn't read too much into this - this isn't good profiling code for the following reasons
1. DateTime isn't meant for profiling. You should use QueryPerformanceCounter or StopWatch which use the CPU hardware profile counters
2. Console.WriteLine is a device method so there may be subtle effects such as buffering to take into account
3. Running one iteration of each code block will never give you accurate results because your CPU does a lot of funky on the fly optimisation such as out of order execution and instruction scheduling
4. Chances are the code that gets JITed for both code blocks is fairly similar so is likely to be in the instruction cache for the second code block
To get a better idea of timing, I did the following
Replaced the Console.WriteLine with a math expression ( e^num)
I used QueryPerformanceCounter/QueryPerformanceTimer through P/Invoke
I ran each code block 1 million times then averaged the results
When I did that I got the following results:
The for loop took 0.000676 milliseconds
The foreach loop took 0.000653 milliseconds
So foreach was very slightly faster but not by much
I then did some further experiments and ran the foreach block first and the for block second
When I did that I got the following results:
The foreach loop took 0.000702 milliseconds
The for loop took 0.000691 milliseconds
Finally I ran both loops together twice i.e for + foreach then for + foreach again
When I did that I got the following results:
The foreach loop took 0.00140 milliseconds
The for loop took 0.001385 milliseconds
So basically it looks to me that whatever code you run second, runs very slightly faster but not
enough to be of any significance.
--Edit--
Here are a couple of useful links
How to time managed code using QueryPerformanceCounter
The instruction cache
Out of order execution