PLINQ Performs Worse Than Usual LINQ

PLINQ Performs Worse Than Usual LINQ - c#

Amazingly, using PLINQ did not yield benefits on a small test case I created; in fact, it was even worse than usual LINQ.
Here's the test code:
int repeatedCount = 10000000;
private void button1_Click(object sender, EventArgs e)
{
var currTime = DateTime.Now;
var strList = Enumerable.Repeat(10, repeatedCount);
var result = strList.AsParallel().Sum();
var currTime2 = DateTime.Now;
textBox1.Text = (currTime2.Ticks-currTime.Ticks).ToString();
}
private void button2_Click(object sender, EventArgs e)
{
var currTime = DateTime.Now;
var strList = Enumerable.Repeat(10, repeatedCount);
var result = strList.Sum();
var currTime2 = DateTime.Now;
textBox2.Text = (currTime2.Ticks - currTime.Ticks).ToString();
}
The result?
textbox1: 3437500
textbox2: 781250
So, LINQ is taking less time than PLINQ to complete a similar operation!
What am I doing wrong? Or is there a twist that I don't know about?
Edit: I've updated my code to use stopwatch, and yet, the same behavior persisted. To discount the effect of JIT, I actually tried a few times with clicking both button1 and button2 and in no particular order. Although the time I got might be different, but the qualitative behavior remained: PLINQ was indeed slower in this case.

First: Stop using DateTime to measure run time. Use a Stopwatch instead. The test code would look like:
var watch = new Stopwatch();
var strList = Enumerable.Repeat(10, 10000000);
watch.Start();
var result = strList.Sum();
watch.Stop();
Console.WriteLine("Linear: {0}", watch.ElapsedMilliseconds);
watch.Reset();
watch.Start();
var parallelResult = strList.AsParallel().Sum();
watch.Stop();
Console.WriteLine("Parallel: {0}", watch.ElapsedMilliseconds);
Console.ReadKey();
Second: Running things in Parallel adds overhead. In this case, PLINQ has to figure out the best way to divide your collection so that it can Sum the elements safely in parallel. After that, you need to join the results from the various threads created and Sum those as well. This isn't a trivial task.
Using the code above I can see that using Sum() nets a ~95ms call. Calling .AsParallel().Sum() nets around ~185ms.
Doing a task in Parallel is only a good idea if you gain something by doing it. In this case, Sum is a simple enough task that you don't gain by using PLINQ.

This is a classic mistake -- thinking, "I'll run a simple test to compare the performance of this single-threaded code with this multi-threaded code."
A simple test is the worst kind of test you can run to measure multi-threaded performance.
Typically, parallelizing some operation yields a performance benefit when the steps you're parallelizing require substantial work. When the steps are simple -- as in, quick* -- the overhead of parallelizing your work ends up dwarfing the miniscule performance gain you would have otherwise gotten.
Consider this analogy.
You're constructing a building. If you have one worker, he has to lay bricks one by one until he's made one wall, then do the same for the next wall, and so on until all walls are built and connected. This is a slow and laborious task that could benefit from parallelization.
The right way to do this would be to parallelize the wall building -- hire, say, 3 more workers, and have each worker construct his own wall so that 4 walls can be built simultaneously. The time it takes to find the 3 extra workers and assign them their tasks is insignificant in comparison to the savings you get by getting 4 walls up in the amount of time it would have previously taken to build 1.
The wrong way to do it would be to parallelize the brick laying -- hire about a thousand more workers and have each worker responsible for laying a single brick at a time. You may think, "If one worker can lay 2 bricks per minute, then a thousand workers should be able to lay 2000 bricks per minute, so I'll finish this job in no time!" But the reality is that by parallelizing your workload at such a microscopic level, you're wasting a tremendous amount of energy gathering and coordinating all of your workers, assigning tasks to them ("lay this brick right there"), making sure no one's work is interfering with anyone else's, etc.
So the moral of this analogy is: in general, use parallelization to split up the substantial units of work (like walls), but leave the insubstantial units (like bricks) to be handled in the usual sequential manner.
*For this reason, you can actually make a pretty good approximation of the performance gain of parallelization in a more work-intensive context by taking any fast-executing code and adding Thread.Sleep(100) (or some other random number) to the end of it. Suddenly sequential executions of this code will be slowed down by 100 ms per iteration, while parallel executions will be slowed significantly less.

Others have pointed out some flaws in your benchmarks. Here's a short console app to make it simpler:
using System;
using System.Diagnostics;
using System.Linq;
public class Test
{
const int Iterations = 1000000000;
static void Main()
{
// Make sure everything's JITted
Time(Sequential, 1);
Time(Parallel, 1);
Time(Parallel2, 1);
// Now run the real tests
Time(Sequential, Iterations);
Time(Parallel, Iterations);
Time(Parallel2, Iterations);
}
static void Time(Func<int, int> action, int count)
{
GC.Collect();
Stopwatch sw = Stopwatch.StartNew();
int check = action(count);
if (count != check)
{
Console.WriteLine("Check for {0} failed!", action.Method.Name);
}
sw.Stop();
Console.WriteLine("Time for {0} with count={1}: {2}ms",
action.Method.Name, count,
(long) sw.ElapsedMilliseconds);
}
static int Sequential(int count)
{
var strList = Enumerable.Repeat(1, count);
return strList.Sum();
}
static int Parallel(int count)
{
var strList = Enumerable.Repeat(1, count);
return strList.AsParallel().Sum();
}
static int Parallel2(int count)
{
var strList = ParallelEnumerable.Repeat(1, count);
return strList.Sum();
}
}
Compilation:
csc /o+ /debug- Test.cs
Results on my quad core i7 laptop; runs up to 2 cores fast, or 4 cores more slowly. Basically ParallelEnumerable.Repeat wins, followed by the sequence version, followed by parallelising the normal Enumerable.Repeat.
Time for Sequential with count=1: 117ms
Time for Parallel with count=1: 181ms
Time for Parallel2 with count=1: 12ms
Time for Sequential with count=1000000000: 9152ms
Time for Parallel with count=1000000000: 44144ms
Time for Parallel2 with count=1000000000: 3154ms
Note that earlier versions of this answer were embarrassingly flawed by having the wrong number of elements - I'm much more confident in the results above.

Is it possible you are not taking into account JIT time? You should run your test twice and discard the first set of results.
Also, you shouldn't use DateTime to get performance timing, use the Stopwatch class instead:
var swatch = new Stopwatch();
swatch.StartNew();
var strList = Enumerable.Repeat(10, repeatedCount);
var result = strList.AsParallel().Sum();
swatch.Stop();
textBox1.Text = swatch.Elapsed;
PLINQ does add some overhead to the processing of a sequence. But the magnitute difference in your case seems excessive. PLINQ makes sense when the overhead cost is outweighed by the benefit of running the logic on multiple cores/CPUs. If you don't have multiple core, running processing in parallel offers no real advantage - and PLINQ should detect such a case and perform the processing sequentially.
EDIT: When creating embedded performance tests of this kind, you should make sure that you are not running them under the debugger, or with Intellitrace enabled, as those can significantly skew performance timings.

Something more important that I didn't see mentioned is that .AsParallel will have different performance depending on the collection used.
In my tests PLINQ is faster than LINQ when NOT used on IEnumerable (Enumerable.Repeat) :
29ms PLINQ ParralelQuery
30ms LINQ ParralelQuery
30ms PLINQ Array
38ms PLINQ List
163ms LINQ IEnumerable
211ms LINQ Array
213ms LINQ List
273ms PLINQ IEnumerable
4 processors
Code is in VB, but provided to show that using .ToArray made the PLINQ version few times faster
Dim test = Function(LINQ As Action, PLINQ As Action, type As String)
Dim sw1 = Stopwatch.StartNew : LINQ() : Dim ts1 = sw1.ElapsedMilliseconds
Dim sw2 = Stopwatch.StartNew : PLINQ() : Dim ts2 = sw2.ElapsedMilliseconds
Return {String.Format("{0,4}ms LINQ {1}", ts1, type), String.Format("{0,4}ms PLINQ {1}", ts2, type)}
End Function
Dim results = New List(Of String) From {Environment.ProcessorCount & " processors"}
Dim count = 12345678, iList = Enumerable.Repeat(1, count)
With iList : results.AddRange(test(Sub() .Sum(), Sub() .AsParallel.Sum(), "IEnumerable")) : End With
With iList.ToArray : results.AddRange(test(Sub() .Sum(), Sub() .AsParallel.Sum(), "Array")) : End With
With iList.ToList : results.AddRange(test(Sub() .Sum(), Sub() .AsParallel.Sum(), "List")) : End With
With ParallelEnumerable.Repeat(1, count) : results.AddRange(test(Sub() .Sum(), Sub() .AsParallel.Sum(), "ParralelQuery")) : End With
MessageBox.Show(String.join(Environment.NewLine, From l In results Order By l))
Running the tests in different order will have a bit different results, so having them in one line makes moving them up and down a bit easier for me.

That indeed may be the case because you are increasing the number of context switches and you are not performing any action that would benefit of having threads waiting for something like i/o completion. This is going to be even worse if you are running in a single cpu box.

I'd recommend using the Stopwatch class for timing metrics. In your case it's a better measure of the interval.

Please read the Side Effects section of this article.
http://msdn.microsoft.com/en-us/magazine/cc163329.aspx
I think you can run into many conditions where PLINQ has additional data processing patterns you must understand before you opt to think that is will always purely have faster response times.

Justin's comment about overhead is exactly right.
Just something to consider when writing concurrent software in general, beyond the use of PLINQ:
You always need to be thinking about the "granularity" of your work items. Some problems are very well suited to parallelization because they can be "chunked" at a very high level, like raytracing entire frames concurrently (these sorts of problems are called embarrassingly parallel). When there are very large "chunks" of work, then the overhead of creating and managing multiple threads becomes negligible compared to the actual work that you want to get done.
PLINQ makes concurrent programming easier, but it doesn't mean that you can ignore thinking about the granularity of your work.

Related

How to reliably measure code efficiency/complexity/performance/expensiveness in C#?

My question consists of 2 parts:
Is there any good way in C# to measure computation effort other than using timers such as Stopwatch? Below is what I have been doing, but the granularity is not great, and the result returned varies every time. I am wondering if there is more precise measure such as CPU operation count so that the result returned can be consistent.
Stopwatch stopWatch = new Stopwatch();
stopWatch.Start();
//do work
stopWatch.Stop();
TimeSpan ts = stopWatch.Elapsed;
Console.WriteLine(ts);
If the alternative approach in 1 is not possible, how can I make the performance test result less variate? What are some factors that can make the result change? Would closing all other applications running help? (I did try it but there seems to be no significant effect.) How about running the test on a VM, sandbox, etc.?
(After typing the proceeding text I realized that I also have tried the Performance Analysis feature which comes with Visual Studio. The test result seems more coarse because of the sampling method it uses. So I also want to rule out that option)

You need to get a profiling tool. But you can use StopWatch more reliably if you run your tests in a loop multiple times but only take the results of the test if the garbage collection generation stays the same.
Like this:
var timespans = new List<TimeSpan>();
while (true)
{
var count = GC.CollectionCount(0);
var sw = Stopwatch.StartNew();
/* run test here */
sw.Stop();
if (count == GC.CollectionCount(0))
{
timespans.Add(sw.Elapsed);
}
if (timespans.Count == 100)
{
break;
}
}
That'll give you 100 tests where garbage collection didn't occur. The average is then pretty good to work from.
If you find that your tests never run without invoking a garbage collection then try working out the minimum number of GC's that get triggered and collect your time spans only when that number occurs.

You could query a system performance counter. The msdn doc for the System.Diagnostics.PerformanceCounter class has some examples. With this class you could query "\Process(your_process_name)\% Processor Time" for example. It's an alternative to Stopwatch but tbh I think just using stopwatch and averaging many runs over time is a perfectly good way to go.
If what you need is a higher resolution stopwatch because you are trying to measure a very small slice of cpu time, then you may be interested in the High-Performance Counter.

Why was the parallel version slower than the sequential version in this example?

I've been learning a little about parallelism in the last few days, and I came across this example.
I put it side to side with a sequential for loop like this:
private static void NoParallelTest()
{
int[] nums = Enumerable.Range(0, 1000000).ToArray();
long total = 0;
var watch = Stopwatch.StartNew();
for (int i = 0; i < nums.Length; i++)
{
total += nums[i];
}
Console.WriteLine("NoParallel");
Console.WriteLine(watch.ElapsedMilliseconds);
Console.WriteLine("The total is {0}", total);
}
I was surprised to see that the NoParallel method finished way way faster than the parallel example given at the site.
I have an i5 PC.
I really thought that the Parallel method would finish faster.
Is there a reasonable explanation for this? Maybe I misunderstood something?

The sequential version was faster because the time spent doing operations on each iteration in your example is very small and there is a fairly significant overhead involved with creating and managing multiple threads.
Parallel programming only increases efficiency when each iteration is sufficiently expensive in terms of processor time.

I think that's because the loop performs a very simple, very fast operation.
In the case of the non-parallel version that's all it does. But the parallel version has to invoke a delegate. Invoking a delegate is quite fast and usually you don't have to worry how often you do that. But in this extreme case, it's what makes the difference. I can easily imagine that invoking a delegate will be, say, ten times slower (or more, I have no idea what the exact ratio is) than adding a number from an array.

Why is PLINQ slower? [duplicate]

Amazingly, using PLINQ did not yield benefits on a small test case I created; in fact, it was even worse than usual LINQ.
Here's the test code:
int repeatedCount = 10000000;
private void button1_Click(object sender, EventArgs e)
{
var currTime = DateTime.Now;
var strList = Enumerable.Repeat(10, repeatedCount);
var result = strList.AsParallel().Sum();
var currTime2 = DateTime.Now;
textBox1.Text = (currTime2.Ticks-currTime.Ticks).ToString();
}
private void button2_Click(object sender, EventArgs e)
{
var currTime = DateTime.Now;
var strList = Enumerable.Repeat(10, repeatedCount);
var result = strList.Sum();
var currTime2 = DateTime.Now;
textBox2.Text = (currTime2.Ticks - currTime.Ticks).ToString();
}
The result?
textbox1: 3437500
textbox2: 781250
So, LINQ is taking less time than PLINQ to complete a similar operation!
What am I doing wrong? Or is there a twist that I don't know about?
Edit: I've updated my code to use stopwatch, and yet, the same behavior persisted. To discount the effect of JIT, I actually tried a few times with clicking both button1 and button2 and in no particular order. Although the time I got might be different, but the qualitative behavior remained: PLINQ was indeed slower in this case.

First: Stop using DateTime to measure run time. Use a Stopwatch instead. The test code would look like:
var watch = new Stopwatch();
var strList = Enumerable.Repeat(10, 10000000);
watch.Start();
var result = strList.Sum();
watch.Stop();
Console.WriteLine("Linear: {0}", watch.ElapsedMilliseconds);
watch.Reset();
watch.Start();
var parallelResult = strList.AsParallel().Sum();
watch.Stop();
Console.WriteLine("Parallel: {0}", watch.ElapsedMilliseconds);
Console.ReadKey();
Second: Running things in Parallel adds overhead. In this case, PLINQ has to figure out the best way to divide your collection so that it can Sum the elements safely in parallel. After that, you need to join the results from the various threads created and Sum those as well. This isn't a trivial task.
Using the code above I can see that using Sum() nets a ~95ms call. Calling .AsParallel().Sum() nets around ~185ms.
Doing a task in Parallel is only a good idea if you gain something by doing it. In this case, Sum is a simple enough task that you don't gain by using PLINQ.

Others have pointed out some flaws in your benchmarks. Here's a short console app to make it simpler:
using System;
using System.Diagnostics;
using System.Linq;
public class Test
{
const int Iterations = 1000000000;
static void Main()
{
// Make sure everything's JITted
Time(Sequential, 1);
Time(Parallel, 1);
Time(Parallel2, 1);
// Now run the real tests
Time(Sequential, Iterations);
Time(Parallel, Iterations);
Time(Parallel2, Iterations);
}
static void Time(Func<int, int> action, int count)
{
GC.Collect();
Stopwatch sw = Stopwatch.StartNew();
int check = action(count);
if (count != check)
{
Console.WriteLine("Check for {0} failed!", action.Method.Name);
}
sw.Stop();
Console.WriteLine("Time for {0} with count={1}: {2}ms",
action.Method.Name, count,
(long) sw.ElapsedMilliseconds);
}
static int Sequential(int count)
{
var strList = Enumerable.Repeat(1, count);
return strList.Sum();
}
static int Parallel(int count)
{
var strList = Enumerable.Repeat(1, count);
return strList.AsParallel().Sum();
}
static int Parallel2(int count)
{
var strList = ParallelEnumerable.Repeat(1, count);
return strList.Sum();
}
}
Compilation:
csc /o+ /debug- Test.cs
Results on my quad core i7 laptop; runs up to 2 cores fast, or 4 cores more slowly. Basically ParallelEnumerable.Repeat wins, followed by the sequence version, followed by parallelising the normal Enumerable.Repeat.
Time for Sequential with count=1: 117ms
Time for Parallel with count=1: 181ms
Time for Parallel2 with count=1: 12ms
Time for Sequential with count=1000000000: 9152ms
Time for Parallel with count=1000000000: 44144ms
Time for Parallel2 with count=1000000000: 3154ms
Note that earlier versions of this answer were embarrassingly flawed by having the wrong number of elements - I'm much more confident in the results above.

Is it possible you are not taking into account JIT time? You should run your test twice and discard the first set of results.
Also, you shouldn't use DateTime to get performance timing, use the Stopwatch class instead:
var swatch = new Stopwatch();
swatch.StartNew();
var strList = Enumerable.Repeat(10, repeatedCount);
var result = strList.AsParallel().Sum();
swatch.Stop();
textBox1.Text = swatch.Elapsed;
PLINQ does add some overhead to the processing of a sequence. But the magnitute difference in your case seems excessive. PLINQ makes sense when the overhead cost is outweighed by the benefit of running the logic on multiple cores/CPUs. If you don't have multiple core, running processing in parallel offers no real advantage - and PLINQ should detect such a case and perform the processing sequentially.
EDIT: When creating embedded performance tests of this kind, you should make sure that you are not running them under the debugger, or with Intellitrace enabled, as those can significantly skew performance timings.

Something more important that I didn't see mentioned is that .AsParallel will have different performance depending on the collection used.
In my tests PLINQ is faster than LINQ when NOT used on IEnumerable (Enumerable.Repeat) :
29ms PLINQ ParralelQuery
30ms LINQ ParralelQuery
30ms PLINQ Array
38ms PLINQ List
163ms LINQ IEnumerable
211ms LINQ Array
213ms LINQ List
273ms PLINQ IEnumerable
4 processors
Code is in VB, but provided to show that using .ToArray made the PLINQ version few times faster
Dim test = Function(LINQ As Action, PLINQ As Action, type As String)
Dim sw1 = Stopwatch.StartNew : LINQ() : Dim ts1 = sw1.ElapsedMilliseconds
Dim sw2 = Stopwatch.StartNew : PLINQ() : Dim ts2 = sw2.ElapsedMilliseconds
Return {String.Format("{0,4}ms LINQ {1}", ts1, type), String.Format("{0,4}ms PLINQ {1}", ts2, type)}
End Function
Dim results = New List(Of String) From {Environment.ProcessorCount & " processors"}
Dim count = 12345678, iList = Enumerable.Repeat(1, count)
With iList : results.AddRange(test(Sub() .Sum(), Sub() .AsParallel.Sum(), "IEnumerable")) : End With
With iList.ToArray : results.AddRange(test(Sub() .Sum(), Sub() .AsParallel.Sum(), "Array")) : End With
With iList.ToList : results.AddRange(test(Sub() .Sum(), Sub() .AsParallel.Sum(), "List")) : End With
With ParallelEnumerable.Repeat(1, count) : results.AddRange(test(Sub() .Sum(), Sub() .AsParallel.Sum(), "ParralelQuery")) : End With
MessageBox.Show(String.join(Environment.NewLine, From l In results Order By l))
Running the tests in different order will have a bit different results, so having them in one line makes moving them up and down a bit easier for me.

That indeed may be the case because you are increasing the number of context switches and you are not performing any action that would benefit of having threads waiting for something like i/o completion. This is going to be even worse if you are running in a single cpu box.

I'd recommend using the Stopwatch class for timing metrics. In your case it's a better measure of the interval.

Please read the Side Effects section of this article.
http://msdn.microsoft.com/en-us/magazine/cc163329.aspx
I think you can run into many conditions where PLINQ has additional data processing patterns you must understand before you opt to think that is will always purely have faster response times.

Justin's comment about overhead is exactly right.
Just something to consider when writing concurrent software in general, beyond the use of PLINQ:
You always need to be thinking about the "granularity" of your work items. Some problems are very well suited to parallelization because they can be "chunked" at a very high level, like raytracing entire frames concurrently (these sorts of problems are called embarrassingly parallel). When there are very large "chunks" of work, then the overhead of creating and managing multiple threads becomes negligible compared to the actual work that you want to get done.
PLINQ makes concurrent programming easier, but it doesn't mean that you can ignore thinking about the granularity of your work.

Why does the second for loop always execute faster than the first one?

I was trying to figure out if a for loop was faster than a foreach loop and was using the System.Diagnostics classes to time the task. While running the test I noticed that which ever loop I put first always executes slower then the last one. Can someone please tell me why this is happening? My code is below:
using System;
using System.Diagnostics;
namespace cool {
class Program {
static void Main(string[] args) {
int[] x = new int[] { 3, 6, 9, 12 };
int[] y = new int[] { 3, 6, 9, 12 };
DateTime startTime = DateTime.Now;
for (int i = 0; i < 4; i++) {
Console.WriteLine(x[i]);
}
TimeSpan elapsedTime = DateTime.Now - startTime;
DateTime startTime2 = DateTime.Now;
foreach (var item in y) {
Console.WriteLine(item);
}
TimeSpan elapsedTime2 = DateTime.Now - startTime2;
Console.WriteLine("\nSummary");
Console.WriteLine("--------------------------\n");
Console.WriteLine("for:\t{0}\nforeach:\t{1}", elapsedTime, elapsedTime2);
Console.ReadKey();
}
}
}
Here is the output:
for: 00:00:00.0175781
foreach: 00:00:00.0009766

Probably because the classes (e.g. Console) need to be JIT-compiled the first time through. You'll get the best metrics by calling all methods (to JIT them (warm then up)) first, then performing the test.
As other users have indicated, 4 passes is never going to be enough to to show you the difference.
Incidentally, the difference in performance between for and foreach will be negligible and the readability benefits of using foreach almost always outweigh any marginal performance benefit.

I would not use DateTime to measure performance - try the Stopwatch class.
Measuring with only 4 passes is never going to give you a good result. Better use > 100.000 passes (you can use an outer loop). Don't do Console.WriteLine in your loop.
Even better: use a profiler (like Redgate ANTS or maybe NProf)

I am not so much in C#, but when I remember right, Microsoft was building "Just in Time" compilers for Java. When they use the same or similar techniques in C#, it would be rather natural that "some constructs coming second perform faster".
For example it could be, that the JIT-System sees that a loop is executed and decides adhoc to compile the whole method. Hence when the second loop is reached, it is yet compiled and performs much faster than the first. But this is a rather simplistic guess of mine. Of course you need a far greater insight in the C# runtime system to understand what is going on. It could also be, that the RAM-Page is accessed first in the first loop and in the second it is still in the CPU-cache.
Addon: The other comment that was made: that the output module can be JITed a first time in the first loop seams to me more likely than my first guess. Modern languages are just very complex to find out what is done under the hood. Also this statement of mine fits into this guess:
But also you have terminal-outputs in your loops. They make things yet more difficult. It could also be, that it costs some time to open the terminal a first time in a program.

I was just performing tests to get some real numbers, but in the meantime Gaz beat me to the answer - the call to Console.Writeline is jitted at the first call, so you pay that cost in the first loop.
Just for information though - using a stopwatch rather than the datetime and measuring number of ticks:
Without a call to Console.Writeline before the first loop the times were
for: 16802
foreach: 2282
with a call to Console.Writeline they were
for: 2729
foreach: 2268
Though these results were not consistently repeatable because of the limited number of runs, but the magnitude of difference was always roughly the same.
The edited code for reference:
int[] x = new int[] { 3, 6, 9, 12 };
int[] y = new int[] { 3, 6, 9, 12 };
Console.WriteLine("Hello World");
Stopwatch sw = new Stopwatch();
sw.Start();
for (int i = 0; i < 4; i++)
{
Console.WriteLine(x[i]);
}
sw.Stop();
long elapsedTime = sw.ElapsedTicks;
sw.Reset();
sw.Start();
foreach (var item in y)
{
Console.WriteLine(item);
}
sw.Stop();
long elapsedTime2 = sw.ElapsedTicks;
Console.WriteLine("\nSummary");
Console.WriteLine("--------------------------\n");
Console.WriteLine("for:\t{0}\nforeach:\t{1}", elapsedTime, elapsedTime2);
Console.ReadKey();

The reason why is there are several forms of overhead in the foreach version that are not present in the for loop
Use of an IDisposable.
An additional method call for every element. Each element must be accessed under the hood by using IEnumerator<T>.Current which is a method call. Because it's on an interface it cannot be inlined. This means N method calls where N is the number of elements in the enumeration. The for loop just uses and indexer
In a foreach loop all calls go through an interface. In general this a bit slower than through a concrete type
Please note that the things I listed above are not necessarily huge costs. They are typically very small costs that can contribute to a small performance difference.
Also note, as Mehrdad pointed out, the compilers and JIT may choose to optimize a foreach loop for certain known data structures such as an array. The end result may just be a for loop.
Note: Your performance benchmark in general needs a bit more work to be accurate.
You should use a StopWatch instead of DateTime. It is much more accurate for performance benchmarks.
You should perform the test many times not just once
You need to do a dummy run on each loop to eliminate the problems that come with JITing a method the first time. This probably isn't an issue when all of the code is in the same method but it doesn't hurt.
You need to use more than just 4 values in the list. Try 40,000 instead.

You should be using the StopWatch to time the behavior.
Technically the for loop is faster. Foreach calls the MoveNext() method (creating a method stack and other overhead from a call) on the IEnumerable's iterator, when for only has to increment a variable.

I don't see why everyone here says that for would be faster than foreach in this particular case. For a List<T>, it is (about 2x slower to foreach through a List than to for through a List<T>).
In fact, the foreach will be slightly faster than the for here. Because foreach on an array essentially compiles to:
for(int i = 0; i < array.Length; i++) { }
Using .Length as a stop criteria allows the JIT to remove bounds checks on the array access, since it's a special case. Using i < 4 makes the JIT insert extra instructions to check each iteration whether or not i is out of bounds of the array, and throw an exception if that is the case. However, with .Length, it can guarantee you'll never go outside of the array bounds so the bounds checks are redundant, making it faster.
However, in most loops, the overhead of the loop is insignificant compared to the work done inside.
The discrepancy you're seeing can only be explained by the JIT I guess.

I wouldn't read too much into this - this isn't good profiling code for the following reasons
1. DateTime isn't meant for profiling. You should use QueryPerformanceCounter or StopWatch which use the CPU hardware profile counters
2. Console.WriteLine is a device method so there may be subtle effects such as buffering to take into account
3. Running one iteration of each code block will never give you accurate results because your CPU does a lot of funky on the fly optimisation such as out of order execution and instruction scheduling
4. Chances are the code that gets JITed for both code blocks is fairly similar so is likely to be in the instruction cache for the second code block
To get a better idea of timing, I did the following
Replaced the Console.WriteLine with a math expression ( e^num)
I used QueryPerformanceCounter/QueryPerformanceTimer through P/Invoke
I ran each code block 1 million times then averaged the results
When I did that I got the following results:
The for loop took 0.000676 milliseconds
The foreach loop took 0.000653 milliseconds
So foreach was very slightly faster but not by much
I then did some further experiments and ran the foreach block first and the for block second
When I did that I got the following results:
The foreach loop took 0.000702 milliseconds
The for loop took 0.000691 milliseconds
Finally I ran both loops together twice i.e for + foreach then for + foreach again
When I did that I got the following results:
The foreach loop took 0.00140 milliseconds
The for loop took 0.001385 milliseconds
So basically it looks to me that whatever code you run second, runs very slightly faster but not
enough to be of any significance.
--Edit--
Here are a couple of useful links
How to time managed code using QueryPerformanceCounter
The instruction cache
Out of order execution

Testing your code for speed?

I'm a total newbie, but I was writing a little program that worked on strings in C# and I noticed that if I did a few things differently, the code executed significantly faster.
So it had me wondering, how do you go about clocking your code's execution speed? Are there any (free)utilities? Do you go about it the old-fashioned way with a System.Timer and do it yourself?

What you are describing is known as performance profiling. There are many programs you can get to do this such as Jetbrains profiler or Ants profiler, although most will slow down your application whilst in the process of measuring its performance.
To hand-roll your own performance profiling, you can use System.Diagnostics.Stopwatch and a simple Console.WriteLine, like you described.
Also keep in mind that the C# JIT compiler optimizes code depending on the type and frequency it is called, so play around with loops of differing sizes and methods such as recursive calls to get a feel of what works best.

ANTS Profiler from RedGate is a really nice performance profiler. dotTrace Profiler from JetBrains is also great. These tools will allow you to see performance metrics that can be drilled down the each individual line.
Scree shot of ANTS Profiler:
ANTS http://www.red-gate.com/products/ants_profiler/images/app/timeline_calltree3.gif
If you want to ensure that a specific method stays within a specific performance threshold during unit testing, I would use the Stopwatch class to monitor the execution time of a method one ore many times in a loop and calculate the average and then Assert against the result.

Just a reminder - make sure to compile in Relase, not Debug! (I've seen this mistake made by seasoned developers - it's easy to forget).

What are you describing is 'Performance Tuning'. When we talk about performance tuning there are two angle to it. (a) Response time - how long it take to execute a particular request/program. (b) Throughput - How many requests it can execute in a second. When we typically 'optimize' - when we eliminate unnecessary processing both response time as well as throughput improves. However if you have wait events in you code (like Thread.sleep(), I/O wait etc) your response time is affected however throughput is not affected. By adopting parallel processing (spawning multiple threads) we can improve response time but throughput will not be improved. Typically for server side application both response time and throughput are important. For desktop applications (like IDE) throughput is not important only response time is important.
You can measure response time by 'Performance Testing' - you just note down the response time for all key transactions. You can measure the throughput by 'Load Testing' - You need to pump requests continuously from sufficiently large number of threads/clients such that the CPU usage of server machine is 80-90%. When we pump request we need to maintain the ratio between different transactions (called transaction mix) - for eg: in a reservation system there will be 10 booking for every 100 search. there will be one cancellation for every 10 booking etc.
After identifying the transactions require tuning for response time (performance testing) you can identify the hot spots by using a profiler.
You can identify the hot spots for throughput by comparing the response time * fraction of that transaction. Assume in search, booking, cancellation scenario, ratio is 89:10:1.
Response time are 0.1 sec, 10 sec and 15 sec.
load for search - 0.1 * .89 = 0.089
load for booking- 10 * .1 = 1
load for cancell= 15 * .01= 0.15
Here tuning booking will yield maximum impact on throughput.
You can also identify hot spots for throughput by taking thread dumps (in the case of java based applications) repeatedly.

Use a profiler.
Ants (http://www.red-gate.com/Products/ants_profiler/index.htm)
dotTrace (http://www.jetbrains.com/profiler/)
If you need to time one specific method only, the Stopwatch class might be a good choice.

I do the following things:
1) I use ticks (e.g. in VB.Net Now.ticks) for measuring the current time. I subtract the starting ticks from the finished ticks value and divide by TimeSpan.TicksPerSecond to get how many seconds it took.
2) I avoid UI operations (like console.writeline).
3) I run the code over a substantial loop (like 100,000 iterations) to factor out usage / OS variables as best as I can.

You can use the StopWatch class to time methods. Remember the first time is often slow due to code having to be jitted.

There is a native .NET option (Team Edition for Software Developers) that might address some performance analysis needs. From the 2005 .NET IDE menu, select Tools->Performance Tools->Performance Wizard...
[GSS is probably correct that you must have Team Edition]

This is simple example for testing code speed. I hope I helped you
class Program {
static void Main(string[] args) {
const int steps = 10000;
Stopwatch sw = new Stopwatch();
ArrayList list1 = new ArrayList();
sw.Start();
for(int i = 0; i < steps; i++) {
list1.Add(i);
}
sw.Stop();
Console.WriteLine("ArrayList:\tMilliseconds = {0},\tTicks = {1}", sw.ElapsedMilliseconds, sw.ElapsedTicks);
MyList list2 = new MyList();
sw.Start();
for(int i = 0; i < steps; i++) {
list2.Add(i);
}
sw.Stop();
Console.WriteLine("MyList: \tMilliseconds = {0},\tTicks = {1}", sw.ElapsedMilliseconds, sw.ElapsedTicks);

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.