CPU benchmark test: Tasks vs ThreadPool vs Thread

CPU benchmark test: Tasks vs ThreadPool vs Thread - c#

I posted another SO question here, and as a follow-up, my colleague did a test, seen below, as some form of "counter" to the argument for async/await/Tasks.
(I am aware that the lock on resultList isn't needed, disregard that)
I am aware that async/await and Tasks is not made to handle CPU-intensive tasks but instead handle I/O operations that are done by the OS. The benchmark below is a CPU-intensive task, so the test is flawed from start.
However, as I understand it, using new Task().Start() will schedule the operation on the ThreadPool and execute the test code on different threads on the ThreadPool. Wouldnt that mean that the first and second test are more or less the same? (I'm guessing not, please explain
Why then the big difference between them?

some form of "counter" to the argument for async/await/Tasks.
The posted code has absolutely nothing to do with async or await. It's comparing three different kinds of parallelism:
Dynamic Task Parallelism.
Direct threadpool access.
Manual multithreading with manual partitioning.
The first two are somewhat comparable. Of course, direct threadpool access will be faster than Dynamic Task Parallelism. But what these tests don't show is that direct threadpool access is much harder to do correctly. In particular, when you are running real-world code and need to handle exceptions and return values, you have to add in boilerplate code and object instances to the direct threadpool access code that slows it down.
The third one is not comparable at all. It just uses 10 manual threads. Again, this example ignores the additional complexity necessary in real-world code; specifically, the need to handle exceptions and return values. It also assumes a partition size, which is problematic; real-world code does not have that luxury. If you're managing your own set of threads, then you have to decide things like how quickly you should increase the number of threads when the queue has many items, and how quickly you should end threads when the queue is empty. These are all difficult questions that add lots of code to the #3 test before you're really comparing the same thing.
And that's not even to say anything about the cost of maintenance. In my experience (i.e., as an application developer), micro-optimizations are just not worth it. Even if you took the "worst" (#1) approach, you're losing about 7 microseconds per item. That is an unimaginably small amount of savings. As a general rule, developer time is far more valuable to your company than user time. If your users have to process a hundred thousand items, the difference would barely be perceptible. If you were to adopt the "best" (#3) approach, the code would be much less maintainable, particularly considering the boilerplate and thread management code necessary in production code and not shown here. Going with #3 would probably cost your company far more in terms of developer time just writing or reading the code than it would ever save in terms of user time.
Oh, and the funniest part of all this is that with all these different kinds of parallelism compared, they didn't even include the one that is most suitable for this test: PLINQ.
static void Main(string[] args)
{
TaskParallelLibrary();
ManualThreads();
Console.ReadKey();
}
static void ManualThreads()
{
var queue = new List<string>();
for (int i = 0; i != 1000000; ++i)
queue.Add("string" + i);
var resultList = new List<string>();
var stopwatch = Stopwatch.StartNew();
var counter = 0;
for (int i = 0; i != 10; ++i)
{
new Thread(() =>
{
while (true)
{
var t = "";
lock (queue)
{
if (counter >= queue.Count)
break;
t = queue[counter];
++counter;
}
t = t.Substring(0, 5);
string t2 = t.Substring(0, 2) + t;
lock (resultList)
resultList.Add(t2);
}
}).Start();
}
while (resultList.Count < queue.Count)
Thread.Sleep(1);
stopwatch.Stop();
Console.WriteLine($"Manual threads: Processed {resultList.Count} in {stopwatch.Elapsed}");
}
static void TaskParallelLibrary()
{
var queue = new List<string>();
for (int i = 0; i != 1000000; ++i)
queue.Add("string" + i);
var stopwatch = Stopwatch.StartNew();
var resultList = queue.AsParallel().Select(t =>
{
t = t.Substring(0, 5);
return t.Substring(0, 2) + t;
}).ToList();
stopwatch.Stop();
Console.WriteLine($"Parallel: Processed {resultList.Count} in {stopwatch.Elapsed}");
}
On my machine, after running this code several times, I find that the PLINQ code outperforms the Manual Threads by about 30%. Sample output on .NET Core 3.0 preview5-27626-15, built for Release, run standalone:
Parallel: Processed 1000000 in 00:00:00.3629408
Manual threads: Processed 1000000 in 00:00:00.5119985
And, of course, the PLINQ code is:
Shorter
More maintainable
More robust (handles exceptions and return types)
Less awkward (no need to poll for completion)
More portable (partitions based on number of processors)
More flexible (automatically adjusts the thread pool as necessary based on amount of work)

Related

What determines the number of threads for a TaskFactory spawned jobs?

I have the following code:
var factory = new TaskFactory();
for (int i = 0; i < 100; i++)
{
var i1 = i;
factory.StartNew(() => foo(i1));
}
static void foo(int i)
{
Thread.Sleep(1000);
Console.WriteLine($"foo{i} - on thread {Thread.CurrentThread.ManagedThreadId}");
}
I can see it only does 4 threads at a time (based on observation). My questions:
What determines the number of threads used at a time?
How can I retrieve this number?
How can I change this number?
P.S. My box has 4 cores.
P.P.S. I needed to have a specific number of tasks (and no more) that are concurrently processed by the TPL and ended up with the following code:
private static int count = 0; // keep track of how many concurrent tasks are running
private static void SemaphoreImplementation()
{
var s = new Semaphore(20, 20); // allow 20 tasks at a time
for (int i = 0; i < 1000; i++)
{
var i1 = i;
Task.Factory.StartNew(() =>
{
try
{
s.WaitOne();
Interlocked.Increment(ref count);
foo(i1);
}
finally
{
s.Release();
Interlocked.Decrement(ref count);
}
}, TaskCreationOptions.LongRunning);
}
}
static void foo(int i)
{
Thread.Sleep(100);
Console.WriteLine($"foo{i:00} - on thread " +
$"{Thread.CurrentThread.ManagedThreadId:00}. Executing concurently: {count}");
}

When you are using a Task in .NET, you are telling the TPL to schedule a piece of work (via TaskScheduler) to be executed on the ThreadPool. Note that the work will be scheduled at its earliest opportunity and however the scheduler sees fit. This means that the TaskScheduler will decide how many threads will be used to run n number of tasks and which task is executed on which thread.
The TPL is very well tuned and continues to adjust its algorithm as it executes your tasks. So, in most cases, it tries to minimize contention. What this means is if you are running 100 tasks and only have 4 cores (which you can get using Environment.ProcessorCount), it would not make sense to execute more than 4 threads at any given time, as otherwise it would need to do more context switching. Now there are times where you want to explicitly override this behaviour. Let's say in the case where you need to wait for some sort of IO to finish, which is a whole different story.
In summary, trust the TPL. But if you are adamant to spawn a thread per task (not always a good idea!), you can use:
Task.Factory.StartNew(
() => /* your piece of work */,
TaskCreationOptions.LongRunning);
This tells the DefaultTaskscheduler to explicitly spawn a new thread for that piece of work.
You can also use your own Scheduler and pass it in to the TaskFactory. You can find a whole bunch of Schedulers HERE.
Note another alternative would be to use PLINQ which again by default analyses your query and decides whether parallelizing it would yield any benefit or not, again in the case of a blocking IO where you are certain starting multiple threads will result in a better execution you can force the parallelism by using WithExecutionMode(ParallelExecutionMode.ForceParallelism) you then can use WithDegreeOfParallelism, to give hints on how many threads to use but remember there is no guarantee you would get that many threads, as MSDN says:
Sets the degree of parallelism to use in a query. Degree of
parallelism is the maximum number of concurrently executing tasks that
will be used to process the query.
Finally, I highly recommend having a read of THIS great series of articles on Threading and TPL.

If you increase the number of tasks to for example 1000000 you will see a lot more threads spawned over time. The TPL tends to inject one every 500ms.
The TPL threadpool does not understand IO-bound workloads (sleep is IO). It's not a good idea to rely on the TPL for picking the right degree of parallelism in these cases. The TPL is completely clueless and injects more threads based on vague guesses about throughput. Also to avoid deadlocks.
Here, the TPL policy clearly is not useful because the more threads you add the more throughput you get. Each thread can process one item per second in this contrived case. The TPL has no idea about that. It makes no sense to limit the thread count to the number of cores.
What determines the number of threads used at a time?
Barely documented TPL heuristics. They frequently go wrong. In particular they will spawn an unlimited number of threads over time in this case. Use task manager to see for yourself. Let this run for an hour and you'll have 1000s of threads.
How can I retrieve this number? How can I change this number?
You can retrieve some of these numbers but that's not the right way to go. If you need a guaranteed DOP you can use AsParallel().WithDegreeOfParallelism(...) or a custom task scheduler. You also can manually start LongRunning tasks. Do not mess with process global settings.

I would suggest using SemaphoreSlim because it doesn't use Windows kernel (so it can be used in Linux C# microservices) and also has a property SemaphoreSlim.CurrentCount that tells how many remaining threads are left so you don't need the Interlocked.Increment or Interlocked.Decrement. I also removed i1 because i is value type and it won't be changed by the call of foo method passing the i argument so it's no need to copy it into i1 to ensure it never changes (if that was the reasoning for adding i1):
private static void SemaphoreImplementation()
{
var maxThreadsCount = 20; // allow 20 tasks at a time
var semaphoreSlim = new SemaphoreSlim(maxTasksCount, maxTasksCount);
var taskFactory = new TaskFactory();
for (int i = 0; i < 1000; i++)
{
taskFactory.StartNew(async () =>
{
try
{
await semaphoreSlim.WaitAsync();
var count = maxTasksCount-semaphoreSlim.CurrentCount; //SemaphoreSlim.CurrentCount tells how many threads are remaining
await foo(i, count);
}
finally
{
semaphoreSlim.Release();
}
}, TaskCreationOptions.LongRunning);
}
}
static async void foo(int i, int count)
{
await Task.Wait(100);
Console.WriteLine($"foo{i:00} - on thread " +
$"{Thread.CurrentThread.ManagedThreadId:00}. Executing concurently: {count}");
}

What is the performance differences between using parallel.foreach and task inside foreach loop?

I would like to know what is the best way or are there any documents/articles that can help me to identify what is the differences of using Parallel.foreach and Task within a normal for
each loop, like the following:
case 1 - Parallel.foreach:
Parallel.foreach
{
// Do SOmething thread safe: parsing an xml and then save
// into a DB Server thry respoitory approach
}
case 2 - Task within foreach:
foreach
{
Task t1 = Task.factory.startNew(()=>
{
//Do the same thing as case 1 that is thread safe
}
}
Task.waitall()
I did do my own tests and the result show case 1 perform way better than case 2. The ratio is about like this:
sequential vs case 1 vs case 2 = 5s : 1s : 4s
While there are almost a 1:4 on the case 1 and case 2 ? So is it means we should always use parallel.foreach or parallel.for if we want to run in parallel within the loop?

First, the best documentation on the subject is Part V of CLR via C#.
http://www.amazon.com/CLR-via-C-Developer-Reference/dp/0735667454/ref=sr_1_1?ie=UTF8&qid=1376239791&sr=8-1&keywords=clr+via+c%23
Secondly, I would expect the Parallel.Foreach to perform better because it will not only create Tasks, but group them. In Jeffrey Richter's book, he explains that tasks that are started individually, will be put on the thread pool queue. There is some overhead to locking the actual thread pool queue. To combat this, Tasks themselves have their own queue for Tasks that they create. This task sub-queue held by the Tasks can actually do some work without locking!
I would have to read that chapter again (Chapter 27), so I am not sure that Parallel.Foreach works this way, but this is what I would expect it to do.
Locking, he explains, is expensive because it requires accessing a kernel level construct.
In either case, do not expect them to process sequentially. Using Parallel.Foreach is less likely to process sequentially than the foreach keyword due to the aforementioned internals.

What Parallel.ForEach() does is that it creates a small number of Tasks to process iterations of your loop. Tasks are relatively cheap, but they aren't free, so this tends to improve performance. And the body of your loop executes quickly, the improvement can be really big. This is the most likely explanation for the behavior you're observing.

How many tasks are you running? Just the creation of a new task could require a significant amount of time if you're looping enough. i.e., the following runs in 15 ms for the first block, and over 1 sec for the 2nd block, and the 2nd block doesn't even run the task. Uncomment the Start and the time goes up to nearly 3 sec. The WaitAll only adds a small amount.
static class Program
{
static void Main()
{
const int max = 3000000;
var range = Enumerable.Range(0, max).ToArray();
{
var sw = new Stopwatch();
sw.Start();
Parallel.ForEach(range, i => { });
sw.Stop();
Console.WriteLine(sw.ElapsedMilliseconds);
}
{
var tasks = new Task[max];
var sw = new Stopwatch();
sw.Start();
foreach (var i in range)
{
tasks[i] = new Task(()=> { });
//tasks[i].Start();
}
//Task.WaitAll(tasks);
sw.Stop();
Console.WriteLine(sw.ElapsedMilliseconds);
}
}
}

Parallel.For vs regular threads

I'm trying to understand why Parallel.For is able to outperform a number of threads in the following scenario: consider a batch of jobs that can be processed in parallel. While processing these jobs, new work may be added, which then needs to be processed as well. The Parallel.For solution would look as follows:
var jobs = new List<Job> { firstJob };
int startIdx = 0, endIdx = jobs.Count;
while (startIdx < endIdx) {
Parallel.For(startIdx, endIdx, i => WorkJob(jobs[i]));
startIdx = endIdx; endIdx = jobs.Count;
}
This means that there are multiple times where the Parallel.For needs to synchronize. Consider a bread-first graph algorithm algorithm; the number of synchronizations would be quite large. Waste of time, no?
Trying the same in the old-fashioned threading approach:
var queue = new ConcurrentQueue<Job> { firstJob };
var threads = new List<Thread>();
var waitHandle = new AutoResetEvent(false);
int numBusy = 0;
for (int i = 0; i < maxThreads; i++)
threads.Add(new Thread(new ThreadStart(delegate {
while (!queue.IsEmpty || numBusy > 0) {
if (queue.IsEmpty)
// numbusy > 0 implies more data may arrive
waitHandle.WaitOne();
Job job;
if (queue.TryDequeue(out job)) {
Interlocked.Increment(ref numBusy);
WorkJob(job); // WorkJob does a waitHandle.Set() when more work was found
Interlocked.Decrement(ref numBusy);
}
}
// others are possibly waiting for us to enable more work which won't happen
waitHandle.Set();
})));
threads.ForEach(t => t.Start());
threads.ForEach(t => t.Join());
The Parallel.For code is of course much cleaner, but what I cannot comprehend, it's even faster as well! Is the task scheduler just that good? The synchronizations were eleminated, there's no busy waiting, yet the threaded approach is consistently slower (for me). What's going on? Can the threading approach be made faster?
Edit: thanks for all the answers, I wish I could pick multiple ones. I chose to go with the one that also shows an actual possible improvement.

The two code samples are not really the same.
The Parallel.ForEach() will use a limited amount of threads and re-use them. The 2nd sample is already starting way behind by having to create a number of threads. That takes time.
And what is the value of maxThreads ? Very critical, in Parallel.ForEach() it is dynamic.
Is the task scheduler just that good?
It is pretty good. And TPL uses work-stealing and other adaptive technologies. You'll have a hard time to do any better.

Parallel.For doesn't actually break up the items into single units of work. It breaks up all the work (early on) based on the number of threads it plans to use and the number of iterations to be executed. Then has each thread synchronously process that batch (possibly using work stealing or saving some extra items to load-balance near the end). By using this approach the worker threads are virtually never waiting on each other, while your threads are constantly waiting on each other due to the heavy synchronization you're using before/after every single iteration.
On top of that since it's using thread pool threads many of the threads it needs are likely already created, which is another advantage in its favor.
As for synchronization, the entire point of a Parallel.For is that all of the iterations can be done in parallel, so there is almost no synchronization that needs to take place (at least in their code).
Then of course there is the issue of the number of threads. The threadpool has a lot of very good algorithms and heuristics to help it determine how many threads are need at that instant in time, based on the current hardware, the load from other applications, etc. It's possible that you're using too many, or not enough threads.
Also, since the number of items that you have isn't known before you start I would suggest using Parallel.ForEach rather than several Parallel.For loops. It is simply designed for the situation that you're in, so it's heuristics will apply better. (It also makes for even cleaner code.)
BlockingCollection<Job> queue = new BlockingCollection<Job>();
//add jobs to queue, possibly in another thread
//call queue.CompleteAdding() when there are no more jobs to run
Parallel.ForEach(queue.GetConsumingEnumerable(),
job => job.DoWork());

Your creating a bunch of new threads and the Parallel.For is using a Threadpool. You'll see better performance if you were utilizing the C# threadpool but there really is no point in doing that.
I would shy away from rolling out your own solution; if there is a corner case where you need customization use the TPL and customize..

How to provoke simulataneous access to shared resource?

I am creating an component which performs sequential processing on inputs. As it will be hosted in several different processes, I need it to be thread safe. At first, I intentionally left out thread safety from the code. Now it is time to introduce that.
First, I wanted to provoke an error to start with, but was not able to. Here is a simplified version of the code for the processing engine:
public Document DoOrchestration(Document input)
{
Document output = new Document();
foreach (var orchestrationStep in m_OrchestrationSteps)
{
var processor = GetProcessor(orchestrationStep).Clone();
output = processor.Process(input);
input = output;
}
return output;
}
The processors can be developed by other people in my organisation, and that can include some complex initialization. They may also be thread unsafe, so I use the Prototype Pattern to get unique instances of the to avoid threading issues in those.
To test this function I used the following code:
for (int i = 0; i < 20000; i++)
{
Thread t = new Thread(() => TestOrchestration(i));
t.Start();
}
void TestOrchestration(int number)
{
Document doc = new Document(string.Format("Test {0}", number));
doc = DoOrchestration(doc);
if (doc.ToString().Substring(0,35) != strExpectedResult)
{
System.Console.WriteLine("Error: {0}", doc.ToString();
}
}
I expected that some of the threads would collide with another and mix up their results, but to my surprise that did not happen.
There is probably a easy and logical explanation to this, but it eludes me. Or is it just that the code is too simple to result in two threads fiddling with the input/output variables at the same time?

Check out CHESS.
CHESS is a tool for finding and reproducing Heisenbugs in concurrent
programs. CHESS repeatedly runs a concurrent test ensuring that every
run takes a different interleaving. If an interleaving results in an
error, CHESS can reproduce the interleaving for improved debugging.
CHESS is available for both managed and native programs.

I would assume that due to the simplicity of your test function, your threads do not even get the time to spawn in a large quantity before the previous one is done with it's work. Consider using a barrier to allow all threads to spawn before you start the computational step. Also, you will want to consider increasing the complexity of your test case, for instance by performing several of the same operation in the same loop (starting a thread is expensive, and allows other cores to complete their work before you even get around to the resource contention.
Generally, resource contention can be provoked by having rapid access to the same resource over a longer period of time, your test case does not seem to allow for this. As an aside, I would strongly recommend that you design for thread safety in mind, instead of introducing this later. When working with the code you have a better understanding of resource access patterns than you will have when analyzing the code at a later stage.

You could use the ManualResetEvent for simultaneous continuation of any number of awaiting threads.

I think your test function is almost complete before next thread starts. You can make all the threads wait on an ManualResetEventSlim before invoking the orchestration function and then set the ManualResetEventSlim
That way all your threads will infact try to invoke orchestration at the same time.
Also possibly you would not need 20,000 threads to simulate this behavior if all threads male orchestracion call at almost the same time
ManualResetEventSlim manualEvent = new ManualResetEventSlim (false);
for (int i = 0; i < 20000; i++)
{
Thread t = new Thread(() => TestOrchestration(i));
t.Start();
}
manualEvent.Set();
void TestOrchestration(int number)
{
manualEvent.Wait();
Document doc = new Document(string.Format("Test {0}", number));
doc = DoOrchestration(doc);
if (doc.ToString().Substring(0,35) != strExpectedResult)
{
System.Console.WriteLine("Error: {0}", doc.ToString();
}
}

Why is ConcurrentBag<T> so slow in .Net (4.0)? Am I doing it wrong?

Before I started a project, I wrote a simple test to compare the performance of ConcurrentBag from (System.Collections.Concurrent) relative to locking & lists. I am extremely surprised that ConcurrentBag is over 10 times slower than locking with a simple List. From what I understand, the ConcurrentBag works best when the reader and writer is the same thread. However, I hadn't thought it's performance would be so much worse than traditional locks.
I have run a test with two Parallel for loops writing to and reading from a list/bag. However, the write by itself shows a huge difference:
private static void ConcurrentBagTest()
{
int collSize = 10000000;
Stopwatch stopWatch = new Stopwatch();
ConcurrentBag<int> bag1 = new ConcurrentBag<int>();
stopWatch.Start();
Parallel.For(0, collSize, delegate(int i)
{
bag1.Add(i);
});
stopWatch.Stop();
Console.WriteLine("Elapsed Time = {0}",
stopWatch.Elapsed.TotalSeconds);
}
On my box, this takes between 3-4 secs to run, compared to 0.5 - 0.9 secs of this code:
private static void LockCollTest()
{
int collSize = 10000000;
object list1_lock=new object();
List<int> lst1 = new List<int>(collSize);
Stopwatch stopWatch = new Stopwatch();
stopWatch.Start();
Parallel.For(0, collSize, delegate(int i)
{
lock(list1_lock)
{
lst1.Add(i);
}
});
stopWatch.Stop();
Console.WriteLine("Elapsed = {0}",
stopWatch.Elapsed.TotalSeconds);
}
As I mentioned, doing concurrent reads and writes doesn't help the concurrent bag test. Am I doing something wrong or is this data structure just really slow?
[EDIT] - I removed the Tasks because I don't need them here (Full code had another task reading)
[EDIT]
Thanks a lot for the answers. I am having a hard time picking "the right answer" since it seems to be a mix of a few answers.
As Michael Goldshteyn pointed out, the speed really depends on the data.
Darin pointed out that there should be more contention for ConcurrentBag to be faster, and Parallel.For doesn't necessarily start the same number of threads. One point to take away is to not do anything you don't have to inside a lock. In the above case, I don't see myself doing anything inside the lock except may be assigning the value to a temp variable.
Additionally, sixlettervariables pointed out that the number of threads that happen to be running may also affect results, although I tried running the original test in reverse order and ConcurrentBag was still slower.
I ran some tests with starting 15 Tasks and the results depended on the collection size among other things. However, ConcurrentBag performed almost as well as or better than locking a list, for up to 1 million insertions. Above 1 million, locking seemed to be much faster sometimes, but I'll probably never have a larger datastructure for my project.
Here's the code I ran:
int collSize = 1000000;
object list1_lock=new object();
List<int> lst1 = new List<int>();
ConcurrentBag<int> concBag = new ConcurrentBag<int>();
int numTasks = 15;
int i = 0;
Stopwatch sWatch = new Stopwatch();
sWatch.Start();
//First, try locks
Task.WaitAll(Enumerable.Range(1, numTasks)
.Select(x => Task.Factory.StartNew(() =>
{
for (i = 0; i < collSize / numTasks; i++)
{
lock (list1_lock)
{
lst1.Add(x);
}
}
})).ToArray());
sWatch.Stop();
Console.WriteLine("lock test. Elapsed = {0}",
sWatch.Elapsed.TotalSeconds);
// now try concurrentBag
sWatch.Restart();
Task.WaitAll(Enumerable.Range(1, numTasks).
Select(x => Task.Factory.StartNew(() =>
{
for (i = 0; i < collSize / numTasks; i++)
{
concBag.Add(x);
}
})).ToArray());
sWatch.Stop();
Console.WriteLine("Conc Bag test. Elapsed = {0}",
sWatch.Elapsed.TotalSeconds);

Let me ask you this: how realistic is it that you'd have an application which is constantly adding to a collection and never reading from it? What's the use of such a collection? (This is not a purely rhetorical question. I could imagine there being uses where, e.g., you only read from the collection on shutdown (for logging) or when requested by the user. I believe these scenarios are fairly rare, though.)
This is what your code is simulating. Calling List<T>.Add is going to be lightning-fast in all but the occasional case where the list has to resize its internal array; but this is smoothed out by all the other adds that happen quite quickly. So you're not likely to see a significant amount of contention in this context, especially testing on a personal PC with, e.g., even 8 cores (as you stated you have in a comment somewhere). Maybe you might see more contention on something like a 24-core machine, where many cores can be trying to add to the list literally at the same time.
Contention is much more likely to creep in where you read from your collection, esp. in foreach loops (or LINQ queries which amount to foreach loops under the hood) which require locking the entire operation so that you aren't modifying your collection while iterating over it.
If you can realistically reproduce this scenario, I believe you will see ConcurrentBag<T> scale much better than your current test is showing.
Update: Here is a program I wrote to compare these collections in the scenario I described above (multiple writers, many readers). Running 25 trials with a collection size of 10000 and 8 reader threads, I got the following results:
Took 529.0095 ms to add 10000 elements to a List<double> with 8 reader threads.
Took 39.5237 ms to add 10000 elements to a ConcurrentBag<double> with 8 reader threads.
Took 309.4475 ms to add 10000 elements to a List<double> with 8 reader threads.
Took 81.1967 ms to add 10000 elements to a ConcurrentBag<double> with 8 reader threads.
Took 228.7669 ms to add 10000 elements to a List<double> with 8 reader threads.
Took 164.8376 ms to add 10000 elements to a ConcurrentBag<double> with 8 reader threads.
[ ... ]
Average list time: 176.072456 ms.
Average bag time: 59.603656 ms.
So clearly it depends on exactly what you're doing with these collections.

There seems to be a bug in the .NET Framework 4 that Microsoft fixed in 4.5, it seems they didn't expect ConcurrentBag to be used a lot.
See the following Ayende post for more info
http://ayende.com/blog/156097/the-high-cost-of-concurrentbag-in-net-4-0

As a general answer:
Concurrent collections that use locking can be very fast if there is little or no contention for their data (i.e., locks). This is due to the fact that such collection classes are often built using very inexpensive locking primitives, especially when uncontented.
Lockless collections can be slower, because of tricks used to avoid locks and due to other bottlenecks such as false sharing, complexity required to implement their lockless nature leading to cache misses, etc...
To summarize, the decision of which way is faster is highly dependant on the data structures employed and the amount of contention for the locks among other issues (e.g., num readers vs. writers in a shared/exclusive type arrangement).
Your particular example has a very high degree of contention, so I must say I am surprised by the behavior. On the other hand, the amount of work done while the lock is kept is very small, so maybe there is little contention for the lock itself, after all. There could also be deficiencies in the implementation of ConcurrentBag's concurrency handling which makes your particular example (with frequent inserts and no reads) a bad use case for it.

Looking at the program using MS's contention visualizer shows that ConcurrentBag<T> has a much higher cost associated with parallel insertion than simply locking on a List<T>. One thing I noticed is there appears to be a cost associated with spinning up the 6 threads (used on my machine) to begin the first ConcurrentBag<T> run (cold run). 5 or 6 threads are then used with the List<T> code, which is faster (warm run). Adding another ConcurrentBag<T> run after the list shows it takes less time than the first (warm run).
From what I'm seeing in the contention, a lot of time is spent in the ConcurrentBag<T> implementation allocating memory. Removing the explicit allocation of size from the List<T> code slows it down, but not enough to make a difference.
EDIT: it appears to be that the ConcurrentBag<T> internally keeps a list per Thread.CurrentThread, locks 2-4 times depending on if it is running on a new thread, and performs at least one Interlocked.Exchange. As noted in MSDN: "optimized for scenarios where the same thread will be both producing and consuming data stored in the bag." This is the most likely explanation for your performance decrease versus a raw list.

This is already resolved in .NET 4.5. The underlying issue was that ThreadLocal, which ConcurrentBag uses, didn’t expect to have a lot of instances. That has been fixed, and now can run fairly fast.
source - The HIGH cost of ConcurrentBag in .NET 4.0

As #Darin-Dimitrov said, I suspect that your Parallel.For isn't actually spawning the same number of threads in each of the two results. Try manually creating N threads to ensure that you are actually seeing thread contention in both cases.

You basically have very few concurrent writes and no contention (Parallel.For doesn't necessarily mean many threads). Try parallelizing the writes and you will observe different results:
class Program
{
private static object list1_lock = new object();
private const int collSize = 1000;
static void Main()
{
ConcurrentBagTest();
LockCollTest();
}
private static void ConcurrentBagTest()
{
var bag1 = new ConcurrentBag<int>();
var stopWatch = Stopwatch.StartNew();
Task.WaitAll(Enumerable.Range(1, collSize).Select(x => Task.Factory.StartNew(() =>
{
Thread.Sleep(5);
bag1.Add(x);
})).ToArray());
stopWatch.Stop();
Console.WriteLine("Elapsed Time = {0}", stopWatch.Elapsed.TotalSeconds);
}
private static void LockCollTest()
{
var lst1 = new List<int>(collSize);
var stopWatch = Stopwatch.StartNew();
Task.WaitAll(Enumerable.Range(1, collSize).Select(x => Task.Factory.StartNew(() =>
{
lock (list1_lock)
{
Thread.Sleep(5);
lst1.Add(x);
}
})).ToArray());
stopWatch.Stop();
Console.WriteLine("Elapsed = {0}", stopWatch.Elapsed.TotalSeconds);
}
}

My guess is that locks don't experience much contention. I would recommend reading following article: Java theory and practice: Anatomy of a flawed microbenchmark. The article discusses a lock microbenchmark. As stated in the article there are a lot of things to take into consideration in this kind of situations.

It would be interesting to see scaling between the two of them.
Two questions
1) how fast is bag vs list for reading, remember to put a lock on the list
2) how fast is bag vs list for reading while another thread is writing

Because the loop body is small, you could try using the Partitioner class Create method...
which enables you to provide a
sequential loop for the delegate body,
so that the delegate is invoked only
once per partition, instead of once
per iteration
How to: Speed Up Small Loop Bodies

It appears that ConcurrentBag is just slower than the other concurrent collections.
I think it's an implementation problem- ANTS Profiler shows that it is gets bogged down in a couple of places - including an array copy.
Using concurrent dictionary is thousands of times faster.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.