ConcurrentBag vs Custom Thread Safe List

ConcurrentBag vs Custom Thread Safe List - c#

I have a .NET 4.5 Single Instance WCF service which maintains a collection of items in a list which will have simultaneous concurrent readers and writers, but with far more readers than writers.
I am currently deciding on whether to use the BCL ConcurrentBag<T> or use my own custom generic ThreadSafeList class (which extends IList<T> and encapsulates the BCL ReaderWriterLockSlim as this is more suited for multiple concurrent readers).
I have found numerous performance differences when testing these implementations by simulating a concurrent scenario of 1m readers (simply running a Sum Linq query) and only 100 writers (adding items to the list).
For my performance test I have a list of tasks:
List<Task> tasks = new List<Task>();
Test 1: If I create 1m reader tasks followed by 100 writer tasks using the following code:
tasks.AddRange(Enumerable.Range(0, 1000000).Select(n => new Task(() => { temp.Where(t => t < 1000).Sum(); })).ToArray());
tasks.AddRange(Enumerable.Range(0, 100).Select(n => new Task(() => { temp.Add(n); })).ToArray());
I get the following timing results:
ConcurrentBag: ~300ms
ThreadSafeList: ~520ms
Test 2: However, if I create 1m reader tasks mixed with 100 writer tasks (whereby the list of Tasks to be executed could be {Reader,Reader,Writer,Reader,Reader,Writer etc}
foreach (var item in Enumerable.Range(0, 1000000))
{
tasks.Add(new Task(() => temp.Where(t => t < 1000).Sum()));
if (item % 10000 == 0)
tasks.Add(new Task(() => temp.Add(item)));
}
I get the following timing results:
ConcurrentBag: ~4000ms
ThreadSafeList: ~800ms
My code for getting the execution time for each test is as follows:
Stopwatch watch = new Stopwatch();
watch.Start();
tasks.ForEach(task => task.Start());
Task.WaitAll(tasks.ToArray());
watch.Stop();
Console.WriteLine("Time: {0}ms", watch.Elapsed.TotalMilliseconds);
The efficiency of ConcurrentBag in Test 1 is better and the efficiency of ConcurrentBag in Test 2 is worse than my custom implementation, therefore I’m finding it a difficult decision on which one to use.
Q1. Why are the results so different when the only thing I’m changing is the ordering of the tasks within the list?
Q2. Is there a better way to change my test to make it more fair?

Why are the results so different when the only thing I’m changing is
the ordering of the tasks within the list?
My best guess is that Test #1 does not actually read items, as there is nothing to read. The order of task execution is:
Read from shared pool 1M times and calculate sum
Write to shared pool 100 times
Your Test # 2 mixes the reads and writes and this is why, I am guessing, you get a different result.
Is there a better way to change my test to make it more fair?
Before you start tasks, try randomising order of the tasks. It might be difficult to reproduce the same result, but you may get closer to real world usage.
Ultimately, your question is about difference of optimistic concurrency (Concurrent* classes) vs pessimistic concurrency (based on a lock). As a rule of thumb, when you have low chances of simultaneous access to a shared resource prefer optimistic concurrency. When the chances of simultaneous access are high prefer pessimistic one.

Related

Parallel or async ASP.NET Core C#

I've googled this plenty but I'm afraid I don't fully understand the consequences of concurrency and parallelism.
I have about 3000 rows of database objects that each have an average of 2-4 logical data attached to them that need to be validated as a part of a search query, meaning the validation service needs to execute approx. 3*3000 times. E.g. the user has filtered on color then each row needs to validate the color and return the result. The loop cannot break when a match has been found, meaning all logical objects will always need to be evaluated (this is due to calculations of relevance and just not a match).
This is done on-demand when the user selects various properties, meaning performance is key here.
I'm currently doing this by using Parallel.ForEach but wonder if it is smarter to use async behavior instead?
Current way
var validatorService = new LogicalGroupValidatorService();
ConcurrentBag<StandardSearchResult> results = new ConcurrentBag<StandardSearchResult>();
Parallel.ForEach(searchGroups, (group) =>
{
var searchGroupResult = validatorService.ValidateLogicGroupRecursivly(
propertySearchQuery, group.StandardPropertyLogicalGroup);
result.Add(new StandardSearchResult(searchGroupResult));
});
Async example code
var validatorService = new LogicalGroupValidatorService();
List<StandardSearchResult> results = new List<StandardSearchResult>();
var tasks = new List<Task<StandardPropertyLogicalGroupSearchResult>>();
foreach (var group in searchGroups)
{
tasks.Add(validatorService.ValidateLogicGroupRecursivlyAsync(
propertySearchQuery, group.StandardPropertyLogicalGroup));
}
await Task.WhenAll(tasks);
results = tasks.Select(logicalGroupResultTask =>
new StandardSearchResult(logicalGroupResultTask.Result)).ToList();

The difference between parallel and async is this:
Parallel: Spin up multiple threads and divide the work over each thread
Async: Do the work in a non-blocking manner.
Whether this makes a difference depends on what it is that is blocking in the async-way. If you're doing work on the CPU, it's the CPU that is blocking you and therefore you will still end up with multiple threads. In case it's IO (or anything else besides the CPU, you will reuse the same thread)
For your particular example that means the following:
Parallel.ForEach => Spin up new threads for each item in the list (the nr of threads that are spun up is managed by the CLR) and execute each item on a different thread
async/await => Do this bit of work, but let me continue execution. Since you have many items, that means saying this multiple times. It depends now what the results:
If this bit of workis on the CPU, the effect is the same
Otherwise, you'll just use a single thread while the work is being done somewhere else

Hashing on multiple keys : for task execution In Multi threaded environment

I have certain objects on which certain tasks needs to be performed.On all objects all task needs to be performed. I want to employ multiple threads say N parallel threads
Say I have objects identifiers like A,B,C (Objects can be in 100 K range ; keys can be long or string)
And Tasks can T1,T2,T3,TN - (Task are max 20 in number)
Conditions for task execution -
Tasks can be executed in parallel even for the same object.
But for the same object, for a given task, it should be executed in series.
Example , say I have
Objects on which are task performed are A,B,A
and tasks are t1, t2
So T1(A), T2(A) or T1(A) , T2(B) are possible , but T1(A) and T1(A) shouldnt be allowed
How can I ensure that , that my conditions are met. I know I have to use some sort of hashing.
I read about hashing , so my hash function can be of -
return ObjectIdentifier.getHashCode() + TaskIdentifier.getHashCode()
or other can be - a^3 + b^2 (where a and b are hashes of object identifier and task identifier respectively)
What would be best strategy, any suggestions
My task doesnt involve any IO, and as of now I am using one thread for each task.
So my current design is ok, or should I try to optimize it based on num of processors. (have fixed num of threads )

You can do a Parallel.ForEach on one of the lists, and a regular foreach on the other list, for example:
Parallel.ForEach (myListOfObjects, currentObject =>
{
foreach(var task in myListOfTasks)
{
task.DoSomething(currentObject);
}
});

I must say that I really like Rufus L's answer. You have to be smart about the things you parallelise and not over-encumber your implementation with excessive thread synchronisation and memory-intensive constructs - those things diminish the benefit of parallelisation. Given the large size of the item pool and the CPU-bound nature of the work, Parallel.ForEach with a sequential inner loop should provide very reasonable performance while keeping the implementation dead simple. It's a win.
Having said that, I have a pretty trivial LINQ-based tweak to Rufus' answer which addresses your other requirement (which is for the same object, for a given task, it should be executed in series). The solution works provided that the following assumptions hold:
The order in which the tasks are executed is not significant.
The work to be performed (all combinations of task x object) is known in advance and cannot change.
(Sorry for stating the obvious) The work which you want to parallelise can be parallelised - i.e. there are no shared resources / side-effects are completely isolated.
With those assumptions in mind, consider the following:
// Cartesian product of the two sets (*objects* and *tasks*).
var workItems = objects.SelectMany(
o => tasks.Select(t => new { Object = o, Task = t })
);
// Group *work items* and materialise *work item groups*.
var workItemGroups = workItems
.GroupBy(i => i, (key, items) => items.ToArray())
.ToArray();
Parallel.ForEach(workItemGroups, workItemGroup =>
{
// Execute non-unique *task* x *object*
// combinations sequentially.
foreach (var workItem in workItemGroup)
{
workItem.Task.Execute(workItem.Object);
}
});
Note that I am not limiting the degree of parallelism in Parallel.ForEach. Since all work is CPU-bound, it will work out the best number of threads on its own.

Parallel.For vs regular threads

I'm trying to understand why Parallel.For is able to outperform a number of threads in the following scenario: consider a batch of jobs that can be processed in parallel. While processing these jobs, new work may be added, which then needs to be processed as well. The Parallel.For solution would look as follows:
var jobs = new List<Job> { firstJob };
int startIdx = 0, endIdx = jobs.Count;
while (startIdx < endIdx) {
Parallel.For(startIdx, endIdx, i => WorkJob(jobs[i]));
startIdx = endIdx; endIdx = jobs.Count;
}
This means that there are multiple times where the Parallel.For needs to synchronize. Consider a bread-first graph algorithm algorithm; the number of synchronizations would be quite large. Waste of time, no?
Trying the same in the old-fashioned threading approach:
var queue = new ConcurrentQueue<Job> { firstJob };
var threads = new List<Thread>();
var waitHandle = new AutoResetEvent(false);
int numBusy = 0;
for (int i = 0; i < maxThreads; i++)
threads.Add(new Thread(new ThreadStart(delegate {
while (!queue.IsEmpty || numBusy > 0) {
if (queue.IsEmpty)
// numbusy > 0 implies more data may arrive
waitHandle.WaitOne();
Job job;
if (queue.TryDequeue(out job)) {
Interlocked.Increment(ref numBusy);
WorkJob(job); // WorkJob does a waitHandle.Set() when more work was found
Interlocked.Decrement(ref numBusy);
}
}
// others are possibly waiting for us to enable more work which won't happen
waitHandle.Set();
})));
threads.ForEach(t => t.Start());
threads.ForEach(t => t.Join());
The Parallel.For code is of course much cleaner, but what I cannot comprehend, it's even faster as well! Is the task scheduler just that good? The synchronizations were eleminated, there's no busy waiting, yet the threaded approach is consistently slower (for me). What's going on? Can the threading approach be made faster?
Edit: thanks for all the answers, I wish I could pick multiple ones. I chose to go with the one that also shows an actual possible improvement.

The two code samples are not really the same.
The Parallel.ForEach() will use a limited amount of threads and re-use them. The 2nd sample is already starting way behind by having to create a number of threads. That takes time.
And what is the value of maxThreads ? Very critical, in Parallel.ForEach() it is dynamic.
Is the task scheduler just that good?
It is pretty good. And TPL uses work-stealing and other adaptive technologies. You'll have a hard time to do any better.

Parallel.For doesn't actually break up the items into single units of work. It breaks up all the work (early on) based on the number of threads it plans to use and the number of iterations to be executed. Then has each thread synchronously process that batch (possibly using work stealing or saving some extra items to load-balance near the end). By using this approach the worker threads are virtually never waiting on each other, while your threads are constantly waiting on each other due to the heavy synchronization you're using before/after every single iteration.
On top of that since it's using thread pool threads many of the threads it needs are likely already created, which is another advantage in its favor.
As for synchronization, the entire point of a Parallel.For is that all of the iterations can be done in parallel, so there is almost no synchronization that needs to take place (at least in their code).
Then of course there is the issue of the number of threads. The threadpool has a lot of very good algorithms and heuristics to help it determine how many threads are need at that instant in time, based on the current hardware, the load from other applications, etc. It's possible that you're using too many, or not enough threads.
Also, since the number of items that you have isn't known before you start I would suggest using Parallel.ForEach rather than several Parallel.For loops. It is simply designed for the situation that you're in, so it's heuristics will apply better. (It also makes for even cleaner code.)
BlockingCollection<Job> queue = new BlockingCollection<Job>();
//add jobs to queue, possibly in another thread
//call queue.CompleteAdding() when there are no more jobs to run
Parallel.ForEach(queue.GetConsumingEnumerable(),
job => job.DoWork());

Your creating a bunch of new threads and the Parallel.For is using a Threadpool. You'll see better performance if you were utilizing the C# threadpool but there really is no point in doing that.
I would shy away from rolling out your own solution; if there is a corner case where you need customization use the TPL and customize..

Performance of running Parallel.Foreach on several threads

I have 3 main processing threads, each of them performing operations on the values of ConcurrentDictionaries by means of Parallel.Foreach. The dictionaries vary in size from 1,000 elements to 250,000 elements
TaskFactory factory = new TaskFactory();
Task t1 = factory.StartNew(() =>
{
Parallel.ForEach(dict1.Values, item => ProcessItem(item));
});
Task t2 = factory.StartNew(() =>
{
Parallel.ForEach(dict2.Values, item => ProcessItem(item));
});
Task t3 = factory.StartNew(() =>
{
Parallel.ForEach(dict3.Values, item => ProcessItem(item));
});
t1.Wait();
t2.Wait();
t3.Wait();
I compared the performance (total execution time) of this construct with just running the Parallel.Foreach in the main thread and the performance improved a lot (the execution time was reduced approximately 5 times)
My questions are:
Is there something wrong with the
approach above? If yes, what and how
can it be improved?
What is the reason for the different execution times?
What is a good way to debug/analyze such a situation?
EDIT: To further clarify the situation: I am mocking the client calls on a WCF service, that each comes on a separate thread (the reason for the Tasks). I also tried to use ThreadPool.QueueUserWorkItem instead of Task, without a performance improvement. The objects in the dictionary have between 20 and 200 properties (just decimals and strings) and there is no I/O activity
I solved the problem by queuing the processing requests in a BlockingCollection and processing them one at the time

You're probably over-parallelizing.
You don't need to create 3 tasks if you already use a good (and balanced) parallelization inside each one of them.
Parallel.Foreach already try to use the right number of threads to exploit the full CPU potential without saturating it. And by creating other tasks having Parallel.Foreach you're probably saturating it.
(EDIT: as Henk said, they probably have some problems in coordinating the number of threads to spawn when run in parallel, and at least this leads to a bigger overhead).
Have a look here for some hints.

First of all, a Task is not a Thread.
Your Parallel.ForEach() calls are run by a scheduler that uses the ThreadPool and should try to optimize Thread usage. The ForEach applies a Partitioner. When you run these in parallel they cannot coordinate very well.
Only if there is a performance problem, consider helping with extra tasks or DegreeOfParallelism directives. And then always profile and analyze first.
An explanation of your results is difficult, it could be caused by many factors (I/O for example) but the advantage of the 'single main task' is that the scheduler has more control and the CPU and Cache are used better (locality).

The dictionaries vary widely in size and by the looks of it (given everything finishes in <5s) the amount of processing work is small. Without knowing more it's hard to say what's actually going on. How big are your dictionary items? The main thread scenario you're comparing this to looks like this right?
Parallel.ForEach(dict1.Values, item => ProcessItem(item));
Parallel.ForEach(dict2.Values, item => ProcessItem(item));
Parallel.ForEach(dict3.Values, item => ProcessItem(item));
By adding the Tasks around each ForEach your adding more overhead to manage the tasks and probably causing memory contention as dict1, dict2 and dict3 all try and be in memory and hot in cache at the same time. Remember, CPU cycles are cheap, cache misses are not.

Why is ConcurrentBag<T> so slow in .Net (4.0)? Am I doing it wrong?

Before I started a project, I wrote a simple test to compare the performance of ConcurrentBag from (System.Collections.Concurrent) relative to locking & lists. I am extremely surprised that ConcurrentBag is over 10 times slower than locking with a simple List. From what I understand, the ConcurrentBag works best when the reader and writer is the same thread. However, I hadn't thought it's performance would be so much worse than traditional locks.
I have run a test with two Parallel for loops writing to and reading from a list/bag. However, the write by itself shows a huge difference:
private static void ConcurrentBagTest()
{
int collSize = 10000000;
Stopwatch stopWatch = new Stopwatch();
ConcurrentBag<int> bag1 = new ConcurrentBag<int>();
stopWatch.Start();
Parallel.For(0, collSize, delegate(int i)
{
bag1.Add(i);
});
stopWatch.Stop();
Console.WriteLine("Elapsed Time = {0}",
stopWatch.Elapsed.TotalSeconds);
}
On my box, this takes between 3-4 secs to run, compared to 0.5 - 0.9 secs of this code:
private static void LockCollTest()
{
int collSize = 10000000;
object list1_lock=new object();
List<int> lst1 = new List<int>(collSize);
Stopwatch stopWatch = new Stopwatch();
stopWatch.Start();
Parallel.For(0, collSize, delegate(int i)
{
lock(list1_lock)
{
lst1.Add(i);
}
});
stopWatch.Stop();
Console.WriteLine("Elapsed = {0}",
stopWatch.Elapsed.TotalSeconds);
}
As I mentioned, doing concurrent reads and writes doesn't help the concurrent bag test. Am I doing something wrong or is this data structure just really slow?
[EDIT] - I removed the Tasks because I don't need them here (Full code had another task reading)
[EDIT]
Thanks a lot for the answers. I am having a hard time picking "the right answer" since it seems to be a mix of a few answers.
As Michael Goldshteyn pointed out, the speed really depends on the data.
Darin pointed out that there should be more contention for ConcurrentBag to be faster, and Parallel.For doesn't necessarily start the same number of threads. One point to take away is to not do anything you don't have to inside a lock. In the above case, I don't see myself doing anything inside the lock except may be assigning the value to a temp variable.
Additionally, sixlettervariables pointed out that the number of threads that happen to be running may also affect results, although I tried running the original test in reverse order and ConcurrentBag was still slower.
I ran some tests with starting 15 Tasks and the results depended on the collection size among other things. However, ConcurrentBag performed almost as well as or better than locking a list, for up to 1 million insertions. Above 1 million, locking seemed to be much faster sometimes, but I'll probably never have a larger datastructure for my project.
Here's the code I ran:
int collSize = 1000000;
object list1_lock=new object();
List<int> lst1 = new List<int>();
ConcurrentBag<int> concBag = new ConcurrentBag<int>();
int numTasks = 15;
int i = 0;
Stopwatch sWatch = new Stopwatch();
sWatch.Start();
//First, try locks
Task.WaitAll(Enumerable.Range(1, numTasks)
.Select(x => Task.Factory.StartNew(() =>
{
for (i = 0; i < collSize / numTasks; i++)
{
lock (list1_lock)
{
lst1.Add(x);
}
}
})).ToArray());
sWatch.Stop();
Console.WriteLine("lock test. Elapsed = {0}",
sWatch.Elapsed.TotalSeconds);
// now try concurrentBag
sWatch.Restart();
Task.WaitAll(Enumerable.Range(1, numTasks).
Select(x => Task.Factory.StartNew(() =>
{
for (i = 0; i < collSize / numTasks; i++)
{
concBag.Add(x);
}
})).ToArray());
sWatch.Stop();
Console.WriteLine("Conc Bag test. Elapsed = {0}",
sWatch.Elapsed.TotalSeconds);

Let me ask you this: how realistic is it that you'd have an application which is constantly adding to a collection and never reading from it? What's the use of such a collection? (This is not a purely rhetorical question. I could imagine there being uses where, e.g., you only read from the collection on shutdown (for logging) or when requested by the user. I believe these scenarios are fairly rare, though.)
This is what your code is simulating. Calling List<T>.Add is going to be lightning-fast in all but the occasional case where the list has to resize its internal array; but this is smoothed out by all the other adds that happen quite quickly. So you're not likely to see a significant amount of contention in this context, especially testing on a personal PC with, e.g., even 8 cores (as you stated you have in a comment somewhere). Maybe you might see more contention on something like a 24-core machine, where many cores can be trying to add to the list literally at the same time.
Contention is much more likely to creep in where you read from your collection, esp. in foreach loops (or LINQ queries which amount to foreach loops under the hood) which require locking the entire operation so that you aren't modifying your collection while iterating over it.
If you can realistically reproduce this scenario, I believe you will see ConcurrentBag<T> scale much better than your current test is showing.
Update: Here is a program I wrote to compare these collections in the scenario I described above (multiple writers, many readers). Running 25 trials with a collection size of 10000 and 8 reader threads, I got the following results:
Took 529.0095 ms to add 10000 elements to a List<double> with 8 reader threads.
Took 39.5237 ms to add 10000 elements to a ConcurrentBag<double> with 8 reader threads.
Took 309.4475 ms to add 10000 elements to a List<double> with 8 reader threads.
Took 81.1967 ms to add 10000 elements to a ConcurrentBag<double> with 8 reader threads.
Took 228.7669 ms to add 10000 elements to a List<double> with 8 reader threads.
Took 164.8376 ms to add 10000 elements to a ConcurrentBag<double> with 8 reader threads.
[ ... ]
Average list time: 176.072456 ms.
Average bag time: 59.603656 ms.
So clearly it depends on exactly what you're doing with these collections.

There seems to be a bug in the .NET Framework 4 that Microsoft fixed in 4.5, it seems they didn't expect ConcurrentBag to be used a lot.
See the following Ayende post for more info
http://ayende.com/blog/156097/the-high-cost-of-concurrentbag-in-net-4-0

As a general answer:
Concurrent collections that use locking can be very fast if there is little or no contention for their data (i.e., locks). This is due to the fact that such collection classes are often built using very inexpensive locking primitives, especially when uncontented.
Lockless collections can be slower, because of tricks used to avoid locks and due to other bottlenecks such as false sharing, complexity required to implement their lockless nature leading to cache misses, etc...
To summarize, the decision of which way is faster is highly dependant on the data structures employed and the amount of contention for the locks among other issues (e.g., num readers vs. writers in a shared/exclusive type arrangement).
Your particular example has a very high degree of contention, so I must say I am surprised by the behavior. On the other hand, the amount of work done while the lock is kept is very small, so maybe there is little contention for the lock itself, after all. There could also be deficiencies in the implementation of ConcurrentBag's concurrency handling which makes your particular example (with frequent inserts and no reads) a bad use case for it.

Looking at the program using MS's contention visualizer shows that ConcurrentBag<T> has a much higher cost associated with parallel insertion than simply locking on a List<T>. One thing I noticed is there appears to be a cost associated with spinning up the 6 threads (used on my machine) to begin the first ConcurrentBag<T> run (cold run). 5 or 6 threads are then used with the List<T> code, which is faster (warm run). Adding another ConcurrentBag<T> run after the list shows it takes less time than the first (warm run).
From what I'm seeing in the contention, a lot of time is spent in the ConcurrentBag<T> implementation allocating memory. Removing the explicit allocation of size from the List<T> code slows it down, but not enough to make a difference.
EDIT: it appears to be that the ConcurrentBag<T> internally keeps a list per Thread.CurrentThread, locks 2-4 times depending on if it is running on a new thread, and performs at least one Interlocked.Exchange. As noted in MSDN: "optimized for scenarios where the same thread will be both producing and consuming data stored in the bag." This is the most likely explanation for your performance decrease versus a raw list.

This is already resolved in .NET 4.5. The underlying issue was that ThreadLocal, which ConcurrentBag uses, didn’t expect to have a lot of instances. That has been fixed, and now can run fairly fast.
source - The HIGH cost of ConcurrentBag in .NET 4.0

As #Darin-Dimitrov said, I suspect that your Parallel.For isn't actually spawning the same number of threads in each of the two results. Try manually creating N threads to ensure that you are actually seeing thread contention in both cases.

You basically have very few concurrent writes and no contention (Parallel.For doesn't necessarily mean many threads). Try parallelizing the writes and you will observe different results:
class Program
{
private static object list1_lock = new object();
private const int collSize = 1000;
static void Main()
{
ConcurrentBagTest();
LockCollTest();
}
private static void ConcurrentBagTest()
{
var bag1 = new ConcurrentBag<int>();
var stopWatch = Stopwatch.StartNew();
Task.WaitAll(Enumerable.Range(1, collSize).Select(x => Task.Factory.StartNew(() =>
{
Thread.Sleep(5);
bag1.Add(x);
})).ToArray());
stopWatch.Stop();
Console.WriteLine("Elapsed Time = {0}", stopWatch.Elapsed.TotalSeconds);
}
private static void LockCollTest()
{
var lst1 = new List<int>(collSize);
var stopWatch = Stopwatch.StartNew();
Task.WaitAll(Enumerable.Range(1, collSize).Select(x => Task.Factory.StartNew(() =>
{
lock (list1_lock)
{
Thread.Sleep(5);
lst1.Add(x);
}
})).ToArray());
stopWatch.Stop();
Console.WriteLine("Elapsed = {0}", stopWatch.Elapsed.TotalSeconds);
}
}

My guess is that locks don't experience much contention. I would recommend reading following article: Java theory and practice: Anatomy of a flawed microbenchmark. The article discusses a lock microbenchmark. As stated in the article there are a lot of things to take into consideration in this kind of situations.

It would be interesting to see scaling between the two of them.
Two questions
1) how fast is bag vs list for reading, remember to put a lock on the list
2) how fast is bag vs list for reading while another thread is writing

Because the loop body is small, you could try using the Partitioner class Create method...
which enables you to provide a
sequential loop for the delegate body,
so that the delegate is invoked only
once per partition, instead of once
per iteration
How to: Speed Up Small Loop Bodies

It appears that ConcurrentBag is just slower than the other concurrent collections.
I think it's an implementation problem- ANTS Profiler shows that it is gets bogged down in a couple of places - including an array copy.
Using concurrent dictionary is thousands of times faster.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.