Using Parallelization for relatively large loops

Using Parallelization for relatively large loops - c#

I have an 8-core CPU machine with 8 GB memory. Logically the following code can be done in parallel, but since the loop exposes more than enough opportunities for parallelism since I have far fewer cores available than the size of the loop. Second, every delegate expression allocates some memory to hold the free variables.
Is it recommended to use parallel for in this case?
also will separating the 2 parallel for's into 2 task improve the performance in this case??
private static void DoWork()
{
int end1 = 100; // minimum of 100 values;
int end2 = 100; // minimum of 100 values;
Task a = Task.Factory.StartNew(
delegate
{
Parallel.For(0, end1, delegate(int i)
{
// independent work
});
}
);
Task b = Task.Factory.StartNew(
delegate
{
Parallel.For(0, end2, delegate(int i)
{
// independent work
});
}
);
a.Wait();
b.Wait();
}

also will separating the 2 parralel for's into 2 task improve the performance in this case??
Not noticeably, and you could easily harm performance.
The TPL is especially designed to provide load balancing, let it do its job.
The main points here that are your concern:
the 'work' should really be independent
the 'work' should be non-trivial, ie computationally intensive and considerably more than just adding a few numbers
the 'work' should avoid I/O (as much as possible)

Leave the load balancing to the framework by calling "Partitioner.Create".
Try creating a ParallelOptions object and pass it to Parallel.For. Try out with different MaxDegreeOfParallelism and tune your code based on the results, this number can be more than the no. of cores in your system. This has worked for me.

Related

Asynchonous programming for method loops

I am trying to learn about asynchronous programming and how I can benefit from it.
My hope is that I can use it to improve performance whenever I'm looping over a method that takes an significant time to complete, like the following method.
string AddStrings()
{
string result = "";
for (int i = 0; i < 10000; i++)
{
result += "hi";
}
return result;
}
Obviously this method doesn't have much value, and I purposely made it ineffecient, in order to test the benefits of asynchronous programming. The test is done by looping over the method 100 times, first synchronously and then asynchronously.
Stopwatch watch = new Stopwatch();
watch.Start();
List<string> results = WorkSync();
//List<string> results = await WorkAsyncParallel();
watch.Stop();
Console.WriteLine(watch.ElapsedMilliseconds);
List<string> WorkSync()
{
var stringList = new List<string>();
for (int i = 0; i < 100; i++)
{
stringList.Add(AddStrings());
}
return stringList;
}
async Task<List<string>> WorkAsyncParallel()
{
var taskList = new List<Task<string>>();
for (int i = 0; i < 100; i++)
{
taskList.Add(Task.Run(() => AddStrings()));
}
var results = (await Task.WhenAll(taskList)).ToList();
return results;
}
Super optimistically (naively), I was hoping that the asynchronous loop would be 100 times as fast as the synchronous loop, as all the tasks are running at the same time. While that didn't exactly turn out to be the case, the time of the loop was decreased by more than two thirds, from around 5000 miliseconds to 1500 miliseconds!
Now my questions are:
What makes the asynchronous loop faster than the synchronous loop, but not nearly 100 times faster? I'm guessing each of the 100 tasks are fighthing for a limited amount of CPU?
Is this a valid method to improve performance when looping methods?
Thank you in advance.

My hope is that I can use it to improve performance
Not really. Concurrency (parallel or asynchronous) can improve performance, but asynchronous code on its own is more about freeing up threads. Asynchronous code provides two main benefits:
Server-side apps get better scalability. By using fewer threads, asynchronous code can scale further and faster than synchronous code.
UI apps get better responsiveness. By freeing up the UI thread, asynchronous code provides a better user experience.
Neither of these have much to do with performance. E.g., when comparing a synchronous server-side request handler with its basic asynchronous counterpart, the asynchronous one is usually slower (slightly), but the server as a whole scales better.
That said, asynchronous code can enable natural concurrency (e.g., Task.WhenAll). And if you do end up adding concurrency, then you can see some performance benefits - sometimes quite drastic ones.
The test is done by looping over the method 100 times, first synchronously and then asynchronously.
Technically, you're comparing single-threaded and parallel code, here.
As Jeroen pointed out in the comments, Parallel or PLINQ are the proper tools to use if you have a parallel problem. While you can use Task.Run, it's a very low-level tool for parallel programming. Task.Run is more commonly used for "shove this one thing to the thread pool", not "shove these 100 things to the thread pool".
What makes the asynchronous loop faster than the synchronous loop, but not nearly 100 times faster? I'm guessing each of the 100 tasks are fighthing for a limited amount of CPU?
Yes. Most machines these days have multi-core CPUs, and each core can do one thing at a time. So if you have 4 or 8 cores, that's the limit of your parallelism.
Is this a valid method to improve performance when looping methods?
If you have a lot of work to do in parallel, then parallel programming is acceptable. Again, I recommend using higher-level constructs like Parallel or PLINQ, which have better built-in partitioning strategies and other optimizations, which will be more efficient than throwing a bunch of tasks at the thread pool.
Parallel programming does have its caveats:
If your work is too fine-grained, then parallel work can end up being slower, as the overhead of partitioning, queueing, and scheduling erases the gains from concurrency.
You generally want to avoid parallelism in some situations such as handling a request on the server side. It's just usually not a good idea to allow one request to consume all the CPU resources of the whole server.
Now, if you want to test out asynchronous benefits, then I recommend using an operation that is inherently asynchronous (e.g., a client web request). Say, if your code was hitting 100 URLs. That would be something more naturally asynchronous that doesn't require thread pool threads.

CPU benchmark test: Tasks vs ThreadPool vs Thread

I posted another SO question here, and as a follow-up, my colleague did a test, seen below, as some form of "counter" to the argument for async/await/Tasks.
(I am aware that the lock on resultList isn't needed, disregard that)
I am aware that async/await and Tasks is not made to handle CPU-intensive tasks but instead handle I/O operations that are done by the OS. The benchmark below is a CPU-intensive task, so the test is flawed from start.
However, as I understand it, using new Task().Start() will schedule the operation on the ThreadPool and execute the test code on different threads on the ThreadPool. Wouldnt that mean that the first and second test are more or less the same? (I'm guessing not, please explain
Why then the big difference between them?

some form of "counter" to the argument for async/await/Tasks.
The posted code has absolutely nothing to do with async or await. It's comparing three different kinds of parallelism:
Dynamic Task Parallelism.
Direct threadpool access.
Manual multithreading with manual partitioning.
The first two are somewhat comparable. Of course, direct threadpool access will be faster than Dynamic Task Parallelism. But what these tests don't show is that direct threadpool access is much harder to do correctly. In particular, when you are running real-world code and need to handle exceptions and return values, you have to add in boilerplate code and object instances to the direct threadpool access code that slows it down.
The third one is not comparable at all. It just uses 10 manual threads. Again, this example ignores the additional complexity necessary in real-world code; specifically, the need to handle exceptions and return values. It also assumes a partition size, which is problematic; real-world code does not have that luxury. If you're managing your own set of threads, then you have to decide things like how quickly you should increase the number of threads when the queue has many items, and how quickly you should end threads when the queue is empty. These are all difficult questions that add lots of code to the #3 test before you're really comparing the same thing.
And that's not even to say anything about the cost of maintenance. In my experience (i.e., as an application developer), micro-optimizations are just not worth it. Even if you took the "worst" (#1) approach, you're losing about 7 microseconds per item. That is an unimaginably small amount of savings. As a general rule, developer time is far more valuable to your company than user time. If your users have to process a hundred thousand items, the difference would barely be perceptible. If you were to adopt the "best" (#3) approach, the code would be much less maintainable, particularly considering the boilerplate and thread management code necessary in production code and not shown here. Going with #3 would probably cost your company far more in terms of developer time just writing or reading the code than it would ever save in terms of user time.
Oh, and the funniest part of all this is that with all these different kinds of parallelism compared, they didn't even include the one that is most suitable for this test: PLINQ.
static void Main(string[] args)
{
TaskParallelLibrary();
ManualThreads();
Console.ReadKey();
}
static void ManualThreads()
{
var queue = new List<string>();
for (int i = 0; i != 1000000; ++i)
queue.Add("string" + i);
var resultList = new List<string>();
var stopwatch = Stopwatch.StartNew();
var counter = 0;
for (int i = 0; i != 10; ++i)
{
new Thread(() =>
{
while (true)
{
var t = "";
lock (queue)
{
if (counter >= queue.Count)
break;
t = queue[counter];
++counter;
}
t = t.Substring(0, 5);
string t2 = t.Substring(0, 2) + t;
lock (resultList)
resultList.Add(t2);
}
}).Start();
}
while (resultList.Count < queue.Count)
Thread.Sleep(1);
stopwatch.Stop();
Console.WriteLine($"Manual threads: Processed {resultList.Count} in {stopwatch.Elapsed}");
}
static void TaskParallelLibrary()
{
var queue = new List<string>();
for (int i = 0; i != 1000000; ++i)
queue.Add("string" + i);
var stopwatch = Stopwatch.StartNew();
var resultList = queue.AsParallel().Select(t =>
{
t = t.Substring(0, 5);
return t.Substring(0, 2) + t;
}).ToList();
stopwatch.Stop();
Console.WriteLine($"Parallel: Processed {resultList.Count} in {stopwatch.Elapsed}");
}
On my machine, after running this code several times, I find that the PLINQ code outperforms the Manual Threads by about 30%. Sample output on .NET Core 3.0 preview5-27626-15, built for Release, run standalone:
Parallel: Processed 1000000 in 00:00:00.3629408
Manual threads: Processed 1000000 in 00:00:00.5119985
And, of course, the PLINQ code is:
Shorter
More maintainable
More robust (handles exceptions and return types)
Less awkward (no need to poll for completion)
More portable (partitions based on number of processors)
More flexible (automatically adjusts the thread pool as necessary based on amount of work)

What determines the number of threads for a TaskFactory spawned jobs?

I have the following code:
var factory = new TaskFactory();
for (int i = 0; i < 100; i++)
{
var i1 = i;
factory.StartNew(() => foo(i1));
}
static void foo(int i)
{
Thread.Sleep(1000);
Console.WriteLine($"foo{i} - on thread {Thread.CurrentThread.ManagedThreadId}");
}
I can see it only does 4 threads at a time (based on observation). My questions:
What determines the number of threads used at a time?
How can I retrieve this number?
How can I change this number?
P.S. My box has 4 cores.
P.P.S. I needed to have a specific number of tasks (and no more) that are concurrently processed by the TPL and ended up with the following code:
private static int count = 0; // keep track of how many concurrent tasks are running
private static void SemaphoreImplementation()
{
var s = new Semaphore(20, 20); // allow 20 tasks at a time
for (int i = 0; i < 1000; i++)
{
var i1 = i;
Task.Factory.StartNew(() =>
{
try
{
s.WaitOne();
Interlocked.Increment(ref count);
foo(i1);
}
finally
{
s.Release();
Interlocked.Decrement(ref count);
}
}, TaskCreationOptions.LongRunning);
}
}
static void foo(int i)
{
Thread.Sleep(100);
Console.WriteLine($"foo{i:00} - on thread " +
$"{Thread.CurrentThread.ManagedThreadId:00}. Executing concurently: {count}");
}

When you are using a Task in .NET, you are telling the TPL to schedule a piece of work (via TaskScheduler) to be executed on the ThreadPool. Note that the work will be scheduled at its earliest opportunity and however the scheduler sees fit. This means that the TaskScheduler will decide how many threads will be used to run n number of tasks and which task is executed on which thread.
The TPL is very well tuned and continues to adjust its algorithm as it executes your tasks. So, in most cases, it tries to minimize contention. What this means is if you are running 100 tasks and only have 4 cores (which you can get using Environment.ProcessorCount), it would not make sense to execute more than 4 threads at any given time, as otherwise it would need to do more context switching. Now there are times where you want to explicitly override this behaviour. Let's say in the case where you need to wait for some sort of IO to finish, which is a whole different story.
In summary, trust the TPL. But if you are adamant to spawn a thread per task (not always a good idea!), you can use:
Task.Factory.StartNew(
() => /* your piece of work */,
TaskCreationOptions.LongRunning);
This tells the DefaultTaskscheduler to explicitly spawn a new thread for that piece of work.
You can also use your own Scheduler and pass it in to the TaskFactory. You can find a whole bunch of Schedulers HERE.
Note another alternative would be to use PLINQ which again by default analyses your query and decides whether parallelizing it would yield any benefit or not, again in the case of a blocking IO where you are certain starting multiple threads will result in a better execution you can force the parallelism by using WithExecutionMode(ParallelExecutionMode.ForceParallelism) you then can use WithDegreeOfParallelism, to give hints on how many threads to use but remember there is no guarantee you would get that many threads, as MSDN says:
Sets the degree of parallelism to use in a query. Degree of
parallelism is the maximum number of concurrently executing tasks that
will be used to process the query.
Finally, I highly recommend having a read of THIS great series of articles on Threading and TPL.

If you increase the number of tasks to for example 1000000 you will see a lot more threads spawned over time. The TPL tends to inject one every 500ms.
The TPL threadpool does not understand IO-bound workloads (sleep is IO). It's not a good idea to rely on the TPL for picking the right degree of parallelism in these cases. The TPL is completely clueless and injects more threads based on vague guesses about throughput. Also to avoid deadlocks.
Here, the TPL policy clearly is not useful because the more threads you add the more throughput you get. Each thread can process one item per second in this contrived case. The TPL has no idea about that. It makes no sense to limit the thread count to the number of cores.
What determines the number of threads used at a time?
Barely documented TPL heuristics. They frequently go wrong. In particular they will spawn an unlimited number of threads over time in this case. Use task manager to see for yourself. Let this run for an hour and you'll have 1000s of threads.
How can I retrieve this number? How can I change this number?
You can retrieve some of these numbers but that's not the right way to go. If you need a guaranteed DOP you can use AsParallel().WithDegreeOfParallelism(...) or a custom task scheduler. You also can manually start LongRunning tasks. Do not mess with process global settings.

I would suggest using SemaphoreSlim because it doesn't use Windows kernel (so it can be used in Linux C# microservices) and also has a property SemaphoreSlim.CurrentCount that tells how many remaining threads are left so you don't need the Interlocked.Increment or Interlocked.Decrement. I also removed i1 because i is value type and it won't be changed by the call of foo method passing the i argument so it's no need to copy it into i1 to ensure it never changes (if that was the reasoning for adding i1):
private static void SemaphoreImplementation()
{
var maxThreadsCount = 20; // allow 20 tasks at a time
var semaphoreSlim = new SemaphoreSlim(maxTasksCount, maxTasksCount);
var taskFactory = new TaskFactory();
for (int i = 0; i < 1000; i++)
{
taskFactory.StartNew(async () =>
{
try
{
await semaphoreSlim.WaitAsync();
var count = maxTasksCount-semaphoreSlim.CurrentCount; //SemaphoreSlim.CurrentCount tells how many threads are remaining
await foo(i, count);
}
finally
{
semaphoreSlim.Release();
}
}, TaskCreationOptions.LongRunning);
}
}
static async void foo(int i, int count)
{
await Task.Wait(100);
Console.WriteLine($"foo{i:00} - on thread " +
$"{Thread.CurrentThread.ManagedThreadId:00}. Executing concurently: {count}");
}

ThreadPool frustrations - Thread creation exceeding SetMaxThreads

I've got an I/O intensive operation.
I only want a MAX of 5 threads ever running at one time.
I've got 8000 tasks to queue and complete.
Each task takes approximately 15-20seconds to execute.
I've looked around at ThreadPool, but
ThreadPool.SetMaxThreads(5, 0);
List<task> tasks = GetTasks();
int toProcess = tasks.Count;
ManualResetEvent resetEvent = new ManualResetEvent(false);
for (int i = 0; i < tasks.Count; i++)
{
ReportGenerator worker = new ReportGenerator(tasks[i].Code, id);
ThreadPool.QueueUserWorkItem(x =>
{
worker.Go();
if (Interlocked.Decrement(ref toProcess) == 0)
resetEvent.Set();
});
}
resetEvent.WaitOne();
I cannot figure out why... my code is executing more than 5 threads at one time. I've tried to setmaxthreads, setminthreads, but it keeps executing more than 5 threads.
What is happening? What am I missing? Should I be doing this in another way?
Thanks

There is a limitation in SetMaxThreads in that you can never set it lower than the number of processors on the system. If you have 8 processors, setting it to 5 is the same as not calling the function at all.

Task Parallel Library can help you:
List<task> tasks = GetTasks();
Parallel.ForEach(tasks, new ParallelOptions { MaxDegreeOfParallelism = 5 },
task => {ReportGenerator worker = new ReportGenerator(task.Code, id);
worker.Go();});
What does MaxDegreeOfParallelism do?

I think there's a different and better way to approach this. (Pardon me if I accidentally Java-ize some of the syntax)
The main thread here has a lists of things to do in "Tasks" -- instead of creating threads for each task, which is really not efficient when you have so many items, create the desired number of threads and then have them request tasks from the list as needed.
The first thing to do is add a variable to the class this code comes from, for use as a pointer into the list. We'll also add one for the maximum desired thread count.
// New variable in your class definition
private int taskStackPointer;
private final static int MAX_THREADS = 5;
Create a method that returns the next task in the list and increments the stack pointer. Then create a new interface for this:
// Make sure that only one thread has access at a time
[MethodImpl(MethodImplOptions.Synchronized)]
public task getNextTask()
{
if( taskStackPointer < tasks.Count )
return tasks[taskStackPointer++];
else
return null;
}
Alternately, you could return tasks[taskStackPointer++].code, if there's a value you can designate as meaning "end of list". Probably easier to do it this way, however.
The interface:
public interface TaskDispatcher
{
[MethodImpl(MethodImplOptions.Synchronized)] public task getNextTask();
}
Within the ReportGenerator class, change the constructor to accept the dispatcher object:
public ReportGenerator( TaskDispatcher td, int idCode )
{
...
}
You'll also need to alter the ReportGenerator class so that the processing has an outer loop that starts off by calling td.getNextTask() to request a new task, and which exits the loop when it gets back a NULL.
Finally, alter the thread creation code to something like this: (this is just to give you an idea)
taskStackPointer = 0;
for (int i = 0; i < MAX_THREADS; i++)
{
ReportGenerator worker = new ReportGenerator(this,id);
worker.Go();
}
That way you create the desired number of threads and keep them all working at max capacity.
(I'm not sure I got the usage of "[MethodImpl(MethodImplOptions.Synchronized)]" exactly right... I am more used to Java than C#)

Your tasks list will have 8k items in it because you told the code to put them there:
List<task> tasks = GetTasks();
That said, this number has nothing to do with how many threads are being used in the sense that the debugger is always going to show how many items you added to the list.
There are various ways to determine how many threads are in use. Perhaps one of the simplest is to break into the application with the debugger and take a look at the threads window. Not only will you get a count, but you'll see what each thread is doing (or not) which leads me to...
There is significant discussion to be had about what your tasks are doing and how you arrived at a number to 'throttle' the thread pool. In most use cases, the thread pool is going to do the right thing.
Now to answer your specific question...
To explicitly control the number of concurrent tasks, consider a trivial implementation that would involve changing your task collection from a List to BlockingCollection (that will internally use a ConcurrentQueue) and the following code to 'consume' the work:
var parallelOptions = new ParallelOptions
{
MaxDegreeOfParallelism = 5
};
Parallel.ForEach(collection.GetConsumingEnumerable(), options, x =>
{
// Do work here...
});
Change MaxDegreeOfParallelism to whatever concurrent value you have determined is appropriate for the work you are doing.
The following might be of interest to you:
Parallel.ForEach Method
BlockingCollection
Chris

Its works for me. This way you can't use a number of workerthreads smaller than "minworkerThreads". The problem is if you need five "workerthreads" maximum and the "minworkerThreads" is six doesn't work.
{
ThreadPool.GetMinThreads(out minworkerThreads,out minportThreads);
ThreadPool.SetMaxThreads(minworkerThreads, minportThreads);
}
MSDN
Remarks
You cannot set the maximum number of worker threads or I/O completion threads to a number smaller than the number of processors on the computer. To determine how many processors are present, retrieve the value of the Environment.ProcessorCount property. In addition, you cannot set the maximum number of worker threads or I/O completion threads to a number smaller than the corresponding minimum number of worker threads or I/O completion threads. To determine the minimum thread pool size, call the GetMinThreads method.
If the common language runtime is hosted, for example by Internet Information Services (IIS) or SQL Server, the host can limit or prevent changes to the thread pool size.
Use caution when changing the maximum number of threads in the thread pool. While your code might benefit, the changes might have an adverse effect on code libraries you use.
Setting the thread pool size too large can cause performance problems. If too many threads are executing at the same time, the task switching overhead becomes a significant factor.

Why is ConcurrentBag<T> so slow in .Net (4.0)? Am I doing it wrong?

Before I started a project, I wrote a simple test to compare the performance of ConcurrentBag from (System.Collections.Concurrent) relative to locking & lists. I am extremely surprised that ConcurrentBag is over 10 times slower than locking with a simple List. From what I understand, the ConcurrentBag works best when the reader and writer is the same thread. However, I hadn't thought it's performance would be so much worse than traditional locks.
I have run a test with two Parallel for loops writing to and reading from a list/bag. However, the write by itself shows a huge difference:
private static void ConcurrentBagTest()
{
int collSize = 10000000;
Stopwatch stopWatch = new Stopwatch();
ConcurrentBag<int> bag1 = new ConcurrentBag<int>();
stopWatch.Start();
Parallel.For(0, collSize, delegate(int i)
{
bag1.Add(i);
});
stopWatch.Stop();
Console.WriteLine("Elapsed Time = {0}",
stopWatch.Elapsed.TotalSeconds);
}
On my box, this takes between 3-4 secs to run, compared to 0.5 - 0.9 secs of this code:
private static void LockCollTest()
{
int collSize = 10000000;
object list1_lock=new object();
List<int> lst1 = new List<int>(collSize);
Stopwatch stopWatch = new Stopwatch();
stopWatch.Start();
Parallel.For(0, collSize, delegate(int i)
{
lock(list1_lock)
{
lst1.Add(i);
}
});
stopWatch.Stop();
Console.WriteLine("Elapsed = {0}",
stopWatch.Elapsed.TotalSeconds);
}
As I mentioned, doing concurrent reads and writes doesn't help the concurrent bag test. Am I doing something wrong or is this data structure just really slow?
[EDIT] - I removed the Tasks because I don't need them here (Full code had another task reading)
[EDIT]
Thanks a lot for the answers. I am having a hard time picking "the right answer" since it seems to be a mix of a few answers.
As Michael Goldshteyn pointed out, the speed really depends on the data.
Darin pointed out that there should be more contention for ConcurrentBag to be faster, and Parallel.For doesn't necessarily start the same number of threads. One point to take away is to not do anything you don't have to inside a lock. In the above case, I don't see myself doing anything inside the lock except may be assigning the value to a temp variable.
Additionally, sixlettervariables pointed out that the number of threads that happen to be running may also affect results, although I tried running the original test in reverse order and ConcurrentBag was still slower.
I ran some tests with starting 15 Tasks and the results depended on the collection size among other things. However, ConcurrentBag performed almost as well as or better than locking a list, for up to 1 million insertions. Above 1 million, locking seemed to be much faster sometimes, but I'll probably never have a larger datastructure for my project.
Here's the code I ran:
int collSize = 1000000;
object list1_lock=new object();
List<int> lst1 = new List<int>();
ConcurrentBag<int> concBag = new ConcurrentBag<int>();
int numTasks = 15;
int i = 0;
Stopwatch sWatch = new Stopwatch();
sWatch.Start();
//First, try locks
Task.WaitAll(Enumerable.Range(1, numTasks)
.Select(x => Task.Factory.StartNew(() =>
{
for (i = 0; i < collSize / numTasks; i++)
{
lock (list1_lock)
{
lst1.Add(x);
}
}
})).ToArray());
sWatch.Stop();
Console.WriteLine("lock test. Elapsed = {0}",
sWatch.Elapsed.TotalSeconds);
// now try concurrentBag
sWatch.Restart();
Task.WaitAll(Enumerable.Range(1, numTasks).
Select(x => Task.Factory.StartNew(() =>
{
for (i = 0; i < collSize / numTasks; i++)
{
concBag.Add(x);
}
})).ToArray());
sWatch.Stop();
Console.WriteLine("Conc Bag test. Elapsed = {0}",
sWatch.Elapsed.TotalSeconds);

Let me ask you this: how realistic is it that you'd have an application which is constantly adding to a collection and never reading from it? What's the use of such a collection? (This is not a purely rhetorical question. I could imagine there being uses where, e.g., you only read from the collection on shutdown (for logging) or when requested by the user. I believe these scenarios are fairly rare, though.)
This is what your code is simulating. Calling List<T>.Add is going to be lightning-fast in all but the occasional case where the list has to resize its internal array; but this is smoothed out by all the other adds that happen quite quickly. So you're not likely to see a significant amount of contention in this context, especially testing on a personal PC with, e.g., even 8 cores (as you stated you have in a comment somewhere). Maybe you might see more contention on something like a 24-core machine, where many cores can be trying to add to the list literally at the same time.
Contention is much more likely to creep in where you read from your collection, esp. in foreach loops (or LINQ queries which amount to foreach loops under the hood) which require locking the entire operation so that you aren't modifying your collection while iterating over it.
If you can realistically reproduce this scenario, I believe you will see ConcurrentBag<T> scale much better than your current test is showing.
Update: Here is a program I wrote to compare these collections in the scenario I described above (multiple writers, many readers). Running 25 trials with a collection size of 10000 and 8 reader threads, I got the following results:
Took 529.0095 ms to add 10000 elements to a List<double> with 8 reader threads.
Took 39.5237 ms to add 10000 elements to a ConcurrentBag<double> with 8 reader threads.
Took 309.4475 ms to add 10000 elements to a List<double> with 8 reader threads.
Took 81.1967 ms to add 10000 elements to a ConcurrentBag<double> with 8 reader threads.
Took 228.7669 ms to add 10000 elements to a List<double> with 8 reader threads.
Took 164.8376 ms to add 10000 elements to a ConcurrentBag<double> with 8 reader threads.
[ ... ]
Average list time: 176.072456 ms.
Average bag time: 59.603656 ms.
So clearly it depends on exactly what you're doing with these collections.

There seems to be a bug in the .NET Framework 4 that Microsoft fixed in 4.5, it seems they didn't expect ConcurrentBag to be used a lot.
See the following Ayende post for more info
http://ayende.com/blog/156097/the-high-cost-of-concurrentbag-in-net-4-0

As a general answer:
Concurrent collections that use locking can be very fast if there is little or no contention for their data (i.e., locks). This is due to the fact that such collection classes are often built using very inexpensive locking primitives, especially when uncontented.
Lockless collections can be slower, because of tricks used to avoid locks and due to other bottlenecks such as false sharing, complexity required to implement their lockless nature leading to cache misses, etc...
To summarize, the decision of which way is faster is highly dependant on the data structures employed and the amount of contention for the locks among other issues (e.g., num readers vs. writers in a shared/exclusive type arrangement).
Your particular example has a very high degree of contention, so I must say I am surprised by the behavior. On the other hand, the amount of work done while the lock is kept is very small, so maybe there is little contention for the lock itself, after all. There could also be deficiencies in the implementation of ConcurrentBag's concurrency handling which makes your particular example (with frequent inserts and no reads) a bad use case for it.

Looking at the program using MS's contention visualizer shows that ConcurrentBag<T> has a much higher cost associated with parallel insertion than simply locking on a List<T>. One thing I noticed is there appears to be a cost associated with spinning up the 6 threads (used on my machine) to begin the first ConcurrentBag<T> run (cold run). 5 or 6 threads are then used with the List<T> code, which is faster (warm run). Adding another ConcurrentBag<T> run after the list shows it takes less time than the first (warm run).
From what I'm seeing in the contention, a lot of time is spent in the ConcurrentBag<T> implementation allocating memory. Removing the explicit allocation of size from the List<T> code slows it down, but not enough to make a difference.
EDIT: it appears to be that the ConcurrentBag<T> internally keeps a list per Thread.CurrentThread, locks 2-4 times depending on if it is running on a new thread, and performs at least one Interlocked.Exchange. As noted in MSDN: "optimized for scenarios where the same thread will be both producing and consuming data stored in the bag." This is the most likely explanation for your performance decrease versus a raw list.

This is already resolved in .NET 4.5. The underlying issue was that ThreadLocal, which ConcurrentBag uses, didn’t expect to have a lot of instances. That has been fixed, and now can run fairly fast.
source - The HIGH cost of ConcurrentBag in .NET 4.0

As #Darin-Dimitrov said, I suspect that your Parallel.For isn't actually spawning the same number of threads in each of the two results. Try manually creating N threads to ensure that you are actually seeing thread contention in both cases.

You basically have very few concurrent writes and no contention (Parallel.For doesn't necessarily mean many threads). Try parallelizing the writes and you will observe different results:
class Program
{
private static object list1_lock = new object();
private const int collSize = 1000;
static void Main()
{
ConcurrentBagTest();
LockCollTest();
}
private static void ConcurrentBagTest()
{
var bag1 = new ConcurrentBag<int>();
var stopWatch = Stopwatch.StartNew();
Task.WaitAll(Enumerable.Range(1, collSize).Select(x => Task.Factory.StartNew(() =>
{
Thread.Sleep(5);
bag1.Add(x);
})).ToArray());
stopWatch.Stop();
Console.WriteLine("Elapsed Time = {0}", stopWatch.Elapsed.TotalSeconds);
}
private static void LockCollTest()
{
var lst1 = new List<int>(collSize);
var stopWatch = Stopwatch.StartNew();
Task.WaitAll(Enumerable.Range(1, collSize).Select(x => Task.Factory.StartNew(() =>
{
lock (list1_lock)
{
Thread.Sleep(5);
lst1.Add(x);
}
})).ToArray());
stopWatch.Stop();
Console.WriteLine("Elapsed = {0}", stopWatch.Elapsed.TotalSeconds);
}
}

My guess is that locks don't experience much contention. I would recommend reading following article: Java theory and practice: Anatomy of a flawed microbenchmark. The article discusses a lock microbenchmark. As stated in the article there are a lot of things to take into consideration in this kind of situations.

It would be interesting to see scaling between the two of them.
Two questions
1) how fast is bag vs list for reading, remember to put a lock on the list
2) how fast is bag vs list for reading while another thread is writing

Because the loop body is small, you could try using the Partitioner class Create method...
which enables you to provide a
sequential loop for the delegate body,
so that the delegate is invoked only
once per partition, instead of once
per iteration
How to: Speed Up Small Loop Bodies

It appears that ConcurrentBag is just slower than the other concurrent collections.
I think it's an implementation problem- ANTS Profiler shows that it is gets bogged down in a couple of places - including an array copy.
Using concurrent dictionary is thousands of times faster.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.