C# How Parallel.ForEach / Parallel.For partitioning works

C# How Parallel.ForEach / Parallel.For partitioning works - c#

I have some basic questions about Parallel.ForEach with partition approach and I'm facing some problems with them so I'd like to understand how this code works and what is the flow of it.
Code sample
var result = new StringBuilder();
Parallel.ForEach(Enumerable.Range(1, 5), () => new StringBuilder(), (x, option, sb) =>
{
sb.Append(x);
return sb;
}, sb =>
{
lock (result)
{
result.Append(sb.ToString());
}
});
Questions related to the code above:
Are they doing some partition work inside parallel foreach ?
When I debug the code, I can see that the iteration (execution) of the code happens more then 5 times, but as I understand it is supposed to fire only 5 times - Enumerable.Range(1, 5) .
When will be this code fired ? In both Parallel.Foreach and Parallel.For there are two blocks separated by {}. How these two blocks are executing and interact with each other?
lock (result)
{
result.Append(sb.ToString());
}
Bonus Q:
See this this block of code where 5 iteration not occurring rather more iteration is taking place. when i use Parallel For instead of foreach. see the code and tell me where i made the mistake.
var result = new StringBuilder();
Parallel.For(1, 5, () => new StringBuilder(), (x, option, sb) =>
{
sb.Append("line " + x + System.Environment.NewLine);
MessageBox.Show("aaa"+x.ToString());
return sb;
}, sb =>
{
lock (result)
{
result.Append(sb.ToString());
}
});

There are several misunderstands regarding how Parallel.XYZ works.
Couple of great points and suggestions have been mentioned in the comments, so I won't repeat them. Rather I would like to share some thoughts about Parallel programming.
The Parallel Class
Whenever we are talking about parallel programming we are usually distinguishing two kinds: Data parallelism and Task parallelism. The former is executing the same function(s) over a chunk of data in parallel. The latter is executing several independent functions in parallel.
(There is also a 3rd model called pipeline which is kind a mixture of these two. I won't spend time on it if you are interested about that one I would suggest to search for Task Parallel Library's Dataflow or System.Threading.Channels.)
The Parallel class supports both of the models. The For and ForEach are designed for data parallelism, while the Invoke for task parallelism.
Partitioning
In case data parallelism the tricky part is how do you slice your data to get the best throughput / performance. You have to put into the account the size of the data collection, the structure of the data, the processing logic and the available cores (and many more other aspects as well). So there is no one-rule-for-all suggestion.
The main concern about partitioning is to not under-use the resources (some cores are idle, while others are working hard) and neither over-use (there are way more waiting jobs than available cores, so the synchronization overhead can be significant).
Let's suppose your processing logic is firmly stable (in other words various input data will not change significantly the processing time). In this case you can load balance the data between the executors. If an executor finishes then it can grab the new piece of data to be processed.
The way how you choose which data should go to which executor can be defined by the Partitioner(1). By default .NET support Range, Chunk, Hash and Striped partitioning. Some are static (the partitioning is done before any processing) and some of them are dynamic (depending on the processing speed some executor might receive more than other ones).
The following two excellent articles can give you better insight how each of the partitioning works:
Dixin's blog
Nima's blog
Thread Safety
If each of the executor can execute its processing task without the need to interact with others than they are considered independent. If you can design your algorithm to have independent processing units then you minimize the synchronization.
In case of For and ForEach each partition can have its own partition-local-storage. That means the computations are independent because the intermediate results are stored in a partition aware storage. But as usual you want to merge these into a single collection or even into value.
That's the reason why these Parallel methods have body and localFinally parameters. The former is used to define the individual processing, while the latter is the aggregate and merge function. (It is kinda similar to the Map-Reduce approach) In the latter you have aware of thread safety by yourself.
PLINQ
I don't want to explore this topic, which outside of the scope of the question. But I would like to give you a notch where to get started:
MS Whitepaper about when to use Parallel and when to use PLINQ
Common pitfalls of PLINQ
Useful resources
Joe Albahari's Parallel Programming
BlackWasp's Parallel Programming
EDIT: How to decide that it's worth to run in parallel?
There is no single formula (at least to my knowledge) which will tell you when does it make sense to use parallel execution. As I tried to highlight in the Partitioning section is a quite complex topic, so several experiments and fine-tuning are needed to find the optimal solution.
I highly encourage you to measure and try several different settings.
Here is my guideline how you should tackle this:
Try to understand the current characteristics of your application
Perform several different measurements to spot the execution bottleneck
Capture the current solution's performance metrics as your baseline
If it possible try to extract that piece of code from the code base to ease the fine-tuning
Try to tackle the same problem with several different aspects and with various inputs
Measure them and compare them to your baseline
If you are satisfied with the result then put that piece of code into your code base and measure again under different workloads
Try to capture as many relevant metrics as you can
If it is possible consider to execute both (sequential and parallel) solutions and compare their results.
If you are satisfied then get rid of the sequential code
Details
There several really good tools that can help you to get insight about your application. For .NET Profiling I would encourage you to give it try to CodeTrack. Concurrency Visualizer is also good tool if no need custom metrics.
By several measurements I meant that you should measure several times with several different tools to exclude special circumstances. If you measure only once then you can get false positive result. So, measure twice, cut once.
Your sequential processing should serve as a baseline. Base over-parallelization can cause certain overhead that's why it make sense to be able to compare your new shine solution with current one. Under utilization can also cause significant performance degradation.
If you can extract your problematic code than you can perform micro-benchmarks. I encourage you to take a look at the awesome Benckmark.NET tool to create benchmarks.
The same problem can be solved many in ways. So try to find several different approaches (like Parallel, PLINQ can be used more or less for the same problems)
As I said earlier measure, measure and measure. You should also keep in mind .NET try to be smart. What I mean by that for example AsParallel does not give you a guarantee that it will run in parallel. .NET analysis your solution and data structure and decide how to run it. On the other hand you can enforce parallel execution if you are certain that it will help.
There are libraries like Scientist.NET which can help you to perform this short of parallel run and compare process.
Enjoy :D

Related

C# Multi-Producer/Multi-Tiered Multi-Consumer Losing Data

I have a built complex application using a multi-tiered producer-consumer pattern, with multiple consumers performing specialized tasks before enqueing data to the next group of consumers. The ultimate job of the application is to break down a raw data file into test records for individual units that that will have been normalized.
The base of the P-C pattern uses Dustin Hyun's pattern from http://dustin-hyun.blogspot.com/2013_07_01_archive.html. I have made numerous modifications because of the multiple tiered approach and others. The code is too complex to post here- perhaps I could post snippets upon request to help clarify and answer questions.
I have employed two tools to speed up how a file gets processed. First is multiple instances of any of the tiers of consumer- there could be eight "index" consumers running whose jobs are to convert the test data from unit IDs and Test Names to Unit Indices and Test Name Indices to normalize the results to load into the DB. Second is the Bundling of units into merged DataTables at two point in the operation.
I have identified that data is lost intermittently, but in a fairly predictable pattern. It appears to be the last, incomplete bundle where the data was expected to have been. After the standard loop pattern, I have a check for a boolean that I use to flag if there is an incoMplete bundle, and it works:
if (dataToSend) // Check if incomplete bundle to process & send prior to ending comsumer operation.
{
UpdateLimitsIndices(bundleNlu);
Enqueue(StdfQType.Func, new BundledNamedTables((N_ParamRes)bundlePR.Copy(), (N_FuncRes)bundleFR.Copy(), numUnitsInCurrBundle));
}
I also have put locks onto everyplace I can see where the any of the p_c entities read or write anything from any of the shared queue members. With just the locks, there appeared to be no real impact. On a whim, I started to play with the sleep time before the loop re-spins So far, Test conditions that caused data loss with a 1ms sleep did not cause data loss during a 100 ms sleep or even a 10 ms sleep during limited testing. Could it be that the longer sleep is allowing the last piece/bundle of data to be properly processed?
I recognize that this question is vague and has few specifics because the application is too complex to post. I do hope I gave enough information for a dialog to start, however. I look for eard to heading your thoughts.
Jeff

I would suggest that because you are not using thread-safe collections (and neither does the author that you are basing your code on) that this may be the basis for losing data due to a concurrent write operation that fails (silently).
Luckily, along with the Task Parallel Library (TPL) .NET 4.0 gives us a whole bunch of concurrent collections which ARE thread-safe for multi-threaded environments.
Have a look at the collections in System.Collections.Concurrent as they are all thread-safe and their locking mechanisms are a lot faster than traditional lock-based objects.

Threading is very difficult to get right, and it appears that you have not gotten it right. Also, why are you (and the author of that blog post) using sleep intervals rather than Monitor.Pulse()?
Rather than trying to implement this yourself, why not use a library that will give you a slightly higher level of abstraction above the underlying thread coordination mechanism?
TPL Dataflow
Reactive Extensions

Fastest Way to Parse Large Strings (multi threaded)

I am about to start a project which will be taking blocks of text, parsing a lot of data into them into some sort of object which can then be serialized, stored, and statistics / data gleaned from. This needs to be as fast as possible as I have > 10,000,000 blocks of text that I need to start on and will be getting 100,000's of thousands a day.
I am running this on a system with 12 xeon cores + hyper threading. I also have access / know a bit about CUDA programming but for string stuff think that its not appropriate. From each string I need to parse a lot of data and some of it I know the exact positions of, some I don't and need to use regex's / something smart.
So consider something like this:
object[] parseAll (string [] stringsToParse)
{
parallel foreach
parse( string[n] )
}
object parse(string s)
{
try to use exact positions / substring etc here instead of regex's
}
So my questions are:
How much slower is using regex's to substr.
Is .NET going to be significantly slower than other languages.
What sort of optimizations (if any) can I do to maximize parallelism.
Anything else I haven't considered?
Thanks for any help! Sorry if this is long winded.

How much slower is using regex's to substr.
If you are looking for an exact string, substr will be faster. Regular expressions however are highly optimized. They (or at least parts) are compiled to IL and you can even store these compiled versions in a separate assembly using Regex.CompileToAssembly. See http://msdn.microsoft.com/en-us/library/9ek5zak6.aspx for more information.
What you really need to do is do perform measurements. Using something like Stopwatch is by far the easiest way to verify whether one or the other code construct works faster.
What sort of optimizations (if any) can I do to maximize parallelism.
With Task.Factory.StartNew, you can schedule tasks to run on the thread pool. You may also have a look at the TPL (Task Parallel Library, of which Task is a part). This has lots of constructs that help you parallelize work and allows constructs like Parallel.ForEach() to execute an iteration on multiple threads. See http://msdn.microsoft.com/en-us/library/dd460717.aspx for more information.
Anything else I haven't considered?
One of the things that will hurt you with this volume of data is memory management. A few things to take into account:
Limit memory allocation: try to re-use the same buffers for a single document instead of copying them when you only need a part. Say you need to work on a range starting at char 1000 to 2000, don't copy that range into a new buffer, but construct your code to work only in that range. This will make your code complexer, but it saves you memory allocations;
StringBuilder is an important class. If you don't know of it yet, have a look.

I don't know what kind of processing you're doing here, but if you're talking hundreds of thousands of strings per day, it seems like a pretty small number. Let's assume that you get 1 million new strings to process every day, and you can fully task 10 of those 12 Xeon cores. That's 100,000 strings per core per day. There are 86,400 seconds in a day, so we're talking 0.864 seconds per string. That's a lot of parsing.
I'll echo the recommendations made by #Pieter, especially where he suggests making measurements to see how long it takes to do your processing. Your best bet is to get something up and working, then figure out how to make it faster if you need to. I think you'll be surprised at how often you don't need to do any optimization. (I know that's heresy to the optimization wizards, but processor time is cheap and programmer time is expensive.)
How much slower is using regex's to substr?
That depends entirely on how complex your regexes are. As #Pieter said, if you're looking for a single string, String.Contains will probably be faster. You might also consider using String.IndexOfAny if you're looking for constant strings. Regular expressions aren't necessary unless you're looking for patterns that can't be represented as constant strings.
Is .NET going to be significantly slower than other languages?
In processor-intensive applications, .NET can be slower than native apps. Sometimes. If so, it's typically in the range of 5 to 20 percent, and most often between 7 and 12 percent. That's just the code executing in isolation. You have to take into account other factors like how long it takes you to build the program in that other language and how difficult it is to share data between the native app and the rest of your system.

Google had recently announced it's internal text processing language (which seems like a Python/Perl subset made for heavily parallel processing).
http://code.google.com/p/szl/ - Sawzall.

If you want to do fast string parsing in C#, you might want to consider having a look at the new NLib project. It contains string extensions to facilitate searching strings in various ways rapidly. Such as, IndexOfAny(string[]) and IndexOfNotAny. They contain overloads with a StringComparison argument too.

Multicore programming: the hard parts

I'm writing a book on multicore programming using .NET 4 and I'm curious to know what parts of multicore programming people have found difficult to grok or anticipate being difficult to grok?

What's a useful unit of work to parallelize, and how do I find/organize one?
All these parallelism primitives aren't helpful if you fork a piece of work that is smaller than the forking overhead; in fact, that buys you a nice slowdown instead of what you are expecting.
So one of the big problems is finding units of work that are obviously more expensive than the parallelism primitives. A key problem here is that nobody knows what anything costs to execute, including the parallelism primitives themselves. Clearly calibrating these costs would be very helpful. (As an aside, we designed, implemented, and daily use a parallel programming langauge, PARLANSE whose objective was to minimize the cost of the parallelism primitives by allowing the compiler to generate and optimize them, with the goal of making smaller bits of work "more parallelizable").
One might also consider discussion big-Oh notation and its applications. We all hope that the parallelism primitives have cost O(1). If that's the case, then if you find work with cost O(x) > O(1) then that work is a good candidate for parallelization. If your proposed work is also O(1), then whether it is effective or not depends on the constant factors and we are back to calibration as above.
There's the problem of collecting work into large enough units, if none of the pieces are large enough. Code motion, algorithm replacement, ... are all useful ideas to achieve this effect.
Lastly, there's the problem of synchnonization: when do my parallel units have to interact, what primitives should I use, and how much do those primitives cost? (More than you expect!).

I guess some of it depends on how basic or advanced the book/audience is. When you go from single-threaded to multi-threaded programming for the first time, you typically fall off a huge cliff (and many never recover, see e.g. all the muddled questions about Control.Invoke).
Anyway, to add some thoughts that are less about the programming itself, and more about the other related tasks in the software process:
Measuring: deciding what metric you are aiming to improve, measuring it correctly (it is so easy to accidentally measure the wrong thing), using the right tools, differentiating signal versus noise, interpreting the results and understanding why they are as they are.
Testing: how to write tests that tolerate unimportant non-determinism/interleavings, but still pin down correct program behavior.
Debugging: tools, strategies, when "hard to debug" implies feedback to improve your code/design and better partition mutable state, etc.
Physical versus logical thread affinity: understanding the GUI thread, understanding how e.g. an F# MailboxProcessor/agent can encapsulate mutable state and run on multiple threads but always with only a single logical thread (one program counter).
Patterns (and when they apply): fork-join, map-reduce, producer-consumer, ...
I expect that there will be a large audience for e.g. "help, I've got a single-threaded app with 12% CPU utilization, and I want to learn just enough to make it go 4x faster without much work" and a smaller audience for e.g. "my app is scaling sub-linearly as we add cores because there seems to be contention here, is there a better approach to use?", and so a bit of the challenge may be serving each of those audiences.

Since you write a whole book for multi-core programming in .Net.
I think you can also go beyond multi-core a little bit.
For example, you can use a chapter talking about parallel computing in a distributed system in .Net. Unlikely, there is no mature frameworks in .Net yet. DryadLinq is the closest. (On the other side, Hadoop and its friends in Java platform are really good.)
You can also use a chapter demonstrating some GPU computing stuff.

One thing that has tripped me up is which approach to use to solve a particular type of problem. There's agents, there's tasks, async computations, MPI for distribution - for many problems you could use multiple methods but I'm having difficulty understanding why I should use one over another.

To understand: low level memory details like the difference between acquire and release semantics of memory.
Most of the rest of the concepts and ideas (anything can interleave, race conditions, ...) are not that difficult with a little usage.
Of course the practice, especially if something is failing sometimes, is very hard as you need to work at multiple levels of abstraction to understand what is going on, so keep your design simple and as far as possible design out the need for locking etc. (e.g. using immutable data and higher level abstractions).

Its not so much theoretical details, but more the practical implementation details which trips people up.
What's the deal with immutable data structures?
All the time, people try to update a data structure from multiple threads, find it too hard, and someone chimes in "use immutable data structures!", and so our persistent coder writes this:
ImmutableSet set;
ThreadLoop1()
foreach(Customer c in dataStore1)
set = set.Add(ProcessCustomer(c));
ThreadLoop2()
foreach(Customer c in dataStore2)
set = set.Add(ProcessCustomer(c));
Coder has heard all their lives that immutable data structures can be updated without locking, but the new code doesn't work for obvious reasons.
Even if your targeting academics and experienced devs, a little primer on the basics of immutable programming idioms can't hurt.
How to partition roughly equal amounts of work between threads?
Getting this step right is hard. Sometimes you break up a single process into 10,000 steps which can be executed in parallel, but not all steps take the same amount of time. If you split the work on 4 threads, and the first 3 threads finish in 1 second, and the last thread takes 60 seconds, your multithreaded program isn't much better than the single-threaded version, right?
So how do you partition problems with roughly equal amounts of work between all threads? Lots of good heuristics on solving bin packing problems should be relevant here..
How many threads?
If your problem is nicely parallelizable, adding more threads should make it faster, right? Well not really, lots of things to consider here:
Even a single core processor, adding more threads can make a program faster because more threads gives more opportunities for the OS to schedule your thread, so it gets more execution time than the single-threaded program. But with the law of diminishing returns, adding more threads increasing context-switching, so at a certain point, even if your program has the most execution time the performance could still be worse than the single-threaded version.
So how do you spin off just enough threads to minimize execution time?
And if there are lots of other apps spinning up threads and competing for resources, how do you detect performance changes and adjust your program automagically?

I find the conceptions of synchronized data moving across worker nodes in complex patterns very hard to visualize and program.
Usually I find debugging to be a bear, also.

Parallelism in .Net

I have been asked to show the benefits and limitations of Parallelism and evaluate it for use within our company. We are predominantly a data orientated business, and essentially load objects from the database, then put them through some business logic, display to the user, then save back to the DB. In my mind, there isn't too much in that pipe line that would benefit from running in parallel, but being fairly new to the concept, I could be completely wrong. Would there be any part of that simple pipe line that would benefit from running in parallel? And are there any guidelines for how to implement this style of programming?
Also, are there any tools (preferably that come with VS2010) that would show where bottle necks occur and would be able to visually show what's going on when I click "Go" on a simple app that runs a given amount of loops (pre-written simple maths loops e.g. for i as integer = 1 to 1000 - do some calculations) in parallel, then in series?
I need to be able to display the difference using a decent profiling tool.

Yes, even from that simple model you could greatly benefit from parrallelism.
Say for instance that during a load of your data you're doing something like this:
foreach(var datarow in someDataSet)
{
//put your data into some business objects here
}
you could optimize this with parrallelism by doing something like this:
Parrallel.ForEach(someDataSet, datarow =>
{
//put your data into some business objects here
});
This could greatly increase your performance depending on how much data your processing here.
Each data row will now be processed asynchronously instead of in sequence like the typical foreach loop.
My suggestion to you would be to run some simple performance tests on an example as simple as this one and see what kind of results you get. Plot it out in a spreadsheet or something, and show it to your team. You might be suprised with the results you get.

You may reap more benefit from implementing a caching layer (distributed or otherwise) than parallelizing your current pipeline.
With a caching layer, the objects you use frequently will reside in the in-memory cache, allowing for much greater read/write performance. There are a number of options for keeping the cache in sync, and these will vary depending on which vendor you choose.
I'd suggest having a look at MemCached and NCache and see if you think they would be a good fit.
EDIT: As far as profiling tools go, I've used dotTrace extensively and would highly recommend it. You can download a 30 day trial from JetBrains' website.

Certainly there are many tasks that can be parallelized, a detailed analysis can help but bottlenecks are possible candidates.
This material can help you Patterns for Parallel Programming: Understanding and Applying Parallel Patterns with the .NET Framework 4

Possibly, but my general response to this sort of query would typically be - Do you have any performance problems in your application(s)? If yes then by all means investigate why and consider whether parallel execution can help. If not then time is probably best spent elsewhere.

Have you checked out Microsoft's Parallel Computing with Managed Code site? It contains several articles on implementation guidelines discussing both when and how to use .Net 4's parallel features.

Migrate a single threaded app to multi-threaded, parallel execution, monte carlo simulation

I've been tasked with taking an existing single threaded monte carlo simulation and optimising it. This is a c# console app, no db access it loads data once from a csv file and writes it out at the end, so it's pretty much just CPU bound, also only uses about 50mb of memory.
I've run it through Jetbrains dotTrace profiler. Of total execution time about 30% is generating uniform random numbers, 24% translating uniform random numbers to normally distributed random numbers.
The basic algorithm is a whole lot of nested for loops, with random number calls and matrix multiplication at the centre, each iteration returns a double which is added to a results list, this list is periodically sorted and tested for some convergence criteria (at check points every 5% of total iteration count) if acceptable the program breaks out of the loops and writes the results, else it proceeds to the end.
I'd like developers to weigh in on:
should I use new Thread v ThreadPool
should I look at the Microsoft Parallels Extension library
should I look at AForge.Net Parallel.For, http://code.google.com/p/aforge/ any other libraries?
Some links to tutorials on the above would be most welcome as I've never written any parallel or multi-threaded code.
best strategies for generating en-mass normally distributed random numbers, and then consuming these. Uniform random numbers are never used in this state by the app, they are always translated to normally distributed and then consumed.
good fast libraries (parallel?) for random number generation
memory considerations as I take this parallel, how much extra will I require.
Current app takes 2 hours for 500,000 iterations, business needs this to scale to 3,000,000 iterations and be called mulitple times a day so need some heavy optimisation.
Particulary would like to hear from people who have used Microsoft Parallels Extension or AForge.Net Parallel
This needs to be productionised fairly quickly so .net 4 beta is out even though I know it has concurrency libraries baked in, we can look at migrating to .net 4 later down the track once it's released. For the moment the server has .Net 2, I've submitted for review an upgrade to .net 3.5 SP1 which my dev box has.
Thanks
Update
I've just tried the Parallel.For implementation but it comes up with some weird results.
Single threaded:
IRandomGenerator rnd = new MersenneTwister();
IDistribution dist = new DiscreteNormalDistribution(discreteNormalDistributionSize);
List<double> results = new List<double>();
for (int i = 0; i < CHECKPOINTS; i++)
{
results.AddRange(Oblist.Simulate(rnd, dist, n));
}
To:
Parallel.For(0, CHECKPOINTS, i =>
{
results.AddRange(Oblist.Simulate(rnd, dist, n));
});
Inside simulate there are many calls to rnd.nextUniform(), I think I am getting many values that are the same, is this likely to happen because this is now parallel?
Also maybe issues with the List AddRange call not being thread safe? I see this
System.Threading.Collections.BlockingCollection might be worth using, but it only has an Add method no AddRange so I'd have to look over there results and add in a thread safe manner. Any insight from someone who has used Parallel.For much appreciated. I switched to the System.Random for my calls temporarily as I was getting an exception when calling nextUniform with my Mersenne Twister implementation, perhaps it wasn't thread safe a certain array was getting an index out of bounds....

First you need to understand why you think that using multiple threads is an optimization - when it is, in fact, not. Using multiple threads will make your workload complete faster only if you have multiple processors, and then at most as many times faster as you have CPUs available (this is called the speed-up). The work is not "optimized" in the traditional sense of the word (i.e. the amount of work isn't reduced - in fact, with multithreading, the total amount of work typically grows because of the threading overhead).
So in designing your application, you have to find pieces of work that can be done in a parallel or overlapping fashion. It may be possible to generate random numbers in parallel (by having multiple RNGs run on different CPUs), but that would also change the results, as you get different random numbers. Another option is have generation of the random numbers on one CPU, and everything else on different CPUs. This can give you a maximum speedup of 3, as the RNG will still run sequentially, and still take 30% of the load.
So if you go for this parallelization, you end up with 3 threads: thread 1 runs the RNG, thread 2 produces normal distribution, and thread 3 does the rest of the simulation.
For this architecture, a producer-consumer architecture is most appropriate. Each thread will read its input from a queue, and produce its output into another queue. Each queue should be blocking, so if the RNG thread falls behind, the normalization thread will automatically block until new random numbers are available. For efficiency, I would pass the random numbers in array of, say, 100 (or larger) across threads, to avoid synchronizations on every random number.
For this approach, you don't need any advanced threading. Just use regular thread class, no pool, no library. The only thing that you need that is (unfortunately) not in the standard library is a blocking Queue class (the Queue class in System.Collections is no good). Codeproject provides a reasonably-looking implementation of one; there are probably others.

List<double> is definitely not thread-safe. See the section "thread safety" in the System.Collections.Generic.List documentation. The reason is performance: adding thread safety is not free.
Your random number implementation also isn't thread-safe; getting the same numbers multiple times is exactly what you'd expect in this case. Let's use the following simplified model of rnd.NextUniform() to understand what is happening:
calculate pseudo-random number from
the current state of the object
update state of the object so the
next call yields a different number
return the pseudo-random number
Now, if two threads execute this method in parallel, something like this may happen:
Thread A calculates a random number
as in step 1.
Thread B calculates a random number
as in step 1. Thread A has not yet
updated the state of the object, so
the result is the same.
Thread A updates the state of the
object as in step 2.
Thread B updates the state of the
object as in step 2, trampling over A's state
changes or maybe giving the same
result.
As you can see, any reasoning you can do to prove that rnd.NextUniform() works is no longer valid because two threads are interfering with each other. Worse, bugs like this depend on timing and may appear only rarely as "glitches" under certain workloads or on certain systems. Debugging nightmare!
One possible solution is to eliminate the state sharing: give each task its own random number generator initialized with another seed (assuming that instances are not sharing state through static fields in some way).
Another (inferior) solution is to create a field holding a lock object in your MersenneTwister class like this:
private object lockObject = new object();
Then use this lock in your MersenneTwister.NextUniform() implementation:
public double NextUniform()
{
lock(lockObject)
{
// original code here
}
}
This will prevent two threads from executing the NextUniform() method in parallel. The problem with the list in your Parallel.For can be addressed in a similar manner: separate the Simulate call and the AddRange call, and then add locking around the AddRange call.
My recommendation: avoid sharing any mutable state (like the RNG state) between parallel tasks if at all possible. If no mutable state is shared, no threading issues occur. This also avoids locking bottlenecks: you don't want your "parallel" tasks to wait on a single random number generator that doesn't work in parallel at all. Especially if 30% of the time is spend acquiring random numbers.
Limit state sharing and locking to places where you can't avoid it, like when aggregating the results of parallel execution (as in your AddRange calls).

Threading is going to be complicated. You will have to break your program into logical units that can each be run on their own threads, and you will have to deal with any concurrency issues that emerge.
The Parallel Extension Library should allow you to parallelize your program by changing some of your for loops to Parallel.For loops. If you want to see how this works, Anders Hejlsberg and Joe Duffy provide a good introduction in their 30 minute video here:
http://channel9.msdn.com/shows/Going+Deep/Programming-in-the-Age-of-Concurrency-Anders-Hejlsberg-and-Joe-Duffy-Concurrent-Programming-with/
Threading vs. ThreadPool
The ThreadPool, as its name implies, is a pool of threads. Using the ThreadPool to obtain your threads has some advantages. Thread pooling enables you to use threads more efficiently by providing your application with a pool of worker threads that are managed by the system.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.