I am playing with some demo of rolling a dice in javascript, where i select 3 numbers out of 6 and use following snippet to get the random result:
let randomNumber = Math.floor(Math.random() * 6 + 1); // Random number between 1 and 6
this is global random, meaning that one player can have much higher win rate than 50% and other lower etc.
How would I implement something that would keep every players win rate exactly towards 50% and not higher? At the moment I am experiencing having huge win strikes while I am rolling, which would mean that in real life that would be the same case, so how would I limit myself to never exceed for example 55% win rate and if it does should slowly go towards 50%, and vice versa if going below 50%?
Is there any api service, that when i provide the ID for player, it would keep the count of wins and losses and keep the rate of win around 50%? Aka keep the historical data for the user etc and decide on win/loss ratio what next number would be by knowing the roll before, but if 50 win /50 losses would do honest random roll. This would destroy win strike though i guess.
So I think I would have to do something like:
if win/loss ratio is 50%, do legit random rolls until 51%, then decide against player to go back to 50% of win/rate. I know this wouldn't be honest but are there any other honest ways to really keep 50% of randomization by just running the snippet above?
How long are those win-strikes? When I test this:
console.log(Math.random());
I get four or five values in a row that are bigger than 0.5 then one below 0.5 then another bigger than 0.5. If this is a problem for game logic, then you don't need random numbers but plot-armor. If you use only one random number generator for all players, then of course some players can steal another player's "destiny" and those players can have very bad or very good dice rolls.
To overcome the issue of "fairness" between players, you can have a unique seed for each player's random number generator. This is possible (and fast) in C++. Calling a compiled C++ console program from Node-js is easy. The only issue would be optimizing it for millions of concurrent players. C++'s std::mt19937 is fast enough and takes a seed value (from os too).
Since you tagged C#, same thing can be done within C# too.new Random(some random seed here) should give similar results. Then you can host the algorithm as a micro-service and make it accessed by nodejs (assuming main backend part of app is on nodejs).
Lastly, you can have one micro-service process per player for the RND. This should give everyone their own seeds for random number generations. But too many players would mean too many threads which is very bad for performance unless dice-rolls are very rare. For example,I can start 100 processes on my 8 core cpu with this:
"use strict";
var cluster = require('cluster');
if(cluster.isMaster){for(let i=0;i<100;i++) { cluster.fork(); }}
console.log(Math.random());
but it starts very slow and hosting a RND server per process could be even slower. So, only main process should host the RND service and internally communicate with worker processes. Cluster module of NodeJs lets worker processes communicate with main process. Pure event-driven communication (no spin-wait whatsoever) between worker(RND) processes and main process should be CPU-friendly but still if all of (millions of) players throw dices at the same time then the process/thread-switching overhead would be visible and each process takes certain amount of memory so the RAM capacity becomes important too. There are custom random number generators for NodeJs that can take seed for much less CPU/RAM usage.
Related
I need to generate identifiers in a distributed system.
Duplicates will be detected by the system and will cause the operation that created that identifier to fail. I need to minimize the probability of failing operations by generating identifiers with low collision probability.
I'd also like to be able to describe mathematically how likely it is that a duplicate number is generated. I'm not sure what such a description would look like, preferably I'd like to know the X in something like:
When generating 1000 random numbers per second for 10 years no more than X duplicates should have been generated.
These random numbers can only have 35 significant bits. The system is written in C# and runs on top of Microsoft's .NET platform.
So this is actually two questsions in one (but I guess they depend on each other):
What component/pattern should I use to generate identifiers?
How can I compute the X value?
For (1) I see the following candidates:
System.Random
System.Guid
System.Security.Cryptography.RNGCryptoServiceProvider
The fact that I need numbers to have 35 significant bits is not a problem when it comes to generating values as it is fine to generate a larger number and then just extracting 35 of those bits. However, it do affect the mathematical computation i presume.
UPDATE
I can see now that 35-bits aren't nearly enough for my description above. I don't really need 1 number per millisecond for 10 years. That was an overstatement.
What I really need is a way to distributively generate identifiers that have 35 significant bits with as low probability of a conflict as possible. As time goes by the system will "clean up" identifiers so that it is possible for the same number to be used again without it causing a failure.
I understand that I could of course implement some kind of centralized counter. But I would like to be able to avoid that if possible. I want to minimize the number of network operations needed to maintain the identifiers.
Any suggestions are welcome!
You are wanting to generate 1000 numbers each second for 10 years. So you will generate
1000*60*60*365*10 = 315360000000
You want to use numbers with 35 bits. There are
2**35 = 34359738368
The minimum number of duplicates that you will generate is 315360000000 - 34359738368 which equals 281000261632. That is a lower bound on X. This is self-evident. Suppose by some amazing freak that you manage to sample each and every possible value from the 2**35 available. Then every other sample you make is a duplicate.
I guess we can safely conclude that 35 bits is not enough.
As far as generating good quality pseudo random numbers, it should be fairly obvious that System.Security.Cryptography.RNGCryptoServiceProvider the best choice of the three that you present.
If you really want uniqueness that I suggest that you do the following:
Allocate to each distributed node a unique range of IDs.
Have each node allocate uniquely from that pool of IDs values. For instance, the node starts at the first value and increments the ID by one every time it is asked to generate a new one.
This is really the best strategy if uniqueness matters. But you will likely need to dedicate more bits for your IDs.
Since the probability of collisions steadily increases with a random allocation as you use up more addresses, the system steadily degrades in performance. There is also the looming specter of a non-zero probability of your random selection never terminating because it never chooses a non-conflicting id (PRNGs have cycle lengths for any given seed much smaller than their theoretical full range of output.) Whether this is a problem in practice of course depends on how saturated you expect your address space to be in the long run.
If the IDs don't need to be random, then you almost certainly want to rely on some form of coordination to assign IDs (such as partitioning the address space or using a coordinating manager of some sort to assign IDs) rather than creating random numbers and reconciling collisions after they happen. It will be easier to implement, probably more performant and will allow better saturation of your address space.
In response to comment:
The design for a specific mechanism of coordination depends on a lot of factors, such as how many nodes you expect to have, how flexible you need to be in regards to adding/dropping nodes, how long the IDs need to remain unique (i.e. what is your strategy for managing ID lifetime), etc. It's a complex problem that warrants a careful analysis of your expected use cases, including a look at your future scalability requirements. A simple partioning scheme is sufficient if your number of nodes and/or number of IDs is small, but if you need to scale to larger volumes, it's a much more challenging problem, perhaps requiring more complex allocation strategies.
One possible partitioning design is that you have a centralized manager that allocates IDs in blocks. Each node can then freely increment IDs within that block and only needs to request a new block when it runs out. This can scale well if you expect your ID lifetime to correlate with age, as that generally means that whole blocks will be freed up over time. If ID lifetime is more randomly distributed, though, this could potentially lead to fragmentation and exhaustion of available blocks. So, again, it's a matter of understanding your requirements so that you can design for the scale and usage patterns your application requires.
You can't use random numbers in your case: the Birthday Paradox states that 1st collistion will be at
sqrt(2 * N)
in your case:
sqrt(2 * 2^35) = sqrt(2^36) = 2^18 = 250000 items before the 1st collistion
So GUID-based value is the best choice.
I think for your particular problem all those random numbers providers will work relatively the same - all should generate nearly ideal even distribution of values.
I heard GUID generation includes MAC address as part of generation, so it might influence some part more than other, but I'm not sure. Most likely it is even distribute as well, but you must check that before relying on it.
The main question you should answer is do you really need random numbers, or consequtive is fine? Maybe consequtive addresses will work better and have better performance because of caching? So it might be good to distribute address space among your machines and have full guarantee when collision will be occured and handle it appropriately?
I have been having some difficulties in identifying the right configurations for effectively scaling my cloud service. I am assuming we just have to use the scale section of the management portal and nothing programmatically?
My current configuration for Web Role is
Medium sized VM (4 GB RAM)
Autoscale - CPUInstance Range - 1 to 10Target CPU - 50 to 80Scale up and down by 1 instance at a timeScale up and down wait time - 5 mins
I used http://loader.io/ site to do load testing by sending concurrent requests to an API. And it could support only 50 -100 users. After that I was getting timeout(10 secs) errors.
My app will be targeting millions of users on a huge scale, so am not really sure how I can efficiently scale to cater to that much load on the server.
I think the problem could be the scale up time which is 5mins(i think its very high), and in management portal, the lowest option is 5mins, so dunno how i can reduce it?
Any suggestions?
Azure's auto-scaling engine examines 60-minute cpu-utilization averages every 5 minutes. This means that every 5 minutes it has a chance to decide if your CPU utilization is too high and scale you up.
If you need something more robust, I'd recommend to think about the following:
CPU Usage is rarely a good indicator for scaling of websites. Look
into Requests/sec or requests/current instead of CPU utilization.
Consider examining the need to scale more frequently (every 1 min?)
Azure portal cannot do this. You'll need either WASABi or
AzureWatch for this
Depending on your usage patterns, consider looking at shorter time averages to make a decision (ie: average over 20 minutes not 60 minutes). Once again, your choices here are WASABi or AzureWatch
Consider looking at the /rate/ of increase
in the metrics and not just the latest averages themselves. IE:
requests/sec rose by 20% in the last 20 minutes. Once again, Azure
autoscaling engine cannot do this, consider either WASABi (which may
do this) or AzureWatch which definitely can do this.
WASABi is an application block from Microsoft (ie: a DLL) that you'll need to configure, host and monitor somewhere yourself. It is pretty flexible and you can override whatever functionality since it is open source.
AzureWatch is a third-party managed service that monitors/autoscales/heals your Azure roles/Virtual Machines/Websites/SQL Azure/etc. It costs money but you let someone else do all the dirty work.
I recently wrote a blog about the comparison of the three products
Disclosure: I'm affiliated with AzureWatch
HTH
Another reason why the minimum time is 5 minutes is because it takes Azure some time to assign additional machines to your Cloud Service and replicate your software onto them. (WebApps dont have that 'problem')
In my work as a saas admin I have found that for Cloud Services this ramp up time after scaling can be around 3-5 minutes for our software package.
If you want to configure scaling within the Azure portal, then my suggestion would be to significantly lower your CPU ranges. As Igorek mentioned Azure scaling looks at the Average over the last 60 minutes.
If a Cloud Service is running at 5% CPU for most of the time, then suddenly it peaks and runs at 99%, it will take some time for the Average to go up and trigger your scale settings. Leaving it at 80% will cause scaling to happen far too late.
RL example:
I manage a portal that runs some CPU intensive calculations. At normal usage our Cloud Services tend to run at 2-5% CPU but on rare occasion we've seen it go up to 99% and stay there for a while.
My first scaling attempt was 2 instances and scaling up with 2 at 80% average CPU, but then it took around 40 minutes for the event to trigger because the Average CPU did not go up that fast. Right now I have everything set to scale when average CPU goes over 25% and what I see is that our Services will scale up after 10-12 minutes.
I'm not saying 25% is the magic number, I'm saying keep in mind that you're working with "average over 60 minutes"
The second thing is that the Azure Portal only shows a limited set of scaling options, and scaling can be set in greater detail when you use Powershell / REST. The 60 minute interval over which the average is calculated for example can be lowered.
I'm working on my 10th grade science fair project right now and I've kind of hit a wall. My project is testing the effect of parallelism on the efficiency of brute forcing md5 password hashes. I'll be calculating the # of password combinations/second it tests to see how efficient it is, using 1, 4,16,32,64,128,512,and 1024 threads. I'm not sure if I'll do dictionary brute force or pure brute force. I figure that dictionary would be easier to parallelize; just split the list up into equal parts for each thread. I haven't written much code yet; I'm just trying to plan it out before I start coding.
My questions are:
Is calculating the password combinations tested/second the best way to determine the performance based on # of threads?
Dictionary or pure brute force? If pure brute force, how would you split up the task into a variable number of threads?
Any other suggestions?
I'm not trying to dampen your enthusiasm, but this is already quite a well understood problem. I'll try to explain what to expect below. But maybe it would be better to do your project in another area. How's about "Maximising MD5 hashing throughput" then you wouldn't be restricted to just looking at threading.
I think that when you write up your project, you'll need to offer some kind of analysis as to when parallel processing is appropriate and when it isn't.
Each time that your CPU changes to another thread, it has to persist the current thread context and load the new thread context. This overhead does not occur in a single-threaded process (except for managed services like garbage collection). So all else equal, adding threads won't improve performance because it must do the original workload plus all of the context switching.
But if you have multiple CPUs (cores) at your disposal, creating one thread per CPU will mean that you can parallelize your calculations without incurring context switching costs. If you have more threads than CPUs then context switching will become an issue.
There are 2 classes of computation: IO-bound and compute-bound. An IO-bound computation can spend large amounts of CPU cycles waiting for a response from some hardware like a network card or a hard disk. Because of this overhead, you can increase the number of threads to the point where the CPU is maxed out again, and this can cancel out the cost of context switching. However there is a limit to the number of threads, beyond which context switching will take up more time than the threads spend blocking for IO.
Compute-bound computations simply require CPU time for number crunching. This is the kind of computation used by a password cracker. Compute-bound operations do not get blocked, so adding more threads than CPUs will slow down your overall throughput.
The C# ThreadPool already takes care of all of this for you - you just add tasks, and it queues them until a Thread is available. New Threads are only created when a thread is blocked. That way, context switches are minimised.
I have a quad-core machine - breaking the problem into 4 threads, each executing on its own core, will be more or less as fast as my machine can brute force passwords.
To seriously parallelize this problem, you're going to need a lot of CPUs. I've read about using the GPU of a graphics card to attack this problem.
There's an analysis of attack vectors that I wrote up here if it's any use to you. Rainbow tables and the processor/memory trade offs would be another interesting area to do a project in.
To answer your question:
1) There is nothing like the best way to test thread performance. Different problems scale differently with threads, depending on how independent each operation in the target problem is. So you can try the dictionary thing. But, when you analyse the results, the results that you get might not be applicable on all problems. One very popular example however, is that people try a shared counter, where the counter is increased by a fixed number of times by each thread.
2) Brute force will cover a large number of cases. In fact, by brute force, there can be an infinite number of possibilities. So, you might have to limit your password by some constraints like the maximum length of the password and so on. One way to distribute brute force is to assign each thread a different starting character for the password. The thread then tests all possible passwords for that starting character. Once the thread finishes its work, it gets another starting character till you use all possible starting symbols.
3) One suggestion that I would like to give you is to test on a little smaller number of threads. You are going upto 1024 threads. That is not a good idead. The number of cores on a machine is generally 4 to 10. So, try not to exceed the number of threads by a huge number than the number of cores. Because, a processor cannot run multiple threads at the same time. Its one thread per processor at any given time. Instead, try to measure performace for different schemes for assigning the problem to different threads.
Let me know if this helps!
One solution that will work for both a dictionary and a brute-force of all possible passwords is to use a approach based around dividing the job up into work units. Have a shared object responsible for dividing the problem space up into units of work - ideally, something like 100ms to 5 seconds worth of work each - and give a reference to this object to each thread you start. Each thread then operates in a loop like this:
for work_block in work_block_generator.get():
for item in work_block:
# Do work
The advantage of this over just parcelling up the whole workspace into one chunk per thread up-front is that if one thread works faster than others, it won't run out of work and just sit idle - it'll pick up more chunks.
Ideally your work item generator would have an interface that, when called, returns an iterator, which itself returns individual passwords to test. The dictionary-based one, then, selects a range from the dictionary, while the brute force one selects a prefix to test for each batch. You'll need to use synchronization primitives to stop races between different threads trying to grab work units, of course.
In both the dictionary and brute force methods, the problem is Embarrassingly Parallel.
To divide the problem for brute force with n threads, just say, the first two (or three) letters (the "prefix") into n pieces. Then, each thread has a set of assigned prefixes, like "aa - fz" where it is responsible only for testing everything that follows its prefixes.
Dictionary is usually statistically slightly better in practice for cracking more passwords, but brute force, since it covers everything, cannot miss a password within the target length.
I have an environment that serves many devices spread across 3 time zones by receiving and sending data during the wee hours of the night. The distribution of these devices was determined pseudo-randomly based on an identification number and a simple calculation using a modulo operation. The result of such a calculation creates an unnecessary artificial peak which consumes more resources than I'd like during certain hours of the night.
As part of our protocol I can instruct devices when to connect to our system on subsequent nights.
I am seeking an algorithm which can generally distribute the peak into a more level line (albeit generally higher at most times) or at least a shove in the right direction - meaning what sort of terminology should I spend my time reading about. I have available to me identification numbers for devices, the current time, and the device's time zone as inputs for performing calculation. I can also perform some up front analytical calculations to create pools from which to draw slots from, though I feel this approach may be less elegant than I am hoping for (though a learning algorithm may not be a bad thing...).
(Ultimately and somewhat less relevant I will be implementing this algorithm using C#.)
If you want to avoid the spikes associated with using random times, look at the various hashing functions used for hashtables. Your reading might start at the wikipedia articles on the subject:
http://en.wikipedia.org/wiki/Hash_function
Basically, divide whatever you want your update window to be into the appropriate number of buckets. One option might be 3 hours * 60 minutes * 60 seconds = 10800 buckets. Then use that as your hashtable size, for the chosen hashing function. Your unique input might be device ID. Don't forget to use GMT for the chosen time. Your programming language of choice probably has a number of built in hashing functions, but the article should provide some links to get you started if you want to implement one from scratch.
This approach is superior to the earlier answer of random access times because it has much better evenness properties, and ensures that your access patterns will be approximately flat, as compared to the random function which is likely to sometimes exhibit spikes.
Here's some more specific information on how to implement various functions:
http://www.partow.net/programming/hashfunctions/index.html
You say that you can tell devices what time to connect, so I don't see why you need anything random or modulused. When each device connects, pick a time tomorrow which currently doesn't have many devices assigned to it, and assign the device to that time. If the devices all take about the same amount of resources to service, then a trivial greedy algorithm will produce a completely smooth distribution - assign each device to whatever time is currently least congested. If the server handles other work than just these devices, then you'd want to start with its typical load profile, then add the device load to that. I wouldn't really call this "analytical calculations", just storing a histogram of expected load against time for the next 24 hours.
Or do you have the problem that the device might not obey instructions (for example it might be offline at its assigned time, and then connect whenever it's next on)? Obviously if your users in a particular time zone all start work at the same time in the morning, then that would be a problematic strategy.
Simply take the number of devices and divide your time interval into n equal segments and allocate each segment to a device, informing them of when to connect when they next connect.
This will give you an optimally uniform distribution in all cases.
Normalize all times to GMT, what do you care about time zones or day light savings time or whatever? Now is now no matter what time zone you're in.
Adding a random distribution can lead to clumping (a uniform random distribution is only uniform in the limit, but not necessarily for any particular sample), and really should be used if there's no feedback mechanism. Since you can control to some extent when they connect a random component is not at all necessary and is not even remotely optimal.
If you're concerned about clock drift across devices consider even if you added randomness this wouldn't decrease the randomness of your clock drift in any way, and would only contribute to an even less optimal allocation.
If you want to ensure a stable distribution of devices by region, then compute the ratio of devices per region, and distribute the slot allocations appropriately. For instance, if you have 50/25/25 by time zone respectively, assign slots to the first time zone, then the next two slots to the remaining time zones, then repeat.
I've been tasked with taking an existing single threaded monte carlo simulation and optimising it. This is a c# console app, no db access it loads data once from a csv file and writes it out at the end, so it's pretty much just CPU bound, also only uses about 50mb of memory.
I've run it through Jetbrains dotTrace profiler. Of total execution time about 30% is generating uniform random numbers, 24% translating uniform random numbers to normally distributed random numbers.
The basic algorithm is a whole lot of nested for loops, with random number calls and matrix multiplication at the centre, each iteration returns a double which is added to a results list, this list is periodically sorted and tested for some convergence criteria (at check points every 5% of total iteration count) if acceptable the program breaks out of the loops and writes the results, else it proceeds to the end.
I'd like developers to weigh in on:
should I use new Thread v ThreadPool
should I look at the Microsoft Parallels Extension library
should I look at AForge.Net Parallel.For, http://code.google.com/p/aforge/ any other libraries?
Some links to tutorials on the above would be most welcome as I've never written any parallel or multi-threaded code.
best strategies for generating en-mass normally distributed random numbers, and then consuming these. Uniform random numbers are never used in this state by the app, they are always translated to normally distributed and then consumed.
good fast libraries (parallel?) for random number generation
memory considerations as I take this parallel, how much extra will I require.
Current app takes 2 hours for 500,000 iterations, business needs this to scale to 3,000,000 iterations and be called mulitple times a day so need some heavy optimisation.
Particulary would like to hear from people who have used Microsoft Parallels Extension or AForge.Net Parallel
This needs to be productionised fairly quickly so .net 4 beta is out even though I know it has concurrency libraries baked in, we can look at migrating to .net 4 later down the track once it's released. For the moment the server has .Net 2, I've submitted for review an upgrade to .net 3.5 SP1 which my dev box has.
Thanks
Update
I've just tried the Parallel.For implementation but it comes up with some weird results.
Single threaded:
IRandomGenerator rnd = new MersenneTwister();
IDistribution dist = new DiscreteNormalDistribution(discreteNormalDistributionSize);
List<double> results = new List<double>();
for (int i = 0; i < CHECKPOINTS; i++)
{
results.AddRange(Oblist.Simulate(rnd, dist, n));
}
To:
Parallel.For(0, CHECKPOINTS, i =>
{
results.AddRange(Oblist.Simulate(rnd, dist, n));
});
Inside simulate there are many calls to rnd.nextUniform(), I think I am getting many values that are the same, is this likely to happen because this is now parallel?
Also maybe issues with the List AddRange call not being thread safe? I see this
System.Threading.Collections.BlockingCollection might be worth using, but it only has an Add method no AddRange so I'd have to look over there results and add in a thread safe manner. Any insight from someone who has used Parallel.For much appreciated. I switched to the System.Random for my calls temporarily as I was getting an exception when calling nextUniform with my Mersenne Twister implementation, perhaps it wasn't thread safe a certain array was getting an index out of bounds....
First you need to understand why you think that using multiple threads is an optimization - when it is, in fact, not. Using multiple threads will make your workload complete faster only if you have multiple processors, and then at most as many times faster as you have CPUs available (this is called the speed-up). The work is not "optimized" in the traditional sense of the word (i.e. the amount of work isn't reduced - in fact, with multithreading, the total amount of work typically grows because of the threading overhead).
So in designing your application, you have to find pieces of work that can be done in a parallel or overlapping fashion. It may be possible to generate random numbers in parallel (by having multiple RNGs run on different CPUs), but that would also change the results, as you get different random numbers. Another option is have generation of the random numbers on one CPU, and everything else on different CPUs. This can give you a maximum speedup of 3, as the RNG will still run sequentially, and still take 30% of the load.
So if you go for this parallelization, you end up with 3 threads: thread 1 runs the RNG, thread 2 produces normal distribution, and thread 3 does the rest of the simulation.
For this architecture, a producer-consumer architecture is most appropriate. Each thread will read its input from a queue, and produce its output into another queue. Each queue should be blocking, so if the RNG thread falls behind, the normalization thread will automatically block until new random numbers are available. For efficiency, I would pass the random numbers in array of, say, 100 (or larger) across threads, to avoid synchronizations on every random number.
For this approach, you don't need any advanced threading. Just use regular thread class, no pool, no library. The only thing that you need that is (unfortunately) not in the standard library is a blocking Queue class (the Queue class in System.Collections is no good). Codeproject provides a reasonably-looking implementation of one; there are probably others.
List<double> is definitely not thread-safe. See the section "thread safety" in the System.Collections.Generic.List documentation. The reason is performance: adding thread safety is not free.
Your random number implementation also isn't thread-safe; getting the same numbers multiple times is exactly what you'd expect in this case. Let's use the following simplified model of rnd.NextUniform() to understand what is happening:
calculate pseudo-random number from
the current state of the object
update state of the object so the
next call yields a different number
return the pseudo-random number
Now, if two threads execute this method in parallel, something like this may happen:
Thread A calculates a random number
as in step 1.
Thread B calculates a random number
as in step 1. Thread A has not yet
updated the state of the object, so
the result is the same.
Thread A updates the state of the
object as in step 2.
Thread B updates the state of the
object as in step 2, trampling over A's state
changes or maybe giving the same
result.
As you can see, any reasoning you can do to prove that rnd.NextUniform() works is no longer valid because two threads are interfering with each other. Worse, bugs like this depend on timing and may appear only rarely as "glitches" under certain workloads or on certain systems. Debugging nightmare!
One possible solution is to eliminate the state sharing: give each task its own random number generator initialized with another seed (assuming that instances are not sharing state through static fields in some way).
Another (inferior) solution is to create a field holding a lock object in your MersenneTwister class like this:
private object lockObject = new object();
Then use this lock in your MersenneTwister.NextUniform() implementation:
public double NextUniform()
{
lock(lockObject)
{
// original code here
}
}
This will prevent two threads from executing the NextUniform() method in parallel. The problem with the list in your Parallel.For can be addressed in a similar manner: separate the Simulate call and the AddRange call, and then add locking around the AddRange call.
My recommendation: avoid sharing any mutable state (like the RNG state) between parallel tasks if at all possible. If no mutable state is shared, no threading issues occur. This also avoids locking bottlenecks: you don't want your "parallel" tasks to wait on a single random number generator that doesn't work in parallel at all. Especially if 30% of the time is spend acquiring random numbers.
Limit state sharing and locking to places where you can't avoid it, like when aggregating the results of parallel execution (as in your AddRange calls).
Threading is going to be complicated. You will have to break your program into logical units that can each be run on their own threads, and you will have to deal with any concurrency issues that emerge.
The Parallel Extension Library should allow you to parallelize your program by changing some of your for loops to Parallel.For loops. If you want to see how this works, Anders Hejlsberg and Joe Duffy provide a good introduction in their 30 minute video here:
http://channel9.msdn.com/shows/Going+Deep/Programming-in-the-Age-of-Concurrency-Anders-Hejlsberg-and-Joe-Duffy-Concurrent-Programming-with/
Threading vs. ThreadPool
The ThreadPool, as its name implies, is a pool of threads. Using the ThreadPool to obtain your threads has some advantages. Thread pooling enables you to use threads more efficiently by providing your application with a pool of worker threads that are managed by the system.