Parallel Shuffle in C# 4? - c#

As noticed in this question: Randomize a List<T> you can implement a shuffle method on a list; like one of the answers mentions:
using System.Security.Cryptography;
...
public static void Shuffle<T>(this IList<T> list)
{
RNGCryptoServiceProvider provider = new RNGCryptoServiceProvider();
int n = list.Count;
while (n > 1)
{
byte[] box = new byte[1];
do provider.GetBytes(box);
while (!(box[0] < n * (Byte.MaxValue / n)));
int k = (box[0] % n);
n--;
T value = list[k];
list[k] = list[n];
list[n] = value;
}
}
Does anyone know if it's possible to "Parallel-ize" this using some of the new features in C# 4?
Just a curiosity.
Thanks all,
-R.

Your question is ambiguous, but you can use the Parallel Framework to help doing some operations in parallel, but it depends on whether you want to get the bytes, then shuffle them, so the getting of the bytes is in parallel, or, if you want to shuffle multiple lists at one time.
If it is the former, I would suggest that you first break your code into smaller functions, so you can do some analysis to see where the bottlenecks are, as, if the getting of the bytes is the bottleneck, then doing it in parallel may make a difference. By having tests in place you can test new functions and decide if it is worth the added complexity.
static RNGCryptoServiceProvider provider = new RNGCryptoServiceProvider();
private static byte[] GetBytes() {
byte[] box = new byte[1];
do provider.GetBytes(box);
while (!(box[0] < n * (Byte.MaxValue / n)));
return box;
}
private static IList<T> InnerLoop(int n, IList<T> list) {
var box = GetBytes(n);
int k = (box[0] % n);
T value = list[k];
list[k] = list[n];
list[n] = value;
return list;
}
public static void Shuffle<T>(this IList<T> list)
{
int n = list.Count;
while (n > 1)
{
list = InnerLoop(n, list);
n--;
}
}
This is a rough idea as to how to split your function, so you can replace the GetBytes function, but you may need to make some other changes to test it.
Getting some numbers is crucial to make certain that you are getting enough of a benefit to warrant adding complexity though.
You may want to move the lines in InnerLoop that deals with list into a separate function, so you can see if that is slow and perhaps swap out some algorithms to improve it, but, you need to have an idea how fast you need the entire shuffle operation to go.
But if you want to just do multiple lists then it will be easy, you may want to perhaps look at PLINQ for that.
UPDATE
The code above is meant to just show an example of how it can be broken into smaller functions, not to be a working solution. If it is necessary to move the Provider class that I put into a static variable into a function and then pass it as a parameter then that may need to be done. I didn't test the code, but my suggestion is based on getting profiling then look at how to improve it's performance, especially since I am not certain which way it was meant to be done in parallel. It may be necessary to just build up an array in order, in parallel, then shuffle them, but first see what the time needed for each operation is, then see if doing it in parallel will be warranted.
There may be a need to use concurrent data structures also, if using multiple threads, in order to not pay a penalty by having to synchronize yourself, but, again, that may not be needed, depending on where the bottleneck is.
UPDATE:
Based on the answer to my comment, you may want to look at the various functions in the parallel library, but this page may help. http://msdn.microsoft.com/en-us/library/dd537609.aspx.
You can create a Func version of your function, and pass that in as a parameter. There are multiple ways to work use this library to make this function in parallel, as you already don't have any global variables, just lose the static operator.
You will want to get numbers as you add more threads though, to see where you start to see a decrease in performance, as you won't see a 1:1 improvement, so, if you add 2 threads it won't go twice as fast, so just test and see where it becomes a problem having more threads. Since your function is CPU bound you may want to have only one thread per core, as a rough starting point.

Any in-place shuffle is not very well suited for parallelization. Especially since a shuffle requires a random component (over the range) so there is no way to localize parts of the problem.

Well, you could fairly easily parallelize the code which is generating the random numbers - which could be the bottleneck when using a cryptographically secure random number generator. You could then use that sequence of random numbers (which would need to have its order preserved) within a single thread performing the swaps.
One problem which has just occurred to me though, is that RNGCryptoServiceProvider isn't thread-safe (and neither is System.Random). You'd need as many random number generators as threads to make this work. Basically it becomes a bit ugly :(

Related

Parallel.For() with Interlocked.CompareExchange(): poorer performance and slightly different results to serial version

I experimented with calculating the mean of a list using Parallel.For(). I decided against it as it is about four times slower than a simple serial version. Yet I am intrigued by the fact that it does not yield exactly the same result as the serial one and I thought it would be instructive to learn why.
My code is:
public static double Mean(this IList<double> list)
{
double sum = 0.0;
Parallel.For(0, list.Count, i => {
double initialSum;
double incrementedSum;
SpinWait spinWait = new SpinWait();
// Try incrementing the sum until the loop finds the initial sum unchanged so that it can safely replace it with the incremented one.
while (true) {
initialSum = sum;
incrementedSum = initialSum + list[i];
if (initialSum == Interlocked.CompareExchange(ref sum, incrementedSum, initialSum)) break;
spinWait.SpinOnce();
}
});
return sum / list.Count;
}
When I run the code on a random sequence of 2000000 points, I get results that are different in the last 2 digits to the serial mean.
I searched stackoverflow and found this: VB.NET running sum in nested loop inside Parallel.for Synclock loses information. My case, however, is different to the one described there. There a thread-local variable temp is the cause of inaccuracy, but I use a single sum that is updated (I hope) according to the textbook Interlocked.CompareExchange() pattern. The question is of course moot because of the poor performance (which surprises me, but I am aware of the overhead), yet I am curious whether there is something to be learnt from this case.
Your thoughts are appreciated.
Using double is the underlying problem, you can feel better about the synchronization not being the cause by using long instead. The results you got are in fact correct but that never makes a programmer happy.
You discovered that floating point math is communicative but not associative. Or in other words, a + b == b + a but a + b + c != a + c + b. Implicit in your code that the order in which the numbers are added is quite random.
This C++ question talks about it as well.
The accuracy issue is very well addressed in the other answers so I won't repeat it here, other that to say never trust the low bits of your floating point values. Instead I'll try to explain the performance hit you're seeing and how to avoid it.
Since you haven't shown your sequential code, I'll assume the absolute simplest case:
double sum = list.Sum();
This is a very simple operation that should work about as fast as it is possible to go on one CPU core. With a very large list it seems like it should be possible to leverage multiple cores to sum the list. And, as it turns out, you can:
double sum = list.AsParallel().Sum();
A few runs of this on my laptop (i3 with 2 cores/4 logical procs) yields a speedup of about 2.6 times over multiple runs against 2 million random numbers (same list, multiple runs).
Your code however is much, much slower than the simple case above. Instead of simply breaking the list into blocks that are summed independently and then summing the results you are introducing all sorts of blocking and waiting in order to have all of the threads update a single running sum.
Those extra waits, the much more complex code that supports them, creating objects and adding more work for the garbage collector all contribute to a much slower result. Not only are you wasting a whole lot of time on each item in the list but you are essentially forcing the program to do a sequential operation by making it wait for the other threads to leave the sum variable alone long enough for you to update it.
Assuming that the operation you are actually performing is more complex than a simple Sum() can handle, you may find that the Aggregate() method is more useful to you than Parallel.For.
There are several overloads of the Aggregate extension, including one that is effectively a Map Pattern implementation, with similarities to how bigdata systems like MapReduce work. Documentation is here.
This version of Aggregate uses an accumulator seed (the starting value for each thread) and three functions:
updateAccumulatorFunc is called for each item in the sequence and returns an updated accumulator value
combineAccumulatorsFunc is used to combine the accumulators from each partition (thread) in your parallel enumerable
resultSelector selects the final output value from the accumulated result.
A parallel sum using this method looks something like this:
double sum = list.AsParallel().Aggregate(
// seed value for accumulators
(double)0,
// add val to accumulator
(acc, val) => acc + val,
// add accumulators
(acc1, acc2) => acc1 + acc2,
// just return the final accumulator
acc => acc
);
For simple aggregations that works fine. For a more complex aggregate that uses an accumulator that is non-trivial there is a variant that accepts a function that creates accumulators for the initial state. This is useful for example in an Average implementation:
public class avg_acc
{
public int count;
public double sum;
}
public double ParallelAverage(IEnumerable<double> list)
{
double avg = list.AsParallel().Aggregate(
// accumulator factory method, called once per thread:
() => new avg_acc { count = 0, sum = 0 },
// update count and sum
(acc, val) => { acc.count++; acc.sum += val; return acc; },
// combine accumulators
(ac1, ac2) => new avg_acc { count = ac1.count + ac2.count, sum = ac1.sum + ac2.sum },
// calculate average
acc => acc.sum / acc.count
);
return avg;
}
While not as fast as the standard Average extension (~1.5 times faster than sequential, 1.6 times slower than parallel) this shows how you can do quite complex operations in parallel without having to lock outputs or wait on other threads to stop messing with them, and how to use a complex accumulator to hold intermediate results.

Random access on .NET lists is slow, but what if I always reference the first element?

I know that in general, .NET Lists are not good for random access. I've always been told that an array would be best for that. I have a program that needs to continually (like more than a billion times) access the first element of a .NET list, and I am wondering if this will slow anything down, or it won't matter because it's the first element in the list. I'm also doing a lot of other things like adding and removing items from the list as I go along, but the List is never empty.
I'm using F#, but I think this applies to any .NET language (I am using .NET Lists, not F# Lists). My list is about 100 elements long.
In F#, the .NET list (System.Collections.Generic.List) is aptly aliased as ResizeArray, which leaves little doubt as to what to expect. It's an array that can resize itself, and not really a list in the CS-classroom understanding of the term. Any performance differences between it and a simple array most likely come from the fact that compiler can be more aggressive about optimizing array usage.
Back to your question. If you only access the first element of a list, it doesn't matter what you choose. Both a ResizeArray and a list (using F# lingo) have O(1) access to the first element (head).
A list would be a preferable choice if your other operations also work on the head element, i.e. you only add elements from the head. If you want to append elements to the end of the list, or mutate some elements that already in, you'd get better mileage out of a ResizeArray.
That said, a ResizeArray in idomatic F# code is a rare sight. The usual approach favors (and doesn't suffer from using) immutable data structures, so seeing one usually would be a minor red flag for me.
There is not much difference between the performance of random access for an array and a list. Here's a test on my machine.
var list = Enumerable.Range(1, 100).ToList();
var array = Enumerable.Range(1, 100).ToArray();
int total = 0;
var sw = Stopwatch.StartNew();
for (int i = 0; i < 1000000000; i++) {
total ^= list[0];
}
Console.WriteLine("Time for list: {0}", sw.Elapsed);
sw.Restart();
for (int i = 0; i < 1000000000; i++) {
total ^= array[0];
}
Console.WriteLine("Time for list: {0}", sw.Elapsed);
This produces this output:
Time for list: 00:00:05.2002620
Time for array: 00:00:03.0159816
If you know you have a fixed size list, it makes sense to use an array, otherwise, there's not much cost to the list. (see update)
Update!
I found some pretty significant new information. After executing the script in release mode, the story changes quite a bit.
Time for list: 00:00:02.3048339
Time for array: 00:00:00.0805705
In this case, the performance of the array totally dominates the list. I'm pretty surprised, but the numbers don't lie.
Go with the array.

Parallel while loop in C#

I try to write BBS-generator. But this generator is too slow. So I need to parallel while. But i did not how do it. Please help me:).
public static List<BigInteger> GenerateFullCycle(long N, long x0)
{
List<BigInteger> randomNumbers = new List<BigInteger>();
var bbs = new BBS_generator(N, x0);
BigInteger rvalue = bbs.Generate();
while(randomNumbers.Contains(rvalue) == false)
{
randomNumbers.Add(rvalue);
rvalue = bbs.Generate();
}
return randomNumbers;
}
You might want to look into improving your algorithm first. How large does the randomNumbers list get? Contains is pretty slow on larger lists. You might want to use a HashSet instead.
BBS is slow, as simple as that. Using BigInteger really doesn't help at all - also quite slow.
Apart from that, there is no way to parallelize this code. The second iteration on the loop depends on the first, the third on the second etc.. You can parallelize the GenerateFullCycle calls themselves, but that's about it.
Parallelization isn't a solve-it-all thing. Many things aren't parallelizable at all - and how much do you get by parallelization? Twice faster? Even if you managed to completely separate and run at 100% efficiency, a 4-core CPU still only means you're four times as fast. True, it does have its applications, but I fail to see how it could help a BBS generator.
And the number one rule of performance tweaking - profile. Where is your application actually spending time?
Maybe you can use something like a Background Worker:
http://msdn.microsoft.com/en-us/library/cc221403(v=vs.95).aspx

Simple round robin (moving average) array in C#

As a diagnostic, I want to display the number of cycles per second in my app. (Think frames-per-second in a first-person-shooter.)
But I don't want to display the most recent value, or the average since launch. What I want to calculate is the mean of the last X values.
My question is, I suppose, about the best way to store these values. My first thought was to create a fixed size array, so each new value would push out the oldest. Is this the best way to do it? If so, how would I implement it?
EDIT:
Here's the class I wrote: RRQueue. It inherits Queue, but enforces the capacity and dequeues if necessary.
EDIT 2:
Pastebin is so passé. Now on a GitHub repo.
The easiest option for this is probably to use a Queue<T>, as this provides the first-in, first-out behavior you're after. Just Enqueue() your items, and when you have more than X items, Dequeue() the extra item(s).
Possibly use a filter:
average = 0.9*average + 0.1*value
where 'value' is the most recent measurement
Vary with the 0.9 and 0.1 (as long as the sum of these two is 1)
This is not exactly an average, but it does filter out spikes, transients, etc, but does not require arrays for storage.
Greetings,
Karel
If you need the fastest implementation, then yes, a fixed-size array ()with a separate count would be fastest.
You should take a look at the performance monitoring built into Windows :D.
MSDN
The API will feel a bit wonky if you haven't played with it before, but it's fast, powerful, extensible, and it makes quick work of getting usable results.
my implementation:
class RoundRobinAverage
{
int[] buffer;
byte _size;
byte _idx = 0;
public RoundRobinAverage(byte size)
{
_size = size;
buffer = new int[size];
}
public double Calc(int probeValue)
{
buffer[_idx++] = probeValue;
if (_idx >= _size)
_idx = 0;
return buffer.Sum() / _size;
}
}
usage:
private RoundRobinAverage avg = new RoundRobinAverage(10);\
...
var average = avg.Calc(123);

Fastest way to calculate primes in C#?

I actually have an answer to my question but it is not parallelized so I am interested in ways to improve the algorithm. Anyway it might be useful as-is for some people.
int Until = 20000000;
BitArray PrimeBits = new BitArray(Until, true);
/*
* Sieve of Eratosthenes
* PrimeBits is a simple BitArray where all bit is an integer
* and we mark composite numbers as false
*/
PrimeBits.Set(0, false); // You don't actually need this, just
PrimeBits.Set(1, false); // remindig you that 2 is the smallest prime
for (int P = 2; P < (int)Math.Sqrt(Until) + 1; P++)
if (PrimeBits.Get(P))
// These are going to be the multiples of P if it is a prime
for (int PMultiply = P * 2; PMultiply < Until; PMultiply += P)
PrimeBits.Set(PMultiply, false);
// We use this to store the actual prime numbers
List<int> Primes = new List<int>();
for (int i = 2; i < Until; i++)
if (PrimeBits.Get(i))
Primes.Add(i);
Maybe I could use multiple BitArrays and BitArray.And() them together?
You might save some time by cross-referencing your bit array with a doubly-linked list, so you can more quickly advance to the next prime.
Also, in eliminating later composites once you hit a new prime p for the first time - the first composite multiple of p remaining will be p*p, since everything before that has already been eliminated. In fact, you only need to multiply p by all the remaining potential primes that are left after it in the list, stopping as soon as your product is out of range (larger than Until).
There are also some good probabilistic algorithms out there, such as the Miller-Rabin test. The wikipedia page is a good introduction.
Parallelisation aside, you don't want to be calculating sqrt(Until) on every iteration. You also can assume multiples of 2, 3 and 5 and only calculate for N%6 in {1,5} or N%30 in {1,7,11,13,17,19,23,29}.
You should be able to parallelize the factoring algorithm quite easily, since the Nth stage only depends on the sqrt(n)th result, so after a while there won't be any conflicts. But that's not a good algorithm, since it requires lots of division.
You should also be able to parallelize the sieve algorithms, if you have writer work packets which are guaranteed to complete before a read. Mostly the writers shouldn't conflict with the reader - at least once you've done a few entries, they should be working at least N above the reader, so you only need a synchronized read fairly occasionally (when N exceeds the last synchronized read value). You shouldn't need to synchronize the bool array across any number of writer threads, since write conflicts don't arise (at worst, more than one thread will write a true to the same place).
The main issue would be to ensure that any worker being waited on to write has completed. In C++ you'd use a compare-and-set to switch to the worker which is being waited for at any point. I'm not a C# wonk so don't know how to do it that language, but the Win32 InterlockedCompareExchange function should be available.
You also might try an actor based approach, since that way you can schedule the actors working with the lowest values, which may be easier to guarantee that you're reading valid parts of the the sieve without having to lock the bus on each increment of N.
Either way, you have to ensure that all workers have got above entry N before you read it, and the cost of doing that is where the trade-off between parallel and serial is made.
Without profiling we cannot tell which bit of the program needs optimizing.
If you were in a large system, then one would use a profiler to find that the prime number generator is the part that needs optimizing.
Profiling a loop with a dozen or so instructions in it is not usually worth while - the overhead of the profiler is significant compared to the loop body, and about the only ways to improve a loop that small is to change the algorithm to do fewer iterations. So IME, once you've eliminated any expensive functions and have a known target of a few lines of simple code, you're better off changing the algorithm and timing an end-to-end run than trying to improve the code by instruction level profiling.
#DrPizza Profiling only really helps improve an implementation, it doesn't reveal opportunities for parallel execution, or suggest better algorithms (unless you've experience to the otherwise, in which case I'd really like to see your profiler).
I've only single core machines at home, but ran a Java equivalent of your BitArray sieve, and a single threaded version of the inversion of the sieve - holding the marking primes in an array, and using a wheel to reduce the search space by a factor of five, then marking a bit array in increments of the wheel using each marking prime. It also reduces storage to O(sqrt(N)) instead of O(N), which helps both in terms of the largest N, paging, and bandwidth.
For medium values of N (1e8 to 1e12), the primes up to sqrt(N) can be found quite quickly, and after that you should be able to parallelise the subsequent search on the CPU quite easily. On my single core machine, the wheel approach finds primes up to 1e9 in 28s, whereas your sieve (after moving the sqrt out of the loop) takes 86s - the improvement is due to the wheel; the inversion means you can handle N larger than 2^32 but makes it slower. Code can be found here. You could parallelise the output of the results from the naive sieve after you go past sqrt(N) too, as the bit array is not modified after that point; but once you are dealing with N large enough for it to matter the array size is too big for ints.
You also should consider a possible change of algorithms.
Consider that it may be cheaper to simply add the elements to your list, as you find them.
Perhaps preallocating space for your list, will make it cheaper to build/populate.
Are you trying to find new primes? This may sound stupid, but you might be able to load up some sort of a data structure with known primes. I am sure someone out there has a list. It might be a much easier problem to find existing numbers that calculate new ones.
You might also look at Microsofts Parallel FX Library for making your existing code multi-threaded to take advantage of multi-core systems. With minimal code changes you can make you for loops multi-threaded.
There's a very good article about the Sieve of Eratosthenes: The Genuine Sieve of Eratosthenes
It's in a functional setting, but most of the opimization do also apply to a procedural implementation in C#.
The two most important optimizations are to start crossing out at P^2 instead of 2*P and to use a wheel for the next prime numbers.
For concurrency, you can process all numbers till P^2 in parallel to P without doing any unnecessary work.
void PrimeNumber(long number)
{
bool IsprimeNumber = true;
long value = Convert.ToInt32(Math.Sqrt(number));
if (number % 2 == 0)
{
IsprimeNumber = false;
MessageBox.Show("No It is not a Prime NUmber");
return;
}
for (long i = 3; i <= value; i=i+2)
{
if (number % i == 0)
{
MessageBox.Show("It is divisible by" + i);
IsprimeNumber = false;
break;
}
}
if (IsprimeNumber)
{
MessageBox.Show("Yes Prime NUmber");
}
else
{
MessageBox.Show("No It is not a Prime NUmber");
}
}

Categories