I'm just looking in to the new .NET 4.0 features. With that, I'm attempting a simple calculation using Parallel.For and a normal for(x;x;x) loop.
However, I'm getting different results about 50% of the time.
long sum = 0;
Parallel.For(1, 10000, y =>
{
sum += y;
}
);
Console.WriteLine(sum.ToString());
sum = 0;
for (int y = 1; y < 10000; y++)
{
sum += y;
}
Console.WriteLine(sum.ToString());
My guess is that the threads are trying to update "sum" at the same time.
Is there an obvious way around it?
You can't do this. sum is being shared across you parallel threads. You need to make sure that the sum variable is only being accessed by one thread at a time:
// DON'T DO THIS!
Parallel.For(0, data.Count, i =>
{
Interlocked.Add(ref sum, data[i]);
});
BUT... This is an anti-pattern because you've effectively serialised the loop because each thread will lock on the Interlocked.Add.
What you need to do is add sub totals and merge them at the end like this:
Parallel.For<int>(0, result.Count, () => 0, (i, loop, subtotal) =>
{
subtotal += result[i];
return subtotal;
},
(x) => Interlocked.Add(ref sum, x)
);
You can find further discussion of this on MSDN: http://msdn.microsoft.com/en-us/library/dd460703.aspx
PLUG: You can find more on this in Chapter 2 on A Guide to Parallel Programming
The following is also definitely worth a read...
Patterns for Parallel Programming: Understanding and Applying Parallel Patterns with the .NET Framework 4 - Stephen Toub
sum += y; is actually sum = sum + y;. You are getting incorrect results because of the following race condition:
Thread1 reads sum
Thread2 reads sum
Thread1 calculates sum+y1, and stores the result in sum
Thread2 calculates sum+y2, and stores the result in sum
sum is now equal to sum+y2, instead of sum+y1+y2.
Your surmise is correct.
When you write sum += y, the runtime does the following:
Read the field onto the stack
Add y to the stack
Write the result back to the field
If two threads read the field at the same time, the change made by the first thread will be overwritten by the second thread.
You need to use Interlocked.Add, which performs the addition as a single atomic operation.
Incrementing a long isn't an atomic operation.
I think it's important to distinguish that this loop is not capable of being partitioned for parallelism, because as has been mentioned above each iteration of the loop is dependent on the prior. The parallel for is designed for explicitly parallel tasks, such as pixel scaling etc. because each iteration of the loop cannot have data dependencies outside its iteration.
Parallel.For(0, input.length, x =>
{
output[x] = input[x] * scalingFactor;
});
The above an example of code that allows for easy partitioning for parallelism. However a word of warning, parallelism comes with a cost, even the loop I used as an example above is far far too simple to bother with a parallel for because the set up time takes longer than the time saved via parallelism.
An important point no-one seems to have mentioned: For data-parallel operations (such as the OP's), it is often better (in terms of both efficiency and simplicity) to use PLINQ instead of the Parallel class. The OP's code is actually trivial to parallelize:
long sum = Enumerable.Range(1, 10000).AsParallel().Sum();
The above snippet uses the ParallelEnumerable.Sum method, although one could also use Aggregate for more general scenarios. Refer to the Parallel Loops chapter for an explanation of these approaches.
if there are two parameters in this code.
For example
long sum1 = 0;
long sum2 = 0;
Parallel.For(1, 10000, y =>
{
sum1 += y;
sum2=sum1*y;
}
);
what will we do ? i am guessing that have to use array !
Related
I tried to create a multithreaded dice rolling simulation - just for curiosity, the joy of multithreaded progarmming and to show others the effects of "random results" (many people can't understand that if you roll a laplace dice six times in a row and you already had 1, 2, 3, 4, 5 that the next roll is NOT a 6.). To show them the distribution of n rolls with m dice I created this code.
Well, the result is fine BUT even though I create a new task for each dice the program runs single threaded.
Multithreaded would be reasonable to simulate "millions" of rerolls with 6 or more dice as the time to finish will grow rapidly.
I read several examples from msdn that all indicate that there should be several tasks running simultanously.
Can someone please give me a hint, why this code does not utilize many threads / cores? (Not even, when I try to run it for 400 dice at once and 10 Million rerolls)
At first I initialize the jagged Array that stores the results. 1st dimension: 1 Entry per dice, the second dimension will be the distribution of eyes rolled with each dice.
Next I create an array of tasks that each return an array of results (the second dimension, as described above)
Each of these arrays has 6 entries that represent eachs side of a laplace W6 dice. If the dice roll results in 1 eye the first entry [0] is increased by +1. So you can visualize how often each value has been rolled.
Then I use a plain for-loop to start all threads. There is no indication to wait for a thread until all are started.
At the end I wait for all to finish and sum up the results. It does not make any difference if change
Task.WaitAll(tasks); to
Task.WhenAll(tasks);
Again my quation: Why doesn't that code utilize more than one core of my CPU? What do I have to change?
Thanks in advance!
Here's the code:
private void buttonStart_Click(object sender, RoutedEventArgs e)
{
int tries = 1000000;
int numberofdice = 20 ;
int numberofsides = 6; // W6 = 6
var rnd = new Random();
int[][] StoreResult = new int[numberofdice][];
for (int i = 0; i < numberofdice; i++)
{
StoreResult[i] = new int[numberofsides];
}
Task<int[]>[] tasks = new Task<int[]>[numberofdice];
for (int ctr = 0; ctr < numberofdice; ctr++)
{
tasks[ctr] = Task.Run(() =>
{
int newValue = 0;
int[] StoreTemp = new int[numberofsides]; // Array that represents how often each value appeared
for (int i = 1; i <= tries; i++) // how often to roll the dice
{
newValue = rnd.Next(1, numberofsides + 1); // Roll the dice; UpperLimit for random int EXCLUDED from range
StoreTemp[newValue-1] = StoreTemp[newValue-1] + 1; //increases value corresponding to eyes on dice by +1
}
return StoreTemp;
});
StoreResult[ctr] = tasks[ctr].Result; // Summing up the individual results for each dice in an array
}
Task.WaitAll(tasks);
// do something to visualize the results - not important for the question
}
}
The issue here is tasks[ctr].Result. The .Result portion itself waits for the function to complete before storing the resulting int array into StoreResult. Instead, make a new loop after Task.WaitAll to get your results.
You may consider doing a Parallel.Foreach loop instead of manually creating separate tasks for this.
As others have indicated, when you try to aggregate this you just end up waiting for each individual task to finish, so this isn't actually multi-threaded.
Very important note: The C# random number generator is not thread safe (see also this MSDN blog post for discussion on the topic). Don't share the same instance between multiple threads. From the documentation:
...Random objects are not thread safe. If your app calls Random
methods from multiple threads, you must use a synchronization object
to ensure that only one thread can access the random number generator
at a time. If you don't ensure that the Random object is accessed in a
thread-safe way, calls to methods that return random numbers return 0.
Also, just to be nit-picky, using a Task is not really the same thing as doing multithreading; while you are, in fact, doing multithreading here, it's also possible to do purely asynchronous, non-multithreaded code with async/await. This is used mostly for I/O-bound operations where it's largely pointless to create a separate thread just to wait for a result (but it's desirable to allow the calling thread to do other work while it's waiting for the result).
I don't think you should have to worry about thread safety while assigning to the main array (assuming that each thread is assigning only to a specific index in the array and that no one else is assigning to the same memory location); you only have to worry about locking when multiple threads are accessing/modifying shared mutable state at the same time. If I'm reading this correctly, this is mutable state (but it's not shared mutable state).
I experimented with calculating the mean of a list using Parallel.For(). I decided against it as it is about four times slower than a simple serial version. Yet I am intrigued by the fact that it does not yield exactly the same result as the serial one and I thought it would be instructive to learn why.
My code is:
public static double Mean(this IList<double> list)
{
double sum = 0.0;
Parallel.For(0, list.Count, i => {
double initialSum;
double incrementedSum;
SpinWait spinWait = new SpinWait();
// Try incrementing the sum until the loop finds the initial sum unchanged so that it can safely replace it with the incremented one.
while (true) {
initialSum = sum;
incrementedSum = initialSum + list[i];
if (initialSum == Interlocked.CompareExchange(ref sum, incrementedSum, initialSum)) break;
spinWait.SpinOnce();
}
});
return sum / list.Count;
}
When I run the code on a random sequence of 2000000 points, I get results that are different in the last 2 digits to the serial mean.
I searched stackoverflow and found this: VB.NET running sum in nested loop inside Parallel.for Synclock loses information. My case, however, is different to the one described there. There a thread-local variable temp is the cause of inaccuracy, but I use a single sum that is updated (I hope) according to the textbook Interlocked.CompareExchange() pattern. The question is of course moot because of the poor performance (which surprises me, but I am aware of the overhead), yet I am curious whether there is something to be learnt from this case.
Your thoughts are appreciated.
Using double is the underlying problem, you can feel better about the synchronization not being the cause by using long instead. The results you got are in fact correct but that never makes a programmer happy.
You discovered that floating point math is communicative but not associative. Or in other words, a + b == b + a but a + b + c != a + c + b. Implicit in your code that the order in which the numbers are added is quite random.
This C++ question talks about it as well.
The accuracy issue is very well addressed in the other answers so I won't repeat it here, other that to say never trust the low bits of your floating point values. Instead I'll try to explain the performance hit you're seeing and how to avoid it.
Since you haven't shown your sequential code, I'll assume the absolute simplest case:
double sum = list.Sum();
This is a very simple operation that should work about as fast as it is possible to go on one CPU core. With a very large list it seems like it should be possible to leverage multiple cores to sum the list. And, as it turns out, you can:
double sum = list.AsParallel().Sum();
A few runs of this on my laptop (i3 with 2 cores/4 logical procs) yields a speedup of about 2.6 times over multiple runs against 2 million random numbers (same list, multiple runs).
Your code however is much, much slower than the simple case above. Instead of simply breaking the list into blocks that are summed independently and then summing the results you are introducing all sorts of blocking and waiting in order to have all of the threads update a single running sum.
Those extra waits, the much more complex code that supports them, creating objects and adding more work for the garbage collector all contribute to a much slower result. Not only are you wasting a whole lot of time on each item in the list but you are essentially forcing the program to do a sequential operation by making it wait for the other threads to leave the sum variable alone long enough for you to update it.
Assuming that the operation you are actually performing is more complex than a simple Sum() can handle, you may find that the Aggregate() method is more useful to you than Parallel.For.
There are several overloads of the Aggregate extension, including one that is effectively a Map Pattern implementation, with similarities to how bigdata systems like MapReduce work. Documentation is here.
This version of Aggregate uses an accumulator seed (the starting value for each thread) and three functions:
updateAccumulatorFunc is called for each item in the sequence and returns an updated accumulator value
combineAccumulatorsFunc is used to combine the accumulators from each partition (thread) in your parallel enumerable
resultSelector selects the final output value from the accumulated result.
A parallel sum using this method looks something like this:
double sum = list.AsParallel().Aggregate(
// seed value for accumulators
(double)0,
// add val to accumulator
(acc, val) => acc + val,
// add accumulators
(acc1, acc2) => acc1 + acc2,
// just return the final accumulator
acc => acc
);
For simple aggregations that works fine. For a more complex aggregate that uses an accumulator that is non-trivial there is a variant that accepts a function that creates accumulators for the initial state. This is useful for example in an Average implementation:
public class avg_acc
{
public int count;
public double sum;
}
public double ParallelAverage(IEnumerable<double> list)
{
double avg = list.AsParallel().Aggregate(
// accumulator factory method, called once per thread:
() => new avg_acc { count = 0, sum = 0 },
// update count and sum
(acc, val) => { acc.count++; acc.sum += val; return acc; },
// combine accumulators
(ac1, ac2) => new avg_acc { count = ac1.count + ac2.count, sum = ac1.sum + ac2.sum },
// calculate average
acc => acc.sum / acc.count
);
return avg;
}
While not as fast as the standard Average extension (~1.5 times faster than sequential, 1.6 times slower than parallel) this shows how you can do quite complex operations in parallel without having to lock outputs or wait on other threads to stop messing with them, and how to use a complex accumulator to hold intermediate results.
We would like to optionally control the number of "threads" on our parallel loops to avoid overwhelming a web service (for example).
Is it possible to specify a custom MaxDegreeOfParallelism on a Parallel.ForEach loop, but also to revert to the default value as required? Seemingly zero (0) is an invalid value for MaxDegreeOfParallelism, whereas I was hoping it could simply mean "ignore".
In other words, can you avoid writing this type of code?
int numParallelOperations = GetNumParallelOperations();
if (numParallelOperations > 0)
{
ParallelOptions options = new ParallelOptions();
options.MaxDegreeOfParallelism = numParallelOperations;
Parallel.ForEach(items, options, i =>
{
Foo(i);
});
}
else
{
Parallel.ForEach(items, i =>
{
Foo(i);
});
}
Do you mean -1 as per MSDN:
The MaxDegreeOfParallelism limits the number of concurrent operations
run by Parallel method calls that are passed this ParallelOptions
instance to the set value, if it is positive. If
MaxDegreeOfParallelism is -1, then there is no limit placed on the
number of concurrently running operations.
You can control the approximate number of threads like this:
// use only (ca) one kernel:
int degreeOfParallelism = 1;
// leave (ca) one kernel idle:
int degreeOfParallelism = Environment.ProcessorCount - 1;
// use (ca) half of the kernels:
int degreeOfParallelism = Environment.ProcessorCount > 1 ?
Environment.ProcessorCount / 2 : 1;
// run at full speed:
int degreeOfParallelism = - 1;
var options = new ParallelOptions();
options.MaxDegreeOfParallelism = degreeOfParallelism;
Parallel.For(0, x, options, y =>
//...
This may not be a definitive answer as I cannot add a comment due to me only just joining StackOverflow. I don't believe it is possible to do as your asking but I do know that MSDN documentation states that -1 is the parameter which sets an unlimited number of tasks to be ran the the ForEach. From my experience, it is best to leave the CLR to determine how many concurrent tasks will be ran unless you really know what you are doing. The Parallel library is high level and if you needed to really do something like this you should be coding at a lower level and in control of your own threads and not leaving it up to a TaskScheduler or ThreadPool etc but this takes a lot of experimentation to get your own algorithms running effectively.
The only thing I can suggest is wrapping the Parallel.ForEach method to include the setting of your ParallelOptions.MaxDegreeOfParallism to cut down on the duplicate code and enable you to add an interface and test the asynchonous code in a synchronous manner.
Apologies for not providing a more positive response!
I've been learning a little about parallelism in the last few days, and I came across this example.
I put it side to side with a sequential for loop like this:
private static void NoParallelTest()
{
int[] nums = Enumerable.Range(0, 1000000).ToArray();
long total = 0;
var watch = Stopwatch.StartNew();
for (int i = 0; i < nums.Length; i++)
{
total += nums[i];
}
Console.WriteLine("NoParallel");
Console.WriteLine(watch.ElapsedMilliseconds);
Console.WriteLine("The total is {0}", total);
}
I was surprised to see that the NoParallel method finished way way faster than the parallel example given at the site.
I have an i5 PC.
I really thought that the Parallel method would finish faster.
Is there a reasonable explanation for this? Maybe I misunderstood something?
The sequential version was faster because the time spent doing operations on each iteration in your example is very small and there is a fairly significant overhead involved with creating and managing multiple threads.
Parallel programming only increases efficiency when each iteration is sufficiently expensive in terms of processor time.
I think that's because the loop performs a very simple, very fast operation.
In the case of the non-parallel version that's all it does. But the parallel version has to invoke a delegate. Invoking a delegate is quite fast and usually you don't have to worry how often you do that. But in this extreme case, it's what makes the difference. I can easily imagine that invoking a delegate will be, say, ten times slower (or more, I have no idea what the exact ratio is) than adding a number from an array.
I have a program with two methods. The first method takes two arrays as parameters, and performs an operation in which values from one array are conditionally written into the other, like so:
void Blend(int[] dest, int[] src, int offset)
{
for (int i = 0; i < src.Length; i++)
{
int rdr = dest[i + offset];
dest[i + offset] = src[i] > rdr? src[i] : rdr;
}
}
The second method creates two separate sets of int arrays and iterates through them such that each array of one set is Blended with each array from the other set, like so:
void CrossBlend()
{
int[][] set1 = new int[150][75000]; // we'll pretend this actually compiles
int[][] set2 = new int[25][10000]; // we'll pretend this actually compiles
for (int i1 = 0; i1 < set1.Length; i1++)
{
for (int i2 = 0; i2 < set2.Length; i2++)
{
Blend(set1[i1], set2[i2], 0); // or any offset, doesn't matter
}
}
}
First question: Since this apporoach is an obvious candidate for parallelization, is it intrinsically thread-safe? It seems like no, since I can conceive a scenario (unlikely, I think) where one thread's changes are lost because a different threads ~simultaneous operation.
If no, would this:
void Blend(int[] dest, int[] src, int offset)
{
lock (dest)
{
for (int i = 0; i < src.Length; i++)
{
int rdr = dest[i + offset];
dest[i + offset] = src[i] > rdr? src[i] : rdr;
}
}
}
be an effective fix?
Second question: If so, what would be the likely performance cost of using locks like this? I assume that with something like this, if a thread attempts to lock a destination array that is currently locked by another thread, the first thread would block until the lock was released instead of continuing to process something.
Also, how much time does it actually take to acquire a lock? Nanosecond scale, or worse than that? Would this be a major issue in something like this?
Third question: How would I best approach this problem in a multi-threaded way that would take advantage of multi-core processors (and this is based on the potentially wrong assumption that a multi-threaded solution would not speed up this operation on a single core processor)? I'm guessing that I would want to have one thread running per core, but I don't know if that's true.
The potential contention with CrossBlend is set1 - the destination of the blend. Rather than using a lock, which is going to be comparatively expensive compared to the amount of work you are doing, arrange for each thread to work on it's own destination. That is a given destination (array at some index in set1) is owned by a given task. This is possible since the outcome is independent of the order that CrossBlend processes the arrays in.
Each task should then run just the inner loop in CrossBlend, and the task is parameterized with the index of the dest array (set1) to use (or range of indices.)
You can also parallelize the Blend method, since each index is computed independently of the others, so no contention there. But on todays machines, with <40 cores you will get sufficient parallism just threading the CrossBlend method.
To run effectively on multi-core you can either
for N cores, divide the problem into N parts. Given that set1 is reasonably large compared to the number of cores, you could just divide set1 into N ranges, and pass each range of indices into N threads running the inner CrossBlend loop. That will give you fairly good parallelism, but it's not optimal. (Some threads will finish sooner and end up with no work to do.)
A more involved scheme is to make each iteration of the CrossBlend inner loop a separate task. Have N queues (for N cores), and distribute the tasks amongst the queues. Start N threads, with each thread reading it's tasks from a queue. If a threads queue becomes empty, it takes a task from some other thread's queue.
The second approach is best suited to irregularly sized tasks, or where the system is being used for other tasks, so some cores may be time switching between other processes, so you cannot expect that equal amounts of work complete in the roughly same time on different cores.
The first approach is much simpler to code, and will give you a good level of parallelism.