I'm using ActionBlock and I tested if it's working properly like below and sometime Actionblock misses actions and it shouldn't happen at all
Why is this happening and how can i fix it?
var n = 0;
var action = new Action<int>((i) =>
{
n++;
//...job...
}
for (int i = 0; i < size; i++)
{
var block = new ActionBlock<int>(i => action(i),
new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 6 });
n = 0;
foreach (var a in list)
block.Post(a);
block.Complete();
block.Completion.Wait();
if (n != list.Count)
ShowError(); //it's called sometimes
}
ActionBlock can execute operations in parallel manner (and I believe it does exactly this in your case). So, in this case you just have a data race on n++ operation.
So, actually the ActionBlock does not miss anything, but you just calculating n incorrectly and sometimes (may be almost all the time) get incorrect count at the end.
To get correct value of n you may replace n++ with Interlocked.Increment(ref n) or just add a lock.
Related
I'm playing around with the Parallel.ForEach in a C# console application, but can't seem to get it right. I'm creating an array with random numbers and i have a sequential foreach and a Parallel.ForEach that finds the largest value in the array. With approximately the same code in c++ i started to see a tradeoff to using several threads at 3M values in the array. But the Parallel.ForEach is twice as slow even at 100M values. What am i doing wrong?
class Program
{
static void Main(string[] args)
{
dostuff();
}
static void dostuff() {
Console.WriteLine("How large do you want the array to be?");
int size = int.Parse(Console.ReadLine());
int[] arr = new int[size];
Random rand = new Random();
for (int i = 0; i < size; i++)
{
arr[i] = rand.Next(0, int.MaxValue);
}
var watchSeq = System.Diagnostics.Stopwatch.StartNew();
var largestSeq = FindLargestSequentially(arr);
watchSeq.Stop();
var elapsedSeq = watchSeq.ElapsedMilliseconds;
Console.WriteLine("Finished sequential in: " + elapsedSeq + "ms. Largest = " + largestSeq);
var watchPar = System.Diagnostics.Stopwatch.StartNew();
var largestPar = FindLargestParallel(arr);
watchPar.Stop();
var elapsedPar = watchPar.ElapsedMilliseconds;
Console.WriteLine("Finished parallel in: " + elapsedPar + "ms Largest = " + largestPar);
dostuff();
}
static int FindLargestSequentially(int[] arr) {
int largest = arr[0];
foreach (int i in arr) {
if (largest < i) {
largest = i;
}
}
return largest;
}
static int FindLargestParallel(int[] arr) {
int largest = arr[0];
Parallel.ForEach<int, int>(arr, () => 0, (i, loop, subtotal) =>
{
if (i > subtotal)
subtotal = i;
return subtotal;
},
(finalResult) => {
Console.WriteLine("Thread finished with result: " + finalResult);
if (largest < finalResult) largest = finalResult;
}
);
return largest;
}
}
It's performance ramifications of having a very small delegate body.
We can achieve better performance using the partitioning. In this case the body delegate performs work with a high data volume.
static int FindLargestParallelRange(int[] arr)
{
object locker = new object();
int largest = arr[0];
Parallel.ForEach(Partitioner.Create(0, arr.Length), () => arr[0], (range, loop, subtotal) =>
{
for (int i = range.Item1; i < range.Item2; i++)
if (arr[i] > subtotal)
subtotal = arr[i];
return subtotal;
},
(finalResult) =>
{
lock (locker)
if (largest < finalResult)
largest = finalResult;
});
return largest;
}
Pay attention to synchronize the localFinally delegate. Also note the need for proper initialization of the localInit: () => arr[0] instead of () => 0.
Partitioning with PLINQ:
static int FindLargestPlinqRange(int[] arr)
{
return Partitioner.Create(0, arr.Length)
.AsParallel()
.Select(range =>
{
int largest = arr[0];
for (int i = range.Item1; i < range.Item2; i++)
if (arr[i] > largest)
largest = arr[i];
return largest;
})
.Max();
}
I highly recommend free book Patterns of Parallel Programming by Stephen Toub.
As the other answerers have mentioned, the action you're trying to perform against each item here is so insignificant that there are a variety of other factors which end up carrying more weight than the actual work you're doing. These may include:
JIT optimizations
CPU branch prediction
I/O (outputting thread results while the timer is running)
the cost of invoking delegates
the cost of task management
the system incorrectly guessing what thread strategy will be optimal
memory/cpu caching
memory pressure
environment (debugging)
etc.
Running each approach a single time is not an adequate way to test, because it enables a number of the above factors to weigh more heavily on one iteration than on another. You should start with a more robust benchmarking strategy.
Furthermore, your implementation is actually dangerously incorrect. The documentation specifically says:
The localFinally delegate is invoked once per task to perform a final action on each task’s local state. This delegate might be invoked concurrently on multiple tasks; therefore, you must synchronize access to any shared variables.
You have not synchronized your final delegate, so your function is prone to race conditions that would make it produce incorrect results.
As in most cases, the best approach to this one is to take advantage of work done by people smarter than we are. In my testing, the following approach appears to be the fastest overall:
return arr.AsParallel().Max();
The Parallel Foreach loop should be running slower because the algorithm used is not parallel and a lot more work is being done to run this algorithm.
In the single thread, to find the max value, we can take the first number as our max value and compare it to every other number in the array. If one of the numbers larger than our first number, we swap and continue. This way we access each number in the array once, for a total of N comparisons.
In the Parallel loop above, the algorithm creates overhead because each operation is wrapped inside a function call with a return value. So in addition to doing the comparisons, it is running overhead of adding and removing these calls onto the call stack. In addition, since each call is dependent on the value of the function call before, it needs to run in sequence.
In the Parallel For Loop below, the array is divided into an explicit number of threads determined by the variable threadNumber. This limits the overhead of function calls to a low number.
Note, for low values, the parallel loops performs slower. However, for 100M, there is a decrease in time elapsed.
static int FindLargestParallel(int[] arr)
{
var answers = new ConcurrentBag<int>();
int threadNumber = 4;
int partitionSize = arr.Length/threadNumber;
Parallel.For(0, /* starting number */
threadNumber+1, /* Adding 1 to threadNumber in case array.Length not evenly divisible by threadNumber */
i =>
{
if (i*partitionSize < arr.Length) /* check in case # in array is divisible by # threads */
{
var max = arr[i*partitionSize];
for (var x = i*partitionSize;
x < (i + 1)*partitionSize && x < arr.Length;
++x)
{
if (arr[x] > max)
max = arr[x];
}
answers.Add(max);
}
});
/* note the shortcut in finding max in the bag */
return answers.Max(i=>i);
}
Some thoughts here: In the parallel case, there is thread management logic involved that determines how many threads it wants to use. This thread management logic presumably possibly runs on your main thread. Every time a thread returns with the new maximum value, the management logic kicks in and determines the next work item (the next number to process in your array). I'm pretty sure that this requires some kind of locking. In any case, determining the next item may even cost more than performing the comparison operation itself.
That sounds like a magnitude more work (overhead) to me than a single thread that processes one number after the other. In the single-threaded case there are a number of optimization at play: No boundary checks, CPU can load data into the first level cache within the CPU, etc. Not sure, which of these optimizations apply for the parallel case.
Keep in mind that on a typical desktop machine there are only 2 to 4 physical CPU cores available so you will never have more than that actually doing work. So if the parallel processing overhead is more than 2-4 times of a single-threaded operation, the parallel version will inevitably be slower, which you are observing.
Have you attempted to run this on a 32 core machine? ;-)
A better solution would be determine non-overlapping ranges (start + stop index) covering the entire array and let each parallel task process one range. This way, each parallel task can internally do a tight single-threaded loop and only return once the entire range has been processed. You could probably even determine a near optimal number of ranges based on the number of logical cores of the machine. I haven't tried this but I'm pretty sure you will see an improvement over the single-threaded case.
Try splitting the set into batches and running the batches in parallel, where the number of batches corresponds to your number of CPU cores.
I ran some equations 1K, 10K and 1M times using the following methods:
A "for" loop.
A "Parallel.For" from the System.Threading.Tasks lib, across the entire set.
A "Parallel.For" across 4 batches.
A "Parallel.ForEach" from the System.Threading.Tasks lib, across the entire set.
A "Parallel.ForEach" across 4 batches.
Results: (Measured in seconds)
Conclusion:
Processing batches in parallel using the "Parallel.ForEach" has the best outcome in cases above 10K records. I believe the batching helps because it utilizes all CPU cores (4 in this example), but also minimizes the amount of threading overhead associated with parallelization.
Here is my code:
public void ParallelSpeedTest()
{
var rnd = new Random(56);
int range = 1000000;
int numberOfCores = 4;
int batchSize = range / numberOfCores;
int[] rangeIndexes = Enumerable.Range(0, range).ToArray();
double[] inputs = rangeIndexes.Select(n => rnd.NextDouble()).ToArray();
double[] weights = rangeIndexes.Select(n => rnd.NextDouble()).ToArray();
double[] outputs = new double[rangeIndexes.Length];
/// Series "for"...
var startTimeSeries = DateTime.Now;
for (var i = 0; i < range; i++)
{
outputs[i] = Math.Sqrt(Math.Pow(inputs[i] * weights[i], 2));
}
var durationSeries = DateTime.Now - startTimeSeries;
/// "Parallel.For"...
var startTimeParallel = DateTime.Now;
Parallel.For(0, range, (i) => {
outputs[i] = Math.Sqrt(Math.Pow(inputs[i] * weights[i], 2));
});
var durationParallelFor = DateTime.Now - startTimeParallel;
/// "Parallel.For" in Batches...
var startTimeParallel2 = DateTime.Now;
Parallel.For(0, numberOfCores, (c) => {
var endValue = (c == numberOfCores - 1) ? range : (c + 1) * batchSize;
var startValue = c * batchSize;
for (var i = startValue; i < endValue; i++)
{
outputs[i] = Math.Sqrt(Math.Pow(inputs[i] * weights[i], 2));
}
});
var durationParallelForBatches = DateTime.Now - startTimeParallel2;
/// "Parallel.ForEach"...
var startTimeParallelForEach = DateTime.Now;
Parallel.ForEach(rangeIndexes, (i) => {
outputs[i] = Math.Sqrt(Math.Pow(inputs[i] * weights[i], 2));
});
var durationParallelForEach = DateTime.Now - startTimeParallelForEach;
/// Parallel.ForEach in Batches...
List<Tuple<int,int>> ranges = new List<Tuple<int, int>>();
for (var i = 0; i < numberOfCores; i++)
{
int start = i * batchSize;
int end = (i == numberOfCores - 1) ? range : (i + 1) * batchSize;
ranges.Add(new Tuple<int,int>(start, end));
}
var startTimeParallelBatches = DateTime.Now;
Parallel.ForEach(ranges, (range) => {
for(var i = range.Item1; i < range.Item1; i++) {
outputs[i] = Math.Sqrt(Math.Pow(inputs[i] * weights[i], 2));
}
});
var durationParallelForEachBatches = DateTime.Now - startTimeParallelBatches;
Debug.Print($"=================================================================");
Debug.Print($"Given: Set-size: {range}, number-of-batches: {numberOfCores}, batch-size: {batchSize}");
Debug.Print($".................................................................");
Debug.Print($"Series For: {durationSeries}");
Debug.Print($"Parallel For: {durationParallelFor}");
Debug.Print($"Parallel For Batches: {durationParallelForBatches}");
Debug.Print($"Parallel ForEach: {durationParallelForEach}");
Debug.Print($"Parallel ForEach Batches: {durationParallelForEachBatches}");
Debug.Print($"");
}
I want to assign and index in the range (0 to MaxDegreeOfParallelism) to each incoming thread in a parallel.for. I have tried using CurrentThread.ManagedThreadId and CurrentThread.Name but they are not in a given range and think they sometimes change values (is this right?). What I am using now is a ConcurrentStack in the following way which works but I think is making my loop horribly slow. Any idea how I could solve this problem? Thanks!!
int nCPU = 16; //number of CORES
ConcurrentStack<int> cs = new ConcurrentStack<int>();
for (int i = 0; i < nCPU; i++) { cs.Push(i); }
options.MaxDegreeOfParallelism = nCPU - 1;
Parallel.For(0, 10000, options, tt =>
{
int threadIndex;
cs.TryPop(out threadIndex); //assign an index to the incoming thread
//..... do stuff with all my variables matrix1[threadIndex][i][j], matrix2.......//
cs.Push(threadIndex); //the thread leaves the previously assigned index
});
I want to do something like the following:
int firstLoopMaxThreads = 1; // or -1
int secondLoopMaxThreads = firstLoopMaxThreads == 1 ? -1 : 1;
Parallel.For(0, m, new ParallelOptions() { MaxDegreeOfParallelism = firstLoopMaxThreads }, i =>
{
//do some processor and/or memory-intensive stuff
Parallel.For(0, n, new ParallelOptions() { MaxDegreeOfParallelism = secondLoopMaxThreads }, j =>
{
//do some other processor and/or memory-intensive stuff
});
});
Would it be worth it, performance wise, to swap the inner Parallel.For loop with a normal for loop when secondLoopMaxThreads = 1? What is the performance difference between a regular for loop and a Parallel.For loop with MaxDegreeofParallelism = 1?
It depends on how many iterations you're talking about and what level of performance you're talking about to answer whether it is worth it or not. In your context is 1ms considered a lot, or a little?
I did a rudimentary test as below (since Thread.Sleep is not entirely accurate.. although the for loop measured 15,000ms to within 1ms everytime). Over 15,000 iterations repeated 5 times it generally added about 4ms of overhead compared to a standard for loop... but of course results would be different depending on the environment.
for (int z = 0; z < 5; z++)
{
int iterations = 15000;
Stopwatch s = Stopwatch.StartNew();
for (int i = 0; i < iterations; i++)
Thread.Sleep(1);
s.Stop();
Console.WriteLine("#{0}:Elapsed (for): {1:#,0}ms", z, ((double)s.ElapsedTicks / (double)Stopwatch.Frequency) * 1000);
var options = new ParallelOptions() { MaxDegreeOfParallelism = 1 };
s = Stopwatch.StartNew();
Parallel.For(0, iterations, options, (i) => Thread.Sleep(1));
s.Stop();
Console.WriteLine("#{0}: Elapsed (parallel): {1:#,0}ms", z, ((double)s.ElapsedTicks / (double)Stopwatch.Frequency) * 1000);
}
The loop body performs equally well in both versions but the loop itself is drastically slower with Parallel.For even for single-threaded execution. Each element needs to call a delegate. This is very much slower than incrementing a loop counter.
If your loop body does anything meaningful the loop overhead will be dwarfed by useful work. Just ensure that your work items are not too small and you won't notice a difference.
Nesting parallel loops is rarely a good idea. A single parallel loop is usually best enough provided the work items are neither too small nor too big.
I wrote this experiment to demonstrate to someone that accessing shared data conccurently with multiple threads was a big no-no. To my surprise, regardless of how many threads I created, I was not able to create a concurrency issue and the value always resulted in a balanced value of 0. I know that the increment operator is not thread-safe which is why there are methods like Interlocked.Increment() and Interlocked.Decrement() (also noted here Is the ++ operator thread safe?).
If the increment/decrement operator is not thread safe, then why does the below code execute without any issues and results to the expected value?
The below snippet creates 2,000 threads. 1,000 constantly incrementing and 1,000 constantly decrementing to insure that the data is being accessed by multiple threads at the same time. What makes it worse is that in a normal program you would not have nearly as many threads. Yet despite the exaggerated numbers in an effort to create a concurrency issue the value always results in being a balanced value of 0.
static void Main(string[] args)
{
Random random = new Random();
int value = 0;
for (int x=0; x<1000; x++)
{
Thread incThread = new Thread(() =>
{
for (int y=0; y<100; y++)
{
Console.WriteLine("Incrementing");
value++;
}
});
Thread decThread = new Thread(() =>
{
for (int z=0; z<100; z++)
{
Console.WriteLine("Decrementing");
value--;
}
});
incThread.Start();
decThread.Start();
}
Thread.Sleep(TimeSpan.FromSeconds(15));
Console.WriteLine(value);
Console.ReadLine();
}
I'm hoping someone can provide me with an explanation so that I know that all my effort into writing thread-safe software is not in vain, or perhaps this experiment is flawed in some way. I have also tried with all threads incrementing and using the ++i instead of i++. The value always results in the expected value.
You'll usually only see issues if you have two threads which are incrementing and decrementing at very close times. (There are also memory model issues, but they're separate.) That means you want them spending most of the time incrementing and decrementing, in order to give you the best chance of the operations colliding.
Currently, your threads will be spending the vast majority of the time sleeping or writing to the console. That's massively reducing the chances of collision.
Additionally, I'd note that absence of evidence is not evidence of absence - concurrency issues can indeed be hard to provoke, particularly if you happen to be running on a CPU with a strong memory model and internally-atomic increment/decrement instructions that the JIT can use. It could be that you'll never provoke the problem on your particular machine - but that the same program could fail on another machine.
IMO these loops are too short. I bet that by the time the second thread starts the first thread has already finished executing its loop and exited. Try to drastically increase the number of iterations that each thread executes. At this point you could even spawn just two threads (remove the outer loop) and it should be enough to see wrong values.
For example, with the following code I'm getting totally wrong results on my system:
static void Main(string[] args)
{
Random random = new Random();
int value = 0;
Thread incThread = new Thread(() =>
{
for (int y = 0; y < 2000000; y++)
{
value++;
}
});
Thread decThread = new Thread(() =>
{
for (int z = 0; z < 2000000; z++)
{
value--;
}
});
incThread.Start();
decThread.Start();
incThread.Join();
decThread.Join();
Console.WriteLine(value);
}
In addition to Jon Skeets answer:
A simple test that at least on my litte Dual Core shows the problem easily:
Sub Main()
Dim i As Long = 1
Dim j As Long = 1
Dim f = Sub()
While Interlocked.Read(j) < 10 * 1000 * 1000
i += 1
Interlocked.Increment(j)
End While
End Sub
Dim l As New List(Of Task)
For n = 1 To 4
l.Add(Task.Run(f))
Next
Task.WaitAll(l.ToArray)
Console.WriteLine("i={0} j={1}", i, j)
Console.ReadLine()
End Sub
i and j should both have the same final value. But they dont have!
EDIT
And in case you think, that C# is more clever than VB:
static void Main(string[] args)
{
long i = 1;
long j = 1;
Task[] t = new Task[4];
for (int k = 0; k < 4; k++)
{
t[k] = Task.Run(() => {
while (Interlocked.Read(ref j) < (long)(10*1000*1000))
{
i++;
Interlocked.Increment(ref j);
}});
}
Task.WaitAll(t);
Console.WriteLine("i = {0} j = {1}", i, j);
Console.ReadLine();
}
it isnt ;)
The result: i is around 15% (percent!) lower than j. ON my machine. Having an eight thread machine, probabyl might even make the result more imminent, because the error is more likely to happen if several tasks run truly parallel and are not just pre-empted.
The above code is flawed of course :(
IF a task is preempted, just AFTER i++, all other tasks continue to increment i and j, so i is expected to differ from j, even if "++" would be atomic. There a simple solution though:
static void Main(string[] args)
{
long i = 0;
int runs = 10*1000*1000;
Task[] t = new Task[Environment.ProcessorCount];
Stopwatch stp = Stopwatch.StartNew();
for (int k = 0; k < t.Length; k++)
{
t[k] = Task.Run(() =>
{
for (int j = 0; j < runs; j++ )
{
i++;
}
});
}
Task.WaitAll(t);
stp.Stop();
Console.WriteLine("i = {0} should be = {1} ms={2}", i, runs * t.Length, stp.ElapsedMilliseconds);
Console.ReadLine();
}
Now a task could be pre-empted somewhere in the loop statements. But that wouldn't effect i. So the only way to see an effect on i would be, if a task is preempted when it just at the i++ statement. And thats what was to be shown: It CAN happen and it's more likely to happen when you have fewer but longer running tasks.
If you write Interlocked.Increment(ref i); instead of i++ the code runs much longer (because of the locking), but i is exactly what it should be!
I am doing this:
private static void Main(string[] args)
{
var dict1 = new Dictionary<int, string>();
var dict2 = new Dictionary<int, string>();
DateTime t1 = DateTime.Now;
for (int i = 1; i < 1000000; i++)
{
Parallel.Invoke(
() => dict1.Add(i, "Test" + i),
() => dict2.Add(i, "Test" + i) );
}
TimeSpan t2 = DateTime.Now.Subtract(t1);
Console.WriteLine(t2.TotalMilliseconds);
Console.ReadLine();
}
So doing a for loop 1 million time and adding items to two different dictionaries.
The problem is that it takes 11 secs which is more than 5 time the normal sequential method (without tasks/threads) which takes only 2 sec.
Don't know why.
Like others have said or implied, parallel code is not always faster due to the overhead of parallelization.
That being said, your code is adding an item to 2 dictionaries in parallel 1M times while you should be adding 1M items to 2 dictionaries in parallel. The difference is subtle but the end result is code that is ~10% faster (on my machine) than your sequential case.
Parallel.Invoke(() => FillDictionary(dict1, 1000000), () => FillDictionary(dict2, 1000000));
...
private static void FillDictionary(Dictionary<int, string> toFill, int itemCount)
{
for(int i = 0 ; i < itemCount; i++)
toFill.Add(i, "test" + i);
}
There is certain overhead using parallel invocation and benefit of distributing work across multiple cores/CPUs. In this case overhead is bigger than the actual benefit from distribution of useful work, so that's why you see significant difference. Try using more heavy operations and you will see the difference.
Rewrite that to something like:
Parallel.Invoke(
() =>
{
for (int i = 0; i < 1000000; i++)
{
dict1.Add(i, "Test" + i);
}
},
() =>
{
for (int i = 0; i < 1000000; i++)
{
dict2.Add(i, "Test" + i);
}
}
);
That should be a lot faster because the two threads are initialized exactly once and then run to completion. In your version you are calling each lambda expression 1000000 times, each time waiting for both to finish before continuing.
Parallel.Invoke is really meant to be used for longer running operations. Otherwise the overhead of setting up the parallel tasks and waiting for them all to finish just kills any performance gained by running the tasks in parallel.
Parallel.Invoke means "perform all these tasks and wait until they are done." In this case, you are only ever doing two tasks in parallel. Thus the overhead of parallel invocation is greater than the potential gains from concurrency.
If what you want to do is add 100000 items to two different dictionaries, you should break the work load between tasks, not the method. This method does the same as your code, but on my machine is much faster than your implementation:
var dict1 = new ConcurrentDictionary<int, string>();
var dict2 = new ConcurrentDictionary<int, string>();
Parallel.Invoke(() =>
{
for(int i = 0; i < 500000; i++)
{
dict1[i] = "Test" +i;
dict2[i] ="Test" +i;
}
},
() =>
{
for(int i = 500000; i < 1000000; i++)
{
dict1[i] ="Test" +i;
dict2[i] = "Test" +i;
}
});