Aggregation of parallel for does not capture all iterations

Aggregation of parallel for does not capture all iterations - c#

I have code that works great using a simple For loop, but I'm trying to speed it up. I'm trying to adapt the code to use multiple cores and landed on Parallel For.
At a high level, I'm collecting the results from CalcRoutine for several thousand accounts and storing the results in an array with 6 elements. I'm then re-running this process 1,000 times. The order of the elements within each 6 element array is important, but the order for the final 1,000 iterations of these 6 element arrays is not important. When I run the code using a For loop, I get a 6,000 element long list. However, when I try the Parallel For version, I'm getting something closer to 600. I've confirmed that the line "return localResults" gets called 1,000 times, but for some reason not all 6 element arrays get added to the list TotalResults. Any insight as to why this isn't working would be greatly appreciated.
object locker = new object();
Parallel.For(0, iScenarios, () => new double[6], (int k, ParallelLoopState state, double[] localResults) =>
{
List<double> CalcResults = new List<double>();
for (int n = iStart; n < iEnd; n++)
{
CalcResults.AddRange(CalcRoutine(n, k));
}
localResults = this.SumOfResults(CalcResults);
return localResults;
},
(double[] localResults) =>
{
lock (locker)
{
TotalResults.AddRange(localResults);
}
});
EDIT: Here's the "non parallel" version:
for (int k = 0; k < iScenarios; k++)
{
CalcResults.Clear();
for (int n = iStart; n < iEnd; n++)
{
CalcResults.AddRange(CalcRoutine(n, k));
}
TotalResults.AddRange(SumOfResults(CalcResults));
}
The output for 1 scenario is a list of 6 doubles, 2 scenarios is a list of 12 doubles, ... n scenarios 6n doubles.
Also per one of the questions, I checked the number of times "TotalResults.AddRange..." gets called, and it's not the full 1,000 times. Why wouldn't this be called each time? With the lock, shouldn't each thread wait for this section to become available?

Check the documentation for Parallel.For
These initial states are passed to the first body invocations on each task. Then, every subsequent body invocation returns a possibly modified state value that is passed to the next body invocation. Finally, the last body invocation on each task returns a state value that is passed to the localFinally delegate
But your body delegate is ignoring the incoming value of localResults which the previous iteration within this task returned. Having the loop state being an array makes it tricky to write a correct version. This will work but looks messy:
//EDIT - Create an array of length 0 here V for input to first iteration
Parallel.For(0, iScenarios, () => new double[0],
(int k, ParallelLoopState state, double[] localResults) =>
{
List<double> CalcResults = new List<double>();
for (int n = iStart; n < iEnd; n++)
{
CalcResults.AddRange(CalcRoutine(n, k));
}
localResults = localResults.Concat(
this.SumOfResults(CalcResults)
).ToArray();
return localResults;
},
(double[] localResults) =>
{
lock (locker)
{
TotalResults.AddRange(localResults);
}
});
(Assuming Linq's enumerable extensions are in scope, for Concat)
I'd suggest using a different data structure (e.g. a List<double> rather than double[]) for the state that more naturally allows more elements to be added to it - but that would mean changing SumOfResults that you've not shown. Or just keep it all a bit more abstract:
Parallel.For(0, iScenarios, Enumerable.Empty<double>(),
(int k, ParallelLoopState state, IEnumerable<double> localResults) =>
{
List<double> CalcResults = new List<double>();
for (int n = iStart; n < iEnd; n++)
{
CalcResults.AddRange(CalcRoutine(n, k));
}
return localResults.Concat(this.SumOfResults(CalcResults));
},
(IEnumerable<double> localResults) =>
{
lock (locker)
{
TotalResults.AddRange(localResults);
}
});
(If it had worked the way you seem to have assumed, why would they have you provide two separate delegates, if all it did, on the return from body, was to immediately invoke localFinally with the return value?)

Try this:
using System.Collections.Generic;
using System.Linq;
using System.Threading.Tasks;
class Program
{
static void Main(string[] args)
{
var iScenarios = 6;
var iStart = 0;
var iEnd = 1000;
var totalResults = new List<double>();
Parallel.For(0, iScenarios, k => {
List<double> calcResults = new List<double>();
for (int n = iStart; n < iEnd; n++)
calcResults.AddRange(CalcRoutine(n, k));
lock (totalResults)
{
totalResults.AddRange(calcResults);
}
});
}
static IEnumerable<double> CalcRoutine(int a, int b)
{
yield return 0;
}
static double[] SumOfResults(IEnumerable<double> source)
{
return source.ToArray();
}
}

Related

Create Hashset with a large number of elements (1M)

I have to create a HashSet with the elements from 1 to N+1, where N is a large number (1M).
For example, if N = 5, the HashSet will have then integers {1, 2, 3, 4, 5, 6 }.
The only way I have found is:
HashSet<int> numbers = new HashSet<int>(N);
for (int i = 1; i <= (N + 1) ; i++)
{
numbers.Add(i);
}
Are there another faster (more efficient) ways to do it?

6 is a tiny number of items so I suspect the real problem is adding a few thousand items. The delays in this case are caused by buffer reallocations, not the speed of Add itself.
The solution to this is to specify even an approximate capacity when constructing the HashSet :
var set=new HashSet<int>(1000);
If, and only if, the input implements ICollection<T>, the HashSet<T>(IEnumerable<T>) constructor will check the size of input collection and use it as its capacity:
if (collection is ICollection<T> coll)
{
int count = coll.Count;
if (count > 0)
{
Initialize(count);
}
}
Explanation
Most containers in .NET use buffers internally to store data. This is far faster than implementing containers using pointers, nodes etc due to CPU cache and RAM access delays. Accessing the next item in the CPU's cache is far faster than chasing a pointer in RAM in all CPUs.
The downside is that each time the buffer is full a new one will have to be allocated. Typically, this buffer will have twice the size of the original buffer. Adding items one by one can result in log2(N) reallocations. This works fine for a moderate number of items but can result in a lot of orphaned buffers when adding eg 1000 items one by one. All those temporary buffers will have to be garbage collected at some point, causing additional delays.

Here's the code to test the three options:
var N = 1000000;
var trials = new List<(int method, TimeSpan duration)>();
for (var i = 0; i < 100; i++)
{
var sw = Stopwatch.StartNew();
HashSet<int> numbers1 = new HashSet<int>(Enumerable.Range(1, N + 1));
sw.Stop();
trials.Add((1, sw.Elapsed));
sw = Stopwatch.StartNew();
HashSet<int> numbers2 = new HashSet<int>(N);
for (int n = 1; n < N + 1; n++)
numbers2.Add(n);
sw.Stop();
trials.Add((2, sw.Elapsed));
HashSet<int> numbers3 = new HashSet<int>(N);
foreach (int n in Enumerable.Range(1, N + 1))
numbers3.Add(n);
sw.Stop();
trials.Add((3, sw.Elapsed));
}
for (int j = 1; j <= 3; j++)
Console.WriteLine(trials.Where(x => x.method == j).Average(x => x.duration.TotalMilliseconds));
Typical output is this:
31.314788
16.493208
16.493208
It is nearly twice as fast to preallocate the capacity of the HashSet<int>.
There is no difference between the traditional loop and a LINQ foreach option.

To build on #Enigmativity's answer, here's a proper benchmark using BenchmarkDotNet:
public class Benchmark
{
private const int N = 1000000;
[Benchmark]
public HashSet<int> EnumerableRange() => new HashSet<int>(Enumerable.Range(1, N + 1));
[Benchmark]
public HashSet<int> NoPreallocation()
{
var result = new HashSet<int>();
for (int n = 1; n < N + 1; n++)
{
result.Add(n);
}
return result;
}
[Benchmark]
public HashSet<int> Preallocation()
{
var result = new HashSet<int>(N);
for (int n = 1; n < N + 1; n++)
{
result.Add(n);
}
return result;
}
}
public class Program
{
public static void Main(string[] args)
{
BenchmarkRunner.Run(typeof(Program).Assembly);
}
}
With the results:
Method
Mean
Error
StdDev
EnumerableRange
29.17 ms
0.743 ms
2.179 ms
NoPreallocation
23.96 ms
0.471 ms
0.775 ms
Preallocation
11.68 ms
0.233 ms
0.665 ms
As we can see, using linq is a bit slower than not using linq (as expected), and pre-allocating saves a significant amount of time.

Multiple thread accessing and editing the same double array

I need to iterate through every double in an array to do the "Laplacian Smoothing", "mixing values" with neighbour doubles.
I'll keep stored values in a temp clone array update the original at the end.
Pseudo code:
double[] A = new double[1000];
// Filling A with values...
double[] B = A.Clone as double[];
for(int loops=0;loops<10;loops++){ // start of the loop
for(int i=0;i<1000;i++){ // iterating through all doubles in the array
// Parallel.For(0, 1000, (i) => {
double v= A[i];
B[i]-=v;
B[i+1]+=v/2;
B[i-1]+=v/2;
// here i'm going out of array bounds, i know. Pseudo code, not relevant.
}
// });
}
A = B.Clone as double[];
With for it works correctly. "Smoothing" the values in the array.
With Parallel.For() I have some access sync problems: threads are colliding and some values are actually not stored correctly. Threads access and edit the array at the same index many times.
(I haven't tested this in a linear array, i'm actually working on a multidimensional array[x,y,z] ..)
How can I solve this?
I was thinking to make a separate array for each thread, and do the sum later... but I need to know the thread index and I haven't found anywhere in the web. (I'm still interested if a "thread index" exist even with a totally different solution...).
I'll accept any solution.

You probably need one of the more advanced overloads of the Parallel.For method:
public static ParallelLoopResult For<TLocal>(int fromInclusive, int toExclusive,
ParallelOptions parallelOptions, Func<TLocal> localInit,
Func<int, ParallelLoopState, TLocal, TLocal> body,
Action<TLocal> localFinally);
Executes a for loop with thread-local data in which iterations may run in parallel, loop options can be configured, and the state of the loop can be monitored and manipulated.
This looks quite intimidating with all the various lambdas it expects. The idea is to have each thread work with local data, and finally merge the data
at the end. Here is how you could use this method to solve your problem:
double[] A = new double[1000];
double[] B = (double[])A.Clone();
object locker = new object();
var parallelOptions = new ParallelOptions()
{
MaxDegreeOfParallelism = Environment.ProcessorCount
};
Parallel.For(0, A.Length, parallelOptions,
localInit: () => new double[A.Length], // create temp array per thread
body: (i, state, temp) =>
{
double v = A[i];
temp[i] -= v;
temp[i + 1] += v / 2;
temp[i - 1] += v / 2;
return temp; // return a reference to the same temp array
}, localFinally: (localB) =>
{
// Can be called in parallel with other threads, so we need to lock
lock (locker)
{
for (int i = 0; i < localB.Length; i++)
{
B[i] += localB[i];
}
}
});
I should mention that the workload of the above example is too granular, so I wouldn't expect large improvements in performance from the parallelization. Hopefully your actual workload is more chunky. If for example you have two nested loops, parallelizing only the outer loop will work greatly because the inner loop will provide the much needed chunkiness.
Alternative solution: Instead of creating auxiliary arrays per thread, you could just update directly the B array, and use locks only when processing an index in the dangerous zone near the boundaries of the partitions:
Parallel.ForEach(Partitioner.Create(0, A.Length), parallelOptions, range =>
{
bool lockTaken = false;
try
{
for (int i = range.Item1; i < range.Item2; i++)
{
bool shouldLock = i < range.Item1 + 1 || i >= range.Item2 - 1;
if (shouldLock) Monitor.Enter(locker, ref lockTaken);
double v = A[i];
B[i] -= v;
B[i + 1] += v / 2;
B[i - 1] += v / 2;
if (shouldLock) { Monitor.Exit(locker); lockTaken = false; }
}
}
finally
{
if (lockTaken) Monitor.Exit(locker);
}
});

Ok, it appears that modulus can solve pretty much all my problems.
Here a really simplified version of the working code:
(the big script is 3d and unfinished... )
private void RunScript(bool Go, ref object Results)
{
if(Go){
LaplacianSmooth(100);
// Needed to restart "RunScript" over and over
this.Component.ExpireSolution(true);
}
else{
A = new double[count];
A[100] = 10000;
A[500] = 10000;
}
Results = A;
}
// <Custom additional code>
public static int T = Environment.ProcessorCount;
public static int count = 1000;
public double[] A = new double[count];
public double[,] B = new double[count, T];
public void LaplacianSmooth(int loops){
for(int loop = 0;loop < loops;loop++){
B = new double[count, T];
// Copying values to first column of temp multidimensional-array
Parallel.For(0, count, new ParallelOptions { MaxDegreeOfParallelism = T }, i => {
B[i, 0] = A[i];
});
// Applying Laplacian smoothing
Parallel.For(0, count, new ParallelOptions { MaxDegreeOfParallelism = T }, i => {
int t = i % 16;
// Wrapped next and previous element indexes
int n = (i + 1) % count;
int p = (i + count - 1) % count;
double v = A[i] * 0.5;
B[i, t] -= v;
B[p, t] += v / 2;
B[n, t] += v / 2;
});
// Copying values back to main array
Parallel.For(0, count, new ParallelOptions { MaxDegreeOfParallelism = T }, i => {
double val = 0;
for(int t = 0;t < T;t++){
val += B[i, t];
}
A[i] = val;
});
}
}
There are no "collisions" with the threads, as confirmed by the result of "Mass Addition" (a sum) that is constant at 20000.
Thanks everyone for the tips!

Linq to objects is 20 times slower than plain C#. Is there a way to speed it up?

If I need just the maximum [or the 3 biggest items] of an array, and I do it with myArray.OrderBy(...).First() [or myArray.OrderBy(...).Take(3)], it is 20 times slower than calling myArray.Max(). Is there a way to write a faster linq query? This is my sample:
using System;
using System.Linq;
namespace ConsoleApp1
{
class Program
{
static void Main(string[] args)
{
var array = new int[1000000];
for (int i = 0; i < array.Length; i++)
{
array[i] = i;
}
var maxResults = new int[10];
var linqResults = new int[10];
var start = DateTime.Now;
for (int i = 0; i < maxResults.Length; i++)
{
maxResults[i] = array.Max();
}
var maxEnd = DateTime.Now;
for (int i = 0; i < maxResults.Length; i++)
{
linqResults[i] = array.OrderByDescending(it => it).First();
}
var linqEnd = DateTime.Now;
// 00:00:00.0748281
// 00:00:01.5321276
Console.WriteLine(maxEnd - start);
Console.WriteLine(linqEnd - maxEnd);
Console.ReadKey();
}
}
}

You sort the initial array 10 times in a loop:
for (int i = 0; i < maxResults.Length; i++)
{
linqResults[i] = array.OrderByDescending(it => it).First();
}
Let's do it once:
// 10 top item of the array
var linqResults = array
.OrderByDescending(it => it)
.Take(10)
.ToArray();
please, note, that
for (int i = 0; i < maxResults.Length; i++)
{
maxResults[i] = array.Max();
}
just repeat the same Max value 10 times (it doesn't return 10 top items)

Max method time consumption is O(n) and Ordering in the best time is O(n log(n))
First error of your code is that you are ordering 10 times which is worst scenario. You can order one time and take 10 of them like what Dmitry answered.
And also, calling Max method for 10 times does not give you 10 biggest values, just the biggest value for 10 times.
However Max method does iterating list once and keep the Max value in a seperate variable. You can rewrite this method to iterate you array and keep you 10 biggest values in your maxResults and this is the fastest way that you can get result.

It seems that others have filled the efficiency gap that Microsoft has left in linq-to-objects:
https://morelinq.github.io/3.1/ref/api/html/M_MoreLinq_MoreEnumerable_PartialSort__1_3.htm

ThreadPool behaves different for debug mode and runtime

I want to use ThreadPool to complete long running jobs in less time. My methods
does more jobs of course but I prepared a simple example for you to understand
my situation. If I run this application it throws ArgumentOutOfRangeException on the commented line. Also it shows that i is equal to 10. How can it enter the for loop if it is 10?
If I don't run the application and debug this code it does not throw exception and works fine.
public void Test()
{
List<int> list1 = new List<int>();
List<int> list2 = new List<int>();
for (int i = 0; i < 10; i++) list1.Add(i);
for (int i = 0; i < 10; i++) list2.Add(i);
int toProcess = list1.Count;
using (ManualResetEvent resetEvent = new ManualResetEvent(false))
{
for (int i = 0; i < list1.Count; i++)
{
ThreadPool.QueueUserWorkItem(
new WaitCallback(delegate(object state)
{
// ArgumentOutOfRangeException with i=10
Sum(list1[i], list2[i]);
if (Interlocked.Decrement(ref toProcess) == 0)
resetEvent.Set();
}), null);
}
resetEvent.WaitOne();
}
MessageBox.Show("Done");
}
private void Sum(int p, int p2)
{
int sum = p + p2;
}
What is the problem here?

The problem is that i==10, but your lists have 10 items (i.e. a maximum index of 9).
This is because you have a race condition over a captured variable that is being changed before your delegate runs. Will the next iteration of the loop increment the value before the delegate runs, or will your delegate run before the loop increments the value? It's all down to the timing of that specific run.
Your instinct is that i will have a value of 0-9. However, when the loop reaches its termination, i will have a value of 10. Because the delegate captures i, the value of i may well be used after the loop has terminated.
Change your loop as follows:
for (int i = 0; i < list1.Count; i++)
{
var idx=i;
ThreadPool.QueueUserWorkItem(
new WaitCallback(delegate(object state)
{
// ArgumentOutOfRangeException with i=10
Sum(list1[idx], list2[idx]);
if (Interlocked.Decrement(ref toProcess) == 0)
resetEvent.Set();
}), null);
}
Now your delegate is getting a "private", independent copy of i instead of referring to a single, changing value that is shared between all invocations of the delegate.
I wouldn't worry too much about the difference in behaviour between debug and non-debug modes. That's the nature of race conditions.

What is the problem here?
Closure. You're capturing the i variable which isn't doing what you expect it to do.
You'll need to create a copy inside your for loop:
var currentIndex = i:
Sum(list1[currentIndex], list2[currentIndex]);

Creating list of lists with Parallel.For - what am I doing wrong?

I'm trying to populate a list of lists within a Parallel.For loop, but when the loop completes the list of lists is empty. What am I doing wrong?
int[] nums = Enumerable.Range(0, 10).ToArray();
IList<IList<double>> bins = new List<IList<double>>();
Parallel.For<IList<IList<double>>>(0, nums.Length, () => new List<IList<double>>(), (i, loop, bin) =>
{
Random random = new Random();
IList<double> list = new List<double>();
for (int j = 0; j < 5; j++)
list.Add(random.NextDouble() + i);
bin.Add(list);
return bin;
}
,
(bin) =>
{
lock (bins)
{
bins.Concat(bin);
}
}
);

This line is wrong:
bins.Concat(bin);
That just concatenates two enumerable sequences and returns the concatenated result (which you are throwing away).
I think it should be:
foreach (var x in bin)
bins.Add(x);

Part of your problem is from using IList<...> bins instead of List<...> bins. There's no benefit from restricting yourself to the interface in this context.
The minimal change would be this:
//IList<IList<double>> bins = new List<IList<double>>();
List<IList<double>> bins = new List<IList<double>>();
...
lock (bins)
{
bins.AddRange(bin);
}
On a side note, Random random = new Random(); inside the tasks means you will have (at least some) identical sub-sequences.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Aggregation of parallel for does not capture all iterations - c#

Related

Create Hashset with a large number of elements (1M)

Multiple thread accessing and editing the same double array

Linq to objects is 20 times slower than plain C#. Is there a way to speed it up?

ThreadPool behaves different for debug mode and runtime

Creating list of lists with Parallel.For - what am I doing wrong?

Categories

Resources