I need to iterate through every double in an array to do the "Laplacian Smoothing", "mixing values" with neighbour doubles.
I'll keep stored values in a temp clone array update the original at the end.
Pseudo code:
double[] A = new double[1000];
// Filling A with values...
double[] B = A.Clone as double[];
for(int loops=0;loops<10;loops++){ // start of the loop
for(int i=0;i<1000;i++){ // iterating through all doubles in the array
// Parallel.For(0, 1000, (i) => {
double v= A[i];
B[i]-=v;
B[i+1]+=v/2;
B[i-1]+=v/2;
// here i'm going out of array bounds, i know. Pseudo code, not relevant.
}
// });
}
A = B.Clone as double[];
With for it works correctly. "Smoothing" the values in the array.
With Parallel.For() I have some access sync problems: threads are colliding and some values are actually not stored correctly. Threads access and edit the array at the same index many times.
(I haven't tested this in a linear array, i'm actually working on a multidimensional array[x,y,z] ..)
How can I solve this?
I was thinking to make a separate array for each thread, and do the sum later... but I need to know the thread index and I haven't found anywhere in the web. (I'm still interested if a "thread index" exist even with a totally different solution...).
I'll accept any solution.
You probably need one of the more advanced overloads of the Parallel.For method:
public static ParallelLoopResult For<TLocal>(int fromInclusive, int toExclusive,
ParallelOptions parallelOptions, Func<TLocal> localInit,
Func<int, ParallelLoopState, TLocal, TLocal> body,
Action<TLocal> localFinally);
Executes a for loop with thread-local data in which iterations may run in parallel, loop options can be configured, and the state of the loop can be monitored and manipulated.
This looks quite intimidating with all the various lambdas it expects. The idea is to have each thread work with local data, and finally merge the data
at the end. Here is how you could use this method to solve your problem:
double[] A = new double[1000];
double[] B = (double[])A.Clone();
object locker = new object();
var parallelOptions = new ParallelOptions()
{
MaxDegreeOfParallelism = Environment.ProcessorCount
};
Parallel.For(0, A.Length, parallelOptions,
localInit: () => new double[A.Length], // create temp array per thread
body: (i, state, temp) =>
{
double v = A[i];
temp[i] -= v;
temp[i + 1] += v / 2;
temp[i - 1] += v / 2;
return temp; // return a reference to the same temp array
}, localFinally: (localB) =>
{
// Can be called in parallel with other threads, so we need to lock
lock (locker)
{
for (int i = 0; i < localB.Length; i++)
{
B[i] += localB[i];
}
}
});
I should mention that the workload of the above example is too granular, so I wouldn't expect large improvements in performance from the parallelization. Hopefully your actual workload is more chunky. If for example you have two nested loops, parallelizing only the outer loop will work greatly because the inner loop will provide the much needed chunkiness.
Alternative solution: Instead of creating auxiliary arrays per thread, you could just update directly the B array, and use locks only when processing an index in the dangerous zone near the boundaries of the partitions:
Parallel.ForEach(Partitioner.Create(0, A.Length), parallelOptions, range =>
{
bool lockTaken = false;
try
{
for (int i = range.Item1; i < range.Item2; i++)
{
bool shouldLock = i < range.Item1 + 1 || i >= range.Item2 - 1;
if (shouldLock) Monitor.Enter(locker, ref lockTaken);
double v = A[i];
B[i] -= v;
B[i + 1] += v / 2;
B[i - 1] += v / 2;
if (shouldLock) { Monitor.Exit(locker); lockTaken = false; }
}
}
finally
{
if (lockTaken) Monitor.Exit(locker);
}
});
Ok, it appears that modulus can solve pretty much all my problems.
Here a really simplified version of the working code:
(the big script is 3d and unfinished... )
private void RunScript(bool Go, ref object Results)
{
if(Go){
LaplacianSmooth(100);
// Needed to restart "RunScript" over and over
this.Component.ExpireSolution(true);
}
else{
A = new double[count];
A[100] = 10000;
A[500] = 10000;
}
Results = A;
}
// <Custom additional code>
public static int T = Environment.ProcessorCount;
public static int count = 1000;
public double[] A = new double[count];
public double[,] B = new double[count, T];
public void LaplacianSmooth(int loops){
for(int loop = 0;loop < loops;loop++){
B = new double[count, T];
// Copying values to first column of temp multidimensional-array
Parallel.For(0, count, new ParallelOptions { MaxDegreeOfParallelism = T }, i => {
B[i, 0] = A[i];
});
// Applying Laplacian smoothing
Parallel.For(0, count, new ParallelOptions { MaxDegreeOfParallelism = T }, i => {
int t = i % 16;
// Wrapped next and previous element indexes
int n = (i + 1) % count;
int p = (i + count - 1) % count;
double v = A[i] * 0.5;
B[i, t] -= v;
B[p, t] += v / 2;
B[n, t] += v / 2;
});
// Copying values back to main array
Parallel.For(0, count, new ParallelOptions { MaxDegreeOfParallelism = T }, i => {
double val = 0;
for(int t = 0;t < T;t++){
val += B[i, t];
}
A[i] = val;
});
}
}
There are no "collisions" with the threads, as confirmed by the result of "Mass Addition" (a sum) that is constant at 20000.
Thanks everyone for the tips!
Related
I tryed to refactor a nested sequential for loop into a nested Parallel.For loop.
But following the recommended parallel patterns and locks, the overall result was too low compared with the sequential result.
The problem was caused by a wrong or inconsistent use of BigInteger calculation methods.
For BigInteger you need to use ++-operator or BigInteger methods like BigInteger.Add().
My sources:
How to: Write a Parallel.For Loop with Thread-Local Variables
Threading in C# - Parallel Programming - The Parallel Class - For and ForEach
Please find sample code below:
internal static class Program
{
static Object lockObj = new Object();
static void Main()
{
//target result: 575
NestedLoopAggregationTest();
return;
}
private static void NestedLoopAggregationTest()
{
BigInteger totalSequential = 0;
BigInteger totalRecomandedPattern = 0;
BigInteger totalAntiPattern = 0;
const int iEnd1 = 5;
const int iEnd2 = 10;
const int iEnd3 = 15;
for (int iCn1 = 1; iCn1 <= iEnd1; iCn1++)
{
for (int iCn2 = 1; iCn2 <= iEnd2; iCn2++)
{
for (int iCn3 = iCn2 - 1; iCn3 <= iEnd3; iCn3++)
{
totalSequential++;
}
}
}
Parallel.For(1, iEnd1 + 1, (iCn1) =>
{
Parallel.For(1, iEnd2 + 1, (iCn2) =>
{
Parallel.For<BigInteger>(iCn2 - 1, iEnd3 + 1, () => 0, (iCn3, state, subtotal) =>
{
//Solution:
//for BigInteger use ++-operator or BigInteger.Add()
subtotal = BigInteger.Add(subtotal, 1);
return subtotal;
},
(subtotal) =>
{
lock (lockObj)
{
totalRecomandedPattern = BigInteger.Add(totalRecomandedPattern, subtotal);
}
}
);
});
});
MessageBox.Show(totalSequential.ToString() + Environment.NewLine + totalRecomandedPattern.ToString() +
}
}
Your current parallel implementation requires a lock every time subtotal is modified in the inner loop. This modified approach is faster than both your serial and parallel implementaions because it avoids a lock in the innermost loop:
Parallel.For(1, iEnd1 + 1, (iCn1) =>
{
Parallel.For(1, iEnd2 + 1, (iCn2) =>
{
BigInteger subtotal = 0;
for (var iCnt3 = iCn2 - 1; iCnt3 < iEnd3 + 1; iCnt3++)
{
//Solution:
//for BigInteger use ++-operator or BigInteger.Add()
subtotal = BigInteger.Add(subtotal, 1);
}
lock (lockObj)
{
totalRecomandedPatternModified = BigInteger.Add(totalRecomandedPatternModified, subtotal);
}
});
});
I increased each of the endpoints by a factor of 10 so the runtime is long enough to be measured on my hardware, then got the following average times:
Serial: 9ms
Parallel: 11ms
Modified: 2ms
I'm having an issue with the following code. The code works with no errors but I'm receiving different output values when using a parallel for loop vs a regular for loop. I need to get the parallel for loop working properly because I run this code thousands of times. Does anyone know why my parallel for loop is returning different outputs?
private object _lock = new object();
public double CalculatePredictedRSquared()
{
double press = 0, tss = 0, press2 = 0, press1 = 0;
Vector<double> output = CreateVector.Dense(Enumerable.Range(0, 400).Select(i => Convert.ToDouble(i)).ToArray());
List<double> input1 = new List<double>(Enumerable.Range(0, 400).Select(i => Convert.ToDouble(i)));
List<double> input2 = new List<double>(Enumerable.Range(200, 400).Select(i => Convert.ToDouble(i)));
Parallel.For(0, output.Count, i =>
{
ConcurrentBag<MultipleRegressionInfo> listMRInfoBag = new ConcurrentBag<MultipleRegressionInfo>(listMRInfo);
ConcurrentBag<double> vectorArrayBag = new ConcurrentBag<double>(output);
ConcurrentBag<double[]> matrixList = new ConcurrentBag<double[]>();
lock (_lock)
{
matrixList.Add(input1.Where((v, k) => k != i).ToArray());
matrixList.Add(input2.Where((v, k) => k != i).ToArray());
}
var matrixArray2 = CreateMatrix.DenseOfColumnArrays(matrixList);
var actualResult = vectorArrayBag.ElementAt(i);
var newVectorArray = CreateVector.Dense(vectorArrayBag.Where((v, j) => j != i).ToArray());
var items = FindBestMRSolution(matrixArray2, newVectorArray);
double estimate1 = 0;
if (items != null)
{
lock (_lock)
{
var y = 0d;
var independentCount = matrixArray2.RowCount;
var dependentCount = newVectorArray.Count;
if (independentCount == dependentCount)
{
var populationCount = independentCount;
y = newVectorArray.Average();
for (int l = 0; l < matrixArray2.ColumnCount; l++)
{
var avg = matrixArray2.Column(l).Average();
y -= avg * items[l];
}
}
for (int m = 0; m < 2; m++)
{
var coefficient = items[m];
if (m == 0)
{
estimate1 += input1.ElementAt(i) * coefficient;
}
else
{
estimate1 += input2.ElementAt(i) * coefficient;
}
}
estimate1 += y;
}
}
else
{
lock (_lock)
{
estimate1 = 0;
}
}
lock (_lock)
{
press1 += Math.Pow(actualResult - estimate1, 2);
}
});
for (int i = 0; i < output.Count; i++)
{
List<double[]> matrixList = new List<double[]>();
matrixList.Add(input1.Where((v, k) => k != i).ToArray());
matrixList.Add(input2.Where((v, k) => k != i).ToArray());
var matrixArray = CreateMatrix.DenseOfColumnArrays(matrixList);
var actualResult = output.ElementAt(i);
var newVectorArray = CreateVector.Dense(output.Where((v, j) => j != i).ToArray());
var items = FindBestMRSolution(matrixArray, newVectorArray);
double estimate = 0;
if (items != null)
{
var y = CalculateYIntercept(matrixArray, newVectorArray, items);
for (int m = 0; m < 2; m++)
{
var coefficient = items[m];
if (m == 0)
{
estimate += input1.ElementAt(i) * coefficient;
}
else
{
estimate += input2.ElementAt(i) * coefficient;
}
}
}
else
{
estimate = 0;
}
press2 += Math.Pow(actualResult - estimate, 2);
}
tss = CalculateTotalSumOfSquares(vectorArray.ToList());
var test1 = 1 - (press1 / tss);
var test2 = 1 - (press2 / tss);
}
public Vector<double> CalculateWithQR(Matrix<double> x, Vector<double> y)
{
Vector<double> result = null;
result = MultipleRegression.QR(x, y);
for (int i = 0; i < result.Count; i++)
{
var value = result.ElementAt(i);
if (Double.IsNaN(value) || Double.IsInfinity(value))
{
return null;
}
}
return result;
}
public Vector<double> CalculateWithNormal(Matrix<double> x, Vector<double> y)
{
Vector<double> result = null;
result = MultipleRegression.NormalEquations(x, y);
for (int i = 0; i < result.Count; i++)
{
var value = result.ElementAt(i);
if (Double.IsNaN(value) || Double.IsInfinity(value))
{
return null;
}
}
return result;
}
public Vector<double> CalculateWithSVD(Matrix<double> x, Vector<double> y)
{
Vector<double> result = null;
result = MultipleRegression.Svd(x, y);
for (int i = 0; i < result.Count; i++)
{
var value = result.ElementAt(i);
if (Double.IsNaN(value) || Double.IsInfinity(value))
{
return null;
}
}
return result;
}
public Vector<double> FindBestMRSolution(Matrix<double> x, Vector<double> y)
{
Vector<double> result = null;
result = CalculateWithNormal(x, y);
if (result != null)
{
return result;
}
else
{
result = CalculateWithSVD(x, y);
if (result != null)
{
return result;
}
else
{
result = CalculateWithQR(x, y);
if (result != null)
{
return result;
}
}
}
return result;
}
public double CalculateTotalSumOfSquares(List<double> dependentVariables)
{
double tts = 0;
for (int i = 0; i < dependentVariables.Count; i++)
{
tts += Math.Pow(dependentVariables.ElementAt(i) - dependentVariables.Average(), 2);
}
return tts;
}
Actual Output (Updated results):
test1 = 137431.12889999992 (parallel for loop)
test2 = 7.3770258447689254E- (regular for loop)
Epilogue: How to setup an MCVE-compliant testing
This may be a fair way to prepare an indeed fully reproducible setup of an MCVE-code + A/B/C/... DataSET-s, put inside a ready-to-run [IDE & Testing Sandbox, hyperlinked here][1], so that Community members can click a re-run button and focus on root-cause analysis, not on decoding and re-engineering the heaps of incomplete SLOCs.
If this runs for the O/P, it will run for other Community Members, whom the O/P has asked for an answer or help.
Try it online!
My new version of the code:
public double CalculatePredictedRSquared()
{
Vector<double> output = CreateVector.Dense(Enumerable.Range(0, 400).Select(i => Convert.ToDouble(i)).ToArray());
List<double> input1 = new List<double>(Enumerable.Range(0, 400).Select(i => Convert.ToDouble(i)));
List<double> input2 = new List<double>(Enumerable.Range(200, 400).Select(i => Convert.ToDouble(i)));
double tss = CalculateTotalSumOfSquares(output.ToList());
IEnumerable<int> range = Enumerable.Range(0, output.Count);
var query = range.Select(i => DoIt(i, output, input1, input2));
var result = 1 - (query.Sum() / tss);
return result;
}
public double DoIt(int i, Vector<double> output, List<double> input1, List<double> input2)
{
List<double[]> matrixList = new List<double[]>
{
input1.Where((v, k) => k != i).ToArray(),
input2.Where((v, k) => k != i).ToArray()
};
var matrixArray = CreateMatrix.DenseOfColumnArrays(matrixList);
var actualResult = output.ElementAt(i);
var newVectorArray = CreateVector.Dense(output.Where((v, j) => j != i).ToArray());
var items = FindBestMRSolution(matrixArray, newVectorArray);
double estimate = 0;
if (items != null)
{
var y = CalculateYIntercept(matrixArray, newVectorArray, items);
for (int m = 0; m < 2; m++)
{
var coefficient = items[m];
if (m == 0)
{
estimate += input1.ElementAt(i) * coefficient;
}
else
{
estimate += input2.ElementAt(i) * coefficient;
}
}
}
else
{
estimate = 0;
}
return Math.Pow(actualResult - estimate, 2);
}
This whole thing is a dog's breakfast; you should abandon that attempt at parallelism entirely.
Start over. Here's what I want you to do. I want you to write a method DoIt that returns double and takes a int i and whatever other state is required to do a single iteration of the loop.
You will then rewrite your method as follows:
public double CalculatePredictedRSquared()
{
Vector<double> output = whatever;
// Whatever other state you need here
IEnumerable<int> range = Enumerable.Range(0, output.Count);
var query = range.Select(i => DoIt(i, whatever_other_state));
return query.Sum();
}
Got it? DoIt is the thing that is inside your loop right now. It must take in i, and output and whatever other vectors you need to pass into it. It must only compute a double -- in this case, the square of the estimate error -- and return that double.
It must be pure: It must not read or write any non-local variable, it must not call any non-pure method, and it must give exactly the same results when given the same inputs, every time. Pure methods are the easiest methods to write, to read, to understand, to test and to parallelize; always try to write pure methods when doing math computations.
Write test cases for DoIt, and test the heck out of it. It's a pure method; you should be able to write lots of test cases. Similarly test any of the pure methods called by DoIt.
Once you are satisfied that DoIt is both correct and pure, then the magic happens. Just change it to:
range.AsParallel().Select...
Then compare the parallel and non-parallel versions. They should produce the same result; if not, then something was impure. Figure out what it was.
Then, verify that the parallel version was faster. If not, then you have failed to do enough work in DoIt to justify parallelism; see https://en.wikipedia.org/wiki/Amdahl%27s_law for details.
A few things:
lock (_lock)
{
matrixList.Add(input1.Where((v, k) => k != i).ToArray());
matrixList.Add(input2.Where((v, k) => k != i).ToArray());
}
You're adding items to a collection that is already thread-safe by design, so no need to lock. While List is not thread-safe, it should be OK to read from it concurrently. From the documentation:
It is safe to perform multiple read operations on a List, but issues can occur if the collection is modified while it’s being read. To ensure thread safety, lock the collection during a read or write operation. To enable a collection to be accessed by multiple threads for reading and writing, you must implement your own synchronization.
Also note that matrixList is stored in a local variable; in this case, the collection can't be called from multiple threads because the entire body of the delegate is guaranteed to run on the same thread - it won't be the case that half of the body of the Parallel.For loop will be run on thread A and that the other half will be run on thread B, for example.
Similarly, there's no reason to lock while making changes to estimate1 because it can't possibly be modified from other threads.
Disclaimer: there's no guarantee as to the degree of parallelism of the Parallel.For loop overall. There's not even a guarantee that it will run in parallel at all.
press1 and press2, however, are not local variables, so you do need to synchronize these somehow. (It would be better if you found some way to avoid locking every time, though, because that'll at least partially kill the point of multithreading).
Perhaps most critically, ConcurrentBag is an unordered collection. You don't show all of the operations you're doing on your matrices, but if you're doing matrix multiplication anywhere this could easily cause wrong results. There is no guarantee that matrix multiplication will commute. While A * B = B * A for integers, this is not true in general for matrices. It's quite possible that your logic is subtly dependent on operations occurring in a particular order (and they won't because ConcurrentBag is unordered).
How can I make the for loop of this function to use the GPU with OpenCL?
public static double[] Calculate(double[] num, int period)
{
var final = new double[num.Length];
double sum = num[0];
double coeff = 2.0 / (1.0 + period);
for (int i = 0; i < num.Length; i++)
{
sum += coeff * (num[i] - sum);
final[i] = sum;
}
return final;
}
Your problem as written does not fit well with something that would work on a GPU. You cannot parallelize (in a way that improves performance) the operation on a single array because the value of the nth element depends on elements 1 to n. However, you can utilize the GPU to process multiple arrays, where each GPU core operates on a separate array.
The full code for the solution is at the end of the answer, but the results of the test, to calculate on 10,000 arrays each of which has 10,000 elements, generates the following (on a GTX1080M and an i7 7700k with 32GB RAM):
Task Generating Data: 1096.4583ms
Task CPU Single Thread: 596.2624ms
Task CPU Parallel: 179.1717ms
GPU CPU->GPU: 89ms
GPU Execute: 86ms
GPU GPU->CPU: 29ms
Task Running GPU: 921.4781ms
Finished
In this test, we measure the speed at which we can generate results into a managed C# array using the CPU with one thread, the CPU with all threads, and finally the GPU using all cores. We validate that the results from each test are identical, using the function AreTheSame.
The fastest time is processing the arrays on the CPU using all threads (Task CPU Parallel: 179ms).
The GPU is actually the slowest (Task Running GPU: 922ms), but this is because of the time taken to reformat the C# arrays in a way that they can be transferred onto the GPU.
If this bottleneck were removed (which is quite possible, depending on your use case), the GPU could potentially be the fastest. If the data were already formatted in a manner that can be immediately be transferred onto the GPU, the total processing time for the GPU would be 204ms (CPU->GPU: 89ms + Execute: 86ms + GPU->CPU: 29 ms = 204ms). This is still slower than the parallel CPU option, but on a different sort of data set, it might be faster.
To get the data back from the GPU (the most important part of actually using the GPU), we use the function ComputeCommandQueue.Read. This transfers the altered array on the GPU back to the CPU.
To run the following code, reference the Cloo Nuget Package (I used 0.9.1). And make sure to compile on x64 (you will need the memory). You may need to update your graphics card driver too if it fails to find an OpenCL device.
class Program
{
static string CalculateKernel
{
get
{
return #"
kernel void Calc(global int* offsets, global int* lengths, global double* doubles, double periodFactor)
{
int id = get_global_id(0);
int start = offsets[id];
int length = lengths[id];
int end = start + length;
double sum = doubles[start];
for(int i = start; i < end; i++)
{
sum = sum + periodFactor * ( doubles[i] - sum );
doubles[i] = sum;
}
}";
}
}
public static double[] Calculate(double[] num, int period)
{
var final = new double[num.Length];
double sum = num[0];
double coeff = 2.0 / (1.0 + period);
for (int i = 0; i < num.Length; i++)
{
sum += coeff * (num[i] - sum);
final[i] = sum;
}
return final;
}
static void Main(string[] args)
{
int maxElements = 10000;
int numArrays = 10000;
int computeCores = 2048;
double[][] sets = new double[numArrays][];
using (Timer("Generating Data"))
{
Random elementRand = new Random(1);
for (int i = 0; i < numArrays; i++)
{
sets[i] = GetRandomDoubles(elementRand.Next((int)(maxElements * 0.9), maxElements), randomSeed: i);
}
}
int period = 14;
double[][] singleResults;
using (Timer("CPU Single Thread"))
{
singleResults = CalculateCPU(sets, period);
}
double[][] parallelResults;
using (Timer("CPU Parallel"))
{
parallelResults = CalculateCPUParallel(sets, period);
}
if (!AreTheSame(singleResults, parallelResults)) throw new Exception();
double[][] gpuResults;
using (Timer("Running GPU"))
{
gpuResults = CalculateGPU(computeCores, sets, period);
}
if (!AreTheSame(singleResults, gpuResults)) throw new Exception();
Console.WriteLine("Finished");
Console.ReadKey();
}
public static bool AreTheSame(double[][] a1, double[][] a2)
{
if (a1.Length != a2.Length) return false;
for (int i = 0; i < a1.Length; i++)
{
var ar1 = a1[i];
var ar2 = a2[i];
if (ar1.Length != ar2.Length) return false;
for (int j = 0; j < ar1.Length; j++)
if (Math.Abs(ar1[j] - ar2[j]) > 0.0000001) return false;
}
return true;
}
public static double[][] CalculateGPU(int partitionSize, double[][] sets, int period)
{
ComputeContextPropertyList cpl = new ComputeContextPropertyList(ComputePlatform.Platforms[0]);
ComputeContext context = new ComputeContext(ComputeDeviceTypes.Gpu, cpl, null, IntPtr.Zero);
ComputeProgram program = new ComputeProgram(context, new string[] { CalculateKernel });
program.Build(null, null, null, IntPtr.Zero);
ComputeCommandQueue commands = new ComputeCommandQueue(context, context.Devices[0], ComputeCommandQueueFlags.None);
ComputeEventList events = new ComputeEventList();
ComputeKernel kernel = program.CreateKernel("Calc");
double[][] results = new double[sets.Length][];
double periodFactor = 2d / (1d + period);
Stopwatch sendStopWatch = new Stopwatch();
Stopwatch executeStopWatch = new Stopwatch();
Stopwatch recieveStopWatch = new Stopwatch();
int offset = 0;
while (true)
{
int first = offset;
int last = Math.Min(offset + partitionSize, sets.Length);
int length = last - first;
var merged = Merge(sets, first, length);
sendStopWatch.Start();
ComputeBuffer<int> offsetBuffer = new ComputeBuffer<int>(
context,
ComputeMemoryFlags.ReadWrite | ComputeMemoryFlags.UseHostPointer,
merged.Offsets);
ComputeBuffer<int> lengthsBuffer = new ComputeBuffer<int>(
context,
ComputeMemoryFlags.ReadWrite | ComputeMemoryFlags.UseHostPointer,
merged.Lengths);
ComputeBuffer<double> doublesBuffer = new ComputeBuffer<double>(
context,
ComputeMemoryFlags.ReadWrite | ComputeMemoryFlags.UseHostPointer,
merged.Doubles);
kernel.SetMemoryArgument(0, offsetBuffer);
kernel.SetMemoryArgument(1, lengthsBuffer);
kernel.SetMemoryArgument(2, doublesBuffer);
kernel.SetValueArgument(3, periodFactor);
sendStopWatch.Stop();
executeStopWatch.Start();
commands.Execute(kernel, null, new long[] { merged.Lengths.Length }, null, events);
executeStopWatch.Stop();
using (var pin = Pinned(merged.Doubles))
{
recieveStopWatch.Start();
commands.Read(doublesBuffer, false, 0, merged.Doubles.Length, pin.Address, events);
commands.Finish();
recieveStopWatch.Stop();
}
for (int i = 0; i < merged.Lengths.Length; i++)
{
int len = merged.Lengths[i];
int off = merged.Offsets[i];
var res = new double[len];
Array.Copy(merged.Doubles,off,res,0,len);
results[first + i] = res;
}
offset += partitionSize;
if (offset >= sets.Length) break;
}
Console.WriteLine("GPU CPU->GPU: " + recieveStopWatch.ElapsedMilliseconds + "ms");
Console.WriteLine("GPU Execute: " + executeStopWatch.ElapsedMilliseconds + "ms");
Console.WriteLine("GPU GPU->CPU: " + sendStopWatch.ElapsedMilliseconds + "ms");
return results;
}
public static PinnedHandle Pinned(object obj) => new PinnedHandle(obj);
public class PinnedHandle : IDisposable
{
public IntPtr Address => handle.AddrOfPinnedObject();
private GCHandle handle;
public PinnedHandle(object val)
{
handle = GCHandle.Alloc(val, GCHandleType.Pinned);
}
public void Dispose()
{
handle.Free();
}
}
public class MergedResults
{
public double[] Doubles { get; set; }
public int[] Lengths { get; set; }
public int[] Offsets { get; set; }
}
public static MergedResults Merge(double[][] sets, int offset, int length)
{
List<int> lengths = new List<int>(length);
List<int> offsets = new List<int>(length);
for (int i = 0; i < length; i++)
{
var arr = sets[i + offset];
lengths.Add(arr.Length);
}
var totalLength = lengths.Sum();
double[] doubles = new double[totalLength];
int dataOffset = 0;
for (int i = 0; i < length; i++)
{
var arr = sets[i + offset];
Array.Copy(arr, 0, doubles, dataOffset, arr.Length);
offsets.Add(dataOffset);
dataOffset += arr.Length;
}
return new MergedResults()
{
Doubles = doubles,
Lengths = lengths.ToArray(),
Offsets = offsets.ToArray(),
};
}
public static IDisposable Timer(string name)
{
return new SWTimer(name);
}
public class SWTimer : IDisposable
{
private Stopwatch _sw;
private string _name;
public SWTimer(string name)
{
_name = name;
_sw = Stopwatch.StartNew();
}
public void Dispose()
{
_sw.Stop();
Console.WriteLine("Task " + _name + ": " + _sw.Elapsed.TotalMilliseconds + "ms");
}
}
public static double[][] CalculateCPU(double[][] arrays, int period)
{
double[][] results = new double[arrays.Length][];
for (var index = 0; index < arrays.Length; index++)
{
var arr = arrays[index];
results[index] = Calculate(arr, period);
}
return results;
}
public static double[][] CalculateCPUParallel(double[][] arrays, int period)
{
double[][] results = new double[arrays.Length][];
Parallel.For(0, arrays.Length, i =>
{
var arr = arrays[i];
results[i] = Calculate(arr, period);
});
return results;
}
static double[] GetRandomDoubles(int num, int randomSeed)
{
Random r = new Random(randomSeed);
var res = new double[num];
for (int i = 0; i < num; i++)
res[i] = r.NextDouble() * 0.9 + 0.05;
return res;
}
}
as commenter Cory stated refer to this link for setup.
How to use your GPU in .NET
Here is how you would use this project:
Add the Nuget Package Cloo
Add reference to OpenCLlib.dll
Download OpenCLLib.zip
Add using OpenCL
static void Main(string[] args)
{
int[] Primes = { 1,2,3,4,5,6,7 };
EasyCL cl = new EasyCL();
cl.Accelerator = AcceleratorDevice.GPU;
cl.LoadKernel(IsPrime);
cl.Invoke("GetIfPrime", 0, Primes.Length, Primes, 1.0);
}
static string IsPrime
{
get
{
return #"
kernel void GetIfPrime(global int* num, int period)
{
int index = get_global_id(0);
int sum = (2.0 / (1.0 + period)) * (num[index] - num[0]);
printf("" %d \n"",sum);
}";
}
}
for (int i = 0; i < num.Length; i++)
{
sum += coeff * (num[i] - sum);
final[i] = sum;
}
means first element is multiplied by coeff 1 time and subtracted from 2nd element. First element also multiplied by square of coeff and this time added to 3rd element. Then first element multiplied by cube of coeff and subtracted from 4th element.
This is going like this:
-e0*c*c*c + e1*c*c - e2*c = f3
e0*c*c*c*c - e1*c*c*c + e2*c*c - e3*c = f4
-e0*c*c*c*c*c + e1*c*c*c*c - e2*c*c*c + e3*c*c - e4*c =f5
For all elements, scan through for all smaller id elements and compute this:
if difference of id values(lets call it k) of elements is odd, take subtraction, if not then take addition. Before addition or subtraction, multiply that value by k-th power of coeff. Lastly, multiply the current num value by coefficient and add it to current cell. Current cell value is final(i).
This is O(N*N) and looks like an all-pairs compute kernel. An example using an open-source C# OpenCL project:
ClNumberCruncher cruncher = new ClNumberCruncher(ClPlatforms.all().gpus(), #"
__kernel void foo(__global double * num, __global double * final, __global int *parameters)
{
int threadId = get_global_id(0);
int period = parameters[0];
double coeff = 2.0 / (1.0 + period);
double sumOfElements = 0.0;
for(int i=0;i<threadId;i++)
{
// negativity of coeff is to select addition or subtraction for different powers of coeff
double powKofCoeff = pow(-coeff,threadId-i);
sumOfElements += powKofCoeff * num[i];
}
final[threadId] = sumOfElements + num[threadId] * coeff;
}
");
cruncher.performanceFeed = true; // getting benchmark feedback on console
double[] numArray = new double[10000];
double[] finalArray = new double[10000];
int[] parameters = new int[10];
int period = 15;
parameters[0] = period;
ClArray<double> numGpuArray = numArray;
numGpuArray.readOnly = true; // gpus read this from host
ClArray<double> finalGpuArray = finalArray; // finalArray will have results
finalGpuArray.writeOnly = true; // gpus write this to host
ClArray<int> parametersGpu = parameters;
parametersGpu.readOnly = true;
// calculate kernels with exact same ordering of parameters
// num(double),final(double),parameters(int)
// finalGpuArray points to __global double * final
numGpuArray.nextParam(finalGpuArray, parametersGpu).compute(cruncher, 1, "foo", 10000, 100);
// first compute always lags because of compiling the kernel so here are repeated computes to get actual performance
numGpuArray.nextParam(finalGpuArray, parametersGpu).compute(cruncher, 1, "foo", 10000, 100);
numGpuArray.nextParam(finalGpuArray, parametersGpu).compute(cruncher, 1, "foo", 10000, 100);
Results are on finalArray array for 10000 elements, using 100 workitems per workitem-group.
GPGPU part takes 82ms on a rx550 gpu which has very low ratio of 64bit-to-32bit compute performance(because consumer gaming cards are not good at double precision for new series). An Nvidia Tesla or an Amd Vega would easily compute this kernel without crippled performance. Fx8150(8 cores) completes in 683ms. If you need to specifically select only an integrated-GPU and its CPU, you can use
ClPlatforms.all().gpus().devicesWithHostMemorySharing() + ClPlatforms.all().cpus() when creating ClNumberCruncher instance.
binaries of api:
https://www.codeproject.com/Articles/1181213/Easy-OpenCL-Multiple-Device-Load-Balancing-and-Pip
or source code to compile on your pc:
https://github.com/tugrul512bit/Cekirdekler
if you have multiple gpus, it uses them without any extra code. Including a cpu to the computations would pull gpu effectiveness down in this sample for first iteration (repeatations complete in 76ms with cpu+gpu) so its better to use 2-3 GPU instead of CPU+GPU.
I didn't check numerical stability(you should use Kahan-Summation when adding millions or more values into same variable but I didn't use it for readability and don't have an idea about if 64-bit values need this too like 32-bit ones) or any value correctness, you should do it. Also foo kernel is not optimized. It makes %50 of core times idle so it should be better scheduled like this:
thread-0: compute element 0 and element N-1
thread-1: compute element 1 and element N-2
thread-m: compute element N/2-1 and element N/2
so all workitems get similar amount of work. On top of this, using 100 for workgroup size is not optimal. It should be something like 128,256,512 or 1024(for Nvidia) but this means array size should also be an integer multiple of this too. Then it would need extra control logic in the kernel to not go out of array borders. For even more performance, for loop could have multiple partial sums to do a "loop unrolling".
I am struggling in using Parallel.For in the below code instead of for loop.
Since the size of the CoefficientVector vector array is rather big, it makes sense to me only to reset the array elements value instead of creating it new for each iteration.
I try to replace the outer loop with Parallel For; and assuming each partition of the parallel for, ran by a separate thread, will have it's own copy of CoefficientVector class it therefore makes sense(?) to me to have one instance of the CoefficientVector object for each thread and reset the vector elements rather than recreating the array. I though find it hard to do this optimisation(?) on Parallel For. Could anyone help please.
static void Main(string[] args)
{
System.Diagnostics.Stopwatch timer = new System.Diagnostics.Stopwatch();
timer.Start();
int numIterations = 20000;
int numCalpoints = 5000;
int vecSize = 10000;
CalcPoint[] calcpoints = new CalcPoint[numCalpoints];
CoefficientVector coeff = new CoefficientVector();
coeff.vectors = new Vector[vecSize];
//not sure how to correctly use Parallel.For here
//Parallel.For(0, numCalpoints, =>){
for (int i = 0; i < numCalpoints;i++)
{
CalcPoint cp = calcpoints[i];
//coeff.vectors = new Vector[vecSize];
coeff.ResetVectors();
//doing some operation on the matrix n times
for (int n = 0; n < numIterations; n++)
{
coeff.vectors[n].x += n;
coeff.vectors[n].y += n;
coeff.vectors[n].z += n;
}
cp.result = coeff.GetResults();
}
Console.Write(timer.Elapsed);
Console.Read();
}
}
class CoefficientVector
{
public Vector[] vectors;
public void ResetVectors()
{
for (int i = 0; i < vectors.Length; i++)
{
vectors[i].x = vectors[i].y = vectors[i].z = 0;
}
}
public double GetResults()
{
double result = 0;
for (int i = 0; i < vectors.Length; i++)
{
result += vectors[i].x * vectors[i].y * vectors[i].z;
}
return result;
}
}
struct Vector
{
public double x;
public double y;
public double z;
}
struct CalcPoint
{
public double result;
}
Parallel.For method currently has 12 overloads. Besides the variations of int, long, ParallelOptions and ParallelState action arguments you can notice several having additional generic argument TLocal like this:
public static ParallelLoopResult For<TLocal>(
int fromInclusive,
int toExclusive,
Func<TLocal> localInit,
Func<int, ParallelLoopState, TLocal, TLocal> body,
Action<TLocal> localFinally
)
Executes a for loop with thread-local data in which iterations may run in parallel, and the state of the loop can be monitored and manipulated.
In other words, TLocal allows you to allocate, use and release some thread-local state, i.e. exactly what you need (TLocal will be your CoefficientVector instance per thread).
So you can remove the coeff local variable and use the aforementioned overload like this:
CalcPoint[] calcpoints = new CalcPoint[numCalpoints];
Parallel.For(0, numCalpoints,
() => new CoefficientVector { vectors = new Vector[vecSize] }, // localInit
(i, loopState, coeff) => // body
{
coeff.ResetVectors();
//doing some operation on the matrix
for (int n = 0; n < coeff.vectors.Length; n++)
{
coeff.vectors[n].x += n;
coeff.vectors[n].y += n;
coeff.vectors[n].z += n;
}
calcpoints[i].result = coeff.GetResults();
return coeff; // required by the body Func signature
},
coeff => { } // required by the overload, do nothing in this case
);
In C#, there's a System.Threading.Tasks.Parallel.For(...) which does the same as a for loop, without order, but in multiple threads.
The thing is, it works only on long and int, I want to work with ulong. Okay, I can typecast but I have some trouble with the borders.
Let's say, I want a loop from long.MaxValue-10 to long.MaxValue+10 (remember, I'm talking about ulong). How do I do that?
An example:
for (long i = long.MaxValue - 10; i < long.MaxValue; ++i)
{
Console.WriteLine(i);
}
//does the same as
System.Threading.Tasks.Parallel.For(long.MaxValue - 10, long.MaxValue, delegate(long i)
{
Console.WriteLine(i);
});
//except for the order, but theres no equivalent for
long max = long.MaxValue;
for (ulong i = (ulong)max - 10; i < (ulong)max + 10; ++i)
{
Console.WriteLine(i);
}
You can always write to Microsoft and ask them to add Parallel.For(ulong, ulong, Action<ulong>) to the next version of the .NET Framework. Until that comes out, you'll have to resort to something like this:
Parallel.For(-10L, 10L, x => { var index = long.MaxValue + (ulong) x; });
Or you can create a custom range for Parallel.ForEach
public static IEnumerable<ulong> Range(ulong fromInclusive, ulong toExclusive)
{
for (var i = fromInclusive; i < toExclusive; i++) yield return i;
}
public static void ParallelFor(ulong fromInclusive, ulong toExclusive, Action<ulong> body)
{
Parallel.ForEach(
Range(fromInclusive, toExclusive),
new ParallelOptions { MaxDegreeOfParallelism = 4 },
body);
}
This will work for every long value from long.MinValue inclusive to long.MaxValue exclusive
Parallel.For(long.MinValue, long.MaxValue, x =>
{
ulong u = (ulong)(x + (-(long.MinValue + 1))) + 1;
Console.WriteLine(u);
});