C# OpenCL GPU implementation for double array math

C# OpenCL GPU implementation for double array math - c#

How can I make the for loop of this function to use the GPU with OpenCL?
public static double[] Calculate(double[] num, int period)
{
var final = new double[num.Length];
double sum = num[0];
double coeff = 2.0 / (1.0 + period);
for (int i = 0; i < num.Length; i++)
{
sum += coeff * (num[i] - sum);
final[i] = sum;
}
return final;
}

Your problem as written does not fit well with something that would work on a GPU. You cannot parallelize (in a way that improves performance) the operation on a single array because the value of the nth element depends on elements 1 to n. However, you can utilize the GPU to process multiple arrays, where each GPU core operates on a separate array.
The full code for the solution is at the end of the answer, but the results of the test, to calculate on 10,000 arrays each of which has 10,000 elements, generates the following (on a GTX1080M and an i7 7700k with 32GB RAM):
Task Generating Data: 1096.4583ms
Task CPU Single Thread: 596.2624ms
Task CPU Parallel: 179.1717ms
GPU CPU->GPU: 89ms
GPU Execute: 86ms
GPU GPU->CPU: 29ms
Task Running GPU: 921.4781ms
Finished
In this test, we measure the speed at which we can generate results into a managed C# array using the CPU with one thread, the CPU with all threads, and finally the GPU using all cores. We validate that the results from each test are identical, using the function AreTheSame.
The fastest time is processing the arrays on the CPU using all threads (Task CPU Parallel: 179ms).
The GPU is actually the slowest (Task Running GPU: 922ms), but this is because of the time taken to reformat the C# arrays in a way that they can be transferred onto the GPU.
If this bottleneck were removed (which is quite possible, depending on your use case), the GPU could potentially be the fastest. If the data were already formatted in a manner that can be immediately be transferred onto the GPU, the total processing time for the GPU would be 204ms (CPU->GPU: 89ms + Execute: 86ms + GPU->CPU: 29 ms = 204ms). This is still slower than the parallel CPU option, but on a different sort of data set, it might be faster.
To get the data back from the GPU (the most important part of actually using the GPU), we use the function ComputeCommandQueue.Read. This transfers the altered array on the GPU back to the CPU.
To run the following code, reference the Cloo Nuget Package (I used 0.9.1). And make sure to compile on x64 (you will need the memory). You may need to update your graphics card driver too if it fails to find an OpenCL device.
class Program
{
static string CalculateKernel
{
get
{
return #"
kernel void Calc(global int* offsets, global int* lengths, global double* doubles, double periodFactor)
{
int id = get_global_id(0);
int start = offsets[id];
int length = lengths[id];
int end = start + length;
double sum = doubles[start];
for(int i = start; i < end; i++)
{
sum = sum + periodFactor * ( doubles[i] - sum );
doubles[i] = sum;
}
}";
}
}
public static double[] Calculate(double[] num, int period)
{
var final = new double[num.Length];
double sum = num[0];
double coeff = 2.0 / (1.0 + period);
for (int i = 0; i < num.Length; i++)
{
sum += coeff * (num[i] - sum);
final[i] = sum;
}
return final;
}
static void Main(string[] args)
{
int maxElements = 10000;
int numArrays = 10000;
int computeCores = 2048;
double[][] sets = new double[numArrays][];
using (Timer("Generating Data"))
{
Random elementRand = new Random(1);
for (int i = 0; i < numArrays; i++)
{
sets[i] = GetRandomDoubles(elementRand.Next((int)(maxElements * 0.9), maxElements), randomSeed: i);
}
}
int period = 14;
double[][] singleResults;
using (Timer("CPU Single Thread"))
{
singleResults = CalculateCPU(sets, period);
}
double[][] parallelResults;
using (Timer("CPU Parallel"))
{
parallelResults = CalculateCPUParallel(sets, period);
}
if (!AreTheSame(singleResults, parallelResults)) throw new Exception();
double[][] gpuResults;
using (Timer("Running GPU"))
{
gpuResults = CalculateGPU(computeCores, sets, period);
}
if (!AreTheSame(singleResults, gpuResults)) throw new Exception();
Console.WriteLine("Finished");
Console.ReadKey();
}
public static bool AreTheSame(double[][] a1, double[][] a2)
{
if (a1.Length != a2.Length) return false;
for (int i = 0; i < a1.Length; i++)
{
var ar1 = a1[i];
var ar2 = a2[i];
if (ar1.Length != ar2.Length) return false;
for (int j = 0; j < ar1.Length; j++)
if (Math.Abs(ar1[j] - ar2[j]) > 0.0000001) return false;
}
return true;
}
public static double[][] CalculateGPU(int partitionSize, double[][] sets, int period)
{
ComputeContextPropertyList cpl = new ComputeContextPropertyList(ComputePlatform.Platforms[0]);
ComputeContext context = new ComputeContext(ComputeDeviceTypes.Gpu, cpl, null, IntPtr.Zero);
ComputeProgram program = new ComputeProgram(context, new string[] { CalculateKernel });
program.Build(null, null, null, IntPtr.Zero);
ComputeCommandQueue commands = new ComputeCommandQueue(context, context.Devices[0], ComputeCommandQueueFlags.None);
ComputeEventList events = new ComputeEventList();
ComputeKernel kernel = program.CreateKernel("Calc");
double[][] results = new double[sets.Length][];
double periodFactor = 2d / (1d + period);
Stopwatch sendStopWatch = new Stopwatch();
Stopwatch executeStopWatch = new Stopwatch();
Stopwatch recieveStopWatch = new Stopwatch();
int offset = 0;
while (true)
{
int first = offset;
int last = Math.Min(offset + partitionSize, sets.Length);
int length = last - first;
var merged = Merge(sets, first, length);
sendStopWatch.Start();
ComputeBuffer<int> offsetBuffer = new ComputeBuffer<int>(
context,
ComputeMemoryFlags.ReadWrite | ComputeMemoryFlags.UseHostPointer,
merged.Offsets);
ComputeBuffer<int> lengthsBuffer = new ComputeBuffer<int>(
context,
ComputeMemoryFlags.ReadWrite | ComputeMemoryFlags.UseHostPointer,
merged.Lengths);
ComputeBuffer<double> doublesBuffer = new ComputeBuffer<double>(
context,
ComputeMemoryFlags.ReadWrite | ComputeMemoryFlags.UseHostPointer,
merged.Doubles);
kernel.SetMemoryArgument(0, offsetBuffer);
kernel.SetMemoryArgument(1, lengthsBuffer);
kernel.SetMemoryArgument(2, doublesBuffer);
kernel.SetValueArgument(3, periodFactor);
sendStopWatch.Stop();
executeStopWatch.Start();
commands.Execute(kernel, null, new long[] { merged.Lengths.Length }, null, events);
executeStopWatch.Stop();
using (var pin = Pinned(merged.Doubles))
{
recieveStopWatch.Start();
commands.Read(doublesBuffer, false, 0, merged.Doubles.Length, pin.Address, events);
commands.Finish();
recieveStopWatch.Stop();
}
for (int i = 0; i < merged.Lengths.Length; i++)
{
int len = merged.Lengths[i];
int off = merged.Offsets[i];
var res = new double[len];
Array.Copy(merged.Doubles,off,res,0,len);
results[first + i] = res;
}
offset += partitionSize;
if (offset >= sets.Length) break;
}
Console.WriteLine("GPU CPU->GPU: " + recieveStopWatch.ElapsedMilliseconds + "ms");
Console.WriteLine("GPU Execute: " + executeStopWatch.ElapsedMilliseconds + "ms");
Console.WriteLine("GPU GPU->CPU: " + sendStopWatch.ElapsedMilliseconds + "ms");
return results;
}
public static PinnedHandle Pinned(object obj) => new PinnedHandle(obj);
public class PinnedHandle : IDisposable
{
public IntPtr Address => handle.AddrOfPinnedObject();
private GCHandle handle;
public PinnedHandle(object val)
{
handle = GCHandle.Alloc(val, GCHandleType.Pinned);
}
public void Dispose()
{
handle.Free();
}
}
public class MergedResults
{
public double[] Doubles { get; set; }
public int[] Lengths { get; set; }
public int[] Offsets { get; set; }
}
public static MergedResults Merge(double[][] sets, int offset, int length)
{
List<int> lengths = new List<int>(length);
List<int> offsets = new List<int>(length);
for (int i = 0; i < length; i++)
{
var arr = sets[i + offset];
lengths.Add(arr.Length);
}
var totalLength = lengths.Sum();
double[] doubles = new double[totalLength];
int dataOffset = 0;
for (int i = 0; i < length; i++)
{
var arr = sets[i + offset];
Array.Copy(arr, 0, doubles, dataOffset, arr.Length);
offsets.Add(dataOffset);
dataOffset += arr.Length;
}
return new MergedResults()
{
Doubles = doubles,
Lengths = lengths.ToArray(),
Offsets = offsets.ToArray(),
};
}
public static IDisposable Timer(string name)
{
return new SWTimer(name);
}
public class SWTimer : IDisposable
{
private Stopwatch _sw;
private string _name;
public SWTimer(string name)
{
_name = name;
_sw = Stopwatch.StartNew();
}
public void Dispose()
{
_sw.Stop();
Console.WriteLine("Task " + _name + ": " + _sw.Elapsed.TotalMilliseconds + "ms");
}
}
public static double[][] CalculateCPU(double[][] arrays, int period)
{
double[][] results = new double[arrays.Length][];
for (var index = 0; index < arrays.Length; index++)
{
var arr = arrays[index];
results[index] = Calculate(arr, period);
}
return results;
}
public static double[][] CalculateCPUParallel(double[][] arrays, int period)
{
double[][] results = new double[arrays.Length][];
Parallel.For(0, arrays.Length, i =>
{
var arr = arrays[i];
results[i] = Calculate(arr, period);
});
return results;
}
static double[] GetRandomDoubles(int num, int randomSeed)
{
Random r = new Random(randomSeed);
var res = new double[num];
for (int i = 0; i < num; i++)
res[i] = r.NextDouble() * 0.9 + 0.05;
return res;
}
}

as commenter Cory stated refer to this link for setup.
How to use your GPU in .NET
Here is how you would use this project:
Add the Nuget Package Cloo
Add reference to OpenCLlib.dll
Download OpenCLLib.zip
Add using OpenCL
static void Main(string[] args)
{
int[] Primes = { 1,2,3,4,5,6,7 };
EasyCL cl = new EasyCL();
cl.Accelerator = AcceleratorDevice.GPU;
cl.LoadKernel(IsPrime);
cl.Invoke("GetIfPrime", 0, Primes.Length, Primes, 1.0);
}
static string IsPrime
{
get
{
return #"
kernel void GetIfPrime(global int* num, int period)
{
int index = get_global_id(0);
int sum = (2.0 / (1.0 + period)) * (num[index] - num[0]);
printf("" %d \n"",sum);
}";
}
}

for (int i = 0; i < num.Length; i++)
{
sum += coeff * (num[i] - sum);
final[i] = sum;
}
means first element is multiplied by coeff 1 time and subtracted from 2nd element. First element also multiplied by square of coeff and this time added to 3rd element. Then first element multiplied by cube of coeff and subtracted from 4th element.
This is going like this:
-e0*c*c*c + e1*c*c - e2*c = f3
e0*c*c*c*c - e1*c*c*c + e2*c*c - e3*c = f4
-e0*c*c*c*c*c + e1*c*c*c*c - e2*c*c*c + e3*c*c - e4*c =f5
For all elements, scan through for all smaller id elements and compute this:
if difference of id values(lets call it k) of elements is odd, take subtraction, if not then take addition. Before addition or subtraction, multiply that value by k-th power of coeff. Lastly, multiply the current num value by coefficient and add it to current cell. Current cell value is final(i).
This is O(N*N) and looks like an all-pairs compute kernel. An example using an open-source C# OpenCL project:
ClNumberCruncher cruncher = new ClNumberCruncher(ClPlatforms.all().gpus(), #"
__kernel void foo(__global double * num, __global double * final, __global int *parameters)
{
int threadId = get_global_id(0);
int period = parameters[0];
double coeff = 2.0 / (1.0 + period);
double sumOfElements = 0.0;
for(int i=0;i<threadId;i++)
{
// negativity of coeff is to select addition or subtraction for different powers of coeff
double powKofCoeff = pow(-coeff,threadId-i);
sumOfElements += powKofCoeff * num[i];
}
final[threadId] = sumOfElements + num[threadId] * coeff;
}
");
cruncher.performanceFeed = true; // getting benchmark feedback on console
double[] numArray = new double[10000];
double[] finalArray = new double[10000];
int[] parameters = new int[10];
int period = 15;
parameters[0] = period;
ClArray<double> numGpuArray = numArray;
numGpuArray.readOnly = true; // gpus read this from host
ClArray<double> finalGpuArray = finalArray; // finalArray will have results
finalGpuArray.writeOnly = true; // gpus write this to host
ClArray<int> parametersGpu = parameters;
parametersGpu.readOnly = true;
// calculate kernels with exact same ordering of parameters
// num(double),final(double),parameters(int)
// finalGpuArray points to __global double * final
numGpuArray.nextParam(finalGpuArray, parametersGpu).compute(cruncher, 1, "foo", 10000, 100);
// first compute always lags because of compiling the kernel so here are repeated computes to get actual performance
numGpuArray.nextParam(finalGpuArray, parametersGpu).compute(cruncher, 1, "foo", 10000, 100);
numGpuArray.nextParam(finalGpuArray, parametersGpu).compute(cruncher, 1, "foo", 10000, 100);
Results are on finalArray array for 10000 elements, using 100 workitems per workitem-group.
GPGPU part takes 82ms on a rx550 gpu which has very low ratio of 64bit-to-32bit compute performance(because consumer gaming cards are not good at double precision for new series). An Nvidia Tesla or an Amd Vega would easily compute this kernel without crippled performance. Fx8150(8 cores) completes in 683ms. If you need to specifically select only an integrated-GPU and its CPU, you can use
ClPlatforms.all().gpus().devicesWithHostMemorySharing() + ClPlatforms.all().cpus() when creating ClNumberCruncher instance.
binaries of api:
https://www.codeproject.com/Articles/1181213/Easy-OpenCL-Multiple-Device-Load-Balancing-and-Pip
or source code to compile on your pc:
https://github.com/tugrul512bit/Cekirdekler
if you have multiple gpus, it uses them without any extra code. Including a cpu to the computations would pull gpu effectiveness down in this sample for first iteration (repeatations complete in 76ms with cpu+gpu) so its better to use 2-3 GPU instead of CPU+GPU.
I didn't check numerical stability(you should use Kahan-Summation when adding millions or more values into same variable but I didn't use it for readability and don't have an idea about if 64-bit values need this too like 32-bit ones) or any value correctness, you should do it. Also foo kernel is not optimized. It makes %50 of core times idle so it should be better scheduled like this:
thread-0: compute element 0 and element N-1
thread-1: compute element 1 and element N-2
thread-m: compute element N/2-1 and element N/2
so all workitems get similar amount of work. On top of this, using 100 for workgroup size is not optimal. It should be something like 128,256,512 or 1024(for Nvidia) but this means array size should also be an integer multiple of this too. Then it would need extra control logic in the kernel to not go out of array borders. For even more performance, for loop could have multiple partial sums to do a "loop unrolling".

Related

Trying to find large prime numbers with Alea GPU

An exception occurs when I try to find the 100,000th prime number using Alea GPU. The algorithm works fine if I try to find a smaller prime number e.g. the 10,000th prime number.
I am using Alea v3.0.4, NVIDIA GTX 970, Cuda 9.2 drivers.
I am new to GPU programming. Any help would be greatly appreciated.
long[] primeNumber = new long[1]; // nth prime number to find
int n = 100000; // find the 100,000th prime number
var worker = Gpu.Default; // GTX 970 CUDA v9.2 drivers
long count = 0;
worker.LongFor(count, n, x =>
{
long a = 2;
while (count < n)
{
long b = 2;
long prime = 1;
while (b * b <= a)
{
if (a % b == 0)
{
prime = 0;
break;
}
b++;
}
if (prime > 0)
{
count++;
}
a++;
}
primeNumber[0] = (a - 1);
}
);
Here are the exception details:
System.Exception occurred HResult=0x80131500 Message=[CUDAError]
CUDA_ERROR_LAUNCH_FAILED Source=Alea StackTrace: at
Alea.CUDAInterop.cuSafeCall#2939.Invoke(String message) at
Alea.CUDAInterop.cuSafeCall(cudaError_enum result) at
A.cf5aded17df9f7cc4c132234dda010fa7.Copy#918-22.Invoke(Unit _arg9)
at Alea.Memory.Copy(FSharpOption1 streamOpt, Memory src, IntPtr
srcOffset, Memory dst, IntPtr dstOffset, FSharpOption1 lengthOpt)
at
Alea.ImplicitMemoryTrackerEntry.cdd2cd00c052408bcdbf03958f14266ca(FSharpFunc2
c600c458623dca7db199a0e417603dff4, Object
cd5116337150ebaa6de788dacd82516fa) at
Alea.ImplicitMemoryTrackerEntry.c6a75c171c9cccafb084beba315394985(FSharpFunc2
c600c458623dca7db199a0e417603dff4, Object
cd5116337150ebaa6de788dacd82516fa) at
Alea.ImplicitMemoryTracker.HostReadWriteBarrier(Object instance) at
Alea.GlobalImplicitMemoryTracker.HostReadWriteBarrier(Object instance)
at A.cf5aded17df9f7cc4c132234dda010fa7.clo#2359-624.Invoke(Object
arg00) at
Microsoft.FSharp.Collections.SeqModule.Iterate[T](FSharpFunc2 action,
IEnumerable1 source) at Alea.Kernel.LaunchRaw(LaunchParam lp,
FSharpOption1 instanceOpt, FSharpList1 args) at
Alea.Parallel.Device.DeviceFor.For(Gpu gpu, Int64 fromInclusive, Int64
toExclusive, Action1 op) at Alea.Parallel.GpuExtension.LongFor(Gpu
gpu, Int64 fromInclusive, Int64 toExclusive, Action1 op) at
TestingGPU.Program.Execute(Int32 t) in
C:\Users..\source\repos\TestingGPU\TestingGPU\Program.cs:line 148
at TestingGPU.Program.Main(String[] args)
Working Solution:
static void Main(string[] args)
{
var devices = Device.Devices;
foreach (var device in devices)
{
Console.WriteLine(device.ToString());
}
while (true)
{
Console.WriteLine("Enter a number to check if it is a prime number:");
string line = Console.ReadLine();
long checkIfPrime = Convert.ToInt64(line);
Stopwatch sw = new Stopwatch();
sw.Start();
bool GPUisPrime = GPUIsItPrime(checkIfPrime+1);
sw.Stop();
Stopwatch sw2 = new Stopwatch();
sw2.Start();
bool CPUisPrime = CPUIsItPrime(checkIfPrime+1);
sw2.Stop();
Console.WriteLine($"GPU: is {checkIfPrime} prime? {GPUisPrime} Time Elapsed: {sw.ElapsedMilliseconds.ToString()}");
Console.WriteLine($"CPU: is {checkIfPrime} prime? {CPUisPrime} Time Elapsed: {sw2.ElapsedMilliseconds.ToString()}");
}
}
[GpuManaged]
private static bool GPUIsItPrime(long n)
{
//Sieve of Eratosthenes Algorithm
bool[] isComposite = new bool[n];
var worker = Gpu.Default;
worker.LongFor(2, n, i =>
{
if (!(isComposite[i]))
{
for (long j = 2; (j * i) < isComposite.Length; j++)
{
isComposite[j * i] = true;
}
}
});
return !isComposite[n-1];
}
private static bool CPUIsItPrime(long n)
{
//Sieve of Eratosthenes Algorithm
bool[] isComposite = new bool[n];
for (int i = 2; i < n; i++)
{
if (!isComposite[i])
{
for (long j = 2; (j * i) < n; j++)
{
isComposite[j * i] = true;
}
}
}
return !isComposite[n-1];
}

Your code doesn't look right. Given a parallel for-loop method here (LongFor), Alea will spawn "n" threads, with an index "x" used to identify what the thread number is. So, for example a simple example like For(0, n, x => a[x] = x); uses "x" to initialize a[] with { 0, 1, 2, ...., n - 1}. But, your kernel code does not use "x" anywhere in the code. Consequently, you run the same code "n" times with absolutely no difference. Why then run on a GPU? What I think you want is to do is to compute in thread "x" whether "x" is prime. With result in hand, set bool prime[x] = true or false. Then, afterwards, in the kernel after all that, add a sync call, followed with a test using a single thread (e.g., x == 0) to go through prime[] and pick the largest prime from the array. Otherwise, there's a lot of collisions for 'primeNumber[0] = (a - 1);' by n-threads on the GPU. I can't imagine how you would ever get the right result. Finally, you probably want to make sure using some Alea call that prime[] is never copied to/from the GPU. But, I don't know how you do that in Alea. The compiler might be smart enough to know that prime[] is only used in the kernel code.

Quick Sort Implementation with large numbers [duplicate]

I learnt about quick sort and how it can be implemented in both Recursive and Iterative method.
In Iterative method:
Push the range (0...n) into the stack
Partition the given array with a pivot
Pop the top element.
Push the partitions (index range) onto a stack if the range has more than one element
Do the above 3 steps, till the stack is empty
And the recursive version is the normal one defined in wiki.
I learnt that recursive algorithms are always slower than their iterative counterpart.
So, Which method is preferred in terms of time complexity (memory is not a concern)?
Which one is fast enough to use in Programming contest?
Is c++ STL sort() using a recursive approach?

In terms of (asymptotic) time complexity - they are both the same.
"Recursive is slower then iterative" - the rational behind this statement is because of the overhead of the recursive stack (saving and restoring the environment between calls).
However -these are constant number of ops, while not changing the number of "iterations".
Both recursive and iterative quicksort are O(nlogn) average case and O(n^2) worst case.
EDIT:
just for the fun of it I ran a benchmark with the (java) code attached to the post , and then I ran wilcoxon statistic test, to check what is the probability that the running times are indeed distinct
The results may be conclusive (P_VALUE=2.6e-34, https://en.wikipedia.org/wiki/P-value. Remember that the P_VALUE is P(T >= t | H) where T is the test statistic and H is the null hypothesis). But the answer is not what you expected.
The average of the iterative solution was 408.86 ms while of recursive was 236.81 ms
(Note - I used Integer and not int as argument to recursiveQsort() - otherwise the recursive would have achieved much better, because it doesn't have to box a lot of integers, which is also time consuming - I did it because the iterative solution has no choice but doing so.
Thus - your assumption is not true, the recursive solution is faster (for my machine and java for the very least) than the iterative one with P_VALUE=2.6e-34.
public static void recursiveQsort(int[] arr,Integer start, Integer end) {
if (end - start < 2) return; //stop clause
int p = start + ((end-start)/2);
p = partition(arr,p,start,end);
recursiveQsort(arr, start, p);
recursiveQsort(arr, p+1, end);
}
public static void iterativeQsort(int[] arr) {
Stack<Integer> stack = new Stack<Integer>();
stack.push(0);
stack.push(arr.length);
while (!stack.isEmpty()) {
int end = stack.pop();
int start = stack.pop();
if (end - start < 2) continue;
int p = start + ((end-start)/2);
p = partition(arr,p,start,end);
stack.push(p+1);
stack.push(end);
stack.push(start);
stack.push(p);
}
}
private static int partition(int[] arr, int p, int start, int end) {
int l = start;
int h = end - 2;
int piv = arr[p];
swap(arr,p,end-1);
while (l < h) {
if (arr[l] < piv) {
l++;
} else if (arr[h] >= piv) {
h--;
} else {
swap(arr,l,h);
}
}
int idx = h;
if (arr[h] < piv) idx++;
swap(arr,end-1,idx);
return idx;
}
private static void swap(int[] arr, int i, int j) {
int temp = arr[i];
arr[i] = arr[j];
arr[j] = temp;
}
public static void main(String... args) throws Exception {
Random r = new Random(1);
int SIZE = 1000000;
int N = 100;
int[] arr = new int[SIZE];
int[] millisRecursive = new int[N];
int[] millisIterative = new int[N];
for (int t = 0; t < N; t++) {
for (int i = 0; i < SIZE; i++) {
arr[i] = r.nextInt(SIZE);
}
int[] tempArr = Arrays.copyOf(arr, arr.length);
long start = System.currentTimeMillis();
iterativeQsort(tempArr);
millisIterative[t] = (int)(System.currentTimeMillis()-start);
tempArr = Arrays.copyOf(arr, arr.length);
start = System.currentTimeMillis();
recursvieQsort(tempArr,0,arr.length);
millisRecursive[t] = (int)(System.currentTimeMillis()-start);
}
int sum = 0;
for (int x : millisRecursive) {
System.out.println(x);
sum += x;
}
System.out.println("end of recursive. AVG = " + ((double)sum)/millisRecursive.length);
sum = 0;
for (int x : millisIterative) {
System.out.println(x);
sum += x;
}
System.out.println("end of iterative. AVG = " + ((double)sum)/millisIterative.length);
}

Recursion is NOT always slower than iteration. Quicksort is perfect example of it. The only way to do this in iterate way is create stack structure. So in other way do the same that the compiler do if we use recursion, and propably you will do this worse than compiler. Also there will be more jumps if you don't use recursion (to pop and push values to stack).

That's the solution i came up with in Javascript. I think it works.
const myArr = [33, 103, 3, 726, 200, 984, 198, 764, 9]
document.write('initial order :', JSON.stringify(myArr), '<br><br>')
qs_iter(myArr)
document.write('_Final order :', JSON.stringify(myArr))
function qs_iter(items) {
if (!items || items.length <= 1) {
return items
}
var stack = []
var low = 0
var high = items.length - 1
stack.push([low, high])
while (stack.length) {
var range = stack.pop()
low = range[0]
high = range[1]
if (low < high) {
var pivot = Math.floor((low + high) / 2)
stack.push([low, pivot])
stack.push([pivot + 1, high])
while (low < high) {
while (low < pivot && items[low] <= items[pivot]) low++
while (high > pivot && items[high] > items[pivot]) high--
if (low < high) {
var tmp = items[low]
items[low] = items[high]
items[high] = tmp
}
}
}
}
return items
}
Let me know if you found a mistake :)
Mister Jojo UPDATE :
this code just mixes values that can in rare cases lead to a sort, in other words never.
For those who have a doubt, I put it in snippet.

Fast sort partially sorted array

Firstly, it's not about an array with subsequences that may be in some order before we start sort, it's an about array of special structure.
I'm writing now a simple method that sorts data. Until now, I used Array.Sort, but PLINQ's OrderBy outperform standard Array.Sort on large arrays.
So i decide to write my own implementation of multithreading sort. Idea was simple: split an array on partitions, parallel sort each partition, then merge all results in one array.
Now i'm done with partitioning and sorting:
public class PartitionSorter
{
public static void Sort(int[] arr)
{
var ranges = Range.FromArray(arr);
var allDone = new ManualResetEventSlim(false, ranges.Length*2);
int completed = 0;
foreach (var range in ranges)
{
ThreadPool.QueueUserWorkItem(r =>
{
var rr = (Range) r;
Array.Sort(arr, rr.StartIndex, rr.Length);
if (Interlocked.Increment(ref completed) == ranges.Length)
allDone.Set();
}, range);
}
allDone.Wait();
}
}
public class Range
{
public int StartIndex { get; }
public int Length { get; }
public Range(int startIndex, int endIndex)
{
StartIndex = startIndex;
Length = endIndex;
}
public static Range[] FromArray<T>(T[] source)
{
int processorCount = Environment.ProcessorCount;
int partitionLength = (int) (source.Length/(double) processorCount);
var result = new Range[processorCount];
int start = 0;
for (int i = 0; i < result.Length - 1; i++)
{
result[i] = new Range(start, partitionLength);
start += partitionLength;
}
result[result.Length - 1] = new Range(start, source.Length - start);
return result;
}
}
As result I get an array with special structure, for example
[1 3 5 | 2 4 7 | 6 8 9]
Now how can I use this information and finish sorting? Insertion sorts and others doesn't use information that data in blocks is already sorted, and we just need to merge them together. I tried to apply some algorithms from Merge sort, but failed.

I've done some testing with a parallel Quicksort implementation.
I tested the following code with a RELEASE build on Windows x64 10, compiled with C#6 (Visual Studio 2015), .Net 4.61, and run outside any debugger.
My processor is quad core with hyperthreading (which is certainly going to help any parallel implementation!)
The array size is 20,000,000 (so a fairly large array).
I got these results:
LINQ OrderBy() took 00:00:14.1328090
PLINQ OrderBy() took 00:00:04.4484305
Array.Sort() took 00:00:02.3695607
Sequential took 00:00:02.7274400
Parallel took 00:00:00.7874578
PLINQ OrderBy() is much faster than LINQ OrderBy(), but slower than Array.Sort().
QuicksortSequential() is around the same speed as Array.Sort()
But the interesting thing here is that QuicksortParallelOptimised() is noticeably faster on my system - so it's definitely an efficient way of sorting if you have enough processor cores.
Here's the full compilable console app. Remember to run it in RELEASE mode - if you run it in DEBUG mode the timing results will be woefully incorrect.
using System;
using System.Diagnostics;
using System.Linq;
using System.Threading.Tasks;
namespace Demo
{
class Program
{
static void Main()
{
int n = 20000000;
int[] a = new int[n];
var rng = new Random(937525);
for (int i = 0; i < n; ++i)
a[i] = rng.Next();
var b = a.ToArray();
var d = a.ToArray();
var sw = new Stopwatch();
sw.Restart();
var c = a.OrderBy(x => x).ToArray(); // Need ToArray(), otherwise it does nothing.
Console.WriteLine("LINQ OrderBy() took " + sw.Elapsed);
sw.Restart();
var e = a.AsParallel().OrderBy(x => x).ToArray(); // Need ToArray(), otherwise it does nothing.
Console.WriteLine("PLINQ OrderBy() took " + sw.Elapsed);
sw.Restart();
Array.Sort(d);
Console.WriteLine("Array.Sort() took " + sw.Elapsed);
sw.Restart();
QuicksortSequential(a, 0, a.Length-1);
Console.WriteLine("Sequential took " + sw.Elapsed);
sw.Restart();
QuicksortParallelOptimised(b, 0, b.Length-1);
Console.WriteLine("Parallel took " + sw.Elapsed);
// Verify that our sort implementation is actually correct!
Trace.Assert(a.SequenceEqual(c));
Trace.Assert(b.SequenceEqual(c));
}
static void QuicksortSequential<T>(T[] arr, int left, int right)
where T : IComparable<T>
{
if (right > left)
{
int pivot = Partition(arr, left, right);
QuicksortSequential(arr, left, pivot - 1);
QuicksortSequential(arr, pivot + 1, right);
}
}
static void QuicksortParallelOptimised<T>(T[] arr, int left, int right)
where T : IComparable<T>
{
const int SEQUENTIAL_THRESHOLD = 2048;
if (right > left)
{
if (right - left < SEQUENTIAL_THRESHOLD)
{
QuicksortSequential(arr, left, right);
}
else
{
int pivot = Partition(arr, left, right);
Parallel.Invoke(
() => QuicksortParallelOptimised(arr, left, pivot - 1),
() => QuicksortParallelOptimised(arr, pivot + 1, right));
}
}
}
static int Partition<T>(T[] arr, int low, int high) where T : IComparable<T>
{
int pivotPos = (high + low) / 2;
T pivot = arr[pivotPos];
Swap(arr, low, pivotPos);
int left = low;
for (int i = low + 1; i <= high; i++)
{
if (arr[i].CompareTo(pivot) < 0)
{
left++;
Swap(arr, i, left);
}
}
Swap(arr, low, left);
return left;
}
static void Swap<T>(T[] arr, int i, int j)
{
T tmp = arr[i];
arr[i] = arr[j];
arr[j] = tmp;
}
}
}

Genetic Algorithm implementation in C#

I've recently started working with C# and I'm currently trying to implement a version of GA to solve Schwefel’s function(See code below). The code is based on a working Processing code that I built.
The first generation(first 100 individuals) seems to work fine but after that the fitness function gets repetitive values. I'm sure I'm missing something here but does anyone know what might be the problem?
public void button21_Click(object sender, EventArgs e)
{
Population p;
// populationNum = 100;
p = new Population();
int gen = 0;
while (gen < 8000)
{
p.evolve();
}
++gen;
}
//Class Genotype
public partial class Genotype
{
public int[] genes;
public Genotype()
{
genes = new int[2];
for (int i = 0; i < genes.Length; i++)
{
Random rnd = new Random(int.Parse(Guid.NewGuid().ToString().Substring(0, 8), System.Globalization.NumberStyles.HexNumber));
//Random rnd = new Random(0);
int random = rnd.Next(256);
genes[i] = (int)random;
}
}
public void mutate()
{
//5% mutation rate
for (int i = 0; i < genes.Length; i++)
{
Random rnd = new Random(int.Parse(Guid.NewGuid().ToString().Substring(0, 8), System.Globalization.NumberStyles.HexNumber));
int random = rnd.Next(100);
if (random < 5)
{
//Random genernd = new Random();
int generandom = rnd.Next(256);
genes[i] = (int)generandom;
}
}
}
}
static Genotype crossover(Genotype a, Genotype b)
{
Genotype c = new Genotype();
for (int i = 0; i < c.genes.Length; i++)
{
//50-50 chance of selection
Random rnd = new Random(int.Parse(Guid.NewGuid().ToString().Substring(0, 8), System.Globalization.NumberStyles.HexNumber));
float random = rnd.Next(0, 1);
if (random < 0.5)
{
c.genes[i] = a.genes[i];
}
else
{
c.genes[i] = b.genes[i];
}
}
return c;
}
//Class Phenotype
public partial class Phenotype
{
double i_x;
double i_y;
public Phenotype(Genotype g)
{
i_x = g.genes[0] * 500 / 256;
i_y = g.genes[1] * 500 / 256;
}
public double evaluate()
{
double fitness = 0;
fitness -= (-1.0*i_x * Math.Sin(Math.Sqrt(Math.Abs(i_x)))) + (-1.0*i_y * Math.Sin(Math.Sqrt(Math.Abs(i_y))));
Console.WriteLine(fitness);
return fitness;
}
}
//Class Individual
public partial class Individual : IComparable<Individual>
{
public Genotype i_genotype;
public Phenotype i_phenotype;
double i_fitness;
public Individual()
{
this.i_genotype = new Genotype();
this.i_phenotype = new Phenotype(i_genotype);
this.i_fitness = 0;
}
public void evaluate()
{
i_fitness = i_phenotype.evaluate();
}
int IComparable<Individual>.CompareTo(Individual objI)
{
Individual iToCompare = (Individual)objI;
if (i_fitness < iToCompare.i_fitness)
{
return -1; //if I am less fit than iCompare return -1
}
else if (i_fitness > iToCompare.i_fitness)
{
return 1; //if I am fitter than iCompare return 1
}
return 0; // if we are equally return 0
}
}
static Individual breed(Individual a, Individual b)
{
Individual c = new Individual();
c.i_genotype = crossover(a.i_genotype, b.i_genotype);
c.i_genotype.mutate();
c.i_phenotype = new Phenotype(c.i_genotype);
return c;
}
//Class Population
public class Population
{
Individual[] pop;
int populationNum = 100;
public Population()
{
pop = new Individual[populationNum];
for (int i = 0; i < populationNum; i++)
{
this.pop[i] = new Individual();
pop[i].evaluate();
}
Array.Sort(this.pop);
}
public void evolve()
{
Individual a = select();
Individual b = select();
//breed the two selected individuals
Individual x = breed(a, b);
//place the offspring in the lowest position in the population, thus replacing the previously weakest offspring
pop[0] = x;
//evaluate the new individual (grow)
x.evaluate();
//the fitter offspring will find its way in the population ranks
Array.Sort(this.pop);
//rnd = new Random(0);
}
Individual select()
{
Random rnd = new Random(int.Parse(Guid.NewGuid().ToString().Substring(0, 8), System.Globalization.NumberStyles.HexNumber));
float random = rnd.Next(0, 1);
//skew distribution; multiplying by 99.999999 scales a number from 0-1 to 0-99, BUT NOT 100
//the sqrt of a number between 0-1 has bigger possibilities of giving us a smaller number
//if we subtract that squares number from 1 the opposite is true-> we have bigger possibilities of having a larger number
int which = (int)Math.Floor(((float)populationNum - 1e-6) * (1.0 - Math.Pow(random, random)));
return pop[which];
}
}

This an updated code that I think it performs well:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.Threading;
namespace ConsoleApplication8
{
class Program
{
static Random random = new Random();
static void Main(string[] args)
{
Population p;
System.IO.StreamWriter file = new System.IO.StreamWriter("c:\\test.txt");
int population = 100;
p = new Population(file, population);
int gen = 0;
while (gen <= 1000)
{
p.evolve(file);
++gen;
}
file.Close();
}
public static double GetRandomNumber(double min, double max)
{
return (random.NextDouble() * (max - min)) + min;
//return random.NextDouble() *random.Next(min,max);
}
//Class Genotype
public class Genotype
{
public int[] genes;
public Genotype()
{
this.genes = new int[2];
for (int i = 0; i < genes.Length; i++)
{
this.genes[i] = (int)GetRandomNumber(-500.0, 500.0);
}
}
public void mutate()
{
//5% mutation rate
for (int i = 0; i < genes.Length; i++)
{
if (GetRandomNumber(0.0, 100) < 5)
{
//Random genernd = new Random();
this.genes[i] = (int)GetRandomNumber(0.0, 256.0);
}
}
}
}
static Genotype crossover(Genotype a, Genotype b)
{
Genotype c = new Genotype();
for (int i = 0; i < c.genes.Length; i++)
{
//50-50 chance of selection
if (GetRandomNumber(0.0, 1) < 0.5)
{
c.genes[i] = a.genes[i];
}
else
{
c.genes[i] = b.genes[i];
}
}
return c;
}
//Class Phenotype
public class Phenotype
{
double i_x;
double i_y;
public Phenotype(Genotype g)
{
this.i_x = g.genes[0];
this.i_y = g.genes[1];
}
public double evaluate(System.IO.StreamWriter file)
{
double fitness = 0;
//fitness -= i_x + i_y;
fitness -= (i_x*Math.Sin(Math.Sqrt(Math.Abs(i_x)))) + i_y*(Math.Sin(Math.Sqrt(Math.Abs(i_y))));
file.WriteLine(fitness);
return fitness;
}
}
//Class Individual
public class Individual : IComparable<Individual>
{
public Genotype i_genotype;
public Phenotype i_phenotype;
double i_fitness;
public Individual()
{
this.i_genotype = new Genotype();
this.i_phenotype = new Phenotype(i_genotype);
this.i_fitness = 0.0;
}
public void evaluate(System.IO.StreamWriter file)
{
this.i_fitness = i_phenotype.evaluate(file);
}
int IComparable<Individual>.CompareTo(Individual objI)
{
Individual iToCompare = (Individual)objI;
if (i_fitness < iToCompare.i_fitness)
{
return -1; //if I am less fit than iCompare return -1
}
else if (i_fitness > iToCompare.i_fitness)
{
return 1; //if I am fitter than iCompare return 1
}
return 0; // if we are equally return 0
}
}
public static Individual breed(Individual a, Individual b)
{
Individual c = new Individual();
c.i_genotype = crossover(a.i_genotype, b.i_genotype);
c.i_genotype.mutate();
c.i_phenotype = new Phenotype(c.i_genotype);
return c;
}
//Class Population
public class Population
{
Individual[] pop;
//int populationNum = 100;
public Population(System.IO.StreamWriter file, int populationNum)
{
this.pop = new Individual[populationNum];
for (int i = 0; i < populationNum; i++)
{
this.pop[i] = new Individual();
this.pop[i].evaluate(file);
}
Array.Sort(pop);
}
public void evolve(System.IO.StreamWriter file)
{
Individual a = select(100);
Individual b = select(100);
//breed the two selected individuals
Individual x = breed(a, b);
//place the offspring in the lowest position in the population, thus replacing the previously weakest offspring
this.pop[0] = x;
//evaluate the new individual (grow)
x.evaluate(file);
//the fitter offspring will find its way in the population ranks
Array.Sort(pop);
}
Individual select(int popNum)
{
//skew distribution; multiplying by 99.999999 scales a number from 0-1 to 0-99, BUT NOT 100
//the sqrt of a number between 0-1 has bigger possibilities of giving us a smaller number
//if we subtract that squares number from 1 the opposite is true-> we have bigger possibilities of having a larger number
int which = (int)Math.Floor(((float)popNum - 1E-6) * (1.0 - Math.Pow(GetRandomNumber(0.0, 1.0), 2)));
return pop[which];
}
}
}
}

This is a problem:
float random = rnd.Next(0, 1); // returns an integer from 0 to 0 as a float
// Documentation states the second argument is exclusive
Try
float random = (float)rnd.NextDouble(); // rnd should be static, init'd once.
and replace all instances of Individual[] with List<Individual> which wraps an array and allows for easy Add(), InsertAt() and RemoveAt() methods.
PS. Also common convention has it to use PascalCasing for all methods and properties.

I think the biggest issue is with your select function.
The success of GA's depends a lot on picking the right Mutation, Evaluation and Selection techniques, although at first glance your selection function seems elegant to skew distribution, you're only skewing it based on relative position (i.e. Pop[0] < Pop[1]) but you're not taking into account how different they are from each other.
In GA's there's a HUGE difference between having the best individual have 100.0 Fitness and the Second have 99.9 than the best have 100.0 and the second have 75.0 and your selection function completely ignores this fact.
What is happening, why you see the repetitive fitness values, is because you're picking roughly the same individuals over and over, making your genetic pool stagnant and stalling in a local minimum (or maximum whatever you're looking for).
If you look for a method like Roullette (http://en.wikipedia.org/wiki/Fitness_proportionate_selection) they pick the probability as a function of the individual fitness divided over the total fitness, sharing the 'chance' of being picked among more individuals depending on how they behave, although this method can also get trapped in locals, it far less prone to than what you currently have, this should give you a very good boost on exploring the search space.
TL;DR - The selection function is not good enough as it is skewing the distribution too harshly and is only taking into account relative comparisons.

Random.next(int min,int max), will generate only integers between the min and max values.
try the (rnd.NextDouble) to generate a random number between 0 and 1.
this what i can help right now :)

Is this parallel sort merge implemented correctly?

Is this parallel merge sort implemented correctly? It looks correct, I took the 40seconds to write a test and it hasnt failed.
The gist of it is i need to sort by splitting the array in half every time. Then i tried to make sure i go wrong and asked a question for a sanity check (my own sanity). I wanted an in place sort but decided that it was way to complicated when seeing the answer, so i implemented the below.
Granted there's no point creating a task/thread to sort a 4 byte array but its to learn threading. Is there anything wrong or anything i can change to make this better. To me it looks perfect but i'd like some general feedback.
static void Main(string[] args)
{
var start = DateTime.Now;
//for (int z = 0; z < 1000000; z++)
int z = 0;
while(true)
{
var curr = DateTime.Now;
if (curr - start > TimeSpan.FromMinutes(1))
break;
var arr = new byte[] { 5, 3, 1, 7, 8, 5, 3, 2, 6, 7, 9, 3, 2, 4, 2, 1 };
Sort(arr, 0, arr.Length, new byte[arr.Length]);
//Console.Write(BitConverter.ToString(arr));
for (int i = 1; i < arr.Length; ++i)
{
if (arr[i] > arr[i])
{
System.Diagnostics.Debug.Assert(false);
throw new Exception("Sort was incorrect " + BitConverter.ToString(arr));
}
}
++z;
}
Console.WriteLine("Tried {0} times with success", z);
}
static void Sort(byte[] arr, int leftPos, int rightPos, byte[] tempArr)
{
var len = rightPos - leftPos;
if (len < 2)
return;
if (len == 2)
{
if (arr[leftPos] > arr[leftPos + 1])
{
var t = arr[leftPos];
arr[leftPos] = arr[leftPos + 1];
arr[leftPos + 1] = t;
}
return;
}
var rStart = leftPos+len/2;
var t1 = new Thread(delegate() { Sort(arr, leftPos, rStart, tempArr); });
var t2 = new Thread(delegate() { Sort(arr, rStart, rightPos, tempArr); });
t1.Start();
t2.Start();
t1.Join();
t2.Join();
var l = leftPos;
var r = rStart;
var z = leftPos;
while (l<rStart && r<rightPos)
{
if (arr[l] < arr[r])
{
tempArr[z] = arr[l];
l++;
}
else
{
tempArr[z] = arr[r];
r++;
}
z++;
}
if (l < rStart)
Array.Copy(arr, l, tempArr, z, rStart - l);
else
Array.Copy(arr, r, tempArr, z, rightPos - r);
Array.Copy(tempArr, leftPos, arr, leftPos, rightPos - leftPos);
}

You could use the Task Parallel Library to give you a better abstraction over threads and cleaner code. The example below uses this.
The main difference from your code, other than using the TPL, is that it has a cutoff threshold below which a sequential implementation is used regardless of the number of threads that have started. This prevents creation of threads that are doing a very small amount of work. There is also a depth cutoff below which new threads are not created. This prevents more threads being created than the hardware can handle based on the number of available logical cores (Environment.ProcessCount).
I would recommend implementing both these approaches in your code to prevent thread explosion for large arrays and innefficient creation of threads which do very small amounts of work, even for small array sizes. It will also give you better performance.
public static class Sort
{
public static int Threshold = 150;
public static void InsertionSort(int[] array, int from, int to)
{
// ...
}
static void Swap(int[] array, int i, int j)
{
// ...
}
static int Partition(int[] array, int from, int to, int pivot)
{
// ...
}
public static void ParallelQuickSort(int[] array)
{
ParallelQuickSort(array, 0, array.Length,
(int) Math.Log(Environment.ProcessorCount, 2) + 4);
}
static void ParallelQuickSort(int[] array, int from, int to, int depthRemaining)
{
if (to - from <= Threshold)
{
InsertionSort(array, from, to);
}
else
{
int pivot = from + (to - from) / 2; // could be anything, use middle
pivot = Partition(array, from, to, pivot);
if (depthRemaining > 0)
{
Parallel.Invoke(
() => ParallelQuickSort(array, from, pivot, depthRemaining - 1),
() => ParallelQuickSort(array, pivot + 1, to, depthRemaining - 1));
}
else
{
ParallelQuickSort(array, from, pivot, 0);
ParallelQuickSort(array, pivot + 1, to, 0);
}
}
}
}
The full source is available on http://parallelpatterns.codeplex.com/
You can read a discussion of the implementation on http://msdn.microsoft.com/en-us/library/ff963551.aspx

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

C# OpenCL GPU implementation for double array math - c#

Related

Trying to find large prime numbers with Alea GPU

Quick Sort Implementation with large numbers [duplicate]

Fast sort partially sorted array

Genetic Algorithm implementation in C#

Is this parallel sort merge implemented correctly?

Categories

Resources