Fast sort partially sorted array

Fast sort partially sorted array - c#

Firstly, it's not about an array with subsequences that may be in some order before we start sort, it's an about array of special structure.
I'm writing now a simple method that sorts data. Until now, I used Array.Sort, but PLINQ's OrderBy outperform standard Array.Sort on large arrays.
So i decide to write my own implementation of multithreading sort. Idea was simple: split an array on partitions, parallel sort each partition, then merge all results in one array.
Now i'm done with partitioning and sorting:
public class PartitionSorter
{
public static void Sort(int[] arr)
{
var ranges = Range.FromArray(arr);
var allDone = new ManualResetEventSlim(false, ranges.Length*2);
int completed = 0;
foreach (var range in ranges)
{
ThreadPool.QueueUserWorkItem(r =>
{
var rr = (Range) r;
Array.Sort(arr, rr.StartIndex, rr.Length);
if (Interlocked.Increment(ref completed) == ranges.Length)
allDone.Set();
}, range);
}
allDone.Wait();
}
}
public class Range
{
public int StartIndex { get; }
public int Length { get; }
public Range(int startIndex, int endIndex)
{
StartIndex = startIndex;
Length = endIndex;
}
public static Range[] FromArray<T>(T[] source)
{
int processorCount = Environment.ProcessorCount;
int partitionLength = (int) (source.Length/(double) processorCount);
var result = new Range[processorCount];
int start = 0;
for (int i = 0; i < result.Length - 1; i++)
{
result[i] = new Range(start, partitionLength);
start += partitionLength;
}
result[result.Length - 1] = new Range(start, source.Length - start);
return result;
}
}
As result I get an array with special structure, for example
[1 3 5 | 2 4 7 | 6 8 9]
Now how can I use this information and finish sorting? Insertion sorts and others doesn't use information that data in blocks is already sorted, and we just need to merge them together. I tried to apply some algorithms from Merge sort, but failed.

I've done some testing with a parallel Quicksort implementation.
I tested the following code with a RELEASE build on Windows x64 10, compiled with C#6 (Visual Studio 2015), .Net 4.61, and run outside any debugger.
My processor is quad core with hyperthreading (which is certainly going to help any parallel implementation!)
The array size is 20,000,000 (so a fairly large array).
I got these results:
LINQ OrderBy() took 00:00:14.1328090
PLINQ OrderBy() took 00:00:04.4484305
Array.Sort() took 00:00:02.3695607
Sequential took 00:00:02.7274400
Parallel took 00:00:00.7874578
PLINQ OrderBy() is much faster than LINQ OrderBy(), but slower than Array.Sort().
QuicksortSequential() is around the same speed as Array.Sort()
But the interesting thing here is that QuicksortParallelOptimised() is noticeably faster on my system - so it's definitely an efficient way of sorting if you have enough processor cores.
Here's the full compilable console app. Remember to run it in RELEASE mode - if you run it in DEBUG mode the timing results will be woefully incorrect.
using System;
using System.Diagnostics;
using System.Linq;
using System.Threading.Tasks;
namespace Demo
{
class Program
{
static void Main()
{
int n = 20000000;
int[] a = new int[n];
var rng = new Random(937525);
for (int i = 0; i < n; ++i)
a[i] = rng.Next();
var b = a.ToArray();
var d = a.ToArray();
var sw = new Stopwatch();
sw.Restart();
var c = a.OrderBy(x => x).ToArray(); // Need ToArray(), otherwise it does nothing.
Console.WriteLine("LINQ OrderBy() took " + sw.Elapsed);
sw.Restart();
var e = a.AsParallel().OrderBy(x => x).ToArray(); // Need ToArray(), otherwise it does nothing.
Console.WriteLine("PLINQ OrderBy() took " + sw.Elapsed);
sw.Restart();
Array.Sort(d);
Console.WriteLine("Array.Sort() took " + sw.Elapsed);
sw.Restart();
QuicksortSequential(a, 0, a.Length-1);
Console.WriteLine("Sequential took " + sw.Elapsed);
sw.Restart();
QuicksortParallelOptimised(b, 0, b.Length-1);
Console.WriteLine("Parallel took " + sw.Elapsed);
// Verify that our sort implementation is actually correct!
Trace.Assert(a.SequenceEqual(c));
Trace.Assert(b.SequenceEqual(c));
}
static void QuicksortSequential<T>(T[] arr, int left, int right)
where T : IComparable<T>
{
if (right > left)
{
int pivot = Partition(arr, left, right);
QuicksortSequential(arr, left, pivot - 1);
QuicksortSequential(arr, pivot + 1, right);
}
}
static void QuicksortParallelOptimised<T>(T[] arr, int left, int right)
where T : IComparable<T>
{
const int SEQUENTIAL_THRESHOLD = 2048;
if (right > left)
{
if (right - left < SEQUENTIAL_THRESHOLD)
{
QuicksortSequential(arr, left, right);
}
else
{
int pivot = Partition(arr, left, right);
Parallel.Invoke(
() => QuicksortParallelOptimised(arr, left, pivot - 1),
() => QuicksortParallelOptimised(arr, pivot + 1, right));
}
}
}
static int Partition<T>(T[] arr, int low, int high) where T : IComparable<T>
{
int pivotPos = (high + low) / 2;
T pivot = arr[pivotPos];
Swap(arr, low, pivotPos);
int left = low;
for (int i = low + 1; i <= high; i++)
{
if (arr[i].CompareTo(pivot) < 0)
{
left++;
Swap(arr, i, left);
}
}
Swap(arr, low, left);
return left;
}
static void Swap<T>(T[] arr, int i, int j)
{
T tmp = arr[i];
arr[i] = arr[j];
arr[j] = tmp;
}
}
}

Related

Quicksort algorithm (C#)

I would like to ask what's actually wrong on this code. I tried to understand quicksort (2-way) by myself so I looked into this page: http://me.dt.in.th/page/Quicksort/#disqus_thread after that I tried to code it by myself and landed here:
public void Sort(Comparison<TList> del, long l, long r)
{
// inspired by: http://me.dt.in.th/page/Quicksort/
if (l >= r) return;
// partitioning
for(long i = l + 1; i <= r; i++)
{
if (del.Invoke(this[i], this[l]) < 0)
{
Swap(i, l);
}
}
// recursion
Sort(del, l, l - 1);
Sort(del, l + 1, r);
}
Then I looked into the comments on the mentioned website and found this:
void qsort(char *v[], int left, int right)
{
int i, last;
void swap(char *v[], int i, int j);
if (left >= right)
return;
swap(v, left, (left + right) / 2);
last = left;
for (i = left + 1; i <= right; i++)
if (strcmp(v[i], v[left]) < 0)
swap(v, ++last, i);
swap(v, left, last);
qsort(v, left, last - 1);
qsort(v, last + 1, right);
}
and now I'm really curious why my code is still working, tested it with this (it's included in a linked list by the way):
static void Main(string[] args)
{
MyList<int> obj;
do
{
obj = MyList.Random(100, 0, 100);
obj.Sort(stdc);
obj.Sort(stdc);
} while (obj.IsSorted(stdc));
Log("Not sorted", obj);
Console.ReadKey(true);
}
and this:
public bool IsSorted(Comparison<TList> del)
{
var el = start;
if (el != null)
{
while (el.Next != null)
{
if (del.Invoke(el.Value, el.Next.Value) > 0) // eq. to this[i] > this[i + 1]
return false;
el = el.Next;
}
}
return true;
}
and this:
public static MyList<int> Random(int num, int min = 0, int max = 1)
{
var res = new MyList<int>();
var rand = new Random();
while (num > 0)
{
res.Add(rand.Next(min, max));
num--;
}
return res;
}

Your quicksort code is not really quicksort, and it will be slow. The line
Sort(del, l, l - 1);
will not do anything, since the recursed into call will detect l >= (l-1). The line
Sort(del, l + 1, r);
only reduces the size of the partition [l,r] by one to [l+1,r], so the time complexity will always be O(n^2), but it looks like it will work, just slowly.
The qsort() function swaps the left and middle values, this will avoid worst case issue if data is already sorted, but doesn't do much else. The key is that it updates "last", so that when partition is completed, the value at v[last] will be the pivot value, and then it makes recursive calls using "last" instead of "left" or in your code's case, "l".
The qsort shown is a variation of Lomuto partition scheme. Take a look at the wiki article which also includes the alternate Hoare partition scheme:
https://en.wikipedia.org/wiki/Quicksort#Algorithm

C# OpenCL GPU implementation for double array math

How can I make the for loop of this function to use the GPU with OpenCL?
public static double[] Calculate(double[] num, int period)
{
var final = new double[num.Length];
double sum = num[0];
double coeff = 2.0 / (1.0 + period);
for (int i = 0; i < num.Length; i++)
{
sum += coeff * (num[i] - sum);
final[i] = sum;
}
return final;
}

Your problem as written does not fit well with something that would work on a GPU. You cannot parallelize (in a way that improves performance) the operation on a single array because the value of the nth element depends on elements 1 to n. However, you can utilize the GPU to process multiple arrays, where each GPU core operates on a separate array.
The full code for the solution is at the end of the answer, but the results of the test, to calculate on 10,000 arrays each of which has 10,000 elements, generates the following (on a GTX1080M and an i7 7700k with 32GB RAM):
Task Generating Data: 1096.4583ms
Task CPU Single Thread: 596.2624ms
Task CPU Parallel: 179.1717ms
GPU CPU->GPU: 89ms
GPU Execute: 86ms
GPU GPU->CPU: 29ms
Task Running GPU: 921.4781ms
Finished
In this test, we measure the speed at which we can generate results into a managed C# array using the CPU with one thread, the CPU with all threads, and finally the GPU using all cores. We validate that the results from each test are identical, using the function AreTheSame.
The fastest time is processing the arrays on the CPU using all threads (Task CPU Parallel: 179ms).
The GPU is actually the slowest (Task Running GPU: 922ms), but this is because of the time taken to reformat the C# arrays in a way that they can be transferred onto the GPU.
If this bottleneck were removed (which is quite possible, depending on your use case), the GPU could potentially be the fastest. If the data were already formatted in a manner that can be immediately be transferred onto the GPU, the total processing time for the GPU would be 204ms (CPU->GPU: 89ms + Execute: 86ms + GPU->CPU: 29 ms = 204ms). This is still slower than the parallel CPU option, but on a different sort of data set, it might be faster.
To get the data back from the GPU (the most important part of actually using the GPU), we use the function ComputeCommandQueue.Read. This transfers the altered array on the GPU back to the CPU.
To run the following code, reference the Cloo Nuget Package (I used 0.9.1). And make sure to compile on x64 (you will need the memory). You may need to update your graphics card driver too if it fails to find an OpenCL device.
class Program
{
static string CalculateKernel
{
get
{
return #"
kernel void Calc(global int* offsets, global int* lengths, global double* doubles, double periodFactor)
{
int id = get_global_id(0);
int start = offsets[id];
int length = lengths[id];
int end = start + length;
double sum = doubles[start];
for(int i = start; i < end; i++)
{
sum = sum + periodFactor * ( doubles[i] - sum );
doubles[i] = sum;
}
}";
}
}
public static double[] Calculate(double[] num, int period)
{
var final = new double[num.Length];
double sum = num[0];
double coeff = 2.0 / (1.0 + period);
for (int i = 0; i < num.Length; i++)
{
sum += coeff * (num[i] - sum);
final[i] = sum;
}
return final;
}
static void Main(string[] args)
{
int maxElements = 10000;
int numArrays = 10000;
int computeCores = 2048;
double[][] sets = new double[numArrays][];
using (Timer("Generating Data"))
{
Random elementRand = new Random(1);
for (int i = 0; i < numArrays; i++)
{
sets[i] = GetRandomDoubles(elementRand.Next((int)(maxElements * 0.9), maxElements), randomSeed: i);
}
}
int period = 14;
double[][] singleResults;
using (Timer("CPU Single Thread"))
{
singleResults = CalculateCPU(sets, period);
}
double[][] parallelResults;
using (Timer("CPU Parallel"))
{
parallelResults = CalculateCPUParallel(sets, period);
}
if (!AreTheSame(singleResults, parallelResults)) throw new Exception();
double[][] gpuResults;
using (Timer("Running GPU"))
{
gpuResults = CalculateGPU(computeCores, sets, period);
}
if (!AreTheSame(singleResults, gpuResults)) throw new Exception();
Console.WriteLine("Finished");
Console.ReadKey();
}
public static bool AreTheSame(double[][] a1, double[][] a2)
{
if (a1.Length != a2.Length) return false;
for (int i = 0; i < a1.Length; i++)
{
var ar1 = a1[i];
var ar2 = a2[i];
if (ar1.Length != ar2.Length) return false;
for (int j = 0; j < ar1.Length; j++)
if (Math.Abs(ar1[j] - ar2[j]) > 0.0000001) return false;
}
return true;
}
public static double[][] CalculateGPU(int partitionSize, double[][] sets, int period)
{
ComputeContextPropertyList cpl = new ComputeContextPropertyList(ComputePlatform.Platforms[0]);
ComputeContext context = new ComputeContext(ComputeDeviceTypes.Gpu, cpl, null, IntPtr.Zero);
ComputeProgram program = new ComputeProgram(context, new string[] { CalculateKernel });
program.Build(null, null, null, IntPtr.Zero);
ComputeCommandQueue commands = new ComputeCommandQueue(context, context.Devices[0], ComputeCommandQueueFlags.None);
ComputeEventList events = new ComputeEventList();
ComputeKernel kernel = program.CreateKernel("Calc");
double[][] results = new double[sets.Length][];
double periodFactor = 2d / (1d + period);
Stopwatch sendStopWatch = new Stopwatch();
Stopwatch executeStopWatch = new Stopwatch();
Stopwatch recieveStopWatch = new Stopwatch();
int offset = 0;
while (true)
{
int first = offset;
int last = Math.Min(offset + partitionSize, sets.Length);
int length = last - first;
var merged = Merge(sets, first, length);
sendStopWatch.Start();
ComputeBuffer<int> offsetBuffer = new ComputeBuffer<int>(
context,
ComputeMemoryFlags.ReadWrite | ComputeMemoryFlags.UseHostPointer,
merged.Offsets);
ComputeBuffer<int> lengthsBuffer = new ComputeBuffer<int>(
context,
ComputeMemoryFlags.ReadWrite | ComputeMemoryFlags.UseHostPointer,
merged.Lengths);
ComputeBuffer<double> doublesBuffer = new ComputeBuffer<double>(
context,
ComputeMemoryFlags.ReadWrite | ComputeMemoryFlags.UseHostPointer,
merged.Doubles);
kernel.SetMemoryArgument(0, offsetBuffer);
kernel.SetMemoryArgument(1, lengthsBuffer);
kernel.SetMemoryArgument(2, doublesBuffer);
kernel.SetValueArgument(3, periodFactor);
sendStopWatch.Stop();
executeStopWatch.Start();
commands.Execute(kernel, null, new long[] { merged.Lengths.Length }, null, events);
executeStopWatch.Stop();
using (var pin = Pinned(merged.Doubles))
{
recieveStopWatch.Start();
commands.Read(doublesBuffer, false, 0, merged.Doubles.Length, pin.Address, events);
commands.Finish();
recieveStopWatch.Stop();
}
for (int i = 0; i < merged.Lengths.Length; i++)
{
int len = merged.Lengths[i];
int off = merged.Offsets[i];
var res = new double[len];
Array.Copy(merged.Doubles,off,res,0,len);
results[first + i] = res;
}
offset += partitionSize;
if (offset >= sets.Length) break;
}
Console.WriteLine("GPU CPU->GPU: " + recieveStopWatch.ElapsedMilliseconds + "ms");
Console.WriteLine("GPU Execute: " + executeStopWatch.ElapsedMilliseconds + "ms");
Console.WriteLine("GPU GPU->CPU: " + sendStopWatch.ElapsedMilliseconds + "ms");
return results;
}
public static PinnedHandle Pinned(object obj) => new PinnedHandle(obj);
public class PinnedHandle : IDisposable
{
public IntPtr Address => handle.AddrOfPinnedObject();
private GCHandle handle;
public PinnedHandle(object val)
{
handle = GCHandle.Alloc(val, GCHandleType.Pinned);
}
public void Dispose()
{
handle.Free();
}
}
public class MergedResults
{
public double[] Doubles { get; set; }
public int[] Lengths { get; set; }
public int[] Offsets { get; set; }
}
public static MergedResults Merge(double[][] sets, int offset, int length)
{
List<int> lengths = new List<int>(length);
List<int> offsets = new List<int>(length);
for (int i = 0; i < length; i++)
{
var arr = sets[i + offset];
lengths.Add(arr.Length);
}
var totalLength = lengths.Sum();
double[] doubles = new double[totalLength];
int dataOffset = 0;
for (int i = 0; i < length; i++)
{
var arr = sets[i + offset];
Array.Copy(arr, 0, doubles, dataOffset, arr.Length);
offsets.Add(dataOffset);
dataOffset += arr.Length;
}
return new MergedResults()
{
Doubles = doubles,
Lengths = lengths.ToArray(),
Offsets = offsets.ToArray(),
};
}
public static IDisposable Timer(string name)
{
return new SWTimer(name);
}
public class SWTimer : IDisposable
{
private Stopwatch _sw;
private string _name;
public SWTimer(string name)
{
_name = name;
_sw = Stopwatch.StartNew();
}
public void Dispose()
{
_sw.Stop();
Console.WriteLine("Task " + _name + ": " + _sw.Elapsed.TotalMilliseconds + "ms");
}
}
public static double[][] CalculateCPU(double[][] arrays, int period)
{
double[][] results = new double[arrays.Length][];
for (var index = 0; index < arrays.Length; index++)
{
var arr = arrays[index];
results[index] = Calculate(arr, period);
}
return results;
}
public static double[][] CalculateCPUParallel(double[][] arrays, int period)
{
double[][] results = new double[arrays.Length][];
Parallel.For(0, arrays.Length, i =>
{
var arr = arrays[i];
results[i] = Calculate(arr, period);
});
return results;
}
static double[] GetRandomDoubles(int num, int randomSeed)
{
Random r = new Random(randomSeed);
var res = new double[num];
for (int i = 0; i < num; i++)
res[i] = r.NextDouble() * 0.9 + 0.05;
return res;
}
}

as commenter Cory stated refer to this link for setup.
How to use your GPU in .NET
Here is how you would use this project:
Add the Nuget Package Cloo
Add reference to OpenCLlib.dll
Download OpenCLLib.zip
Add using OpenCL
static void Main(string[] args)
{
int[] Primes = { 1,2,3,4,5,6,7 };
EasyCL cl = new EasyCL();
cl.Accelerator = AcceleratorDevice.GPU;
cl.LoadKernel(IsPrime);
cl.Invoke("GetIfPrime", 0, Primes.Length, Primes, 1.0);
}
static string IsPrime
{
get
{
return #"
kernel void GetIfPrime(global int* num, int period)
{
int index = get_global_id(0);
int sum = (2.0 / (1.0 + period)) * (num[index] - num[0]);
printf("" %d \n"",sum);
}";
}
}

for (int i = 0; i < num.Length; i++)
{
sum += coeff * (num[i] - sum);
final[i] = sum;
}
means first element is multiplied by coeff 1 time and subtracted from 2nd element. First element also multiplied by square of coeff and this time added to 3rd element. Then first element multiplied by cube of coeff and subtracted from 4th element.
This is going like this:
-e0*c*c*c + e1*c*c - e2*c = f3
e0*c*c*c*c - e1*c*c*c + e2*c*c - e3*c = f4
-e0*c*c*c*c*c + e1*c*c*c*c - e2*c*c*c + e3*c*c - e4*c =f5
For all elements, scan through for all smaller id elements and compute this:
if difference of id values(lets call it k) of elements is odd, take subtraction, if not then take addition. Before addition or subtraction, multiply that value by k-th power of coeff. Lastly, multiply the current num value by coefficient and add it to current cell. Current cell value is final(i).
This is O(N*N) and looks like an all-pairs compute kernel. An example using an open-source C# OpenCL project:
ClNumberCruncher cruncher = new ClNumberCruncher(ClPlatforms.all().gpus(), #"
__kernel void foo(__global double * num, __global double * final, __global int *parameters)
{
int threadId = get_global_id(0);
int period = parameters[0];
double coeff = 2.0 / (1.0 + period);
double sumOfElements = 0.0;
for(int i=0;i<threadId;i++)
{
// negativity of coeff is to select addition or subtraction for different powers of coeff
double powKofCoeff = pow(-coeff,threadId-i);
sumOfElements += powKofCoeff * num[i];
}
final[threadId] = sumOfElements + num[threadId] * coeff;
}
");
cruncher.performanceFeed = true; // getting benchmark feedback on console
double[] numArray = new double[10000];
double[] finalArray = new double[10000];
int[] parameters = new int[10];
int period = 15;
parameters[0] = period;
ClArray<double> numGpuArray = numArray;
numGpuArray.readOnly = true; // gpus read this from host
ClArray<double> finalGpuArray = finalArray; // finalArray will have results
finalGpuArray.writeOnly = true; // gpus write this to host
ClArray<int> parametersGpu = parameters;
parametersGpu.readOnly = true;
// calculate kernels with exact same ordering of parameters
// num(double),final(double),parameters(int)
// finalGpuArray points to __global double * final
numGpuArray.nextParam(finalGpuArray, parametersGpu).compute(cruncher, 1, "foo", 10000, 100);
// first compute always lags because of compiling the kernel so here are repeated computes to get actual performance
numGpuArray.nextParam(finalGpuArray, parametersGpu).compute(cruncher, 1, "foo", 10000, 100);
numGpuArray.nextParam(finalGpuArray, parametersGpu).compute(cruncher, 1, "foo", 10000, 100);
Results are on finalArray array for 10000 elements, using 100 workitems per workitem-group.
GPGPU part takes 82ms on a rx550 gpu which has very low ratio of 64bit-to-32bit compute performance(because consumer gaming cards are not good at double precision for new series). An Nvidia Tesla or an Amd Vega would easily compute this kernel without crippled performance. Fx8150(8 cores) completes in 683ms. If you need to specifically select only an integrated-GPU and its CPU, you can use
ClPlatforms.all().gpus().devicesWithHostMemorySharing() + ClPlatforms.all().cpus() when creating ClNumberCruncher instance.
binaries of api:
https://www.codeproject.com/Articles/1181213/Easy-OpenCL-Multiple-Device-Load-Balancing-and-Pip
or source code to compile on your pc:
https://github.com/tugrul512bit/Cekirdekler
if you have multiple gpus, it uses them without any extra code. Including a cpu to the computations would pull gpu effectiveness down in this sample for first iteration (repeatations complete in 76ms with cpu+gpu) so its better to use 2-3 GPU instead of CPU+GPU.
I didn't check numerical stability(you should use Kahan-Summation when adding millions or more values into same variable but I didn't use it for readability and don't have an idea about if 64-bit values need this too like 32-bit ones) or any value correctness, you should do it. Also foo kernel is not optimized. It makes %50 of core times idle so it should be better scheduled like this:
thread-0: compute element 0 and element N-1
thread-1: compute element 1 and element N-2
thread-m: compute element N/2-1 and element N/2
so all workitems get similar amount of work. On top of this, using 100 for workgroup size is not optimal. It should be something like 128,256,512 or 1024(for Nvidia) but this means array size should also be an integer multiple of this too. Then it would need extra control logic in the kernel to not go out of array borders. For even more performance, for loop could have multiple partial sums to do a "loop unrolling".

Quick Sort Implementation with large numbers [duplicate]

I learnt about quick sort and how it can be implemented in both Recursive and Iterative method.
In Iterative method:
Push the range (0...n) into the stack
Partition the given array with a pivot
Pop the top element.
Push the partitions (index range) onto a stack if the range has more than one element
Do the above 3 steps, till the stack is empty
And the recursive version is the normal one defined in wiki.
I learnt that recursive algorithms are always slower than their iterative counterpart.
So, Which method is preferred in terms of time complexity (memory is not a concern)?
Which one is fast enough to use in Programming contest?
Is c++ STL sort() using a recursive approach?

In terms of (asymptotic) time complexity - they are both the same.
"Recursive is slower then iterative" - the rational behind this statement is because of the overhead of the recursive stack (saving and restoring the environment between calls).
However -these are constant number of ops, while not changing the number of "iterations".
Both recursive and iterative quicksort are O(nlogn) average case and O(n^2) worst case.
EDIT:
just for the fun of it I ran a benchmark with the (java) code attached to the post , and then I ran wilcoxon statistic test, to check what is the probability that the running times are indeed distinct
The results may be conclusive (P_VALUE=2.6e-34, https://en.wikipedia.org/wiki/P-value. Remember that the P_VALUE is P(T >= t | H) where T is the test statistic and H is the null hypothesis). But the answer is not what you expected.
The average of the iterative solution was 408.86 ms while of recursive was 236.81 ms
(Note - I used Integer and not int as argument to recursiveQsort() - otherwise the recursive would have achieved much better, because it doesn't have to box a lot of integers, which is also time consuming - I did it because the iterative solution has no choice but doing so.
Thus - your assumption is not true, the recursive solution is faster (for my machine and java for the very least) than the iterative one with P_VALUE=2.6e-34.
public static void recursiveQsort(int[] arr,Integer start, Integer end) {
if (end - start < 2) return; //stop clause
int p = start + ((end-start)/2);
p = partition(arr,p,start,end);
recursiveQsort(arr, start, p);
recursiveQsort(arr, p+1, end);
}
public static void iterativeQsort(int[] arr) {
Stack<Integer> stack = new Stack<Integer>();
stack.push(0);
stack.push(arr.length);
while (!stack.isEmpty()) {
int end = stack.pop();
int start = stack.pop();
if (end - start < 2) continue;
int p = start + ((end-start)/2);
p = partition(arr,p,start,end);
stack.push(p+1);
stack.push(end);
stack.push(start);
stack.push(p);
}
}
private static int partition(int[] arr, int p, int start, int end) {
int l = start;
int h = end - 2;
int piv = arr[p];
swap(arr,p,end-1);
while (l < h) {
if (arr[l] < piv) {
l++;
} else if (arr[h] >= piv) {
h--;
} else {
swap(arr,l,h);
}
}
int idx = h;
if (arr[h] < piv) idx++;
swap(arr,end-1,idx);
return idx;
}
private static void swap(int[] arr, int i, int j) {
int temp = arr[i];
arr[i] = arr[j];
arr[j] = temp;
}
public static void main(String... args) throws Exception {
Random r = new Random(1);
int SIZE = 1000000;
int N = 100;
int[] arr = new int[SIZE];
int[] millisRecursive = new int[N];
int[] millisIterative = new int[N];
for (int t = 0; t < N; t++) {
for (int i = 0; i < SIZE; i++) {
arr[i] = r.nextInt(SIZE);
}
int[] tempArr = Arrays.copyOf(arr, arr.length);
long start = System.currentTimeMillis();
iterativeQsort(tempArr);
millisIterative[t] = (int)(System.currentTimeMillis()-start);
tempArr = Arrays.copyOf(arr, arr.length);
start = System.currentTimeMillis();
recursvieQsort(tempArr,0,arr.length);
millisRecursive[t] = (int)(System.currentTimeMillis()-start);
}
int sum = 0;
for (int x : millisRecursive) {
System.out.println(x);
sum += x;
}
System.out.println("end of recursive. AVG = " + ((double)sum)/millisRecursive.length);
sum = 0;
for (int x : millisIterative) {
System.out.println(x);
sum += x;
}
System.out.println("end of iterative. AVG = " + ((double)sum)/millisIterative.length);
}

Recursion is NOT always slower than iteration. Quicksort is perfect example of it. The only way to do this in iterate way is create stack structure. So in other way do the same that the compiler do if we use recursion, and propably you will do this worse than compiler. Also there will be more jumps if you don't use recursion (to pop and push values to stack).

That's the solution i came up with in Javascript. I think it works.
const myArr = [33, 103, 3, 726, 200, 984, 198, 764, 9]
document.write('initial order :', JSON.stringify(myArr), '<br><br>')
qs_iter(myArr)
document.write('_Final order :', JSON.stringify(myArr))
function qs_iter(items) {
if (!items || items.length <= 1) {
return items
}
var stack = []
var low = 0
var high = items.length - 1
stack.push([low, high])
while (stack.length) {
var range = stack.pop()
low = range[0]
high = range[1]
if (low < high) {
var pivot = Math.floor((low + high) / 2)
stack.push([low, pivot])
stack.push([pivot + 1, high])
while (low < high) {
while (low < pivot && items[low] <= items[pivot]) low++
while (high > pivot && items[high] > items[pivot]) high--
if (low < high) {
var tmp = items[low]
items[low] = items[high]
items[high] = tmp
}
}
}
}
return items
}
Let me know if you found a mistake :)
Mister Jojo UPDATE :
this code just mixes values that can in rare cases lead to a sort, in other words never.
For those who have a doubt, I put it in snippet.

Generating permutations of a set (most efficiently)

I would like to generate all permutations of a set (a collection), like so:
Collection: 1, 2, 3
Permutations: {1, 2, 3}
{1, 3, 2}
{2, 1, 3}
{2, 3, 1}
{3, 1, 2}
{3, 2, 1}
This isn't a question of "how", in general, but more about how most efficiently.
Also, I wouldn't want to generate ALL permutations and return them, but only generating a single permutation, at a time, and continuing only if necessary (much like Iterators - which I've tried as well, but turned out to be less efficient).
I've tested many algorithms and approaches and came up with this code, which is most efficient of those I tried:
public static bool NextPermutation<T>(T[] elements) where T : IComparable<T>
{
// More efficient to have a variable instead of accessing a property
var count = elements.Length;
// Indicates whether this is the last lexicographic permutation
var done = true;
// Go through the array from last to first
for (var i = count - 1; i > 0; i--)
{
var curr = elements[i];
// Check if the current element is less than the one before it
if (curr.CompareTo(elements[i - 1]) < 0)
{
continue;
}
// An element bigger than the one before it has been found,
// so this isn't the last lexicographic permutation.
done = false;
// Save the previous (bigger) element in a variable for more efficiency.
var prev = elements[i - 1];
// Have a variable to hold the index of the element to swap
// with the previous element (the to-swap element would be
// the smallest element that comes after the previous element
// and is bigger than the previous element), initializing it
// as the current index of the current item (curr).
var currIndex = i;
// Go through the array from the element after the current one to last
for (var j = i + 1; j < count; j++)
{
// Save into variable for more efficiency
var tmp = elements[j];
// Check if tmp suits the "next swap" conditions:
// Smallest, but bigger than the "prev" element
if (tmp.CompareTo(curr) < 0 && tmp.CompareTo(prev) > 0)
{
curr = tmp;
currIndex = j;
}
}
// Swap the "prev" with the new "curr" (the swap-with element)
elements[currIndex] = prev;
elements[i - 1] = curr;
// Reverse the order of the tail, in order to reset it's lexicographic order
for (var j = count - 1; j > i; j--, i++)
{
var tmp = elements[j];
elements[j] = elements[i];
elements[i] = tmp;
}
// Break since we have got the next permutation
// The reason to have all the logic inside the loop is
// to prevent the need of an extra variable indicating "i" when
// the next needed swap is found (moving "i" outside the loop is a
// bad practice, and isn't very readable, so I preferred not doing
// that as well).
break;
}
// Return whether this has been the last lexicographic permutation.
return done;
}
It's usage would be sending an array of elements, and getting back a boolean indicating whether this was the last lexicographical permutation or not, as well as having the array altered to the next permutation.
Usage example:
var arr = new[] {1, 2, 3};
PrintArray(arr);
while (!NextPermutation(arr))
{
PrintArray(arr);
}
The thing is that I'm not happy with the speed of the code.
Iterating over all permutations of an array of size 11 takes about 4 seconds.
Although it could be considered impressive, since the amount of possible permutations of a set of size 11 is 11! which is nearly 40 million.
Logically, with an array of size 12 it will take about 12 times more time, since 12! is 11! * 12, and with an array of size 13 it will take about 13 times more time than the time it took with size 12, and so on.
So you can easily understand how with an array of size 12 and more, it really takes a very long time to go through all permutations.
And I have a strong hunch that I can somehow cut that time by a lot (without switching to a language other than C# - because compiler optimization really does optimize pretty nicely, and I doubt I could optimize as good, manually, in Assembly).
Does anyone know any other way to get that done faster?
Do you have any idea as to how to make the current algorithm faster?
Note that I don't want to use an external library or service in order to do that - I want to have the code itself and I want it to be as efficient as humanly possible.

This might be what you're looking for.
private static bool NextPermutation(int[] numList)
{
/*
Knuths
1. Find the largest index j such that a[j] < a[j + 1]. If no such index exists, the permutation is the last permutation.
2. Find the largest index l such that a[j] < a[l]. Since j + 1 is such an index, l is well defined and satisfies j < l.
3. Swap a[j] with a[l].
4. Reverse the sequence from a[j + 1] up to and including the final element a[n].
*/
var largestIndex = -1;
for (var i = numList.Length - 2; i >= 0; i--)
{
if (numList[i] < numList[i + 1]) {
largestIndex = i;
break;
}
}
if (largestIndex < 0) return false;
var largestIndex2 = -1;
for (var i = numList.Length - 1 ; i >= 0; i--) {
if (numList[largestIndex] < numList[i]) {
largestIndex2 = i;
break;
}
}
var tmp = numList[largestIndex];
numList[largestIndex] = numList[largestIndex2];
numList[largestIndex2] = tmp;
for (int i = largestIndex + 1, j = numList.Length - 1; i < j; i++, j--) {
tmp = numList[i];
numList[i] = numList[j];
numList[j] = tmp;
}
return true;
}

Update 2018-05-28:
A new multithreaded version (lot faster) is available below as another answer.
Also an article about permutation: Permutations: Fast implementations and a new indexing algorithm allowing multithreading
A little bit too late...
According to recent tests (updated 2018-05-22)
Fastest is mine BUT not in lexicographic order
For fastest lexicographic order, Sani Singh Huttunen solution seems to be the way to go.
Performance test results for 10 items (10!) in release on my machine (millisecs):
Ouellet : 29
SimpleVar: 95
Erez Robinson : 156
Sani Singh Huttunen : 37
Pengyang : 45047
Performance test results for 13 items (13!) in release on my machine (seconds):
Ouellet : 48.437
SimpleVar: 159.869
Erez Robinson : 327.781
Sani Singh Huttunen : 64.839
Advantages of my solution:
Heap's algorithm (Single swap per permutation)
No multiplication (like some implementations seen on the web)
Inlined swap
Generic
No unsafe code
In place (very low memory usage)
No modulo (only first bit compare)
My implementation of Heap's algorithm:
using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;
using System.Runtime.CompilerServices;
namespace WpfPermutations
{
/// <summary>
/// EO: 2016-04-14
/// Generator of all permutations of an array of anything.
/// Base on Heap's Algorithm. See: https://en.wikipedia.org/wiki/Heap%27s_algorithm#cite_note-3
/// </summary>
public static class Permutations
{
/// <summary>
/// Heap's algorithm to find all pmermutations. Non recursive, more efficient.
/// </summary>
/// <param name="items">Items to permute in each possible ways</param>
/// <param name="funcExecuteAndTellIfShouldStop"></param>
/// <returns>Return true if cancelled</returns>
public static bool ForAllPermutation<T>(T[] items, Func<T[], bool> funcExecuteAndTellIfShouldStop)
{
int countOfItem = items.Length;
if (countOfItem <= 1)
{
return funcExecuteAndTellIfShouldStop(items);
}
var indexes = new int[countOfItem];
// Unecessary. Thanks to NetManage for the advise
// for (int i = 0; i < countOfItem; i++)
// {
// indexes[i] = 0;
// }
if (funcExecuteAndTellIfShouldStop(items))
{
return true;
}
for (int i = 1; i < countOfItem;)
{
if (indexes[i] < i)
{ // On the web there is an implementation with a multiplication which should be less efficient.
if ((i & 1) == 1) // if (i % 2 == 1) ... more efficient ??? At least the same.
{
Swap(ref items[i], ref items[indexes[i]]);
}
else
{
Swap(ref items[i], ref items[0]);
}
if (funcExecuteAndTellIfShouldStop(items))
{
return true;
}
indexes[i]++;
i = 1;
}
else
{
indexes[i++] = 0;
}
}
return false;
}
/// <summary>
/// This function is to show a linq way but is far less efficient
/// From: StackOverflow user: Pengyang : http://stackoverflow.com/questions/756055/listing-all-permutations-of-a-string-integer
/// </summary>
/// <typeparam name="T"></typeparam>
/// <param name="list"></param>
/// <param name="length"></param>
/// <returns></returns>
static IEnumerable<IEnumerable<T>> GetPermutations<T>(IEnumerable<T> list, int length)
{
if (length == 1) return list.Select(t => new T[] { t });
return GetPermutations(list, length - 1)
.SelectMany(t => list.Where(e => !t.Contains(e)),
(t1, t2) => t1.Concat(new T[] { t2 }));
}
/// <summary>
/// Swap 2 elements of same type
/// </summary>
/// <typeparam name="T"></typeparam>
/// <param name="a"></param>
/// <param name="b"></param>
[MethodImpl(MethodImplOptions.AggressiveInlining)]
static void Swap<T>(ref T a, ref T b)
{
T temp = a;
a = b;
b = temp;
}
/// <summary>
/// Func to show how to call. It does a little test for an array of 4 items.
/// </summary>
public static void Test()
{
ForAllPermutation("123".ToCharArray(), (vals) =>
{
Console.WriteLine(String.Join("", vals));
return false;
});
int[] values = new int[] { 0, 1, 2, 4 };
Console.WriteLine("Ouellet heap's algorithm implementation");
ForAllPermutation(values, (vals) =>
{
Console.WriteLine(String.Join("", vals));
return false;
});
Console.WriteLine("Linq algorithm");
foreach (var v in GetPermutations(values, values.Length))
{
Console.WriteLine(String.Join("", v));
}
// Performance Heap's against Linq version : huge differences
int count = 0;
values = new int[10];
for (int n = 0; n < values.Length; n++)
{
values[n] = n;
}
Stopwatch stopWatch = new Stopwatch();
ForAllPermutation(values, (vals) =>
{
foreach (var v in vals)
{
count++;
}
return false;
});
stopWatch.Stop();
Console.WriteLine($"Ouellet heap's algorithm implementation {count} items in {stopWatch.ElapsedMilliseconds} millisecs");
count = 0;
stopWatch.Reset();
stopWatch.Start();
foreach (var vals in GetPermutations(values, values.Length))
{
foreach (var v in vals)
{
count++;
}
}
stopWatch.Stop();
Console.WriteLine($"Linq {count} items in {stopWatch.ElapsedMilliseconds} millisecs");
}
}
}
An this is my test code:
Task.Run(() =>
{
int[] values = new int[12];
for (int n = 0; n < values.Length; n++)
{
values[n] = n;
}
// Eric Ouellet Algorithm
int count = 0;
var stopwatch = new Stopwatch();
stopwatch.Reset();
stopwatch.Start();
Permutations.ForAllPermutation(values, (vals) =>
{
foreach (var v in vals)
{
count++;
}
return false;
});
stopwatch.Stop();
Console.WriteLine($"This {count} items in {stopwatch.ElapsedMilliseconds} millisecs");
// Simple Plan Algorithm
count = 0;
stopwatch.Reset();
stopwatch.Start();
PermutationsSimpleVar permutations2 = new PermutationsSimpleVar();
permutations2.Permutate(1, values.Length, (int[] vals) =>
{
foreach (var v in vals)
{
count++;
}
});
stopwatch.Stop();
Console.WriteLine($"Simple Plan {count} items in {stopwatch.ElapsedMilliseconds} millisecs");
// ErezRobinson Algorithm
count = 0;
stopwatch.Reset();
stopwatch.Start();
foreach(var vals in PermutationsErezRobinson.QuickPerm(values))
{
foreach (var v in vals)
{
count++;
}
};
stopwatch.Stop();
Console.WriteLine($"Erez Robinson {count} items in {stopwatch.ElapsedMilliseconds} millisecs");
});
Usage examples:
ForAllPermutation("123".ToCharArray(), (vals) =>
{
Console.WriteLine(String.Join("", vals));
return false;
});
int[] values = new int[] { 0, 1, 2, 4 };
ForAllPermutation(values, (vals) =>
{
Console.WriteLine(String.Join("", vals));
return false;
});

Well, if you can handle it in C and then translate to your language of choice, you can't really go much faster than this, because the time will be dominated by print:
void perm(char* s, int n, int i){
if (i >= n-1) print(s);
else {
perm(s, n, i+1);
for (int j = i+1; j<n; j++){
swap(s[i], s[j]);
perm(s, n, i+1);
swap(s[i], s[j]);
}
}
}
perm("ABC", 3, 0);

Update 2018-05-28, a new version, the fastest ... (multi-threaded)
Time taken for fastest algorithms
Need: Sani Singh Huttunen (fastest lexico) solution and my new OuelletLexico3 which support indexing
Indexing has 2 main advantages:
allows to get anyone permutation directly
allows multi-threading (derived from the first advantage)
Article: Permutations: Fast implementations and a new indexing algorithm allowing multithreading
On my machine (6 hyperthread cores : 12 threads) Xeon E5-1660 0 # 3.30Ghz, tests algorithms running with empty stuff to do for 13! items (time in millisecs):
53071: Ouellet (implementation of Heap)
65366: Sani Singh Huttunen (Fastest lexico)
11377: Mix OuelletLexico3 - Sani Singh Huttunen
A side note: using shares properties/variables between threads for permutation action will strongly impact performance if their usage is modification (read / write). Doing so will generate "false sharing" between threads. You will not get expected performance. I got this behavior while testing. My experience showed problems when I try to increase the global variable for the total count of permutation.
Usage:
PermutationMixOuelletSaniSinghHuttunen.ExecuteForEachPermutationMT(
new int[] {1, 2, 3, 4},
p =>
{
Console.WriteLine($"Values: {p[0]}, {p[1]}, p[2]}, {p[3]}");
});
Code:
using System;
using System.Runtime.CompilerServices;
namespace WpfPermutations
{
public class Factorial
{
// ************************************************************************
protected static long[] FactorialTable = new long[21];
// ************************************************************************
static Factorial()
{
FactorialTable[0] = 1; // To prevent divide by 0
long f = 1;
for (int i = 1; i <= 20; i++)
{
f = f * i;
FactorialTable[i] = f;
}
}
// ************************************************************************
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static long GetFactorial(int val) // a long can only support up to 20!
{
if (val > 20)
{
throw new OverflowException($"{nameof(Factorial)} only support a factorial value <= 20");
}
return FactorialTable[val];
}
// ************************************************************************
}
}
namespace WpfPermutations
{
public class PermutationSaniSinghHuttunen
{
public static bool NextPermutation(int[] numList)
{
/*
Knuths
1. Find the largest index j such that a[j] < a[j + 1]. If no such index exists, the permutation is the last permutation.
2. Find the largest index l such that a[j] < a[l]. Since j + 1 is such an index, l is well defined and satisfies j < l.
3. Swap a[j] with a[l].
4. Reverse the sequence from a[j + 1] up to and including the final element a[n].
*/
var largestIndex = -1;
for (var i = numList.Length - 2; i >= 0; i--)
{
if (numList[i] < numList[i + 1])
{
largestIndex = i;
break;
}
}
if (largestIndex < 0) return false;
var largestIndex2 = -1;
for (var i = numList.Length - 1; i >= 0; i--)
{
if (numList[largestIndex] < numList[i])
{
largestIndex2 = i;
break;
}
}
var tmp = numList[largestIndex];
numList[largestIndex] = numList[largestIndex2];
numList[largestIndex2] = tmp;
for (int i = largestIndex + 1, j = numList.Length - 1; i < j; i++, j--)
{
tmp = numList[i];
numList[i] = numList[j];
numList[j] = tmp;
}
return true;
}
}
}
using System;
namespace WpfPermutations
{
public class PermutationOuelletLexico3<T> // Enable indexing
{
// ************************************************************************
private T[] _sortedValues;
private bool[] _valueUsed;
public readonly long MaxIndex; // long to support 20! or less
// ************************************************************************
public PermutationOuelletLexico3(T[] sortedValues)
{
_sortedValues = sortedValues;
Result = new T[_sortedValues.Length];
_valueUsed = new bool[_sortedValues.Length];
MaxIndex = Factorial.GetFactorial(_sortedValues.Length);
}
// ************************************************************************
public T[] Result { get; private set; }
// ************************************************************************
/// <summary>
/// Sort Index is 0 based and should be less than MaxIndex. Otherwise you get an exception.
/// </summary>
/// <param name="sortIndex"></param>
/// <param name="result">Value is not used as inpu, only as output. Re-use buffer in order to save memory</param>
/// <returns></returns>
public void GetSortedValuesFor(long sortIndex)
{
int size = _sortedValues.Length;
if (sortIndex < 0)
{
throw new ArgumentException("sortIndex should greater or equal to 0.");
}
if (sortIndex >= MaxIndex)
{
throw new ArgumentException("sortIndex should less than factorial(the lenght of items)");
}
for (int n = 0; n < _valueUsed.Length; n++)
{
_valueUsed[n] = false;
}
long factorielLower = MaxIndex;
for (int index = 0; index < size; index++)
{
long factorielBigger = factorielLower;
factorielLower = Factorial.GetFactorial(size - index - 1); // factorielBigger / inverseIndex;
int resultItemIndex = (int)(sortIndex % factorielBigger / factorielLower);
int correctedResultItemIndex = 0;
for(;;)
{
if (! _valueUsed[correctedResultItemIndex])
{
resultItemIndex--;
if (resultItemIndex < 0)
{
break;
}
}
correctedResultItemIndex++;
}
Result[index] = _sortedValues[correctedResultItemIndex];
_valueUsed[correctedResultItemIndex] = true;
}
}
// ************************************************************************
}
}
using System;
using System.Collections.Generic;
using System.Threading.Tasks;
namespace WpfPermutations
{
public class PermutationMixOuelletSaniSinghHuttunen
{
// ************************************************************************
private long _indexFirst;
private long _indexLastExclusive;
private int[] _sortedValues;
// ************************************************************************
public PermutationMixOuelletSaniSinghHuttunen(int[] sortedValues, long indexFirst = -1, long indexLastExclusive = -1)
{
if (indexFirst == -1)
{
indexFirst = 0;
}
if (indexLastExclusive == -1)
{
indexLastExclusive = Factorial.GetFactorial(sortedValues.Length);
}
if (indexFirst >= indexLastExclusive)
{
throw new ArgumentException($"{nameof(indexFirst)} should be less than {nameof(indexLastExclusive)}");
}
_indexFirst = indexFirst;
_indexLastExclusive = indexLastExclusive;
_sortedValues = sortedValues;
}
// ************************************************************************
public void ExecuteForEachPermutation(Action<int[]> action)
{
// Console.WriteLine($"Thread {System.Threading.Thread.CurrentThread.ManagedThreadId} started: {_indexFirst} {_indexLastExclusive}");
long index = _indexFirst;
PermutationOuelletLexico3<int> permutationOuellet = new PermutationOuelletLexico3<int>(_sortedValues);
permutationOuellet.GetSortedValuesFor(index);
action(permutationOuellet.Result);
index++;
int[] values = permutationOuellet.Result;
while (index < _indexLastExclusive)
{
PermutationSaniSinghHuttunen.NextPermutation(values);
action(values);
index++;
}
// Console.WriteLine($"Thread {System.Threading.Thread.CurrentThread.ManagedThreadId} ended: {DateTime.Now.ToString("yyyyMMdd_HHmmss_ffffff")}");
}
// ************************************************************************
public static void ExecuteForEachPermutationMT(int[] sortedValues, Action<int[]> action)
{
int coreCount = Environment.ProcessorCount; // Hyper treading are taken into account (ex: on a 4 cores hyperthreaded = 8)
long itemsFactorial = Factorial.GetFactorial(sortedValues.Length);
long partCount = (long)Math.Ceiling((double)itemsFactorial / (double)coreCount);
long startIndex = 0;
var tasks = new List<Task>();
for (int coreIndex = 0; coreIndex < coreCount; coreIndex++)
{
long stopIndex = Math.Min(startIndex + partCount, itemsFactorial);
PermutationMixOuelletSaniSinghHuttunen mix = new PermutationMixOuelletSaniSinghHuttunen(sortedValues, startIndex, stopIndex);
Task task = Task.Run(() => mix.ExecuteForEachPermutation(action));
tasks.Add(task);
if (stopIndex == itemsFactorial)
{
break;
}
startIndex = startIndex + partCount;
}
Task.WaitAll(tasks.ToArray());
}
// ************************************************************************
}
}

The fastest permutation algorithm that i know of is the QuickPerm algorithm.
Here is the implementation, it uses yield return so you can iterate one at a time like required.
Code:
public static IEnumerable<IEnumerable<T>> QuickPerm<T>(this IEnumerable<T> set)
{
int N = set.Count();
int[] a = new int[N];
int[] p = new int[N];
var yieldRet = new T[N];
List<T> list = new List<T>(set);
int i, j, tmp; // Upper Index i; Lower Index j
for (i = 0; i < N; i++)
{
// initialize arrays; a[N] can be any type
a[i] = i + 1; // a[i] value is not revealed and can be arbitrary
p[i] = 0; // p[i] == i controls iteration and index boundaries for i
}
yield return list;
//display(a, 0, 0); // remove comment to display array a[]
i = 1; // setup first swap points to be 1 and 0 respectively (i & j)
while (i < N)
{
if (p[i] < i)
{
j = i%2*p[i]; // IF i is odd then j = p[i] otherwise j = 0
tmp = a[j]; // swap(a[j], a[i])
a[j] = a[i];
a[i] = tmp;
//MAIN!
for (int x = 0; x < N; x++)
{
yieldRet[x] = list[a[x]-1];
}
yield return yieldRet;
//display(a, j, i); // remove comment to display target array a[]
// MAIN!
p[i]++; // increase index "weight" for i by one
i = 1; // reset index i to 1 (assumed)
}
else
{
// otherwise p[i] == i
p[i] = 0; // reset p[i] to zero
i++; // set new index value for i (increase by one)
} // if (p[i] < i)
} // while(i < N)
}

Here is the fastest implementation I ended up with:
public class Permutations
{
private readonly Mutex _mutex = new Mutex();
private Action<int[]> _action;
private Action<IntPtr> _actionUnsafe;
private unsafe int* _arr;
private IntPtr _arrIntPtr;
private unsafe int* _last;
private unsafe int* _lastPrev;
private unsafe int* _lastPrevPrev;
public int Size { get; private set; }
public bool IsRunning()
{
return this._mutex.SafeWaitHandle.IsClosed;
}
public bool Permutate(int start, int count, Action<int[]> action, bool async = false)
{
return this.Permutate(start, count, action, null, async);
}
public bool Permutate(int start, int count, Action<IntPtr> actionUnsafe, bool async = false)
{
return this.Permutate(start, count, null, actionUnsafe, async);
}
private unsafe bool Permutate(int start, int count, Action<int[]> action, Action<IntPtr> actionUnsafe, bool async = false)
{
if (!this._mutex.WaitOne(0))
{
return false;
}
var x = (Action)(() =>
{
this._actionUnsafe = actionUnsafe;
this._action = action;
this.Size = count;
this._arr = (int*)Marshal.AllocHGlobal(count * sizeof(int));
this._arrIntPtr = new IntPtr(this._arr);
for (var i = 0; i < count - 3; i++)
{
this._arr[i] = start + i;
}
this._last = this._arr + count - 1;
this._lastPrev = this._last - 1;
this._lastPrevPrev = this._lastPrev - 1;
*this._last = count - 1;
*this._lastPrev = count - 2;
*this._lastPrevPrev = count - 3;
this.Permutate(count, this._arr);
});
if (!async)
{
x();
}
else
{
new Thread(() => x()).Start();
}
return true;
}
private unsafe void Permutate(int size, int* start)
{
if (size == 3)
{
this.DoAction();
Swap(this._last, this._lastPrev);
this.DoAction();
Swap(this._last, this._lastPrevPrev);
this.DoAction();
Swap(this._last, this._lastPrev);
this.DoAction();
Swap(this._last, this._lastPrevPrev);
this.DoAction();
Swap(this._last, this._lastPrev);
this.DoAction();
return;
}
var sizeDec = size - 1;
var startNext = start + 1;
var usedStarters = 0;
for (var i = 0; i < sizeDec; i++)
{
this.Permutate(sizeDec, startNext);
usedStarters |= 1 << *start;
for (var j = startNext; j <= this._last; j++)
{
var mask = 1 << *j;
if ((usedStarters & mask) != mask)
{
Swap(start, j);
break;
}
}
}
this.Permutate(sizeDec, startNext);
if (size == this.Size)
{
this._mutex.ReleaseMutex();
}
}
private unsafe void DoAction()
{
if (this._action == null)
{
if (this._actionUnsafe != null)
{
this._actionUnsafe(this._arrIntPtr);
}
return;
}
var result = new int[this.Size];
fixed (int* pt = result)
{
var limit = pt + this.Size;
var resultPtr = pt;
var arrayPtr = this._arr;
while (resultPtr < limit)
{
*resultPtr = *arrayPtr;
resultPtr++;
arrayPtr++;
}
}
this._action(result);
}
private static unsafe void Swap(int* a, int* b)
{
var tmp = *a;
*a = *b;
*b = tmp;
}
}
Usage and testing performance:
var perms = new Permutations();
var sw1 = Stopwatch.StartNew();
perms.Permutate(0,
11,
(Action<int[]>)null); // Comment this line and...
//PrintArr); // Uncomment this line, to print permutations
sw1.Stop();
Console.WriteLine(sw1.Elapsed);
Printing method:
private static void PrintArr(int[] arr)
{
Console.WriteLine(string.Join(",", arr));
}
Going deeper:
I did not even think about this for a very long time, so I can only explain my code so much, but here's the general idea:
Permutations aren't lexicographic - this allows me to practically perform less operations between permutations.
The implementation is recursive, and when the "view" size is 3, it skips the complex logic and just performs 6 swaps to get the 6 permutations (or sub-permutations, if you will).
Because the permutations aren't in a lexicographic order, how can I decide which element to bring to the start of the current "view" (sub permutation)? I keep record of elements that were already used as "starters" in the current sub-permutation recursive call and simply search linearly for one that wasn't used in the tail of my array.
The implementation is for integers only, so to permute over a generic collection of elements you simply use the Permutations class to permute indices instead of your actual collection.
The Mutex is there just to ensure things don't get screwed when the execution is asynchronous (notice that you can pass an UnsafeAction parameter that will in turn get a pointer to the permuted array. You must not change the order of elements in that array (pointer)! If you want to, you should copy the array to a tmp array or just use the safe action parameter which takes care of that for you - the passed array is already a copy).
Note:
I have no idea how good this implementation really is - I haven't touched it in so long.
Test and compare to other implementations on your own, and let me know if you have any feedback!
Enjoy.

Here is a generic permutation finder that will iterate through every permutation of a collection and call an evalution function. If the evalution function returns true (it found the answer it was looking for), the permutation finder stops processing.
public class PermutationFinder<T>
{
private T[] items;
private Predicate<T[]> SuccessFunc;
private bool success = false;
private int itemsCount;
public void Evaluate(T[] items, Predicate<T[]> SuccessFunc)
{
this.items = items;
this.SuccessFunc = SuccessFunc;
this.itemsCount = items.Count();
Recurse(0);
}
private void Recurse(int index)
{
T tmp;
if (index == itemsCount)
success = SuccessFunc(items);
else
{
for (int i = index; i < itemsCount; i++)
{
tmp = items[index];
items[index] = items[i];
items[i] = tmp;
Recurse(index + 1);
if (success)
break;
tmp = items[index];
items[index] = items[i];
items[i] = tmp;
}
}
}
}
Here is a simple implementation:
class Program
{
static void Main(string[] args)
{
new Program().Start();
}
void Start()
{
string[] items = new string[5];
items[0] = "A";
items[1] = "B";
items[2] = "C";
items[3] = "D";
items[4] = "E";
new PermutationFinder<string>().Evaluate(items, Evaluate);
Console.ReadLine();
}
public bool Evaluate(string[] items)
{
Console.WriteLine(string.Format("{0},{1},{2},{3},{4}", items[0], items[1], items[2], items[3], items[4]));
bool someCondition = false;
if (someCondition)
return true; // Tell the permutation finder to stop.
return false;
}
}

Here is a recursive implementation with complexity O(n * n!)1 based on swapping of the elements of an array. The array is initialised with values from 1, 2, ..., n.
using System;
namespace Exercise
{
class Permutations
{
static void Main(string[] args)
{
int setSize = 3;
FindPermutations(setSize);
}
//-----------------------------------------------------------------------------
/* Method: FindPermutations(n) */
private static void FindPermutations(int n)
{
int[] arr = new int[n];
for (int i = 0; i < n; i++)
{
arr[i] = i + 1;
}
int iEnd = arr.Length - 1;
Permute(arr, iEnd);
}
//-----------------------------------------------------------------------------
/* Method: Permute(arr) */
private static void Permute(int[] arr, int iEnd)
{
if (iEnd == 0)
{
PrintArray(arr);
return;
}
Permute(arr, iEnd - 1);
for (int i = 0; i < iEnd; i++)
{
swap(ref arr[i], ref arr[iEnd]);
Permute(arr, iEnd - 1);
swap(ref arr[i], ref arr[iEnd]);
}
}
}
}
On each recursive step we swap the last element with the current element pointed to by the local variable in the for loop and then we indicate the uniqueness of the swapping by: incrementing the local variable of the for loop and decrementing the termination condition of the for loop, which is initially set to the number of the elements in the array, when the latter becomes zero we terminate the recursion.
Here are the helper functions:
//-----------------------------------------------------------------------------
/*
Method: PrintArray()
*/
private static void PrintArray(int[] arr, string label = "")
{
Console.WriteLine(label);
Console.Write("{");
for (int i = 0; i < arr.Length; i++)
{
Console.Write(arr[i]);
if (i < arr.Length - 1)
{
Console.Write(", ");
}
}
Console.WriteLine("}");
}
//-----------------------------------------------------------------------------
/*
Method: swap(ref int a, ref int b)
*/
private static void swap(ref int a, ref int b)
{
int temp = a;
a = b;
b = temp;
}
1. There are n! permutations of n elements to be printed.

I would be surprised if there are really order of magnitude improvements to be found. If there are, then C# needs fundamental improvement. Furthermore doing anything interesting with your permutation will generally take more work than generating it. So the cost of generating is going to be insignificant in the overall scheme of things.
That said, I would suggest trying the following things. You have already tried iterators. But have you tried having a function that takes a closure as input, then then calls that closure for each permutation found? Depending on internal mechanics of C#, this may be faster.
Similarly, have you tried having a function that returns a closure that will iterate over a specific permutation?
With either approach, there are a number of micro-optimizations you can experiment with. For instance you can sort your input array, and after that you always know what order it is in. For example you can have an array of bools indicating whether that element is less than the next one, and rather than do comparisons, you can just look at that array.

There's an accessible introduction to the algorithms and survey of implementations in Steven Skiena's Algorithm Design Manual (chapter 14.4 in the second edition)
Skiena references D. Knuth. The Art of Computer Programming, Volume 4 Fascicle 2: Generating All Tuples and Permutations. Addison Wesley, 2005.

I created an algorithm slightly faster than Knuth's one:
11 elements:
mine: 0.39 seconds
Knuth's: 0.624 seconds
13 elements:
mine: 56.615 seconds
Knuth's: 98.681 seconds
Here's my code in Java:
public static void main(String[] args)
{
int n=11;
int a,b,c,i,tmp;
int end=(int)Math.floor(n/2);
int[][] pos = new int[end+1][2];
int[] perm = new int[n];
for(i=0;i<n;i++) perm[i]=i;
while(true)
{
//this is where you can use the permutations (perm)
i=0;
c=n;
while(pos[i][1]==c-2 && pos[i][0]==c-1)
{
pos[i][0]=0;
pos[i][1]=0;
i++;
c-=2;
}
if(i==end) System.exit(0);
a=(pos[i][0]+1)%c+i;
b=pos[i][0]+i;
tmp=perm[b];
perm[b]=perm[a];
perm[a]=tmp;
if(pos[i][0]==c-1)
{
pos[i][0]=0;
pos[i][1]++;
}
else
{
pos[i][0]++;
}
}
}
The problem is my algorithm only works for odd numbers of elements. I wrote this code quickly so I'm pretty sure there's a better way to implement my idea to get better performance, but I don't really have the time to work on it right now to optimize it and solve the issue when the number of elements is even.
It's one swap for every permutation and it uses a really simple way to know which elements to swap.
I wrote an explanation of the method behind the code on my blog: http://antoinecomeau.blogspot.ca/2015/01/fast-generation-of-all-permutations.html

As the author of this question was asking about an algorithm:
[...] generating a single permutation, at a time, and continuing only if necessary
I would suggest considering Steinhaus–Johnson–Trotter algorithm.
Steinhaus–Johnson–Trotter algorithm on Wikipedia
Beautifully explained here

It's 1 am and I was watching TV and thought of this same question, but with string values.
Given a word find all permutations. You can easily modify this to handle an array, sets, etc.
Took me a bit to work it out, but the solution I came up was this:
string word = "abcd";
List<string> combinations = new List<string>();
for(int i=0; i<word.Length; i++)
{
for (int j = 0; j < word.Length; j++)
{
if (i < j)
combinations.Add(word[i] + word.Substring(j) + word.Substring(0, i) + word.Substring(i + 1, j - (i + 1)));
else if (i > j)
{
if(i== word.Length -1)
combinations.Add(word[i] + word.Substring(0, i));
else
combinations.Add(word[i] + word.Substring(0, i) + word.Substring(i + 1));
}
}
}
Here's the same code as above, but with some comments
string word = "abcd";
List<string> combinations = new List<string>();
//i is the first letter of the new word combination
for(int i=0; i<word.Length; i++)
{
for (int j = 0; j < word.Length; j++)
{
//add the first letter of the word, j is past i so we can get all the letters from j to the end
//then add all the letters from the front to i, then skip over i (since we already added that as the beginning of the word)
//and get the remaining letters from i+1 to right before j.
if (i < j)
combinations.Add(word[i] + word.Substring(j) + word.Substring(0, i) + word.Substring(i + 1, j - (i + 1)));
else if (i > j)
{
//if we're at the very last word no need to get the letters after i
if(i== word.Length -1)
combinations.Add(word[i] + word.Substring(0, i));
//add i as the first letter of the word, then get all the letters up to i, skip i, and then add all the lettes after i
else
combinations.Add(word[i] + word.Substring(0, i) + word.Substring(i + 1));
}
}
}

//+------------------------------------------------------------------+
//| |
//+------------------------------------------------------------------+
/**
* http://marknelson.us/2002/03/01/next-permutation/
* Rearranges the elements into the lexicographically next greater permutation and returns true.
* When there are no more greater permutations left, the function eventually returns false.
*/
// next lexicographical permutation
template <typename T>
bool next_permutation(T &arr[], int firstIndex, int lastIndex)
{
int i = lastIndex;
while (i > firstIndex)
{
int ii = i--;
T curr = arr[i];
if (curr < arr[ii])
{
int j = lastIndex;
while (arr[j] <= curr) j--;
Swap(arr[i], arr[j]);
while (ii < lastIndex)
Swap(arr[ii++], arr[lastIndex--]);
return true;
}
}
return false;
}
//+------------------------------------------------------------------+
//| |
//+------------------------------------------------------------------+
/**
* Swaps two variables or two array elements.
* using references/pointers to speed up swapping.
*/
template<typename T>
void Swap(T &var1, T &var2)
{
T temp;
temp = var1;
var1 = var2;
var2 = temp;
}
//+------------------------------------------------------------------+
//| |
//+------------------------------------------------------------------+
// driver program to test above function
#define N 3
void OnStart()
{
int i, x[N];
for (i = 0; i < N; i++) x[i] = i + 1;
printf("The %i! possible permutations with %i elements:", N, N);
do
{
printf("%s", ArrayToString(x));
} while (next_permutation(x, 0, N - 1));
}
// Output:
// The 3! possible permutations with 3 elements:
// "1,2,3"
// "1,3,2"
// "2,1,3"
// "2,3,1"
// "3,1,2"
// "3,2,1"

// Permutations are the different ordered arrangements of an n-element
// array. An n-element array has exactly n! full-length permutations.
// This iterator object allows to iterate all full length permutations
// one by one of an array of n distinct elements.
// The iterator changes the given array in-place.
// Permutations('ABCD') => ABCD DBAC ACDB DCBA
// BACD BDAC CADB CDBA
// CABD ADBC DACB BDCA
// ACBD DABC ADCB DBCA
// BCAD BADC CDAB CBDA
// CBAD ABDC DCAB BCDA
// count of permutations = n!
// Heap's algorithm (Single swap per permutation)
// http://www.quickperm.org/quickperm.php
// https://stackoverflow.com/a/36634935/4208440
// https://en.wikipedia.org/wiki/Heap%27s_algorithm
// My implementation of Heap's algorithm:
template<typename T>
class PermutationsIterator
{
int b, e, n;
int c[32]; /* control array: mixed radix number in rising factorial base.
the i-th digit has base i, which means that the digit must be
strictly less than i. The first digit is always 0, the second
can be 0 or 1, the third 0, 1 or 2, and so on.
ArrayResize isn't strictly necessary, int c[32] would suffice
for most practical purposes. Also, it is much faster */
public:
PermutationsIterator(T &arr[], int firstIndex, int lastIndex)
{
this.b = firstIndex; // v.begin()
this.e = lastIndex; // v.end()
this.n = e - b + 1;
ArrayInitialize(c, 0);
}
// Rearranges the input array into the next permutation and returns true.
// When there are no more permutations left, the function returns false.
bool next(T &arr[])
{
// find index to update
int i = 1;
// reset all the previous indices that reached the maximum possible values
while (c[i] == i)
{
c[i] = 0;
++i;
}
// no more permutations left
if (i == n)
return false;
// generate next permutation
int j = (i & 1) == 1 ? c[i] : 0; // IF i is odd then j = c[i] otherwise j = 0.
swap(arr[b + j], arr[b + i]); // generate a new permutation from previous permutation using a single swap
// Increment that index
++c[i];
return true;
}
};

I found this algo on rosetta code and it is really the fastest one I tried. http://rosettacode.org/wiki/Permutations#C
/* Boothroyd method; exactly N! swaps, about as fast as it gets */
void boothroyd(int *x, int n, int nn, int callback(int *, int))
{
int c = 0, i, t;
while (1) {
if (n > 2) boothroyd(x, n - 1, nn, callback);
if (c >= n - 1) return;
i = (n & 1) ? 0 : c;
c++;
t = x[n - 1], x[n - 1] = x[i], x[i] = t;
if (callback) callback(x, nn);
}
}
/* entry for Boothroyd method */
void perm2(int *x, int n, int callback(int*, int))
{
if (callback) callback(x, n);
boothroyd(x, n, n, callback);
}

If you just want to calculate the number of possible permutations you can avoid all that hard work above and use something like this (contrived in c#):
public static class ContrivedUtils
{
public static Int64 Permutations(char[] array)
{
if (null == array || array.Length == 0) return 0;
Int64 permutations = array.Length;
for (var pos = permutations; pos > 1; pos--)
permutations *= pos - 1;
return permutations;
}
}
You call it like this:
var permutations = ContrivedUtils.Permutations("1234".ToCharArray());
// output is: 24
var permutations = ContrivedUtils.Permutations("123456789".ToCharArray());
// output is: 362880

Simple C# recursive solution by swapping, for the initial call the index must be 0
static public void Permute<T>(List<T> input, List<List<T>> permutations, int index)
{
if (index == input.Count - 1)
{
permutations.Add(new List<T>(input));
return;
}
Permute(input, permutations, index + 1);
for (int i = index+1 ; i < input.Count; i++)
{
//swap
T temp = input[index];
input[index] = input[i];
input[i] = temp;
Permute(input, permutations, index + 1);
//swap back
temp = input[index];
input[index] = input[i];
input[i] = temp;
}
}

Is this parallel sort merge implemented correctly?

Is this parallel merge sort implemented correctly? It looks correct, I took the 40seconds to write a test and it hasnt failed.
The gist of it is i need to sort by splitting the array in half every time. Then i tried to make sure i go wrong and asked a question for a sanity check (my own sanity). I wanted an in place sort but decided that it was way to complicated when seeing the answer, so i implemented the below.
Granted there's no point creating a task/thread to sort a 4 byte array but its to learn threading. Is there anything wrong or anything i can change to make this better. To me it looks perfect but i'd like some general feedback.
static void Main(string[] args)
{
var start = DateTime.Now;
//for (int z = 0; z < 1000000; z++)
int z = 0;
while(true)
{
var curr = DateTime.Now;
if (curr - start > TimeSpan.FromMinutes(1))
break;
var arr = new byte[] { 5, 3, 1, 7, 8, 5, 3, 2, 6, 7, 9, 3, 2, 4, 2, 1 };
Sort(arr, 0, arr.Length, new byte[arr.Length]);
//Console.Write(BitConverter.ToString(arr));
for (int i = 1; i < arr.Length; ++i)
{
if (arr[i] > arr[i])
{
System.Diagnostics.Debug.Assert(false);
throw new Exception("Sort was incorrect " + BitConverter.ToString(arr));
}
}
++z;
}
Console.WriteLine("Tried {0} times with success", z);
}
static void Sort(byte[] arr, int leftPos, int rightPos, byte[] tempArr)
{
var len = rightPos - leftPos;
if (len < 2)
return;
if (len == 2)
{
if (arr[leftPos] > arr[leftPos + 1])
{
var t = arr[leftPos];
arr[leftPos] = arr[leftPos + 1];
arr[leftPos + 1] = t;
}
return;
}
var rStart = leftPos+len/2;
var t1 = new Thread(delegate() { Sort(arr, leftPos, rStart, tempArr); });
var t2 = new Thread(delegate() { Sort(arr, rStart, rightPos, tempArr); });
t1.Start();
t2.Start();
t1.Join();
t2.Join();
var l = leftPos;
var r = rStart;
var z = leftPos;
while (l<rStart && r<rightPos)
{
if (arr[l] < arr[r])
{
tempArr[z] = arr[l];
l++;
}
else
{
tempArr[z] = arr[r];
r++;
}
z++;
}
if (l < rStart)
Array.Copy(arr, l, tempArr, z, rStart - l);
else
Array.Copy(arr, r, tempArr, z, rightPos - r);
Array.Copy(tempArr, leftPos, arr, leftPos, rightPos - leftPos);
}

You could use the Task Parallel Library to give you a better abstraction over threads and cleaner code. The example below uses this.
The main difference from your code, other than using the TPL, is that it has a cutoff threshold below which a sequential implementation is used regardless of the number of threads that have started. This prevents creation of threads that are doing a very small amount of work. There is also a depth cutoff below which new threads are not created. This prevents more threads being created than the hardware can handle based on the number of available logical cores (Environment.ProcessCount).
I would recommend implementing both these approaches in your code to prevent thread explosion for large arrays and innefficient creation of threads which do very small amounts of work, even for small array sizes. It will also give you better performance.
public static class Sort
{
public static int Threshold = 150;
public static void InsertionSort(int[] array, int from, int to)
{
// ...
}
static void Swap(int[] array, int i, int j)
{
// ...
}
static int Partition(int[] array, int from, int to, int pivot)
{
// ...
}
public static void ParallelQuickSort(int[] array)
{
ParallelQuickSort(array, 0, array.Length,
(int) Math.Log(Environment.ProcessorCount, 2) + 4);
}
static void ParallelQuickSort(int[] array, int from, int to, int depthRemaining)
{
if (to - from <= Threshold)
{
InsertionSort(array, from, to);
}
else
{
int pivot = from + (to - from) / 2; // could be anything, use middle
pivot = Partition(array, from, to, pivot);
if (depthRemaining > 0)
{
Parallel.Invoke(
() => ParallelQuickSort(array, from, pivot, depthRemaining - 1),
() => ParallelQuickSort(array, pivot + 1, to, depthRemaining - 1));
}
else
{
ParallelQuickSort(array, from, pivot, 0);
ParallelQuickSort(array, pivot + 1, to, 0);
}
}
}
}
The full source is available on http://parallelpatterns.codeplex.com/
You can read a discussion of the implementation on http://msdn.microsoft.com/en-us/library/ff963551.aspx

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Fast sort partially sorted array - c#

Related

Quicksort algorithm (C#)

C# OpenCL GPU implementation for double array math

Quick Sort Implementation with large numbers [duplicate]

Generating permutations of a set (most efficiently)

Is this parallel sort merge implemented correctly?

Categories

Resources