I am trying to implement and construct a Quick Sort algorithm form scratch, I have been browsing different postings and this seems like a popular topic, I'm just not entirely sure how to interpret what other people have done with it to get mine working.
public static void QuickSort(int[] A)
{
int i, j, pivot,counter,temp;
counter = 1; //Sets Initial counter to allow loop to execute
while (counter != 0)
{ i = -1; //Left bound
j = 0; //Right bound
counter = 0; //Count of transpositions per cycle set to zero
Random rand = new Random();
pivot = rand.Next(0, A.Length - 1); //Random Pivot Value Generated
for (int x = 0; x < A.Length; x++) //Executes contents for each item in the array
{
if (A[pivot] > A[j]) //Checks if pivot is greater than right bound
{
i++; //left bound incremented
temp = A[j];
A[j] = A[i];
A[i] = temp; //swaps left and right bound values
j++; //Right bound Incremented
counter++; //Increments number of transpositions for this cycle.
}
else j++; //else right bound is icremented
}
//Heres where it gets sketchy
temp = A[i+1];
A[i + 1] = A[pivot]; //pivot value is placed in Index 1+ left bound
for (int x =(i+2); x <A.Length; x++) //Shifts the remaining values in the array from index of pivot (1+ left bound) over one position to the end of the array (not functional)
{
temp = A[x];
A[x + 1] = temp;
}
}
}
As you can see I've sort of hit a wall with the shifting portion of the algorithm, and I'm honestly not sure how to continue without just copying someone's solution from the web
First step -- Actually understand what is happening in a Quicksort...
Basic Quicksort Algorithm
From your collection of items, select one (called the pivot). (How it is chosen is theoretically irrelevant but may matter on a practical level)
Separate the items in your collection into two new collections: Items less than the pivot, and Items greater than the pivot.
Sort the two new collections separately (how is irrelevant -- most implementations call themselves recursively).
Join the now sorted lower collection, the pivot, and the sort upper collection.
Now, this particular implementation is a bit bizarre. It's trying to do it in place, which is OK, but it doesn't seem to ever move the pivot, which means it's going to be overwritten (and therefore change in the middle of the partitioning)
A few nit-picky things.
Don't create a new Random object in the middle of the loop. Create it once outside the loop, and reuse it. There are several reason for this. 1) It takes a (relative) long time, and 2) it's seeded with the time in milleseconds, so two created within a millisecond of each other will create the same sequence -- here you'll always get the same pivot.
Actually, it's best not to use a random pivot at all. This was probably inspired by basic misunderstanding of the algorithm. Originally, it was proposed that, since it didn't matter which item was picked as the pivot, you might as well pick the first item, because it was easiest. But then it was discovered that an already sorted collections have a worst-case time with that as the pivot. So, naturally they went completely the other way and decided to go with a random pivot. That's dumb. For any given collection, there exists a pivot that cause a worst-case time. By using a random pivot, you increase your chance of hitting it by accident. The best choice: For an indexable collection (like an array), best to go with the item physically in the middle of the collection. It'll give you the best-case time for an already sorted collection, and it's worst-case is a pathological ordering that you're unlikely to hit upon. For a non-indexable collection (like a linked-list -- betcha' didn't know you could Quicksort a linked list), you pretty much have to go with using the first item, so be careful when you use it.
If the first time through the loop, a[0] is less than the pivot, i is equals to j, so you swap A[0] with itself.
You should implement IComparer.
This interface is used in conjunction with the Array.Sort and
Array.BinarySearch methods. It provides a way to customize the sort
order of a collection. See the Compare method for notes on parameters
and return value. The default implementation of this interface is the
Comparer class. For the generic version of this interface, see
System.Collections.Generic.IComparer<T>.
public class Reverse : IComparer
{
int IComparer.Compare(Object x, Object y) =>
((new CaseInsensitiveComparer()).Compare( y, x ));
}
The above method is an example, then you would:
IComparer reverse = new Reverse();
collection.Sort(reverse);
Simply pass your comparison to the Sort on the collection, it will perform and display reversed. You can go even more in depth with greater control and actually sort based on specific model properties and etc.
I would look into Sort and the IComparer above. More detail about IComparer here.
Related
If I am given three arrays of equal length. Each array represents the distance to a specific attraction (ie the first array is only theme parks, the second is only museums, the third is only beaches) on a road trip I am taking. I wan't to determine all possible trips stopping at one of each type of attraction on each trip, never driving backwards, and never visiting the same attraction twice.
IE if I have the following three arrays:
[29 50]
[61 37]
[37 70]
The function would return 3 because the possible combinations would be: (29,61,70)(29,37,70)(50,61,70)
What I've got so far:
public int test(int[] A, int[] B, int[] C) {
int firstStop = 0;
int secondStop = 0;
int thirdStop = 0;
List<List<int>> possibleCombinations = new List<List<int>>();
for(int i = 0; i < A.Length; i++)
{
firstStop = A[i];
for(int j = 0; j < B.Length; j++)
{
if(firstStop < B[j])
{
secondStop = B[j];
for(int k = 0; k < C.Length; k++)
{
if(secondStop < C[k])
{
thirdStop = C[k];
possibleCombinations.Add(new List<int>{firstStop, secondStop, thirdStop});
}
}
}
}
}
return possibleCombinations.Count();
}
This works for the folowing test cases:
Example test: ([29, 50], [61, 37], [37, 70])
OK Returns 3
Example test: ([5], [5], [5])
OK Returns 0
Example test: ([61, 62], [37, 38], [29, 30])
FAIL Returns 0
What is the correct algorithm to calculate this correctly?
What is the best performing algorithm?
How can I tell the performance of this algorithm's time complexity (ie is it O(N*log(N))?)
UPDATE: The question has been rewritten with new details and still is completely unclear and self-contradictory; attempts to clarify the problem with the original poster have been unsuccessful, and the original poster admits to having started coding before understanding the problem themselves. The solution below is correct for the problem as it was originally stated; what the solution to the real problem looks like, no one can say, because no one can say what the real problem is. I'll leave this here for historical purposes.
Let's re-state the problem:
We are given three arrays of distances to attractions along a road.
We wish to enumerate all sequences of possible stops at attractions that do not backtrack. (NOTE: The statement of the problem is to enumerate them; the wrong algorithm given counts them. These are completely different problems. Counting them can be extremely fast. Enumerating them is extremely slow! If the problem is to count them then clarify the problem.)
No other constraints are given in the problem. (For example, it is not given in the problem that we stop at no more than one beach, or that we must stop at one of every kind, or that we must go to a beach before we go to a museum. If those are constraints then they must be stated in the problem)
Suppose there are a total of n attractions. For each attraction either we visit it or we do not. It might seem that there are 2n possibilities. However, there's a problem. Suppose we have two museums, M1 and M2 both 5 km down the road. The possible routes are:
(Start, End) -- visit no attractions on your road trip
(Start, M1, End)
(Start, M2, End)
(Start, M1, M2, End)
(Start, M2, M1, End)
There are five non-backtracking possibilities, not four.
The algorithm you want is:
Partition the attractions by distance, so that all the partitions contain the attractions that are at the same distance.
For each partition, generate a set of all the possible orderings of all the subsets within that partition. Do not forget that "skip all of them" is a possible ordering.
The combinations you want are the Cartesian product of all the partition ordering sets.
That should give you enough hints to make progress. You have several problems to solve here: partitioning, permuting within a partition, and then taking the cross product of arbitrarily many sets. I and many others have written articles on all of these subjects, so do some research if you do not know how to solve these sub-problems yourself.
As for the asymptotic performance: As noted above, the problem given is to enumerate the solutions. The best possible case is, as noted before, 2n for cases where there are no attractions at the same distance, so we are at least exponential. If there are collisions then it becomes a product of many factorials; I leave it to you to work it out, but it's big.
Again: if the problem is to work out the number of solutions, that's much easier. You don't have to enumerate them to know how many solutions there are! Just figure out the number of orderings at each partition and then multiply all the counts together. I leave figuring out the asymptotic performance of partitioning, working out the number of orderings, and multiplying them together as an exercise.
Your solution runs in O(n ^ 3). But if you need to generate all possible combinations and the distances are sorted row and column wise i.e
[1, 2, 3]
[4, 5, 6]
[7, 8, 9]
all solutions will degrade to O(n^3) as it requires to compute all possible subsequences.
If the input has lots of data and the distance between each of them is relatively far then a Sort + binary search + recursive solution might be faster.
static List<List<int>> answer = new List<List<int>>();
static void findPaths(List<List<int>> distances, List<int> path, int rowIndex = 0, int previousValue = -1)
{
if(rowIndex == distances.Count)
{
answer.Add(path);
return;
}
previousValue = previousValue == -1 ? distances[0][0] : previousValue;
int startIndex = distances[rowIndex].BinarySearch(previousValue);
startIndex = startIndex < 0 ? Math.Abs(startIndex) - 1 : startIndex;
// No further destination can be added
if (startIndex == distances[rowIndex].Count)
return;
for(int i=startIndex; i < distances[rowIndex].Count; ++i)
{
var temp = new List<int>(path);
int currentValue = distances[rowIndex][i];
temp.Add(currentValue);
findPaths(distances, temp, rowIndex + 1, currentValue);
}
}
The majority of savings in this solution comes from the fact that since the data is already sorted we need not look distances in the next destinations with distance less than the previous value we have.
For smaller and more closed distances this might be a overkill with the additional sorting and binary search overhead making it slower than the straightforward brute force approach.
Ultimately i think this comes down to how your data is and you can try out both approaches and try which one is faster for you.
Note: This solution does not assume strictly increasing distances i.e) [29, 37, 37] is valid here. If you do not want such solution you'll have to change Binary Search to do a upper bound as opposed to lower bound.
Use Dynamic Programming with State. As there are only 3 arrays, so there are only 2*2*2 states.
Combine the arrays and sort it. [29, 37, 37, 50, 61, 70]. And we make an 2d-array: dp[0..6][0..7]. There are 8 states:
001 means we have chosen 1st array.
010 means we have chosen 2nd array.
011 means we have chosen 1st and 2nd array.
.....
111 means we have chosen 1st, 2nd, 3rd array.
The complexity is O(n*8)=O(n)
I have an Array in C# which contains numbers (e.g. int, float or double); I have another array of ranges (each defined as a lower and upper bound). My current implementation is something like this.
foreach (var v in data)
{
foreach (var row in ranges)
{
if (v >= row.lower && v <= row.high)
{
statistics[row]++;
break;
}
}
}
So the algorithm is O(mn) where m is the number of ranges and n is the size of the numbers.
Can this be improved? because in practical, the n is big and I want this to be as fast as possible.
Sort data array, then for each interval - find the first index that is in this range in data, and the last one (both using binary search). The number of elements that in this interval is than easily computed by reducing lastIdx-firstIdx (or add +1, depending if lastIdx is inclusive or not).
This is done in O(mlogm + nlogm), where m is the number of data, and n number of intervals.
Bonus: If data is changing constantly, you can use an order statistics tree, with the same approach (since this tree allows you to find easily the index of each element, and is supporting modifying the data).
Bonus2: Optimality proof
Using comparisons based algorithms, this cannot be done better, since if we could, we could also solve element distinctness problem better.
Element Distinctness Problem:
Given an array a1,a2,...,an - find out if there are i,j such that
i!=j, ai=aj.
This problem is known to have Omega(nlogn) time bound using comparisons based algorithms.
Reduction:
Given an instance of element distinctness problem a1,...,an - create data=a1,...,an, and intervals: [a1,a1], [a2,a2],..., [an,an] - and run the algorithm.
If there are more than n matches - there is duplicates, otherwise there is none.
The complexity of the above algorithm is O(n+f(n)), where n is the number of elements, and f(n) is the complexity of this algorithm. this has to be Omega(nlogn), so does f(n), and we can conclude there is no more efficient algorithm.
Assuming the ranges are ordered, you always take the first range that fits, right?
This means that you could easily build a binary tree of the lower bounds. You find the highest lower bound that's lower than your number, and check if it fits the higher bound. If the tree is properly balanced, this can get you quite close to O(nlog m). Of course, if you don't need to change the ranges frequently, a simple ordered array will do - just use the usual binary search methods.
Using a hashtable instead could get you pretty close to O(n), depending on how the ranges are structured. If data is also ordered, you could get even better results.
An alternate solution that doesn't involve sorting the data:
var dictionary = new Dictionary<int, int>();
foreach (var v in data) {
if (dictionary.ContainsKey(v)){
dictionary[v]++;
} else {
dictionary[v] = 1;
}
}
foreach (var row in ranges) {
for (var i = row.lower; i <= row.higher; i++) {
statistics[row] += dictionary[i];
}
}
Get a count of the number of occurrences of each value in data, and then sum the counts between the bounds of your range.
I want to shuffle a big dataset (of type List<Record>), then iterate over it many times. Typically, shuffling a list only shuffles the references, not the data. My algorithm's performance suffers tremendously (3x) because of frequent cache missing. I can do a deep copy of the shuffled data to make it cache friendly. However, that would double the memory usage.
Is there a more memory-efficient way to shuffle or re-order data so that the shuffled data is cache friendly?
Option 1:
Make Record a struct so the List<Record> holds contiguous data in memory.
Then either sort it directly, or (if the records are large) instead of sorting the list directly, make an array of indices (initially just {0, 1, ..., n - 1}) and then sort the indices by making the comparator compare the elements they refer to. Finally if you need the sorted array you can copy the elements in the shuffled order by looking at the indices.
Note that this may be more cache-unfriendly than directly sorting the structs, but at least it'll be a single pass through the data, so it is more likely to be faster, depending on the struct size. You can't really avoid it if the struct is large, so if you're not sure whether Record is large, you'll have to try both approaches and see whether sorting the records directly is more efficient.
If you can't change the type, then your only solution is to somehow make them contiguous in memory. The only realistic way of doing that is to perform an initial garbage collection, then allocate them in order, and keep your fingers crossed hoping that the runtime will allocate them contiguously. I can't think of any other way that could work if you can't make it a struct.
If you think another garbage collection run in the middle might mess up the order, you can try making a second array of GCHandle with pinned references to these objects. I don't recommend this, but it might be your only solution at that point.
Option 2:
Are you really using the entire record for sorting? That's unlikely. If not, then just extract the portion of each record that is relevant, sort those, and then re-shuffle the original data.
It is better for you not to touch the List. Instead you create an accessor method for you list. First you create an array of n elements in a random order e.g something like var arr = [2, 5, .., n-1, 0];
Then you create an access method:
Record get(List<Record> list, int i) {
return list[arr[i]];
}
By doing so the list remains untouched, but you get a random Record at every index.
Edit: to create a random order array:
int[] arr = new int[n];
// Fill the array with values 1 to n;
for (int i = 0; i < arr.Length; i++)
arr[i] = i + 1;
// Switch pairs of values for unbiased uniform random distribution:
Random rnd = new Random();
for (int i = 0; i < arr.Length - 1; i++) {
int j = rnd.Next(i, arr.Length);
int temp = arr[i];
arr[i] = arr[j];
arr[j] = temp;
}
In of the functions in my program that is called every second I get a float value that represents some X strength.
These values keep on coming at intervals and I am looking to store a history of the last 30 values and check if there's a downward/decreasing trend in the values (there might be a 2 or 3 false positives as well, so those have to neglected). If there's a downward trend and (If the most recent value minus the first value in the history) passes a threshold of 50 (say), I want to call another function. How can such a thing be implemented in C# which has such a structure to store history of 30 values and then analyse/deduce the downward trend?
You have several choices. If you only need to call this once per second, you can use a Queue<float>, like this:
Queue<float> theQueue = new Queue<float>(30);
// once per second:
// if the queue is full, remove an item
if (theQueue.Count >= 30)
{
theQueue.Dequeue();
}
// add the new item to the queue
theQueue.Enqueue(newValue);
// now analyze the items in the queue to detect a downward trend
foreach (float f in theQueue)
{
// do your analysis
}
That's easy to implement and will be plenty fast enough to run once per second.
How you analyze the downward trend really depends on your definition.
It occurs to me that the Queue<float> enumerator might not be guaranteed to return things in the order that they were inserted. If it doesn't, then you'll have to implement your own circular buffer.
I don't know C# but I'd probably store the values as an List of some sort. Here's some pseudo code for the trend checking:
if last value - first value < threshold
return
counter = 0
for int i = 1; i < 30; i++
if val[i] > val[i-1]
counter++
if counter < false_positive_threshold
//do stuff
A Circular List is the best data structure to store last X values. There doesn't seem to be one in the standard library but there are several questions on SO how to build one.
You need to define "downward trend". It seems to me that according to your current definition ((If the most recent value minus the first value in the history) the sequence "100, 150, 155, 175, 180, 182" is a downward trend. With that definition you only need to the latest and first value in history from the circular list, simplifying it somewhat.
But you probably need a more elaborate algorithm to identify a downward trend.
I have a program that needs to repeatedly compute the approximate percentile (order statistic) of a dataset in order to remove outliers before further processing. I'm currently doing so by sorting the array of values and picking the appropriate element; this is doable, but it's a noticable blip on the profiles despite being a fairly minor part of the program.
More info:
The data set contains on the order of up to 100000 floating point numbers, and assumed to be "reasonably" distributed - there are unlikely to be duplicates nor huge spikes in density near particular values; and if for some odd reason the distribution is odd, it's OK for an approximation to be less accurate since the data is probably messed up anyhow and further processing dubious. However, the data isn't necessarily uniformly or normally distributed; it's just very unlikely to be degenerate.
An approximate solution would be fine, but I do need to understand how the approximation introduces error to ensure it's valid.
Since the aim is to remove outliers, I'm computing two percentiles over the same data at all times: e.g. one at 95% and one at 5%.
The app is in C# with bits of heavy lifting in C++; pseudocode or a preexisting library in either would be fine.
An entirely different way of removing outliers would be fine too, as long as it's reasonable.
Update: It seems I'm looking for an approximate selection algorithm.
Although this is all done in a loop, the data is (slightly) different every time, so it's not easy to reuse a datastructure as was done for this question.
Implemented Solution
Using the wikipedia selection algorithm as suggested by Gronim reduced this part of the run-time by about a factor 20.
Since I couldn't find a C# implementation, here's what I came up with. It's faster even for small inputs than Array.Sort; and at 1000 elements it's 25 times faster.
public static double QuickSelect(double[] list, int k) {
return QuickSelect(list, k, 0, list.Length);
}
public static double QuickSelect(double[] list, int k, int startI, int endI) {
while (true) {
// Assume startI <= k < endI
int pivotI = (startI + endI) / 2; //arbitrary, but good if sorted
int splitI = partition(list, startI, endI, pivotI);
if (k < splitI)
endI = splitI;
else if (k > splitI)
startI = splitI + 1;
else //if (k == splitI)
return list[k];
}
//when this returns, all elements of list[i] <= list[k] iif i <= k
}
static int partition(double[] list, int startI, int endI, int pivotI) {
double pivotValue = list[pivotI];
list[pivotI] = list[startI];
list[startI] = pivotValue;
int storeI = startI + 1;//no need to store # pivot item, it's good already.
//Invariant: startI < storeI <= endI
while (storeI < endI && list[storeI] <= pivotValue) ++storeI; //fast if sorted
//now storeI == endI || list[storeI] > pivotValue
//so elem #storeI is either irrelevant or too large.
for (int i = storeI + 1; i < endI; ++i)
if (list[i] <= pivotValue) {
list.swap_elems(i, storeI);
++storeI;
}
int newPivotI = storeI - 1;
list[startI] = list[newPivotI];
list[newPivotI] = pivotValue;
//now [startI, newPivotI] are <= to pivotValue && list[newPivotI] == pivotValue.
return newPivotI;
}
static void swap_elems(this double[] list, int i, int j) {
double tmp = list[i];
list[i] = list[j];
list[j] = tmp;
}
Thanks, Gronim, for pointing me in the right direction!
The histogram solution from Henrik will work. You can also use a selection algorithm to efficiently find the k largest or smallest elements in an array of n elements in O(n). To use this for the 95th percentile set k=0.05n and find the k largest elements.
Reference:
http://en.wikipedia.org/wiki/Selection_algorithm#Selecting_k_smallest_or_largest_elements
According to its creator a SoftHeap can be used to:
compute exact or approximate medians
and percentiles optimally. It is also
useful for approximate sorting...
I used to identify outliers by calculating the standard deviation. Everything with a distance more as 2 (or 3) times the standard deviation from the avarage is an outlier. 2 times = about 95%.
Since your are calculating the avarage, its also very easy to calculate the standard deviation is very fast.
You could also use only a subset of your data to calculate the numbers.
You could estimate your percentiles from just a part of your dataset, like the first few thousand points.
The Glivenko–Cantelli theorem ensures that this would be a fairly good estimate, if you can assume your data points to be independent.
Divide the interval between minimum and maximum of your data into (say) 1000 bins and calculate a histogram. Then build partial sums and see where they first exceed 5000 or 95000.
There are a couple basic approaches I can think of. First is to compute the range (by finding the highest and lowest values), project each element to a percentile ((x - min) / range) and throw out any that evaluate to lower than .05 or higher than .95.
The second is to compute the mean and standard deviation. A span of 2 standard deviations from the mean (in both directions) will enclose 95% of a normally-distributed sample space, meaning your outliers would be in the <2.5 and >97.5 percentiles. Calculating the mean of a series is linear, as is the standard dev (square root of the sum of the difference of each element and the mean). Then, subtract 2 sigmas from the mean, and add 2 sigmas to the mean, and you've got your outlier limits.
Both of these will compute in roughly linear time; the first one requires two passes, the second one takes three (once you have your limits you still have to discard the outliers). Since this is a list-based operation, I do not think you will find anything with logarithmic or constant complexity; any further performance gains would require either optimizing the iteration and calculation, or introducing error by performing the calculations on a sub-sample (such as every third element).
A good general answer to your problem seems to be RANSAC.
Given a model, and some noisy data, the algorithm efficiently recovers the parameters of the model.
You will have to chose a simple model that can map your data. Anything smooth should be fine. Let say a mixture of few gaussians. RANSAC will set the parameters of your model and estimate a set of inliners at the same time. Then throw away whatever doesn't fit the model properly.
You could filter out 2 or 3 standard deviation even if the data is not normally distributed; at least, it will be done in a consistent manner, that should be important.
As you remove the outliers, the std dev will change, you could do this in a loop until the change in std dev is minimal. Whether or not you want to do this depends upon why are you manipulating the data this way. There are major reservations by some statisticians to removing outliers. But some remove the outliers to prove that the data is fairly normally distributed.
Not an expert, but my memory suggests:
to determine percentile points exactly you need to sort and count
taking a sample from the data and calculating the percentile values sounds like a good plan for decent approximation if you can get a good sample
if not, as suggested by Henrik, you can avoid the full sort if you do the buckets and count them
One set of data of 100k elements takes almost no time to sort, so I assume you have to do this repeatedly. If the data set is the same set just updated slightly, you're best off building a tree (O(N log N)) and then removing and adding new points as they come in (O(K log N) where K is the number of points changed). Otherwise, the kth largest element solution already mentioned gives you O(N) for each dataset.