Speed up loops on large data in C# - c#

I have three nested loops from zero to n. n is a large number, around 12000th These three loops working on 2DList. It is actually a Floyd algorithm. At these large data it takes along time, could you advise me how to improve it? Thank you (Sorry for my english:) )
List<List<int>> distance = new List<List<int>>();
...
for (int i = 0; i < n; i++)
for (int v = 0; v < n; v++)
for (int w = 0; w < n; w++)
{
if (distance[v][i] != int.MaxValue &&
distance[i][w] != int.MaxValue)
{
int d = distance[v][i] + distance[i][w];
if (distance[v][w] > d)
distance[v][w] = d;
}
}

The first part of your if statement distance[v][i] != int.MaxValue can be moved outside of the iteration over w to reduce overhead in some cases. However, I have no idea how often your values are at int.MaxValue

You cannot change Floyd’s algorithm, its complexity is fixed (and it’s provably the most efficient solution to the general problem of finding all pairwise shortest path distances in a graph with negative edge weights).
You can only improve the runtime by making the problem more specific or the data set smaller. For a general solution you’re stuck with what you have.

Normally I would suggest using Parallel Linq - for example the Ray Tracer example, however this assumes that the items you're operating on are independent. In your example you are using results from a previous iteration, in the current one, making it impossible to parallelize.
As your code is quite simple and there isn't really any overhead, there's not really anything you can do to speed that up. As mentioned you could switch the Lists to arrays. You might also want to compare Double arithmetic to Integer arithmetic on your target machine.

After a simple look at your code, it seems that you might be heading for a overflow, as the condition check would not be able to block it.
In your code, the condition below adds no value, since we can have distance[v][i] < Int.MaxValue & distance[i][w] < Int.MaxValue but distance[v][i] + distance[i][w] > Int.Maxvalue.
if (distance[v][i] != int.MaxValue && distance[i][w] != int.MaxValue)

As the others have mentioned, the complexity is fixed so you don't exactly have many options there. However, you can use
Use arrays instead of lists, if possible.
Use an "unsafe" block with pointersemantics, this should decrease the time required to access your array data.
Check if you can parallelize your algorithm. In your case you could use multiple copies of your data (multiple copies to get rid of the need for synchronisation) and have several threads work on it (e.g. by splitting the range of the outerloop into some subranges (1-1000, 1001-2000 e.g.).

Related

Algorithms and Datastructures - Am I solving these complexity questions right?

This is my first post, so I am sorry if I could've done something better, please tell tho ;)
I am currently practicing for Algorithms and Data structures, and we need to calculate Time and Space complexity. For some reason I find it pretty hard and can't get any confirmation/guidance if I'm going the right way or totally the wrong way. So I thought I'd give it a go here.
The First method I'm trying to solve is the following:
static void f1(int n) {
for (int i = 0; i < n; i++) {
for (int j = i; j < i * i; j++) {
for (int k = 2; k < j; k++) {
System.out.println("*");
}
}
}
}
Time complexity
The steps I made so far:
The first for loop has a time complexity of O(n), because of the i < n, it iterates n times.
The second for loop has a time complexity of O(i^2) because of the j < i*i, it iterates i^2 times.
The third loop does j-2 times, because k=2, so I think.. it is O(j) complexity
So I am not sure if all of the above is correct, but now the step is to multiply all the complexities, so the total complexity would be Big O(n * i^2 * j) which then would be i^2?
So if I'm doing this right it would be Big O (i^2) time complexity.
Space complexity
I'm not too sure how to start with the space complexity but I guess it is just Big O(j) because it is only saving the function call and the for loops, which is O(1) and the printing of the last for loop is the j - 2 times that the printline is called.
I'm totally unsure if this is the right way to think but if not, it's also good for me to know ;)
Thanks in advance for the help!
Kind regards
Your answers cannot be correct because they mention variables i and j which don't appear in the input. Clearly any correct answer must only mention n.
The outer-most loop has an iteration count linear in n, whereas the two inner loops both have iteration counts that grow with n^2, so the three loops nested together are O(n^5). Since the argument of the println-statement is independent of n, it's O(1), and hence we have O(n^5) in total for time complexity.
As for space complexity, there are only three single-valued counting variables declared in the function, so it's O(1).
The innermost loop has the O(j) time complexity, right.
Now the middle loop is a bit more tricky. It runs from i to i^2, and each iteration is linear in terms of j. You may express it as Sum[i..i^2] j. This is a sum of an arithmetic progression. Recall that Sum[a..b] j = O(b^2 - a^2). Substituting limits (a = i, b = i*2) obtain the middle loop complexity O((i^2)^2 - i^2) = O(i^4).
Finally, the outer loop runs from 0 to n, and each iteration is O(i^4). It is important to know that the sum of polynomial of degree k yields the polynomial of degree k+1. This means that the time complexity of the entire function is O(n^5).
Regarding space complexity, it cannot possibly be O(j), because j is a free variable. It doesn't exist outside the function. Try to prove that it is O(1).

Can I have a faster nested loop just lowering the algorithm complexity?

I have a newbie question guys.
Let's suppose that I have the following simple nested loop, where m and n are not necessarily the same, but are really big numbers:
x = 0;
for (i=0; i<m; i++)
{
for (j=0; j<n; j++)
{
delta = CalculateDelta(i,j);
x = x + j + i + delta;
}
}
And now I have this:
x = 0;
for (i=0; i<m; i++)
{
for (j=0; j<n; j++)
{
delta = CalculateDelta(i,j);
x = x + j + i + delta;
j++;
delta = CalculateDelta(i,j);
x = x + j + i + delta;
}
}
Rule: I do need to go through all the elements of the loop, because of this delta calculation.
My questions are:
1) Is the second algorithm faster than the first one, or is it the same?
I have this doubt because for me the first algorithm has a complexity of O(m * n) and the second one is O(m * n/2). Or does lower complexity not necessary makes it faster?
2) Is there any other way to make this faster without something like a Parallel. For?
3) If I make usage of a Parallel. For, would it really make it faster since I would probably need to do a synchronization lock on the x variable?
Thanks!
Definitely not, since time complexity is presumably dominated by the number of times CalculateDelta() is called, it doesn't matter whether you make the calls inline, within a single loop or any number of nested loops, the call gets made m*n times.
And now you have a bug (which is the reason I decided to add an answer after #Peter-Duniho had already done so quite comprehensively)
If n is odd, you do more iterations than intended - almost certainly getting the wrong answer or crashing your program...
Asking three questions in a single post is pushing the limits of "too broad". But in your case, the questions are reasonably simple, so…
1) Does the second algorithm is faster then the first one, or is it the same? I have this doubt because for me the first algorithm have a complexity of O(m * n) and the second one is O(m * n/2). Or does lower complexity not necessary makes it faster?
Complexity ignores coefficients like 1/2. So there's no difference between O(m * n) and O(m * n/2). The latter should have been reduced to O(m * n), which is obviously the same as the former.
And the second isn't really O(m * n/2) anyway, because you didn't really remove work. You just partially unrolled your loop. These kinds of meaningless transformations are one of the reasons we ignore coefficients in big-O notation in the first place. It's too easy to fiddle with the coefficients without really changing the actual computational work.
2) Is there any other way to make this faster without something like a Parallel.For?
That's definitely too broad a question. "Any other way"? Probably. But you haven't provided enough context.
The only obvious potential improvement I can see in the code you posted is that you are computing j + i repeatedly, when you could instead just observe that that component of the whole computation increases by 1 with each iteration and so you could keep a separate incrementing variable instead of adding i and j each time. But a) it's far from clear making that change would speed anything up (whether it would, would depend a lot on specifics in the CPU's own optimization logic), and b) if it did so reliably, it's possible that a good optimizing JIT compiler would make that transformation to the code for you.
But beyond that, the CalculateDelta() method is a complete unknown here. It could be a simple on-liner that the compiler inlines for you, or it could be some enormous computation that dominates the whole loop.
There's no way for any of us to tell you if there is "any other way" to make the loop faster. For that matter, it's not even clear that the change you made makes the loop faster. Maybe it did, maybe it didn't.
3) If I make usage of a Parallel.For, would it really make it faster since I would probably need to do a syncronization lock on the x variable?
That at least depends on what CalculateDelta() is doing. If it's expensive enough, then the synchronization on x might not matter. The bigger issue is that each calculation of x depends on the previous one. It's impossible to parallelize the computation, because it's inherently a serial computation.
What you could do is compute all the deltas in parallel, since they don't depend on x (at least, they don't in the code you posted). The other element of the sum is constant (i + j for known m and n), so in the end it's just the sum of the deltas and that constant. Again, whether this is worth doing depends somewhat on how costly CalculateDelta() is. The less costly that method is, the less likely you're going to see much if any improvement by parallelizing execution of it.
One advantageous transformation is to extract the arithmetic sum of the contributions of the i and j terms using the double arithmetic series formula. This saves quite a bit of work, reducing the complexity of that portion of the calculation to O(1) from O(m*n).
x = 0;
for (i=0; i<m; i++)
{
for (j=0; j<n; j++)
{
delta = CalculateDelta(i,j);
x = x + j + i + delta;
}
}
can become
x = n * m * (n + m - 2) / 2;
for (i=0; i<m; i++)
{
for (j=0; j<n; j++)
{
x += CalculateDelta(i,j);
}
}
Optimizing further depends entirely on what CalculateDelta does, which you have not disclosed. If it has side effects, then that's a problem. But if it's a pure function (where its result is dependent only on the inputs i and j) then there's a good chance it can be computed directly as well.
the first for() will send you to the second for()
the second for() will loop till jn) and the second mn/2

Best performing algorithm for unique trip selection using arrays?

If I am given three arrays of equal length. Each array represents the distance to a specific attraction (ie the first array is only theme parks, the second is only museums, the third is only beaches) on a road trip I am taking. I wan't to determine all possible trips stopping at one of each type of attraction on each trip, never driving backwards, and never visiting the same attraction twice.
IE if I have the following three arrays:
[29 50]
[61 37]
[37 70]
The function would return 3 because the possible combinations would be: (29,61,70)(29,37,70)(50,61,70)
What I've got so far:
public int test(int[] A, int[] B, int[] C) {
int firstStop = 0;
int secondStop = 0;
int thirdStop = 0;
List<List<int>> possibleCombinations = new List<List<int>>();
for(int i = 0; i < A.Length; i++)
{
firstStop = A[i];
for(int j = 0; j < B.Length; j++)
{
if(firstStop < B[j])
{
secondStop = B[j];
for(int k = 0; k < C.Length; k++)
{
if(secondStop < C[k])
{
thirdStop = C[k];
possibleCombinations.Add(new List<int>{firstStop, secondStop, thirdStop});
}
}
}
}
}
return possibleCombinations.Count();
}
This works for the folowing test cases:
Example test: ([29, 50], [61, 37], [37, 70])
OK Returns 3
Example test: ([5], [5], [5])
OK Returns 0
Example test: ([61, 62], [37, 38], [29, 30])
FAIL Returns 0
What is the correct algorithm to calculate this correctly?
What is the best performing algorithm?
How can I tell the performance of this algorithm's time complexity (ie is it O(N*log(N))?)
UPDATE: The question has been rewritten with new details and still is completely unclear and self-contradictory; attempts to clarify the problem with the original poster have been unsuccessful, and the original poster admits to having started coding before understanding the problem themselves. The solution below is correct for the problem as it was originally stated; what the solution to the real problem looks like, no one can say, because no one can say what the real problem is. I'll leave this here for historical purposes.
Let's re-state the problem:
We are given three arrays of distances to attractions along a road.
We wish to enumerate all sequences of possible stops at attractions that do not backtrack. (NOTE: The statement of the problem is to enumerate them; the wrong algorithm given counts them. These are completely different problems. Counting them can be extremely fast. Enumerating them is extremely slow! If the problem is to count them then clarify the problem.)
No other constraints are given in the problem. (For example, it is not given in the problem that we stop at no more than one beach, or that we must stop at one of every kind, or that we must go to a beach before we go to a museum. If those are constraints then they must be stated in the problem)
Suppose there are a total of n attractions. For each attraction either we visit it or we do not. It might seem that there are 2n possibilities. However, there's a problem. Suppose we have two museums, M1 and M2 both 5 km down the road. The possible routes are:
(Start, End) -- visit no attractions on your road trip
(Start, M1, End)
(Start, M2, End)
(Start, M1, M2, End)
(Start, M2, M1, End)
There are five non-backtracking possibilities, not four.
The algorithm you want is:
Partition the attractions by distance, so that all the partitions contain the attractions that are at the same distance.
For each partition, generate a set of all the possible orderings of all the subsets within that partition. Do not forget that "skip all of them" is a possible ordering.
The combinations you want are the Cartesian product of all the partition ordering sets.
That should give you enough hints to make progress. You have several problems to solve here: partitioning, permuting within a partition, and then taking the cross product of arbitrarily many sets. I and many others have written articles on all of these subjects, so do some research if you do not know how to solve these sub-problems yourself.
As for the asymptotic performance: As noted above, the problem given is to enumerate the solutions. The best possible case is, as noted before, 2n for cases where there are no attractions at the same distance, so we are at least exponential. If there are collisions then it becomes a product of many factorials; I leave it to you to work it out, but it's big.
Again: if the problem is to work out the number of solutions, that's much easier. You don't have to enumerate them to know how many solutions there are! Just figure out the number of orderings at each partition and then multiply all the counts together. I leave figuring out the asymptotic performance of partitioning, working out the number of orderings, and multiplying them together as an exercise.
Your solution runs in O(n ^ 3). But if you need to generate all possible combinations and the distances are sorted row and column wise i.e
[1, 2, 3]
[4, 5, 6]
[7, 8, 9]
all solutions will degrade to O(n^3) as it requires to compute all possible subsequences.
If the input has lots of data and the distance between each of them is relatively far then a Sort + binary search + recursive solution might be faster.
static List<List<int>> answer = new List<List<int>>();
static void findPaths(List<List<int>> distances, List<int> path, int rowIndex = 0, int previousValue = -1)
{
if(rowIndex == distances.Count)
{
answer.Add(path);
return;
}
previousValue = previousValue == -1 ? distances[0][0] : previousValue;
int startIndex = distances[rowIndex].BinarySearch(previousValue);
startIndex = startIndex < 0 ? Math.Abs(startIndex) - 1 : startIndex;
// No further destination can be added
if (startIndex == distances[rowIndex].Count)
return;
for(int i=startIndex; i < distances[rowIndex].Count; ++i)
{
var temp = new List<int>(path);
int currentValue = distances[rowIndex][i];
temp.Add(currentValue);
findPaths(distances, temp, rowIndex + 1, currentValue);
}
}
The majority of savings in this solution comes from the fact that since the data is already sorted we need not look distances in the next destinations with distance less than the previous value we have.
For smaller and more closed distances this might be a overkill with the additional sorting and binary search overhead making it slower than the straightforward brute force approach.
Ultimately i think this comes down to how your data is and you can try out both approaches and try which one is faster for you.
Note: This solution does not assume strictly increasing distances i.e) [29, 37, 37] is valid here. If you do not want such solution you'll have to change Binary Search to do a upper bound as opposed to lower bound.
Use Dynamic Programming with State. As there are only 3 arrays, so there are only 2*2*2 states.
Combine the arrays and sort it. [29, 37, 37, 50, 61, 70]. And we make an 2d-array: dp[0..6][0..7]. There are 8 states:
001 means we have chosen 1st array.
010 means we have chosen 2nd array.
011 means we have chosen 1st and 2nd array.
.....
111 means we have chosen 1st, 2nd, 3rd array.
The complexity is O(n*8)=O(n)

Optimize if-statement (a > 0 && b > 0 && a + b == c) in C#

I'm currently doing some graph calculations that involves adjacency matrices, and I'm in the process of optimizing every little bit of it.
One of the instructions that I think can be optimized is the one in the title, in it's original form:
if ((adjMatrix[i][k] > 0) && (adjMatrix[k][j] > 0) && (adjMatrix[i][k] + adjMatrix[k][j] == w))
But for ease I'll stick to the form provided in the title:
if (a > 0 && b > 0 && a + b == c)
What I don't like is the > 0 part (being an adjacency matrix, in it's initial form it contains only 0 and 1, but as the program progresses, zeros are replaced with numbers from 2 onwards, until there are no more zeros.
I've done a test and removed the > 0 part for both a and b, and there was a significant improvement. Over 60088 iterations there was a decrease of 792ms, from 3672ms to 2880ms, which is 78% of the original time, which to me is excellent.
So my question is: can you think of some way of optimizing a statement like this and having the same result, in C#? Maybe some bitwise operations or something similar, I'm not quite familiar with them.
Answer with every idea that crosses your mind, even if it's not suitable. I'll do the speed testing myself and let you know of the results.
EDIT: This is for a compiler that I'm gonna run it myself on my computer. What I just described it's not a problem / bottleneck that I'm complaining of. The program in it's current form runs fine for my needs, but I just want to push it forward and make it as basic and optimized as possible. Hope this clarifies a little bit.
EDIT I believe providing the full code it's a useful thing, so here it is, but keep in mind what I said in the bold below. I want to concentrate strictly on the if statement. The program essentially takes an adjacency matrix and stores all the route combinations that exists. Then there are sorted and trimmed according to some coefficients, but this I didn't included.
int w, i, j, li, k;
int[][] adjMatrix = Data.AdjacencyMatrix;
List<List<List<int[]>>> output = new List<List<List<int[]>>>(c);
for (w = 2; w <= 5; w++)
{
int[] plan;
for (i = 0; i < c; i++)
{
for (j = 0; j < c; j++)
{
if (j == i) continue;
if (adjMatrix[i][j] == 0)
{
for (k = 0; k < c; k++) // 11.7%
{
if (
adjMatrix[i][k] > 0 &&
adjMatrix[k][j] > 0 &&
adjMatrix[i][k] + adjMatrix[k][j] == w) // 26.4%
{
adjMatrix[i][j] = w;
foreach (int[] first in output[i][k])
foreach (int[] second in output[k][j]) // 33.9%
{
plan = new int[w - 1];
li = 0;
foreach (int l in first) plan[li++] = l;
plan[li++] = k;
foreach (int l in second) plan[li++] = l;
output[i][j].Add(plan);
}
}
}
// Here the sorting and trimming occurs, but for the sake of
// discussion, this is only a simple IEnumerable<T>.Take()
if (adjMatrix[i][j] == w)
output[i][j] = output[i][j].Take(10).ToList();
}
}
}
}
Added comments with profiler results in optimized build.
By the way, the timing results were obtained with exactly this piece of code (without the sorting and trimming which dramatically increases execution time). There weren't another parts that were included in my measurement. There is a Stopwatch.StartNew() exactly before this code, and a Console.WriteLine(EllapsedMilliseconds) just after.
If you want to make an idea about the size, the adjacency matrix has 406 rows / columns. So basically there are only for-instructions combined which execute many many iterations, so I haven't got many options of optimizing. Speed is not currently a problem, but I want to make sure I'm ready when it'll become.
And to rule out the 'optimize another parts' problem, there is room for talk in this subject also, but for this specific matter, I just want to find solution for this as an abstract problem / concept. It may help me and others understand how the C# compiler works and treats if-statements and comparisons, that's my goal here.
You can replace a>0 && b>0 with (a-1)|(b-1) >= 0 for signed variables a and b.
Likewise, the condition x == w can be expressed as (x - w)|(w - x) >= 0, since when x != w either left or the right part of the expression will toggle the sign bit, which is preserved by bit-wise or. Everything put together would be (a-1)|(b-1)|(a+b-w)|(w-a-b) >= 0 expressed as a single comparison.
Alternatively a slight speed advantage may come from putting
the probabilities in increasing order:
Which is more likely (a|b)>=0 or (a+b)==w ?
I don't know how well C# optimizes things like this, but it's not so difficult to try to store adjMatrix[i][k] and adjMatrix[k][j] in temporary variables not to read memory twice. See if that changes things in any way.
It's hard to believe that arithmetic and comparison operations are the bottleneck here. Most likely it's memory access or branching. Ideally memory should be accessed in a linear fashion. Can you do something to make it more linear?
It would be good to see more code to suggest something more concrete.
Update: You could try to use two-dimensional array (int[,]) instead of a jagged one (int[][]). This might improve memory locality and element access speed.
The order of the logical tests could be important (as noted in other answers). Since you are using the short circuit logical test (&& instead of &), then the conditions are evaluated from left to right, and the first one it finds that is false, will cause the program to stop evaluating the conditional and continue executing (without executing the if block). So if there is one condition is the far more likely to be false than the rest, that one should go first, and the next should be the next most likely one to be false, etc.
Another good optimization (which I suspect is really what gave you your performance increase --rather than simply dropping out some of the conditions) is to assign the values you are pulling from the arrays to local variables.
You are using adjMatrix[i][k] twice (as well as adjMatrix[k][j]) which is forcing the computer to dig through the array to get the value. Instead, before the if statement, set each of those to a local variable each time, then do your logic test against those variables.
I agree with others who say it's unlikely that this simple statement is your bottleneck and suggest profiling before you decide on optimizing this specific line. But, as a theoretical experiment, you can do a couple of things:
Zero-checks: checking for a != 0 && b != 0 will probably be somewhat faster than a >= 0 && b >= 0. Since your adjacency matrix is non-negative, you can safely do this.
Reordering: if testing just a + b == c is faster, try using this test first and only then test for a and b individually. I doubt this will be faster because addition and equality check is more expensive than zero checks, but it might work for your particular case.
Avoid double indexing: look at the resulting IL with ILDASM or an equivalent to ensure that the array indexes are only dereferenced once, not twice. If they aren't, try putting them in local variables before the check.
Unless you're calling a function you don't optimize conditionals. Its pointless. However if you really want to theres a few easy things to keep in mind
Conditions are checked if something is a zero (or not), if the highest bit is set (or not) and a compare (== or !=) is essentially a - b and checking if its zero (==0) or not (!=0). So a is unsigned then a>0 is the same as a!=0. If a is signed then a<0 is pretty good (this uses the check on highest bit) and is better then a<=0. But anyways just knowing those rules may help.
Also fire up a profiler, you'll see conditionals take 001% of the time. If anything you should ask how to write something that doesnt require conditionals.
Have you considered reversing the logic?
if (a > 0 && b > 0 && a + b == c)
could be rewritten to:
if (a == 0 || b == 0 || a + b != c) continue;
Since you don't want to do anything in the loop if any of the statements are false, then try to abort as soon as possible (if the runtime is that smart, which I assume).
The operation which is the heaviest should be last, because if first statement is true, the others doesn't need to be checked. I assumed that the addition is the heaviest part, but profiling it might tell a different story.
However, I haven't profiled these scenarios my self, and with such trivial conditionals, it might even be a drawback. Would be interesting to see your findings.

Fast Algorithm for computing percentiles to remove outliers

I have a program that needs to repeatedly compute the approximate percentile (order statistic) of a dataset in order to remove outliers before further processing. I'm currently doing so by sorting the array of values and picking the appropriate element; this is doable, but it's a noticable blip on the profiles despite being a fairly minor part of the program.
More info:
The data set contains on the order of up to 100000 floating point numbers, and assumed to be "reasonably" distributed - there are unlikely to be duplicates nor huge spikes in density near particular values; and if for some odd reason the distribution is odd, it's OK for an approximation to be less accurate since the data is probably messed up anyhow and further processing dubious. However, the data isn't necessarily uniformly or normally distributed; it's just very unlikely to be degenerate.
An approximate solution would be fine, but I do need to understand how the approximation introduces error to ensure it's valid.
Since the aim is to remove outliers, I'm computing two percentiles over the same data at all times: e.g. one at 95% and one at 5%.
The app is in C# with bits of heavy lifting in C++; pseudocode or a preexisting library in either would be fine.
An entirely different way of removing outliers would be fine too, as long as it's reasonable.
Update: It seems I'm looking for an approximate selection algorithm.
Although this is all done in a loop, the data is (slightly) different every time, so it's not easy to reuse a datastructure as was done for this question.
Implemented Solution
Using the wikipedia selection algorithm as suggested by Gronim reduced this part of the run-time by about a factor 20.
Since I couldn't find a C# implementation, here's what I came up with. It's faster even for small inputs than Array.Sort; and at 1000 elements it's 25 times faster.
public static double QuickSelect(double[] list, int k) {
return QuickSelect(list, k, 0, list.Length);
}
public static double QuickSelect(double[] list, int k, int startI, int endI) {
while (true) {
// Assume startI <= k < endI
int pivotI = (startI + endI) / 2; //arbitrary, but good if sorted
int splitI = partition(list, startI, endI, pivotI);
if (k < splitI)
endI = splitI;
else if (k > splitI)
startI = splitI + 1;
else //if (k == splitI)
return list[k];
}
//when this returns, all elements of list[i] <= list[k] iif i <= k
}
static int partition(double[] list, int startI, int endI, int pivotI) {
double pivotValue = list[pivotI];
list[pivotI] = list[startI];
list[startI] = pivotValue;
int storeI = startI + 1;//no need to store # pivot item, it's good already.
//Invariant: startI < storeI <= endI
while (storeI < endI && list[storeI] <= pivotValue) ++storeI; //fast if sorted
//now storeI == endI || list[storeI] > pivotValue
//so elem #storeI is either irrelevant or too large.
for (int i = storeI + 1; i < endI; ++i)
if (list[i] <= pivotValue) {
list.swap_elems(i, storeI);
++storeI;
}
int newPivotI = storeI - 1;
list[startI] = list[newPivotI];
list[newPivotI] = pivotValue;
//now [startI, newPivotI] are <= to pivotValue && list[newPivotI] == pivotValue.
return newPivotI;
}
static void swap_elems(this double[] list, int i, int j) {
double tmp = list[i];
list[i] = list[j];
list[j] = tmp;
}
Thanks, Gronim, for pointing me in the right direction!
The histogram solution from Henrik will work. You can also use a selection algorithm to efficiently find the k largest or smallest elements in an array of n elements in O(n). To use this for the 95th percentile set k=0.05n and find the k largest elements.
Reference:
http://en.wikipedia.org/wiki/Selection_algorithm#Selecting_k_smallest_or_largest_elements
According to its creator a SoftHeap can be used to:
compute exact or approximate medians
and percentiles optimally. It is also
useful for approximate sorting...
I used to identify outliers by calculating the standard deviation. Everything with a distance more as 2 (or 3) times the standard deviation from the avarage is an outlier. 2 times = about 95%.
Since your are calculating the avarage, its also very easy to calculate the standard deviation is very fast.
You could also use only a subset of your data to calculate the numbers.
You could estimate your percentiles from just a part of your dataset, like the first few thousand points.
The Glivenko–Cantelli theorem ensures that this would be a fairly good estimate, if you can assume your data points to be independent.
Divide the interval between minimum and maximum of your data into (say) 1000 bins and calculate a histogram. Then build partial sums and see where they first exceed 5000 or 95000.
There are a couple basic approaches I can think of. First is to compute the range (by finding the highest and lowest values), project each element to a percentile ((x - min) / range) and throw out any that evaluate to lower than .05 or higher than .95.
The second is to compute the mean and standard deviation. A span of 2 standard deviations from the mean (in both directions) will enclose 95% of a normally-distributed sample space, meaning your outliers would be in the <2.5 and >97.5 percentiles. Calculating the mean of a series is linear, as is the standard dev (square root of the sum of the difference of each element and the mean). Then, subtract 2 sigmas from the mean, and add 2 sigmas to the mean, and you've got your outlier limits.
Both of these will compute in roughly linear time; the first one requires two passes, the second one takes three (once you have your limits you still have to discard the outliers). Since this is a list-based operation, I do not think you will find anything with logarithmic or constant complexity; any further performance gains would require either optimizing the iteration and calculation, or introducing error by performing the calculations on a sub-sample (such as every third element).
A good general answer to your problem seems to be RANSAC.
Given a model, and some noisy data, the algorithm efficiently recovers the parameters of the model.
You will have to chose a simple model that can map your data. Anything smooth should be fine. Let say a mixture of few gaussians. RANSAC will set the parameters of your model and estimate a set of inliners at the same time. Then throw away whatever doesn't fit the model properly.
You could filter out 2 or 3 standard deviation even if the data is not normally distributed; at least, it will be done in a consistent manner, that should be important.
As you remove the outliers, the std dev will change, you could do this in a loop until the change in std dev is minimal. Whether or not you want to do this depends upon why are you manipulating the data this way. There are major reservations by some statisticians to removing outliers. But some remove the outliers to prove that the data is fairly normally distributed.
Not an expert, but my memory suggests:
to determine percentile points exactly you need to sort and count
taking a sample from the data and calculating the percentile values sounds like a good plan for decent approximation if you can get a good sample
if not, as suggested by Henrik, you can avoid the full sort if you do the buckets and count them
One set of data of 100k elements takes almost no time to sort, so I assume you have to do this repeatedly. If the data set is the same set just updated slightly, you're best off building a tree (O(N log N)) and then removing and adding new points as they come in (O(K log N) where K is the number of points changed). Otherwise, the kth largest element solution already mentioned gives you O(N) for each dataset.

Categories