I'm currently doing some graph calculations that involves adjacency matrices, and I'm in the process of optimizing every little bit of it.
One of the instructions that I think can be optimized is the one in the title, in it's original form:
if ((adjMatrix[i][k] > 0) && (adjMatrix[k][j] > 0) && (adjMatrix[i][k] + adjMatrix[k][j] == w))
But for ease I'll stick to the form provided in the title:
if (a > 0 && b > 0 && a + b == c)
What I don't like is the > 0 part (being an adjacency matrix, in it's initial form it contains only 0 and 1, but as the program progresses, zeros are replaced with numbers from 2 onwards, until there are no more zeros.
I've done a test and removed the > 0 part for both a and b, and there was a significant improvement. Over 60088 iterations there was a decrease of 792ms, from 3672ms to 2880ms, which is 78% of the original time, which to me is excellent.
So my question is: can you think of some way of optimizing a statement like this and having the same result, in C#? Maybe some bitwise operations or something similar, I'm not quite familiar with them.
Answer with every idea that crosses your mind, even if it's not suitable. I'll do the speed testing myself and let you know of the results.
EDIT: This is for a compiler that I'm gonna run it myself on my computer. What I just described it's not a problem / bottleneck that I'm complaining of. The program in it's current form runs fine for my needs, but I just want to push it forward and make it as basic and optimized as possible. Hope this clarifies a little bit.
EDIT I believe providing the full code it's a useful thing, so here it is, but keep in mind what I said in the bold below. I want to concentrate strictly on the if statement. The program essentially takes an adjacency matrix and stores all the route combinations that exists. Then there are sorted and trimmed according to some coefficients, but this I didn't included.
int w, i, j, li, k;
int[][] adjMatrix = Data.AdjacencyMatrix;
List<List<List<int[]>>> output = new List<List<List<int[]>>>(c);
for (w = 2; w <= 5; w++)
{
int[] plan;
for (i = 0; i < c; i++)
{
for (j = 0; j < c; j++)
{
if (j == i) continue;
if (adjMatrix[i][j] == 0)
{
for (k = 0; k < c; k++) // 11.7%
{
if (
adjMatrix[i][k] > 0 &&
adjMatrix[k][j] > 0 &&
adjMatrix[i][k] + adjMatrix[k][j] == w) // 26.4%
{
adjMatrix[i][j] = w;
foreach (int[] first in output[i][k])
foreach (int[] second in output[k][j]) // 33.9%
{
plan = new int[w - 1];
li = 0;
foreach (int l in first) plan[li++] = l;
plan[li++] = k;
foreach (int l in second) plan[li++] = l;
output[i][j].Add(plan);
}
}
}
// Here the sorting and trimming occurs, but for the sake of
// discussion, this is only a simple IEnumerable<T>.Take()
if (adjMatrix[i][j] == w)
output[i][j] = output[i][j].Take(10).ToList();
}
}
}
}
Added comments with profiler results in optimized build.
By the way, the timing results were obtained with exactly this piece of code (without the sorting and trimming which dramatically increases execution time). There weren't another parts that were included in my measurement. There is a Stopwatch.StartNew() exactly before this code, and a Console.WriteLine(EllapsedMilliseconds) just after.
If you want to make an idea about the size, the adjacency matrix has 406 rows / columns. So basically there are only for-instructions combined which execute many many iterations, so I haven't got many options of optimizing. Speed is not currently a problem, but I want to make sure I'm ready when it'll become.
And to rule out the 'optimize another parts' problem, there is room for talk in this subject also, but for this specific matter, I just want to find solution for this as an abstract problem / concept. It may help me and others understand how the C# compiler works and treats if-statements and comparisons, that's my goal here.
You can replace a>0 && b>0 with (a-1)|(b-1) >= 0 for signed variables a and b.
Likewise, the condition x == w can be expressed as (x - w)|(w - x) >= 0, since when x != w either left or the right part of the expression will toggle the sign bit, which is preserved by bit-wise or. Everything put together would be (a-1)|(b-1)|(a+b-w)|(w-a-b) >= 0 expressed as a single comparison.
Alternatively a slight speed advantage may come from putting
the probabilities in increasing order:
Which is more likely (a|b)>=0 or (a+b)==w ?
I don't know how well C# optimizes things like this, but it's not so difficult to try to store adjMatrix[i][k] and adjMatrix[k][j] in temporary variables not to read memory twice. See if that changes things in any way.
It's hard to believe that arithmetic and comparison operations are the bottleneck here. Most likely it's memory access or branching. Ideally memory should be accessed in a linear fashion. Can you do something to make it more linear?
It would be good to see more code to suggest something more concrete.
Update: You could try to use two-dimensional array (int[,]) instead of a jagged one (int[][]). This might improve memory locality and element access speed.
The order of the logical tests could be important (as noted in other answers). Since you are using the short circuit logical test (&& instead of &), then the conditions are evaluated from left to right, and the first one it finds that is false, will cause the program to stop evaluating the conditional and continue executing (without executing the if block). So if there is one condition is the far more likely to be false than the rest, that one should go first, and the next should be the next most likely one to be false, etc.
Another good optimization (which I suspect is really what gave you your performance increase --rather than simply dropping out some of the conditions) is to assign the values you are pulling from the arrays to local variables.
You are using adjMatrix[i][k] twice (as well as adjMatrix[k][j]) which is forcing the computer to dig through the array to get the value. Instead, before the if statement, set each of those to a local variable each time, then do your logic test against those variables.
I agree with others who say it's unlikely that this simple statement is your bottleneck and suggest profiling before you decide on optimizing this specific line. But, as a theoretical experiment, you can do a couple of things:
Zero-checks: checking for a != 0 && b != 0 will probably be somewhat faster than a >= 0 && b >= 0. Since your adjacency matrix is non-negative, you can safely do this.
Reordering: if testing just a + b == c is faster, try using this test first and only then test for a and b individually. I doubt this will be faster because addition and equality check is more expensive than zero checks, but it might work for your particular case.
Avoid double indexing: look at the resulting IL with ILDASM or an equivalent to ensure that the array indexes are only dereferenced once, not twice. If they aren't, try putting them in local variables before the check.
Unless you're calling a function you don't optimize conditionals. Its pointless. However if you really want to theres a few easy things to keep in mind
Conditions are checked if something is a zero (or not), if the highest bit is set (or not) and a compare (== or !=) is essentially a - b and checking if its zero (==0) or not (!=0). So a is unsigned then a>0 is the same as a!=0. If a is signed then a<0 is pretty good (this uses the check on highest bit) and is better then a<=0. But anyways just knowing those rules may help.
Also fire up a profiler, you'll see conditionals take 001% of the time. If anything you should ask how to write something that doesnt require conditionals.
Have you considered reversing the logic?
if (a > 0 && b > 0 && a + b == c)
could be rewritten to:
if (a == 0 || b == 0 || a + b != c) continue;
Since you don't want to do anything in the loop if any of the statements are false, then try to abort as soon as possible (if the runtime is that smart, which I assume).
The operation which is the heaviest should be last, because if first statement is true, the others doesn't need to be checked. I assumed that the addition is the heaviest part, but profiling it might tell a different story.
However, I haven't profiled these scenarios my self, and with such trivial conditionals, it might even be a drawback. Would be interesting to see your findings.
Related
I was digging around in .NET's implementation of Dictionaries, and found one function that I'm curious about: HashHelpers.GetPrime.
Most of what it does is quite straightforward, it looks for a prime number above some minimum which is passed to it as a parameter, apparently for the specific purpose of being used as a number of buckets in a hashtable-like structure. But there's one mysterious part:
if (HashHelpers.IsPrime(j) && (j - 1) % 101 != 0)
{
return j;
}
What is the purpose of the (j - 1) % 101 != 0 check? i.e. Why do we apparently want to avoid having a number of buckets which is 1 more than a multiple of 101?
The comments explain it pretty well:
‘InitHash’ is basically an implementation of classic DoubleHashing
(see http://en.wikipedia.org/wiki/Double_hashing)
1) The only ‘correctness’ requirement is that the ‘increment’ used to
probe a. Be non-zero b. Be relatively prime to the table size
‘hashSize’. (This is needed to insure you probe all entries in the
table before you ‘wrap’ and visit entries already probed)
2) Because
we choose table sizes to be primes, we just need to insure that the
increment is 0 < incr < hashSize
Thus this function would work: Incr = 1 + (seed % (hashSize-1))
While this works well for ‘uniformly distributed’ keys, in practice,
non-uniformity is common. In particular in practice we can see
‘mostly sequential’ where you get long clusters of keys that ‘pack’.
To avoid bad behavior you want it to be the case that the increment is
‘large’ even for ‘small’ values (because small values tend to happen
more in practice). Thus we multiply ‘seed’ by a number that will make
these small values bigger (and not hurt large values). We picked
HashPrime (101) because it was prime, and if ‘hashSize-1’ is not a
multiple of HashPrime (enforced in GetPrime), then incr has the
potential of being every value from 1 to hashSize-1. The choice was
largely arbitrary.
I have three nested loops from zero to n. n is a large number, around 12000th These three loops working on 2DList. It is actually a Floyd algorithm. At these large data it takes along time, could you advise me how to improve it? Thank you (Sorry for my english:) )
List<List<int>> distance = new List<List<int>>();
...
for (int i = 0; i < n; i++)
for (int v = 0; v < n; v++)
for (int w = 0; w < n; w++)
{
if (distance[v][i] != int.MaxValue &&
distance[i][w] != int.MaxValue)
{
int d = distance[v][i] + distance[i][w];
if (distance[v][w] > d)
distance[v][w] = d;
}
}
The first part of your if statement distance[v][i] != int.MaxValue can be moved outside of the iteration over w to reduce overhead in some cases. However, I have no idea how often your values are at int.MaxValue
You cannot change Floyd’s algorithm, its complexity is fixed (and it’s provably the most efficient solution to the general problem of finding all pairwise shortest path distances in a graph with negative edge weights).
You can only improve the runtime by making the problem more specific or the data set smaller. For a general solution you’re stuck with what you have.
Normally I would suggest using Parallel Linq - for example the Ray Tracer example, however this assumes that the items you're operating on are independent. In your example you are using results from a previous iteration, in the current one, making it impossible to parallelize.
As your code is quite simple and there isn't really any overhead, there's not really anything you can do to speed that up. As mentioned you could switch the Lists to arrays. You might also want to compare Double arithmetic to Integer arithmetic on your target machine.
After a simple look at your code, it seems that you might be heading for a overflow, as the condition check would not be able to block it.
In your code, the condition below adds no value, since we can have distance[v][i] < Int.MaxValue & distance[i][w] < Int.MaxValue but distance[v][i] + distance[i][w] > Int.Maxvalue.
if (distance[v][i] != int.MaxValue && distance[i][w] != int.MaxValue)
As the others have mentioned, the complexity is fixed so you don't exactly have many options there. However, you can use
Use arrays instead of lists, if possible.
Use an "unsafe" block with pointersemantics, this should decrease the time required to access your array data.
Check if you can parallelize your algorithm. In your case you could use multiple copies of your data (multiple copies to get rid of the need for synchronisation) and have several threads work on it (e.g. by splitting the range of the outerloop into some subranges (1-1000, 1001-2000 e.g.).
I don't know what overhead there is in int array lookups. Which would perform better (in C#):
a = aLookup[i];
b = (a % 6) == 5;
c = (b ? a+1 : a-1) >> 1; // (a + 1) / 2 or (a - 1) / 2
Or
a = aLookup[i];
b = bLookup[i];
c = cLookup[i];
Would an array lookup actually save that much time for either b or c?
Edit: I profiled it several ways. The result is that array lookups are almost four times faster.
It is so extremely unlikely to matter. You should go with what is most readable. And I can tell you that
c = (b ? a+1 : a-1) >> 1;
is pointless as you aren't buying any performance but your code is less readable. Just go with explicitly dividing by two.
That said, just try it for yourself in a profiler if you really care.
A:
depends on
element type
length of array
cache locality
processor affinity, L2 cache size
cache duration (or more importantly: how many times used till cache eviction?)
B:
you need to ... Profile! ( What Are Some Good .NET Profilers? )
Both are O(1) conceptually, although you have an out of bounds check with the array access.
I don't think this will be your bottleneck either way, I would go with what's more readable and shows your intend better.
also if you use reflector to check the implemention of the % operator
you will find it is extremely inefficient and not to be used in a time-critical application in high frequency code so csharp game programmers tend to avoid % and use:
while (x >= n) x -= n;
but they can make assumptions about the range of x (which are verified in debug builds)
unless you are doing 10,000+ of these per second I wouldn't worry about it
I have a few inequalities regarding {x,y}, that satisfies the following equations:
x>=0
y>=0
f(x,y)=x^2+y^2>=100
g(x,y)=x^2+y^2<=200
Note that x and y must be integer.
Graphically it can be represented as follows, the blue region is the region that satisfies the above inequalities:
The question now is, is there any function in Matlab that finds every admissible pair of {x,y}? If there is an algorithm to do this kind of thing I would be glad to hear about it as well.
Of course, one approach we can always use is brute force approach where we test every possible combination of {x,y} to see whether the inequalities are satisfied. But this is the last resort, because it's time consuming. I'm looking for a clever algorithm that does this, or in the best case, an existing library that I can use straight-away.
The x^2+y^2>=100 and x^2+y^2<=200 are just examples; in reality f and g can be any polynomial functions of any degree.
Edit: C# code are welcomed as well.
This is surely not possible to do in general for a general set of polynomial inequalities, by any method other than enumerative search, even if there are a finite number of solutions. (Perhaps I should say not trivial, as it is possible. Enumerative search will work, subject to floating point issues.) Note that the domain of interest need not be simply connected for higher order inequalities.
Edit: The OP has asked about how one might proceed to do a search.
Consider the problem
x^3 + y^3 >= 1e12
x^4 + y^4 <= 1e16
x >= 0, y >= 0
Solve for all integer solutions of this system. Note that integer programming in ANY form will not suffice here, since ALL integer solutions are requested.
Use of meshgrid here would force us to look at points in the domain (0:10000)X(0:10000). So it would force us to sample a set of 1e8 points, testing every point to see if they satisfy the constraints.
A simple loop can potentially be more efficient than that, although it will still require some effort.
% Note that I will store these in a cell array,
% since I cannot preallocate the results.
tic
xmax = 10000;
xy = cell(1,xmax);
for x = 0:xmax
% solve for y, given x. This requires us to
% solve for those values of y such that
% y^3 >= 1e12 - x.^3
% y^4 <= 1e16 - x.^4
% These are simple expressions to solve for.
y = ceil((1e12 - x.^3).^(1/3)):floor((1e16 - x.^4).^0.25);
n = numel(y);
if n > 0
xy{x+1} = [repmat(x,1,n);y];
end
end
% flatten the cell array
xy = cell2mat(xy);
toc
The time required was...
Elapsed time is 0.600419 seconds.
Of the 100020001 combinations that we might have tested for, how many solutions did we find?
size(xy)
ans =
2 4371264
Admittedly, the exhaustive search is simpler to write.
tic
[x,y] = meshgrid(0:10000);
k = (x.^3 + y.^3 >= 1e12) & (x.^4 + y.^4 <= 1e16);
xy = [x(k),y(k)];
toc
I ran this on a 64 bit machine, with 8 gig of ram. But even so the test itself was a CPU hog.
Elapsed time is 50.182385 seconds.
Note that floating point considerations will sometimes cause a different number of points to be found, depending on how the computations are done.
Finally, if your constraint equations are more complex, you might need to use roots in the expression for the bounds on y, to help identify where the constraints are satisfied. The nice thing here is it still works for more complicated polynomial bounds.
I have a program that needs to repeatedly compute the approximate percentile (order statistic) of a dataset in order to remove outliers before further processing. I'm currently doing so by sorting the array of values and picking the appropriate element; this is doable, but it's a noticable blip on the profiles despite being a fairly minor part of the program.
More info:
The data set contains on the order of up to 100000 floating point numbers, and assumed to be "reasonably" distributed - there are unlikely to be duplicates nor huge spikes in density near particular values; and if for some odd reason the distribution is odd, it's OK for an approximation to be less accurate since the data is probably messed up anyhow and further processing dubious. However, the data isn't necessarily uniformly or normally distributed; it's just very unlikely to be degenerate.
An approximate solution would be fine, but I do need to understand how the approximation introduces error to ensure it's valid.
Since the aim is to remove outliers, I'm computing two percentiles over the same data at all times: e.g. one at 95% and one at 5%.
The app is in C# with bits of heavy lifting in C++; pseudocode or a preexisting library in either would be fine.
An entirely different way of removing outliers would be fine too, as long as it's reasonable.
Update: It seems I'm looking for an approximate selection algorithm.
Although this is all done in a loop, the data is (slightly) different every time, so it's not easy to reuse a datastructure as was done for this question.
Implemented Solution
Using the wikipedia selection algorithm as suggested by Gronim reduced this part of the run-time by about a factor 20.
Since I couldn't find a C# implementation, here's what I came up with. It's faster even for small inputs than Array.Sort; and at 1000 elements it's 25 times faster.
public static double QuickSelect(double[] list, int k) {
return QuickSelect(list, k, 0, list.Length);
}
public static double QuickSelect(double[] list, int k, int startI, int endI) {
while (true) {
// Assume startI <= k < endI
int pivotI = (startI + endI) / 2; //arbitrary, but good if sorted
int splitI = partition(list, startI, endI, pivotI);
if (k < splitI)
endI = splitI;
else if (k > splitI)
startI = splitI + 1;
else //if (k == splitI)
return list[k];
}
//when this returns, all elements of list[i] <= list[k] iif i <= k
}
static int partition(double[] list, int startI, int endI, int pivotI) {
double pivotValue = list[pivotI];
list[pivotI] = list[startI];
list[startI] = pivotValue;
int storeI = startI + 1;//no need to store # pivot item, it's good already.
//Invariant: startI < storeI <= endI
while (storeI < endI && list[storeI] <= pivotValue) ++storeI; //fast if sorted
//now storeI == endI || list[storeI] > pivotValue
//so elem #storeI is either irrelevant or too large.
for (int i = storeI + 1; i < endI; ++i)
if (list[i] <= pivotValue) {
list.swap_elems(i, storeI);
++storeI;
}
int newPivotI = storeI - 1;
list[startI] = list[newPivotI];
list[newPivotI] = pivotValue;
//now [startI, newPivotI] are <= to pivotValue && list[newPivotI] == pivotValue.
return newPivotI;
}
static void swap_elems(this double[] list, int i, int j) {
double tmp = list[i];
list[i] = list[j];
list[j] = tmp;
}
Thanks, Gronim, for pointing me in the right direction!
The histogram solution from Henrik will work. You can also use a selection algorithm to efficiently find the k largest or smallest elements in an array of n elements in O(n). To use this for the 95th percentile set k=0.05n and find the k largest elements.
Reference:
http://en.wikipedia.org/wiki/Selection_algorithm#Selecting_k_smallest_or_largest_elements
According to its creator a SoftHeap can be used to:
compute exact or approximate medians
and percentiles optimally. It is also
useful for approximate sorting...
I used to identify outliers by calculating the standard deviation. Everything with a distance more as 2 (or 3) times the standard deviation from the avarage is an outlier. 2 times = about 95%.
Since your are calculating the avarage, its also very easy to calculate the standard deviation is very fast.
You could also use only a subset of your data to calculate the numbers.
You could estimate your percentiles from just a part of your dataset, like the first few thousand points.
The Glivenko–Cantelli theorem ensures that this would be a fairly good estimate, if you can assume your data points to be independent.
Divide the interval between minimum and maximum of your data into (say) 1000 bins and calculate a histogram. Then build partial sums and see where they first exceed 5000 or 95000.
There are a couple basic approaches I can think of. First is to compute the range (by finding the highest and lowest values), project each element to a percentile ((x - min) / range) and throw out any that evaluate to lower than .05 or higher than .95.
The second is to compute the mean and standard deviation. A span of 2 standard deviations from the mean (in both directions) will enclose 95% of a normally-distributed sample space, meaning your outliers would be in the <2.5 and >97.5 percentiles. Calculating the mean of a series is linear, as is the standard dev (square root of the sum of the difference of each element and the mean). Then, subtract 2 sigmas from the mean, and add 2 sigmas to the mean, and you've got your outlier limits.
Both of these will compute in roughly linear time; the first one requires two passes, the second one takes three (once you have your limits you still have to discard the outliers). Since this is a list-based operation, I do not think you will find anything with logarithmic or constant complexity; any further performance gains would require either optimizing the iteration and calculation, or introducing error by performing the calculations on a sub-sample (such as every third element).
A good general answer to your problem seems to be RANSAC.
Given a model, and some noisy data, the algorithm efficiently recovers the parameters of the model.
You will have to chose a simple model that can map your data. Anything smooth should be fine. Let say a mixture of few gaussians. RANSAC will set the parameters of your model and estimate a set of inliners at the same time. Then throw away whatever doesn't fit the model properly.
You could filter out 2 or 3 standard deviation even if the data is not normally distributed; at least, it will be done in a consistent manner, that should be important.
As you remove the outliers, the std dev will change, you could do this in a loop until the change in std dev is minimal. Whether or not you want to do this depends upon why are you manipulating the data this way. There are major reservations by some statisticians to removing outliers. But some remove the outliers to prove that the data is fairly normally distributed.
Not an expert, but my memory suggests:
to determine percentile points exactly you need to sort and count
taking a sample from the data and calculating the percentile values sounds like a good plan for decent approximation if you can get a good sample
if not, as suggested by Henrik, you can avoid the full sort if you do the buckets and count them
One set of data of 100k elements takes almost no time to sort, so I assume you have to do this repeatedly. If the data set is the same set just updated slightly, you're best off building a tree (O(N log N)) and then removing and adding new points as they come in (O(K log N) where K is the number of points changed). Otherwise, the kth largest element solution already mentioned gives you O(N) for each dataset.