I don't know what overhead there is in int array lookups. Which would perform better (in C#):
a = aLookup[i];
b = (a % 6) == 5;
c = (b ? a+1 : a-1) >> 1; // (a + 1) / 2 or (a - 1) / 2
Or
a = aLookup[i];
b = bLookup[i];
c = cLookup[i];
Would an array lookup actually save that much time for either b or c?
Edit: I profiled it several ways. The result is that array lookups are almost four times faster.
It is so extremely unlikely to matter. You should go with what is most readable. And I can tell you that
c = (b ? a+1 : a-1) >> 1;
is pointless as you aren't buying any performance but your code is less readable. Just go with explicitly dividing by two.
That said, just try it for yourself in a profiler if you really care.
A:
depends on
element type
length of array
cache locality
processor affinity, L2 cache size
cache duration (or more importantly: how many times used till cache eviction?)
B:
you need to ... Profile! ( What Are Some Good .NET Profilers? )
Both are O(1) conceptually, although you have an out of bounds check with the array access.
I don't think this will be your bottleneck either way, I would go with what's more readable and shows your intend better.
also if you use reflector to check the implemention of the % operator
you will find it is extremely inefficient and not to be used in a time-critical application in high frequency code so csharp game programmers tend to avoid % and use:
while (x >= n) x -= n;
but they can make assumptions about the range of x (which are verified in debug builds)
unless you are doing 10,000+ of these per second I wouldn't worry about it
Related
I have a newbie question guys.
Let's suppose that I have the following simple nested loop, where m and n are not necessarily the same, but are really big numbers:
x = 0;
for (i=0; i<m; i++)
{
for (j=0; j<n; j++)
{
delta = CalculateDelta(i,j);
x = x + j + i + delta;
}
}
And now I have this:
x = 0;
for (i=0; i<m; i++)
{
for (j=0; j<n; j++)
{
delta = CalculateDelta(i,j);
x = x + j + i + delta;
j++;
delta = CalculateDelta(i,j);
x = x + j + i + delta;
}
}
Rule: I do need to go through all the elements of the loop, because of this delta calculation.
My questions are:
1) Is the second algorithm faster than the first one, or is it the same?
I have this doubt because for me the first algorithm has a complexity of O(m * n) and the second one is O(m * n/2). Or does lower complexity not necessary makes it faster?
2) Is there any other way to make this faster without something like a Parallel. For?
3) If I make usage of a Parallel. For, would it really make it faster since I would probably need to do a synchronization lock on the x variable?
Thanks!
Definitely not, since time complexity is presumably dominated by the number of times CalculateDelta() is called, it doesn't matter whether you make the calls inline, within a single loop or any number of nested loops, the call gets made m*n times.
And now you have a bug (which is the reason I decided to add an answer after #Peter-Duniho had already done so quite comprehensively)
If n is odd, you do more iterations than intended - almost certainly getting the wrong answer or crashing your program...
Asking three questions in a single post is pushing the limits of "too broad". But in your case, the questions are reasonably simple, so…
1) Does the second algorithm is faster then the first one, or is it the same? I have this doubt because for me the first algorithm have a complexity of O(m * n) and the second one is O(m * n/2). Or does lower complexity not necessary makes it faster?
Complexity ignores coefficients like 1/2. So there's no difference between O(m * n) and O(m * n/2). The latter should have been reduced to O(m * n), which is obviously the same as the former.
And the second isn't really O(m * n/2) anyway, because you didn't really remove work. You just partially unrolled your loop. These kinds of meaningless transformations are one of the reasons we ignore coefficients in big-O notation in the first place. It's too easy to fiddle with the coefficients without really changing the actual computational work.
2) Is there any other way to make this faster without something like a Parallel.For?
That's definitely too broad a question. "Any other way"? Probably. But you haven't provided enough context.
The only obvious potential improvement I can see in the code you posted is that you are computing j + i repeatedly, when you could instead just observe that that component of the whole computation increases by 1 with each iteration and so you could keep a separate incrementing variable instead of adding i and j each time. But a) it's far from clear making that change would speed anything up (whether it would, would depend a lot on specifics in the CPU's own optimization logic), and b) if it did so reliably, it's possible that a good optimizing JIT compiler would make that transformation to the code for you.
But beyond that, the CalculateDelta() method is a complete unknown here. It could be a simple on-liner that the compiler inlines for you, or it could be some enormous computation that dominates the whole loop.
There's no way for any of us to tell you if there is "any other way" to make the loop faster. For that matter, it's not even clear that the change you made makes the loop faster. Maybe it did, maybe it didn't.
3) If I make usage of a Parallel.For, would it really make it faster since I would probably need to do a syncronization lock on the x variable?
That at least depends on what CalculateDelta() is doing. If it's expensive enough, then the synchronization on x might not matter. The bigger issue is that each calculation of x depends on the previous one. It's impossible to parallelize the computation, because it's inherently a serial computation.
What you could do is compute all the deltas in parallel, since they don't depend on x (at least, they don't in the code you posted). The other element of the sum is constant (i + j for known m and n), so in the end it's just the sum of the deltas and that constant. Again, whether this is worth doing depends somewhat on how costly CalculateDelta() is. The less costly that method is, the less likely you're going to see much if any improvement by parallelizing execution of it.
One advantageous transformation is to extract the arithmetic sum of the contributions of the i and j terms using the double arithmetic series formula. This saves quite a bit of work, reducing the complexity of that portion of the calculation to O(1) from O(m*n).
x = 0;
for (i=0; i<m; i++)
{
for (j=0; j<n; j++)
{
delta = CalculateDelta(i,j);
x = x + j + i + delta;
}
}
can become
x = n * m * (n + m - 2) / 2;
for (i=0; i<m; i++)
{
for (j=0; j<n; j++)
{
x += CalculateDelta(i,j);
}
}
Optimizing further depends entirely on what CalculateDelta does, which you have not disclosed. If it has side effects, then that's a problem. But if it's a pure function (where its result is dependent only on the inputs i and j) then there's a good chance it can be computed directly as well.
the first for() will send you to the second for()
the second for() will loop till jn) and the second mn/2
Essentially I'm not sure how to store a 3D data structure for the fastest access possible as I'm not sure what is going on under the hood for multi-dimensional arrays.
NOTE: The arrays will be a constant and known size each and every time, and each element will be exactly 16 bits.
Option one is to have a multi-dimension array data[16, 16, 16] and simply access via data[x, y, z] option two is to have a single dimension array data[16 * 16 * 16] and access via data[x + (y * 16) + (z * 16 * 16)].
As each element should only be 16 bits long, and I have a suspicion that a multi-dimension array would store a lot of references to other arrays internally at a minimum of 32 bits per one, that is a lot of wasted memory. However, I fear it may be faster than running the equation specified in option two each time, and speed is key to this project.
So, can anyone enlighten me as to how much difference in speed there would likely to be compared to how much difference in memory consumption?
C# stores multidimensional arrays as a single block of memory, so they compile to almost the same thing. (One difference is that there are three sets of bounds to check).
I.e. arr[x,y,z] is just about equivalent to arr[x + y*ny +z*nz*ny] and will generally have similar performance characteristics.
The exact performance however will be dominated by the pattern of memory access, and how this affects cache coherence (at least for large amounts of data). You may find that nested loops over x, then y then z may be faster or slower than doing the loops in a different order, if one does a better job of keeping currently used data in the processor cache.
This is highly dependent on the exact algorithm, so it isn't possible to give an answer which is correct for all algorithms.
The other cause of any speed reduction versus C or C++ is the bounds-checking, which will still be needed in the one-dimensional array case. However these will often, but not always, be removed automatically.
https://blogs.msdn.microsoft.com/clrcodegeneration/2009/08/13/array-bounds-check-elimination-in-the-clr/
Again, the exact algorithm will affect whether the optimiser is able to remove the bounds checks.
Your course of action should be as follows:
Write a naïve version of the algorithm with arr[x,y,z].
If it's fast enough you can stop.
Otherwise profile the algorithm to check it is actually array accesses which are the issue, analyse the memory access patterns and so on.
I think it's worth pointing out that if your array dimensions are really all 16, then you can calculate the index for the array from (x, y, z) much more efficiently:
int index = x | y << 4 | z << 8;
And the inverse:
int x = index & 0xf;
int y = (index >> 4) & 0xf;
int z = (index >> 8) & 0xf;
If this is the case, then I recommend using the single-dimensional array since it will almost certainly be faster.
Note that it's entirely possible that the JIT compiler would perform this optimisation anyway (assuming that the multiplication is hard-coded as per your OP), but it's worth doing explicitly.
The reason that I say the single-dimensional array would be faster is because the latest compiler is lacking some of the optimisations for multi-dimensional array access, as discussed in this thread.
That said, you should perform careful timings to see what really is the fastest.
As Eric Lippert says: "If you want to know which horse is faster, race your horses".
I would vote for single-demension array, it should work much faster. Basically you can write some tests, performing your most common tasks and measuring the time spent.
Also if you have 2^n array sizes it is much faster to access element position by using left shift operation instead of multiplication.
I was digging around in .NET's implementation of Dictionaries, and found one function that I'm curious about: HashHelpers.GetPrime.
Most of what it does is quite straightforward, it looks for a prime number above some minimum which is passed to it as a parameter, apparently for the specific purpose of being used as a number of buckets in a hashtable-like structure. But there's one mysterious part:
if (HashHelpers.IsPrime(j) && (j - 1) % 101 != 0)
{
return j;
}
What is the purpose of the (j - 1) % 101 != 0 check? i.e. Why do we apparently want to avoid having a number of buckets which is 1 more than a multiple of 101?
The comments explain it pretty well:
‘InitHash’ is basically an implementation of classic DoubleHashing
(see http://en.wikipedia.org/wiki/Double_hashing)
1) The only ‘correctness’ requirement is that the ‘increment’ used to
probe a. Be non-zero b. Be relatively prime to the table size
‘hashSize’. (This is needed to insure you probe all entries in the
table before you ‘wrap’ and visit entries already probed)
2) Because
we choose table sizes to be primes, we just need to insure that the
increment is 0 < incr < hashSize
Thus this function would work: Incr = 1 + (seed % (hashSize-1))
While this works well for ‘uniformly distributed’ keys, in practice,
non-uniformity is common. In particular in practice we can see
‘mostly sequential’ where you get long clusters of keys that ‘pack’.
To avoid bad behavior you want it to be the case that the increment is
‘large’ even for ‘small’ values (because small values tend to happen
more in practice). Thus we multiply ‘seed’ by a number that will make
these small values bigger (and not hurt large values). We picked
HashPrime (101) because it was prime, and if ‘hashSize-1’ is not a
multiple of HashPrime (enforced in GetPrime), then incr has the
potential of being every value from 1 to hashSize-1. The choice was
largely arbitrary.
I'm currently doing some graph calculations that involves adjacency matrices, and I'm in the process of optimizing every little bit of it.
One of the instructions that I think can be optimized is the one in the title, in it's original form:
if ((adjMatrix[i][k] > 0) && (adjMatrix[k][j] > 0) && (adjMatrix[i][k] + adjMatrix[k][j] == w))
But for ease I'll stick to the form provided in the title:
if (a > 0 && b > 0 && a + b == c)
What I don't like is the > 0 part (being an adjacency matrix, in it's initial form it contains only 0 and 1, but as the program progresses, zeros are replaced with numbers from 2 onwards, until there are no more zeros.
I've done a test and removed the > 0 part for both a and b, and there was a significant improvement. Over 60088 iterations there was a decrease of 792ms, from 3672ms to 2880ms, which is 78% of the original time, which to me is excellent.
So my question is: can you think of some way of optimizing a statement like this and having the same result, in C#? Maybe some bitwise operations or something similar, I'm not quite familiar with them.
Answer with every idea that crosses your mind, even if it's not suitable. I'll do the speed testing myself and let you know of the results.
EDIT: This is for a compiler that I'm gonna run it myself on my computer. What I just described it's not a problem / bottleneck that I'm complaining of. The program in it's current form runs fine for my needs, but I just want to push it forward and make it as basic and optimized as possible. Hope this clarifies a little bit.
EDIT I believe providing the full code it's a useful thing, so here it is, but keep in mind what I said in the bold below. I want to concentrate strictly on the if statement. The program essentially takes an adjacency matrix and stores all the route combinations that exists. Then there are sorted and trimmed according to some coefficients, but this I didn't included.
int w, i, j, li, k;
int[][] adjMatrix = Data.AdjacencyMatrix;
List<List<List<int[]>>> output = new List<List<List<int[]>>>(c);
for (w = 2; w <= 5; w++)
{
int[] plan;
for (i = 0; i < c; i++)
{
for (j = 0; j < c; j++)
{
if (j == i) continue;
if (adjMatrix[i][j] == 0)
{
for (k = 0; k < c; k++) // 11.7%
{
if (
adjMatrix[i][k] > 0 &&
adjMatrix[k][j] > 0 &&
adjMatrix[i][k] + adjMatrix[k][j] == w) // 26.4%
{
adjMatrix[i][j] = w;
foreach (int[] first in output[i][k])
foreach (int[] second in output[k][j]) // 33.9%
{
plan = new int[w - 1];
li = 0;
foreach (int l in first) plan[li++] = l;
plan[li++] = k;
foreach (int l in second) plan[li++] = l;
output[i][j].Add(plan);
}
}
}
// Here the sorting and trimming occurs, but for the sake of
// discussion, this is only a simple IEnumerable<T>.Take()
if (adjMatrix[i][j] == w)
output[i][j] = output[i][j].Take(10).ToList();
}
}
}
}
Added comments with profiler results in optimized build.
By the way, the timing results were obtained with exactly this piece of code (without the sorting and trimming which dramatically increases execution time). There weren't another parts that were included in my measurement. There is a Stopwatch.StartNew() exactly before this code, and a Console.WriteLine(EllapsedMilliseconds) just after.
If you want to make an idea about the size, the adjacency matrix has 406 rows / columns. So basically there are only for-instructions combined which execute many many iterations, so I haven't got many options of optimizing. Speed is not currently a problem, but I want to make sure I'm ready when it'll become.
And to rule out the 'optimize another parts' problem, there is room for talk in this subject also, but for this specific matter, I just want to find solution for this as an abstract problem / concept. It may help me and others understand how the C# compiler works and treats if-statements and comparisons, that's my goal here.
You can replace a>0 && b>0 with (a-1)|(b-1) >= 0 for signed variables a and b.
Likewise, the condition x == w can be expressed as (x - w)|(w - x) >= 0, since when x != w either left or the right part of the expression will toggle the sign bit, which is preserved by bit-wise or. Everything put together would be (a-1)|(b-1)|(a+b-w)|(w-a-b) >= 0 expressed as a single comparison.
Alternatively a slight speed advantage may come from putting
the probabilities in increasing order:
Which is more likely (a|b)>=0 or (a+b)==w ?
I don't know how well C# optimizes things like this, but it's not so difficult to try to store adjMatrix[i][k] and adjMatrix[k][j] in temporary variables not to read memory twice. See if that changes things in any way.
It's hard to believe that arithmetic and comparison operations are the bottleneck here. Most likely it's memory access or branching. Ideally memory should be accessed in a linear fashion. Can you do something to make it more linear?
It would be good to see more code to suggest something more concrete.
Update: You could try to use two-dimensional array (int[,]) instead of a jagged one (int[][]). This might improve memory locality and element access speed.
The order of the logical tests could be important (as noted in other answers). Since you are using the short circuit logical test (&& instead of &), then the conditions are evaluated from left to right, and the first one it finds that is false, will cause the program to stop evaluating the conditional and continue executing (without executing the if block). So if there is one condition is the far more likely to be false than the rest, that one should go first, and the next should be the next most likely one to be false, etc.
Another good optimization (which I suspect is really what gave you your performance increase --rather than simply dropping out some of the conditions) is to assign the values you are pulling from the arrays to local variables.
You are using adjMatrix[i][k] twice (as well as adjMatrix[k][j]) which is forcing the computer to dig through the array to get the value. Instead, before the if statement, set each of those to a local variable each time, then do your logic test against those variables.
I agree with others who say it's unlikely that this simple statement is your bottleneck and suggest profiling before you decide on optimizing this specific line. But, as a theoretical experiment, you can do a couple of things:
Zero-checks: checking for a != 0 && b != 0 will probably be somewhat faster than a >= 0 && b >= 0. Since your adjacency matrix is non-negative, you can safely do this.
Reordering: if testing just a + b == c is faster, try using this test first and only then test for a and b individually. I doubt this will be faster because addition and equality check is more expensive than zero checks, but it might work for your particular case.
Avoid double indexing: look at the resulting IL with ILDASM or an equivalent to ensure that the array indexes are only dereferenced once, not twice. If they aren't, try putting them in local variables before the check.
Unless you're calling a function you don't optimize conditionals. Its pointless. However if you really want to theres a few easy things to keep in mind
Conditions are checked if something is a zero (or not), if the highest bit is set (or not) and a compare (== or !=) is essentially a - b and checking if its zero (==0) or not (!=0). So a is unsigned then a>0 is the same as a!=0. If a is signed then a<0 is pretty good (this uses the check on highest bit) and is better then a<=0. But anyways just knowing those rules may help.
Also fire up a profiler, you'll see conditionals take 001% of the time. If anything you should ask how to write something that doesnt require conditionals.
Have you considered reversing the logic?
if (a > 0 && b > 0 && a + b == c)
could be rewritten to:
if (a == 0 || b == 0 || a + b != c) continue;
Since you don't want to do anything in the loop if any of the statements are false, then try to abort as soon as possible (if the runtime is that smart, which I assume).
The operation which is the heaviest should be last, because if first statement is true, the others doesn't need to be checked. I assumed that the addition is the heaviest part, but profiling it might tell a different story.
However, I haven't profiled these scenarios my self, and with such trivial conditionals, it might even be a drawback. Would be interesting to see your findings.
Anyone knows if multiply operator is faster than using the Math.Pow method? Like:
n * n * n
vs
Math.Pow ( n, 3 )
I just reinstalled windows so visual studio is not installed and the code is ugly
using System;
using System.Diagnostics;
public static class test{
public static void Main(string[] args){
MyTest();
PowTest();
}
static void PowTest(){
var sw = Stopwatch.StartNew();
double res = 0;
for (int i = 0; i < 333333333; i++){
res = Math.Pow(i,30); //pow(i,30)
}
Console.WriteLine("Math.Pow: " + sw.ElapsedMilliseconds + " ms: " + res);
}
static void MyTest(){
var sw = Stopwatch.StartNew();
double res = 0;
for (int i = 0; i < 333333333; i++){
res = MyPow(i,30);
}
Console.WriteLine("MyPow: " + sw.ElapsedMilliseconds + " ms: " + res);
}
static double MyPow(double num, int exp)
{
double result = 1.0;
while (exp > 0)
{
if (exp % 2 == 1)
result *= num;
exp >>= 1;
num *= num;
}
return result;
}
}
The results:
csc /o test.cs
test.exe
MyPow: 6224 ms: 4.8569351667866E+255
Math.Pow: 43350 ms: 4.8569351667866E+255
Exponentiation by squaring (see https://stackoverflow.com/questions/101439/the-most-efficient-way-to-implement-an-integer-based-power-function-powint-int) is much faster than Math.Pow in my test (my CPU is a Pentium T3200 at 2 Ghz)
EDIT: .NET version is 3.5 SP1, OS is Vista SP1 and power plan is high performance.
Basically, you should benchmark to see.
Educated Guesswork (unreliable):
In case it's not optimized to the same thing by some compiler...
It's very likely that x * x * x is faster than Math.Pow(x, 3) as Math.Pow has to deal with the problem in its general case, dealing with fractional powers and other issues, while x * x * x would just take a couple multiply instructions, so it's very likely to be faster.
A few rules of thumb from 10+ years of optimization in image processing & scientific computing:
Optimizations at an algorithmic level beat any amount of optimization at a low level. Despite the "Write the obvious, then optimize" conventional wisdom this must be done at the start. Not after.
Hand coded math operations (especially SIMD SSE+ types) will generally outperform the fully error checked, generalized inbuilt ones.
Any operation where the compiler knows beforehand what needs to be done are optimized by the compiler. These include:
1. Memory operations such as Array.Copy()
2. For loops over arrays where the array length is given. As in for (..; i<array.Length;..)
Always set unrealistic goals (if you want to).
I just happened to have tested this yesterday, then saw your question now.
On my machine, a Core 2 Duo running 1 test thread, it is faster to use multiply up to a factor of 9. At 10, Math.Pow(b, e) is faster.
However, even at a factor of 2, the results are often not identical. There are rounding errors.
Some algorithms are highly sensitive to rounding errors. I had to literally run over a million random tests until I discovered this.
This is so micro that you should probably benchmark it for specific platforms, I don't think the results for a Pentium Pro will be necessarily the same as for an ARM or Pentium II.
All in all, it's most likely to be totally irrelevant.
I checked, and Math.Pow() is defined to take two doubles. This means that it can't do repeated multiplications, but has to use a more general approach. If there were a Math.Pow(double, int), it could probably be more efficient.
That being said, the performance difference is almost certainly absolutely trivial, and so you should use whichever is clearer. Micro-optimizations like this are almost always pointless, can be introduced at virtually any time, and should be left for the end of the development process. At that point, you can check if the software is too slow, where the hot spots are, and spend your micro-optimization effort where it will actually make a difference.
Let's use the convention x^n. Let's assume n is always an integer.
For small values of n, boring multiplication will be faster, because Math.Pow (likely, implementation dependent) uses fancy algorithms to allow for n to be non-integral and/or negative.
For large values of n, Math.Pow will likely be faster, but if your library isn't very smart it will use the same algorithm, which is not ideal if you know that n is always an integer. For that you could code up an implementation of exponentiation by squaring or some other fancy algorithm.
Of course modern computers are very fast and you should probably stick to the simplest, easiest to read, least likely to be buggy method until you benchmark your program and are sure that you will get a significant speedup by using a different algorithm.
Math.Pow(x, y) is typically calculated internally as Math.Exp(Math.Log(x) * y). Evey power equation requires finding a natural log, a multiplication, and raising e to a power.
As I mentioned in my previous answer, only at a power of 10 does Math.Pow() become faster, but accuracy will be compromised if using a series of multiplications.
I disagree that handbuilt functions are always faster. The cosine functions are way faster and more accurate than anything i could write. As for pow(). I did a quick test to see how slow Math.pow() was in javascript, because Mehrdad cautioned against guesswork
for (i3 = 0; i3 < 50000; ++i3) {
for(n=0; n < 9000;n++){
x=x*Math.cos(i3);
}
}
here are the results:
Each function run 50000 times
time for 50000 Math.cos(i) calls = 8 ms
time for 50000 Math.pow(Math.cos(i),9000) calls = 21 ms
time for 50000 Math.pow(Math.cos(i),9000000) calls = 16 ms
time for 50000 homemade for loop calls 1065 ms
if you don't agree try the program at http://www.m0ose.com/javascripts/speedtests/powSpeedTest.html