Please ignore code readability in this question.
In terms of performance, should the following code be written like this:
int maxResults = criteria.MaxResults;
if (maxResults > 0)
{
while (accounts.Count > maxResults)
accounts.RemoveAt(maxResults);
}
or like this:
if (criteria.MaxResults > 0)
{
while (accounts.Count > criteria.MaxResults)
accounts.RemoveAt(criteria.MaxResults);
}
?
Edit: criteria is a class, and MaxResults is a simple integer property (i.e., public int MaxResults { get { return _maxResults; } }.
Does the C# compiler treat MaxResults as a black box and evaluate it every time? Or is it smart enough to figure out that I've got 3 calls to the same property with no modification of that property between the calls? What if MaxResults was a field?
One of the laws of optimization is precalculation, so I instinctively wrote this code like the first listing, but I'm curious if this kind of thing is being done for me automatically (again, ignore code readability).
(Note: I'm not interested in hearing the 'micro-optimization' argument, which may be valid in the specific case I've posted. I'd just like some theory behind what's going on or not going on.)
First off, the only way to actually answer performance questions is to actually try it both ways and test the results in realistic conditions.
That said, the other answers which say that "the compiler" does not do this optimization because the property might have side effects are both right and wrong. The problem with the question (aside from the fundamental problem that it simply cannot be answered without actually trying it and measuring the result) is that "the compiler" is actually two compilers: the C# compiler, which compiles to MSIL, and the JIT compiler, which compiles IL to machine code.
The C# compiler never ever does this sort of optimization; as noted, doing so would require that the compiler peer into the code being called and verify that the result it computes does not change over the lifetime of the callee's code. The C# compiler does not do so.
The JIT compiler might. No reason why it couldn't. It has all the code sitting right there. It is completely free to inline the property getter, and if the jitter determines that the inlined property getter returns a value that can be cached in a register and re-used, then it is free to do so. (If you don't want it to do so because the value could be modified on another thread then you already have a race condition bug; fix the bug before you worry about performance.)
Whether the jitter actually does inline the property fetch and then enregister the value, I have no idea. I know practically nothing about the jitter. But it is allowed to do so if it sees fit. If you are curious about whether it does so or not, you can either (1) ask someone who is on the team that wrote the jitter, or (2) examine the jitted code in the debugger.
And finally, let me take this opportunity to note that computing results once, storing the result and re-using it is not always an optimization. This is a surprisingly complicated question. There are all kinds of things to optimize for:
execution time
executable code size -- this has a major effect on executable time because big code takes longer to load, increases the working set size, puts pressure on processor caches, RAM and the page file. Small slow code is often in the long run faster than big fast code in important metrics like startup time and cache locality.
register allocation -- this also has a major effect on execution time, particularly in architectures like x86 which have a small number of available registers. Enregistering a value for fast re-use can mean that there are fewer registers available for other operations that need optimization; perhaps optimizing those operations instead would be a net win.
and so on. It get real complicated real fast.
In short, you cannot possibly know whether writing the code to cache the result rather than recomputing it is actually (1) faster, or (2) better performing. Better performance does not always mean making execution of a particular routine faster. Better performance is about figuring out what resources are important to the user -- execution time, memory, working set, startup time, and so on -- and optimizing for those things. You cannot do that without (1) talking to your customers to find out what they care about, and (2) actually measuring to see if your changes are having a measurable effect in the desired direction.
If MaxResults is a property then no, it will not optimize it, because the getter may have complex logic, say:
private int _maxResults;
public int MaxReuslts {
get { return _maxResults++; }
set { _maxResults = value; }
}
See how the behavior would change if it in-lines your code?
If there's no logic...either method you wrote is fine, it's a very minute difference and all about how readable it is TO YOU (or your team)...you're the one looking at it.
Your two code samples are only guaranteed to have the same result in single-threaded environments, which .Net isn't, and if MaxResults is a field (not a property). The compiler can't assume, unless you use the synchronization features, that criteria.MaxResults won't change during the course of your loop. If it's a property, it can't assume that using the property doesn't have side effects.
Eric Lippert points out quite correctly that it depends a lot on what you mean by "the compiler". The C# -> IL compiler? Or the IL -> machine code (JIT) compiler? And he's right to point out that the JIT may well be able to optimize the property getter, since it has all of the information (whereas the C# -> IL compiler doesn't, necessarily). It won't change the situation with multiple threads, but it's a good point nonetheless.
It will be called and evaluated every time. The compiler has no way of determining if a method (or getter) is deterministic and pure (no side effects).
Note that actual evaluation of the property may be inlined by the JIT compiler, making it effectively as fast as a simple field.
It's good practise to make property evaluation an inexpensive operation. If you do some heavy calculation in the getter, consider caching the result manually, or changing it to a method.
why not test it?
just set up 2 console apps make it look 10 million times and compare the results ... remember to run them as properly released apps that have been installed properly or else you cannot gurantee that you are not just running the msil.
Really you are probably going to get about 5 answers saying 'you shouldn't worry about optimisation'. they clearly do not write routines that need to be as fast as possible before being readable (eg games).
If this piece of code is part of a loop that is executed billions of times then this optimisation could be worthwhile. For instance max results could be an overridden method and so you may need to discuss virtual method calls.
Really the ONLY way to answer any of these questions is to figure out is this is a piece of code that will benefit from optimisation. Then you need to know the kinds of things that are increasing the time to execute. Really us mere mortals cannot do this a priori and so have to simply try 2-3 different versions of the code and then test it.
If criteria is a class type, I doubt it would be optimized, because another thread could always change that value in the meantime. For structs I'm not sure, but my gut feeling is that it won't be optimized, but I think it wouldn't make much difference in performance in that case anyhow.
Related
I have a very small c# code marked as inline, but dont work.
I have seen that the longest function generates more than 32 bytes of IL code. Does the limit of 32 bytes too short ?
// inlined
[MethodImpl(MethodImplOptions.AggressiveInlining)]
static public bool INL_IsInRange (this byte pValue, byte pMin) {
return(pValue>=pMin);
}
// NOT inlined
[MethodImpl(MethodImplOptions.AggressiveInlining)]
static public bool INL_IsInRange (this byte pValue, byte pMin, byte pMax) {
return(pValue>=pMin&&pValue<=pMax);
}
Is it possible to change that limit?
I am looking for inline function criteria also. In your case, I believe that JIT optimization timed out before it could reach the decision to inline your second function. For JIT, it's not a priority to inline a function, so it was busy analyzing your long code. However, if you place your calls inside tight loops, JIT will probably inline them, as inner calls gain priority to inline. If you really care about this type of micro-optimization, it's time to switch to C++. It's a whole new brave world out there for you to explore and exploit!
I noticed that the question had been edited right after this answer had been posted, meaning a high level of interactivity. Well, I don't know why there is a limit of 32 bytes, but that seems to be exactly the size of a CPU cache block, conservatively speaking. What a coincidence! In any case, code optimization must be done with a particular hardware configuration, better saved in an extra file side by side with its assembly. The timeout policy is stupid, because optimization is not supposed to be done at run-time, competing against the precious code execution time. Optimization is supposed to be done at application load-time, only the first time it's run on the machine, once for all. It can be triggered again when hardware configuration change is detected. Again, if you really need performance, just go with C/C++. C# is not designed for performance and will never make performance its top priority. Like Java, C# is designed for safety, with a much stronger caution against possible negative performance impacts.
Up to the "32-bytes of IL" limit, there are a number of other factors which affect whether a method would be inlined or not. There are at least a couple of articles that describe these factors.
One article explains that a scoring heuristic is used to adjust an initial guess about the relative size of the code when inlined vs not (i.e. whether the call site is larger or smaller than the inlined code itself):
If inlining makes code smaller then the call it replaces, it is ALWAYS good. Note that we are talking about the NATIVE code size, not the IL code size (which can be quite different).
The more a particular call site is executed, the more it will benefit from inlning. Thus code in loops deserves to be inlined more than code that is not in loops.
If inlining exposes important optimizations, then inlining is more desirable. In particular methods with value types arguments benefit more than normal because of optimizations like this and thus having a bias to inline these methods is good.
Thus the heuristic the X86 JIT compiler uses is, given an inline candidate.
Estimate the size of the call site if the method were not inlined.
Estimate the size of the call site if it were inlined (this is an estimate based on the IL, we employ a simple state machine (Markov Model), created using lots of real data to form this estimator logic)
Compute a multiplier. By default it is 1
Increase the multiplier if the code is in a loop (the current heuristic bumps it to 5 in a loop)
Increase the multiplier if it looks like struct optimizations will kick in.
If InlineSize <= NonInlineSize * Multiplier do the inlining.
Another article explains several conditions that will prevent a method from being inlined based on their mere existence (including the "32-bytes of IL" limit):
These are some of the reasons for which we won't inline a method:
Method is marked as not inline with the CompilerServices.MethodImpl attribute.
Size of inlinee is limited to 32 bytes of IL: This is a heuristic, the rationale behind it is that usually, when you have methods bigger than that, the overhead of the call will not be as significative compared to the work the method does. Of course, as a heuristic, it fails in some situations. There have been suggestions for us adding an attribute to control these threshold. For Whidbey, that attribute has not been added (it has some very bad properties: it's x86 JIT specific and it's longterm value, as compilers get smarter, is dubious).
Virtual calls: We don't inline across virtual calls. The reason for not doing this is that we don't know the final target of the call. We could potentially do better here (for example, if 99% of calls end up in the same target, you can generate code that does a check on the method table of the object the virtual call is going to execute on, if it's not the 99% case, you do a call, else you just execute the inlined code), but unlike the J language, most of the calls in the primary languages we support, are not virtual, so we're not forced to be so aggressive about optimizing this case.
Valuetypes: We have several limitations regarding value types an inlining. We take the blame here, this is a limitation of our JIT, we could do better and we know it. Unfortunately, when stack ranked against other features of Whidbey, getting some statistics on how frequently methods cannot be inlined due to this reason and considering the cost of making this area of the JIT significantly better, we decided that it made more sense for our customers to spend our time working in other optimizations or CLR features. Whidbey is better than previous versions in one case: value types that only have a pointer size int as a member, this was (relatively) not expensive to make better, and helped a lot in common value types such as pointer wrappers (IntPtr, etc).
MarshalByRef: Call targets that are in MarshalByRef classes won't be inlined (call has to be intercepted and dispatched). We've got better in Whidbey for this scenario
VM restrictions: These are mostly security, the JIT must ask the VM for permission to inline a method (see CEEInfo::canInline in Rotor source to get an idea of what kind of things the VM checks for).
Complicated flowgraph: We don't inline loops, methods with exception handling regions, etc...
If basic block that has the call is deemed as it won't execute frequently (for example, a basic block that has a throw, or a static class constructor), inlining is much less aggressive (as the only real win we can make is code size)
Other: Exotic IL instructions, security checks that need a method frame, etc...
Back in 2009 I posted this answer to a question about optimisations for nested try/catch/finally blocks.
Thinking about this again some years later, it seems the question could be extended to that other control flow, not only try/catch/finally, but also if/else.
At each of these junctions, execution will follow one path. Code must be generated for both, obviously, but the order in which they're placed in memory, and the number of jumps required to navigate through them will differ.
The order generated code is laid out in memory has implications for the miss rate on the CPU's instruction cache. Having the instruction pipeline stalled, waiting for memory reads, can really kill loop performance.
I don't think loops (for/foreach/while) are a such a good fit unless you expect the loop have zero iterations more often than it has some, as the natural generation order seems pretty optimal.
Some questions:
In what ways do the available .NET JITs optimise for generated instruction order?
How much difference can this make in practice to common code? What about perfectly suited cases?
Is there anything the developer can do to influence this layout? What about mangling with the forbidden goto?
Does the specific JIT being used make much difference to layout?
Does the method inlining heuristic come into play here too?
Basically anything interesting related to this aspect of the JIT!
Some initial thoughts:
Moving catch blocks out of line is an easy job, as they're supposed to be the exceptional case by definition. Not sure this happens.
For some loops I suspect you can increase performance non-trivially. However in general I don't think it'll make that much difference.
I don't know how the JIT decides the order of generated code. In C on Linux you have likely(cond) and unlikely(cond) which you can use to tell to the compiler which branch is the common path to optimise for. I'm not sure that all compilers respect these macros.
Instruction ordering is distinct from the problem of branch prediction, in which the CPU guesses (on its own, afaik) which branch will be taken in order to start the pipeline (oversimplied steps: decode, fetch operands, execute, write back) on instructions, before the execute step has determined the value of the condition variable.
I can't think of any way to influence this order in the C# language. Perhaps you can manipulate it a bit by gotoing to labels explicitly, but is this portable, and are there any other problems with it?
Perhaps this is what profile guided optimisation is for. Do we have that in the .NET ecosystem, now or in plan? Maybe I'll go and have a read about LLILC.
The optimization you are referring to is called the code layout optimization which is defined as follows:
Those pieces of code that are executed close in time in the same thread should be be close in the virtual address space so that they fit in a single or few consecutive cache lines. This reduces cache misses.
Those pieces of code that are executed close in time in different threads should be be close in the virtual address space so that they fit in a single or few consecutive cache lines as long as there is no self-modifying code. This gets lower priority than the previous one. This reduces cache misses.
Those pieces of code that are executed frequently (hot code) should be close in the virtual address space so that they fit in as few virtual pages as possible. This reduces page faults and working set size.
Those pieces of code that are rarely executed (cold code) should be close in the virtual address space so that they fit in as few virtual pages as possible. This reduces page faults and working set size.
Now to your questions.
In what ways do the available .NET JITs optimise for generated
instruction order?
"Instruction order" is really a very general term. Many optimizations affect instruction order. I'll assume that you're referring to code layout.
JITters by design should take the minimum amount of time to compile code while at the same time produce high-quality code. To achieve this, they only perform the most important optimizations so that it's really worth spending time doing them. Code layout optimization is not one of them because without profiling, it may not be beneficial. While a JITter can certainly perform profiling and dynamic optimization, there is a generally preferred way.
How much difference can this make in practice to common code? What
about perfectly suited cases?
Code layout optimization by itself can improve overall performance typically by -1% (negative one) to 4%, which is enough to make compiler writers happy. I would like to add that it reduces energy consumption indirectly by reducing cache misses. The reduction in miss ratio of the instruction cache can be typically up to 35%.
Is there anything the developer can do to influence this layout? What
about mangling with the forbidden goto?
Yes, there are numerous ways. I would like to mention the generally recommended one which is mpgo.exe. Please do not use goto for this purpose. It's forbidden.
Does the specific JIT being used make much difference to layout?
No.
Does the method inlining heuristic come into play here too?
Inlining can indeed improve code layout with respect to function calls. It's one of the most important optimizations and all .NET JITs perform it.
Moving catch blocks out of line is an easy job, as they're supposed to
be the exceptional case by definition. Not sure this happens.
Yes it might be "easy", but what is the potential gained benefit? catch blocks are typically small in size (containing a call to a function that handles the exception). Handling this particular case of code layout does not seem promising. If you really care, use mpgo.exe.
I don't know how the JIT decides the order of generated code. In C on
Linux you have likely(cond) and unlikely(cond) which you can use to
tell to the compiler which branch is the common path to optimise for.
Using PGO is much more preferable over using likely(cond) and unlikely(cond) for two reasons:
The programmer might inadvertently make mistakes while placing likely(cond) and unlikely(cond) in the code. It actually happens a lot. Making big mistakes while trying to manually optimize the code is very typical.
Adding likely(cond) and unlikely(cond) all over the code makes it less maintainable in the future. You'll have to make sure that these hints hold every time you change the source code. In large code bases, this could be ( or rather is) a nightmare.
Instruction ordering is distinct from the problem of branch
prediction...
Assuming you are talking about code layout, yes they are distinct. But code layout optimization is usually guided by a profile which really includes branch statistics. Hardware branch prediction is of course totally different.
Maybe I'll go and have a read about LLILC.
While using mpgo.exe is the mainstream way of performing this optimization, you can use LLILC also since LLVM support profile-guided optimization as well. But I don't think you need to go this far.
Does anyone have advice for using the params in C# for method argument passing. I'm contemplating making overloads for the first 6 arguments and then a 7th using the params feature. My reasoning is to avoid the extra array allocation the params feature require. This is for some high performant utility methods. Any advice? Is it a waste of code to create all the overloads?
Honestly, I'm a little bothered by everyone shouting "premature optimization!" Here's why.
What you say makes perfect sense, particularly as you have already indicated you are working on a high-performance library.
Even BCL classes follow this pattern. Consider all the overloads of string.Format or Console.WriteLine.
This is very easy to get right. The whole premise behind the movement against premature optimization is that when you do something tricky for the purposes of optimizing performance, you're liable to break something by accident and make your code less maintainable. I don't see how that's a danger here; it should be very straightforward what you're doing, to yourself as well as any future developer who may deal with your code.
Also, even if you profiled the results of both approaches and saw only a very small difference in speed, there's still the issue of memory allocation. Creating a new array for every method call entails allocating more memory that will need to be garbage collected later. And in some scenarios where "nearly" real-time behavior is desired (such as algorithmic trading, the field I'm in), minimizing garbage collections is just as important as maximizing execution speed.
So, even if it earns me some downvotes: I say go for it.
(And to those who claim "the compiler surely already does something like this"--I wouldn't be so sure. Firstly, if that were the case, I fail to see why BCL classes would follow this pattern, as I've already mentioned. But more importantly, there is a very big semantic difference between a method that accepts multiple arguments and one that accepts an array. Just because one can be used as a substitute for the other doesn't mean the compiler would, or should, attempt such a substitution).
Yes, that's the strategy that the .NET framework uses. String.Concat() would be a good example. It has overloads for up to 4 strings, plus a fallback one that takes a params string[]. Pretty important here, Concat needs to be fast and is there to help the user fall in the pit of success when he uses the + operator instead of a StringBuilder.
The code duplication you'll get is the price. You'd profile them to see if the speedup is worth the maintenance headache.
Fwiw: there are plenty of micro-optimizations like this in the .NET framework. Somewhat necessary because the designers could not really predict how their classes were going to be used. String.Concat() is just as likely to be used in a tight inner loop that is critical to program perf as, say, a config reader that only runs once at startup. As the end-user of your own code, you typically have the luxury of not having to worry about that. The reverse is also true, the .NET framework code is remarkably free of micro-optimizations when it is unlikely that their benefit would be measurable. Like providing overloads when the core code is slow anyway.
You can always pass Tuple as a parameter, or if the types of the parameters are always the same, an IList<T>.
As other answers and comments have said, you should only optimize after:
Ensuring correct behavior.
Determining the need to optimize.
My point is, if your method is capable of getting unlimited number of parameters, then the logic inside it works in an array-style. So, having overloads for limited number of parameters wouldn't be helping. Unless, you can implement limited number of parameters in a whole different way that is much faster.
For example, if you're handing the parameters to a Console.WriteLine, there's a hidden array creation in there too, so either way you end up having an array.
And, sorry for bothering Dan Tao, I also feel like it is premature optimization. Because you need to know what difference would it make to have overloads with limited number of parameters. If your application is that much performance-critical, you'd need to implement both ways and try to run a test and compare execution times.
Don't even think about performance at this stage. Create whatever overloads will make your code easier to write and easier to understand at 4am two years from now. Sometimes that means params, sometimes that means avoiding it.
After you've got something that works, figure out if these are a performance problem. It's not hard to make the parameters more complicated, but if you add unnecessary complexity now, you'll never make them less so later.
You can try something like this to benchmark the performance so you have some concrete numbers to make decisions with.
In general, object allocation is slightly faster than in C/C++ and deletion is much, much faster for small objects -- until you have tens of thousands of them being made per second. Here's an old article regarding memory allocation performance.
I find a lot of cases where I think to myself that I could use relfection to solve a problem, but I usually don't because I hear a lot along the lines of "don't use reflection, it's too inefficient".
Now I'm in a position where I have a problem where I can't find any other solution than to use reflection with new T(), as outlined in this question & answer.
So I'm wondering if somebody can tell me reflection's specific intended usage, and if there's a set of guidelines to indicate when it's appropriate and when it isn't?
It is often "fast enough", and if you need faster (for tight loops etc) you can do meta-programming with Expression or ILGenerator (perhaps via DynamicMethod), to make extremely fast code (including some tricks you can't do in C#).
Reflection is more commonly used for framework/library scenarios, where the library by definition knows nothing about the caller, and must work based on configuration, attributes or patterns.
If there's one thing that I hate hearing it's "don't use reflection, it's too inefficient".
Too inefficient for what? If you're writing a console application that's run once a month and isn't time critical, does it really matter if it takes 30 seconds instead of 28, because of you using reflection?
Guidelines for when it's inappropriate to use are ones that only you can really put together as they're heavily dependent on what you're doing and how efficient/performant alternatives are.
A useful abstraction for code efficiency is to partition it in three categories of time, each about 3 orders of magnitude apart.
First is human-time. There's a lot you can do when you only need to keep a person happy with the performance of your code. Humans cannot perceive the difference between code that needs 10 milliseconds or 20 milliseconds, both look instant. And a human is forgiving when a program needs 6 seconds instead of 5, roughly 3 billion machine instructions more. Common examples of programs that run at human-time are compilers and point-and-click designers. Using reflection is never a problem.
Then there is I/O-time. When your program needs to hit the disk or the network. I/O is slow, restricted by mechanical motion in the case of the disk, bandwidth and latency in the case of a network. You can always tell when I/O is the bottleneck, your program is running but it isn't driving up the CPU load much. The operating system is constantly blocking the thread, making it wait until the I/O request is complete.
Reflection operates at I/O-time. To retrieve type data, the CLR must read the assembly metadata. And when that wasn't done before, your program will cause a page-fault, requiring the operating system to read the data from disk. What follows is that, roughly, reflection can make I/O bound code only twice as slow. Usually better because after the first perf hit, the metadata is cached and can be retrieved a lot quicker. Reflection is thus often an acceptable trade-off. The canonical examples are serialization and dbase ORMs.
Then there's machine-time. The raw performance of a CPU core is stupendous. A property getter can execute in somewhere between 0 and 1/2 a nanosecond. This does not compare favorably with, say, PropertyInfo.GetValue(). Both will keep the CPU busy, you'll see the CPU load for the core at 100%. But GetValue() costs hundreds if not thousands of machine code instructions. Not counting the time needed to page in the metadata. While not much an incremental time, it builds up fast when you loop.
If you cannot classify your reflection code in the human-time or I/O-time categories then reflection is unlikely to be an appropriate substitute for regular code.
The key to keeping reflection from slowing down your program is to not use it inside a loop. If you want to read a property from an object during startup (happens once), use reflection. You want to read a property from a list of 10,000 objects of unknown type, use reflection to get the property getter delegate once (search term: PropertyInfo.GetGetMethod), then call the delegate 10,000 types. There are plenty of examples of this on StackOverflow.
Reflection is not inefficient. It is less efficient than direct calls. So personnaly I use reflection when there's no equivalent compile time safe method. IMHO the problem with reflection is not so much the efficiency but the fragility of the code as it uses magic strings which are very refactor unfriendly.
I use it for plugin architecture - looking through assemblies in the plugin folder for methods marked with a custom attribute indicating info about the plugin - and in a logging framework. The framework detects a custom attribute on the assembly itself which holds information about the author of the assembly, the project, version information, and other tags that are logged along with everything in the stack trace.
Going to give away a 'trade secret', but it's a good one. The framework allows you to tag each method or class with a 'Story ref', e.g.
[StoryRef(Ref="ImportCSV1")]
...and the idea is it would integrate into our agile project management framework: if there were any exceptions thrown within that class/method, the logging method would use reflection to check for a StoryRef attribute in the stack trace, and if so that would be logged as an exception against that story. In the PM software you could see exceptions by Story (a story is like an extreme/agile use case).
I think that's a valid use, at least! Basically, when it just seems the most neat, and appropriate way to do it, I use reflection. Nothing else really comes into it - I can't think of an occasion you'd be using reflection to make that many calls that efficiency would come into it.
So I'm wondering if somebody can tell
me reflection's specific intended
usage, and if there's a set of
guidelines to indicate when it's
appropriate and when it isn't?
A bad example of reflection is this one from Wikipedia:
//Without reflection
Foo foo = new Foo();
foo.Hello();
//With reflection
Type t = Type.GetType("FooNamespace.Foo");
object foo = Activator.CreateInstance(t);
t.InvokeMember("Hello", BindingFlags.InvokeMethod, null, foo, null);
Here, there is no advantage to using reflection: The non-reflection-using code is not only more efficient, but easier to understand.
Good uses of reflection are things like serialization and object-relational mapping, which are easy to implement if you have a list of a class's properties, but otherwise require a custom-written function for each class.
Consider:
if (condition1)
{
// Code block 1
}
else
{
// Code block 2
}
If I know that condition1 will be true the majority of the time, then I should code the logic as written, instead of:
if (!condition1)
{
// Code block 2
}
else
{
// Code block 1
}
since I will avoid the penalty of the jump to the second code block (note: I have limited knowledge of assembly language). Does this idea carry forward to switch statements and case labels?
switch (myCaseValue)
{
case Case1:
// Code block 1
break;
case Case2:
// Code block 2
break;
// etc.
}
If I know that one of the cases will happen more often, can I rearrange the order of the case labels so that it's more efficient? Should I? In my code I've been ordering the case labels alphabetically for code readability without really thinking about it. Is this micro-optimization?
Some facts for modern hardware like x86 or x86_64:
A unconditionally taken branch has almost no additional costs, besides the decoding. If you want a number, it's about a quarter clock cycle.
A conditional branch, which was correctly predicted, has almost no additional costs.
A conditional branch, which was not correctly predicted, has a penalty equal to the length of the processor pipelines, this is about 12-20 clocks, depending on the hardware.
The prediction mechanisms are very sophisticated. Loops with a low number of iterations (on Core 2 for example up to 64) can be perfectly predicted. Small repeating patterns like "taken-taken-nottaken-taken" can be predicted, if they are not too long (IIRC 6 on Core2).
You can read more about branch prediction in Agner Fogs excellent manual.
Switch statements are usually replaced by a jump table by the compiler. In most cases the order of cases won't make a difference at all. There are prediction mechanisms for indirect jumps as well.
So the question isn't if your jumps are more likely to be taken, it is if they are well predictable, at least for the hardware you intend to run your code on. This isn't an easy question at all. But if you have branches depending on a random (or pseudo random) condition, you could try to reformulate it as a branchless statement if possible.
Your conclusion regarding the if statements will not be true on most of the hardware I'm familiar with. The problem is not that you are jumping, but that you are branching. The code could go two different ways, depending on the result of a comparison. This can stall the pipeline on most modern CPUs. Branch prediction is common, and fixes the problem most of the time, but has nothing to do with your example. The predictor can equally well predict that a comparison will be false as it can that it will be true.
As usual, see wikipedia: Branch Predictor
It depends. The compiler will use a bunch of internal implementation-dependent criteria to decide whether to implement the switch as a sequence of if-like tests, or as a jump table. This might depend, for example, on how "compact" your set of case labels is. If your case label values form a "dense" set, the compiler is probably more likely to use a jump table, in which case the ordering of case labels won't matter. If it decides to go with what resembles a sequence of if-else tests, the order might matter.
Keep in mind though, that the body of switch is one large statement, with case labels providing multiple entry points into that statement. For that reason, the compilers ability (as well as yours) to rearrange the case "sub-blocks" within that statement might be limited.
Case labels should be ordered in the most effecient way for readability.
Reordering case labels for efficiency is a case of premature optimization unless a profiler has specifically told you this is a problem.
I think that even your initial premise - that you can optimize the if statement by rearranging the conditional may well be faulty. In a non-optimized build you might find doing what you're talking about has some value - maybe. In the general case you're going to have to jump at least once for either case, so there's no advantage (in general) to arranging the conditional anyway. But that's for non-optimized builds, so who cares about that optimization?
In optimized builds, I think you might be surprised by what a compiler sometimes generates for an if statement. The compiler may move one or the other (or both) cases to somewhere out-of-line. I think that you trying to optimize this naively by playing with which condition 'comes first' won't necessarily do what you want. At best you should do this only after examining what the compiler is generating. And, of course, this becomes an expensive process, since even the slightest change you make around the statement can change how the compiler decides to generate the output code.
Now, as far as the switch statement is concerned, I'd always go with using a switch when it makes the code more readable. The worst that a compiler should do with a switch statement that is equivalent to an if statement is to generate the same code. For more than a few cases, switch statements will generally be compiled as a jump table. But then again a set of if tests that are comparing a single variable to a set of values might very well be recognized by a compiler such that it'll do the same. However, I'd guess that using a switch will enable to compiler to recognize the situation much more readily.
If you're really interested in getting the most out of the performance of that conditional, you might consider using something like MSVC's Profile Guided Optimization (PGO or 'pogo')which uses the results of profiling runs to optimize how conditionals get generated. I don't know whether or not if GCC has similar capabilities.
I'm not sure about the C# compiler, but I know that in assembly a switch statement can actually be programmed as a jump to a specific line, rather than evaluating the expression like an if statement. Since in a select you have all constants, it just treats cases as line numbers and you jump directly to the line number (case value) passed in without any evaluation. This makes the order of the case statements not really matter at all.
I assume you're aware that it will only matter if this is a hotspot. The best way to tell if it's a hotspot is to run the code, sample the program counter, and see if it's in there more than 10% of the time. If it is a hotspot, see how much time is spent in doing the if or switch. Usually it is negligible, unless your Block 1 and/or Block 2 do almost nothing. You can use a profiler for this. I just pause it repeatedly.
If you're not familiar with assembly language I would suggest learning it, enough to understand what the compiler generates. It's interesting and not hard.
As others have said, it depends on lots of things, including how many cases there are, how optimization is done, and the architecture you're running on. For an interesting overview, see http://ols.fedoraproject.org/GCC/Reprints-2008/sayle-reprint.pdf
If you put the cases that happen most often first, this will optimize the code slightly, and because of the way switch statments work the same is true. When the program goes into switch and finds a case that's true, it will execute it and hit break, which will exit out of the loop. Your thinking is correct.
However, I do think this optimization is pretty minimal, and if it slows your development time to do this, it's probably not worth it. Also if you have to modify your program flow drastically to accommodate this, it's probably not worth it. You're only saving a couple cycles at most and likely would never see the improvement.