I'm writing an XNA game where I do per-pixel collision checks. The loop which checks this does so by shifting an int and bitwise ORing and is generally difficult to read and understand.
I would like to add private methods such as private bool IsTransparent(int pixelColorValue) to make the loop more readable, but I don't want the overhead of method calls since this is very performance sensitive code.
Is there a way to force the compiler to inline this call or will I do I just hope that the compiler will do this optimization?
If there isn't a way to force this, is there a way to check if the method was inlined, short of reading the disassembly? Will the method show up in reflection if it was inlined and no other callers exist?
Edit: I can't force it, so can I detect it?
No you can't. Even more, the one who decides on inlining isn't VS compiler that takes you code and converts it into IL, but JIT compiler that takes IL and converts it to machine code. This is because only the JIT compiler knows enough about the processor architecture to decide if putting a method inline is appropriate as it’s a tradeoff between instruction pipelining and cache size.
So even looking in .NET Reflector will not help you.
"You can check
System.Reflection.MethodBase.GetCurrentMethod().Name.
If the method is inlined, it will
return the name of the caller
instead."
--Joel Coehoorn
There is a new way to encourage more agressive inlining in .net 4.5 that is described here: http://blogs.microsoft.co.il/blogs/sasha/archive/2012/01/20/aggressive-inlining-in-the-clr-4-5-jit.aspx
Basically it is just a flag to tell the compiler to inline if possible. Unfortunatly, it's not available in the current version of XNA (Game Studio 4.0) but should be available when XNA catches up to VS 2012 this year some time. It is already available if you are somehow running on Mono.
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static int LargeMethod(int i, int j)
{
if (i + 14 > j)
{
return i + j;
}
else if (j * 12 < i)
{
return 42 + i - j * 7;
}
else
{
return i % 14 - j;
}
}
Be aware that the XBox works different.
A google turned up this:
"The inline method which mitigates the overhead of a call of a method.
JIT forms into an inline what fulfills the following conditions.
The IL code size is 16 bytes or less.
The branch command is not used (if
sentence etc.).
The local variable is not used.
Exception handling has not been
carried out (try, catch, etc.).
float is not used as the argument or
return value of a method (probably by
the Xbox 360, not applied).
When two or more arguments are in a
method, it uses for the turn
declared.
However, a virtual function is not formed into an inline."
http://xnafever.blogspot.com/2008/07/inline-method-by-xna-on-xbox360.html
I have no idea if he is correct. Anyone?
Nope, you can't.
Basically, you can't do that in most modern C++ compilers either. inline is just an offer to the compiler. It's free to take it or not.
The C# compiler does not do any special inlining at the IL level. JIT optimizer is the one that will do it.
why not use unsafe code (inline c as its known) and make use of c/c++ style pointers, this is safe from the GC (ie not affected by collection) but comes with its own security implications (cant use for internet zone apps) but is excellent for the kind of thing it appears you are trying to achieve especially with performance and even more so with arrays and bitwise operations?
to summarise, you want performance for a small part of your app? use unsafe code and make use of pointers etc seems the best option to me
EDIT: a bit of a starter ?
http://msdn.microsoft.com/en-us/library/aa288474(VS.71).aspx
The only way to check this is to get or write a profiler, and hook into the JIT events, you must also make sure Inlining is not turned off as it is by default when profiling.
You can detect it at runtime with the aforementioned GetCurrentMethod call. But, that'd seem to be a bit of a waste[1]. The easiest thing to do would to just ILDASM the MSIL and check there.
Note that this is specifically for the compiler inlining the call, and is covered in the various Reflection docs on MSDN.
If the method that calls the GetCallingAssembly method is expanded inline by the compiler (that is, if the compiler inserts the function body into the emitted Microsoft intermediate language (MSIL), rather than emitting a function call), then the assembly returned by the GetCallingAssembly method is the assembly containing the inline code. This might be different from the assembly that contains the original method. To ensure that a method that calls the GetCallingAssembly method is not inlined by the compiler, you can apply the MethodImplAttribute attribute with MethodImplOptions.NoInlining.
However, the JITter is also free to inline calls - but I think a disassembler would be the only way to verify what is and isn't done at that level.
Edit: Just to clear up some confusion in this thread, csc.exe will inline MSIL calls - though the JITter will (probably) be more aggressive in it.
[1] And, by waste - I mean that (a) that it defeats the purpose of the inlining (better performance) because of the Reflection lookup. And (b), it'd probably change the inlining behavior so that it's no longer inlined anyway. And, before you think you can just turn it on Debug builds with an Assert or something - realize that it will not be inlined during Debug, but may be in Release.
Is there a way to force the compiler to inline this call or will I do I just hope that the compiler will do this optimization?
If it is cheaper to inline the function, it will. So don't worry about it unless your profiler says that it actually is a problem.
For more information
JIT Enhancements in .NET 3.5 SP1
For simple code, you can try to get asm even online: https://sharplab.io/
For more complex cases, try https://github.com/szehetner/InliningAnalyzer (I've not tried it yet).
Related
Please ignore code readability in this question.
In terms of performance, should the following code be written like this:
int maxResults = criteria.MaxResults;
if (maxResults > 0)
{
while (accounts.Count > maxResults)
accounts.RemoveAt(maxResults);
}
or like this:
if (criteria.MaxResults > 0)
{
while (accounts.Count > criteria.MaxResults)
accounts.RemoveAt(criteria.MaxResults);
}
?
Edit: criteria is a class, and MaxResults is a simple integer property (i.e., public int MaxResults { get { return _maxResults; } }.
Does the C# compiler treat MaxResults as a black box and evaluate it every time? Or is it smart enough to figure out that I've got 3 calls to the same property with no modification of that property between the calls? What if MaxResults was a field?
One of the laws of optimization is precalculation, so I instinctively wrote this code like the first listing, but I'm curious if this kind of thing is being done for me automatically (again, ignore code readability).
(Note: I'm not interested in hearing the 'micro-optimization' argument, which may be valid in the specific case I've posted. I'd just like some theory behind what's going on or not going on.)
First off, the only way to actually answer performance questions is to actually try it both ways and test the results in realistic conditions.
That said, the other answers which say that "the compiler" does not do this optimization because the property might have side effects are both right and wrong. The problem with the question (aside from the fundamental problem that it simply cannot be answered without actually trying it and measuring the result) is that "the compiler" is actually two compilers: the C# compiler, which compiles to MSIL, and the JIT compiler, which compiles IL to machine code.
The C# compiler never ever does this sort of optimization; as noted, doing so would require that the compiler peer into the code being called and verify that the result it computes does not change over the lifetime of the callee's code. The C# compiler does not do so.
The JIT compiler might. No reason why it couldn't. It has all the code sitting right there. It is completely free to inline the property getter, and if the jitter determines that the inlined property getter returns a value that can be cached in a register and re-used, then it is free to do so. (If you don't want it to do so because the value could be modified on another thread then you already have a race condition bug; fix the bug before you worry about performance.)
Whether the jitter actually does inline the property fetch and then enregister the value, I have no idea. I know practically nothing about the jitter. But it is allowed to do so if it sees fit. If you are curious about whether it does so or not, you can either (1) ask someone who is on the team that wrote the jitter, or (2) examine the jitted code in the debugger.
And finally, let me take this opportunity to note that computing results once, storing the result and re-using it is not always an optimization. This is a surprisingly complicated question. There are all kinds of things to optimize for:
execution time
executable code size -- this has a major effect on executable time because big code takes longer to load, increases the working set size, puts pressure on processor caches, RAM and the page file. Small slow code is often in the long run faster than big fast code in important metrics like startup time and cache locality.
register allocation -- this also has a major effect on execution time, particularly in architectures like x86 which have a small number of available registers. Enregistering a value for fast re-use can mean that there are fewer registers available for other operations that need optimization; perhaps optimizing those operations instead would be a net win.
and so on. It get real complicated real fast.
In short, you cannot possibly know whether writing the code to cache the result rather than recomputing it is actually (1) faster, or (2) better performing. Better performance does not always mean making execution of a particular routine faster. Better performance is about figuring out what resources are important to the user -- execution time, memory, working set, startup time, and so on -- and optimizing for those things. You cannot do that without (1) talking to your customers to find out what they care about, and (2) actually measuring to see if your changes are having a measurable effect in the desired direction.
If MaxResults is a property then no, it will not optimize it, because the getter may have complex logic, say:
private int _maxResults;
public int MaxReuslts {
get { return _maxResults++; }
set { _maxResults = value; }
}
See how the behavior would change if it in-lines your code?
If there's no logic...either method you wrote is fine, it's a very minute difference and all about how readable it is TO YOU (or your team)...you're the one looking at it.
Your two code samples are only guaranteed to have the same result in single-threaded environments, which .Net isn't, and if MaxResults is a field (not a property). The compiler can't assume, unless you use the synchronization features, that criteria.MaxResults won't change during the course of your loop. If it's a property, it can't assume that using the property doesn't have side effects.
Eric Lippert points out quite correctly that it depends a lot on what you mean by "the compiler". The C# -> IL compiler? Or the IL -> machine code (JIT) compiler? And he's right to point out that the JIT may well be able to optimize the property getter, since it has all of the information (whereas the C# -> IL compiler doesn't, necessarily). It won't change the situation with multiple threads, but it's a good point nonetheless.
It will be called and evaluated every time. The compiler has no way of determining if a method (or getter) is deterministic and pure (no side effects).
Note that actual evaluation of the property may be inlined by the JIT compiler, making it effectively as fast as a simple field.
It's good practise to make property evaluation an inexpensive operation. If you do some heavy calculation in the getter, consider caching the result manually, or changing it to a method.
why not test it?
just set up 2 console apps make it look 10 million times and compare the results ... remember to run them as properly released apps that have been installed properly or else you cannot gurantee that you are not just running the msil.
Really you are probably going to get about 5 answers saying 'you shouldn't worry about optimisation'. they clearly do not write routines that need to be as fast as possible before being readable (eg games).
If this piece of code is part of a loop that is executed billions of times then this optimisation could be worthwhile. For instance max results could be an overridden method and so you may need to discuss virtual method calls.
Really the ONLY way to answer any of these questions is to figure out is this is a piece of code that will benefit from optimisation. Then you need to know the kinds of things that are increasing the time to execute. Really us mere mortals cannot do this a priori and so have to simply try 2-3 different versions of the code and then test it.
If criteria is a class type, I doubt it would be optimized, because another thread could always change that value in the meantime. For structs I'm not sure, but my gut feeling is that it won't be optimized, but I think it wouldn't make much difference in performance in that case anyhow.
I have a very small c# code marked as inline, but dont work.
I have seen that the longest function generates more than 32 bytes of IL code. Does the limit of 32 bytes too short ?
// inlined
[MethodImpl(MethodImplOptions.AggressiveInlining)]
static public bool INL_IsInRange (this byte pValue, byte pMin) {
return(pValue>=pMin);
}
// NOT inlined
[MethodImpl(MethodImplOptions.AggressiveInlining)]
static public bool INL_IsInRange (this byte pValue, byte pMin, byte pMax) {
return(pValue>=pMin&&pValue<=pMax);
}
Is it possible to change that limit?
I am looking for inline function criteria also. In your case, I believe that JIT optimization timed out before it could reach the decision to inline your second function. For JIT, it's not a priority to inline a function, so it was busy analyzing your long code. However, if you place your calls inside tight loops, JIT will probably inline them, as inner calls gain priority to inline. If you really care about this type of micro-optimization, it's time to switch to C++. It's a whole new brave world out there for you to explore and exploit!
I noticed that the question had been edited right after this answer had been posted, meaning a high level of interactivity. Well, I don't know why there is a limit of 32 bytes, but that seems to be exactly the size of a CPU cache block, conservatively speaking. What a coincidence! In any case, code optimization must be done with a particular hardware configuration, better saved in an extra file side by side with its assembly. The timeout policy is stupid, because optimization is not supposed to be done at run-time, competing against the precious code execution time. Optimization is supposed to be done at application load-time, only the first time it's run on the machine, once for all. It can be triggered again when hardware configuration change is detected. Again, if you really need performance, just go with C/C++. C# is not designed for performance and will never make performance its top priority. Like Java, C# is designed for safety, with a much stronger caution against possible negative performance impacts.
Up to the "32-bytes of IL" limit, there are a number of other factors which affect whether a method would be inlined or not. There are at least a couple of articles that describe these factors.
One article explains that a scoring heuristic is used to adjust an initial guess about the relative size of the code when inlined vs not (i.e. whether the call site is larger or smaller than the inlined code itself):
If inlining makes code smaller then the call it replaces, it is ALWAYS good. Note that we are talking about the NATIVE code size, not the IL code size (which can be quite different).
The more a particular call site is executed, the more it will benefit from inlning. Thus code in loops deserves to be inlined more than code that is not in loops.
If inlining exposes important optimizations, then inlining is more desirable. In particular methods with value types arguments benefit more than normal because of optimizations like this and thus having a bias to inline these methods is good.
Thus the heuristic the X86 JIT compiler uses is, given an inline candidate.
Estimate the size of the call site if the method were not inlined.
Estimate the size of the call site if it were inlined (this is an estimate based on the IL, we employ a simple state machine (Markov Model), created using lots of real data to form this estimator logic)
Compute a multiplier. By default it is 1
Increase the multiplier if the code is in a loop (the current heuristic bumps it to 5 in a loop)
Increase the multiplier if it looks like struct optimizations will kick in.
If InlineSize <= NonInlineSize * Multiplier do the inlining.
Another article explains several conditions that will prevent a method from being inlined based on their mere existence (including the "32-bytes of IL" limit):
These are some of the reasons for which we won't inline a method:
Method is marked as not inline with the CompilerServices.MethodImpl attribute.
Size of inlinee is limited to 32 bytes of IL: This is a heuristic, the rationale behind it is that usually, when you have methods bigger than that, the overhead of the call will not be as significative compared to the work the method does. Of course, as a heuristic, it fails in some situations. There have been suggestions for us adding an attribute to control these threshold. For Whidbey, that attribute has not been added (it has some very bad properties: it's x86 JIT specific and it's longterm value, as compilers get smarter, is dubious).
Virtual calls: We don't inline across virtual calls. The reason for not doing this is that we don't know the final target of the call. We could potentially do better here (for example, if 99% of calls end up in the same target, you can generate code that does a check on the method table of the object the virtual call is going to execute on, if it's not the 99% case, you do a call, else you just execute the inlined code), but unlike the J language, most of the calls in the primary languages we support, are not virtual, so we're not forced to be so aggressive about optimizing this case.
Valuetypes: We have several limitations regarding value types an inlining. We take the blame here, this is a limitation of our JIT, we could do better and we know it. Unfortunately, when stack ranked against other features of Whidbey, getting some statistics on how frequently methods cannot be inlined due to this reason and considering the cost of making this area of the JIT significantly better, we decided that it made more sense for our customers to spend our time working in other optimizations or CLR features. Whidbey is better than previous versions in one case: value types that only have a pointer size int as a member, this was (relatively) not expensive to make better, and helped a lot in common value types such as pointer wrappers (IntPtr, etc).
MarshalByRef: Call targets that are in MarshalByRef classes won't be inlined (call has to be intercepted and dispatched). We've got better in Whidbey for this scenario
VM restrictions: These are mostly security, the JIT must ask the VM for permission to inline a method (see CEEInfo::canInline in Rotor source to get an idea of what kind of things the VM checks for).
Complicated flowgraph: We don't inline loops, methods with exception handling regions, etc...
If basic block that has the call is deemed as it won't execute frequently (for example, a basic block that has a throw, or a static class constructor), inlining is much less aggressive (as the only real win we can make is code size)
Other: Exotic IL instructions, security checks that need a method frame, etc...
using System;
namespace ConsoleApplication1
{
class TestMath
{
static void Main()
{
double res = 0.0;
for(int i =0;i<1000000;++i)
res += System.Math.Sqrt(2.0);
Console.WriteLine(res);
Console.ReadKey();
}
}
}
By benchmarking this code against the c++ version, I discover than performance are 10 times slower than c++ version. I have no problem with that , but that lead me to the following question :
It seems (after a few search) that JIT compiler can't optimize this code as c++ compiler can do, namely just call sqrt once and apply *1000000 on it.
Is there a way to force JIT to do it ?
I repro, I clock the C++ version at 1.2 msec, the C# version at 12.2 msec. The reason is readily visible if you take a look at the machine code the C++ code generator and optimizer emits. It rewrites the loop like this (using the C# equivalent):
double temp = Math.Sqrt(2.0);
for (int i = 0; i < 1000000; ++i) {
res += temp;
}
That's a combination of two optimizations, called "invariant code motion" and "loop hoisting". In other words, the C++ compiler knows enough about the sqrt() function to know that its return value is not affected by the surrounding code so can be moved at will. And that it is then worth-while to move that code outside of the loop and create an extra local variable to store the result. And that calculating sqrt() is slower than adding. Sounds obvious but that's a rule that has to built into the optimizer and has to be considered, one of many, many rules.
And yes, the jitter optimizer misses that one. It is guilty of not being able to spent the same amount of time as the C++ optimizer, it operates under heavy time constraints. Because if it takes too long then the program takes too much time getting started.
Tongue in cheek: a C# programmer needs to be a bit smarter than the code generator and recognize these optimization opportunities himself. This is a fairly obvious one. Well, now that you know about it anyway :)
To do the optimization you want, the compiler has to assure that the function Sqrt() will always return the same value for a certain input.
The compiler can do all kinds of checks that the function isn't using any other "outer" variables to see if it's stateless. But that also doesn't always mean that it can't be affected by side affects.
When a function is called in a loop it should be called in each iteration (think of a multithreaded environment to see why this is important). So usually it's up to the user to take constant stuff out of the loop if he wants that kind of optimization.
Back to the C++ compiler - the compiler might have certain optimization for its library functions. A lot of compilers try to optimize important libraries like the math library, so that might be compiler specific.
Another big difference is in C++ you usually include that kinda stuff from a header file. This means the compiler may have all the information it needs to decide if the function call doesn't change between calls.
The .Net compiler (at compile time - Visual Studio) doesn't always have all the code to parse. Most of the library functions are already compiled (into IL - first stage). And so might not be able to do deep optimizations considering 3rd party dlls. And at the JIT (runtime) compilation it will probably be too costly to do these kind of optimizations across assemblies.
It might help the JIT (or even the C# compiler) if Math.Sqrt was annotated as [Pure]. Then, assuming the arguments to the function are constant as they are in your example, the calculation of the value could be lifted outside the loop.
What's more, such a loop could reasonably be converted into the code:
double res = 1000000 * Math.Sqrt(2.0);
In theory the compiler or JIT could perform this automatically. However I suspect that it would be optimising for a pattern that happens rarely in actual code.
I opened a feature request for ReSharper, suggesting that the design-time tool suggests such a refactoring.
Is there any way I can check (not force) if a given method or property getter is being inlined in a release build?
No - because it doesn't happen at build time; it happens at JIT time. The C# compiler won't perform any inlining; it's up to the CLR that the code ends up running on.
You can discover this using cordbg with all JIT optimizations turned on, but you'll need to dig through the assembly code. I don't know of any way of discovering this within code. (It's possible you could do so with the debugger API, although that may well disable some inlining to start with.)
They're never inlined by the C# compiler. Only const fields are.
You can take a look at the C# compiler optimizations here.
You can make sure that a method or property accessor is never inlined with this attribute applied to it:
[MethodImpl(MethodImplOptions.NoInlining)]
You'd have to look at the machine code. Set a breakpoint on method call and when it hits, right-click and choose Go To Assembly. If you don't see the CALL statement then it got inlined. You'll have to be up to speed a little on reading machine code to be really sure though, you might see a call that was in the inlined method.
To make this accurate, you'll have to use Tools + Options, Debugging, General, untick "Suppress JIT optimization on module load". Which ensures the jitter behaves as it does without the debugger, methods won't be inlined when the optimizer is turned off.
Add code within the method body to examine the stack trace using StackFrame. In my experience, inlined methods are excluded from this stack trace.
I know this post is rather old, but you just could print out the stack where you call the function and in the function you call itself. This is probaly the easiest way, because inlining happens at jit-compilation time.
If the printed out stack matches you can be sure, that the function was inlined.
To print out the stack you can use System.Environment.StackTrace or VS Varibles $caller and $callstack (https://msdn.microsoft.com/en-us/library/5557y8b4.aspx#BKMK_Print_to_the_Output_window_with_tracepoints)
It's possible without looking at the assembly code:
http://blogs.msdn.com/b/clrcodegeneration/archive/2009/05/11/jit-etw-tracing-in-net-framework-4.aspx
I'm writing my own scripting language in C#, with some features I like, and I chose to use MSIL as output's bytecode (Reflection.Emit is quite useful, and I dont have to think up another bytecode). It works, emits executable, which can be run ( even decompiled with Reflector :) )and is quite fast.
But - I want to run multiple 'processes' in one process+one thread, and control their assigned CPU time manually (also implement much more robust IPC that is offered by .NET framework) Is there any way to entirely disable JIT and create own VM, stepping instruction-after-instruction using .NET framework (and control memory usage, etc.), without need to write anything on my own, or to achieve this I must write entire MSIL interpret?
EDIT 1): I know that interpreting IL isn't the fastest thing in the universe :)
EDIT 2): To clarify - I want my VM to be some kind of 'operating system' - it gets some CPU time and divides it between processes, controls memory allocation for them, and so on. It doesnt have to be fast, nor effective, but just a proof of concept for some of my experiments. I dont need to implement it on the level of processing every instruction - if this should be done by .NET, I wont mind, i just want to say : step one instruction, and wait till I told you to step next.
EDIT 3): I realized, that ICorDebug can maybe accomplish my needs, now looking at implementation of Mono's runtime.
You could use Mono - I believe that allows an option to interpret the IL instead of JITting it. The fact that it's open source means (subject to licensing) that you should be able to modify it according to your needs, too.
Mono doesn't have all of .NET's functionality, admittedly - but it may do all you need.
Beware that MSIL was designed to be parsed by a JIT compiler. It is not very suitable for an interpreter. A good example is perhaps the ADD instruction. It is used to add a wide variety of value type values: byte, short, int32, int64, ushort, uint32, uint64. Your compiler knows what kind of add is required but you'll lose that type info when generating the MSIL.
Now you need to find it back at runtime and that requires checking the types of the values on the evaluation stack. Very slow.
An easily interpreted IL has dedicated ADD instructions like ADD8, ADD16, etc.
Microsofts implementation of the Common Language Runtime has only one execution system, the JIT. Mono, on the other hand comes with both, a JIT and an interpreter.
I, however, do not fully understand what exactly you want to do yourself and what you would like to leave to Microsofts implementation:
Is there any way to entirely disable JIT and create own VM?
and
... without need to write anything on my own, or to achieve this I must write entire MSIL interpret?
is sort of contradicting.
If you think, you can write a better execution system than microsofts JIT, you will have to write it from scratch. Bear in mind, however, that both microsofts and monos JIT are highly optimized compilers. (Programming language shootout)
Being able to schedule CPU time for operating system processes exactly is not possible from user mode. That's the operating systems task.
Some implementation of green threads might be an idea, but that is definitely a topic for unmanaged code. If that's what you want, have a look at the CLR hosting API.
I would suggest, you try to implement your language in CIL. After all, it gets compiled down to raw x86. If you don't care about verifiability, you can use pointers where necessary.
One thing you could consider doing is generating code in a state-machine style. Let me explain what I mean by this.
When you write generator methods in C# with yield return, the method is compiled into an inner IEnumerator class that implements a state machine. The method's code is compiled into logical blocks that are terminated with a yield return or yield break statement, and each block corresponds to a numbered state. Because each yield return must provide a value, each block ends by storing a value in a local field. The enumerator object, in order to generate its next value, calls a method that consists of a giant switch statement on the current state number in order to run the current block, then advances the state and returns the value of the local field.
Your scripting language could generate its methods in a similar style, where a method corresponds to a state machine object, and the VM allocates time by advancing the state machine during the time allotted. A few tricky parts to this method: implementing things like method calls and try/finally blocks are harder than generating straight-up MSIL.