Is there an easy way of JIT-ing C# code up front, rather than waiting for the first time the code is invoked? I have read about NGEN but I don't think that's going to help me.
My application waits and responds to a specific external event that comes from a UDP port, and none of the critical-path code is (a) run before the event arrives, or (b) ever run again, so the cost of JIT is high in this scenario. Checking with ANTS profiler the overhead for JIT is around 40-50%, sometimes it's as high as 90%. My application is highly latency sensitive, and every millisecond counts.
My initial thought is that I could add a bool parameter to every critical path method, and call those methods before the event occurs, in order to initiate the JIT compile. However, is there a prettier and less hacky way?
Many thanks
I'd say use NGEN and if it doesn't work, you likely have deeper problems.
But, to answer your question, this article on how to pre-jit uses System.Runtime.CompilerServices.RuntimeHelpers.PrepareMethod to force a JIT. It includes sample code to use reflection to get the method handles.
What happens the second time the event arrives? Is it faster then or just as slow. If its still slow then JIT is not the problem, because the code gets "JIT"ed only once, the first time it is run.
NGEN would provide the answer for you. My suggestion is to take the bare minimum of the code you need, the critical path if you will, and put it in dummy/sandbox project. Start profiling/NGenning this code and seeing the performance.
If this bare minimum code, even after being NGEN'ed performs poorly on multiple calls, then pre-compiling isn't going to help you. Its something else in the code that is causing performance bottle necks.
Related
Back in 2009 I posted this answer to a question about optimisations for nested try/catch/finally blocks.
Thinking about this again some years later, it seems the question could be extended to that other control flow, not only try/catch/finally, but also if/else.
At each of these junctions, execution will follow one path. Code must be generated for both, obviously, but the order in which they're placed in memory, and the number of jumps required to navigate through them will differ.
The order generated code is laid out in memory has implications for the miss rate on the CPU's instruction cache. Having the instruction pipeline stalled, waiting for memory reads, can really kill loop performance.
I don't think loops (for/foreach/while) are a such a good fit unless you expect the loop have zero iterations more often than it has some, as the natural generation order seems pretty optimal.
Some questions:
In what ways do the available .NET JITs optimise for generated instruction order?
How much difference can this make in practice to common code? What about perfectly suited cases?
Is there anything the developer can do to influence this layout? What about mangling with the forbidden goto?
Does the specific JIT being used make much difference to layout?
Does the method inlining heuristic come into play here too?
Basically anything interesting related to this aspect of the JIT!
Some initial thoughts:
Moving catch blocks out of line is an easy job, as they're supposed to be the exceptional case by definition. Not sure this happens.
For some loops I suspect you can increase performance non-trivially. However in general I don't think it'll make that much difference.
I don't know how the JIT decides the order of generated code. In C on Linux you have likely(cond) and unlikely(cond) which you can use to tell to the compiler which branch is the common path to optimise for. I'm not sure that all compilers respect these macros.
Instruction ordering is distinct from the problem of branch prediction, in which the CPU guesses (on its own, afaik) which branch will be taken in order to start the pipeline (oversimplied steps: decode, fetch operands, execute, write back) on instructions, before the execute step has determined the value of the condition variable.
I can't think of any way to influence this order in the C# language. Perhaps you can manipulate it a bit by gotoing to labels explicitly, but is this portable, and are there any other problems with it?
Perhaps this is what profile guided optimisation is for. Do we have that in the .NET ecosystem, now or in plan? Maybe I'll go and have a read about LLILC.
The optimization you are referring to is called the code layout optimization which is defined as follows:
Those pieces of code that are executed close in time in the same thread should be be close in the virtual address space so that they fit in a single or few consecutive cache lines. This reduces cache misses.
Those pieces of code that are executed close in time in different threads should be be close in the virtual address space so that they fit in a single or few consecutive cache lines as long as there is no self-modifying code. This gets lower priority than the previous one. This reduces cache misses.
Those pieces of code that are executed frequently (hot code) should be close in the virtual address space so that they fit in as few virtual pages as possible. This reduces page faults and working set size.
Those pieces of code that are rarely executed (cold code) should be close in the virtual address space so that they fit in as few virtual pages as possible. This reduces page faults and working set size.
Now to your questions.
In what ways do the available .NET JITs optimise for generated
instruction order?
"Instruction order" is really a very general term. Many optimizations affect instruction order. I'll assume that you're referring to code layout.
JITters by design should take the minimum amount of time to compile code while at the same time produce high-quality code. To achieve this, they only perform the most important optimizations so that it's really worth spending time doing them. Code layout optimization is not one of them because without profiling, it may not be beneficial. While a JITter can certainly perform profiling and dynamic optimization, there is a generally preferred way.
How much difference can this make in practice to common code? What
about perfectly suited cases?
Code layout optimization by itself can improve overall performance typically by -1% (negative one) to 4%, which is enough to make compiler writers happy. I would like to add that it reduces energy consumption indirectly by reducing cache misses. The reduction in miss ratio of the instruction cache can be typically up to 35%.
Is there anything the developer can do to influence this layout? What
about mangling with the forbidden goto?
Yes, there are numerous ways. I would like to mention the generally recommended one which is mpgo.exe. Please do not use goto for this purpose. It's forbidden.
Does the specific JIT being used make much difference to layout?
No.
Does the method inlining heuristic come into play here too?
Inlining can indeed improve code layout with respect to function calls. It's one of the most important optimizations and all .NET JITs perform it.
Moving catch blocks out of line is an easy job, as they're supposed to
be the exceptional case by definition. Not sure this happens.
Yes it might be "easy", but what is the potential gained benefit? catch blocks are typically small in size (containing a call to a function that handles the exception). Handling this particular case of code layout does not seem promising. If you really care, use mpgo.exe.
I don't know how the JIT decides the order of generated code. In C on
Linux you have likely(cond) and unlikely(cond) which you can use to
tell to the compiler which branch is the common path to optimise for.
Using PGO is much more preferable over using likely(cond) and unlikely(cond) for two reasons:
The programmer might inadvertently make mistakes while placing likely(cond) and unlikely(cond) in the code. It actually happens a lot. Making big mistakes while trying to manually optimize the code is very typical.
Adding likely(cond) and unlikely(cond) all over the code makes it less maintainable in the future. You'll have to make sure that these hints hold every time you change the source code. In large code bases, this could be ( or rather is) a nightmare.
Instruction ordering is distinct from the problem of branch
prediction...
Assuming you are talking about code layout, yes they are distinct. But code layout optimization is usually guided by a profile which really includes branch statistics. Hardware branch prediction is of course totally different.
Maybe I'll go and have a read about LLILC.
While using mpgo.exe is the mainstream way of performing this optimization, you can use LLILC also since LLVM support profile-guided optimization as well. But I don't think you need to go this far.
I find timing my code for performance to be a bit of a headache and I have to assume this is a common problem for folks.
Adding in a bunch of code to time my loops is great but often there is more overhead in adding the timing checks than adding a parallel task.
making a loop parallel can take seconds whereas adding in timing chunks requires several lines of code and be dangerous if forgotten.
So... My question is why dont IDE's include this ability to time chunks of code like a break point?
Something like i mark a section of code and tell it to dump the timing output to a console with some options to make it more readable. averages and such.
I know break points can seriously hamper performance but there must be an easier way to time specific chunks of code then to write all my own timing routines for each chunk of code i want to check.
First, VS 2015 does offer timing indicators while debugging.
(source: s-msft.com)
But I say don't use it for serious profiling purposes. Use it to get a very rough idea of what's going on.
Timing anything in a debug build yields unreliable figures, because of many factors (disabled compiler optimizations, disabled JIT optimizations, nop instructions everywhere, breakpoints, etc). Garbage in, garbage out.
Just do your profiling properly, it's a hard thing to get right. And use a real profiler for that if you need detailed performance data.
I am trying to make the loading part of a C# program faster. Currently it takes like 15 seconds to load up.
On first glimpse, things that are done during the loading part includes constructing many 3rd Party UI components, loading layout files, xmls, DLLs, resources files, reflections, waiting for WndProc... etc.
I used something really simple to see the time some part takes,
i.e. breakpointing at a double which holds the total milliseconds of a TimeSpan which is the difference of a DateTime.Now at the start and a DateTime.Now at the end.
Trying that a few times will give me sth like,
11s 13s 12s 12s 7s 11s 12s 11s 7s 13s 7s.. (Usually 12s, but 7s sometimes)
If I add SuspendLayout, BeginUpdate like hell; call things in reflections once instead of many times; reduce some redundant redundant computation redundancy. The time are like 3s 4s 3s 4s 3s 10s 4s 4s 3s 4s 10s 3s 10s.... (Usually 4s, but 10s sometimes)
In both cases, the times are not consistent but more like, a bimodal distribution? It really made me unsure whether my correction of the code is really making it faster.
So I would like to know what will cause such result.
Debug mode?
The "C# hve to compile/interpret the code on the 1st time it runs, but the following times will be faster" thing?
The waiting of WndProc message?
The reflections? PropertyInfo? Reflection.Assembly?
Loading files? XML? DLL? resource file?
UI Layouts?
(There are surely no internet/network/database access in that part)
Thanks.
Profiling by stopping in the debugger is not a reliable way to get timings, as you've discovered.
Profiling by writing times to a log works fine, although why do all this by hand when you can just launch the program in dotTrace? (Free trial, fully functional).
Another thing that works when you don't have access to a profiler is what I call the binary approach - look at what happens in the code and try to disable about half of it by using comments. Note the effect on the running time. If it appears significant, repeat this process with half of that half, and so on recursively until you narrow in on the most significant piece of work. The difficulty is in simulating the side effects of the missing code so that that the remaining code can still work, so this is still harder than using a debugger, but can be quicker than adding a lot of manually time logging, because the binary approach lets you zero in on the slowest place in logarithmic time.
Raymond Chen's advise is good here. When people ask him "How can I make my application start up faster?" he says "Do less stuff."
(And ALWAYS profile the release build - profiling the debug build is generally a wasted effort).
Profile it. you can use eqatec its free
Well, the best thing is to run your application through a profiler and see what the bottlenecks are. I've personally used dotTrace, there are plenty of others you can find on the web.
Debug mode turns off a lot of JIT optimizations, so apps will run a lot slower than release builds. Whatever the mode, JITting has to happen, so I'd discount that as a significant factor. Time to read files from disk can vary based on the OS's caching mechanism, and whether you're doing a cold start or a warm start.
If you have to use timers to profile, I'd suggest repeating the experiment a large number of times and taking the average.
Profiling you code is definitely the best way to identify which areas are taking the longest to run.
As for the other part of your question about the inconsistent timings: timings in an multitasking O/S are inherently inconsistent, and working with managed code throws the garbage collector into the equation too. It could be that the GC is kicking in during your timing which will obviously slow things down.
If you want to try and get a "purer" timing try putting a GC collect before you start your timers, this way it is less likely to start in your timing section. Do remember to remove the timers after, as second guessing when the GC should run normally results in poorer performance.
Apart from the obvious (profiling), which will tell you precisely where time is being spent, there are some other points that spring to mind:
To get reasonable timing results with the approach you are using, run a release build of your program, and have it dump the timing results to a file (e.g. with Trace.WriteLine). Timing a debug version will give you spurious results. When running the timing tests, quit all other applications (including your debugger) to minimise the load on your computer and get more consistent results. Run the program many times and look at the average timings. Finally, bear in mind that Windows caches a lot of stuff, so the first run will be slow and subsequent runs will be much faster. This will at least give you a more consistent basis to tell whether your improvements are making a significant difference.
Don't try and optimise code that shouldn't be run in the first place - Can you defer any of the init tasks? You may find that some of the work can simply be removed from the init sequence. e.g. if you are loading a data file, check whether it is needed immediately - if not, then you could load it the first time it is needed instead of during program startup.
In my website I have a rather complicated business logic part encapsulated in a single DLL. Because CPU usage hits the roof when I call certain methods of it I need to measure and optimize.
I know about the performance profiler, but I don't know how to set it up for a library.
Furthermore I don't seem to find any usefull resources about it.
How do you approach this problem?
You can create a simple exe that runs the main methods of your library.
Although it requires some understanding to know which method to call it can help you focus on speciifc scenarios and check their bottlenecks.
You can also put some performance counters: look into msdn, or open a debugger and use old system: create a Stopwatch and do Debug.Writeline to see what's happening.
As Dror says, run it stand-alone under a simple exe. I would add, run THAT under the IDE, and while it's being slow, just pause it, several times, and each time look in detail at what it's doing. It's counterintuitive but very effective.
I'm writing a DSP application in C# (basically a multitrack editor). I've been profiling it for quite some time on different machines and I've noticed some 'curious' things.
On my home machine, the first run of the playback loop takes up about 50%-60% of the available time, (I assume it's due to the JIT doing its job), then for the subsequent loops it goes down to a steady 5% consumption. The problem is, if I run the application on a slower computer, the first run takes up more than the available time, causing the playback to get interrupted and messing the output audio, which is unacceptable. After that, it goes down to a 8%-10% consumption.
Even after the first run, the application keeps calling some time-consuming routines from time to time (every 2 seconds more or less), which causes the steady 5% consumption to experience very short peaks of 20%-25%. I've noticed that if I let the application run for a while these peaks will also go down to a 7%-10%. (I'm not sure if it's due to the JIT recompiling these portions of code).
So, I have a serious problem with the JIT. While the application will behave nicely even in very slow machines, these 'compiling storms' are going to be a big problem. I'm trying to figure out how to resolve this issue and I've come up with an idea, which is to mark all the 'sensible' routines with an attribute that will tell the application to 'squeeze' them beforehand during start-up, so they'll be fully optimized when they're really needed. But this is only an idea (and I don't like it too much either) and I wonder if there's a better solution to the whole problem.
I'd like to hear what you guys think.
(NGEN the application is not an option, I like and want all the JIT optimizations I can get.)
EDIT:
Memory consumption and garbage collection kicks are not an issue, I'm using object pools and the maximum peak of memory during playback is 304 Kb.
You can trigger the JIT compiler to compile your entire set of assemblies during your application's initialization routine using the PrepareMethod ... method (without having to use NGen).
This solution is described in more detail here: Forcing JIT Compilation During Runtime.
The initial speed indeed sounds like Fusion+JIT, which would be helped by ILMerge (for Fusion) and NGEN (for JIT); you could always play a silent track through the system at startup so that this does all the hard work without the user noticing any distortion?
NGEN is a good option; is there a reason you can't use it?
The issues you mention after the initial load do not sound like they are related to JIT. Perhaps garbage collection.
Have you tried profiling? Both CPU and memory (collections)?
As Marc mentioned, the ongoing spikes do not sound like JIT issues. Other things to look for:
Garbage collection - are you allocating memory during your audio processing? If you're creating a lot of garbage, or even objects which survive a Gen 0 collection, this might cause noticible spikes. It sounds like you are doing some kind of pre-allocation, but watch out for hidden allocations in library code (even a foreach loop can allocate!)
Denormals. There is an issue with certain types of processors when dealing with very small floating point numbers which can cause CPU spikes. See http://www.musicdsp.org/files/denormal.pdf for details.
Edit:
Even if you don't want to use NGen, at least compare an NGen'd version so you can see what difference JITing makes
If you believe you are being impacted by JIT, then precompile your app with NGEN and run the tests again. There is no JIT overhead in code that has been compiled by NGEN. If you still see spikes in the NGEN'd app, then you know they are not caused by JIT.