Building profiling support into the code

Building profiling support into the code - c#

I wish (I dont know if its possible) to build profiling support into my code instead of using some external profiler. I have heard that there is some profiler api that is used by most of the profiler writers. Can that api be used to profile from within the code that is being executed? Are there any other considerations?

If you don't want to use a regular profiler, you could have your application output performance counters.
You may find this blog entry useful to get started: Link

The EQATEC Profiler builds an instrumented version of your app that will run and collect profiling statistics entirely by itself - you don't need to attach the profiler. By default your app will simply dump the statistics into plaintext xml-files.
This means that you can build a profiled version of your app, deploy it at your customer's site, and have them run it and send back the statistics-reports to you. No need for them to install anything special or run a profiler or anything.
Also, if you can reach your deployed app's machine via a network-connection and it allows incoming connections then you can even take snapshots of the running profiled app yourself, sitting at home with the profiler. All you need is a socket-connection - you decide the port-number yourself and the control-protocol itself is plain http, so it's pretty likely to make it past even content-filtering gateways.

The .NET framework profiler API is a COM object that intercepts calls before .NET handles them. My understanding is that it cannot be hosted in managed (C#) code.
Depending on what you want to do, you can insert Stopwatch timers to measure length of calls, or add Performance Counters to your application so that you can monitor the performance of the application from the Performance Monitor.

There's a GameDev article that discusses how to build profiling infrastructure in a C++ program. You may be able to adapt this approach to work with C# provided objects created on the stack are freed on exit instead of left for the garbage collector
http://www.gamedev.net/reference/programming/features/enginuity3/
Even if you can't take the whole technique, there may be some useful ideas.

What I've done when I can't use my favorite technique is this. It's clumsy and gives low-resolution information, but it works. First, have a global stack of strings. This is in C, but you can adapt it to C#:
int nStack = 0;
char* stack[10000];
Then, on entry and exit to each routine you have source code for, push/pop the name of the routine:
void EveryFunction(){
int iStack = nStack++; stack[iStack] = "EveryFunction";
... code inside function
nStack = iStack; stack[iStack] = NULL;
}
So now stack[0..nStack] keeps a running call stack (minus the line numbers of where functions are called from), so it's not as good as a real call stack, but better than nothing.
Now you need a way to take snapshots of it at random or pseudo-random times. Have another global variable and a routine to look at it:
time_t timeToSnap;
void CheckForSnap(){
time_t now = time(NULL);
if (now >= timeToSnap){
if (now - timeToSnap > 10000) timeToSnap = now; // don't take snaps since 1970
timeToSnap += 1; // setup time for next snapshot
// print stack to snapshot file
}
}
Now, sprinkle calls to CheckForSnap throughout your code, especially in the low-level routines. When the run is finished, you have a file of stack samples. You can look at those for unexpected behavior. For example, any function showing up on a significant fraction of samples has inclusive time roughly equal to that fraction.
Like I said, this is better than nothing. It does have shortcomings:
It does not capture line-numbers where calls come from, so if you find a function with suspiciously large time, you need to rummage within it for the time-consuming code.
It adds significant overhead of it's own, namely all the calls to time(NULL), so when you have removed all your big problems, it will be harder to find the small ones.
If your program spends significant time waiting for I/O or for user input, you will see a bunch of samples piled up after that I/O. If it's file I/O, that's useful information, but if it's user input, you will have to discard those samples, because all they say is that you take time.
It is important to understand a few things:
Contrary to popular accepted wisdom, accuracy of time measurement (and thus a large number of samples) is not important. What is important is that samples occur during the time when you are waiting for the program to do its work.
Also contrary to accepted wisdom, you are not looking for a call graph, you don't need to care about recursion, you don't need to care about how many milliseconds any routine takes or how many times it is called, and you don't need to care about the distinction between inclusive and exclusive time, or the distinction between CPU and wall-clock time. What you do need to care about is, for any routine, what percent of time it is on the stack, because that is how much time it is responsible for, in the sense that if you could somehow make that routine take no time, that is how much your total time would decrease.

Related

how does a c# profiler work?

I'm curious how does a typical C# profiler work?
Are there special hooks in the virtual machine?
Is it easy to scan the byte code for function calls and inject calls to start/stop timer?
Or is it really hard and that's why people pay for tools to do this?
(as a side note i find a bit interesting bec it's so rare - google misses the boat completely on the search "how does a c# profiler work?" doesn't work at all - the results are about air conditioners...)

There is a free CLR Profiler by Microsoft, version 4.0.
https://www.microsoft.com/downloads/en/details.aspx?FamilyID=be2d842b-fdce-4600-8d32-a3cf74fda5e1
BTW, there's a nice section in the CLR Profiler doc that describes how it works, in detail, page 103. There's source as part of distro.

Is it easy to scan the byte code for
function calls and inject calls to
start/stop timer?
Or is it really hard and that's why
people pay for tools to do this?
Injecting calls is hard enough that tools are needed to do it.
Not only is it hard, it's a very indirect way to find bottlenecks.
The reason is what a bottleneck is is one or a small number of statements in your code that are responsible for a good percentage of time being spent, time that could be reduced significantly - i.e. it's not truly necessary, i.e. it's wasteful.
IF you can tell the average inclusive time of one of your routines (including IO time), and IF you can multiply it by how many times it has been called, and divide by the total time, you can tell what percent of time the routine takes.
If the percent is small (like 10%) you probably have bigger problems elsewhere.
If the percent is larger (like 20% to 99%) you could have a bottleneck inside the routine.
So now you have to hunt inside the routine for it, looking at things it calls and how much time they take. Also you want to avoid being confused by recursion (the bugaboo of call graphs).
There are profilers (such as Zoom for Linux, Shark, & others) that work on a different principle.
The principle is that there is a function call stack, and during all the time a routine is responsible for (either doing work or waiting for other routines to do work that it requested) it is on the stack.
So if it is responsible for 50% of the time (say), then that's the amount of time it is on the stack,
regardless of how many times it was called, or how much time it took per call.
Not only is the routine on the stack, but the specific lines of code costing the time are also on the stack.
You don't need to hunt for them.
Another thing you don't need is precision of measurement.
If you took 10,000 stack samples, the guilty lines would be measured at 50 +/- 0.5 percent.
If you took 100 samples, they would be measured as 50 +/- 5 percent.
If you took 10 samples, they would be measured as 50 +/- 16 percent.
In every case you find them, and that is your goal.
(And recursion doesn't matter. All it means is that a given line can appear more than once in a given stack sample.)
On this subject, there is lots of confusion. At any rate, the profilers that are most effective for finding bottlenecks are the ones that sample the stack, on wall-clock time, and report percent by line. (This is easy to see if certain myths about profiling are put in perspective.)

1) There's no such thing as "typical". People collect profile information by a variety of means: time sampling the PC, inspecting stack traces, capturing execution counts of methods/statements/compiled instructions, inserting probes in code to collect counts and optionally calling contexts to get profile data on a call-context basis. Each of these techniques might be implemented in different ways.
2) There's profiling "C#" and profiling "CLR". In the MS world, you could profile CLR and back-translate CLR instruction locations to C# code. I don't know if Mono uses the same CLR instruction set; if they did not, then you could not use the MS CLR profiler; you'd have to use a Mono IL profiler. Or, you could instrument C# source code to collect the profiling data, and then compile/run/collect that data on either MS, Mono, or somebody's C# compatible custom compiler, or C# running in embedded systems such as WinCE where space is precious and features like CLR-built-ins tend to get left out.
One way to instrument source code is to use source-to-source transformations, to map the code from its initial state to code that contains data-collecting code as well as the original program. This paper on instrumenting code to collect test coverage data shows how a program transformation system can be used to insert test coverage probes by inserting statements that set block-specific boolean flags when a block of code is executed. A counting-profiler substitutes counter-incrementing instructions for those probes. A timing profiler inserts clock-snapshot/delta computations for those probes. Our C# Profiler implements both counting and timing profiling for C# source code both ways; it also collect the call graph data by using more sophisticated probes that collect the execution path. Thus it can produce timing data on call graphs this way. This scheme works anywhere you can get your hands on a halfway decent resolution time value.

This is a link to a lengthy article that discusses both instrumentation and sampling methods:
http://smartbear.com/support/articles/aqtime/profiling/

c# : runtime code profiling - any existing libraries?

I have an application where the user can connect nodes together to perform realtime calculations.
I'd like to be able to show the user a CPU-usage percentage to show how much of available CPU-time is being used, and a per-node breakdown to be able to spot the problem-areas.
Are there any available open source implementations for a runtime profiler like this ?
I can write my own using System.Diagnostics.Process.TotalProcessorTime, stopwatches / performancecounters, but I'd rather go with something tried & tested that could maybe offer me more detailed information later if possible.
Edit:
I'm not looking for a stand-alone profiler since I want to show the realtime stats in the UI of my application.

You can try commercial GlowCode profiler that has such feature.
Or open source SlimTune, but it is still in beta.

I assume you are running off a timer, like say 10 calculations per second, because otherwise you are simply using 100% of the CPU (unless you're also doing I/O).
Can you set an alarm-clock interrupt to go off at some reasonable frequency, like 10 or 100 Hz, independent of whatever else the program is doing, and even especially during I/O or other blocked time?
Then for each block just keep a count of how many times out of the last 100 interrupts it was active. That's your percent, and the cost of acquiring it is minimal.
Do blocks call each other as subroutines? In that case, on each interrupt, you may want to capture the call stack among blocks, and a block is "active" if it is somewhere on the stack, and it is "crunching" if it is at the end of the stack (not in the process of calling another block, and not in I/O). Then you have a choice on each block of indicating the percent of time it is "crunching" (which will not exceed 100% when summed over blocks) or "active" (which probably will exceed 100% when summed over blocks).
The value of the latter number is it doesn't tell you so much "where" the time is spent, it tells you "why". That can answer questions like "I see that foo is taking a lot of time, but how did I get there?" Same for I/O. That's a process too, it just takes place on other hardware. You don't want to ignore it, because if you do you could end up saying "How come I'm only using a small fraction of the CPU? What's the holdup?"

Visualization of the timestamps in a call stack

I'm trying to tune the performance of my application. And I'm curious what methods are taking the longest to process, and thus should be looked over for any optimization opportunities.
Are there any existing free tools that will help me visualize the call stack and how long each method is taking to complete? I'm thinking something that displays the call stack as a stacked bar graph or a treemap so it's easy to see that MethodA() took 10 second to complete because it called MethodB() and Method(C) that took 3 and 7 seconds to complete.
Does something like this exist?

Yes, they are called performance profilers. Check out RedGate ANTS and JetBrains' dotTrace, not free but pretty cheap and way better than any free alternative I have seen.

Several profilers are available. I'm only familiar with one (the one included in Visual Studio Team Suite), but it's not free. I must say, it's been powerful/reliable enough that I've had no desire to try switching to another. The other one I've heard about is Red Gate's .NET profiler (also not free). Both of these have feature sets including, but definitely not limited to, what you're after. If this is for commercial application development, definitely check them out. :)

The way this is measured is by either instrumentation or sampling within a profiler.
Instrumentation
Instrumentation does the rough equivalent of rewriting all your code to do the following:
public void Foo()
{
//Code
}
into
public void Foo()
{
InformTheProfilingCodeThatFooStarted();
//Code
InformTheProfilingCodeThatFooEnded();
}
Since the profiler knows when everything starts and stops it can managed a stack of when thisngs start and stop and supply this information later. Many allow you to do this at a line level (by doing much the same thing but instrumenting even more before each line.
This gets your 100% accurate information on the 'call graph' in your application but does so at the cost of: preventing inlining of methods and adding considerable overhead to each method call.
Sampling
An alternate approach is Sampling.
Instead of trying to get 100% accurate call graphs but with less than accurate actual times this approach instead works on the basis that, if it check on a regular basis what is going on in your app it can give you a good idea of how much time it sepnds in various functions without actually having to spend much effort working this out. Most sampling profilers know how to 'parse' the call stack when they interrupt the program so they can still give a reasonable idea of what is calling which function and how much time this seems to take but will not be able to tell you whether this was (say) 10 calls to Foo() which did ten calls to Bar() or one call to Foo() which was in the one call to Bar() which just happened to last so long it was sampled 10 times.
Both approaches have their pros and cons and solve different problems. In general the sampling method is the best one to start with first since it is less invasive and thus should give more accurate information on what is taking time which is often the most important first question before working out why.
I know of only one free sampling profiler for .net code which is the free redistributable profiler which is linked with the VS 2008 Team System release (but which can be downloaded separately). The resulting output cannot be easily viewed with anything but the (very expensive) Team System edition of Visual Studio.
Red Gate ANTS does not support sampling (at this time), Jet Brains (dotTrace) and MS Visual Studio Team System have profilers that support both styles. Which you prefer on a cost benefit basis is a matter of opinion.

This is the method I use. If you have an IDE with a pause button it costs nothing and works very well.
What it tells you is roughly what % of wall-clock time is spent in each routine, and more precisely, in each statement. That is more important than the average duration of executing the routine or statement, because it automatically factors in the invocation count. By sampling wall-clock time it automatically includes CPU, IO, and other kinds of system time.
Even more importantly, if you look at the samples in which your routine is on the call stack, you can see not only what it is doing, but why. The reason that is important is that what you are really looking for is time being spent that could be replaced with something faster. Without the "why" information, you have to guess what that is.
BTW: This technique is little-known mainly because professors do not teach it (even if they know it) because they seldom have to work with monstrous software like we have in the real world, so they treat gprof as the foundation paradigm of profiling.
Here's an example of using it.
P.S. Expect the percents to add up to a lot more than 100%. The way to think about the percents is, if a statement or routine is on the stack X% of the time (as estimated from a small number of samples), that is roughly how much wall-clock time will shrink if the statement or routine can be made to take a lot less time.

Which of the following tasks dont always spent a consistent amount of time?

I am trying to make the loading part of a C# program faster. Currently it takes like 15 seconds to load up.
On first glimpse, things that are done during the loading part includes constructing many 3rd Party UI components, loading layout files, xmls, DLLs, resources files, reflections, waiting for WndProc... etc.
I used something really simple to see the time some part takes,
i.e. breakpointing at a double which holds the total milliseconds of a TimeSpan which is the difference of a DateTime.Now at the start and a DateTime.Now at the end.
Trying that a few times will give me sth like,
11s 13s 12s 12s 7s 11s 12s 11s 7s 13s 7s.. (Usually 12s, but 7s sometimes)
If I add SuspendLayout, BeginUpdate like hell; call things in reflections once instead of many times; reduce some redundant redundant computation redundancy. The time are like 3s 4s 3s 4s 3s 10s 4s 4s 3s 4s 10s 3s 10s.... (Usually 4s, but 10s sometimes)
In both cases, the times are not consistent but more like, a bimodal distribution? It really made me unsure whether my correction of the code is really making it faster.
So I would like to know what will cause such result.
Debug mode?
The "C# hve to compile/interpret the code on the 1st time it runs, but the following times will be faster" thing?
The waiting of WndProc message?
The reflections? PropertyInfo? Reflection.Assembly?
Loading files? XML? DLL? resource file?
UI Layouts?
(There are surely no internet/network/database access in that part)
Thanks.

Profiling by stopping in the debugger is not a reliable way to get timings, as you've discovered.
Profiling by writing times to a log works fine, although why do all this by hand when you can just launch the program in dotTrace? (Free trial, fully functional).
Another thing that works when you don't have access to a profiler is what I call the binary approach - look at what happens in the code and try to disable about half of it by using comments. Note the effect on the running time. If it appears significant, repeat this process with half of that half, and so on recursively until you narrow in on the most significant piece of work. The difficulty is in simulating the side effects of the missing code so that that the remaining code can still work, so this is still harder than using a debugger, but can be quicker than adding a lot of manually time logging, because the binary approach lets you zero in on the slowest place in logarithmic time.
Raymond Chen's advise is good here. When people ask him "How can I make my application start up faster?" he says "Do less stuff."
(And ALWAYS profile the release build - profiling the debug build is generally a wasted effort).

Profile it. you can use eqatec its free

Well, the best thing is to run your application through a profiler and see what the bottlenecks are. I've personally used dotTrace, there are plenty of others you can find on the web.
Debug mode turns off a lot of JIT optimizations, so apps will run a lot slower than release builds. Whatever the mode, JITting has to happen, so I'd discount that as a significant factor. Time to read files from disk can vary based on the OS's caching mechanism, and whether you're doing a cold start or a warm start.
If you have to use timers to profile, I'd suggest repeating the experiment a large number of times and taking the average.

Profiling you code is definitely the best way to identify which areas are taking the longest to run.
As for the other part of your question about the inconsistent timings: timings in an multitasking O/S are inherently inconsistent, and working with managed code throws the garbage collector into the equation too. It could be that the GC is kicking in during your timing which will obviously slow things down.
If you want to try and get a "purer" timing try putting a GC collect before you start your timers, this way it is less likely to start in your timing section. Do remember to remove the timers after, as second guessing when the GC should run normally results in poorer performance.

Apart from the obvious (profiling), which will tell you precisely where time is being spent, there are some other points that spring to mind:
To get reasonable timing results with the approach you are using, run a release build of your program, and have it dump the timing results to a file (e.g. with Trace.WriteLine). Timing a debug version will give you spurious results. When running the timing tests, quit all other applications (including your debugger) to minimise the load on your computer and get more consistent results. Run the program many times and look at the average timings. Finally, bear in mind that Windows caches a lot of stuff, so the first run will be slow and subsequent runs will be much faster. This will at least give you a more consistent basis to tell whether your improvements are making a significant difference.
Don't try and optimise code that shouldn't be run in the first place - Can you defer any of the init tasks? You may find that some of the work can simply be removed from the init sequence. e.g. if you are loading a data file, check whether it is needed immediately - if not, then you could load it the first time it is needed instead of during program startup.

Variation in execution time

I've been profiling a method using the stopwatch class, which is sub-millisecond accurate. The method runs thousands of times, on multiple threads.
I've discovered that most calls (90%+) take 0.1ms, which is acceptable. Occasionally, however, I find that the method takes several orders of magnitude longer, so that the average time for the call is actually more like 3-4ms.
What could be causing this?
The method itself is run from a delegate, and is essentially an event handler.
There are not many possible execution paths, and I've not yet discovered a path that would be conspicuously complicated.
I'm suspecting garbage collection, but I don't know how to detect whether it has occurred.
Finally, I am also considering whether the logging method itself is causing the problem. (The logger is basically a call to a static class + event listener that writes to the console.)

Just because Stopwatch has a high accuracy doesn't mean that other things can't get in the way - like the OS interrupting that thread to do something else. Garbage collection is another possibility. Writing to the console could easily cause delays like that.
Are you actually interested in individual call times, or is it overall performance which is important? It's generally more useful to run a method thousands of times and look at the total time - that's much more indicative of overall performance than individual calls which can be affected by any number of things on the computer.

As I commented, you really should at least describe what your method does, if you're not willing to post some code (which would be best).
That said, one way you can tell if garbage collection has occurred (from Windows):
Run perfmon (Start->Run->perfmon)
Right-click on the graph; select "Add Counters..."
Under "Performance object", select ".NET CLR Memory"
From there you can select # Gen 0, 1, and 2 collections and click "Add"
Now on the graph you will see a graph of all .NET CLR garbage collections
Just keep this graph open while you run your application
EDIT: If you want to know if a collection occurred during a specific execution, why not do this?
int initialGen0Collections = GC.CollectionCount(0);
int initialGen1Collections = GC.CollectionCount(1);
int initialGen2Collections = GC.CollectionCount(2);
// run your method
if (GC.CollectionCount(0) > initialGen0Collections)
// gen 0 collection occurred
if (GC.CollectionCount(1) > initialGen1Collections)
// gen 1 collection occurred
if (GC.CollectionCount(2) > initialGen2Collections)
// gen 2 collection occurred
SECOND EDIT: A couple of points on how to reduce garbage collections within your method:
You mentioned in a comment that your method adds the object passed in to "a big collection." Depending on the type you use for said big collection, it may be possible to reduce garbage collections. For instance, if you use a List<T>, then there are two possibilities:
a. If you know in advance how many objects you'll be processing, you should set the list's capacity upon construction:
List<T> bigCollection = new List<T>(numObjects);
b. If you don't know how many objects you'll be processing, consider using something like a LinkedList<T> instead of a List<T>. The reason for this is that a List<T> automatically resizes itself whenever a new item is added beyond its current capacity; this results in a leftover array that (eventually) will need to be garbage collected. A LinkedList<T> does not use an array internally (it uses LinkedListNode<T> objects), so it will not result in this garbage collection.
If you are creating objects within your method (i.e., somewhere in your method you have one or more lines like Thing myThing = new Thing();), consider using a resource pool to eliminate the need for constantly constructing objects and thereby allocating more heap memory. If you need to know more about resource pooling, check out the Wikipedia article on Object Pools and the MSDN documentation on the ConcurrentBag<T> class, which includes a sample implementation of an ObjectPool<T>.

That can depend on many things and you really have to figure out which one you are delaing with.
I'm not terribly familiar with what triggers garbage collection and what thread it runs on, but that sounds like a possibility.
My first thought around this is with paging. If this is the first time the method runs and the application needs to page in some code to run the method, it would be waiting on that. Or, it could be the data that you're using within the method that triggered a cache miss and now you have to wait for that.
Maybe you're doing an allocation and the allocator did some extra reshuffling in order to get you the allocation you requested.
Not sure how thread time is calculated with Stopwatch, but a context switch might be what you're seeing.
Or...it could be something completely different...
Basically, it could be one of several things and you really have to look at the code itself to see what is causing your occasional slow-down.

It could well be GC. If you use a profiler application such as Redgate's ANTS profiler you can profile % time in GC along side your application's performance to see what's going on.
In addition, you can use the CLRProfiler...
https://github.com/MicrosoftArchive/clrprofiler
Finally, Windows Performance Monitor will show the % time in GC for a given running applicaiton too.
These tools will help you get a holistic view of what's going on in your app as well as the OS in general.
I'm sure you know this stuff already but microbenchmarking such as this is sometimes useful for determining how fast one line of code might be compared to another than you might write, but you generally want to profile your application under typical load too.
Knowing that a given line of code is 10 times faster than another is useful, but if that line of code is easier to read and not part of a tight loop then the 10x performance hit might not be a problem.

What you need is a performance profile to tell you exactly what causes a slow down. Here is a quick list And of course here is the ANTS profiler.
Without knowing what your operation is doing, it sounds like it could be the garbage collection. However that might not be the only reason. If you are reading or writing to the disc it is possible your application might have to wait while something else is using the disk.
Timing issues may occur if you have a multi-threaded application and another thread could be taking some processor time that is only running 10 % of the time. This is why a profiler would help.

If you're only running the code "thousands" of times on a pretty quick function, the occasional longer time could easily be due to transient events on the system (maybe Windows decided it was time to cache something).
That being said, I would suggest the following:
Run the function many many more times, and take an average.
In the code that uses the function, determine if the function in question actually is a bottleneck. Use a profiler for this.

It can be dependent on your OS, environment, page reads, CPU ticks per second and so on.
The most realistic way is to run an execution path several thousand times and take the average.
However, if that logging class is only called occasionally and it logs to disk, that is quite likely to be a slow-down factor if it has to seek on the drive first.
A read of http://en.wikipedia.org/wiki/Profiling_%28computer_programming%29 may give you an insight into more techniques for determining slowdowns in your applications, while a list of profiling tools that you may find useful is at:
http://en.wikipedia.org/wiki/Visual_Studio_Team_System_Profiler
specifically http://en.wikipedia.org/wiki/Visual_Studio_Team_System_Profiler if you're doing c# stuff.
Hope that helps!

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.