Low overhead (statistical) profiler for Win32 platform with managed capabilities

Low overhead (statistical) profiler for Win32 platform with managed capabilities - c#

Anyone can recommend a LOP on Windows? Similar to Linux's OProfile or to OS X's Shark.
must be able to sample non-instrumented binaries
capable of resolving CLR stacks
preferable delayed PDB resolution of symbols
impact low enough to be able to get a decent reading on live, production systems

The Visual Studio Team Suite profiler is amazing. It's so good at its job that it makes me seem better at mine.
Redgate has a performance profiler and memory profiler which I haven't used.

Automated QA's AQTime has saved my butt. I used it to figure out a problem with a .NET web service calling some nasty old C code, and it did it well.

This is what I use. Although it is not suitable for live production use, it answers your other needs.
For live production use, you need something that samples the stack. In my opinion, it's OK if it has some small overhead. My goal is to discover the activities that need optimization, and for that I'm willing to pay a temporary price in speed.
There is always one or more intervals of interest, like the interval between when a request is received, and the response goes out. It's surprising how few samples you need in such an interval to find out what's taking the time.
High precision of timing is not needed. If there is something X going on that, through optimization, would save you, say, 50% of the interval, that is roughly the fraction of samples that will show you X.

Related

Thread Affinity

I have a multithreaded program which consist of a C# interop layer over C++ code.
I am setting threads affinity (like in this post) and it works on part of my code, however on second part it doesn't work. Can Intel Compiler / IPP / MKL libs / inline assembly interfere with external affinity setting?
UPDATE:
I can't post code as it is whole environment with many many dlls. I set environment values: OMP_NUM_THREADS=1 MKL_NUM_THREADS=1 IPP_NUM_THREADS=1. When it runs in single thread, it runs ok, but when I use number of C# threads and set affinity per thread (on a quad core machine), the initialization is going fine on separate cores, but during processing all threads start using the same core. Hope I am clear enough.
Thanks.

We've had this exact problem; we'd set our thread affinity to what we wanted, and the IPP/MKL functions would blow that away! The answer to your question is 'yes'.
Auto Parallelism
The issue is that, by default, the Intel libraries like to automatically execute multi-threaded versions of the routines. So, a single FFT gets computed by a number of threads setup by the library specifically for this purpose.
Intel's intent is that the programmer could get on with the job of writing a single threaded application, and the library would allow that single thread to benefit from a multicore processor by creating a number of threads for the maths work. A noble intent (your source code then need know nothing about the runtime hardware to extract the best achievable performance - handy sometimes), but a right bloody nuisance when one is doing one's own threading for one's own reasons.
Controlling the Library's Behaviour
Take a look at these Intel docs, section Support Functions / Threading Support Functions. You can either programmatically control the library's threading tendancies, or there's some environment variables you can set (like MKL_NUM_THREADS) before your program runs. Setting the number of threads was (as far as I recall) enough to stop the library doing its own thing.
Philosophical Essay Inspired By Answering Your Question (best ignored)
More or less everything Intel is doing in CPU design and software (e.g. IPP/MKL) is aimed at making it unnecessary for the programmer to Worry About Threads. You want good math performance? Use MKL. You want that for loop to go fast? Turn on Auto Parallelisation in ICC. You want to make the best use of cache? That's what Hyperthreading is for.
It's not a bad approach, and personally speaking I think that they've done a pretty good job. AMD too. Their architectures are quite good at delivering good real world performance improvements to the "Average Programmer" for the minimal investment in learning, re-writing and code development.
Irritation
However, the thing that irritates me a little bit (though I don't want to appear ungrateful!) is that whilst this approach works for the majority of programmers out there (which is where the profitable market is), it just throws more obstacles in the way of those programmers who want to spin their own parallelism. I can't blame Intel for that of course, they've done exactly the right thing; they're a market led company, they need to make things that will sell.
By offering these easy features the situation of there being too many under skilled and under trained programmers becomes more entrenched. If all programmers can get good performance without having to learn what auto parallelism is actually doing, then we'll never move on. The pool of really good programmers who actually know that stuff will remain really small.
Problem
I see this as a problem (though only a small one, I'll explain later). Computing needs to become more efficient for both economic and environmental reasons. Intel's approach allows for increased performance, and better silicon manufacturing techniques produces lower power consumption, but I always feel like it's not quite as efficient as it could be.
Example
Take the Cell processor at the heart of the PS3. It's something that I like to witter on about endlessly! However, IBM developed that with a completely different philosophy to Intel. They gave you no cache (just some fast static RAM instead to use as you saw fit), the architecture was pretty much pure NUMA, you had to do all your own parallelisation, etc. etc. The result was that if you really knew what you were doing you could get about 250GFLOPS out of the thing (I think the none-PS3 variants went to 320GLOPS), for 80Watts, all the way back in 2005.
It's taken Intel chips about another 6 or 7 years or so for a single device to get to that level of performance. That's a lot of Moores law growth. If the Cell got manufactured on Intel's latest silicon fab and was given as many transistors as Intel put into their big Xeons, it would still blow everything else away.
No Market
However, apart from PS3, Cell was a none-starter market proposition. IBM decided that it would never be a big enough seller to be worth their while. There just wasn't enough programmers out there who could really use it, and to indulge the few of us who could makes no commercial sense, which wouldn't please the shareholders.
Small Problem, Bigger Problem
I said earlier that this was only a small problem. Well, most of the world's computing isn't about high maths performance, it's become Facebook, Twitter, etc. That sort is all about I/O performance, and for that you don't need high maths performance. So in that sense the dependence on Intel Doing Everything For You so that the average programmer to get good math performance matters very little. There's just not enough maths being done to warrant a change in design philosophy.
In fact, I strongly suspect that the world will ultimately decide that you don't need a large chip at all, an ARM should do just fine. If that does come to pass then the market for Intel's very large chips with very good general purpose maths compute performance will vanish. Effectively those of use who want good maths performance are being heavily subsidised by those who want to fill enourmous data centres with Intel based hardware and put Intel PCs on every desktop.
We're simply lucky that Intel apparently has a desire to make sure that every big CPU they build is good at maths regardless of whether most of their users actually use that maths performance. I'm sure that desire has its foundations in marketing prowess and wanting the bragging rights, but those are not hard, commercially tangible artifacts that bring shareholder value.
So if those data centre guys decide that, actually, they'd rather save electricity and fill their data centres with ARMs, where does that leave Intel? ARMs are fine devices for the purpose for which they're intended, but they're not at the top of my list of Supercomputer chips. So where does that leave us?
Trend
My take on the current market trend is that 'Workstations' (PCs as we call them now) are going to start costing lots and lots of money, just like they did in the 1980s / early 90s.
I think that better supercomputers will become unaffordable because no one can spare the $10billions it would take to do the next big chip. If people stop having PCs there won't be a mass market for large all-out GPUs, so we won't even be able to use those instead. They're an exclusive thing, but super computers do play a vital role in our world and we do need them to get better. So who is going to pay for that? Not me, that's for sure.
Oops, that went on for quite a while...

how does a c# profiler work?

I'm curious how does a typical C# profiler work?
Are there special hooks in the virtual machine?
Is it easy to scan the byte code for function calls and inject calls to start/stop timer?
Or is it really hard and that's why people pay for tools to do this?
(as a side note i find a bit interesting bec it's so rare - google misses the boat completely on the search "how does a c# profiler work?" doesn't work at all - the results are about air conditioners...)

There is a free CLR Profiler by Microsoft, version 4.0.
https://www.microsoft.com/downloads/en/details.aspx?FamilyID=be2d842b-fdce-4600-8d32-a3cf74fda5e1
BTW, there's a nice section in the CLR Profiler doc that describes how it works, in detail, page 103. There's source as part of distro.

Is it easy to scan the byte code for
function calls and inject calls to
start/stop timer?
Or is it really hard and that's why
people pay for tools to do this?
Injecting calls is hard enough that tools are needed to do it.
Not only is it hard, it's a very indirect way to find bottlenecks.
The reason is what a bottleneck is is one or a small number of statements in your code that are responsible for a good percentage of time being spent, time that could be reduced significantly - i.e. it's not truly necessary, i.e. it's wasteful.
IF you can tell the average inclusive time of one of your routines (including IO time), and IF you can multiply it by how many times it has been called, and divide by the total time, you can tell what percent of time the routine takes.
If the percent is small (like 10%) you probably have bigger problems elsewhere.
If the percent is larger (like 20% to 99%) you could have a bottleneck inside the routine.
So now you have to hunt inside the routine for it, looking at things it calls and how much time they take. Also you want to avoid being confused by recursion (the bugaboo of call graphs).
There are profilers (such as Zoom for Linux, Shark, & others) that work on a different principle.
The principle is that there is a function call stack, and during all the time a routine is responsible for (either doing work or waiting for other routines to do work that it requested) it is on the stack.
So if it is responsible for 50% of the time (say), then that's the amount of time it is on the stack,
regardless of how many times it was called, or how much time it took per call.
Not only is the routine on the stack, but the specific lines of code costing the time are also on the stack.
You don't need to hunt for them.
Another thing you don't need is precision of measurement.
If you took 10,000 stack samples, the guilty lines would be measured at 50 +/- 0.5 percent.
If you took 100 samples, they would be measured as 50 +/- 5 percent.
If you took 10 samples, they would be measured as 50 +/- 16 percent.
In every case you find them, and that is your goal.
(And recursion doesn't matter. All it means is that a given line can appear more than once in a given stack sample.)
On this subject, there is lots of confusion. At any rate, the profilers that are most effective for finding bottlenecks are the ones that sample the stack, on wall-clock time, and report percent by line. (This is easy to see if certain myths about profiling are put in perspective.)

1) There's no such thing as "typical". People collect profile information by a variety of means: time sampling the PC, inspecting stack traces, capturing execution counts of methods/statements/compiled instructions, inserting probes in code to collect counts and optionally calling contexts to get profile data on a call-context basis. Each of these techniques might be implemented in different ways.
2) There's profiling "C#" and profiling "CLR". In the MS world, you could profile CLR and back-translate CLR instruction locations to C# code. I don't know if Mono uses the same CLR instruction set; if they did not, then you could not use the MS CLR profiler; you'd have to use a Mono IL profiler. Or, you could instrument C# source code to collect the profiling data, and then compile/run/collect that data on either MS, Mono, or somebody's C# compatible custom compiler, or C# running in embedded systems such as WinCE where space is precious and features like CLR-built-ins tend to get left out.
One way to instrument source code is to use source-to-source transformations, to map the code from its initial state to code that contains data-collecting code as well as the original program. This paper on instrumenting code to collect test coverage data shows how a program transformation system can be used to insert test coverage probes by inserting statements that set block-specific boolean flags when a block of code is executed. A counting-profiler substitutes counter-incrementing instructions for those probes. A timing profiler inserts clock-snapshot/delta computations for those probes. Our C# Profiler implements both counting and timing profiling for C# source code both ways; it also collect the call graph data by using more sophisticated probes that collect the execution path. Thus it can produce timing data on call graphs this way. This scheme works anywhere you can get your hands on a halfway decent resolution time value.

This is a link to a lengthy article that discusses both instrumentation and sampling methods:
http://smartbear.com/support/articles/aqtime/profiling/

Software for testing applications to increase performance

I'm searching for software to do benchmarking, analysis, performance of code and apps.
Something like Intel VTune.
Can anyone give me some names, free or payed, targeting c# apps.
thanks

Redgate Performance Profiler
...their Memory Profiler will undoubtedly help as well.

It sounds like you have multiple purposes.
If you want to monitor the performance health of programs, benchmarks and profilers that measure are probably what you want.
If you need to make a program faster, and you have the source code for it, I think your best bet is to get something that samples the call stack and gives you line-level percent of wall-clock time. For this, don't look for high precision of timing. Look instead for high precision of pinpointing code that is responsible for high percentages of time. Especially in larger programs, these lines are function or method calls that you can find a way to avoid. Don't fall into the myth of thinking that the only problems are hotspots where the program counter lives, and don't fall into the myth that all I/O is necessary I/O.
There's a totally manual method that a few people use and works very well.

Visual Studio, in it's higher SKUs, includes multiple profilers which can be used for this purpose. If your application is using multiple threads, VS 2010's new Concurrency Profiler is incredible for profiling C# concurrent code.
There are many other performance profilers on the market which are useful for performance profiling as well. I personally like dotTrace and think it's a very clean interface.

Such a tool is usually called a Profiler.
The (free) MS CLR Profiler (up to Fx 3.5) isn't bad, and a good place to start.
Update: a 2022 Blog about this.

I've used the EQATEC profiler which is free for personal use. It benchmarks run time and helped point out the inefficiencies that were causing bottlenecks in my code.

Visualization of the timestamps in a call stack

I'm trying to tune the performance of my application. And I'm curious what methods are taking the longest to process, and thus should be looked over for any optimization opportunities.
Are there any existing free tools that will help me visualize the call stack and how long each method is taking to complete? I'm thinking something that displays the call stack as a stacked bar graph or a treemap so it's easy to see that MethodA() took 10 second to complete because it called MethodB() and Method(C) that took 3 and 7 seconds to complete.
Does something like this exist?

Yes, they are called performance profilers. Check out RedGate ANTS and JetBrains' dotTrace, not free but pretty cheap and way better than any free alternative I have seen.

Several profilers are available. I'm only familiar with one (the one included in Visual Studio Team Suite), but it's not free. I must say, it's been powerful/reliable enough that I've had no desire to try switching to another. The other one I've heard about is Red Gate's .NET profiler (also not free). Both of these have feature sets including, but definitely not limited to, what you're after. If this is for commercial application development, definitely check them out. :)

The way this is measured is by either instrumentation or sampling within a profiler.
Instrumentation
Instrumentation does the rough equivalent of rewriting all your code to do the following:
public void Foo()
{
//Code
}
into
public void Foo()
{
InformTheProfilingCodeThatFooStarted();
//Code
InformTheProfilingCodeThatFooEnded();
}
Since the profiler knows when everything starts and stops it can managed a stack of when thisngs start and stop and supply this information later. Many allow you to do this at a line level (by doing much the same thing but instrumenting even more before each line.
This gets your 100% accurate information on the 'call graph' in your application but does so at the cost of: preventing inlining of methods and adding considerable overhead to each method call.
Sampling
An alternate approach is Sampling.
Instead of trying to get 100% accurate call graphs but with less than accurate actual times this approach instead works on the basis that, if it check on a regular basis what is going on in your app it can give you a good idea of how much time it sepnds in various functions without actually having to spend much effort working this out. Most sampling profilers know how to 'parse' the call stack when they interrupt the program so they can still give a reasonable idea of what is calling which function and how much time this seems to take but will not be able to tell you whether this was (say) 10 calls to Foo() which did ten calls to Bar() or one call to Foo() which was in the one call to Bar() which just happened to last so long it was sampled 10 times.
Both approaches have their pros and cons and solve different problems. In general the sampling method is the best one to start with first since it is less invasive and thus should give more accurate information on what is taking time which is often the most important first question before working out why.
I know of only one free sampling profiler for .net code which is the free redistributable profiler which is linked with the VS 2008 Team System release (but which can be downloaded separately). The resulting output cannot be easily viewed with anything but the (very expensive) Team System edition of Visual Studio.
Red Gate ANTS does not support sampling (at this time), Jet Brains (dotTrace) and MS Visual Studio Team System have profilers that support both styles. Which you prefer on a cost benefit basis is a matter of opinion.

This is the method I use. If you have an IDE with a pause button it costs nothing and works very well.
What it tells you is roughly what % of wall-clock time is spent in each routine, and more precisely, in each statement. That is more important than the average duration of executing the routine or statement, because it automatically factors in the invocation count. By sampling wall-clock time it automatically includes CPU, IO, and other kinds of system time.
Even more importantly, if you look at the samples in which your routine is on the call stack, you can see not only what it is doing, but why. The reason that is important is that what you are really looking for is time being spent that could be replaced with something faster. Without the "why" information, you have to guess what that is.
BTW: This technique is little-known mainly because professors do not teach it (even if they know it) because they seldom have to work with monstrous software like we have in the real world, so they treat gprof as the foundation paradigm of profiling.
Here's an example of using it.
P.S. Expect the percents to add up to a lot more than 100%. The way to think about the percents is, if a statement or routine is on the stack X% of the time (as estimated from a small number of samples), that is roughly how much wall-clock time will shrink if the statement or routine can be made to take a lot less time.

How to calculate/save memory useage of .NET app on terminal servers?

I do some C# fat application on Citrix/Terminal Server.
How to measure actual memory usage per-session?
What can I do to reduce the memory usage in total?
We still working on .NET 1.1. Does it have a difference if we upgrade our .NET runtime?

Its very difficult to get metrics on actual memory usage in .NET. About the closest approximation you can get is on a per-object basis by calling Marshal.SizeOf(). My understanding of that method is that it is essentially measuring the size of a serialized version of the object, and the in-memory footprint may be close to that, its not exact. But its a good estimate.
You may also want to investigate the SMS API's under .NET. They provide ways to query various memory statistics from the Operating System about your process (or other processes). Its the same library that is used by "perfmon". You may be able to use that to inspect your process programatically.
Also, you'll want to invest in a good profiling tool for .NET. I've evaluated ANTS and dotTrace. They are both very good. I preferred dotTrace for its simplicity. Most profilers have very non-intuitive interfaces. That's something I've just come to expect. dotTrace is actually quite good, by those standards. ANTS I think is probably more advanced (not sure, just my opinion from the brief eval I did on both of them).

Here is an article from MSDN on measuring application performance. It includes doing measurements on memory usage.
I don't think upgrading the .NET Runtime will have a significant effect on the application as it is, you will probably be better off optimising your application in other ways. Try to dispose of resources when you don't need them.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.