Benchmarking method calls in C# [duplicate]

Benchmarking method calls in C# [duplicate] - c#

This question already has answers here:
Exact time measurement for performance testing [duplicate]
(7 answers)
Closed 9 years ago.
I'm looking for a way to benchmark method calls in C#.
I have coded a data structure for university assignment, and just came up with a way to optimize a bit, but in a way that would add a bit of overhead in all situations, while turning a O(n) call into O(1) in some.
Now I want to run both versions against the test data to see if it's worth implementing the optimization. I know that in Ruby, you could wrap the code in a Benchmark block and have it output the time needed to execute the block in console - is there something like that available for C#?

Stolen (and modified) from Yuriy's answer:
private static void Benchmark(Action act, int iterations)
{
GC.Collect();
act.Invoke(); // run once outside of loop to avoid initialization costs
Stopwatch sw = Stopwatch.StartNew();
for (int i = 0; i < iterations; i++)
{
act.Invoke();
}
sw.Stop();
Console.WriteLine((sw.ElapsedMilliseconds / iterations).ToString());
}
Often a particular method has to initialize some things, and you don't always want to include those initialization costs in your overall benchmark. Also, you want to divide the total execution time by the number of iterations, so that your estimate is more-or-less independent of the number of iterations.

Here are some things I've found by trial and errors.
Discard the first batch of (thousands) iterations. They will most likely be affected by the JITter.
Running the benchmark on a separate Thread object can give better and more stable results. I don't know why.
I've seen some people using Thread.Sleep for whatever reason before executing the benchmark. This will only make things worse. I don't know why. Possibly due to the JITter.
Never run the benchmark with debugging enabled. The code will most likely run orders of magnitude slower.
Compile your application with all optimizations enabled. Some code can be drastically affected by optimization, while other code will not be, so compiling without optimization will affect the reliability of your benchmark.
When compiling with optimizations enabled, it is sometimes necessary to somehow evaluate the output of the benchmark (e.g. print a value, etc). Otherwise the compiler may 'figure out' some computations are useless and will simply not perform them.
Invocation of delegates can have noticeable overhead when performing certain benchmarks. It is better to put more than one iteration inside the delegate, so that the overhead has little effect on the result of the benchmark.
Profilers can have their own overhead. They're good at telling you which parts of your code are bottlenecks, but they're not good at actually benchmarking two different things reliably.
In general, fancy benchmarking solutions can have noticeable overhead. For example, if you want to benchmark many objects using one interface, it may be tempting to wrap every object in a class. However, remember that the class constructor also has overhead that must be taken into account. It is better to keep everything as simple and direct as possible.

I stole most of the following from Jon Skeet's method for benchmarking:
private static void Benchmark(Action act, int interval)
{
GC.Collect();
Stopwatch sw = Stopwatch.StartNew();
for (int i = 0; i < interval; i++)
{
act.Invoke();
}
sw.Stop();
Console.WriteLine(sw.ElapsedMilliseconds);
}

You could use the inbuilt Stopwatch class to "Provides a set of methods and properties that you can use to accurately measure elapsed time." if you are looking for a manual way to do it. Not sure on automated though.

Sounds like you want a profiler. I would strongly recommend the EQATEC profiler myself, it being the best free one I've tried. The nice thing about this method over a simple stopwatch one is that it also provides a breakdown of performance over certain methods/blocks.

Profilers give the best benchmarks since they diagnose all your code, however they slow it down a lot. Profilers are used for finding bottlenecks.
For optimizing an algorithm, when you know where the bottlenecks are, use a dictionary of name-->stopwatch, to keep track of the performance critical sections during run-time.

Related

How can I measure cold-code performance?

Suppose I have two methods, Foo and Bar, that do roughly the same thing, and I want to measure which one is faster. Also, single execution of both Foo and Bar is too fast to measure reliably.
Normally, I'd simply run them both a huge number of times like this:
var sw=new Stopwatch();
sw.Start();
for(int ii=0;ii<HugeNumber;++ii)
Foo();
sw.Stop();
Console.WriteLine("Foo: "+sw.ElapsedMilliseconds);
// and the same code for Bar
But in this way, every run of Foo after the first will probably be working with processor cache, not actual memory. Which is probably way faster than in real application. What can I do to ensure that my method is run cold every time?
Clarification
By "roughly the same thing" I mean the both methods are used in the same way, but actual algorithm may differ significantly. For example, Foo might be doing some tricky math, while Bar skips it by using more memory.
And yes, I understand that methods running in cold will not have much effect on overall performance. I'm still interested which one is faster.

First of all if Foo is working with the processor cache then Bar will also work with the processor cache. Shouldn't It ???????? So both of your functions are getting the same previledge. Now suppose the after first time the time for foo is A and then it is running with avg time B as it is working with processor cache. So total time will be
A + B*(hugenumber-1)
Similarly for Bar it will be
C + D*(hugenumber-1) //where C is the first runtime and D is the avg runtime using prscr cache
If i am not wrong here the result is depended on B and D and both of them are average runtime using the processor cache. So if you want to calculate which of your function is better I thing processor cache is not a problem as both functions are suppose to use that.
Edited:
I think now its clear. As Bar is skipping some tricky maths by using memory it will have a little bit (may be in nano/pico seconds) advantage. So in order to restrict that you have to flush your cpu cache inside your for loop. As in both the loops you will be doing the same thing I think now you will get a better idea about which function is better. There is already a stack overflow discussion on how to flush cpu cache. Please vist this link
hope it helps.
Edit details: Improved answer and corrected spellings

But assuming Foo and Bar are similar enough, any cache speedup (or any other environmental factor) should affect both equally. So even though you might not be getting an accurate absolute measure of cold performance, you should still observe a relative difference between the algorithms if one exists.
Also remember that if these functions are called in the inner loop of your system (otherwise why would you care so much about their performance), in the real world they're likely to be kept in the cache anyway, so by using your code you're likely to get a decent approximation of real world performance.

Performance question about enumerating empty lists

which is better, performance-wise, when you potentially have an empty list?
if (_myList != null && _myList.Count > 0)
{
foreach (thing t in _myList )
{
...
or without checking if _myList count contains anything:
if (_myList != null)
{
foreach (thing t in _myList )
{
...
I'm guessing there's probably not much in it, but that the first one is slightly quicker (?)
Thanks
edit:
To clarify, I mean a list like this: List<Thing>

There is only one way to answer a performance question:
Measure
Measure
Measure
The only way you can know if:
I can improve the code
I have improved the code
And most importantly: What code to improve first
is to measure how much time different parts of the program is taking, and then improving the top items first.
To answer your question, I would assume that the miniscule overhead of a few extra objects is indeed going to cost you some cycles compared to just calling Count (assuming that is a fast call, field read for instance).
However, since you're asking this question it tells me that you don't have enough information about the state of your program and your code, so the chance of improving that miniscule overhead actually having a noticable effect for your users is so slim I wouldn't bother.
I can guarantee you have bigger fish to fry performance-wise, so tackle those first.
Personally I don't use null references except when dealing with databases or in a few lines of code to signal "not initialized yet", other than that I use empty lists and strings, etc. Your code is much easier to read and understand, and the benefit of microoptimization on this level will never be noticed.

Unless you are calling your code in a tight loop, the difference will be insignificant. However, be advised that there is a difference: the check for _myList.Count > 0 avoids the calling of GetEnumerator, the creation of an IEnumerator implementing object (a heap allocation) and a call to that enumerator's MoveNext() method.
If you are in a tight spot performance-wise that avoided (heap allocation + virtual method calls) might help, but in general your code is shorter and easier to understand by avoiding the explicit on _myList.Count.

Compulsory Disclaimer: You should have already identified this as a problem area via profiling before attempting to "optimise it", and hence you'll already have a the tools at hand to determine quickly and easily which methods faster. Odds are, neither will make an appreciable difference to your application's performance.
But that being said, Count, for System.Generics.Collection.List<> will almost definitely be quicker.
Although optimisation improves things greatly (don't be scared of using foreach! it's nearly free), foreach more or less involves:
var enumerator = _myList.GetEnumerator();
try
{
while (enumerator.MoveNext())
{
}
}
finally
{
enumerator.Dispose();
}
which is a lot more complicated than merely comparing a simple property (safe assumption that List.Count is a simple property) with a constant.

Assigning integers fields/properties to zero in a constructor

During a recent code review a colleague suggested that, for class with 4 int properties, assigning each to zero in the constructor would result in a performance penalty.
For example,
public Example()
{
this.major = 0;
this.minor = 0;
this.revision = 0;
this.build = 0;
}
His point was that this is redundant as they will be set to zero by default and you are introducing overhead by essentially performing the same task twice. My point was that the performance hit would be negligible if one existed at all and this is more readable (there are several constructors) as the intention of the state of the object after calling this constructor is very clear.
What do you think? Is there a performance gain worth caring about here?

No, there is not. The compiler will optimize out these operations; the same task will not be performed twice. Your colleague is wrong.
[Edit based upon input from the always-excellent Jon Skeet]
The compiler SHOULD optimize out the operations, but apparently they are not completely optimized out; however, the optimization gain is completely negligible, and the benefit from having the assignment be so explicit is good. Your colleague may not be completely wrong, but they're focusing on a completely trivial optimization.

I don't believe they're the same operation, and there is a performance difference. Here's a microbenchmark to show it:
using System;
using System.Diagnostics;
class With
{
int x;
public With()
{
x = 0;
}
}
class Without
{
int x;
public Without()
{
}
}
class Test
{
static void Main(string[] args)
{
int iterations = int.Parse(args[0]);
Stopwatch sw = Stopwatch.StartNew();
if (args[1] == "with")
{
for (int i = 0; i < iterations; i++)
{
new With();
}
}
else
{
for (int i = 0; i < iterations; i++)
{
new Without();
}
}
sw.Stop();
Console.WriteLine(sw.ElapsedMilliseconds);
}
}
Results:
c:\Users\Jon\Test>test 1000000000 with
8427
c:\Users\Jon\Test>test 1000000000 without
881
c:\Users\Jon\Test>test 1000000000 with
7568
c:\Users\Jon\Test>test 1000000000 without
819
Now, would that make me change the code? Absolutely not. Write the most readable code first. If it's more readable with the assignment, keep the assignment there. Even though a microbenchmark shows it has a cost, that's still a small cost in the context of doing any real work. Even though the proportional difference is high, it's still creating a billion instances in 8 seconds in the "slow" route. My guess is that there's actually some sort of optimization for completely-empty constructors chaining directly to the completely empty object() constructor. The difference between assigning to two fields and only assigning to one field is much smaller.
In terms of while the compiler can't optimize it out, bear in mind that a base constructor could be modifying the value by reflection, or perhaps a virtual method call. The compiler could potentially notice those, but it seems a strange optimization.

My understanding is that objects memory is cleared to zero with a simple and very fast memory wipe. these explicit assignments, however, will take additional IL. Indeed, some tools will spot you assigning the default value (in a field initialiser) and advise against it.
So I would say: don't do this - it is potentially marginally slower. But not by much. In short I think your friend is correct.
Unfortunately I'm on a mobile device right now, without the right tools to prove it.

You should focus on code clarity, that is the most important thing. If performance becomes an issue, then measure performance, and see what your bottlenecks are, and improve them. It's not worth it to spend so much time worrying about performance when ease of understanding code is more important.

You can initialize them as fields directly:
public int number = 0;
And is also clear.

The more important question is: Is there a really readability gain? If the people who are maintaining the code already know that ints are assigned to zero, this is just some more code they have to parse. Perhaps the code would be cleaner without lines that do nothing.

In fact, I'd use the assignment in constructor, just for readability, and for marking the 'I didn't forget to initialize those' intention. Relying on default behavior tends to confuse another developers.

I don't think you should care about performance hit, usually there are many other places where program can be optimized. On the other hand I don't see any gain from specifying these values in the constructor since they are going to be set to 0 anyway.

How much does bytecode size impact JIT / Inlining / Performance?

I've been poking around mscorlib to see how the generic collection optimized their enumerators and I stumbled on this:
// in List<T>.Enumerator<T>
public bool MoveNext()
{
List<T> list = this.list;
if ((this.version == list._version) && (this.index < list._size))
{
this.current = list._items[this.index];
this.index++;
return true;
}
return this.MoveNextRare();
}
The stack size is 3, and the size of the bytecode should be 80 bytes. The naming of the MoveNextRare method got me on my toes and it contains an error case as well as an empty collection case, so obviously this is breaching separation of concern.
I assume the MoveNext method is split this way to optimize stack space and help the JIT, and I'd like to do the same for some of my perf bottlenecks, but without hard data, I don't want my voodoo programming turning into cargo-cult ;)
Thanks!
Florian

If you're going to think about ways in which List<T>.Enumerator is "odd" for the sake of performance, consider this first: it's a mutable struct. Feel free to recoil with horror; I know I do.
Ultimately, I wouldn't start mimicking optimisations from the BCL without benchmarking/profiling what difference they make in your specific application. It may well be appropriate for the BCL but not for you; don't forget that the BCL goes through the whole NGEN-alike service on install. The only way to find out what's appropriate for your application is to measure it.
You say you want to try the same kind of thing for your performance bottlenecks: that suggests you already know the bottlenecks, which suggests you've got some sort of measurement in place. So, try this optimisation and measure it, then see whether the gain in performance is worth the pain of readability/maintenance which goes with it.
There's nothing cargo-culty about trying something and measuring it, then making decisions based on that evidence.

Separating it into two functions has some advantages:
If the method were to be inlined, only the fast path would be inlined and the error handling would still be a function call. This prevents inlining from costing too much extra space. But 80 bytes of IL is probably still above the threshold for inlining (it was once documented as 32 bytes, don't know if it's changed since .NET 2.0).
Even if it isn't inlined, the function will be smaller and fit within the CPU's instruction cache more easily, and since the slow path is separate, it won't have to be fetched into cache every time the fast path is.
It may help the CPU branch predictor optimize for the more common path (returning true).
I think that MoveNextRare is always going to return false, but by structuring it like this it becomes a tail call, and if it's private and can only be called from here then the JIT could theoretically build a custom calling convention between these two methods that consists of just a jmp instruction with no prologue and no duplication of epilogue.

Self-Profiling using Proxy class

Given an interface
public interface IValueProvider
{
object GetValue(int index);
}
and a tree structure of instances of IValueProvider similar to a math expression tree.
I want to measure the time that is spent in the GetValue method of each node at runtime without an external profiler.
GetValue could do anything that i don't know at design time: Collecting values from other IValueProviders, running a IronPython expression or even be an external plugin. I want to present statistics about the node-timings to the user.
For this i can create a proxy class that wraps an IValueProvider:
public class ValueProviderProfiler : IValueProvider
{
private IValueProvider valueProvider;
public object GetValue(int index)
{
// ... start measuring
try
{
return this.valuepProvider.GetValue(index);
}
finally
{
// ... stop measuring
}
}
}
What is the best way to measure the time that is spend in a node without distortions caused by external processes, with good accuracy and respect to the fact that the nodes are evaluated in parallel?
Just using the Stopwatch class won't work and having a look at the process' processor time doesn't respect the fact that the cpu time could have been consumed on another node.

If you're trying to analyze performance instead of starting with a given method get an actual profile like Ants profiler and see where the real bottlenecks are. Many times when you assume why your application isn't being performant you end up looking and optimizing all of the wrong places and just waste a lot of time.

You don't say how quickly you expect each GetValue call to finish, so it's hard to give any definite advice...
For things that take some number of milliseconds (disk accesses, filling in controls, network transfers, etc.) I've used DateTime.Ticks.Now. It seems to work reasonably well, and the claimed resolution of 10,000,000 ticks per second sounds pretty good. (I doubt it's really that precise though; I don't know what facility it is backed by.)
I'm not aware of any way of avoiding distortions introduced by execution of other processes. I usually just take the mean time spent running each particular section I'm interested in, averaged out over as many runs as possible (to smooth out variations caused by other processes and any timer inaccuracies).
(In native code, for profiling things that don't take long to execute, I use the CPU cycle counter via the RDTSC instruction. So if you're timing stuff that is over too soon for other timers to get a useful reading, but doesn't finish so quickly that call overhead is an issue, and you don't mind getting your readings in CPU cycles rather than any standard time units, it could be worth writing a little native function that returns the cycle counter value in a UInt64. I haven't needed to do this in managed code myself though...)

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.