While performance testing, I noticed something interesting.
I noticed that the very first insertion into a LinkedList(C# Generics) is extremely slower than any other insertion done at the head of the list. I simply used the C# template LinkedList and used AddFirst() for each insertion into the LinkedList. Why is the very first insertion the slowest?
First Five Insertion Results:
First insertion into list: 0.0152 milliseconds
Second insertion into list(at head): 0.0006 milliseconds
Third insertion into list(at head): 0.0003 milliseconds
Fourth insertion into list(at head): 0.0006 milliseconds
Fifth insertion into list(at head): 0.0006 milliseconds
Performance Testing Code:
using (StreamReader readText = new StreamReader("MillionNumbers.txt"))
{
String line;
Int32 counter = 0;
while ((line = readText.ReadLine()) != null)
{
watchTime.Start();
theList.AddFirst(line);
watchTime.Stop();
Double time = watchTime.Elapsed.TotalMilliseconds;
totalTime = totalTime + time;
Console.WriteLine(time);
watchTime.Reset();
++counter;
}
Console.WriteLine(totalTime);
Console.WriteLine(counter);
Console.WriteLine(totalTime / counter);
}
Timing a single operation is very dangerous - the slightest stutter can make a huge difference in results. Additionally, it's not clear that you've done anything with LinkedList<T> before this code, which means you'd be timing the JITting of AddFirst and possibly even whole other types involved.
Timing just the first insert is rather difficult as once you've done it, you can't easily repeat it. However, you can time "insert and remove" repeatedly, as this code does:
using System;
using System.Collections.Generic;
using System.Diagnostics;
class Program
{
public static void Main(string[] args)
{
// Make sure we've JITted the LinkedList code
new LinkedList<string>().AddFirst("ignored");
LinkedList<string> list = new LinkedList<string>();
TimeInsert(list);
list.AddFirst("x");
TimeInsert(list);
list.AddFirst("x");
TimeInsert(list);
list.AddFirst("x");
}
const int Iterations = 100000000;
static void TimeInsert(LinkedList<string> list)
{
GC.Collect();
GC.WaitForPendingFinalizers();
Stopwatch sw = Stopwatch.StartNew();
for (int i = 0; i < Iterations; i++)
{
list.AddFirst("item");
list.RemoveFirst();
}
sw.Stop();
Console.WriteLine("Initial size: {0}; Ticks: {1}",
list.Count, sw.ElapsedTicks);
}
}
My results:
Initial size: 0; Ticks: 5589583
Initial size: 1; Ticks: 8137963
Initial size: 2; Ticks: 8399579
This is what I'd expect, as depending on the internal representation there's very slightly more work to do in terms of hooking up the "previous head" when adding and removing to an already-populated list.
My guess is you're seeing JIT time, but really your code doesn't really time accurately enough to be useful, IMO.
Related
Recently, I needed to choose between using SortedDictionary and SortedList, and settled on SortedList.
However, now I discovered that my C# program is slowing to a crawl when performing SortedList.Count, which I check using a function/method called thousands of times.
Usually my program would call the function 10,000 times within 35 ms, but while using SortedList.Count, it slowed to 300-400 ms, basically 10x slower.
I also tried SortedList.Keys.Count, but this reduced my performance another 10 times, to over 3000 ms.
I have only ~5000 keys/objects in SortedList<DateTime, object_name>.
I can easily and instantly retrieve data from my sorted list by SortedList[date] (in 35 ms), so I haven't found any problem with the list structure or objects its holding.
Is this performance normal?
What else can I use to obtain the number of records in the list, or just to check that the list is populated?
(besides adding a separate tracking flag, which I may do for now)
CORRECTION:
Sorry, I'm actually using:
ConcurrentDictionary<string, SortedList<DateTime, string>> dict_list = new ConcurrentDictionary<string, SortedList<DateTime, string>>();
And I had various counts in various places, sometimes checking items in the list and other times in ConcurrentDicitonary. So the issue applies to ConcurrentDicitonary and I wrote quick test code to confirm this, which takes 350 ms, without using concurrency.
Here is the test with ConcurrentDicitonary, showing 350 ms:
public static void CountTest()
{
//Create test ConcurrentDictionary
ConcurrentDictionary<int, string> test_dict = new ConcurrentDictionary<int, string>();
for (int i = 0; i < 50000; i++)
{
test_dict[i] = "ABC";
}
//Access .Count property 10,000 times
int tick_count = Environment.TickCount;
for (int i = 1; i <= 10000; i++)
{
int dict_count = test_dict.Count;
}
Console.WriteLine(string.Format("Time: {0} ms", Environment.TickCount - tick_count));
Console.ReadKey();
}
this article recommends calling this instead:
dictionary.Skip(0).Count()
The count could be invalid as soon as the call from the method returns. If you want to write the count to a log for tracing purposes, for example, you can use alternative methods, such as the lock-free enumerator
Well, ConcurrentDictionary<TKey, TValue> must work properly with many threads at once, so it needs some synchronization overhead.
Source code for Count property: https://referencesource.microsoft.com/#mscorlib/system/Collections/Concurrent/ConcurrentDictionary.cs,40c23c8c36011417
public int Count
{
[SuppressMessage("Microsoft.Concurrency", "CA8001", Justification = "ConcurrencyCop just doesn't know about these locks")]
get
{
int count = 0;
int acquiredLocks = 0;
try
{
// Acquire all locks
AcquireAllLocks(ref acquiredLocks);
// Compute the count, we allow overflow
for (int i = 0; i < m_tables.m_countPerLock.Length; i++)
{
count += m_tables.m_countPerLock[i];
}
}
finally
{
// Release locks that have been acquired earlier
ReleaseLocks(0, acquiredLocks);
}
return count;
}
}
Looks like you need to refactor your existing code. Since you didn't provide any code, we can't tell you what you could optimize.
For performance sensitive code I would not recommend to use ConcurrentDictionary.Count property as it has a locking implementation. You could use Interlocked.Increment instead to do the count yourself.
In C#, suppose I have a foreach loop and it is possible that the iterator will be empty. Following the loop, more actions need to be taken only if the iterator was not empty. So I declare bool res = false; before the loop. Is it faster to just set res = true; in each loop iteration, or to test if it's been done yet, as in if (!res) res = true;. I suppose the question could more succinctly be stated as "is it faster to set a bool's value or test its value?"
In addition, even if one is slightly faster than the other, is it feasible to have so many iterations in the loop that the impact on performance is not negligible?
To kill a few minutes:
static void Main(string[] args)
{
bool test = false;
Stopwatch sw = new Stopwatch();
sw.Start();
for (long i = 0; i < 100000000; i++)
{
if (!test)
test = true;
}
sw.Stop();
Console.WriteLine(sw.ElapsedMilliseconds + ". Hi, I'm just using test somehow:" + test);
sw.Reset();
bool test2 = false;
sw.Start();
for (long i = 0; i < 100000000; i++)
{
test2 = true;
}
sw.Stop();
Console.WriteLine(sw.ElapsedMilliseconds + ". Hi, I'm just using test2 somehow:" + test2);
Console.ReadKey();
}
Output:
448
379
So, unless missed somthing, just setting the value is faster than checking and then setting it. Is that what you wanted to test?
EDIT:
Fixed an error pointed out in the comments. As a side note, I indeed ran this test a few times and even when the miliseconds changed, the second case was always slighty faster.
if (!res) res = true is redundant.
The compiler should be smart enough to know that res will always end up being true and remove your if statement and/or completely remove the set altogether if you compile with Release / Optimize Code.
To your question itself. It should be faster to set a primitive value than to actually compare and set. I highly doubt you would be able to accurately measure the time difference at all on a primitive and just thinking about this alone consumed more time than the process will in x exagerrated iterations.
Given this simple piece of code and 10mln array of random numbers:
static int Main(string[] args)
{
int size = 10000000;
int num = 10; //increase num to reduce number of buckets
int numOfBuckets = size/num;
int[] ar = new int[size];
Random r = new Random(); //initialize with randum numbers
for (int i = 0; i < size; i++)
ar[i] = r.Next(size);
var s = new Stopwatch();
s.Start();
var group = ar.GroupBy(i => i / num);
var l = group.Count();
s.Stop();
Console.WriteLine(s.ElapsedMilliseconds);
Console.ReadLine();
return 0;
}
I did some performance on grouping, so when the number of buckets is 10k the estimated execution time is 0.7s, for 100k buckets it is 2s, for 1m buckets it is 7.5s.
I wonder why is that. I imagine that if the GroupBy is implemented using HashTable there might be problem with collisions. For example initially the hashtable is prepard to work for let's say 1000 groups and then when the number of groups is growing it needs to increase the size and do the rehashing. If these was the case I could then write my own grouping where I would initialize the HashTable with expected number of buckets, I did that but it was only slightly faster.
So my question is, why number of buckets influences groupBy performance that much?
EDIT:
running under release mode change the results to 0.55s, 1.6s, 6.5s respectively.
I also changed the group.ToArray to piece of code below just to force execution of grouping :
foreach (var g in group)
array[g.Key] = 1;
where array is initialized before timer with appropriate size, the results stayed almost the same.
EDIT2:
You can see the working code from mellamokb in here pastebin.com/tJUYUhGL
I'm pretty certain this is showing the effects of memory locality (various levels of caching) and also object allocation.
To verify this, I took three steps:
Improve the benchmarking to avoid unnecessary parts and to garbage collect between tests
Remove the LINQ part by populating a Dictionary (which is effecively what GroupBy does behind the scenes)
Remove even Dictionary<,> and show the same trend for plain arrays.
In order to show this for arrays, I needed to increase the input size, but it does show the same kind of growth.
Here's a short but complete program which can be used to test both the dictionary and the array side - just flip which line is commented out in the middle:
using System;
using System.Collections.Generic;
using System.Diagnostics;
class Test
{
const int Size = 100000000;
const int Iterations = 3;
static void Main()
{
int[] input = new int[Size];
// Use the same seed for repeatability
var rng = new Random(0);
for (int i = 0; i < Size; i++)
{
input[i] = rng.Next(Size);
}
// Switch to PopulateArray to change which method is tested
Func<int[], int, TimeSpan> test = PopulateDictionary;
for (int buckets = 10; buckets <= Size; buckets *= 10)
{
TimeSpan total = TimeSpan.Zero;
for (int i = 0; i < Iterations; i++)
{
// Switch which line is commented to change the test
// total += PopulateDictionary(input, buckets);
total += PopulateArray(input, buckets);
GC.Collect();
GC.WaitForPendingFinalizers();
}
Console.WriteLine("{0,9}: {1,7}ms", buckets, (long) total.TotalMilliseconds);
}
}
static TimeSpan PopulateDictionary(int[] input, int buckets)
{
int divisor = input.Length / buckets;
var dictionary = new Dictionary<int, int>(buckets);
var stopwatch = Stopwatch.StartNew();
foreach (var item in input)
{
int key = item / divisor;
int count;
dictionary.TryGetValue(key, out count);
count++;
dictionary[key] = count;
}
stopwatch.Stop();
return stopwatch.Elapsed;
}
static TimeSpan PopulateArray(int[] input, int buckets)
{
int[] output = new int[buckets];
int divisor = input.Length / buckets;
var stopwatch = Stopwatch.StartNew();
foreach (var item in input)
{
int key = item / divisor;
output[key]++;
}
stopwatch.Stop();
return stopwatch.Elapsed;
}
}
Results on my machine:
PopulateDictionary:
10: 10500ms
100: 10556ms
1000: 10557ms
10000: 11303ms
100000: 15262ms
1000000: 54037ms
10000000: 64236ms // Why is this slower? See later.
100000000: 56753ms
PopulateArray:
10: 1298ms
100: 1287ms
1000: 1290ms
10000: 1286ms
100000: 1357ms
1000000: 2717ms
10000000: 5940ms
100000000: 7870ms
An earlier version of PopulateDictionary used an Int32Holder class, and created one for each bucket (when the lookup in the dictionary failed). This was faster when there was a small number of buckets (presumably because we were only going through the dictionary lookup path once per iteration instead of twice) but got significantly slower, and ended up running out of memory. This would contribute to fragmented memory access as well, of course. Note that PopulateDictionary specifies the capacity to start with, to avoid effects of data copying within the test.
The aim of using the PopulateArray method is to remove as much framework code as possible, leaving less to the imagination. I haven't yet tried using an array of a custom struct (with various different struct sizes) but that may be something you'd like to try too.
EDIT: I can reproduce the oddity of the slower result for 10000000 than 100000000 at will, regardless of test ordering. I don't understand why yet. It may well be specific to the exact processor and cache I'm using...
--EDIT--
The reason why 10000000 is slower than the 100000000 results has to do with the way hashing works. A few more tests explain this.
First off, let's look at the operations. There's Dictionary.FindEntry, which is used in the [] indexing and in Dictionary.TryGetValue, and there's Dictionary.Insert, which is used in the [] indexing and in Dictionary.Add. If we would just do a FindEntry, the timings would go up as we expect it:
static TimeSpan PopulateDictionary1(int[] input, int buckets)
{
int divisor = input.Length / buckets;
var dictionary = new Dictionary<int, int>(buckets);
var stopwatch = Stopwatch.StartNew();
foreach (var item in input)
{
int key = item / divisor;
int count;
dictionary.TryGetValue(key, out count);
}
stopwatch.Stop();
return stopwatch.Elapsed;
}
This is implementation doesn't have to deal with hash collisions (because there are none), which makes the behavior as we expect it. Once we start dealing with collisions, the timings start to drop. If we have as much buckets as elements, there are obviously less collisions... To be exact, we can figure out exactly how many collisions there are by doing:
static TimeSpan PopulateDictionary(int[] input, int buckets)
{
int divisor = input.Length / buckets;
int c1, c2;
c1 = c2 = 0;
var dictionary = new Dictionary<int, int>(buckets);
var stopwatch = Stopwatch.StartNew();
foreach (var item in input)
{
int key = item / divisor;
int count;
if (!dictionary.TryGetValue(key, out count))
{
dictionary.Add(key, 1);
++c1;
}
else
{
count++;
dictionary[key] = count;
++c2;
}
}
stopwatch.Stop();
Console.WriteLine("{0}:{1}", c1, c2);
return stopwatch.Elapsed;
}
The result is something like this:
10:99999990
10: 4683ms
100:99999900
100: 4946ms
1000:99999000
1000: 4732ms
10000:99990000
10000: 4964ms
100000:99900000
100000: 7033ms
1000000:99000000
1000000: 22038ms
9999538:90000462 <<-
10000000: 26104ms
63196841:36803159 <<-
100000000: 25045ms
Note the value of '36803159'. This answers the question why the last result is faster than the first result: it simply has to do less operations -- and since caching fails anyways, that factor doesn't make a difference anymore.
10k the estimated execution time is 0.7s, for 100k buckets it is 2s, for 1m buckets it is 7.5s.
This is an important pattern to recognize when you profile code. It is one of the standard size vs execution time relationships in software algorithms. Just from seeing the behavior, you can tell a lot about the way the algorithm was implemented. And the other way around of course, from the algorithm you can predict the expected execution time. A relationship that's annotated in the Big Oh notation.
Speediest code you can get is amortized O(1), execution time barely increases when you double the size of the problem. The Dictionary<> class behaves that way, as John demonstrated. The increases in time as the problem set gets large is the "amortized" part. A side-effect of Dictionary having to perform linear O(n) searches in buckets that keep getting bigger.
A very common pattern is O(n). That tells you that there is a single for() loop in the algorithm that iterates over the collection. O(n^2) tells you there are two nested for() loops. O(n^3) has three, etcetera.
What you got is the one in between, O(log n). It is the standard complexity of a divide-and-conquer algorithm. In other words, each pass splits the problem in two, continuing with the smaller set. Very common, you see it back in sorting algorithms. Binary search is the one you find back in your text book. Note how logâ‚‚(10) = 3.3, very close to the increment you see in your test. Perf starts to tank a bit for very large sets due to the poor locality of reference, a cpu cache problem that's always associated with O(log n) algoritms.
The one thing that John's answer demonstrates is that his guess cannot be correct, GroupBy() certainly does not use a Dictionary<>. And it is not possible by design, Dictionary<> cannot provide an ordered collection. Where GroupBy() must be ordered, it says so in the MSDN Library:
The IGrouping objects are yielded in an order based on the order of the elements in source that produced the first key of each IGrouping. Elements in a grouping are yielded in the order they appear in source.
Not having to maintain order is what makes Dictionary<> fast. Keeping order always cost O(log n), a binary tree in your text book.
Long story short, if you don't actually care about order, and you surely would not for random numbers, then you don't want to use GroupBy(). You want to use a Dictionary<>.
There are (at least) two influence factors: First, a hash table lookup only takes O(1) if you have a perfect hash function, which does not exist. Thus, you have hash collisions.
I guess more important, though, are caching effects. Modern CPUs have large caches, so for the smaller bucket count, the hash table itself might fit into the cache. As the hash table is frequently accessed, this might have a strong influence on the performance. If there are more buckets, more accesses to the RAM might be neccessary, which are slow compared to a cache hit.
There are a few factors at work here.
Hashes and groupings
The way grouping works is by creating a hash table. Each individual group then supports an 'add' operation, which adds an element to the add list. To put it bluntly, it's like a Dictionary<Key, List<Value>>.
Hash tables are always overallocated. If you add an element to the hash, it checks if there is enough capacity, and if not, recreates the hash table with a larger capacity (To be exact: new capacity = count * 2 with count the number of groups). However, a larger capacity means that the bucket index is no longer correct, which means you have to re-build the entries in the hash table. The Resize() method in Lookup<Key, Value> does this.
The 'groups' themselves work like a List<T>. These too are overallocated, but are easier to reallocate. To be precise: the data is simply copied (with Array.Copy in Array.Resize) and a new element is added. Since there's no re-hashing or calculation involved, this is quite a fast operation.
The initial capacity of a grouping is 7. This means, for 10 elements you need to reallocate 1 time, for 100 elements 4 times, for 1000 elements 8 times, and so on. Because you have to re-hash more elements each time, your code gets a bit slower each time the number of buckets grows.
I think these overallocations are the largest contributors to the small growth in the timings as the number of buckets grow. The easiest way to test this theory is to do no overallocations at all (test 1), and simply put counters in an array. The result can be shown below in the code for FixArrayTest (or if you like FixBucketTest which is closer to how groupings work). As you can see, the timings of # buckets = 10...10000 are the same, which is correct according to this theory.
Cache and random
Caching and random number generators aren't friends.
Our little test also shows that when the number of buckets grows above a certain threshold, memory comes into play. On my computer this is at an array size of roughly 4 MB (4 * number of buckets). Because the data is random, random chunks of RAM will be loaded and unloaded into the cache, which is a slow process. This is also the large jump in the speed. To see this in action, change the random numbers to a sequence (called 'test 2'), and - because the data pages can now be cached - the speed will remain the same overall.
Note that hashes overallocate, so you will hit the mark before you have a million entries in your grouping.
Test code
static void Main(string[] args)
{
int size = 10000000;
int[] ar = new int[size];
//random number init with numbers [0,size-1]
var r = new Random();
for (var i = 0; i < size; i++)
{
ar[i] = r.Next(0, size);
//ar[i] = i; // Test 2 -> uncomment to see the effects of caching more clearly
}
Console.WriteLine("Fixed dictionary:");
for (var numBuckets = 10; numBuckets <= 1000000; numBuckets *= 10)
{
var num = (size / numBuckets);
var timing = 0L;
for (var i = 0; i < 5; i++)
{
timing += FixBucketTest(ar, num);
//timing += FixArrayTest(ar, num); // test 1
}
var avg = ((float)timing) / 5.0f;
Console.WriteLine("Avg Time: " + avg + " ms for " + numBuckets);
}
Console.WriteLine("Fixed array:");
for (var numBuckets = 10; numBuckets <= 1000000; numBuckets *= 10)
{
var num = (size / numBuckets);
var timing = 0L;
for (var i = 0; i < 5; i++)
{
timing += FixArrayTest(ar, num); // test 1
}
var avg = ((float)timing) / 5.0f;
Console.WriteLine("Avg Time: " + avg + " ms for " + numBuckets);
}
}
static long FixBucketTest(int[] ar, int num)
{
// This test shows that timings will not grow for the smaller numbers of buckets if you don't have to re-allocate
System.Diagnostics.Stopwatch s = new Stopwatch();
s.Start();
var grouping = new Dictionary<int, List<int>>(ar.Length / num + 1); // exactly the right size
foreach (var item in ar)
{
int idx = item / num;
List<int> ll;
if (!grouping.TryGetValue(idx, out ll))
{
grouping.Add(idx, ll = new List<int>());
}
//ll.Add(item); //-> this would complete a 'grouper'; however, we don't want the overallocator of List to kick in
}
s.Stop();
return s.ElapsedMilliseconds;
}
// Test with arrays
static long FixArrayTest(int[] ar, int num)
{
System.Diagnostics.Stopwatch s = new Stopwatch();
s.Start();
int[] buf = new int[(ar.Length / num + 1) * 10];
foreach (var item in ar)
{
int code = (item & 0x7FFFFFFF) % buf.Length;
buf[code]++;
}
s.Stop();
return s.ElapsedMilliseconds;
}
When executing bigger calculations, less physical memory is available on the computer, counting the buckets will be slower with less memory, as you expend the buckets, your memory will decrease.
Try something like the following:
int size = 2500000; //10000000 divided by 4
int[] ar = new int[size];
//random number init with numbers [0,size-1]
System.Diagnostics.Stopwatch s = new Stopwatch();
s.Start();
for (int i = 0; i<4; i++)
{
var group = ar.GroupBy(i => i / num);
//the number of expected buckets is size / num.
var l = group.ToArray();
}
s.Stop();
calcuting 4 times with lower numbers.
Even though it is good to check performance of code in terms of algorithmic analysis and Big-Oh! notation i wanted to see how much it takes for the code to execute in my PC. I had initialized a List to 9999count and removed even elements out from the them. Sadly the timespan to execute this seems to be 0:0:0. Surprised by the result there must be something wrong in the way i time the execution. Could someone help me time the code correct?
IList<int> source = new List<int>(100);
for (int i = 0; i < 9999; i++)
{
source.Add(i);
}
TimeSpan startTime, duration;
startTime = Process.GetCurrentProcess().Threads[0].UserProcessorTime;
RemoveEven(ref source);
duration = Process.GetCurrentProcess().Threads[0].UserProcessorTime.Subtract(startTime);
Console.WriteLine(duration.Milliseconds);
Console.Read();
The most appropriate thing to use there would be Stopwatch - anything involving TimeSpan has nowhere near enough precision for this:
var watch = Stopwatch.StartNew();
// something to time
watch.Stop();
Console.WriteLine(watch.ElapsedMilliseconds);
However, a modern CPU is very fast, and it would not surprise me if it can remove them in that time. Normally, for timing, you need to repeat an operation a large number of times to get a reasonable measurement.
Aside: the ref in RemoveEven(ref source) is almost certainly not needed.
In .Net 2.0 you can use the Stopwatch class
IList<int> source = new List<int>(100);
for (int i = 0; i < 9999; i++)
{
source.Add(i);
}
Stopwatch watch = new Stopwatch();
watch.Start();
RemoveEven(ref source);
//watch.ElapsedMilliseconds contains the execution time in ms
watch.Stop()
Adding to previous answers:
var sw = Stopwatch.StartNew();
// instructions to time
sw.Stop();
sw.ElapsedMilliseconds returns a long and has a resolution of:
1 millisecond = 1000000 nanoseconds
sw.Elapsed.TotalMilliseconds returns a double and has a resolution equal to the inverse of Stopwatch.Frequency. On my PC for example Stopwatch.Frequency has a value of 2939541 ticks per second, that gives sw.Elapsed.TotalMilliseconds a resolution of:
1/2939541 seconds = 3,401891655874165e-7 seconds = 340 nanoseconds
Is this a valid way to do performance analysis? I want to get nanosecond accuracy and determine the performance of typecasting:
class PerformanceTest
{
static double last = 0.0;
static List<object> numericGenericData = new List<object>();
static List<double> numericTypedData = new List<double>();
static void Main(string[] args)
{
double totalWithCasting = 0.0;
double totalWithoutCasting = 0.0;
for (double d = 0.0; d < 1000000.0; ++d)
{
numericGenericData.Add(d);
numericTypedData.Add(d);
}
Stopwatch stopwatch = new Stopwatch();
for (int i = 0; i < 10; ++i)
{
stopwatch.Start();
testWithTypecasting();
stopwatch.Stop();
totalWithCasting += stopwatch.ElapsedTicks;
stopwatch.Start();
testWithoutTypeCasting();
stopwatch.Stop();
totalWithoutCasting += stopwatch.ElapsedTicks;
}
Console.WriteLine("Avg with typecasting = {0}", (totalWithCasting/10));
Console.WriteLine("Avg without typecasting = {0}", (totalWithoutCasting/10));
Console.ReadKey();
}
static void testWithTypecasting()
{
foreach (object o in numericGenericData)
{
last = ((double)o*(double)o)/200;
}
}
static void testWithoutTypeCasting()
{
foreach (double d in numericTypedData)
{
last = (d * d)/200;
}
}
}
The output is:
Avg with typecasting = 468872.3
Avg without typecasting = 501157.9
I'm a little suspicious... it looks like there is nearly no impact on the performance. Is casting really that cheap?
Update:
class PerformanceTest
{
static double last = 0.0;
static object[] numericGenericData = new object[100000];
static double[] numericTypedData = new double[100000];
static Stopwatch stopwatch = new Stopwatch();
static double totalWithCasting = 0.0;
static double totalWithoutCasting = 0.0;
static void Main(string[] args)
{
for (int i = 0; i < 100000; ++i)
{
numericGenericData[i] = (double)i;
numericTypedData[i] = (double)i;
}
for (int i = 0; i < 10; ++i)
{
stopwatch.Start();
testWithTypecasting();
stopwatch.Stop();
totalWithCasting += stopwatch.ElapsedTicks;
stopwatch.Reset();
stopwatch.Start();
testWithoutTypeCasting();
stopwatch.Stop();
totalWithoutCasting += stopwatch.ElapsedTicks;
stopwatch.Reset();
}
Console.WriteLine("Avg with typecasting = {0}", (totalWithCasting/(10.0)));
Console.WriteLine("Avg without typecasting = {0}", (totalWithoutCasting / (10.0)));
Console.ReadKey();
}
static void testWithTypecasting()
{
foreach (object o in numericGenericData)
{
last = ((double)o * (double)o) / 200;
}
}
static void testWithoutTypeCasting()
{
foreach (double d in numericTypedData)
{
last = (d * d) / 200;
}
}
}
The output is:
Avg with typecasting = 4791
Avg without typecasting = 3303.9
Note that it's not typecasting that you are measuring, it's unboxing. The values are doubles all along, there is no type casting going on.
You forgot to reset the stopwatch between tests, so you are adding the accumulated time of all previous tests over and over. If you convert the ticks to actual time, you see that it adds up to much more than the time it took to run the test.
If you add a stopwatch.Reset(); before each stopwatch.Start();, you get a much more reasonable result like:
Avg with typecasting = 41027,1
Avg without typecasting = 20594,3
Unboxing a value is not so expensive, it only has to check that the data type in the object is correct, then get the value. Still it's a lot more work than when the type is already known. Remember that you are also measuring the looping, calculation and assigning of the result, which is the same for both tests.
Boxing a value is more expensive than unboxing it, as that allocates an object on the heap.
1) Yes, casting is usually (very) cheap.
2) You are not going to get nanosecond accuracy in a managed language. Or in an unmanaged language under most operating systems.
Consider
other processes
garbage collection
different JITters
different CPUs
And, your measurement includes the foreach loop, looks like 50% or more to me. Maybe 90%.
When you call Stopwatch.Start it is letting the timer continue to run from wherever it left off. You need to call Stopwatch.Reset() to set the timers back to zero before starting again. Personally I just use stopwatch = Stopwatch.StartNew() whenever I want to start a timer to avoid this sort of confusion.
Furthermore, you probably want to call both of your test methods before starting the "timing loop" so that they get a fair chance to "warm up" that piece of code and ensure that the JIT has had a chance to run to even the playing field.
When I do that on my machine, I see that testWithTypecasting runs in approximately half the time as testWithoutTypeCasting.
That being said however, the cast itself it not likely to be the most significant part of that performance penalty. The testWithTypecasting method is operating on a list of boxed doubles which means that there is an additional level of indirection required to retrieve each value (follow a reference to the value somewhere else in memory) in addition to increasing the total amount of memory consumed. This increases the amount of time spent on memory access and is likely to be a bigger effect than the CPU time spent "in the cast" itself.
Look into performance counters in the System.Diagnostics namespace, When you create a new counter, you first create a category, and then specify one or more counters to be placed in it.
// Create a collection of type CounterCreationDataCollection.
System.Diagnostics.CounterCreationDataCollection CounterDatas =
new System.Diagnostics.CounterCreationDataCollection();
// Create the counters and set their properties.
System.Diagnostics.CounterCreationData cdCounter1 =
new System.Diagnostics.CounterCreationData();
System.Diagnostics.CounterCreationData cdCounter2 =
new System.Diagnostics.CounterCreationData();
cdCounter1.CounterName = "Counter1";
cdCounter1.CounterHelp = "help string1";
cdCounter1.CounterType = System.Diagnostics.PerformanceCounterType.NumberOfItems64;
cdCounter2.CounterName = "Counter2";
cdCounter2.CounterHelp = "help string 2";
cdCounter2.CounterType = System.Diagnostics.PerformanceCounterType.NumberOfItems64;
// Add both counters to the collection.
CounterDatas.Add(cdCounter1);
CounterDatas.Add(cdCounter2);
// Create the category and pass the collection to it.
System.Diagnostics.PerformanceCounterCategory.Create(
"Multi Counter Category", "Category help", CounterDatas);
see MSDN docs
Just a thought but sometimes identical machine code can take a different number of cycles to execute depending on its alignment in memory so you might want to add a control or controls.
Don't "do" C# myself but in C for x86-32 and later the rdtsc instruction is usually available which is much more accurate than OS ticks. More info on rdtsc can be found by searching stackoverflow. Under C it is usually available as an intrinsic or built-in function and returns the number of clock cycles (in an 8 byte - long long/__int64 - unsigned integer) since the computer was powered up. So if the CPU has a clock speed of 3 Ghz the underlying counter is incremented 3 billion times per second. Save for a few early AMD processors, all multi-core CPUs will have their counters synchronized.
If C# does not have it you might consider writing a VERY short C function to access it from C#. There is a great deal of overhead if you access the instruction through a function vs inline. The difference between two back-to-back calls to the function will be the basic measurement overhead. If you're thinking of metering your application you'll have to determine several more complex overhead values.
You might consider shutting off the CPU energy-saving mode (and restarting the PC) as it lowers the clock frequency being fed to the CPU during periods of low activity. This is since it causes the time stamp counters of the different cores to become un-synchronized.