Adding to List<t> becomes very slow over time - c#

I'm parsing an html table that has about 1000 rows. I'm adding ~10 char string from one <td> in each row to a list<string> object. It's very quick for the first 200 or so loops but then becomes slower and slower over time.
This is the code i'm using:
List<string> myList = new List<string>();
int maxRows = numRows;
for (int i = 1; i < maxRows; i++)
{
TableRow newTable = myTable.TableRows[i];
string coll = string.Format("{0},{1},{2},{3},{4}",newTable.TableCells[0].Text,newTable.TableCells[1].Text,newTable.TableCells[2].Text,newTable.TableCells[3].Text,newTable.TableCells[4].Text);
myList.Add(coll);
label1.Text = i.ToString();
}
Should I use an array instead?
Edit: I threw the above code in a new method that gets run on a new Thread and then updated my label control with this code:
label1.Invoke((MethodInvoker)delegate
{
label1.Text = i.ToString();
});
Program runs at a consistent speed and doesn't block the UI.

If you roughly know the range (number of items) in your collection it is better to use an array.
Reason : Every time you add an element to the List if the list is full it allocates new block of memory to hold the double the current space and copies everything there and then keeps appending the additional entries till it becomes full, and one more allocation copy cycle.
Following is how it works AFAIK, start with 16 elements by default,
when you add 17th element to the list it allocates 32 elemnts and copies 16 there then continues for 17 to 32. and repeats this process, so it is slower but offer flexibility of not having to determine the length beforehand. This might be the reason you're seeing the drag.
Thanks #Dyppl
var list = new List<int>(1000); This is one elegant option too, as #Dyppl suggested it is best of both the worlds.

I tested adding strings to a list, and benchmarked it with a LIST_SIZE of 1000000 (one million) items and a LIST_SIZE of 100000 (one hundred thousands) items. This way we can compare how it scales.
I ran each test 5 times and averaged the running times.
var l = new List<string>();
for (var i = 0; i < LIST_SIZE; ++i) {
l.Add("i = " + i.ToString());
}
LIST_SIZE of 1000000 takes 1519 ms
LIST_SIZE of 100000 takes 96 ms
var l = new List<string>(LIST_SIZE);
for (var i = 0; i < LIST_SIZE; ++i) {
l.Add("i = " + i.ToString());
}
LIST_SIZE of 1000000 takes 1386 ms
LIST_SIZE of 100000 takes 65 ms
var l = new string[LIST_SIZE];
for (var i = 0; i < LIST_SIZE; ++i) {
l[i] = "i = " + i.ToString();
}
LIST_SIZE of 1000000 takes 1510 ms
LIST_SIZE of 100000 takes 66 ms
So, we can notice 2 things:
it really takes more time to add each items the longer the list gets larger
the difference shouldn't be noticeable in a 1000 items list
I would conclude then that the bottleneck is in one of the other methods you call.

Initialize the List with the capacity you expect it to consume:
List<string> myList = new List<string>(maxRows);
Sidenote: If you generate 'very' large lists, the internally increasing storage arrays over time sum up to twice the storage you really need. But if for 1000 entries you already slow down, I suggest investigating the true reason for it with a profiler. May the strings grow to large ?

Related

Why is Queue consuming so much memory?

Basically I was doing a code kata on codewars site to kinda of 'warm up' before starting to code, and noticed a problem that I don't know if its because of my code, or just regular thing.
public static string WhoIsNext(string[] names, long n)
{
Queue<string> fifo = new Queue<string>(names);
for(int i = 0; i < n - 1; i++)
{
var name = fifo.Dequeue();
fifo.Enqueue(name);
fifo.Enqueue(name);
}
return fifo.Peek();
}
And Is called like this:
// Test 1
string[] names = { "Sheldon", "Leonard", "Penny", "Rajesh", "Howard" };
long n = 1;
var nth = CodeKata.WhoIsNext(names, n); // n = 1 Should return sheldon.
// test 2
string[] names = { "Sheldon", "Leonard", "Penny", "Rajesh", "Howard" };
long n = 52;
var nth = CodeKata.WhoIsNext(names, n); // n = 52 Should return Penny.
// test 3
string[] names = { "Sheldon", "Leonard", "Penny", "Rajesh", "Howard" };
long n = 7230702951;
var nth = CodeKata.WhoIsNext(names, n); // n = 52 Should return Leonard.
In this code When I put the long n with the value 7230702951 (a really high number...), it throws an out of memory exception. Is the number that high, or is the queue just not optimized for such numbers.
I say this because I tried using a List and the list memory usage stayed under 500 MB (the plateu was around 327MB btw), and this running for about 2/3min, whereas the queue throwed the exception in a matter of seconds, and went over 2GB in just that time alone.
Can someone explain to me the why of this happening, I just curious?
edit 1
I forgot to add the List code:
public static string WhoIsNext(string[] names, long n)
{
List<string> test = new List<string>(names);
for(int i = 0; i < n - 1; i++)
{
var name = test[0];
test.RemoveAt(0);
test.Add(name);
test.Add(name);
}
return test[0];
}
edit 2
For those saying that the code doubles the names and is inneficient, I already know that, the code isn't made to be useful, is just a kata. (I updated the link now!)
My question is as to why is Queue so much more inneficient thatn List with high count numbers.
Part of the reason is that the queue code is way faster than the List code, because queues are optimised for deletes due to the fact that they are a circular buffer. Lists aren't - the list copies the array contents every time you remove that first element.
Change the input value to 72307000 for example. On my machine, the queue finishes that in less than a second. The list is still chugging away minutes (and at this rate, hours) later. In 4 minutes i is now at 752408 - it has done almost 1% of the work).
Thus, I am not sure the queue is less memory efficient. It is just so fast that you run into the memory issue sooner. The list almost certainly has the same issue (the way that List and Queue do array size doubling is very similar) - it will just likely take days to run into it.
To a certain extent, you could predict this even without running your code. A queue with 7230702951 entries in it (running 64-bit) will take a minimum of 8 bytes per entry. So 57845623608 bytes. Which is larger than 50GB. Clearly your machine is going to struggle to fit that in RAM (plus .NET won't let you have an array that large)...
Additionally, your code has a subtle bug. The loop can't ever end (if n is greater than int.MaxValue). Your loop variable is an int but the parameter is a long. Your int will overflow (from int.MaxValue to int.MinValue with i++). So the loop will never exit, for large values of n (meaning the queue will grow forever). You likely should change the type of i to long.

Different time for getting value from two arrays different length

I was surprised that getting value by index from the first array need more time then from second one. It is not depend on arrays lenght, in my tests it is true for any combinations. I guess that it is depends on some low level optimizations. Can somebody explain it?
code example is bellow:
var a1 = new int[10];
var a2 = new int[1000000];
#region init
var random = new Random(12345);
for (int i = 0; i < a1.Length; i++)
a1[i] = random.Next(1000000000);
for (int i = 0; i < a2.Length; i++)
a2[i] = random.Next(1000000000);
#endregion
Console.WriteLine("a1 Length = " + a1.Length);
var watcher = Stopwatch.StartNew();
var t1 = a1[a1.Length / 2];
watcher.Stop();
Console.WriteLine("a1 timestamp = " + watcher.ElapsedTicks); // average value 130-150 ticks
Console.WriteLine("a2 Length = " + a2.Length);
watcher = Stopwatch.StartNew();
var t2 = a2[a2.Length / 2];
watcher.Stop();
Console.WriteLine("a2 timestamp = " + watcher.ElapsedTicks); //average value 10 - 15 ticks
Console.ReadLine();
My result is:
- getting value by the index from array with lenght 10 is ~130-150 ticks
- getting value by the index from array with lenght 1000000 is ~10-15 ticks
I would suggest you to change how you measuring the performance, but let's assume that your measurement is correct. It could be a few reasons here and one of them is branch prediction. In short, modern processors are using branch prediction for their computations.
As it says on Wikipedia:
The purpose of the branch predictor is to improve the flow in the
instruction pipeline. Branch predictors play a critical role in
achieving high effective performance in many modern pipelined
microprocessor architectures such as x86.
So the digital circuit is trying to identify a pattern and follow it. If you guess right every time, the execution will never have to stop and it goes fast and if you guess wrong too often, you spend a lot of time rolling back and restarting. For the same reason processing sorted array is faster than processing unsorted array.

Why property "ElapsedTicks" of List not equal to "ElapsedTicks" of Array?

For example I have the following code implements Stopwatch:
var list = new List<int>();
var array = new ArrayList();
Stopwatch listStopwatch = new Stopwatch(), arrayStopwatch = new Stopwatch();
listStopwatch.Start();
for (int i =0; i <=10000;i++)
{
list.Add(10);
}
listStopwatch.Stop();
arrayStopwatch.Start();
for (int i = 0; i <= 10000; i++)
{
list.Add(10);
}
arrayStopwatch.Stop();
Console.WriteLine(listStopwatch.ElapsedTicks > arrayStopwatch.ElapsedTicks);
Why this values are not equal?
Different code is expected to produce different timing.
Second loop adds to array as question imply
One most obvious difference is boxing in ArrayList - each int is stored as boxed value (created on heap instead of inline for List<int>).
Second loop adds to list as sample shows
growing list requires re-allocation and copying all elements which may be slower for second set of elements if in particular range it will hit more re-allocations (as copy operation need to copy a lot more elements each time).
Note that on average (as hinted by Adam Houldsworth) re-allocation cost the same (as they happen way lest often when array grows), but one can find set of numbers when there are extra re-allocation in on of the cases to get one number consistently different than another. One would need much higher number of items to add for difference to be consistent.

Take pages and combine to list of pages

I have a list, let's say it contains 1000 items. I want to end up with a list of 10 times 100 items with something like:
myList.Select(x => x.y).Take(100) (until list is empty)
So I want Take(100) to run ten times, since the list contains 1000 items, and end up with list containing 10 lists which each contains 100 items.
You need to Skip the number of records you have already taken, you can keep track of this number and use it when you query
alreadyTaken = 0;
while (alreadyTaken < 1000) {
var pagedList = myList.Select(x => x.y).Skip(alreadyTaken).Take(100);
...
alreadyTaken += 100;
}
This can be achieved with a simple paging extension method.
public static List<T> GetPage<T>(this List<T> dataSource, int pageIndex, int pageSize = 100)
{
return dataSource.Skip(pageIndex * pageSize)
.Take(pageSize)
.ToList();
}
Of course, you can extend it to accept and/or return any kind of IEnumerable<T>.
As already posted you can use a for loop and Skip some elements and Take some elements. In this way you create a new query in every for loop. But a problem raises if you also want to go through each of those queries, because this will be very inefficient. Lets assume you just have 50 entries and you want to go through your list with ten elements every loop. You will have 5 loops doing
.Skip(0).Take(10)
.Skip(10).Take(10)
.Skip(20).Take(10)
.Skip(30).Take(10)
.Skip(40).Take(10)
Here two problem raises.
Skiping elements can still lead to computation. In your first query you just calculate the needed 10 elements, but in your second loop you calculated 20 elements and throwing 10 away, and so on. If you sum all 5 loops together you already computed 10 + 20 + 30 + 40 + 50 = 150 elements even you only had 50 elements. This result in an O(n^2) performance.
Not every IEnumerable does the above thing. Some IEnumerable like a database for example can optimize a Skip, for example they use an Offset (MySQL) definition in the SQL query. But that still doesn't solve the problem. The main problem you still have is that you will create 5 different Queries and execute all 5 of them. Those five queries will now take the most time. Because a simple Query to a database is even a lot slower than just Skipping some in-memory elements or some computations.
Because of all these problems it makes sense to not use a for loop with multiple .Skip(x).Take(y) if you also want to evaluate every query in every loop. Instead your algorithm should only go through your IEnumerable once, executing the query once, and on the first iteration return the first 10 elements. The next iteration returns the next 10 elements and so on, until it runs out of elements.
The following Extension Method does exactly this.
public static IEnumerable<IReadOnlyList<T>> Combine<T>(this IEnumerable<T> source, int amount) {
var combined = new List<T>();
var counter = 0;
foreach ( var entry in source ) {
combined.Add(entry);
if ( ++counter >= amount ) {
yield return combined;
combined = new List<T>();
counter = 0;
}
}
if ( combined.Count > 0 )
yield return combined;
}
With this you can just do
someEnumerable.Combine(100)
and you get a new IEnumerable<IReadOnlyList<T>> that goes through your enumeration just once slicing everything into chunks with a maximum of 100 elements.
Just to show how much difference the performance could be:
var numberCount = 100000;
var combineCount = 100;
var nums = Enumerable.Range(1, numberCount);
var count = 0;
// Bechmark with Combine() Extension
var swCombine = Stopwatch.StartNew();
var sumCombine = 0L;
var pages = nums.Combine(combineCount);
foreach ( var page in pages ) {
sumCombine += page.Sum();
count++;
}
swCombine.Stop();
Console.WriteLine("Count: {0} Sum: {1} Time Combine: {2}", count, sumCombine, swCombine.Elapsed);
// Doing it with .Skip(x).Take(y)
var swTakes = Stopwatch.StartNew();
count = 0;
var sumTaken = 0L;
var alreadyTaken = 0;
while ( alreadyTaken < numberCount ) {
sumTaken += nums.Skip(alreadyTaken).Take(combineCount).Sum();
alreadyTaken += combineCount;
count++;
}
swTakes.Stop();
Console.WriteLine("Count: {0} Sum: {1} Time Takes: {2}", count, sumTaken, swTakes.Elapsed);
The usage with the Combine() Extension Methods runs in 3 milliseconds on my computer (i5 # 4Ghz) while the for loop already needs 178 milliseconds
If you have a lot more elements or the slicing is smaller it gets even more worse. For example if combineCount is set to 10 instead of 100 the runtime changes to 4 milliseconds and 1800 milliseconds (1.8 seconds)
Now you could possibly say that you don't have so much elements or your slicing never gets so small. But remember, in this this example i just generated a sequence of numbers that has nearly zero computation time. The whole overhead from 4 milliseconds to 178 milliseconds is only caused of the re-evaluation and Skiping of values. If you have some more complex stuff going on behind the scenes the Skipping creates the most overhead, and also if an IEnumerable can implement Skip, like a database as explained above, that example will still get more worse, because the most overhead will be the execution of the query itself.
And the amount of queries can go really fast up. With 100.000 elements and a slicing/chunking of 100 you already will execute 1.000 queries. The Combine Extension provided above on the other hand will always execute your query once. And will never suffer of any of those problems described above.
All of that doesn't mean that Skip and Take should be avoided. They have their place. But if you really plan to go through every element you should avoid using Skip and Take to get your slicing done.
If the only thing you want is just to slice everything into pages with 100 elements, and you just want to fetch the third page, for example. You just should calculate how much elements you need to Skip.
var pageCount = 100;
var pageNumberToGet = 3;
var thirdPage = yourEnumerable.Skip(pageCount * (pageNumberToGet-1)).take(pageCount);
In this way you will get the elements from 200 to 300 in a single query. Also an IEnumerable with a databse can optimize that and you just have a single-query. So, if you only want a specific range of elements from your IEnumerable than you should use Skip and Take and do it like above instead of using the Combine Extension Method that i provided.

Why .NET group by is (much) slower when the number of buckets grows

Given this simple piece of code and 10mln array of random numbers:
static int Main(string[] args)
{
int size = 10000000;
int num = 10; //increase num to reduce number of buckets
int numOfBuckets = size/num;
int[] ar = new int[size];
Random r = new Random(); //initialize with randum numbers
for (int i = 0; i < size; i++)
ar[i] = r.Next(size);
var s = new Stopwatch();
s.Start();
var group = ar.GroupBy(i => i / num);
var l = group.Count();
s.Stop();
Console.WriteLine(s.ElapsedMilliseconds);
Console.ReadLine();
return 0;
}
I did some performance on grouping, so when the number of buckets is 10k the estimated execution time is 0.7s, for 100k buckets it is 2s, for 1m buckets it is 7.5s.
I wonder why is that. I imagine that if the GroupBy is implemented using HashTable there might be problem with collisions. For example initially the hashtable is prepard to work for let's say 1000 groups and then when the number of groups is growing it needs to increase the size and do the rehashing. If these was the case I could then write my own grouping where I would initialize the HashTable with expected number of buckets, I did that but it was only slightly faster.
So my question is, why number of buckets influences groupBy performance that much?
EDIT:
running under release mode change the results to 0.55s, 1.6s, 6.5s respectively.
I also changed the group.ToArray to piece of code below just to force execution of grouping :
foreach (var g in group)
array[g.Key] = 1;
where array is initialized before timer with appropriate size, the results stayed almost the same.
EDIT2:
You can see the working code from mellamokb in here pastebin.com/tJUYUhGL
I'm pretty certain this is showing the effects of memory locality (various levels of caching) and also object allocation.
To verify this, I took three steps:
Improve the benchmarking to avoid unnecessary parts and to garbage collect between tests
Remove the LINQ part by populating a Dictionary (which is effecively what GroupBy does behind the scenes)
Remove even Dictionary<,> and show the same trend for plain arrays.
In order to show this for arrays, I needed to increase the input size, but it does show the same kind of growth.
Here's a short but complete program which can be used to test both the dictionary and the array side - just flip which line is commented out in the middle:
using System;
using System.Collections.Generic;
using System.Diagnostics;
class Test
{
const int Size = 100000000;
const int Iterations = 3;
static void Main()
{
int[] input = new int[Size];
// Use the same seed for repeatability
var rng = new Random(0);
for (int i = 0; i < Size; i++)
{
input[i] = rng.Next(Size);
}
// Switch to PopulateArray to change which method is tested
Func<int[], int, TimeSpan> test = PopulateDictionary;
for (int buckets = 10; buckets <= Size; buckets *= 10)
{
TimeSpan total = TimeSpan.Zero;
for (int i = 0; i < Iterations; i++)
{
// Switch which line is commented to change the test
// total += PopulateDictionary(input, buckets);
total += PopulateArray(input, buckets);
GC.Collect();
GC.WaitForPendingFinalizers();
}
Console.WriteLine("{0,9}: {1,7}ms", buckets, (long) total.TotalMilliseconds);
}
}
static TimeSpan PopulateDictionary(int[] input, int buckets)
{
int divisor = input.Length / buckets;
var dictionary = new Dictionary<int, int>(buckets);
var stopwatch = Stopwatch.StartNew();
foreach (var item in input)
{
int key = item / divisor;
int count;
dictionary.TryGetValue(key, out count);
count++;
dictionary[key] = count;
}
stopwatch.Stop();
return stopwatch.Elapsed;
}
static TimeSpan PopulateArray(int[] input, int buckets)
{
int[] output = new int[buckets];
int divisor = input.Length / buckets;
var stopwatch = Stopwatch.StartNew();
foreach (var item in input)
{
int key = item / divisor;
output[key]++;
}
stopwatch.Stop();
return stopwatch.Elapsed;
}
}
Results on my machine:
PopulateDictionary:
10: 10500ms
100: 10556ms
1000: 10557ms
10000: 11303ms
100000: 15262ms
1000000: 54037ms
10000000: 64236ms // Why is this slower? See later.
100000000: 56753ms
PopulateArray:
10: 1298ms
100: 1287ms
1000: 1290ms
10000: 1286ms
100000: 1357ms
1000000: 2717ms
10000000: 5940ms
100000000: 7870ms
An earlier version of PopulateDictionary used an Int32Holder class, and created one for each bucket (when the lookup in the dictionary failed). This was faster when there was a small number of buckets (presumably because we were only going through the dictionary lookup path once per iteration instead of twice) but got significantly slower, and ended up running out of memory. This would contribute to fragmented memory access as well, of course. Note that PopulateDictionary specifies the capacity to start with, to avoid effects of data copying within the test.
The aim of using the PopulateArray method is to remove as much framework code as possible, leaving less to the imagination. I haven't yet tried using an array of a custom struct (with various different struct sizes) but that may be something you'd like to try too.
EDIT: I can reproduce the oddity of the slower result for 10000000 than 100000000 at will, regardless of test ordering. I don't understand why yet. It may well be specific to the exact processor and cache I'm using...
--EDIT--
The reason why 10000000 is slower than the 100000000 results has to do with the way hashing works. A few more tests explain this.
First off, let's look at the operations. There's Dictionary.FindEntry, which is used in the [] indexing and in Dictionary.TryGetValue, and there's Dictionary.Insert, which is used in the [] indexing and in Dictionary.Add. If we would just do a FindEntry, the timings would go up as we expect it:
static TimeSpan PopulateDictionary1(int[] input, int buckets)
{
int divisor = input.Length / buckets;
var dictionary = new Dictionary<int, int>(buckets);
var stopwatch = Stopwatch.StartNew();
foreach (var item in input)
{
int key = item / divisor;
int count;
dictionary.TryGetValue(key, out count);
}
stopwatch.Stop();
return stopwatch.Elapsed;
}
This is implementation doesn't have to deal with hash collisions (because there are none), which makes the behavior as we expect it. Once we start dealing with collisions, the timings start to drop. If we have as much buckets as elements, there are obviously less collisions... To be exact, we can figure out exactly how many collisions there are by doing:
static TimeSpan PopulateDictionary(int[] input, int buckets)
{
int divisor = input.Length / buckets;
int c1, c2;
c1 = c2 = 0;
var dictionary = new Dictionary<int, int>(buckets);
var stopwatch = Stopwatch.StartNew();
foreach (var item in input)
{
int key = item / divisor;
int count;
if (!dictionary.TryGetValue(key, out count))
{
dictionary.Add(key, 1);
++c1;
}
else
{
count++;
dictionary[key] = count;
++c2;
}
}
stopwatch.Stop();
Console.WriteLine("{0}:{1}", c1, c2);
return stopwatch.Elapsed;
}
The result is something like this:
10:99999990
10: 4683ms
100:99999900
100: 4946ms
1000:99999000
1000: 4732ms
10000:99990000
10000: 4964ms
100000:99900000
100000: 7033ms
1000000:99000000
1000000: 22038ms
9999538:90000462 <<-
10000000: 26104ms
63196841:36803159 <<-
100000000: 25045ms
Note the value of '36803159'. This answers the question why the last result is faster than the first result: it simply has to do less operations -- and since caching fails anyways, that factor doesn't make a difference anymore.
10k the estimated execution time is 0.7s, for 100k buckets it is 2s, for 1m buckets it is 7.5s.
This is an important pattern to recognize when you profile code. It is one of the standard size vs execution time relationships in software algorithms. Just from seeing the behavior, you can tell a lot about the way the algorithm was implemented. And the other way around of course, from the algorithm you can predict the expected execution time. A relationship that's annotated in the Big Oh notation.
Speediest code you can get is amortized O(1), execution time barely increases when you double the size of the problem. The Dictionary<> class behaves that way, as John demonstrated. The increases in time as the problem set gets large is the "amortized" part. A side-effect of Dictionary having to perform linear O(n) searches in buckets that keep getting bigger.
A very common pattern is O(n). That tells you that there is a single for() loop in the algorithm that iterates over the collection. O(n^2) tells you there are two nested for() loops. O(n^3) has three, etcetera.
What you got is the one in between, O(log n). It is the standard complexity of a divide-and-conquer algorithm. In other words, each pass splits the problem in two, continuing with the smaller set. Very common, you see it back in sorting algorithms. Binary search is the one you find back in your text book. Note how logâ‚‚(10) = 3.3, very close to the increment you see in your test. Perf starts to tank a bit for very large sets due to the poor locality of reference, a cpu cache problem that's always associated with O(log n) algoritms.
The one thing that John's answer demonstrates is that his guess cannot be correct, GroupBy() certainly does not use a Dictionary<>. And it is not possible by design, Dictionary<> cannot provide an ordered collection. Where GroupBy() must be ordered, it says so in the MSDN Library:
The IGrouping objects are yielded in an order based on the order of the elements in source that produced the first key of each IGrouping. Elements in a grouping are yielded in the order they appear in source.
Not having to maintain order is what makes Dictionary<> fast. Keeping order always cost O(log n), a binary tree in your text book.
Long story short, if you don't actually care about order, and you surely would not for random numbers, then you don't want to use GroupBy(). You want to use a Dictionary<>.
There are (at least) two influence factors: First, a hash table lookup only takes O(1) if you have a perfect hash function, which does not exist. Thus, you have hash collisions.
I guess more important, though, are caching effects. Modern CPUs have large caches, so for the smaller bucket count, the hash table itself might fit into the cache. As the hash table is frequently accessed, this might have a strong influence on the performance. If there are more buckets, more accesses to the RAM might be neccessary, which are slow compared to a cache hit.
There are a few factors at work here.
Hashes and groupings
The way grouping works is by creating a hash table. Each individual group then supports an 'add' operation, which adds an element to the add list. To put it bluntly, it's like a Dictionary<Key, List<Value>>.
Hash tables are always overallocated. If you add an element to the hash, it checks if there is enough capacity, and if not, recreates the hash table with a larger capacity (To be exact: new capacity = count * 2 with count the number of groups). However, a larger capacity means that the bucket index is no longer correct, which means you have to re-build the entries in the hash table. The Resize() method in Lookup<Key, Value> does this.
The 'groups' themselves work like a List<T>. These too are overallocated, but are easier to reallocate. To be precise: the data is simply copied (with Array.Copy in Array.Resize) and a new element is added. Since there's no re-hashing or calculation involved, this is quite a fast operation.
The initial capacity of a grouping is 7. This means, for 10 elements you need to reallocate 1 time, for 100 elements 4 times, for 1000 elements 8 times, and so on. Because you have to re-hash more elements each time, your code gets a bit slower each time the number of buckets grows.
I think these overallocations are the largest contributors to the small growth in the timings as the number of buckets grow. The easiest way to test this theory is to do no overallocations at all (test 1), and simply put counters in an array. The result can be shown below in the code for FixArrayTest (or if you like FixBucketTest which is closer to how groupings work). As you can see, the timings of # buckets = 10...10000 are the same, which is correct according to this theory.
Cache and random
Caching and random number generators aren't friends.
Our little test also shows that when the number of buckets grows above a certain threshold, memory comes into play. On my computer this is at an array size of roughly 4 MB (4 * number of buckets). Because the data is random, random chunks of RAM will be loaded and unloaded into the cache, which is a slow process. This is also the large jump in the speed. To see this in action, change the random numbers to a sequence (called 'test 2'), and - because the data pages can now be cached - the speed will remain the same overall.
Note that hashes overallocate, so you will hit the mark before you have a million entries in your grouping.
Test code
static void Main(string[] args)
{
int size = 10000000;
int[] ar = new int[size];
//random number init with numbers [0,size-1]
var r = new Random();
for (var i = 0; i < size; i++)
{
ar[i] = r.Next(0, size);
//ar[i] = i; // Test 2 -> uncomment to see the effects of caching more clearly
}
Console.WriteLine("Fixed dictionary:");
for (var numBuckets = 10; numBuckets <= 1000000; numBuckets *= 10)
{
var num = (size / numBuckets);
var timing = 0L;
for (var i = 0; i < 5; i++)
{
timing += FixBucketTest(ar, num);
//timing += FixArrayTest(ar, num); // test 1
}
var avg = ((float)timing) / 5.0f;
Console.WriteLine("Avg Time: " + avg + " ms for " + numBuckets);
}
Console.WriteLine("Fixed array:");
for (var numBuckets = 10; numBuckets <= 1000000; numBuckets *= 10)
{
var num = (size / numBuckets);
var timing = 0L;
for (var i = 0; i < 5; i++)
{
timing += FixArrayTest(ar, num); // test 1
}
var avg = ((float)timing) / 5.0f;
Console.WriteLine("Avg Time: " + avg + " ms for " + numBuckets);
}
}
static long FixBucketTest(int[] ar, int num)
{
// This test shows that timings will not grow for the smaller numbers of buckets if you don't have to re-allocate
System.Diagnostics.Stopwatch s = new Stopwatch();
s.Start();
var grouping = new Dictionary<int, List<int>>(ar.Length / num + 1); // exactly the right size
foreach (var item in ar)
{
int idx = item / num;
List<int> ll;
if (!grouping.TryGetValue(idx, out ll))
{
grouping.Add(idx, ll = new List<int>());
}
//ll.Add(item); //-> this would complete a 'grouper'; however, we don't want the overallocator of List to kick in
}
s.Stop();
return s.ElapsedMilliseconds;
}
// Test with arrays
static long FixArrayTest(int[] ar, int num)
{
System.Diagnostics.Stopwatch s = new Stopwatch();
s.Start();
int[] buf = new int[(ar.Length / num + 1) * 10];
foreach (var item in ar)
{
int code = (item & 0x7FFFFFFF) % buf.Length;
buf[code]++;
}
s.Stop();
return s.ElapsedMilliseconds;
}
When executing bigger calculations, less physical memory is available on the computer, counting the buckets will be slower with less memory, as you expend the buckets, your memory will decrease.
Try something like the following:
int size = 2500000; //10000000 divided by 4
int[] ar = new int[size];
//random number init with numbers [0,size-1]
System.Diagnostics.Stopwatch s = new Stopwatch();
s.Start();
for (int i = 0; i<4; i++)
{
var group = ar.GroupBy(i => i / num);
//the number of expected buckets is size / num.
var l = group.ToArray();
}
s.Stop();
calcuting 4 times with lower numbers.

Categories