Enumerating hash based collections

Enumerating hash based collections - c#

As much as I have read and understood, the Dictionary internally uses hashing to hash a key to some memory location(called bucket) and stores the value in a LinkedList over there. If some another key is hashed to the same location, it is probably appended to that linkedlist.
I am pretty sure there are enough buckets to make the insertions and removals in the dictionary fast enough but besides that there is also some housekeeping going on. Every time a key is added to the dictionary, the Dictionary.Keys and Dictionary.Values properties are also updated. I wanted to understand the complexity of that.
How does the foreach loop ennumerate through all the keys or values in a dictionary. I tried to Google the answer but didn't really reach a concrete answer. My research comes down to answering this question - "How has Microsoft probably implemented the KeyCollection class or even ValueCollection class for that matter?"
It might be using a List(essentially an array) or LinkedList underneath. But then, removing a key from the dictionary would cause inefficient housekeeping, isn't it?
There is a similar situation in HashSet also. The elements are inserted in a random fashion based on some hash code. How is foreach enumerating every element when there is no order or no links from one element to another?
Can someone please give me an insight?

Related

Maintaining data locality in a Dictionary<TKey,TValue>

I'm making a game and I decided that for reasons, I'd give each game object an int entity ID that I could easily search them by instead of having to linearly search a list or worse, many lists. The idea was inspired by the ECS pattern and I figured if I made sure to re-use ints when they were destroyed, it would help keep all the data close together in memory and reduce cache misses by a bit. (I know that depends more on access order, just thinking in the abstract here). The problem is I'm now doubting myself and I've read so much that I can't keep the ideas straight in my head.
The question is essentially if I keep endlessly adding higher numbered keys to a Dictionary<int, SomeClass>, will the speed/memory usage be worse than if I try to re-use lower numbers?
Note: I feel like the answer is going to be "write your own class" but I was trying to avoid that and I don't think I'd do a good job if I don't understand this concept.

No, it makes no difference at all. From MSDN:
The Dictionary generic class provides a mapping from a set of keys to a set of values. Each addition to the dictionary consists of a value and its associated key. Retrieving a value by using its key is very fast, close to O(1), because the Dictionary class is implemented as a hash table.
So, the speed will always be O(1) because it internally uses a hash table, the value of the key doesn't affects it at all.
The only problem you can face is if you reach int.MaxValue, that's up to your scenerio.

Okay here's my best effort at answering this myself, apologies if I get anything wrong.
Short answer: No. If you add higher numbers they just get stuck somewhere into the array until it's full. The solution to the example problem is to just replace the dictionary with a GameObject array and use the int as an index, and if necessary write a class to handle expanding it.
Longer answer: I think my confusion came from reading somewhere that a dictionary was just a pair of parallel arrays or something like that. I guess that's true but since it's indexed by hash codes, it's not intended for contiguous index values. So it's doing a bunch of redundant work to handle cases that I'm never going to use it for.

Dictionary not preserving order if removed key is added later [duplicate]

I have read this in answer to many questions on here. But what exactly does it mean?
var test = new Dictionary<int, string>();
test.Add(0, "zero");
test.Add(1, "one");
test.Add(2, "two");
test.Add(3, "three");
Assert(test.ElementAt(2).Value == "two");
The above code seems to work as expected. So in what manner is a dictionary considered unordered? Under what circumstances could the above code fail?

Well, for one thing it's not clear whether you expect this to be insertion-order or key-order. For example, what would you expect the result to be if you wrote:
var test = new Dictionary<int, string>();
test.Add(3, "three");
test.Add(2, "two");
test.Add(1, "one");
test.Add(0, "zero");
Console.WriteLine(test.ElementAt(0).Value);
Would you expect "three" or "zero"?
As it happens, I think the current implementation preserves insertion ordering so long as you never delete anything - but you must not rely on this. It's an implementation detail, and that could change in the future.
Deletions also affect this. For example, what would you expect the result of this program to be?
using System;
using System.Collections.Generic;
class Test
{
static void Main()
{
var test = new Dictionary<int, string>();
test.Add(3, "three");
test.Add(2, "two");
test.Add(1, "one");
test.Add(0, "zero");
test.Remove(2);
test.Add(5, "five");
foreach (var pair in test)
{
Console.WriteLine(pair.Key);
}
}
}
It's actually (on my box) 3, 5, 1, 0. The new entry for 5 has used the vacated entry previously used by 2. That's not going to be guaranteed either though.
Rehashing (when the dictionary's underlying storage needs to be expanded) could affect things... all kinds of things do.
Just don't treat it as an ordered collection. It's not designed for that. Even if it happens to work now, you're relying on undocumented behaviour which goes against the purpose of the class.

A Dictionary<TKey, TValue> represents a Hash Table and in a hashtable there is no notion of order.
The documentation explains it pretty well:
For purposes of enumeration, each item
in the dictionary is treated as a
KeyValuePair structure
representing a value and its key. The
order in which the items are returned
is undefined.

There's a lot of good ideas here, but scattered, so I'm going to try to create an answer that lays it out better, even though the problem has been answered.
First, a Dictionary has no guaranteed order, so you use it only to quickly look up a key and find a corresponding value, or you enumerate through all the key-value pairs without caring what the order is.
If you want order, you use an OrderedDictionary but the tradeoff is that lookup is slower, so if you don't need order, don't ask for it.
Dictionaries (and HashMap in Java) use hashing. That is O(1) time regardless of the size of your table. Ordered dictionaries typically use some sort of balanced tree which is O(log2(n)) so as your data grows, access gets slower. To compare, for 1 million elements, that's on the order of 2^20, so you'd have to do on the order of 20 lookups for a tree, but 1 for a hash map. That's a LOT faster.
Hashing is deterministic. Non-determinism means when you hash(5) the first time, and you hash(5) the next time, you get a different place. That would be completely useless.
What people meant to say is that if you add things to a dictionary, the order is complicated, and subject to change any time you add (or potentially remove) an element. For example, imagine the hash table has 500k elements into it, and you have 400k values. When you add one more, you reach the critical threshhold because it needs about 20% empty space to be efficient, so it allocates a bigger table (say, 1 million entries) and re-hashes all the values. Now they are all in different locations than they were before.
If you build the same Dictionary twice (read my statement carefully, THE SAME), you will get the same order. But as Jon correctly says, don't count on it. Too many things can make it not the same, even the initially allocated size.
This brings up an excellent point. It is really, really expensive to have to resize a hashmap. That means you have to allocate a bigger table, and re-insert every key-value pair. So it is well worth allocating 10x the memory it needs rather than have even a single grow have to happen. Know your size of hashmap, and preallocate enough if at all possible, it's a huge performance win. And if you have a bad implementation that doesn't resize, it can be a disaster if you pick too small of a size.
Now what Jon argued with me about in my comment in his answer was that if you add objects to a Dictionary in two different runs, you will get two different orderings. True, but that's not the dictionary's fault.
When you say:
new Foo();
you are creating a new object at a new location in memory.
If you use the value Foo as the key in a dictionary, with no other information, the only thing they can do is use the address of the object as the key.
That means that
var f1 = new Foo(1);
var f2 = new Foo(1);
f1 and f2 are not the same object, even if they have the same values.
So if you were to put them into Dictionaries:
var test = new Dictionary<Foo, string>();
test.Add(f1, "zero");
don't expect it to be the same as:
var test = new Dictionary<Foo, string>();
test.Add(f2, "zero");
even if both f1 and f2 have the same values. That has nothing to do with the deterministic behavior of the Dictionary.
Hashing is an awesome topic in computer science, my favorite to teach in data structures.
Check out Cormen and Leiserson for a high end book on red-black trees vs. hashing
This guy named Bob has a great site about hashing, and optimal hashes: http://burtleburtle.net/bob

The order is non-deterministic.
From here
For purposes of enumeration, each item in the dictionary is treated as a KeyValuePair structure representing a value and its key. The order in which the items are returned is undefined.
Maybe for your needs OrderedDictionary is the required.

I don't know C# or any of .NET, but the general concept of a Dictionary is that it's a collection of key-value pairs.
You don't access sequentially to a dictionary as you would when, for example, iterating a list or array.
You access by having a key, then finding whether there's a value for that key on the dictionary and what is it.
In your example you posted a dictionary with numerical keys which happen to be sequential, without gaps and in ascending order of insertion.
But no matter in which order you insert a value for key '2', you will always get the same value when querying for key '2'.
I don't know if C# permits, I guess yes, to have key types other than numbers, but in that case, it's the same, there's no explicit order on the keys.
The analogy with a real life dictionary could be confusing, as the keys which are the words, are alphabetically ordered so we can find them faster, but if they weren't, the dictionary would work anyway, because the definition of the word "Aardvark" would have the same meaning, even if it came after "Zebra". Think of a novel, in the other hand, changing the order of the pages wouldn't make any sense, as they are an ordered collection in essence.

The class Dictionary<TKey,TValue> is implemented using an array-backed index-linked list. If no items are ever removed, the backing store will hold items in order. When an item is removed, however, the space will be marked for reuse before the array is expanded. As a consequence, if e.g. ten items are added to a new dictionary, the fourth item is deleted, a new item is added, and the dictionary is enumerated, the new item will likely appear fourth rather than tenth, but there is no guarantee that different versions of Dictionary will handle things the same way.
IMHO, it would have been helpful for Microsoft to document that a dictionary from which no items are ever deleted will enumerate items in the original order, but that once any items are deleted, any future changes to the dictionary may arbitrary permute the items therein. Upholding such a guarantee as long as no items are deleted would be relatively cheap for most reasonable dictionary implementations; continuing to uphold the guarantee after items are deleted would be much more expensive.
Alternatively, it might have been helpful to have an AddOnlyDictionary which would be thread-safe for a single writer simultaneous with any number of readers, and guarantee to retain items in sequence (note that if items are only added--never deleted or otherwise modified--one may take a "snapshot" merely by noting how many items it presently contains). Making a general-purpose dictionary thread-safe is expensive, but adding the above level of thread-safety would be cheap. Note that efficient multi-writer multi-reader usage would not require use of a reader-writer lock, but could simply be handled by having writers lock and having readers not bother to.
Microsoft didn't implement an AddOnlyDictionary as described above, of course, but it's interesting to note that the thread-safe ConditionalWeakTable has add-only semantics, probably because--as noted--it's much easier to add concurrency to add-only collections than to collections which allow deletion.

Dictionary< string, Obj>, not SortedDictionary< string, Obj >, is default to sequence by the insertion order. Strange enough you need to specifically declare a SortedDictionary to have a dictionary that sorted by key string order:
public SortedDictionary<string, Row> forecastMTX = new SortedDictionary<string, Row>();

ConcurrentDictionary adding and reading data [duplicate]

I have read this in answer to many questions on here. But what exactly does it mean?
var test = new Dictionary<int, string>();
test.Add(0, "zero");
test.Add(1, "one");
test.Add(2, "two");
test.Add(3, "three");
Assert(test.ElementAt(2).Value == "two");
The above code seems to work as expected. So in what manner is a dictionary considered unordered? Under what circumstances could the above code fail?

Well, for one thing it's not clear whether you expect this to be insertion-order or key-order. For example, what would you expect the result to be if you wrote:
var test = new Dictionary<int, string>();
test.Add(3, "three");
test.Add(2, "two");
test.Add(1, "one");
test.Add(0, "zero");
Console.WriteLine(test.ElementAt(0).Value);
Would you expect "three" or "zero"?
As it happens, I think the current implementation preserves insertion ordering so long as you never delete anything - but you must not rely on this. It's an implementation detail, and that could change in the future.
Deletions also affect this. For example, what would you expect the result of this program to be?
using System;
using System.Collections.Generic;
class Test
{
static void Main()
{
var test = new Dictionary<int, string>();
test.Add(3, "three");
test.Add(2, "two");
test.Add(1, "one");
test.Add(0, "zero");
test.Remove(2);
test.Add(5, "five");
foreach (var pair in test)
{
Console.WriteLine(pair.Key);
}
}
}
It's actually (on my box) 3, 5, 1, 0. The new entry for 5 has used the vacated entry previously used by 2. That's not going to be guaranteed either though.
Rehashing (when the dictionary's underlying storage needs to be expanded) could affect things... all kinds of things do.
Just don't treat it as an ordered collection. It's not designed for that. Even if it happens to work now, you're relying on undocumented behaviour which goes against the purpose of the class.

A Dictionary<TKey, TValue> represents a Hash Table and in a hashtable there is no notion of order.
The documentation explains it pretty well:
For purposes of enumeration, each item
in the dictionary is treated as a
KeyValuePair structure
representing a value and its key. The
order in which the items are returned
is undefined.

There's a lot of good ideas here, but scattered, so I'm going to try to create an answer that lays it out better, even though the problem has been answered.
First, a Dictionary has no guaranteed order, so you use it only to quickly look up a key and find a corresponding value, or you enumerate through all the key-value pairs without caring what the order is.
If you want order, you use an OrderedDictionary but the tradeoff is that lookup is slower, so if you don't need order, don't ask for it.
Dictionaries (and HashMap in Java) use hashing. That is O(1) time regardless of the size of your table. Ordered dictionaries typically use some sort of balanced tree which is O(log2(n)) so as your data grows, access gets slower. To compare, for 1 million elements, that's on the order of 2^20, so you'd have to do on the order of 20 lookups for a tree, but 1 for a hash map. That's a LOT faster.
Hashing is deterministic. Non-determinism means when you hash(5) the first time, and you hash(5) the next time, you get a different place. That would be completely useless.
What people meant to say is that if you add things to a dictionary, the order is complicated, and subject to change any time you add (or potentially remove) an element. For example, imagine the hash table has 500k elements into it, and you have 400k values. When you add one more, you reach the critical threshhold because it needs about 20% empty space to be efficient, so it allocates a bigger table (say, 1 million entries) and re-hashes all the values. Now they are all in different locations than they were before.
If you build the same Dictionary twice (read my statement carefully, THE SAME), you will get the same order. But as Jon correctly says, don't count on it. Too many things can make it not the same, even the initially allocated size.
This brings up an excellent point. It is really, really expensive to have to resize a hashmap. That means you have to allocate a bigger table, and re-insert every key-value pair. So it is well worth allocating 10x the memory it needs rather than have even a single grow have to happen. Know your size of hashmap, and preallocate enough if at all possible, it's a huge performance win. And if you have a bad implementation that doesn't resize, it can be a disaster if you pick too small of a size.
Now what Jon argued with me about in my comment in his answer was that if you add objects to a Dictionary in two different runs, you will get two different orderings. True, but that's not the dictionary's fault.
When you say:
new Foo();
you are creating a new object at a new location in memory.
If you use the value Foo as the key in a dictionary, with no other information, the only thing they can do is use the address of the object as the key.
That means that
var f1 = new Foo(1);
var f2 = new Foo(1);
f1 and f2 are not the same object, even if they have the same values.
So if you were to put them into Dictionaries:
var test = new Dictionary<Foo, string>();
test.Add(f1, "zero");
don't expect it to be the same as:
var test = new Dictionary<Foo, string>();
test.Add(f2, "zero");
even if both f1 and f2 have the same values. That has nothing to do with the deterministic behavior of the Dictionary.
Hashing is an awesome topic in computer science, my favorite to teach in data structures.
Check out Cormen and Leiserson for a high end book on red-black trees vs. hashing
This guy named Bob has a great site about hashing, and optimal hashes: http://burtleburtle.net/bob

The order is non-deterministic.
From here
For purposes of enumeration, each item in the dictionary is treated as a KeyValuePair structure representing a value and its key. The order in which the items are returned is undefined.
Maybe for your needs OrderedDictionary is the required.

I don't know C# or any of .NET, but the general concept of a Dictionary is that it's a collection of key-value pairs.
You don't access sequentially to a dictionary as you would when, for example, iterating a list or array.
You access by having a key, then finding whether there's a value for that key on the dictionary and what is it.
In your example you posted a dictionary with numerical keys which happen to be sequential, without gaps and in ascending order of insertion.
But no matter in which order you insert a value for key '2', you will always get the same value when querying for key '2'.
I don't know if C# permits, I guess yes, to have key types other than numbers, but in that case, it's the same, there's no explicit order on the keys.
The analogy with a real life dictionary could be confusing, as the keys which are the words, are alphabetically ordered so we can find them faster, but if they weren't, the dictionary would work anyway, because the definition of the word "Aardvark" would have the same meaning, even if it came after "Zebra". Think of a novel, in the other hand, changing the order of the pages wouldn't make any sense, as they are an ordered collection in essence.

Dictionary< string, Obj>, not SortedDictionary< string, Obj >, is default to sequence by the insertion order. Strange enough you need to specifically declare a SortedDictionary to have a dictionary that sorted by key string order:
public SortedDictionary<string, Row> forecastMTX = new SortedDictionary<string, Row>();

Fastest way to get any element from a Dictionary

I'm implementing A* in C# (not for pathfinding) and I need Dictionary to hold open nodes, because I need fast insertion and fast lookup. I want to get the first open node from the Dictionary (it can be any random node). Using Dictionary.First() is very slow. If I use an iterator, MoveNext() is still using 15% of the whole CPU time of my program. What is the fastest way to get any random element from a Dictionary?

I suggest you use a specialized data structure for this purpose, as the regular Dictionary was not made for this.
In Java, I would probably recommend LinkedHashMap, for which there are custom C# equivalents (not built-in sadly) (see).
It is, however, rather easy to implement this yourself in a reasonable fashion. You could, for instance, use a regular dictionary with tuples that point to the next element as well as the actual data. Or you could keep a secondary stack that simply stores all keys in order of addition. Just some ideas. I never did implemented nor profiled this myself, but I'm sure you'll find a good way.
Oh, and if you didn't already, you might also want to check the hash code distribution, to make sure there is no problem there.

Finding the first (or an index) element in a dictionary is actually O(n) because it has to iterate over every bucket until a non-empty one is found, so MoveNext will actually be the fastest way.
If this were a problem, I would consider using something like a stack, where pop is an O(1) operation.

Try
Enumerable.ToList(dictionary.Values)[new Random().next(dictionary.Count)].
Should have pretty good performance but watch out for memory usage if your dictionary is huge. Obviously take care of not creating the random object every time and you might be able to cache the return value of Enumerable.ToList if its members don't change too frequently.

What is the fastest way of changing Dictionary<K,V>?

This is an algorithmic question.
I have got Dictionary<object,Queue<object>>. Each queue contains one or more elements in it. I want to remove all queues with only one element from the dictionary. What is the fastest way to do it?
Pseudo-code: foreach(item in dict) if(item.Length==1) dict.Remove(item);
It is easy to do it in a loop (not foreach, of course), but I'd like to know which approach is the fastest one here.
Why I want it: I use that dictionary to find duplicate elements in a large set of objects. The Key in dictionary is kind of a hash of the object, the Value is a queue of all objects found with the same hash. Since I want only duplicates, I need to remove all items with just a single object in associated queue.
Update:
It may be important to know that in a regular case there are just a few duplicates in a large set of objects. Let's assume 1% or less. So possibly it could be faster to leave the Dictionary as is and create a new one from scatch with just selected elements from the first one... and then deelte the first Dictionary completely. I think it depends on the comlpexity of computational Dictionary class's methods used in particular algorithms.
I really want to see this problem on a theoretical level because as a teacher I want to discuss it with students. I didn't provide any concrete solution myself because I think it is really easy to do it. The question is which approach is the best, the fastest.

var itemsWithOneEntry = dict.Where(x => x.Value.Count == 1)
.Select(x => x.Key)
.ToList();
foreach (var item in itemsWithOneEntry) {
dict.Remove(item));
}

It stead of trying to optimize the traversing of the collection how about optimizing the content of the collection so that it only includes the duplicates? This would require changing your collection algorithm instead to something like this
var duplicates = new Dictionary<object,Queue<object>>;
var possibleDuplicates = new Dictionary<object,object>();
foreach(var item in original){
if(possibleDuplicates.ContainsKey(item)){
duplicates.Add(item, new Queue<object>{possibleDuplicates[item],item});
possibleDuplicates.Remove(item);
} else if(duplicates.ContainsKey(item)){
duplicates[item].Add(item);
} else {
possibleDuplicates.Add(item);
}
}

Note that you should probably measure the impact of this on the performance in a realistic scenario before you bother to make your code any more complex than it really needs to be. Most imagined performance problems are not in fact the real cause of slow code.
But supposing you do find that you could get a speed advantage by avoiding a linear search for queues of length 1, you could solve this problem with a technique called indexing.
As well as your dictionary containing all the queues, you maintain an index container (probably another dictionary) that only contains the queues of length 1, so when you need them they are already available separately.
To do this, you need to enhance all the operations that modify the length of the queue, so that they have the side-effect of updating the index container.
One way to do it is to define a class ObservableQueue. This would be a thin wrapper around Queue except it also has a ContentsChanged event that fires when the number of items in the queue changes. Use ObservableQueue everywhere instead of the plain Queue.
Then when you create a new queue, enlist on its ContentsChanged event a handler that checks to see if the queue only has one item. Based on this you can either insert or remove it from the index container.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.