Dictionary not preserving order if removed key is added later [duplicate] - c#

I have read this in answer to many questions on here. But what exactly does it mean?
var test = new Dictionary<int, string>();
test.Add(0, "zero");
test.Add(1, "one");
test.Add(2, "two");
test.Add(3, "three");
Assert(test.ElementAt(2).Value == "two");
The above code seems to work as expected. So in what manner is a dictionary considered unordered? Under what circumstances could the above code fail?

Well, for one thing it's not clear whether you expect this to be insertion-order or key-order. For example, what would you expect the result to be if you wrote:
var test = new Dictionary<int, string>();
test.Add(3, "three");
test.Add(2, "two");
test.Add(1, "one");
test.Add(0, "zero");
Console.WriteLine(test.ElementAt(0).Value);
Would you expect "three" or "zero"?
As it happens, I think the current implementation preserves insertion ordering so long as you never delete anything - but you must not rely on this. It's an implementation detail, and that could change in the future.
Deletions also affect this. For example, what would you expect the result of this program to be?
using System;
using System.Collections.Generic;
class Test
{
static void Main()
{
var test = new Dictionary<int, string>();
test.Add(3, "three");
test.Add(2, "two");
test.Add(1, "one");
test.Add(0, "zero");
test.Remove(2);
test.Add(5, "five");
foreach (var pair in test)
{
Console.WriteLine(pair.Key);
}
}
}
It's actually (on my box) 3, 5, 1, 0. The new entry for 5 has used the vacated entry previously used by 2. That's not going to be guaranteed either though.
Rehashing (when the dictionary's underlying storage needs to be expanded) could affect things... all kinds of things do.
Just don't treat it as an ordered collection. It's not designed for that. Even if it happens to work now, you're relying on undocumented behaviour which goes against the purpose of the class.

A Dictionary<TKey, TValue> represents a Hash Table and in a hashtable there is no notion of order.
The documentation explains it pretty well:
For purposes of enumeration, each item
in the dictionary is treated as a
KeyValuePair structure
representing a value and its key. The
order in which the items are returned
is undefined.

There's a lot of good ideas here, but scattered, so I'm going to try to create an answer that lays it out better, even though the problem has been answered.
First, a Dictionary has no guaranteed order, so you use it only to quickly look up a key and find a corresponding value, or you enumerate through all the key-value pairs without caring what the order is.
If you want order, you use an OrderedDictionary but the tradeoff is that lookup is slower, so if you don't need order, don't ask for it.
Dictionaries (and HashMap in Java) use hashing. That is O(1) time regardless of the size of your table. Ordered dictionaries typically use some sort of balanced tree which is O(log2(n)) so as your data grows, access gets slower. To compare, for 1 million elements, that's on the order of 2^20, so you'd have to do on the order of 20 lookups for a tree, but 1 for a hash map. That's a LOT faster.
Hashing is deterministic. Non-determinism means when you hash(5) the first time, and you hash(5) the next time, you get a different place. That would be completely useless.
What people meant to say is that if you add things to a dictionary, the order is complicated, and subject to change any time you add (or potentially remove) an element. For example, imagine the hash table has 500k elements into it, and you have 400k values. When you add one more, you reach the critical threshhold because it needs about 20% empty space to be efficient, so it allocates a bigger table (say, 1 million entries) and re-hashes all the values. Now they are all in different locations than they were before.
If you build the same Dictionary twice (read my statement carefully, THE SAME), you will get the same order. But as Jon correctly says, don't count on it. Too many things can make it not the same, even the initially allocated size.
This brings up an excellent point. It is really, really expensive to have to resize a hashmap. That means you have to allocate a bigger table, and re-insert every key-value pair. So it is well worth allocating 10x the memory it needs rather than have even a single grow have to happen. Know your size of hashmap, and preallocate enough if at all possible, it's a huge performance win. And if you have a bad implementation that doesn't resize, it can be a disaster if you pick too small of a size.
Now what Jon argued with me about in my comment in his answer was that if you add objects to a Dictionary in two different runs, you will get two different orderings. True, but that's not the dictionary's fault.
When you say:
new Foo();
you are creating a new object at a new location in memory.
If you use the value Foo as the key in a dictionary, with no other information, the only thing they can do is use the address of the object as the key.
That means that
var f1 = new Foo(1);
var f2 = new Foo(1);
f1 and f2 are not the same object, even if they have the same values.
So if you were to put them into Dictionaries:
var test = new Dictionary<Foo, string>();
test.Add(f1, "zero");
don't expect it to be the same as:
var test = new Dictionary<Foo, string>();
test.Add(f2, "zero");
even if both f1 and f2 have the same values. That has nothing to do with the deterministic behavior of the Dictionary.
Hashing is an awesome topic in computer science, my favorite to teach in data structures.
Check out Cormen and Leiserson for a high end book on red-black trees vs. hashing
This guy named Bob has a great site about hashing, and optimal hashes: http://burtleburtle.net/bob

The order is non-deterministic.
From here
For purposes of enumeration, each item in the dictionary is treated as a KeyValuePair structure representing a value and its key. The order in which the items are returned is undefined.
Maybe for your needs OrderedDictionary is the required.

I don't know C# or any of .NET, but the general concept of a Dictionary is that it's a collection of key-value pairs.
You don't access sequentially to a dictionary as you would when, for example, iterating a list or array.
You access by having a key, then finding whether there's a value for that key on the dictionary and what is it.
In your example you posted a dictionary with numerical keys which happen to be sequential, without gaps and in ascending order of insertion.
But no matter in which order you insert a value for key '2', you will always get the same value when querying for key '2'.
I don't know if C# permits, I guess yes, to have key types other than numbers, but in that case, it's the same, there's no explicit order on the keys.
The analogy with a real life dictionary could be confusing, as the keys which are the words, are alphabetically ordered so we can find them faster, but if they weren't, the dictionary would work anyway, because the definition of the word "Aardvark" would have the same meaning, even if it came after "Zebra". Think of a novel, in the other hand, changing the order of the pages wouldn't make any sense, as they are an ordered collection in essence.

The class Dictionary<TKey,TValue> is implemented using an array-backed index-linked list. If no items are ever removed, the backing store will hold items in order. When an item is removed, however, the space will be marked for reuse before the array is expanded. As a consequence, if e.g. ten items are added to a new dictionary, the fourth item is deleted, a new item is added, and the dictionary is enumerated, the new item will likely appear fourth rather than tenth, but there is no guarantee that different versions of Dictionary will handle things the same way.
IMHO, it would have been helpful for Microsoft to document that a dictionary from which no items are ever deleted will enumerate items in the original order, but that once any items are deleted, any future changes to the dictionary may arbitrary permute the items therein. Upholding such a guarantee as long as no items are deleted would be relatively cheap for most reasonable dictionary implementations; continuing to uphold the guarantee after items are deleted would be much more expensive.
Alternatively, it might have been helpful to have an AddOnlyDictionary which would be thread-safe for a single writer simultaneous with any number of readers, and guarantee to retain items in sequence (note that if items are only added--never deleted or otherwise modified--one may take a "snapshot" merely by noting how many items it presently contains). Making a general-purpose dictionary thread-safe is expensive, but adding the above level of thread-safety would be cheap. Note that efficient multi-writer multi-reader usage would not require use of a reader-writer lock, but could simply be handled by having writers lock and having readers not bother to.
Microsoft didn't implement an AddOnlyDictionary as described above, of course, but it's interesting to note that the thread-safe ConditionalWeakTable has add-only semantics, probably because--as noted--it's much easier to add concurrency to add-only collections than to collections which allow deletion.

Dictionary< string, Obj>, not SortedDictionary< string, Obj >, is default to sequence by the insertion order. Strange enough you need to specifically declare a SortedDictionary to have a dictionary that sorted by key string order:
public SortedDictionary<string, Row> forecastMTX = new SortedDictionary<string, Row>();

Related

is Dictionary.Keys order guarantee to be the same if the Dictionary has not been modified?

based on the doc
The order of the keys in the Dictionary.KeyCollection is
unspecified
Ok I'm fine with that. But what if I did not modify the dictionary's key nor its values.
Lets say if I do
Dictionary.Keys.ToList();
Thread.Sleep(5000)
Dictionary.Keys.ToList();
can I safely assume the order would be the same?
Iterating a Dictionary is a deterministic process. It is based on the implementation-specific way the items are organized inside hash "buckets", tie resolution, and insertion order.
However, the order of iterating a dictionary does not depend on anything arbitrary that could change between iterations. You can see how the iteration is done in the source of Dictionary.Enumerator here: its bool MoveNext() method walks dictionary.entries[index] array one by one, stopping when the first unused element is reached. Hence you can safely assume that the order of iterating a dictionary is not going to change when you do not modify the dictionary between iterations.

ConcurrentDictionary adding and reading data [duplicate]

I have read this in answer to many questions on here. But what exactly does it mean?
var test = new Dictionary<int, string>();
test.Add(0, "zero");
test.Add(1, "one");
test.Add(2, "two");
test.Add(3, "three");
Assert(test.ElementAt(2).Value == "two");
The above code seems to work as expected. So in what manner is a dictionary considered unordered? Under what circumstances could the above code fail?
Well, for one thing it's not clear whether you expect this to be insertion-order or key-order. For example, what would you expect the result to be if you wrote:
var test = new Dictionary<int, string>();
test.Add(3, "three");
test.Add(2, "two");
test.Add(1, "one");
test.Add(0, "zero");
Console.WriteLine(test.ElementAt(0).Value);
Would you expect "three" or "zero"?
As it happens, I think the current implementation preserves insertion ordering so long as you never delete anything - but you must not rely on this. It's an implementation detail, and that could change in the future.
Deletions also affect this. For example, what would you expect the result of this program to be?
using System;
using System.Collections.Generic;
class Test
{
static void Main()
{
var test = new Dictionary<int, string>();
test.Add(3, "three");
test.Add(2, "two");
test.Add(1, "one");
test.Add(0, "zero");
test.Remove(2);
test.Add(5, "five");
foreach (var pair in test)
{
Console.WriteLine(pair.Key);
}
}
}
It's actually (on my box) 3, 5, 1, 0. The new entry for 5 has used the vacated entry previously used by 2. That's not going to be guaranteed either though.
Rehashing (when the dictionary's underlying storage needs to be expanded) could affect things... all kinds of things do.
Just don't treat it as an ordered collection. It's not designed for that. Even if it happens to work now, you're relying on undocumented behaviour which goes against the purpose of the class.
A Dictionary<TKey, TValue> represents a Hash Table and in a hashtable there is no notion of order.
The documentation explains it pretty well:
For purposes of enumeration, each item
in the dictionary is treated as a
KeyValuePair structure
representing a value and its key. The
order in which the items are returned
is undefined.
There's a lot of good ideas here, but scattered, so I'm going to try to create an answer that lays it out better, even though the problem has been answered.
First, a Dictionary has no guaranteed order, so you use it only to quickly look up a key and find a corresponding value, or you enumerate through all the key-value pairs without caring what the order is.
If you want order, you use an OrderedDictionary but the tradeoff is that lookup is slower, so if you don't need order, don't ask for it.
Dictionaries (and HashMap in Java) use hashing. That is O(1) time regardless of the size of your table. Ordered dictionaries typically use some sort of balanced tree which is O(log2(n)) so as your data grows, access gets slower. To compare, for 1 million elements, that's on the order of 2^20, so you'd have to do on the order of 20 lookups for a tree, but 1 for a hash map. That's a LOT faster.
Hashing is deterministic. Non-determinism means when you hash(5) the first time, and you hash(5) the next time, you get a different place. That would be completely useless.
What people meant to say is that if you add things to a dictionary, the order is complicated, and subject to change any time you add (or potentially remove) an element. For example, imagine the hash table has 500k elements into it, and you have 400k values. When you add one more, you reach the critical threshhold because it needs about 20% empty space to be efficient, so it allocates a bigger table (say, 1 million entries) and re-hashes all the values. Now they are all in different locations than they were before.
If you build the same Dictionary twice (read my statement carefully, THE SAME), you will get the same order. But as Jon correctly says, don't count on it. Too many things can make it not the same, even the initially allocated size.
This brings up an excellent point. It is really, really expensive to have to resize a hashmap. That means you have to allocate a bigger table, and re-insert every key-value pair. So it is well worth allocating 10x the memory it needs rather than have even a single grow have to happen. Know your size of hashmap, and preallocate enough if at all possible, it's a huge performance win. And if you have a bad implementation that doesn't resize, it can be a disaster if you pick too small of a size.
Now what Jon argued with me about in my comment in his answer was that if you add objects to a Dictionary in two different runs, you will get two different orderings. True, but that's not the dictionary's fault.
When you say:
new Foo();
you are creating a new object at a new location in memory.
If you use the value Foo as the key in a dictionary, with no other information, the only thing they can do is use the address of the object as the key.
That means that
var f1 = new Foo(1);
var f2 = new Foo(1);
f1 and f2 are not the same object, even if they have the same values.
So if you were to put them into Dictionaries:
var test = new Dictionary<Foo, string>();
test.Add(f1, "zero");
don't expect it to be the same as:
var test = new Dictionary<Foo, string>();
test.Add(f2, "zero");
even if both f1 and f2 have the same values. That has nothing to do with the deterministic behavior of the Dictionary.
Hashing is an awesome topic in computer science, my favorite to teach in data structures.
Check out Cormen and Leiserson for a high end book on red-black trees vs. hashing
This guy named Bob has a great site about hashing, and optimal hashes: http://burtleburtle.net/bob
The order is non-deterministic.
From here
For purposes of enumeration, each item in the dictionary is treated as a KeyValuePair structure representing a value and its key. The order in which the items are returned is undefined.
Maybe for your needs OrderedDictionary is the required.
I don't know C# or any of .NET, but the general concept of a Dictionary is that it's a collection of key-value pairs.
You don't access sequentially to a dictionary as you would when, for example, iterating a list or array.
You access by having a key, then finding whether there's a value for that key on the dictionary and what is it.
In your example you posted a dictionary with numerical keys which happen to be sequential, without gaps and in ascending order of insertion.
But no matter in which order you insert a value for key '2', you will always get the same value when querying for key '2'.
I don't know if C# permits, I guess yes, to have key types other than numbers, but in that case, it's the same, there's no explicit order on the keys.
The analogy with a real life dictionary could be confusing, as the keys which are the words, are alphabetically ordered so we can find them faster, but if they weren't, the dictionary would work anyway, because the definition of the word "Aardvark" would have the same meaning, even if it came after "Zebra". Think of a novel, in the other hand, changing the order of the pages wouldn't make any sense, as they are an ordered collection in essence.
The class Dictionary<TKey,TValue> is implemented using an array-backed index-linked list. If no items are ever removed, the backing store will hold items in order. When an item is removed, however, the space will be marked for reuse before the array is expanded. As a consequence, if e.g. ten items are added to a new dictionary, the fourth item is deleted, a new item is added, and the dictionary is enumerated, the new item will likely appear fourth rather than tenth, but there is no guarantee that different versions of Dictionary will handle things the same way.
IMHO, it would have been helpful for Microsoft to document that a dictionary from which no items are ever deleted will enumerate items in the original order, but that once any items are deleted, any future changes to the dictionary may arbitrary permute the items therein. Upholding such a guarantee as long as no items are deleted would be relatively cheap for most reasonable dictionary implementations; continuing to uphold the guarantee after items are deleted would be much more expensive.
Alternatively, it might have been helpful to have an AddOnlyDictionary which would be thread-safe for a single writer simultaneous with any number of readers, and guarantee to retain items in sequence (note that if items are only added--never deleted or otherwise modified--one may take a "snapshot" merely by noting how many items it presently contains). Making a general-purpose dictionary thread-safe is expensive, but adding the above level of thread-safety would be cheap. Note that efficient multi-writer multi-reader usage would not require use of a reader-writer lock, but could simply be handled by having writers lock and having readers not bother to.
Microsoft didn't implement an AddOnlyDictionary as described above, of course, but it's interesting to note that the thread-safe ConditionalWeakTable has add-only semantics, probably because--as noted--it's much easier to add concurrency to add-only collections than to collections which allow deletion.
Dictionary< string, Obj>, not SortedDictionary< string, Obj >, is default to sequence by the insertion order. Strange enough you need to specifically declare a SortedDictionary to have a dictionary that sorted by key string order:
public SortedDictionary<string, Row> forecastMTX = new SortedDictionary<string, Row>();

Enumerating hash based collections

As much as I have read and understood, the Dictionary internally uses hashing to hash a key to some memory location(called bucket) and stores the value in a LinkedList over there. If some another key is hashed to the same location, it is probably appended to that linkedlist.
I am pretty sure there are enough buckets to make the insertions and removals in the dictionary fast enough but besides that there is also some housekeeping going on. Every time a key is added to the dictionary, the Dictionary.Keys and Dictionary.Values properties are also updated. I wanted to understand the complexity of that.
How does the foreach loop ennumerate through all the keys or values in a dictionary. I tried to Google the answer but didn't really reach a concrete answer. My research comes down to answering this question - "How has Microsoft probably implemented the KeyCollection class or even ValueCollection class for that matter?"
It might be using a List(essentially an array) or LinkedList underneath. But then, removing a key from the dictionary would cause inefficient housekeeping, isn't it?
There is a similar situation in HashSet also. The elements are inserted in a random fashion based on some hash code. How is foreach enumerating every element when there is no order or no links from one element to another?
Can someone please give me an insight?

What is the fastest way of changing Dictionary<K,V>?

This is an algorithmic question.
I have got Dictionary<object,Queue<object>>. Each queue contains one or more elements in it. I want to remove all queues with only one element from the dictionary. What is the fastest way to do it?
Pseudo-code: foreach(item in dict) if(item.Length==1) dict.Remove(item);
It is easy to do it in a loop (not foreach, of course), but I'd like to know which approach is the fastest one here.
Why I want it: I use that dictionary to find duplicate elements in a large set of objects. The Key in dictionary is kind of a hash of the object, the Value is a queue of all objects found with the same hash. Since I want only duplicates, I need to remove all items with just a single object in associated queue.
Update:
It may be important to know that in a regular case there are just a few duplicates in a large set of objects. Let's assume 1% or less. So possibly it could be faster to leave the Dictionary as is and create a new one from scatch with just selected elements from the first one... and then deelte the first Dictionary completely. I think it depends on the comlpexity of computational Dictionary class's methods used in particular algorithms.
I really want to see this problem on a theoretical level because as a teacher I want to discuss it with students. I didn't provide any concrete solution myself because I think it is really easy to do it. The question is which approach is the best, the fastest.
var itemsWithOneEntry = dict.Where(x => x.Value.Count == 1)
.Select(x => x.Key)
.ToList();
foreach (var item in itemsWithOneEntry) {
dict.Remove(item));
}
It stead of trying to optimize the traversing of the collection how about optimizing the content of the collection so that it only includes the duplicates? This would require changing your collection algorithm instead to something like this
var duplicates = new Dictionary<object,Queue<object>>;
var possibleDuplicates = new Dictionary<object,object>();
foreach(var item in original){
if(possibleDuplicates.ContainsKey(item)){
duplicates.Add(item, new Queue<object>{possibleDuplicates[item],item});
possibleDuplicates.Remove(item);
} else if(duplicates.ContainsKey(item)){
duplicates[item].Add(item);
} else {
possibleDuplicates.Add(item);
}
}
Note that you should probably measure the impact of this on the performance in a realistic scenario before you bother to make your code any more complex than it really needs to be. Most imagined performance problems are not in fact the real cause of slow code.
But supposing you do find that you could get a speed advantage by avoiding a linear search for queues of length 1, you could solve this problem with a technique called indexing.
As well as your dictionary containing all the queues, you maintain an index container (probably another dictionary) that only contains the queues of length 1, so when you need them they are already available separately.
To do this, you need to enhance all the operations that modify the length of the queue, so that they have the side-effect of updating the index container.
One way to do it is to define a class ObservableQueue. This would be a thin wrapper around Queue except it also has a ContentsChanged event that fires when the number of items in the queue changes. Use ObservableQueue everywhere instead of the plain Queue.
Then when you create a new queue, enlist on its ContentsChanged event a handler that checks to see if the queue only has one item. Based on this you can either insert or remove it from the index container.

LINQ ToDictionary initial capacity

I regularly use the LINQ extension method ToDictionary, but am wondering about the performance. There is no parameter to define the capacity for the dictionary and with a list of 100k items or more, this could become an issue:
IList<int> list = new List<int> { 1, 2, ... , 1000000 };
IDictionary<int, string> dictionary = list.ToDictionary<int, string>(x => x, x => x.ToString("D7"));
Does the implementation actually take the list.Count and passes it to the constructor for the dictionary?
Or is the resizing of the dictionary fast enough, so I don't really have to worry about it?
Does the implementation actually take the list.Count and passes it to
the constructor for the dictionary?
No. According to ILSpy, the implementation is basically this:
Dictionary<TKey, TElement> dictionary = new Dictionary<TKey, TElement>(comparer);
foreach (TSource current in source)
{
dictionary.Add(keySelector(current), elementSelector(current));
}
return dictionary;
If you profile your code and determine that the ToDictionary operation is your bottleneck, its trivial to make your own function based on the above code.
Does the implementation actually take the list.Count and passes it to the constructor for the dictionary?
This is an implementation detail and it shouldn't matter to you.
Or is the resizing of the dictionary fast enough, so I don't really have to worry about it?
Well, I don't know. Only you know whether or not this is actually a bottleneck in your application, and whether or not the performance is acceptable. If you want to know if it's fast enough, write the code and time it. As Eric Lippert is wont to say, if you want to know how fast two horses are, do you pit them in a race against each other, or do you ask random strangers on the Internet which one is faster?
That said, I'm having a really hard time imaging this being a bottleneck in any realistic application. If adding items to a dictionary is a bottleneck in your application, you're doing something wrong.
I don't think it'll be a bottleneck TBH. And in case you have real complaints and issues, you should look into it at that time to see if you can improve it, may be you can do paging instead of converting everything at once.
I don't know about resizing the dictionary, but checking the implementation with dotPeek.exe suggests that the implementation does not take the list length.
What the code basically does is:
create a new dictionary
iterate over sequence and add items
If you find this a bottleneck, it would be trivial to create your own extension method ToDictionaryWithCapacity that works on something that can have its length actually computed without iterating the whole thing.
Just scanned the Dictionary implementation. Basically, when it starts to fill up, the internal list is resized by roughly doubling it to a near prime. So that should not happen too frequently.
Does the implementation actually take the list.Count and passes it to the constructor for the dictionary?
It doesn't. That's because the calling Count() would enumerate the source, and then adding it to the dictionary would enumerate the source a second time. It's not a good idea to enumerate the source twice, for example this would fail on DataReaders.
Or is the resizing of the dictionary fast enough, so I don't really have to worry about it?
The Dictionary.Resize method is used to expand the dictionary. It allocates a new dictionary and copies the existing items into the new dictionary (using Array.Copy). The dictionary size is increased in prime number steps.
This is not the fastest way, but fast enough if you do not know the size.

Categories