Let's assume I have data of size N (i.e.N elements) and the dictionary was created with capacity N. What is the complexity of:
space -- of entire dictionary
time -- adding entry to dictionary
MS revealed only that entry retrieval is close to O(1). But what about the rest?
What is the complexity of space -- of entire dictionary
Dictionary uses associative array data structure that is of O(N) space complexity.
Msdn says: "the Dictionary class is implemented as a hash table". And hash table uses associative arrays in turn.
What is the complexity of time -- adding entry to dictionary
A single add takes amortized O(1) time. In most cases it is O(1), it changes to O(N) when the underlying data structure needs to grow or shrink. As the later only happens infrequently, people use the word "amortized".
The time complexity of adding a new entry is documented under Dictionary<T>.Add():
If Count is less than the capacity, this method approaches an O(1) operation. If the capacity must be increased to accommodate the new element, this method becomes an O(n) operation, where n is Count.
It is not formally documented, but widely stated (and visible via disassembly) that the underlying storage is an array of name-value pairs. Thus space complexity is O(n).
As it is not part of the specification this could, in theory, change; but in practice is highly unlikely to because it would change the performance of the various operations (eg. enumeration) which could be visible.
Related
Consider for example the documentation for the .NET Framework 4.5 Dictionary<TKey, TValue> class:
In the remarks for the .ContainsKey method, they state that
This method approaches an O(1) operation.
And in the remarks for the .Count property, they state that
Retrieving the value of this property is an O(1) operation.
Note that I am not necessarily asking for the details of C#, .NET, Dictionary or what Big O notation is in general. I just found this distinction of "approaches" intriguing.
Is there any difference? If so, how significant can it potentially be? Should I pay attention to it?
If the hash function used by the underlying objects in the hash code is "good" it means collisions will be very rare. Odds are good that there will be only one item at a given hash bucket, maybe two, and almost never more. If you could say, for certain, that there will never be more than c items (where c is a constant) in a bucket, then the operation would be O(c) (which is O(1)). But that assurance can't be made. It's possible, that you just happen to have n different items that, unluckily enough, all collide, and all end up in the same bucket, and in that case, ContainsKey is O(n). It's also possible that the hash function isn't "good" and results in hash collisions frequently, which can make the actual contains check worse than just O(1).
It's because Dictionary is an implementation of hash tables - this means that a key lookup is done by using a hashing function that tells you which bucket among many buckets contained in the data structure contains the value you're looking up. Usually, for a good hashing function, assuming a large enough set of buckets, each bucket only contains a single element - in which case the complexity is indeed O(1) - unfortunately, this is not always true - the hashing function may have clashes, in which case a bucket may contain more than one entry and so the algorithm has to iterate through the bucket until it finds the entry you're looking for - so it's no longer O(1) for these (hopefully) rare cases.
O(1) is a constant running time, approaching O(1) is close to constant but not quite, but for most purposes is a negligible increase. You should not pay attention to it.
Something either is or isn't O(1). I think what they're trying to say is that running time is approximately O(1) per operation for a large number of operations.
I have a game loop which draws a list of objects, the list called "mylist" and holds something like 1000 objects, objects(especially bullets which fly fast and hit things) needs to be added and removed from the list constantly, few objects every second.
If i understand correctly the insert in List is practically free if the list capacity large enough, the problem is with removal which is O(n) because first of all i need to find the item in the list , and second it creates a new list after the removal.
If i could aggregate all the removals and make them once per frame, it would be efficient because i would use mylist.Except(listToRemove) and this would be in O(n).
but unfortunately i can't do it.
Linked list is also problematic because i need to find the Object in the list.
Any one has better suggestion ?
What about a HashSet? It supports O(1) insert and removal. If you need ordering, use a tree data structure or a LinkedList plus a Dictionary that allows you to find nodes quickly.
The BCL has a SortedList and a SortedSet and a SortedDictionary. All but one were very slow but I can never remember which one was the "good" one. It is tree-based internally.
If you are doing searching then your best bet is to use either a Dictionary that has average case of O(1) and worst case of O(n) (when you hash to the same bucket).
If you need to preserve order you can use any implentation of a balanced binary search tree and your performance will be O(logn).
In .NET this simply comes down to
Dictionary - Insertion/Removal
O(1) (average case)
SortedDictionary - preserves order for ordered operations, insertion/removal O(logn)
So ideally you would want to use a data structure that has O(1) or O(log n) remove and insert. A dictionary-type data structure would probably be ideal due to their constant average-case insert and delete time complexity. I would recommend either a HashSet and then overriding GetHashCode() on your game object base class.
Here is more details on all the different dictionary/hash table data structures.
Time complexity
Here are all the 'dictionary-like' data structures and their time-complexities:
Type Find by key Remove Add
HashSet O(1)* O(1)* O(1)**
Dictionary O(1)* O(1)* O(1)**
SortedList O(log n) O(n) O(n)
SortedDictionary O(log n) O(log n) O(log n)
* O(n) with collision
** O(n) with collision or when adding beyond the array's capacity.
HashSet vs Dictionary
The difference between a HashSet and a Dictionary is that a Dictionary works on a KeyValuePair whereas with a HashSet the key is the object itself (its GetHashCode() method).
SortedList vs SortedDictionary
You should only consider these if you need to maintain an order. The difference is that SortedDictionary uses a red-black tree and SortedList uses sorted arrays for its keys and values.
References
My blog post - .NET simple collections and dictionaries
MSDN - HashSet
MSDN - Dictionary
MSDN - SortedDictionary
MSDN - SortedList
Performance-wise, you'll probably see the best results by simply not inserting or removing anything from any kind of list. Allocate a sufficiently large array and tag all your objects with whether they should be rendered or not.
Given your requirements, however, this is all probably premature optimisation; you could very well re-create the entire list of 1000 elements 30 times per second and not see any performance drop.
I suggest reading the slides of this presentation given by Braid's creator, on using the right data structures in independent video games: http://the-witness.net/news/2011/06/how-to-program-independent-games/ (tl;dr just use arrays for everything).
The .NET Dictionary<TKey,TValue>'s internal structure and process is a highly optimized design, as discussed by Simon Cooper in this excellent blog post.
The MSDN docs state that Add(TKey,TValue) is O(1) -- unless the Dictionary's element count is at capacity, necessitating that a dynamic resize operation first be performed, thus making Add() O(n) at these junctures.
As the Dictionary grows, resize operations become progressively more infrequent, and therefore it may be said that averaged over large n, Add() approaches O(1).
This is evidenced by Cooper in this graph of total elapsed time for n/2 add operations as a function of n.
However, the average worst case performance of Add() is O(n).
My question : Is it possible to design a more consistently performant data structure than the .NET Dictionary ?
Specifically, I want average worst case performance of add, delete, retrieve operations to all be O(1).
Note that consistency of performance ( "big O" ) is the only relevant design criteria. Memory utilization and absolute performance ( including degree of clustering & cache performance ) are not relevant design criteria.
Choosing an initial capacity much larger than anticipated needs is one option, but that is an initialization step, and I am looking for a data structure design.
Have you simply tried creating a dictionary with a big enough size?
Generic O(1) is basically an array. Access via index. Or hierarchical arrays (4 byte key, array pointing to aray pointing to array pointing to array pretty much, to save space).
Anythig else - no, sorry.
Lots of high performance stuff uses preallocated arrays.
I'm struggling with the concept of when to use binary search trees and when to use dictionaries.
In my application I did a little experiment which used the C5 library TreeDictionary (which I believe is a red-black binary search tree), and the C# dictionary. The dictionary was always faster at add/find operations and also always used less memory space. For example, at 16809 <int, float> entries, the dictionary used 342 KiB whilst the tree used 723 KiB.
I thought that BST's were supposed to be more memory efficient, but it seems that one node of the tree requires more bytes than one entry in a dictionary. What gives? Is there a point at where BST's are better than dictionaries?
Also, as a side question, does anyone know if there is a faster + more memory efficient data structure for storing <int, float> pairs for dictionary type access than either of the mentioned structures?
I thought that BST's were supposed to
be more memory efficient, but it seems
that one node of the tree requires
more bytes than one entry in a
dictionary. What gives? Is there a
point at where BST's are better than
dictionaries?
I've personally never heard of such a principle. Even still, its only a general principle, not a categorical fact etched in the fabric of the universe.
Generally, Dictionaries are really just a fancy wrapper around an array of linked lists. You insert into the dictionary something like:
LinkedList<Tuple<TKey, TValue>> list =
internalArray[internalArray % key.GetHashCode()];
if (list.Exists(x => x.Key == key))
throw new Exception("Key already exists");
list.AddLast(Tuple.Create(key, value));
So its nearly O(1) operation. The dictionary uses O(internalArray.Length + n) memory, where n is number of items in the collection.
In general BSTs can be implemented as:
linked-lists, which use O(n) space, where n is the number items in the collection.
arrays, which use O(2h - n) space where h is the height of the tree and n is the number of items in the collection.
Since red-black trees have a bounded height of O(1.44 * n), an array implementation should have a bounded memory usage of about O(21.44n - n)
Odds are, the C5 TreeDictionary is implemented using arrays, which is probably responsible for the wasted space.
What gives? Is there a point at where
BST's are better than dictionaries?
Dictionaries have some undesirable properties:
There may not be enough continugous blocks of memory to hold your dictionary, even if its memory requirements are much less than than the total available RAM.
Evaluating the hash function can take an arbitrarily long length of time. Strings, for example, use Reflector to examine the System.String.GetHashCode method -- you'll notice hashing a string always takes O(n) time, which means it can take considerable time for very long strings. On the hand, comparing strings for inequality almost always faster than hashing, since it may require looking at just the first few chars. Its wholly possible for tree inserts to be faster than dictionary inserts if hash code evaluation takes too long.
Int32's GetHashCode method is literally just return this, so you'd be hardpressed to find a case where a hashtable with int keys is slower than a tree dictionary.
RB Trees have some desirable properties:
You can find/remove the Min and Max elements in O(log n) time, compared to O(n) time using a dictionary.
If a tree is implemented as linked list rather than an array, the tree is usually more space efficient than a dictionary.
Likewise, its ridiculous easy to write immutable versions of trees which support insert/lookup/delete in O(log n) time. Dictionaries do not adapt well to immutability, since you need to copy the entire internal array for every operation (actually, I have seen some array-based implementations of immutable finger trees, a kind of general purpose dictionary data structure, but the implementation is very complex).
You can traverse all the elements in a tree in sorted order in constant space and O(n) time, whereas you'd need to dump a hash table into an array and sort it to get the same effect.
So, the choice of data structure really depends on what properties you need. If you just want an unordered bag and can guarantee that your hash function evaluate quickly, go with a .Net Dictionary. If you need an ordered bag or have a slow running hash function, go with TreeDictionary.
It does make sense that a tree node would require more storage than a dictionary entry. A binary tree node needs to store the value and both the left and right subtrees. The generic Dictionary<TKey, TValue> is implemented as a hash table which - I'm assuming - either uses a linked list for each bucket (value plus one pointer/reference) or some sort of remapping (just the value). I'd have to have a peek in Reflector to be sure, but for the purpose of this question I don't think it's that important.
The sparser the hash table, the less efficient in terms of storage/memory. If you create a hash table (dictionary) and initialize its capacity to 1 million, and only fill it with 10,000 elements, then I'm pretty sure it would eat up a lot more memory than a BST with 10,000 nodes.
Still, I wouldn't worry about any of this if the amount of nodes/keys is only in the thousands. That's going to be measured in the kilobytes, compared to gigabytes of physical RAM.
If the question is "why would you want to use a binary tree instead of a hash table?" Then the best answer IMO is that binary trees are ordered whereas hash tables are not. You can only search a hash table for keys that are exactly equal to something; with a tree, you can search for a range of values, nearest value, etc. This is a pretty important distinction if you're creating an index or something similar.
It seems to me you're doing a premature optimization.
What I'd suggest to you is to create an interface to isolate which structure you're actually using, and then implement the interface using the Dictionary (which seems to work best).
If memory/performance becomes an issue (which probably will not for 20k- numbers), then you can create other interface implementations, and check which one works bests. You won't need to change almost anything in the rest of the code (except which implementation you're using).
The interface for a Tree and a Hash table (which I'm guessing is what your Dictionary is based one) should be very similar. Always revolving around keyed lookups.
I had always thought a Dictionary was better for creating things once and then then doing lots of lookups on it. While a Tree was better if you were modifying it significantly. However, I don't know where I picked that idea up from.
(Functional languages often use trees as the basis for they collections as you can re-use most of the tree if you make small modifications to it).
You're not comparing "apples with apples", a BST will give you an ordered representation while a dictionary allows you to do a lookup on a key value pair (in your case ).
I wouldn't expect much size in the memory footprint between the 2 but the dictionary will give you a much faster lookup. To find an item in a BST you (potentially) need to traverse the entire tree. But to do a dictnary lookup you simply lookup based on the key.
A balanced BST is preferable if you need to protect your data structure from latency spikes and hash collisions attacks.
The former happens when an array-backed structure grows an gets resized, the latter is an inevitable property of hashing algorithm as a projection from infinite space to a limited integer range.
Another problem in .NET is that there is LOH, and with a sufficiently large dictionary you run into a LOH fragmentation. In this case you can use a BST, paying a price of larger algorithmic complexity class.
In short, with a BST backed by the allocation heap you get worst case O(log(N)) time, with hashtable you get O(N) worst case time.
BST comes at a price of O(log(N)) average time, worse cache locality and more heap allocations, but it has latency guarantees and is protected from dictionary attacks and memory fragmentation.
Worth noting that BST is also a subject to memory fragmentation on other platforms, not using a compacting garbage collector.
As for the memory size, the .NET Dictionary`2 class is more memory efficient, because it stores data as an off-heap linked list, which only stores value and offset information.
BST has to store object header (as each node is a class instance on the heap), two pointers, and some augmented tree data for balanced trees. For example, a red-black tree would need a boolean interpreted as color (red or black). This is at least 6 machine words, if I'm not mistaken. So, each node in a red-black tree on 64-bit system is a minimum of:
3 words for the header = 24 bytes
2 words for the child pointers = 16 bytes
1 word for the color = 8 bytes
at least 1 word for the value 8+ bytes
= 24+16+8+8 = 56 bytes (+8 bytes if the tree uses a parent node pointer).
At the same time, the minimum size of the dictionary entry would be just 16 bytes.
Are C# lists fast? What are the good and bad sides of using lists to handle objects?
Extensive use of lists will make software slower? What are the alternatives to lists in C#?
How many objects is "too many objects" for lists?
List<T> uses a backing array to hold items:
Indexer access (i.e. fetch/update) is O(1)
Remove from tail is O(1)
Remove from elsewhere requires existing items to be shifted up, so O(n) effectively
Add to end is O(1) unless it requires resizing, in which case it's O(n). (This doubles the size of the buffer, so the amortized cost is O(1).)
Add to elsewhere requires existing items to be shifted down, so O(n) effectively
Finding an item is O(n) unless it's sorted, in which case a binary search gives O(log n)
It's generally fine to use lists fairly extensively. If you know the final size when you start populating a list, it's a good idea to use the constructor which lets you specify the capacity, to avoid resizing. Beyond that: if you're concerned, break out the profiler...
Compared to what?
If you mean List<T>, then that is essentially a wrapper around an array; so fast to read/write by index, relatively fast to append (since it allows extra space at the end, doubling in size when necessary) and remove from the end, but more expensive to do other operations (insert/delete other than the end)
An array is again fast by index, but fixed size (no append/delete)
Dictionary<,> etc offer better access by key
A list isn't intrinsically slow; especially if you know you always need to look at all the data, or can access it by index. But for large lists it may be better (and more convenient) to search via a key. There are various dictionary implementations in .NET, each with different costs re size / performance.