C# Datastructure with SortedDictionary() and node.Next() functionality? - c#

How to construct/obtain a datastructure with the following capabilities:
Stores (key,value) nodes, keys implement IComparable.
Fast (log N) insertion and retrieval.
Fast (log N) method to retrieve the next higher/next lower node from any node. [EXAMPLE: if
the key values inserted are (7,cat), (4,dog),(12,ostrich), (13,goldfish) then if keyVal referred to (7,cat), keyVal.Next() should return a reference to (12,ostrich) ].
A solution with an enumerator from an arbitrary key would of course also suffice. Note that standard SortedDictionary functionality will not suffice, since only an enumerator over the entire set can be returned, which makes finding keyVal.next require N operations at worst.
Could a self-implemented balanced binary search tree (red-black tree) be fitted with node.next() functionality? Any good references for doing this? Any less coding-time consuming solutions?

I once had similar requirements and was unable to find something suitable. So I implemented an AVL tree. Here come some advices to do it with performance in mind:
Do not use recursion for walking the tree (insert, update, delete, next). Better use a stack array to store the way up to the root which is needed for balancing operations.
Do not store parent nodes. All operations will start from the root node and walk further down. Parents are not needed, if implemented carefully.
In order to find the Next() node of an existing one, usually Find() is first called. The stack produced by that, should be reused for Next() than.
By following these rules, I was able to implement the AVL tree. It is working very efficiently even for very large data sets. I would be willing to share, but it would need some modifications, since it does not store values (very easy) and does not rely on IComparable but on fixed key types of int.

The OrderedDictionary in PowerCollections provides a "get iterator starting at or before key" function that takes O(log N) time to return the first value. That makes it very fast to, say, scan the 1,000 items that are near the middle of a 50 million item set (which with SortedDictionary would require guessing to start at the start or the end, both of which are equally bad choices and would require iterator around 25 million items). OrderedDictionary can to that with just 1,000 items iterated.
There is a problem in OrderedDictionary though in that it uses yield which causes O(n^2) performance and out of memory conditions when iterating a 50 million item set in a 32 bit process. There is a quite simple fix for that while I will document later.

Related

In C# what the best generic to use for many insert and delete?

I have a game loop which draws a list of objects, the list called "mylist" and holds something like 1000 objects, objects(especially bullets which fly fast and hit things) needs to be added and removed from the list constantly, few objects every second.
If i understand correctly the insert in List is practically free if the list capacity large enough, the problem is with removal which is O(n) because first of all i need to find the item in the list , and second it creates a new list after the removal.
If i could aggregate all the removals and make them once per frame, it would be efficient because i would use mylist.Except(listToRemove) and this would be in O(n).
but unfortunately i can't do it.
Linked list is also problematic because i need to find the Object in the list.
Any one has better suggestion ?
What about a HashSet? It supports O(1) insert and removal. If you need ordering, use a tree data structure or a LinkedList plus a Dictionary that allows you to find nodes quickly.
The BCL has a SortedList and a SortedSet and a SortedDictionary. All but one were very slow but I can never remember which one was the "good" one. It is tree-based internally.
If you are doing searching then your best bet is to use either a Dictionary that has average case of O(1) and worst case of O(n) (when you hash to the same bucket).
If you need to preserve order you can use any implentation of a balanced binary search tree and your performance will be O(logn).
In .NET this simply comes down to
Dictionary - Insertion/Removal
O(1) (average case)
SortedDictionary - preserves order for ordered operations, insertion/removal O(logn)
So ideally you would want to use a data structure that has O(1) or O(log n) remove and insert. A dictionary-type data structure would probably be ideal due to their constant average-case insert and delete time complexity. I would recommend either a HashSet and then overriding GetHashCode() on your game object base class.
Here is more details on all the different dictionary/hash table data structures.
Time complexity
Here are all the 'dictionary-like' data structures and their time-complexities:
Type Find by key Remove Add
HashSet O(1)* O(1)* O(1)**
Dictionary O(1)* O(1)* O(1)**
SortedList O(log n) O(n) O(n)
SortedDictionary O(log n) O(log n) O(log n)
* O(n) with collision
** O(n) with collision or when adding beyond the array's capacity.
HashSet vs Dictionary
The difference between a HashSet and a Dictionary is that a Dictionary works on a KeyValuePair whereas with a HashSet the key is the object itself (its GetHashCode() method).
SortedList vs SortedDictionary
You should only consider these if you need to maintain an order. The difference is that SortedDictionary uses a red-black tree and SortedList uses sorted arrays for its keys and values.
References
My blog post - .NET simple collections and dictionaries
MSDN - HashSet
MSDN - Dictionary
MSDN - SortedDictionary
MSDN - SortedList
Performance-wise, you'll probably see the best results by simply not inserting or removing anything from any kind of list. Allocate a sufficiently large array and tag all your objects with whether they should be rendered or not.
Given your requirements, however, this is all probably premature optimisation; you could very well re-create the entire list of 1000 elements 30 times per second and not see any performance drop.
I suggest reading the slides of this presentation given by Braid's creator, on using the right data structures in independent video games: http://the-witness.net/news/2011/06/how-to-program-independent-games/ (tl;dr just use arrays for everything).

What is faster: sorted collection or list and linq queries on it (with intensive insertion/deletion)?

I have a dilemma. I have to implement prioritized queue (custom sort order). I need to insert/process/delete a lot of messages per second by using it (~100-1000).
Which design is faster at run-time?
1) custom sorted by priority collection (list)
2) list(non-sorted collection) + linq query all time when I need to process (dequeue) message
3) something else
ADDED:
SOLUTION:
List (Dictionary) of queues by priority: SortedList<int, VPair<bool, Queue<MyMessage>>>
where int - priority, bool - true if it is not empty queue
Whats your read/write ratio? Are multiple threads involved, if so, how?
As always when asking about performance, benchmark both code-paths and see for yourself (this is especially true the more specific your problem domain is).
The only way to know for sure is to measure the performance for yourself.
Well, finding an element in an unsorted data structure takes O(n) on average (one pass over the data structure). Binary Search trees have an average insertion complexity on O(log n) and also an average lookup complexity of O(log n). So in theory using something like that would be faster. In reality the overhead or the shape of the data might kill the theoretical advantage.
Also if your custom sort order can change at runtime you might have to rebuild the sorted data structure which is an additional performance hit.
In the end: If it is important for your application then try the different approaches and benchmark it yourself - it's the only way to be certain that it works.
Introducing sorting will always incur an insertion performance overhead as far as I'm aware. If there's no need for the sorting then use a nice generic Dictionary which will provide a quick lookup based on your unique key.

SkipList<T> vs Dictionary<TKey,TValue>

I've been reading about Skip Lists lately.
I have a web application that executes quite complex Sql queries against static datasets.
I want to implement a caching system whereby I generate an md5 hash of the sql query and then return a cached dataset for the query if it exists in the collection.
Which algorithm would be better, Dictionary or a SkipList? Why?
http://msdn.microsoft.com/en-us/library/ms379573%28VS.80%29.aspx#datastructures20_4_topic4
The reason you would use a SkipList<T> vs Dictionary<TKey,TValue> is that a skip list keeps its items in order. If you regularly need to enumerate the items in order, a skip list is good because it can enumerate in O(n).
If you wanted to be able to enumerate in order but didn't care if enumeration is O(n lg n), a SortedSet<T> (or more likely a SortedDictionary<TKey, TValue>) would be what you'd want because they use red-black trees (balanced binary trees) and they are already in the standard library.
Since it's extremely unlikely that you will want to enumerate your cache in order (or at all), a skip list (and likewise a binary tree) is unnecessary.
Dictionary, definitely. Two reasons:
Dictionary<TKey, TValue> uses a hash table, making retrieval O(1) (i.e. constant time), compared to O(log n) in a skip list.
Dictionary<TKey, TValue> already exists and is well-tested and optimised, whereas a skip list class doesn’t exist to my knowledge, so you would have to implement your own, which takes effort to get it right and test it thoroughly.
Memory consumption is about the same for both (certainly the same complexity, namely O(n)).
Skip List gives an average of Log(n) on all dictionary operations. If the number of items is fixed then a lock stripped hash table will do great. An in memory splay tree is also good since cache is the word. Splay trees give faster for the recently accessed item. As such in a dictionary operation such as find; [skip lists were slow compared to splay tree which again were slow compared to hash tables.][1][1]: http://harisankar-krishnaswamy.blogspot.in/2012/04/skip-list-runtime-on-dictionay.html
If localization in data structure is needed then skip lists can be useful. For example, finding flights around a date etc. But, a cache is in memory so a splay is fine. Hashtable and splay trees dont provide localization.

C# Binary Trees and Dictionaries

I'm struggling with the concept of when to use binary search trees and when to use dictionaries.
In my application I did a little experiment which used the C5 library TreeDictionary (which I believe is a red-black binary search tree), and the C# dictionary. The dictionary was always faster at add/find operations and also always used less memory space. For example, at 16809 <int, float> entries, the dictionary used 342 KiB whilst the tree used 723 KiB.
I thought that BST's were supposed to be more memory efficient, but it seems that one node of the tree requires more bytes than one entry in a dictionary. What gives? Is there a point at where BST's are better than dictionaries?
Also, as a side question, does anyone know if there is a faster + more memory efficient data structure for storing <int, float> pairs for dictionary type access than either of the mentioned structures?
I thought that BST's were supposed to
be more memory efficient, but it seems
that one node of the tree requires
more bytes than one entry in a
dictionary. What gives? Is there a
point at where BST's are better than
dictionaries?
I've personally never heard of such a principle. Even still, its only a general principle, not a categorical fact etched in the fabric of the universe.
Generally, Dictionaries are really just a fancy wrapper around an array of linked lists. You insert into the dictionary something like:
LinkedList<Tuple<TKey, TValue>> list =
internalArray[internalArray % key.GetHashCode()];
if (list.Exists(x => x.Key == key))
throw new Exception("Key already exists");
list.AddLast(Tuple.Create(key, value));
So its nearly O(1) operation. The dictionary uses O(internalArray.Length + n) memory, where n is number of items in the collection.
In general BSTs can be implemented as:
linked-lists, which use O(n) space, where n is the number items in the collection.
arrays, which use O(2h - n) space where h is the height of the tree and n is the number of items in the collection.
Since red-black trees have a bounded height of O(1.44 * n), an array implementation should have a bounded memory usage of about O(21.44n - n)
Odds are, the C5 TreeDictionary is implemented using arrays, which is probably responsible for the wasted space.
What gives? Is there a point at where
BST's are better than dictionaries?
Dictionaries have some undesirable properties:
There may not be enough continugous blocks of memory to hold your dictionary, even if its memory requirements are much less than than the total available RAM.
Evaluating the hash function can take an arbitrarily long length of time. Strings, for example, use Reflector to examine the System.String.GetHashCode method -- you'll notice hashing a string always takes O(n) time, which means it can take considerable time for very long strings. On the hand, comparing strings for inequality almost always faster than hashing, since it may require looking at just the first few chars. Its wholly possible for tree inserts to be faster than dictionary inserts if hash code evaluation takes too long.
Int32's GetHashCode method is literally just return this, so you'd be hardpressed to find a case where a hashtable with int keys is slower than a tree dictionary.
RB Trees have some desirable properties:
You can find/remove the Min and Max elements in O(log n) time, compared to O(n) time using a dictionary.
If a tree is implemented as linked list rather than an array, the tree is usually more space efficient than a dictionary.
Likewise, its ridiculous easy to write immutable versions of trees which support insert/lookup/delete in O(log n) time. Dictionaries do not adapt well to immutability, since you need to copy the entire internal array for every operation (actually, I have seen some array-based implementations of immutable finger trees, a kind of general purpose dictionary data structure, but the implementation is very complex).
You can traverse all the elements in a tree in sorted order in constant space and O(n) time, whereas you'd need to dump a hash table into an array and sort it to get the same effect.
So, the choice of data structure really depends on what properties you need. If you just want an unordered bag and can guarantee that your hash function evaluate quickly, go with a .Net Dictionary. If you need an ordered bag or have a slow running hash function, go with TreeDictionary.
It does make sense that a tree node would require more storage than a dictionary entry. A binary tree node needs to store the value and both the left and right subtrees. The generic Dictionary<TKey, TValue> is implemented as a hash table which - I'm assuming - either uses a linked list for each bucket (value plus one pointer/reference) or some sort of remapping (just the value). I'd have to have a peek in Reflector to be sure, but for the purpose of this question I don't think it's that important.
The sparser the hash table, the less efficient in terms of storage/memory. If you create a hash table (dictionary) and initialize its capacity to 1 million, and only fill it with 10,000 elements, then I'm pretty sure it would eat up a lot more memory than a BST with 10,000 nodes.
Still, I wouldn't worry about any of this if the amount of nodes/keys is only in the thousands. That's going to be measured in the kilobytes, compared to gigabytes of physical RAM.
If the question is "why would you want to use a binary tree instead of a hash table?" Then the best answer IMO is that binary trees are ordered whereas hash tables are not. You can only search a hash table for keys that are exactly equal to something; with a tree, you can search for a range of values, nearest value, etc. This is a pretty important distinction if you're creating an index or something similar.
It seems to me you're doing a premature optimization.
What I'd suggest to you is to create an interface to isolate which structure you're actually using, and then implement the interface using the Dictionary (which seems to work best).
If memory/performance becomes an issue (which probably will not for 20k- numbers), then you can create other interface implementations, and check which one works bests. You won't need to change almost anything in the rest of the code (except which implementation you're using).
The interface for a Tree and a Hash table (which I'm guessing is what your Dictionary is based one) should be very similar. Always revolving around keyed lookups.
I had always thought a Dictionary was better for creating things once and then then doing lots of lookups on it. While a Tree was better if you were modifying it significantly. However, I don't know where I picked that idea up from.
(Functional languages often use trees as the basis for they collections as you can re-use most of the tree if you make small modifications to it).
You're not comparing "apples with apples", a BST will give you an ordered representation while a dictionary allows you to do a lookup on a key value pair (in your case ).
I wouldn't expect much size in the memory footprint between the 2 but the dictionary will give you a much faster lookup. To find an item in a BST you (potentially) need to traverse the entire tree. But to do a dictnary lookup you simply lookup based on the key.
A balanced BST is preferable if you need to protect your data structure from latency spikes and hash collisions attacks.
The former happens when an array-backed structure grows an gets resized, the latter is an inevitable property of hashing algorithm as a projection from infinite space to a limited integer range.
Another problem in .NET is that there is LOH, and with a sufficiently large dictionary you run into a LOH fragmentation. In this case you can use a BST, paying a price of larger algorithmic complexity class.
In short, with a BST backed by the allocation heap you get worst case O(log(N)) time, with hashtable you get O(N) worst case time.
BST comes at a price of O(log(N)) average time, worse cache locality and more heap allocations, but it has latency guarantees and is protected from dictionary attacks and memory fragmentation.
Worth noting that BST is also a subject to memory fragmentation on other platforms, not using a compacting garbage collector.
As for the memory size, the .NET Dictionary`2 class is more memory efficient, because it stores data as an off-heap linked list, which only stores value and offset information.
BST has to store object header (as each node is a class instance on the heap), two pointers, and some augmented tree data for balanced trees. For example, a red-black tree would need a boolean interpreted as color (red or black). This is at least 6 machine words, if I'm not mistaken. So, each node in a red-black tree on 64-bit system is a minimum of:
3 words for the header = 24 bytes
2 words for the child pointers = 16 bytes
1 word for the color = 8 bytes
at least 1 word for the value 8+ bytes
= 24+16+8+8 = 56 bytes (+8 bytes if the tree uses a parent node pointer).
At the same time, the minimum size of the dictionary entry would be just 16 bytes.

Efficient insertion and search of strings

In an application I will have between about 3000 and 30000 strings.
After creation (read from files unordered) there will not be many strings that will be added often (but there WILL be sometimes!). Deletion of strings will also not happen often.
Comparing a string with the ones stored will occur frequently.
What kind of structure can I use best, a hashtable, a tree (Red-Black, Splay,....) or just on ordered list (maybe a StringArray?) ?
(Additional remark : a link to a good C# implementation would be appreciated as well)
It sounds like you simply need a hashtable. The HashSet<T> would thus seem to be the ideal choice. (You don't seem to require keys, but Dictionary<T> would be the right option if you did, of course.)
Here's a summary of the time complexities of the different operations on a HashSet<T> of size n. They're partially based off the fact that the type uses an array as the backing data structure.
Insertion: Typically O(1), but potentially O(n) if the array needs to be resized.
Deletion: O(1)
Exists (Contains): O(1) (given ideal hashtable buckets)
Someone correct me if any of these are wrong please. They are just my best guesses from what I know of the implementation/hashtables in general.
HashSet is very good for fast insertion and search speeds. Add, Remove and Contains are O(1).
Edit- Add assumes the array does not need to be resized. If that's the case as Noldorin has stated it is O(n).
I used HashSet on a recent VB 6 (I didn't write it) to .NET 3.5 upgrade project where I was iterating round a collection that had child items and each child item could appear in more than one parent item. The application processed a list of items I wanted to send to an API that charges a lot of money per call.
I basically used the HashSet to keep track items I'd already sent to prevent us incurring an unnecessary charge. As the process was invoked several times (it is basically a batch job with multiple commands), I serialized the HashSet between invocations. This worked very well- I had a requirement to reuse as much as the existing code as possible as this had been thoroughly tested. The HashSet certainly performed very fast.
If you're looking for real-time performance or optimal memory efficiency I'd recommend a radix tree or explicit suffix or prefix tree. Otherwise I'd probably use a hash.
Trees have the advantage of having fixed bounds on worst case lookup, insertion and deletion times (based on the length of the pattern you're looking up). Hash based solutions have the advantage of being a whole lot easier to code (you get these out of the box in C#), cheaper to construct initially and if properly configured have similar average-case performance. However, they do tend to use more memory and have non-deterministic time lookups, insertions (and depending on the implementation possibly deletions).
The answers recommending HashSet<T> are spot on if your comparisons are just "is this string present in the set or not". You could even use different IEqualityComparer<string> implementations (probably choosing from the ones in StringComparer) for case-sensitivity etc.
Is this the only type of comparison you need, or do you need things like "where would this string appear in the set if it were actually an ordered list?" If you need that sort of check, then you'll probably want to do a binary search. (List<T> provides a BinarySearch method; I don't know why SortedList and SortedDictionary don't, as both would be able to search pretty easily. Admittedly a SortedDictionary search wouldn't be quite the same as a normal binary search, but it would still usually have similar characteristics I believe.)
As I say, if you only want "in the set or not" checking, the HashSet<T> is your friend. I just thought I'd bring up the rest in case :)
If you need to know "where would this string appear in the set if it were actually an ordered list" (as in Jon Skeet's answer), you could consider a trie. This solution can only be used for certain types of "string-like" data, and if the "alphabet" is large compared to the number of strings it can quickly lose its advantages. Cache locality could also be a problem.
This could be over-engineered for a set of only N = 30,000 things that is largely precomputed, however. You might even do better just allocating an array of k * N Optional and filling it by skipping k spaces between each actual thing (thus reducing the probability that your rare insertions will require reallocation, still leaving you with a variant of binary search, and keeping your items in sorted order. If you need precise "where would this string appear in the set", though, this wouldn't work because you would need O(n) time to examine each space before the item checking if it was blank or O(n) time on insert to update a "how many items are really before me" counter in each slot. It could provide you with very fast imprecise indexes, though, and those indexes would be stable between insertions/deletions.

Categories