Are C# lists fast? What are the good and bad sides of using lists to handle objects?
Extensive use of lists will make software slower? What are the alternatives to lists in C#?
How many objects is "too many objects" for lists?
List<T> uses a backing array to hold items:
Indexer access (i.e. fetch/update) is O(1)
Remove from tail is O(1)
Remove from elsewhere requires existing items to be shifted up, so O(n) effectively
Add to end is O(1) unless it requires resizing, in which case it's O(n). (This doubles the size of the buffer, so the amortized cost is O(1).)
Add to elsewhere requires existing items to be shifted down, so O(n) effectively
Finding an item is O(n) unless it's sorted, in which case a binary search gives O(log n)
It's generally fine to use lists fairly extensively. If you know the final size when you start populating a list, it's a good idea to use the constructor which lets you specify the capacity, to avoid resizing. Beyond that: if you're concerned, break out the profiler...
Compared to what?
If you mean List<T>, then that is essentially a wrapper around an array; so fast to read/write by index, relatively fast to append (since it allows extra space at the end, doubling in size when necessary) and remove from the end, but more expensive to do other operations (insert/delete other than the end)
An array is again fast by index, but fixed size (no append/delete)
Dictionary<,> etc offer better access by key
A list isn't intrinsically slow; especially if you know you always need to look at all the data, or can access it by index. But for large lists it may be better (and more convenient) to search via a key. There are various dictionary implementations in .NET, each with different costs re size / performance.
Related
I have a game loop which draws a list of objects, the list called "mylist" and holds something like 1000 objects, objects(especially bullets which fly fast and hit things) needs to be added and removed from the list constantly, few objects every second.
If i understand correctly the insert in List is practically free if the list capacity large enough, the problem is with removal which is O(n) because first of all i need to find the item in the list , and second it creates a new list after the removal.
If i could aggregate all the removals and make them once per frame, it would be efficient because i would use mylist.Except(listToRemove) and this would be in O(n).
but unfortunately i can't do it.
Linked list is also problematic because i need to find the Object in the list.
Any one has better suggestion ?
What about a HashSet? It supports O(1) insert and removal. If you need ordering, use a tree data structure or a LinkedList plus a Dictionary that allows you to find nodes quickly.
The BCL has a SortedList and a SortedSet and a SortedDictionary. All but one were very slow but I can never remember which one was the "good" one. It is tree-based internally.
If you are doing searching then your best bet is to use either a Dictionary that has average case of O(1) and worst case of O(n) (when you hash to the same bucket).
If you need to preserve order you can use any implentation of a balanced binary search tree and your performance will be O(logn).
In .NET this simply comes down to
Dictionary - Insertion/Removal
O(1) (average case)
SortedDictionary - preserves order for ordered operations, insertion/removal O(logn)
So ideally you would want to use a data structure that has O(1) or O(log n) remove and insert. A dictionary-type data structure would probably be ideal due to their constant average-case insert and delete time complexity. I would recommend either a HashSet and then overriding GetHashCode() on your game object base class.
Here is more details on all the different dictionary/hash table data structures.
Time complexity
Here are all the 'dictionary-like' data structures and their time-complexities:
Type Find by key Remove Add
HashSet O(1)* O(1)* O(1)**
Dictionary O(1)* O(1)* O(1)**
SortedList O(log n) O(n) O(n)
SortedDictionary O(log n) O(log n) O(log n)
* O(n) with collision
** O(n) with collision or when adding beyond the array's capacity.
HashSet vs Dictionary
The difference between a HashSet and a Dictionary is that a Dictionary works on a KeyValuePair whereas with a HashSet the key is the object itself (its GetHashCode() method).
SortedList vs SortedDictionary
You should only consider these if you need to maintain an order. The difference is that SortedDictionary uses a red-black tree and SortedList uses sorted arrays for its keys and values.
References
My blog post - .NET simple collections and dictionaries
MSDN - HashSet
MSDN - Dictionary
MSDN - SortedDictionary
MSDN - SortedList
Performance-wise, you'll probably see the best results by simply not inserting or removing anything from any kind of list. Allocate a sufficiently large array and tag all your objects with whether they should be rendered or not.
Given your requirements, however, this is all probably premature optimisation; you could very well re-create the entire list of 1000 elements 30 times per second and not see any performance drop.
I suggest reading the slides of this presentation given by Braid's creator, on using the right data structures in independent video games: http://the-witness.net/news/2011/06/how-to-program-independent-games/ (tl;dr just use arrays for everything).
The .NET Dictionary<TKey,TValue>'s internal structure and process is a highly optimized design, as discussed by Simon Cooper in this excellent blog post.
The MSDN docs state that Add(TKey,TValue) is O(1) -- unless the Dictionary's element count is at capacity, necessitating that a dynamic resize operation first be performed, thus making Add() O(n) at these junctures.
As the Dictionary grows, resize operations become progressively more infrequent, and therefore it may be said that averaged over large n, Add() approaches O(1).
This is evidenced by Cooper in this graph of total elapsed time for n/2 add operations as a function of n.
However, the average worst case performance of Add() is O(n).
My question : Is it possible to design a more consistently performant data structure than the .NET Dictionary ?
Specifically, I want average worst case performance of add, delete, retrieve operations to all be O(1).
Note that consistency of performance ( "big O" ) is the only relevant design criteria. Memory utilization and absolute performance ( including degree of clustering & cache performance ) are not relevant design criteria.
Choosing an initial capacity much larger than anticipated needs is one option, but that is an initialization step, and I am looking for a data structure design.
Have you simply tried creating a dictionary with a big enough size?
Generic O(1) is basically an array. Access via index. Or hierarchical arrays (4 byte key, array pointing to aray pointing to array pointing to array pretty much, to save space).
Anythig else - no, sorry.
Lots of high performance stuff uses preallocated arrays.
I am looking for the most efficient way to store a collection of integers. Right now they're being stored in a HashSet<T>, but profiling has shown that these collections weigh heavily on some performance-critical code and I suspect there's a better option.
Some more details:
Random lookups must be O(1) or close to it.
The collections can grow large, so space efficiency is desirable.
The values are uniformly distributed in a 64-bit space.
Mutability is not needed.
There's no clear upper bound on size, but tens of millions of elements is not uncommon.
The most painful performance hit right now is creating them. That seems to be allocation-related - clearing and reusing HashSets helps a lot in benchmarks, but unfortunately that is not a feasible option in the application code.
(added) Implementing a data structure that's tailored to the task is fine. Is a hash table still the way to go? A trie also seems like a possibility at first glance, but I don't have any practical experience with them.
HashSet is usually the best general purpose collection in this case.
If you have any specific information about your collection you may have better options.
If you have a fixed upper bound that is not incredibly large you can use a bit vector of suitable size.
If you have a very dense collection you can instead store the missing values.
If you have very small collections, <= 4 items or so, you can store them in a regular array. A full scan of such small array may be faster than the hashing required to use the hash-set.
If you don't have any more specific characteristics of your data than "large collections of int" HashSet is the way to go.
If the size of the values is bounded you could use a bitset. It stores one bit per integer. In total the memory use would be log n bits with n being the greatest integer.
Another option is a bloom filter. Bloom filters are very compact but you have to be prepared for an occasional false positive in lookups. You can find more about them in wikipedia.
A third option is using a simle sorted array. Lookups are log n with n being the number of integers. It may be fast enough.
I decided to try and implement a special purpose hash-based set class that uses linear probing to handle collisions:
Backing store is a simple array of longs
The array is sized to be larger than the expected number of elements to be stored.
For a value's hash code, use the least-significant 31 bits.
Searching for the position of a value in the backing store is done using a basic linear probe, like so:
int FindIndex(long value)
{
var index = ((int)(value & 0x7FFFFFFF) % _storage.Length;
var slotValue = _storage[index];
if(slotValue == 0x0 || slotValue == value) return index;
for(++index; ; index++)
{
if (index == _storage.Length) index = 0;
slotValue = _storage[index];
if(slotValue == 0x0 || slotValue == value) return index;
}
}
(I was able to determine that the data being stored will never include 0, so that number is safe to use for empty slots.)
The array needs to be larger than the number of elements stored. (Load factor less than 1.) If the set is ever completely filled then FindIndex() will go into an infinite loop if it's used to search for a value that isn't already in the set. In fact, it will want to have quite a lot of empty space, otherwise search and retrieval may suffer as the data starts to form large clumps.
I'm sure there's still room for optimization, and I will may get stuck using some sort of BigArray<T> or sharding for the backing store on large sets. But initial results are promising. It performs over twice as fast as HashSet<T> at a load factor of 0.5, nearly twice as fast with a load factor of 0.8, and even at 0.9 it's still working 40% faster in my tests.
Overhead is 1 / load factor, so if those performance figures hold out in the real world then I believe it will also be more memory-efficient than HashSet<T>. I haven't done a formal analysis, but judging by the internal structure of HashSet<T> I'm pretty sure its overhead is well above 10%.
--
So I'm pretty happy with this solution, but I'm still curious if there are other possibilities. Maybe some sort of trie?
--
Epilogue: Finally got around to doing some competitive benchmarks of this vs. HashSet<T> on live data. (Before I was using synthetic test sets.) It's even beating my optimistic expectations from before. Real-world performance is turning out to be as much as 6x faster than HashSet<T>, depending on collection size.
What I would do is just create an array of integers with a sufficient enough size to handle how ever many integers you need. Is there any reason from staying away from the generic List<T>? http://msdn.microsoft.com/en-us/library/6sh2ey19.aspx
The most painful performance hit right now is creating them...
As you've obviously observed, HashSet<T> does not have a constructor that takes a capacity argument to initialize its capacity.
One trick which I believe would work is the following:
int capacity = ... some appropriate number;
int[] items = new int[capacity];
HashSet<int> hashSet = new HashSet<int>(items);
hashSet.Clear();
...
Looking at the implementation with reflector, this will initialize the capacity to the size of the items array, ignoring the fact that this array contains duplicates. It will, however, only actually add one value (zero), so I'd assume that initializing and clearing should be reasonably efficient.
I haven't tested this so you'd have to benchmark it. And be willing to take the risk of depending on an undocumented internal implementation detail.
It would be interesting to know why Microsoft didn't provide a constructor with a capacity argument like they do for other collection types.
I'm struggling with the concept of when to use binary search trees and when to use dictionaries.
In my application I did a little experiment which used the C5 library TreeDictionary (which I believe is a red-black binary search tree), and the C# dictionary. The dictionary was always faster at add/find operations and also always used less memory space. For example, at 16809 <int, float> entries, the dictionary used 342 KiB whilst the tree used 723 KiB.
I thought that BST's were supposed to be more memory efficient, but it seems that one node of the tree requires more bytes than one entry in a dictionary. What gives? Is there a point at where BST's are better than dictionaries?
Also, as a side question, does anyone know if there is a faster + more memory efficient data structure for storing <int, float> pairs for dictionary type access than either of the mentioned structures?
I thought that BST's were supposed to
be more memory efficient, but it seems
that one node of the tree requires
more bytes than one entry in a
dictionary. What gives? Is there a
point at where BST's are better than
dictionaries?
I've personally never heard of such a principle. Even still, its only a general principle, not a categorical fact etched in the fabric of the universe.
Generally, Dictionaries are really just a fancy wrapper around an array of linked lists. You insert into the dictionary something like:
LinkedList<Tuple<TKey, TValue>> list =
internalArray[internalArray % key.GetHashCode()];
if (list.Exists(x => x.Key == key))
throw new Exception("Key already exists");
list.AddLast(Tuple.Create(key, value));
So its nearly O(1) operation. The dictionary uses O(internalArray.Length + n) memory, where n is number of items in the collection.
In general BSTs can be implemented as:
linked-lists, which use O(n) space, where n is the number items in the collection.
arrays, which use O(2h - n) space where h is the height of the tree and n is the number of items in the collection.
Since red-black trees have a bounded height of O(1.44 * n), an array implementation should have a bounded memory usage of about O(21.44n - n)
Odds are, the C5 TreeDictionary is implemented using arrays, which is probably responsible for the wasted space.
What gives? Is there a point at where
BST's are better than dictionaries?
Dictionaries have some undesirable properties:
There may not be enough continugous blocks of memory to hold your dictionary, even if its memory requirements are much less than than the total available RAM.
Evaluating the hash function can take an arbitrarily long length of time. Strings, for example, use Reflector to examine the System.String.GetHashCode method -- you'll notice hashing a string always takes O(n) time, which means it can take considerable time for very long strings. On the hand, comparing strings for inequality almost always faster than hashing, since it may require looking at just the first few chars. Its wholly possible for tree inserts to be faster than dictionary inserts if hash code evaluation takes too long.
Int32's GetHashCode method is literally just return this, so you'd be hardpressed to find a case where a hashtable with int keys is slower than a tree dictionary.
RB Trees have some desirable properties:
You can find/remove the Min and Max elements in O(log n) time, compared to O(n) time using a dictionary.
If a tree is implemented as linked list rather than an array, the tree is usually more space efficient than a dictionary.
Likewise, its ridiculous easy to write immutable versions of trees which support insert/lookup/delete in O(log n) time. Dictionaries do not adapt well to immutability, since you need to copy the entire internal array for every operation (actually, I have seen some array-based implementations of immutable finger trees, a kind of general purpose dictionary data structure, but the implementation is very complex).
You can traverse all the elements in a tree in sorted order in constant space and O(n) time, whereas you'd need to dump a hash table into an array and sort it to get the same effect.
So, the choice of data structure really depends on what properties you need. If you just want an unordered bag and can guarantee that your hash function evaluate quickly, go with a .Net Dictionary. If you need an ordered bag or have a slow running hash function, go with TreeDictionary.
It does make sense that a tree node would require more storage than a dictionary entry. A binary tree node needs to store the value and both the left and right subtrees. The generic Dictionary<TKey, TValue> is implemented as a hash table which - I'm assuming - either uses a linked list for each bucket (value plus one pointer/reference) or some sort of remapping (just the value). I'd have to have a peek in Reflector to be sure, but for the purpose of this question I don't think it's that important.
The sparser the hash table, the less efficient in terms of storage/memory. If you create a hash table (dictionary) and initialize its capacity to 1 million, and only fill it with 10,000 elements, then I'm pretty sure it would eat up a lot more memory than a BST with 10,000 nodes.
Still, I wouldn't worry about any of this if the amount of nodes/keys is only in the thousands. That's going to be measured in the kilobytes, compared to gigabytes of physical RAM.
If the question is "why would you want to use a binary tree instead of a hash table?" Then the best answer IMO is that binary trees are ordered whereas hash tables are not. You can only search a hash table for keys that are exactly equal to something; with a tree, you can search for a range of values, nearest value, etc. This is a pretty important distinction if you're creating an index or something similar.
It seems to me you're doing a premature optimization.
What I'd suggest to you is to create an interface to isolate which structure you're actually using, and then implement the interface using the Dictionary (which seems to work best).
If memory/performance becomes an issue (which probably will not for 20k- numbers), then you can create other interface implementations, and check which one works bests. You won't need to change almost anything in the rest of the code (except which implementation you're using).
The interface for a Tree and a Hash table (which I'm guessing is what your Dictionary is based one) should be very similar. Always revolving around keyed lookups.
I had always thought a Dictionary was better for creating things once and then then doing lots of lookups on it. While a Tree was better if you were modifying it significantly. However, I don't know where I picked that idea up from.
(Functional languages often use trees as the basis for they collections as you can re-use most of the tree if you make small modifications to it).
You're not comparing "apples with apples", a BST will give you an ordered representation while a dictionary allows you to do a lookup on a key value pair (in your case ).
I wouldn't expect much size in the memory footprint between the 2 but the dictionary will give you a much faster lookup. To find an item in a BST you (potentially) need to traverse the entire tree. But to do a dictnary lookup you simply lookup based on the key.
A balanced BST is preferable if you need to protect your data structure from latency spikes and hash collisions attacks.
The former happens when an array-backed structure grows an gets resized, the latter is an inevitable property of hashing algorithm as a projection from infinite space to a limited integer range.
Another problem in .NET is that there is LOH, and with a sufficiently large dictionary you run into a LOH fragmentation. In this case you can use a BST, paying a price of larger algorithmic complexity class.
In short, with a BST backed by the allocation heap you get worst case O(log(N)) time, with hashtable you get O(N) worst case time.
BST comes at a price of O(log(N)) average time, worse cache locality and more heap allocations, but it has latency guarantees and is protected from dictionary attacks and memory fragmentation.
Worth noting that BST is also a subject to memory fragmentation on other platforms, not using a compacting garbage collector.
As for the memory size, the .NET Dictionary`2 class is more memory efficient, because it stores data as an off-heap linked list, which only stores value and offset information.
BST has to store object header (as each node is a class instance on the heap), two pointers, and some augmented tree data for balanced trees. For example, a red-black tree would need a boolean interpreted as color (red or black). This is at least 6 machine words, if I'm not mistaken. So, each node in a red-black tree on 64-bit system is a minimum of:
3 words for the header = 24 bytes
2 words for the child pointers = 16 bytes
1 word for the color = 8 bytes
at least 1 word for the value 8+ bytes
= 24+16+8+8 = 56 bytes (+8 bytes if the tree uses a parent node pointer).
At the same time, the minimum size of the dictionary entry would be just 16 bytes.
In an application I will have between about 3000 and 30000 strings.
After creation (read from files unordered) there will not be many strings that will be added often (but there WILL be sometimes!). Deletion of strings will also not happen often.
Comparing a string with the ones stored will occur frequently.
What kind of structure can I use best, a hashtable, a tree (Red-Black, Splay,....) or just on ordered list (maybe a StringArray?) ?
(Additional remark : a link to a good C# implementation would be appreciated as well)
It sounds like you simply need a hashtable. The HashSet<T> would thus seem to be the ideal choice. (You don't seem to require keys, but Dictionary<T> would be the right option if you did, of course.)
Here's a summary of the time complexities of the different operations on a HashSet<T> of size n. They're partially based off the fact that the type uses an array as the backing data structure.
Insertion: Typically O(1), but potentially O(n) if the array needs to be resized.
Deletion: O(1)
Exists (Contains): O(1) (given ideal hashtable buckets)
Someone correct me if any of these are wrong please. They are just my best guesses from what I know of the implementation/hashtables in general.
HashSet is very good for fast insertion and search speeds. Add, Remove and Contains are O(1).
Edit- Add assumes the array does not need to be resized. If that's the case as Noldorin has stated it is O(n).
I used HashSet on a recent VB 6 (I didn't write it) to .NET 3.5 upgrade project where I was iterating round a collection that had child items and each child item could appear in more than one parent item. The application processed a list of items I wanted to send to an API that charges a lot of money per call.
I basically used the HashSet to keep track items I'd already sent to prevent us incurring an unnecessary charge. As the process was invoked several times (it is basically a batch job with multiple commands), I serialized the HashSet between invocations. This worked very well- I had a requirement to reuse as much as the existing code as possible as this had been thoroughly tested. The HashSet certainly performed very fast.
If you're looking for real-time performance or optimal memory efficiency I'd recommend a radix tree or explicit suffix or prefix tree. Otherwise I'd probably use a hash.
Trees have the advantage of having fixed bounds on worst case lookup, insertion and deletion times (based on the length of the pattern you're looking up). Hash based solutions have the advantage of being a whole lot easier to code (you get these out of the box in C#), cheaper to construct initially and if properly configured have similar average-case performance. However, they do tend to use more memory and have non-deterministic time lookups, insertions (and depending on the implementation possibly deletions).
The answers recommending HashSet<T> are spot on if your comparisons are just "is this string present in the set or not". You could even use different IEqualityComparer<string> implementations (probably choosing from the ones in StringComparer) for case-sensitivity etc.
Is this the only type of comparison you need, or do you need things like "where would this string appear in the set if it were actually an ordered list?" If you need that sort of check, then you'll probably want to do a binary search. (List<T> provides a BinarySearch method; I don't know why SortedList and SortedDictionary don't, as both would be able to search pretty easily. Admittedly a SortedDictionary search wouldn't be quite the same as a normal binary search, but it would still usually have similar characteristics I believe.)
As I say, if you only want "in the set or not" checking, the HashSet<T> is your friend. I just thought I'd bring up the rest in case :)
If you need to know "where would this string appear in the set if it were actually an ordered list" (as in Jon Skeet's answer), you could consider a trie. This solution can only be used for certain types of "string-like" data, and if the "alphabet" is large compared to the number of strings it can quickly lose its advantages. Cache locality could also be a problem.
This could be over-engineered for a set of only N = 30,000 things that is largely precomputed, however. You might even do better just allocating an array of k * N Optional and filling it by skipping k spaces between each actual thing (thus reducing the probability that your rare insertions will require reallocation, still leaving you with a variant of binary search, and keeping your items in sorted order. If you need precise "where would this string appear in the set", though, this wouldn't work because you would need O(n) time to examine each space before the item checking if it was blank or O(n) time on insert to update a "how many items are really before me" counter in each slot. It could provide you with very fast imprecise indexes, though, and those indexes would be stable between insertions/deletions.