The generic List<T> in .NET has a BinarySearch() method. BinarySearch() is an efficient algorithm for searching large datasets. I think I read that if everyone in the world was listed in a phone book then binary search could find any person within 35 steps. At what point should a BinarySearch() be used on a List as opposed to using the standard .Where clause with a lambda? How big should the data set be before switching from Where to BinarySearch? Or does Where already use a binary search behind the scenes?
When to use List<T>.BinarySearch?
As you can read in the documentation manual:
Searches the entire sorted List<T> for an element using the default comparer and returns the zero-based index of the element.
Furthermore it can only be used to match a given element, not a predicate since a generic predicate would defeat the order constraint.
So the list has to be sorted, either by the default comparator, or by a given comparator:
public int BinarySearch(T item) //default comparator
public int BinarySearch(T item, IComparer<T> comparer) //given comparator
The algorithm runs in O(log n) time whereas the where clause runs in O(n) time, which means in practice it will nearly always outperform the second (unless it is very likely the element will be found in the front of the list).
Or does .Where already use a binary search behind the scenes?
No, it can't since. A List<T> is not always sorted. First checking whether the list is sorted, or sorting the the list would require a computational effort of O(n) or O(n log n) respectively which would be the same or even more expensive than linear search. In other words, first checking whether the list is sorted and then - if that's the case - perform binary search would be more expensive than performing a linear search. A linear search can handle both unordered lists and predicates but at a much larger cost.
At what point should a BinarySearch() be used on a List as opposed to using the standard Where clause with a lambda?
Any time the list is sorted relative to the value you're searching for, BinarySearch will (on average) give you better performance than Where regardless of size.
If the list is unordered, or in an order that does not correspond to the value you're looking for (you can't use a binary search to find a person in the phone book by first name) then BinarySearch will not give you the right results.
Note that it only returns one index, while Where will give you all items that match the criteria, but you can search on either side of the found element if there are duplicates (BinarySearch gives you one index that matches, not necessarily the first index).
Obviously the bigger the list, the more improvement BinarySearch is going to give you.
does Where already use a binary search behind the scenes?
No - it iterates through the list in physical order.
Related
I have a game loop which draws a list of objects, the list called "mylist" and holds something like 1000 objects, objects(especially bullets which fly fast and hit things) needs to be added and removed from the list constantly, few objects every second.
If i understand correctly the insert in List is practically free if the list capacity large enough, the problem is with removal which is O(n) because first of all i need to find the item in the list , and second it creates a new list after the removal.
If i could aggregate all the removals and make them once per frame, it would be efficient because i would use mylist.Except(listToRemove) and this would be in O(n).
but unfortunately i can't do it.
Linked list is also problematic because i need to find the Object in the list.
Any one has better suggestion ?
What about a HashSet? It supports O(1) insert and removal. If you need ordering, use a tree data structure or a LinkedList plus a Dictionary that allows you to find nodes quickly.
The BCL has a SortedList and a SortedSet and a SortedDictionary. All but one were very slow but I can never remember which one was the "good" one. It is tree-based internally.
If you are doing searching then your best bet is to use either a Dictionary that has average case of O(1) and worst case of O(n) (when you hash to the same bucket).
If you need to preserve order you can use any implentation of a balanced binary search tree and your performance will be O(logn).
In .NET this simply comes down to
Dictionary - Insertion/Removal
O(1) (average case)
SortedDictionary - preserves order for ordered operations, insertion/removal O(logn)
So ideally you would want to use a data structure that has O(1) or O(log n) remove and insert. A dictionary-type data structure would probably be ideal due to their constant average-case insert and delete time complexity. I would recommend either a HashSet and then overriding GetHashCode() on your game object base class.
Here is more details on all the different dictionary/hash table data structures.
Time complexity
Here are all the 'dictionary-like' data structures and their time-complexities:
Type Find by key Remove Add
HashSet O(1)* O(1)* O(1)**
Dictionary O(1)* O(1)* O(1)**
SortedList O(log n) O(n) O(n)
SortedDictionary O(log n) O(log n) O(log n)
* O(n) with collision
** O(n) with collision or when adding beyond the array's capacity.
HashSet vs Dictionary
The difference between a HashSet and a Dictionary is that a Dictionary works on a KeyValuePair whereas with a HashSet the key is the object itself (its GetHashCode() method).
SortedList vs SortedDictionary
You should only consider these if you need to maintain an order. The difference is that SortedDictionary uses a red-black tree and SortedList uses sorted arrays for its keys and values.
References
My blog post - .NET simple collections and dictionaries
MSDN - HashSet
MSDN - Dictionary
MSDN - SortedDictionary
MSDN - SortedList
Performance-wise, you'll probably see the best results by simply not inserting or removing anything from any kind of list. Allocate a sufficiently large array and tag all your objects with whether they should be rendered or not.
Given your requirements, however, this is all probably premature optimisation; you could very well re-create the entire list of 1000 elements 30 times per second and not see any performance drop.
I suggest reading the slides of this presentation given by Braid's creator, on using the right data structures in independent video games: http://the-witness.net/news/2011/06/how-to-program-independent-games/ (tl;dr just use arrays for everything).
How to construct/obtain a datastructure with the following capabilities:
Stores (key,value) nodes, keys implement IComparable.
Fast (log N) insertion and retrieval.
Fast (log N) method to retrieve the next higher/next lower node from any node. [EXAMPLE: if
the key values inserted are (7,cat), (4,dog),(12,ostrich), (13,goldfish) then if keyVal referred to (7,cat), keyVal.Next() should return a reference to (12,ostrich) ].
A solution with an enumerator from an arbitrary key would of course also suffice. Note that standard SortedDictionary functionality will not suffice, since only an enumerator over the entire set can be returned, which makes finding keyVal.next require N operations at worst.
Could a self-implemented balanced binary search tree (red-black tree) be fitted with node.next() functionality? Any good references for doing this? Any less coding-time consuming solutions?
I once had similar requirements and was unable to find something suitable. So I implemented an AVL tree. Here come some advices to do it with performance in mind:
Do not use recursion for walking the tree (insert, update, delete, next). Better use a stack array to store the way up to the root which is needed for balancing operations.
Do not store parent nodes. All operations will start from the root node and walk further down. Parents are not needed, if implemented carefully.
In order to find the Next() node of an existing one, usually Find() is first called. The stack produced by that, should be reused for Next() than.
By following these rules, I was able to implement the AVL tree. It is working very efficiently even for very large data sets. I would be willing to share, but it would need some modifications, since it does not store values (very easy) and does not rely on IComparable but on fixed key types of int.
The OrderedDictionary in PowerCollections provides a "get iterator starting at or before key" function that takes O(log N) time to return the first value. That makes it very fast to, say, scan the 1,000 items that are near the middle of a 50 million item set (which with SortedDictionary would require guessing to start at the start or the end, both of which are equally bad choices and would require iterator around 25 million items). OrderedDictionary can to that with just 1,000 items iterated.
There is a problem in OrderedDictionary though in that it uses yield which causes O(n^2) performance and out of memory conditions when iterating a 50 million item set in a 32 bit process. There is a quite simple fix for that while I will document later.
I've been reading about Skip Lists lately.
I have a web application that executes quite complex Sql queries against static datasets.
I want to implement a caching system whereby I generate an md5 hash of the sql query and then return a cached dataset for the query if it exists in the collection.
Which algorithm would be better, Dictionary or a SkipList? Why?
http://msdn.microsoft.com/en-us/library/ms379573%28VS.80%29.aspx#datastructures20_4_topic4
The reason you would use a SkipList<T> vs Dictionary<TKey,TValue> is that a skip list keeps its items in order. If you regularly need to enumerate the items in order, a skip list is good because it can enumerate in O(n).
If you wanted to be able to enumerate in order but didn't care if enumeration is O(n lg n), a SortedSet<T> (or more likely a SortedDictionary<TKey, TValue>) would be what you'd want because they use red-black trees (balanced binary trees) and they are already in the standard library.
Since it's extremely unlikely that you will want to enumerate your cache in order (or at all), a skip list (and likewise a binary tree) is unnecessary.
Dictionary, definitely. Two reasons:
Dictionary<TKey, TValue> uses a hash table, making retrieval O(1) (i.e. constant time), compared to O(log n) in a skip list.
Dictionary<TKey, TValue> already exists and is well-tested and optimised, whereas a skip list class doesn’t exist to my knowledge, so you would have to implement your own, which takes effort to get it right and test it thoroughly.
Memory consumption is about the same for both (certainly the same complexity, namely O(n)).
Skip List gives an average of Log(n) on all dictionary operations. If the number of items is fixed then a lock stripped hash table will do great. An in memory splay tree is also good since cache is the word. Splay trees give faster for the recently accessed item. As such in a dictionary operation such as find; [skip lists were slow compared to splay tree which again were slow compared to hash tables.][1][1]: http://harisankar-krishnaswamy.blogspot.in/2012/04/skip-list-runtime-on-dictionay.html
If localization in data structure is needed then skip lists can be useful. For example, finding flights around a date etc. But, a cache is in memory so a splay is fine. Hashtable and splay trees dont provide localization.
In some library code, I have a List that can contain 50,000 items or more.
Callers of the library can invoke methods that result in strings being added to the list. How do I efficiently check for uniqueness of the strings being added?
Currently, just before adding a string, I scan the entire list and compare each string to the to-be-added string. This starts showing scale problems above 10,000 items.
I will benchmark this, but interested in insight.
if I replace the List<> with a Dictionary<> , will ContainsKey() be appreciably faster as the list grows to 10,000 items and beyond?
if I defer the uniqueness check until after all items have been added, will it be faster? At that point I would need to check every element against every other element, still an n^^2 operation.
EDIT
Some basic benchmark results. I created an abstract class that exposes 2 methods: Fill and Scan. Fill just fills the collection with n items (I used 50,000). Scan scans the list m times (I used 5000) to see if a given value is present. Then I built an implementation of that class for List, and another for HashSet.
The strings used were uniformly 11 characters in length, and randomly generated via a method in the abstract class.
A very basic micro-benchmark.
Hello from Cheeso.Tests.ListTester
filling 50000 items...
scanning 5000 items...
Time to fill: 00:00:00.4428266
Time to scan: 00:00:13.0291180
Hello from Cheeso.Tests.HashSetTester
filling 50000 items...
scanning 5000 items...
Time to fill: 00:00:00.3797751
Time to scan: 00:00:00.4364431
So, for strings of that length, HashSet is roughly 25x faster than List , when scanning for uniqueness. Also, for this size of collection, HashSet has zero penalty over List when adding items to the collection.
The results are interesting and not valid. To get valid results, I'd need to do warmup intervals, multiple trials, with random selection of the implementation. But I feel confident that that would move the bar only slightly.
Thanks everyone.
EDIT2
After adding randomization and multple trials, HashSet consistently outperforms List in this case, by about 20x.
These results don't necessarily hold for strings of variable length, more complex objects, or different collection sizes.
You should use the HashSet<T> class, which is specifically designed for what you're doing.
Use HashSet<string> instead of List<string>, then it should scale very well.
From my tests, HashSet<string> takes no time compared to List<string> :)
Possibly off-topic, but if you want to scale very large unique sets of strings (millions+) in a language-independent way, you might check out Bloom Filters.
Does the Contains(T) function not work for you?
I have read that dictionary<> is implemented as an associative array. In some languages (not necessarily anything related to .NET), string indexes are stored as a tree structure that forks at each node based upon the character in the node. Please see http://en.wikipedia.org/wiki/Associative_arrays.
A similar data structure was devised by Aho and Corasick in 1973 (I think). If you store 50,000 strings in such a structure, then it matters not how many strings you are storing. It matters more the length of the strings. If they are are about the same length, then you will likely never see a slow-down in lookups because the search algorithm is linear in run-time with respect to the length of the string you are searching for. Even for a red-black tree or AVL tree, the search run-time depends more upon the length of the string you are searching for rather than the number of elements in the index. However, if you choose to implement your index keys with a hash function, you now incurr the cost of hashing the string (going to be O(m), m = string length) and also the lookup of the string in the index, which will likely be on the order of O(log(n)), n = number of elements in the index.
edit: I'm not a .NET guru. Other more experienced people suggest another structure. I would take their word over mine.
edit2: your analysis is a little off for comparing uniqueness. If you use a hashing structure or dictionary, then it will not be an O(n^2) operation because of the reasoning I posted above. If you continue to use a list, then you are correct that it is O(n^2) * (max length of a string in your set) because you must examine each element in the list each time.
I have 60k items that need to be checked against a 20k lookup list. Is there a collection object (like List, HashTable) that provides an exceptionly fast Contains() method? Or will I have to write my own? In otherwords, is the default Contains() method just scan each item or does it use a better search algorithm.
foreach (Record item in LargeCollection)
{
if (LookupCollection.Contains(item.Key))
{
// Do something
}
}
Note. The lookup list is already sorted.
In the most general case, consider System.Collections.Generic.HashSet as your default "Contains" workhorse data structure, because it takes constant time to evaluate Contains.
The actual answer to "What is the fastest searchable collection" depends on your specific data size, ordered-ness, cost-of-hashing, and search frequency.
If you don't need ordering, try HashSet<Record> (new to .Net 3.5)
If you do, use a List<Record> and call BinarySearch.
Have you considered List.BinarySearch(item)?
You said that your large collection is already sorted so this seems like the perfect opportunity? A hash would definitely be the fastest, but this brings about its own problems and requires a lot more overhead for storage.
You should read this blog that speed tested several different types of collections and methods for each using both single and multi-threaded techniques.
According to the results, a BinarySearch on a List and SortedList were the top performers constantly running neck-in-neck when looking up something as a "value".
When using a collection that allows for "keys", the Dictionary, ConcurrentDictionary, Hashset, and HashTables performed the best overall.
I've put a test together:
First - 3 chars with all of the possible combinations of A-Z0-9
Fill each of the collections mentioned here with those strings
Finally - search and time each collection for a random string (same string for each collection).
This test simulates a lookup when there is guaranteed to be a result.
Then I changed the initial collection from all possible combinations to only 10,000 random 3 character combinations, this should induce a 1 in 4.6 hit rate of a random 3 char lookup, thus this is a test where there isn't guaranteed to be a result, and ran the test again:
IMHO HashTable, although fastest, isn't always the most convenient; working with objects. But a HashSet is so close behind it's probably the one to recommend.
Just for fun (you know FUN) I ran with 1.68M rows (4 characters):
Keep both lists x and y in sorted order.
If x = y, do your action, if x < y, advance x, if y < x, advance y until either list is empty.
The run time of this intersection is proportional to min (size (x), size (y))
Don't run a .Contains () loop, this is proportional to x * y which is much worse.
If it's possible to sort your items then there is a much faster way to do this then doing key lookups into a hashtable or b-tree. Though if you're items aren't sortable you can't really put them into a b-tree anyway.
Anyway, if sortable sort both lists then it's just a matter of walking the lookup list in order.
Walk lookup list
While items in check list <= lookup list item
if check list item = lookup list item do something
Move to next lookup list item
If you're using .Net 3.5, you can make cleaner code using:
foreach (Record item in LookupCollection.Intersect(LargeCollection))
{
//dostuff
}
I don't have .Net 3.5 here and so this is untested. It relies on an extension method. Not that LookupCollection.Intersect(LargeCollection) is probably not the same as LargeCollection.Intersect(LookupCollection) ... the latter is probably much slower.
This assumes LookupCollection is a HashSet
If you aren't worried about squeaking every single last bit of performance the suggestion to use a HashSet or binary search is solid. Your datasets just aren't large enough that this is going to be a problem 99% of the time.
But if this just one of thousands of times you are going to do this and performance is critical (and proven to be unacceptable using HashSet/binary search), you could certainly write your own algorithm that walked the sorted lists doing comparisons as you went. Each list would be walked at most once and in the pathological cases wouldn't be bad (once you went this route you'd probably find that the comparison, assuming it's a string or other non-integral value, would be the real expense and that optimizing that would be the next step).