I have a file of records, sorted alphabetically:
Andrew d432
Ben x127
...
...
Zac b332
The first field is a person name, the second field is some id. Once I read the file, I do not need to make any changes to the data.
I want to treat each record as a Key-Value pair, where the person name is the Key. I don't know which class to use in order to access a record (as fast as possible). Dictionary does not has a binary search. On the other hand, as I understand, SortedList and SortedDictionary should be used only when I need to insert/remove data.
Edit: To clarify, I'm talking about simply accessing a record, like:
x = MyDic[Zac]
What no one has stated is why dictionaries are O(1) and why it IS faster than a binary search. One side point is that dictionaries are not sorted by the key. The whole point of a dictionary is to go to the exact* (for all practical purposes) location of the item that is referenced by the key value. It does not "search" for the item - it knows the exact location of the item you want.
So a binary search would be pointless on a hash-based dictionary because there is no need to "search" for an item when the collection already knows exactly where it is.
*This isn't completely true in the case of hash collisions, but the principle of the dictionary is to get the item directly, and any additional lookups are an implementation detail and should be rare.
On the other hand, as I understand, SortedList and SortedDictionary should be used only when I need to insert/remove data.
They should be used when you want the data automatically sorted when adding or removing data. Note that SortedDictionary loses the performance gain of a "normal" dictionary because it now has to search for the location using the key value. It's primary use is to allow you to iterate over the keys in order.
If you have a unique key value per item, don't need to iterate the items in any particular order, and want the fastest "get" performance, then Dictionary is the way to go.
In general dictionary lookup will be faster than binary search of a collection. There are two specific cases when that's not true:
If the list is small (fewer than 15 (possibly as low as 10) items, in my tests), then the overhead of computing a hash code and going through the dictionary lookup will be slower than binary search on an array. But beyond 15 items, dictionary lookup beats binary search, hands down.
If there are many hash collisions (due either to a bad hash function or a dictionary with a high load factor), then dictionary lookup slows down. If it's really bad, then binary search could potentially beat dictionary lookup.
In 15 years working with .NET dictionaries holding all kinds of data, I've never seen #2 be a problem when using the standard String.GetHashCode() method with real world data. The only time I've run into trouble is when I created a bad GetHashCode() method.
Related
I need to store a set of elements. What I need is functionality to
remove (single) elements and
add (sets of) elements and
each object should only be in the set once and
get a random element from the set
I chose the HashSet (C#) since it sports fast methods for removing elements (hashSet.remove(element)), adding sets (hashSet.UnionWith(anotherHashSet)) and the nature of a HashSet guarantees that there are not duplicates, so requirements 1 to 3 are taken care of.
The only way I found to get a random element is
Object object = hashSet.ElementAt(rnd.Next(hashSet.Count));
But this is very slow, since I call it once for every pixel of my map (creating a random flood fill from multiple starting points; mapsize 500x500 at the moment but I'd like to go bigger) and the hashset holds rather many items. (A quick test shows it blows up to 5752 entries before shrinking again.)
Profiling (CPU sampling) tells me my ElementAt calls take over 50%.
I realize 500x500 operations over a big hashset is no easy task, but other operations (Remove and UnionWith) are called as often as ElementAt, so the main problem seems to be the operation and not the number of calls.
I vaguely understand why getting a certain element from a HashSet is very expensive (when compared to getting it from a list or another ordered data structure, but I just want a random pick. Can it really be so hard and is there no way around it? Is there a better data structure for my purpose?
Changing everything to Lists doesn't help because now other methods become bottlenecks and it takes even longer.
Casting the HashSet to an array and pick my random element from there expectedly doesn't help because while picking a random element from an array is quick, casting the hashset to the array in the first place takes longer than running hashSet.ElementAt by itself.
If you want to understand better what I am trying to do: A link to my question and the answer.
I think that OrderedDictionary might suit your purposes:
var dict = new OrderedDictionary();
dict.Add("My String Key", "My String");
dict.Add(12345, 54321);
Console.WriteLine(dict[0]); // Prints "My String"
Console.WriteLine(dict[1]); // Prints 54321
Console.WriteLine(dict["My String Key"]); // Prints "My String"
Console.WriteLine(dict[(object)12345]); // Prints 54321 (note the need to cast!)
This has fast add and remove, and O(1) indexing. It only works with object keys and values though - there's no generic version.
[EDIT] Many years later: We now have the strongly-typed generic SortedDictionary<TKey, TValue> which might be better.
The basic problem is the indexing.
In an array or a list, the data is indexed by its coördinate - usually just a simple int index. In a HashSet, you pick the index yourself - the key. The side-effect is, though, that there is no "coördinate" - the question "element at index 3" doesn't make sense, really. The way it's actually implemented is that the whole HashSet is enumerated, item after item, and the n-th item is returned. This means that to get the 1000th item, you have to enumerate all the 999 items before that as well. This hurts.
The best way to solve this would be to pick the random based on an actual key of the HashSet. Of course, this only works if it's reasonable to pick random keys just like that.
If you can't pick the key at random in a satisfactory way, you'll probably want to keep two separate lists - whenever you add a new item to a HashSet, add its key to a List<TKey>; you can then easily pick a random key from the List, and follow it. Depending on your requirements, duplicates may not be much of a problem.
And of course, you could save on the ElementAt enumerations if you only do the enumeration once - for example, before searching the HashSet, you could convert it to List. This only makes sense if you're picking multiple random indices at once, of course (e.g. if you pick 5 indices at random at once, you'll save about 1/5th of the time on average) - if you're always picking one, then modifying the HashSet and picking another, it's not going to help.
Depending on your exact use case, it might also be worth having a look at SortedSet. It works in a similar way to HashSet, but it maintains order in the keys. The helpful part is that you can use the GetViewBetween method to get a whole range of keys - you could use this quite effectively if your keys are sparse, but well balanced between arbitrary ranges. You'd just first pick a range at random, then get the items in range with GetViewBetween, and pick a random one out of those as well. In effect, this will allow you to partition the search results, and should save quite a bit of time.
I'm currently using an ICollection to return all items where location.path.StartsWith(value).
The collection itself is kept in a singleton object and is hydrated on instantiation of the object from a sproc call to a Sql database. While the count of items is only around 1300 the collection itself has the potential to be searched often (I can't define often - maybe 100,000 maybe 1 million - it varies).
Given the details above, perhaps more are needed, what would be the most efficient collection type to use to find all items where path.StartsWith(value)?
I think that you are looking for a Trie which lets you add an item associated with the key, then find all items that have a key starting with your search term
From the page:
As discussed below, a trie has a number of advantages over binary
search trees.[4] A trie can also be used to replace a hash table, over
which it has the following advantages:
Looking up data in a trie is faster in the worst case, O(m) time (where m is the length of a search string), compared to an imperfect
hash table. An imperfect hash table can have key collisions. A key
collision is the hash function mapping of different keys to the same
position in a hash table. The worst-case lookup speed in an imperfect
hash table is O(N) time, but far more typically is O(1), with O(m)
time spent evaluating the hash.
IIRC, I had looked for some c# code and found this implementation to work reasonably well
Edit for comment:
You would need to scan all keys of a dictionary to see if they start with your search string. In a Trie, you ask for the node that matches your search string, and then you are guaranteed that all elements under that node have a key that start with the search string you gave.
Here you can see that search for te would need to drill down two nodes inside the Trie, and that you arrive on a node where all descendants start with te
I have a game loop which draws a list of objects, the list called "mylist" and holds something like 1000 objects, objects(especially bullets which fly fast and hit things) needs to be added and removed from the list constantly, few objects every second.
If i understand correctly the insert in List is practically free if the list capacity large enough, the problem is with removal which is O(n) because first of all i need to find the item in the list , and second it creates a new list after the removal.
If i could aggregate all the removals and make them once per frame, it would be efficient because i would use mylist.Except(listToRemove) and this would be in O(n).
but unfortunately i can't do it.
Linked list is also problematic because i need to find the Object in the list.
Any one has better suggestion ?
What about a HashSet? It supports O(1) insert and removal. If you need ordering, use a tree data structure or a LinkedList plus a Dictionary that allows you to find nodes quickly.
The BCL has a SortedList and a SortedSet and a SortedDictionary. All but one were very slow but I can never remember which one was the "good" one. It is tree-based internally.
If you are doing searching then your best bet is to use either a Dictionary that has average case of O(1) and worst case of O(n) (when you hash to the same bucket).
If you need to preserve order you can use any implentation of a balanced binary search tree and your performance will be O(logn).
In .NET this simply comes down to
Dictionary - Insertion/Removal
O(1) (average case)
SortedDictionary - preserves order for ordered operations, insertion/removal O(logn)
So ideally you would want to use a data structure that has O(1) or O(log n) remove and insert. A dictionary-type data structure would probably be ideal due to their constant average-case insert and delete time complexity. I would recommend either a HashSet and then overriding GetHashCode() on your game object base class.
Here is more details on all the different dictionary/hash table data structures.
Time complexity
Here are all the 'dictionary-like' data structures and their time-complexities:
Type Find by key Remove Add
HashSet O(1)* O(1)* O(1)**
Dictionary O(1)* O(1)* O(1)**
SortedList O(log n) O(n) O(n)
SortedDictionary O(log n) O(log n) O(log n)
* O(n) with collision
** O(n) with collision or when adding beyond the array's capacity.
HashSet vs Dictionary
The difference between a HashSet and a Dictionary is that a Dictionary works on a KeyValuePair whereas with a HashSet the key is the object itself (its GetHashCode() method).
SortedList vs SortedDictionary
You should only consider these if you need to maintain an order. The difference is that SortedDictionary uses a red-black tree and SortedList uses sorted arrays for its keys and values.
References
My blog post - .NET simple collections and dictionaries
MSDN - HashSet
MSDN - Dictionary
MSDN - SortedDictionary
MSDN - SortedList
Performance-wise, you'll probably see the best results by simply not inserting or removing anything from any kind of list. Allocate a sufficiently large array and tag all your objects with whether they should be rendered or not.
Given your requirements, however, this is all probably premature optimisation; you could very well re-create the entire list of 1000 elements 30 times per second and not see any performance drop.
I suggest reading the slides of this presentation given by Braid's creator, on using the right data structures in independent video games: http://the-witness.net/news/2011/06/how-to-program-independent-games/ (tl;dr just use arrays for everything).
I've been working on a project where I need to iterate through a collection of data and remove entries where the "primary key" is duplicated. I have tried using a
List<int>
and
Dictionary<int, bool>
With the dictionary I found slightly better performance, even though I never need the Boolean tagged with each entry. My expectation is that this is because a List allows for indexed access and a Dictionary does not. What I was wondering is, is there a better solution to this problem. I do not need to access the entries again, I only need to track what "primary keys" I have seen and make sure I only perform addition work on entries that have a new primary key. I'm using C# and .NET 2.0. And I have no control over fixing the input data to remove the duplicates from the source (unfortunately!). And so you can have a feel for scaling, overall I'm checking for duplicates about 1,000,000 times in the application, but in subsets of no more than about 64,000 that need to be unique.
They have added the HashSet class in .NET 3.5. But I guess it will be on par with the Dictionary. If you have less than say a 100 elements a List will probably perform better.
Edit: Nevermind my comment. I thought you're talking about C++. I have no idea if my post is relevant in the C# world..
A hash-table could be a tad faster. Binary trees (that's what used in the dictionary) tend to be relative slow because of the way the memory gets accessed. This is especially true if your tree becomes very large.
However, before you change your data-structure, have you tried to use a custom pool allocator for your dictionary? I bet the time is not spent traversing the tree itself but in the millions of allocations and deallocations the dictionary will do for you.
You may see a factor 10 speed-boost just plugging a simple pool allocator into the dictionary template. Afaik boost has a component that can be directly used.
Another option: If you know only 64.000 entries in your integers exist you can write those to a file and create a perfect hash function for it. That way you can just use the hash function to map your integers into the 0 to 64.000 range and index a bit-array.
Probably the fastest way, but less flexible. You have to redo your perfect hash function (can be done automatically) each time your set of integers changes.
I don't really get what you are asking.
Firstly is just the opposite of what you say. The dictionary has indexed access (is a hash table) while de List hasn't.
If you already have the data in a dictionary then all keys are unique, there can be no duplicates.
I susspect you have the data stored in another data type and you're storing it into the dictionary. If that's the case the inserting the data will work with two dictionarys.
foreach (int key in keys)
{
if (!MyDataDict.ContainsKey(key))
{
if (!MyDuplicatesDict.ContainsKey(key))
MyDuplicatesDict.Add(key);
}
else
MyDataDict.Add(key);
}
If you are checking for uniqueness of integers, and the range of integers is constrained enough then you could just use an array.
For better packing you could implement a bitmap data structure (basically an array, but each int in the array represents 32 ints in the key space by using 1 bit per key). That way if you maximum number is 1,000,000 you only need ~30.5KB of memory for the data structure.
Performs of a bitmap would be O(1) (per check) which is hard to beat.
There was a question awhile back on removing duplicates from an array. For the purpose of the question performance wasn't much of a consideration, but you might want to take a look at the answers as they might give you some ideas. Also, I might be off base here, but if you are trying to remove duplicates from the array then a LINQ command like Enumerable.Distinct might give you better performance than something that you write yourself. As it turns out there is a way to get LINQ working on .NET 2.0 so this might be a route worth investigating.
If you're going to use a List, use the BinarySearch:
// initailize to a size if you know your set size
List<int> FoundKeys = new List<int>( 64000 );
Dictionary<int,int> FoundDuplicates = new Dictionary<int,int>();
foreach ( int Key in MyKeys )
{
// this is an O(log N) operation
int index = FoundKeys.BinarySearch( Key );
if ( index < 0 )
{
// if the Key is not in our list,
// index is the two's compliment of the next value that is in the list
// i.e. the position it should occupy, and we maintain sorted-ness!
FoundKeys.Insert( ~index, Key );
}
else
{
if ( DuplicateKeys.ContainsKey( Key ) )
{
DuplicateKeys[Key]++;
}
else
{
DuplicateKeys.Add( Key, 1 );
}
}
}
You can also use this for any type for which you can define an IComparer by using an overload: BinarySearch( T item, IComparer< T > );
Which would be faster for say 500 elements.
Or what's the faster data structure/collection for retrieving elements?
List<MyObj> myObjs = new List<MyObj>();
int i = myObjs.BinarySearch(myObjsToFind);
MyObj obj = myObjs[i];
Or
Dictionary<MyObj, MyObj> myObjss = new Dictionary<MyObj, MyObj>();
MyObj value;
myObjss.TryGetValue(myObjsToFind, out value);
I assume in your real code you'd actually populate myObjs - and sort it.
Have you just tried it? It will depend on several factors:
Do you need to sort the list for any other reason?
How fast is MyObj.CompareTo(MyObj)?
How fast is MyObj.GetHashCode()?
How fast is MyObj.Equals()?
How likely are you to get hash collisions?
Does it actually make a significant difference to you?
It'll take around 8 or 9 comparisons in the binary search case, against a single call to GetHashCode and some number of calls to Equals (depending on hash collisions) in the dictionary case. Then there's the intrinsic calculations (accessing arrays etc) involved in both cases.
Is this really a bottleneck for you though?
I'd expect Dictionary to be a bit faster at 500 elements, but not very much faster. As the collection grows, the difference will obviously grow.
Have been doing some real world tests with in memory collection of about 500k items.
Binary Search wins in every way.
Dictionary slows down the more hash collision you have. Binary search technically slows down but no where as fast as the dictionaries algorithm.
The neat thing about the binary search is it will tell you exactly where to insert the item into the list if not found.. so making the sorted list is pretty fast too. (not as fast)
Dictionaries that large also consume a lot of memory compared to a list sorted with binary search. From my tests the sorted list consumed about 27% of the memory a dictionary id. (so a diction claimed 3.7 X the memory)
For smallish list dictionary is just fine -- once you get largish it may not be the best choice.
The latter.
A binary search runs at O(log n) while a hashtable will be O(1).
Big 'O' notation, as used by some of the commenters, is a great guideline to use. In practice, though, the only way to be sure which way is faster in a particular situation is to time your own code before and after a change (as hinted at by Jon).
BinarySearch requires the list to already be sorted. [edit: Forgot that dictionary is a hashtable. So lookup is O(1)]. The 2 are not really the same either. The first one is really just checking if it exists in the list and where it is. If you want to just check existance in a dictionary use the contain method.