I've used generic dictionaries in C# a fair bit. Things like:
var example = new Dictionary<int, string> {
{ 0, "Test0" },
{ 1, "Test1" } };
I vaguely remember being told that, before generics came along, you could use a Hashtable(). Basically the same thing, but without a specific type (so value types are going to be boxed, I think).
var example2 = new Hashtable {
{0, "Test0"},
{1, "Test1"} };
And there are questions like this one discussing why we prefer Dictionary over Hashtables (Why is Dictionary preferred over hashtable?).
But what about all the other 'dictionary' types?
SortedDictionary<K,V> - Seems to work like Dictionary but it's .Keys collection is sorted. I'm not sure why you'd care though.
OrderedDictionary is non-generic like a Hashtable, but I can't wrap my head around what's different than a Hashtable. http://msdn.microsoft.com/en-us/library/system.collections.specialized.ordereddictionary.aspx mentions that it's keys are not sorted like a SortedDictionary, so I just plain don't see why or when to use this.
ListDictionary - Smaller/Faster than Hashtable (but is it faster than a generic Dictionary?) when the number of elements is less than 10. Again, I'm at a loss for when you'd use this.
I'm also confused about SortedList<K,V>. When I hear List I don't think key/value pairs (maybe I should?). It implements IDictionary<TKey,TValue>. From this question, I can see that it differs from SortedDictionary in it's performance characteristics (What's the difference between SortedList and SortedDictionary?)
Can Someone Briefly Explain When To Use Which Dictionary Type?
For the sake of simplicity, assume I have access to .Net 4.5 or higher...so maybe there is no situation where Hashtable is useful any more?
Both Dictionary and Hashtable indicate the use of some kind of indexing of the data. Consulting the index takes some time, making it slower at small numbers of elements.
A List does not use an index, and items are typically added at the end. When inserting items, the other items "physically" move to create room for the new element, and when removing items, the other items move to close the gap.
A Dictionary typically does not preserve order, and may contain gaps in memory. When adding items, these gaps may be filled by the new item. Iterating over the Dictionary would then return the items in a different order.
Sorting is a different kind of ordering - it does not preserve the order in which items were added, but follows rules to determine the place of added items.
It's funny that ArrayList became List<T> when the genericalisation happened, and Hashtable became Dictionary<T, U> - both removing the technical aspect from the name, leaving only the name of the abstraction.
Use Dictionary<TKey,TValue>. There's no reason to use the older non-generic hash table.
Ordered Dictionary
If the insertion order of the items in the dictionary matter, then use the OrderedDictionary.
Say I have a mapping of children to their favorite ice cream.
OrderedDictioanry childToIcecream = new OrderedDictionary();
childToIcecream["Jake"] = "Vanilla";
childToIcecream["Kevin"] = "Chocolate";
childToIcecream["Megan"] = "Strawberry";
Each day one child gets an extra scoop in rotation. We could take the day number (Sunday = 0, Monday = 1..) mod it by the number of children, and pull their index from the dictionary to select whose lucky day it is. This of course only works if the dictionary maintains the order. Otherwise I would need a separate List<string> just for maintaining the order. You get key/value pairs and order in one container.
It's unfortunate there's no generic ordered dictionary, but someone posted an implementation here,
Sorted Dictionary
Same for sorted dictionary. If you had a requirement that the key/value pairs needed to be sorted this would save you time to keep it always sorted rather than have to do an expensive sort operating when you needed it to be.
SortedDictionary<char, string> letterToWord = new SortedDictionary<char, string>();
letterToWord['b'] = "bat";
letterToWord['c'] = "cat";
letterToWord['a'] = "apple";
Say you have a dictionary like the above, except the user can build the letter associations at runtime. You always want to display it in alphabetical order, so it makes sense to always keep it sorted as each new item is added.
TLDR; Always use Dictionary<TKey, TValue> unless you have a circumstance that requires it to be ordered or sorted.
Related
I need to store a set of elements. What I need is functionality to
remove (single) elements and
add (sets of) elements and
each object should only be in the set once and
get a random element from the set
I chose the HashSet (C#) since it sports fast methods for removing elements (hashSet.remove(element)), adding sets (hashSet.UnionWith(anotherHashSet)) and the nature of a HashSet guarantees that there are not duplicates, so requirements 1 to 3 are taken care of.
The only way I found to get a random element is
Object object = hashSet.ElementAt(rnd.Next(hashSet.Count));
But this is very slow, since I call it once for every pixel of my map (creating a random flood fill from multiple starting points; mapsize 500x500 at the moment but I'd like to go bigger) and the hashset holds rather many items. (A quick test shows it blows up to 5752 entries before shrinking again.)
Profiling (CPU sampling) tells me my ElementAt calls take over 50%.
I realize 500x500 operations over a big hashset is no easy task, but other operations (Remove and UnionWith) are called as often as ElementAt, so the main problem seems to be the operation and not the number of calls.
I vaguely understand why getting a certain element from a HashSet is very expensive (when compared to getting it from a list or another ordered data structure, but I just want a random pick. Can it really be so hard and is there no way around it? Is there a better data structure for my purpose?
Changing everything to Lists doesn't help because now other methods become bottlenecks and it takes even longer.
Casting the HashSet to an array and pick my random element from there expectedly doesn't help because while picking a random element from an array is quick, casting the hashset to the array in the first place takes longer than running hashSet.ElementAt by itself.
If you want to understand better what I am trying to do: A link to my question and the answer.
I think that OrderedDictionary might suit your purposes:
var dict = new OrderedDictionary();
dict.Add("My String Key", "My String");
dict.Add(12345, 54321);
Console.WriteLine(dict[0]); // Prints "My String"
Console.WriteLine(dict[1]); // Prints 54321
Console.WriteLine(dict["My String Key"]); // Prints "My String"
Console.WriteLine(dict[(object)12345]); // Prints 54321 (note the need to cast!)
This has fast add and remove, and O(1) indexing. It only works with object keys and values though - there's no generic version.
[EDIT] Many years later: We now have the strongly-typed generic SortedDictionary<TKey, TValue> which might be better.
The basic problem is the indexing.
In an array or a list, the data is indexed by its coördinate - usually just a simple int index. In a HashSet, you pick the index yourself - the key. The side-effect is, though, that there is no "coördinate" - the question "element at index 3" doesn't make sense, really. The way it's actually implemented is that the whole HashSet is enumerated, item after item, and the n-th item is returned. This means that to get the 1000th item, you have to enumerate all the 999 items before that as well. This hurts.
The best way to solve this would be to pick the random based on an actual key of the HashSet. Of course, this only works if it's reasonable to pick random keys just like that.
If you can't pick the key at random in a satisfactory way, you'll probably want to keep two separate lists - whenever you add a new item to a HashSet, add its key to a List<TKey>; you can then easily pick a random key from the List, and follow it. Depending on your requirements, duplicates may not be much of a problem.
And of course, you could save on the ElementAt enumerations if you only do the enumeration once - for example, before searching the HashSet, you could convert it to List. This only makes sense if you're picking multiple random indices at once, of course (e.g. if you pick 5 indices at random at once, you'll save about 1/5th of the time on average) - if you're always picking one, then modifying the HashSet and picking another, it's not going to help.
Depending on your exact use case, it might also be worth having a look at SortedSet. It works in a similar way to HashSet, but it maintains order in the keys. The helpful part is that you can use the GetViewBetween method to get a whole range of keys - you could use this quite effectively if your keys are sparse, but well balanced between arbitrary ranges. You'd just first pick a range at random, then get the items in range with GetViewBetween, and pick a random one out of those as well. In effect, this will allow you to partition the search results, and should save quite a bit of time.
What data structure could I use in C# to allow quick insertion/deletion as well as uniform random selection? A List has slow deletion by element (since it needs to find the index of the element each time), while a HashSet does not seem to allow random selection of an element (without copying to a list.)
The data structure will be updated continuously, so insertion and deletion need to be online procedures. It seems as if there should be a way to make insertion, deletion, and random selection all O(log n).
A binary search tree with arbitrary integer keys assigned to the objects would solve all of these problems, but I can't find the appropriate class in the C# standard library. Is there a canonical way to solve this without writing a custom binary search tree?
There is already a BST in the C# BCL, it's called a SortedDictionary<TKey, TValue>, if you don't want Key Value Pairs, but instead want single items, you can use the SortedSet<T> (SortedSet is in .NET 4.0).
It sounds like from your example you'd want a SortedDictionary<int, WhateverValueType>. Though I'm not sure exactly what you are after when you say "uniform random selection".
Of course, the Dictionary<TKey, TValue> is O(1) which is much faster. So unless you have a need for sorted order of the keys, I'd use that.
UPDATE: From the sounds of your needs, you're going to have a catch-22 on efficiency. To be able to jump into a random contiguous index in the data structure, how often will you be inserting/deleting? If not often, you could use an array and just Sort() after (O(n log n)), or always insert/delete in order (O(n)).
Or, you could wrap a Dictionary<int, YourType> and keep a parallel List<int> and update it after every Add/Delete:
_dictionary.Add(newIndex, newValue);
_indexes.Add(newIndex);
And then just access a random index from the list on lookups. The nice thing is that in this method really the Add() will be ~ O(1) (unless List resizes, but you can set an initial capacity to avoid some of that) but you would incurr a O(n) cost on removes.
I'm afraid the problem is you'll either sacrifice times on the lookups, or on the deletes/inserts. The problem is all the best access-time containers are non-contiguous. With the dual List<int>/Dictionary<int, YourValue> combo, though, you'd have a pretty good mix.
UPDATE 2: It sounds like from our continued discussion that if that absolute performance is your requirement you may have better luck rolling your own. Was fun to think about though, I'll update if I think of anything else.
Binary search trees and derived structures, like SortedDictionary or SortedSet, operate by comparing keys.
Your objects are not comparable by itself, but they offer object identity and a hash value. Therefore, a HashSet is the right data structure. Note: A Dictionary<int,YourType> is not appropriate because removal becomes a linear search (O(n)), and doesn't solve the random problem after removals.
Insert is O(1)
Remove is O(1)
RandomElement is O(n). It can easily be implemented, e.g.
set.ElementAt(random.Next(set.Count))
No copying to an intermediate list is necessary.
I realize that this question is over 3 years old, but just for people who come across this page:
If you don't need to keep the items in the data set sorted, you can just use a List<ItemType>.
Insertion and random selection are O(1). You can do deletion in O(1) by just moving the last item to the position of the item you want to delete and removing it from the end.
Code:
using System; // For the Random
using System.Collections.Generic; // The List
// List:
List<ItemType> list = new List<ItemType>();
// Add x:
ItemType x = ...; // The item to insert into the list
list.Add( x );
// Random selection
Random r = ...; // Probably get this from somewhere else
int index = r.Next( list.Count );
ItemType y = list[index];
// Remove item at index
list[index] = list[list.Count - 1]; // Copy last item to index
list.RemoveAt( list.Count - 1 ); // Remove from end of list
EDIT: Of course, to remove an element from the List<ItemType> you'll need to know its index. If you want to remove a random element, you can use a random index (as done in the example above). If you want to remove a given item, you can keep a Dictionary<ItemType,int> which maps the items to their indices. Adding, removing and updating these indices can all be done in O(1) (amortized).
Together this results in a complexity of O(1) (amortized) for all operations.
I know Dictionaries don't store their key-value pairs in the order that they are added, but if I add the same key-value pairs (potentially in different orders) to two different Dictionaries and serialize the results, will the data on file be the same?
Edit: To clarify, I'm asking specifically about the output of GetObjectData(), not any particular serializer. Consider the following code:
Dictionary<string,List<string>> dict1 = new Dictionary<string,List<string>>();
Dictionary<string,List<string>> dict2 = new Dictionary<string,List<string>>();
string key11 = "key1";
string key12 = "key1";
string key21 = "key2";
string key22 = "key2";
List<string> values11 = new List(1);
List<string> values12 = new List(1);
List<string> values21 = new List(1);
List<string> values22 = new List(1);
values11.add("value1");
values12.add("value1");
values21.add("value2");
values22.add("value2");
dict1.add(key11, values11);
dict2.add(key22, values22);
dict1.add(key21, values21);
dict2.add(key12, values12);
Will dict1 and dict2 return the same thing for GetObjectData()? If not, why not?
Whether or not it is would end up being an implementation detail that likely would not be guaranteed in future versions and/or alternate implementations; As such, I would recommend at the very least having a test written that verifies it and can run as part of your standard tests. But if you are implementing a solution that absolutely depends on it, then it may be worth writing your own serializer...
Almost certainly not! The Dictionary works by hashing, and there has to be some method of hash collision resolution. So let's say that the first time you go through the dictionary you add key1 and key2 in that order. key1 ends up in the "normal" spot for keys that hash to that particular value. key2 is stored "somewhere else" (dependent on the implementation).
Now change the order that you add keys. key2 goes in the normal spot and key1 goes "somewhere else."
You cannot make any assumptions about the order of the items in your dictionary.
Even if you could guarantee the order, that guarantee could be invalidated with the next change to the .NET Framework because the implementation of string.GetHashCode might change (it has in the past). That would completely change the order in which keys are stored in the dictionary's underlying data structures, so any saved data created by a previous version of the Framework would likely not agree with data you create when running with the new version.
That would depend on internal implementation details of the particular (version of the) Dictionary.
So in general, No.
There have been topics here about using Serialization to determine Equality, it fails on several corner cases.
I have 60k items that need to be checked against a 20k lookup list. Is there a collection object (like List, HashTable) that provides an exceptionly fast Contains() method? Or will I have to write my own? In otherwords, is the default Contains() method just scan each item or does it use a better search algorithm.
foreach (Record item in LargeCollection)
{
if (LookupCollection.Contains(item.Key))
{
// Do something
}
}
Note. The lookup list is already sorted.
In the most general case, consider System.Collections.Generic.HashSet as your default "Contains" workhorse data structure, because it takes constant time to evaluate Contains.
The actual answer to "What is the fastest searchable collection" depends on your specific data size, ordered-ness, cost-of-hashing, and search frequency.
If you don't need ordering, try HashSet<Record> (new to .Net 3.5)
If you do, use a List<Record> and call BinarySearch.
Have you considered List.BinarySearch(item)?
You said that your large collection is already sorted so this seems like the perfect opportunity? A hash would definitely be the fastest, but this brings about its own problems and requires a lot more overhead for storage.
You should read this blog that speed tested several different types of collections and methods for each using both single and multi-threaded techniques.
According to the results, a BinarySearch on a List and SortedList were the top performers constantly running neck-in-neck when looking up something as a "value".
When using a collection that allows for "keys", the Dictionary, ConcurrentDictionary, Hashset, and HashTables performed the best overall.
I've put a test together:
First - 3 chars with all of the possible combinations of A-Z0-9
Fill each of the collections mentioned here with those strings
Finally - search and time each collection for a random string (same string for each collection).
This test simulates a lookup when there is guaranteed to be a result.
Then I changed the initial collection from all possible combinations to only 10,000 random 3 character combinations, this should induce a 1 in 4.6 hit rate of a random 3 char lookup, thus this is a test where there isn't guaranteed to be a result, and ran the test again:
IMHO HashTable, although fastest, isn't always the most convenient; working with objects. But a HashSet is so close behind it's probably the one to recommend.
Just for fun (you know FUN) I ran with 1.68M rows (4 characters):
Keep both lists x and y in sorted order.
If x = y, do your action, if x < y, advance x, if y < x, advance y until either list is empty.
The run time of this intersection is proportional to min (size (x), size (y))
Don't run a .Contains () loop, this is proportional to x * y which is much worse.
If it's possible to sort your items then there is a much faster way to do this then doing key lookups into a hashtable or b-tree. Though if you're items aren't sortable you can't really put them into a b-tree anyway.
Anyway, if sortable sort both lists then it's just a matter of walking the lookup list in order.
Walk lookup list
While items in check list <= lookup list item
if check list item = lookup list item do something
Move to next lookup list item
If you're using .Net 3.5, you can make cleaner code using:
foreach (Record item in LookupCollection.Intersect(LargeCollection))
{
//dostuff
}
I don't have .Net 3.5 here and so this is untested. It relies on an extension method. Not that LookupCollection.Intersect(LargeCollection) is probably not the same as LargeCollection.Intersect(LookupCollection) ... the latter is probably much slower.
This assumes LookupCollection is a HashSet
If you aren't worried about squeaking every single last bit of performance the suggestion to use a HashSet or binary search is solid. Your datasets just aren't large enough that this is going to be a problem 99% of the time.
But if this just one of thousands of times you are going to do this and performance is critical (and proven to be unacceptable using HashSet/binary search), you could certainly write your own algorithm that walked the sorted lists doing comparisons as you went. Each list would be walked at most once and in the pathological cases wouldn't be bad (once you went this route you'd probably find that the comparison, assuming it's a string or other non-integral value, would be the real expense and that optimizing that would be the next step).
I've been working on a project where I need to iterate through a collection of data and remove entries where the "primary key" is duplicated. I have tried using a
List<int>
and
Dictionary<int, bool>
With the dictionary I found slightly better performance, even though I never need the Boolean tagged with each entry. My expectation is that this is because a List allows for indexed access and a Dictionary does not. What I was wondering is, is there a better solution to this problem. I do not need to access the entries again, I only need to track what "primary keys" I have seen and make sure I only perform addition work on entries that have a new primary key. I'm using C# and .NET 2.0. And I have no control over fixing the input data to remove the duplicates from the source (unfortunately!). And so you can have a feel for scaling, overall I'm checking for duplicates about 1,000,000 times in the application, but in subsets of no more than about 64,000 that need to be unique.
They have added the HashSet class in .NET 3.5. But I guess it will be on par with the Dictionary. If you have less than say a 100 elements a List will probably perform better.
Edit: Nevermind my comment. I thought you're talking about C++. I have no idea if my post is relevant in the C# world..
A hash-table could be a tad faster. Binary trees (that's what used in the dictionary) tend to be relative slow because of the way the memory gets accessed. This is especially true if your tree becomes very large.
However, before you change your data-structure, have you tried to use a custom pool allocator for your dictionary? I bet the time is not spent traversing the tree itself but in the millions of allocations and deallocations the dictionary will do for you.
You may see a factor 10 speed-boost just plugging a simple pool allocator into the dictionary template. Afaik boost has a component that can be directly used.
Another option: If you know only 64.000 entries in your integers exist you can write those to a file and create a perfect hash function for it. That way you can just use the hash function to map your integers into the 0 to 64.000 range and index a bit-array.
Probably the fastest way, but less flexible. You have to redo your perfect hash function (can be done automatically) each time your set of integers changes.
I don't really get what you are asking.
Firstly is just the opposite of what you say. The dictionary has indexed access (is a hash table) while de List hasn't.
If you already have the data in a dictionary then all keys are unique, there can be no duplicates.
I susspect you have the data stored in another data type and you're storing it into the dictionary. If that's the case the inserting the data will work with two dictionarys.
foreach (int key in keys)
{
if (!MyDataDict.ContainsKey(key))
{
if (!MyDuplicatesDict.ContainsKey(key))
MyDuplicatesDict.Add(key);
}
else
MyDataDict.Add(key);
}
If you are checking for uniqueness of integers, and the range of integers is constrained enough then you could just use an array.
For better packing you could implement a bitmap data structure (basically an array, but each int in the array represents 32 ints in the key space by using 1 bit per key). That way if you maximum number is 1,000,000 you only need ~30.5KB of memory for the data structure.
Performs of a bitmap would be O(1) (per check) which is hard to beat.
There was a question awhile back on removing duplicates from an array. For the purpose of the question performance wasn't much of a consideration, but you might want to take a look at the answers as they might give you some ideas. Also, I might be off base here, but if you are trying to remove duplicates from the array then a LINQ command like Enumerable.Distinct might give you better performance than something that you write yourself. As it turns out there is a way to get LINQ working on .NET 2.0 so this might be a route worth investigating.
If you're going to use a List, use the BinarySearch:
// initailize to a size if you know your set size
List<int> FoundKeys = new List<int>( 64000 );
Dictionary<int,int> FoundDuplicates = new Dictionary<int,int>();
foreach ( int Key in MyKeys )
{
// this is an O(log N) operation
int index = FoundKeys.BinarySearch( Key );
if ( index < 0 )
{
// if the Key is not in our list,
// index is the two's compliment of the next value that is in the list
// i.e. the position it should occupy, and we maintain sorted-ness!
FoundKeys.Insert( ~index, Key );
}
else
{
if ( DuplicateKeys.ContainsKey( Key ) )
{
DuplicateKeys[Key]++;
}
else
{
DuplicateKeys.Add( Key, 1 );
}
}
}
You can also use this for any type for which you can define an IComparer by using an overload: BinarySearch( T item, IComparer< T > );