In C#.NET, I like using HashSets because of their supposed O(1) time complexity for lookups. If I have a large set of data that is going to be queried, I often prefer using a HashSet to a List, since it has this time complexity.
What confuses me is the constructor for the HashSet, which takes IEqualityComparer as an argument:
http://msdn.microsoft.com/en-us/library/bb359100.aspx
In the link above, the remarks note that the "constructor is an O(1) operation," but if this is the case, I am curious if lookup is still O(1).
In particular, it seems to me that, if I were to write a Comparer to pass in to the constructor of a HashSet, whenever I perform a lookup, the Comparer code would have to be executed on every key to check to see if there was a match. This would not be O(1), but O(n).
Does the implementation internally construct a lookup table as elements are added to the collection?
In general, how might I ascertain information about complexity of .NET data structures?
A HashSet works via hashing (via IEqualityComparer.GetHashCode) the objects you insert and tosses the objects into buckets per the hash. The buckets themselves are stored in an array, hence the O(1) part.
For example (this is not necessarily exactly how the C# implementation works, it just gives a flavor) it takes the first character of the hash and throws everything with a hash starting with 1 into bucket 1. Hash of 2, bucket 2, and so on. Inside that bucket is another array of buckets that divvy up by the second character in the hash. So on for every character in the hash....
Now, when you look something up, it hashes it, and jumps thru the appropriate buckets. It has to do several array lookups (one for each character in the hash) but does not grow as a function of N, the number of objects you've added, hence the O(1) rating.
To your other question, here is a blog post with the complexity of a number of collections' operations: http://c-sharp-snippets.blogspot.com/2010/03/runtime-complexity-of-net-generic.html
if I were to write a Comparer to pass in to the constructor of a HashSet, whenever I perform a lookup, the Comparer code would have to be executed on every key to check to see if there was a match. This would not be O(1), but O(n).
Let's call the value you are searching for the "query" value.
Can you explain why you believe the comparer has to be executed on every key to see if it matches the query?
This belief is false. (Unless of course the hash code supplied by the comparer is the same for every key!) The search algorithm executes the equality comparer on every key whose hash code matches the query's hash code, modulo the number of buckets in the hash table. That's how hash tables get O(1) lookup time.
Does the implementation internally construct a lookup table as elements are added to the collection?
Yes.
In general, how might I ascertain information about complexity of .NET data structures?
Read the documentation.
Actually the lookup time of a HashSet<T> isn't always O(1).
As others have already mentioned a HashSet uses IEqualityComparer<T>.GetHashCode().
Now consider a struct or object which always returns the same hash code x.
If you add n items to your HashSet there will be n items with the same hash in it (as long as the objects aren't equal).
So if you were to check if an element with the hash code x exists in your HashSet it will run equality checks for all objects with the hash code x to test wether the HashSet contains the element
It would depends on quality of hash function (GetHashCode()) your IEqualityComparer implementation provides. Ideal hash function should provide well-distributed random set of hash codes. These hash codes will be used as an index which allows mapping key to a value, so search for a value by key becomes more efficient especially when a key is a complex object/structure.
the Comparer code would have to be executed on every key to check to
see if there was a match. This would not be O(1), but O(n).
This is not how hashtable works, this is some kind of straightforward bruteforce search. In case of hashtable you would have more intelligent approach which uses search by index (hash code).
Lookup is still O(1) if you pass an IEqualityComparer. The hash set still uses the same logic as if you don't pass an IEqualityComparer; it just uses the IEqualityComparer's implementations of GetHashCode and Equals instead of the instance methods of System.Object (or the overrides provided by the object in question).
Related
Can anyone explain what is the complexity of
Dictionary.ContainsValue?
I know that Dictionary.ContainsKey complexity is O(1).
Short answer : O(N)
Long Answer :
This method performs a linear search; therefore, the average execution time is proportional to Count. That is, this method is an O(n) operation, where n is Count.
Official documentation
O(n).
This method performs a linear search; therefore, the average execution time is proportional to Count. That is, this method is an O(n) operation, where n is Count.
Particularly, in Dictionary structure the key is generated in a hashing mechanism. When calling to ContainsKey function it computes the hash code argument and checks if this computed argument exists in the hash table.
The value of the dictionary is not hashed and when calling to ContainsValue function the algorithm has to iterate over the values stored until it finds the first matched item (if exists).
Note, if you use this function you might need another structure that will work with a better complexity. You might store all values in a HashSet<T>.
I have a file of records, sorted alphabetically:
Andrew d432
Ben x127
...
...
Zac b332
The first field is a person name, the second field is some id. Once I read the file, I do not need to make any changes to the data.
I want to treat each record as a Key-Value pair, where the person name is the Key. I don't know which class to use in order to access a record (as fast as possible). Dictionary does not has a binary search. On the other hand, as I understand, SortedList and SortedDictionary should be used only when I need to insert/remove data.
Edit: To clarify, I'm talking about simply accessing a record, like:
x = MyDic[Zac]
What no one has stated is why dictionaries are O(1) and why it IS faster than a binary search. One side point is that dictionaries are not sorted by the key. The whole point of a dictionary is to go to the exact* (for all practical purposes) location of the item that is referenced by the key value. It does not "search" for the item - it knows the exact location of the item you want.
So a binary search would be pointless on a hash-based dictionary because there is no need to "search" for an item when the collection already knows exactly where it is.
*This isn't completely true in the case of hash collisions, but the principle of the dictionary is to get the item directly, and any additional lookups are an implementation detail and should be rare.
On the other hand, as I understand, SortedList and SortedDictionary should be used only when I need to insert/remove data.
They should be used when you want the data automatically sorted when adding or removing data. Note that SortedDictionary loses the performance gain of a "normal" dictionary because it now has to search for the location using the key value. It's primary use is to allow you to iterate over the keys in order.
If you have a unique key value per item, don't need to iterate the items in any particular order, and want the fastest "get" performance, then Dictionary is the way to go.
In general dictionary lookup will be faster than binary search of a collection. There are two specific cases when that's not true:
If the list is small (fewer than 15 (possibly as low as 10) items, in my tests), then the overhead of computing a hash code and going through the dictionary lookup will be slower than binary search on an array. But beyond 15 items, dictionary lookup beats binary search, hands down.
If there are many hash collisions (due either to a bad hash function or a dictionary with a high load factor), then dictionary lookup slows down. If it's really bad, then binary search could potentially beat dictionary lookup.
In 15 years working with .NET dictionaries holding all kinds of data, I've never seen #2 be a problem when using the standard String.GetHashCode() method with real world data. The only time I've run into trouble is when I created a bad GetHashCode() method.
GetHashCode() returns a int32 as hash.
I was wondering how would it work when the number of elements exceed int.MaxValue, as all of them would have returned some integer <= int.MaxValue?
There is no requirement that if object1.GetHashCode() == object2.GetHashCode(), then object1.Equals(object2). Any container type that uses hash codes must be prepared to deal with hash collisions. One possible way to do that is to store all different objects with the same hash code in a list, and when looking up an object, first look up the hash code, and then iterate over the objects in the associated list, calling Equals for every object until you find a match.
As already mentioned, GetHashCode does not produce unique results.
A dictionary stores at each location a keyvaluepair, so when collisions occur, the items with the same hashcode (modded to the size of the underlying array) are chained and the actual key you are looking to retrieve is searched for.
A Dictionary is O(1) in its best case, and even average case, but at worst case, it is O(n).
Consider for example the documentation for the .NET Framework 4.5 Dictionary<TKey, TValue> class:
In the remarks for the .ContainsKey method, they state that
This method approaches an O(1) operation.
And in the remarks for the .Count property, they state that
Retrieving the value of this property is an O(1) operation.
Note that I am not necessarily asking for the details of C#, .NET, Dictionary or what Big O notation is in general. I just found this distinction of "approaches" intriguing.
Is there any difference? If so, how significant can it potentially be? Should I pay attention to it?
If the hash function used by the underlying objects in the hash code is "good" it means collisions will be very rare. Odds are good that there will be only one item at a given hash bucket, maybe two, and almost never more. If you could say, for certain, that there will never be more than c items (where c is a constant) in a bucket, then the operation would be O(c) (which is O(1)). But that assurance can't be made. It's possible, that you just happen to have n different items that, unluckily enough, all collide, and all end up in the same bucket, and in that case, ContainsKey is O(n). It's also possible that the hash function isn't "good" and results in hash collisions frequently, which can make the actual contains check worse than just O(1).
It's because Dictionary is an implementation of hash tables - this means that a key lookup is done by using a hashing function that tells you which bucket among many buckets contained in the data structure contains the value you're looking up. Usually, for a good hashing function, assuming a large enough set of buckets, each bucket only contains a single element - in which case the complexity is indeed O(1) - unfortunately, this is not always true - the hashing function may have clashes, in which case a bucket may contain more than one entry and so the algorithm has to iterate through the bucket until it finds the entry you're looking for - so it's no longer O(1) for these (hopefully) rare cases.
O(1) is a constant running time, approaching O(1) is close to constant but not quite, but for most purposes is a negligible increase. You should not pay attention to it.
Something either is or isn't O(1). I think what they're trying to say is that running time is approximately O(1) per operation for a large number of operations.
How that integer hash is generated by the GetHashCode() function? Is it a random value which is not unique?
In string, it is overridden to make sure that there exists only one hash code for a particular string.
How to do that?
How searching for specific key in a hash table is speeded up using hash code?
What are the advantages of using hash code over using an index directly in the collection (like in arrays)?
Can someone help?
Basically, hash functions use some generic function to digest data and generate a fingerprint (and integer number here) for that data. Unlike an index, this fingerprint depends ONLY on the data, and should be free of any predictable ordering based on the data. Any change to a single bit of the data should also change the fingerprint considerably.
Notice that nowhere does this guarantee that different data won't give the same hash. In fact, quite the opposite: this happens very often, and is called a collision. But, with an integer, the probability is roughly 1 in 4 billion against this (1 in 2^32). If a collision happens, you just compare the actual object you are hashing to see if they match.
This fingerprint can then be used as an index to an array (or arraylist) of stored values. Because the fingerprint is dependent only on the data, you can compute a hash for something and just check the array element for that hash value to see if it has been stored already. Otherwise, you'd have to go through the whole array checking if it matches an item.
You can also VERY quickly do associative arrays by using 2 arrays, one with Key values (indexed by hash), and a second with values mapped to those keys. If you use a hash, you just need to know the key's hash to find the matching value for the key. This is much faster than doing a binary search on a sorted key list, or a scan of the whole array to find matching keys.
There are MANY ways to generate a hash, and all of them have various merits, but few are simple. I suggest consulting the wikipedia page on hash functions for more info.
A hash code IS an index, and a hash table, at its very lowest level, IS an array. But for a given key value, we determine the index into in a hash table differently, to make for much faster data retrieval.
Example: You have 1,000 words and their definitions. You want to store them so that you can retrieve the definition for a word very, very quickly -- faster than a binary search, which is what you would have to do with an array.
So you create a hash table. You start with an array substantially bigger than 1,000 entries -- say 5,000 (the bigger, the more time-efficient).
The way you'll use your table is, you take the word to look up, and convert it to a number between 0 and 4,999. You choose the algorithm for doing this; that's the hashing algorithm. But you could doubtless write something that would be very fast.
Then you use the converted number as an index into your 5,000-element array, and insert/find your definition at that index. There's no searching at all: you've created the index directly from the search word.
All of the operations I've described are constant time; none of them takes longer when we increase the number of entries. We just need to make sure that there is sufficient space in the hash to minimize the chance of "collisions", that is, the chance that two different words will convert to the same integer index. Because that can happen with any hashing algorithm, we need to add checks to see if there is a collision, and do something special (if "hello" and "world" both hash to 1,234 and "hello" is already in the table, what will we do with "world"? Simplest is to put it in 1,235, and adjust our lookup logic to allow for this possibility.)
Edit: after re-reading your post: a hashing algorithm is most definitely not random, it must be deterministic. The index generated for "hello" in my example must be 1,234 every single time; that's the only way the lookup can work.
Answering each one of your questions directly:
How that integer hash is generated by
the GetHashCode() function? Is it a
random value which is not unique?
An integer hash is generated by whatever method is appropriate for the object.
The generation method is not random but must follow consistent rules, ensuring that a hash generated for one particular object will equal the hash generated for an equivalent object. As an example, a hash function for an integer would be to simply return that integer.
In string, it is overridden to make
sure that there exists only one hash
code for a particular string. How to
do that?
There are many ways this can be done. Here's an example I'm thinking of on the spot:
int hash = 0;
for(int i = 0; i < theString.Length; ++i)
{
hash ^= theString[i];
}
This is a valid hash algorithm, because the same sequence of characters will always produce the same hash number. It's not a good hash algorithm (an extreme understatement), because many strings will produce the same hash. A valid hash algorithm doesn't have to guarantee uniqueness. A good hash algorithm will make a chance of two differing objects producing the same number extremely unlikely.
How searching for specific key in a hash table is speeded up using hash code?
What are the advantages of using hash code over using an index directly in the collection (like in arrays)?
A hash code is typically used in hash tables. A hash table is an array, but each entry in the array is a "bucket" of items, not just one item. If you have an object and you want to know which bucket it belongs in, calculate
hash_value MOD hash_table_size.
Then you simply have to compare the object with every item in the bucket. So a hash table lookup will most likely have a search time of O(1), as opposed to O(log(N)) for a sorted list or O(N) for an unsorted list.