So I need to create a dictionary with keys that are objects with a custom Equals() function. I discovered I need to override GetHashCode() too. I heard that for optimal performance you should have hash codes that don't collide, but that seems counter intuitive. I might be misunderstanding it, but it seems the entire point of using hash codes is to group items into buckets and if the hash codes never collide each bucket will only have 1 item which seems to defeat the purpose.
So should I intentionally make my hash codes collide occasionally? Performance is important. This will be a dictionary that will probably grow to multiple million items and I'll be doing lookups very often.
The goal of a hash code is to give you an index into an array, each of which is a bucket that may contain zero, one, or more items. The performance of the lookup then is dependent on the number of elements in the bucket. The fewer the better, since once you're in the bucket, it's an O(n) search (where n is the number of elements in the bucket). Therefore, it's ideal if the hashcode prevents collisions as much as possible, allowing for the optimal O(1) time as much as possible.
Dictionaries store data in buckets but there isn't one bucket for each hashcode. The number of buckets is based on the capacity. Values are put into buckets based on the modulus of the hashcode and number of buckets.
Lets say you have a GetHashCode() method that produces these hash codes for five objects:
925
10641
14316
17213
28624
Hash codes should be spread out. So these look spread out, right? If we have 7 buckets, then we end up calculating the modulus of each which gives us:
1
1
1
0
1
So we end up with buckets:
0 - 1 item
1 - 4 items
2 - 0 items
3 - 0 items
4 - 0 items
5 - 0 items
6 - 0 items
oops, not so well spread out now.
This is not made up data. These are actual hash codes.
Here's a sample of how to generate a hash code from contained data (not the formula used for the above hash codes, a better one).
https://stackoverflow.com/a/263416/118703
You must ensure that the following holds:
(GetHashCode(a) != GetHashCode(b)) => !Equals(a, b)
The reverse implication is identical in meaning:
Equals(a, b) => (GetHashCode(a) == GetHashCode(b))
Apart from that, generate as few collisions as possible. A collision is defined as:
(GetHashCode(a) == GetHashCode(b)) && !Equals(a, b)
A collision does not affect correctness, but performance. GetHashCode always returning zero would be correct for example, but slow.
Related
HashSet.Contains implementation in .Net is:
/// <summary>
/// Checks if this hashset contains the item
/// </summary>
/// <param name="item">item to check for containment</param>
/// <returns>true if item contained; false if not</returns>
public bool Contains(T item) {
if (m_buckets != null) {
int hashCode = InternalGetHashCode(item);
// see note at "HashSet" level describing why "- 1" appears in for loop
for (int i = m_buckets[hashCode % m_buckets.Length] - 1; i >= 0; i = m_slots[i].next) {
if (m_slots[i].hashCode == hashCode && m_comparer.Equals(m_slots[i].value, item)) {
return true;
}
}
}
// either m_buckets is null or wasn't found
return false;
}
And I read in a lot of places "search complexity in hashset is O(1)". How?
Then why does that for-loop exist?
Edit: .net reference link: https://github.com/microsoft/referencesource/blob/master/System.Core/System/Collections/Generic/HashSet.cs
The classic implementation of a hash table works by assigning elements to one of a number of buckets, based on the hash of the element. If the hashing was perfect, i.e. no two elements had the same hash, then we'd be living in a perfectly perfect world where we wouldn't need to care about anything - any lookup would be O(1) always, because we'd only need to compute the hash, get the bucket and say if something is inside.
We're not living in a perfectly perfect world. First off, consider string hashing. In .NET, there are (2^16)^n possible strings of length n; GetHashCode returns a long, and there are 2^64 possible values of long. That's exactly enough to hash every string of length 4 to a unique long, but if we want strings longer than that, there must exist two different values that give the same hash - this is called a collision. Also, we don't want to maintain 2^64 buckets at all times anyway. The usual way of dealing with that is to take the hashcode and compute its value modulo the number of buckets to determine the bucket's number1. So, the takeaway is - we need to allow for collisions.
The referenced .NET Framework implementation uses the simplest way of dealing with collisions - every bucket holds a linked list of all objects that result in the particular hash. You add object A, it's assigned to a bucket i. You add object B, it has the same hash, so it's added to the list in bucket i right after A. Now if you lookup for any element, you need to traverse the list of all objects and call the actual Equals method to find out if that thing is actually the one you're looking for. That explains the for loop - in the worst case you have to go through the entire list.
Okay, so how "search complexity in hashset is O(1)"? It's not. The worst case complexity is proportional to the number of items. It's O(1) on average.2 If all objects fall to the same bucket, asking for the elements at the end of the list (or for ones that are not in the structure but would fall into the same bucket) will be O(n).
So what do people mean by "it's O(1) on average"? The structure monitors how many objects are there proportional to the number of buckets and if that exceeds some threshold, called the load factor, it resizes. It's easy to see that this makes the average lookup time proportional to the load factor.
That's why it's important for hash functions to be uniform, meaning that the probability that two randomly chosen different objects get the same long assigned is 1/2^643. That keeps the distribution of objects in a hash table uniform, so we avoid pathological cases where one bucket contains a huge number of items.
Note that if you know the hash function and the algorithm used by the hash table, you can force such a pathological case and O(n) lookups. If a server takes inputs from a user and stores them in a hash table, an attacker knowing the hash function and the hash table implementations could use this as a vector for a DDoS attack. There are ways of dealing with that too. Treat this as a demonstration that yes, the worst case can be O(n) and that people are generally aware of that.
There are dozens of other, more complicated ways hash tables can be implemented. If you're interested you need to research on your own. Since lookup structures are so commonplace in computer science, people have come up with all sorts of crazy optimisations that minimise not only the theoretical number of operations, but also things like CPU cache misses.
[1] That's exactly what's happening in the statement int i = m_buckets[hashCode % m_buckets.Length] - 1
[2] At least the ones using naive chaining are not. There exist hash tables with worst-case constant time complexity. But usually they're worse in practice compared to the theoretically (in regards to time complexity) slower implementations, mainly due to CPU cache misses.
[3] I'm assuming the domain of possible hashes is the set of all longs, so there are 2^64 of them, but everything I wrote generalises to any other non-empty, finite set of values.
GetHashCode() returns a int32 as hash.
I was wondering how would it work when the number of elements exceed int.MaxValue, as all of them would have returned some integer <= int.MaxValue?
There is no requirement that if object1.GetHashCode() == object2.GetHashCode(), then object1.Equals(object2). Any container type that uses hash codes must be prepared to deal with hash collisions. One possible way to do that is to store all different objects with the same hash code in a list, and when looking up an object, first look up the hash code, and then iterate over the objects in the associated list, calling Equals for every object until you find a match.
As already mentioned, GetHashCode does not produce unique results.
A dictionary stores at each location a keyvaluepair, so when collisions occur, the items with the same hashcode (modded to the size of the underlying array) are chained and the actual key you are looking to retrieve is searched for.
A Dictionary is O(1) in its best case, and even average case, but at worst case, it is O(n).
My program retrieves a finite and complete list of elements I want to refer to by a string ID. I'm using a .Net Dictionary<string, MyClass> to store these elements. I personally have no idea how many elements there will be. It could be a few. It could be thousands.
Given the program know exactly how many elements it will be putting in the hash table, what should it specify as the table's capacity. Clearly it should be at least the number of elements it will contain, but using only that number will likely lead to numerous collisions.
Is there a guide to selecting the capacity of a hash table for a known number of elements to balance hash collisions and memory wastage?
EDIT: I'm aware a hash table's size can change. What I'm avoiding first and foremost is leaving it with the default allocation, then immediately adding thousands of elements causing countless resize operations. I won't be adding or removing elements once it's populated. If I know what's going in, I can ensure there's sufficient space upfront. My question relates to the balance of hash collisions versus memory wastage.
Your question seems to imply a false assumption, namely that the dictionary's capacity is fixed. It isn't.
If you know in any given case that a dictionary will hold at least some number of elements, then you can specify that number as the dictionary's initial capacity. The dictionary's capacity is always at least as large as its item count (this is true for .NET 2 through 4, at least; I believe this is an undocumented implementation detail that's subject to change).
Specifying the initial capacity reduces the number of memory allocations by eliminating those that would occurred as the dictionary grows from its default initial capacity to the capacity you have chosen.
If the hash function in use is well chosen, the number of collisions should be relatively small and should have a minimal impact on performance. Specifying an over-large capacity might help in some contrived situations, but I would definitely not give this any thought unless profiling showed that the dictionary's lookups were having a significant impact on performance.
(As an example of a contrived situation, consider a dictionary with int keys with a capacity of 10007, all of whose keys are a multiple of 10007. With the current implementation, all of the items would be stored in a single bucket, because the bucket is chosen by dividing the hash code by the capacity and taking the remainder. In this case, the dictionary would function as a linked list, and forcing it to use a different capacity would fix that.)
This is bit of a subjective question but let me try my best to answer this (from perspective of CLR 2.0. only as I have not yet explored if there have been any changes in dictionary for CLR 4.0).
Your are using a dictionary keyed on string. Since there can be infinite possible strings, it is reasonable to assume that every possible hash code is 'equally likely'. Or in other words each of the 2^32 hash codes (range of int) are equally likely for the string class. Current version of Dictionary in BCL drops off 32nd bit from any 32 bit hash code thus obtained, to essentially get a 31 bit hash code. Hence the range we are dealing with is 2^31 unique equally likely hash codes.
Note that the range of the hash codes is not dependent on the number of elements dictionary contains or can contain.
Dictionary class will use this hash code to allocate a bucket to the 'Myclass' object. So essentially if two different strings return same 31 bits of hash code (assuming BCL designers have chosen the string hash function highly wisely, such instances should be fairly spread out) both will be allocated same bucket. In such a hash collision, nothing can be done.
Now, in current implementation of the Dictionary class, it may happen that even different hash codes (again 31 bit) still end up in the same bucket. The bucket index is identified as follows:
hash = <31 bit hash code>
pr = <least prime number greater than or equal to current dictionary capacity>
bucket_index = hash modulus pr
Hence every hash code of the form (pr*factor + bucket_index) will end up in same bucket irrespective of the factor part.
If you want to be absolutely sure that all different possible 31 bit hash codes end up in different buckets only way is to force the pr to be greater than or equal to the largest possible 31 bit hash code. Or in other words, ensure that every hash code is of the form (pr*0 + hash_code) i.e. pr should be greater than 2^31. This by extension means that the dictionary capacity should be at-least 2^31.
Note that the capacity required to minimize hash collisions is not at all dependent on the number of elements you want to store in the dictionary but on the range of the possible hash codes.
As you can imagine 2^31 is huge huge memory allocation. In fact if you try to specify 2^31 as the capacity, there will be two arrays of 2^31 length. Consider that on a 32 bit machine highest possible address on RAM is 2^32!!!!!
If, for some reason, default behavior of the dictionary is not acceptable to you and it is critical for you to minimize hash collisions (or rather I would say bucket collisions) only hope you have is to provide your own hash code (i.e. you can not use string as key). Such a hash code should keep the formula to obtain bucket index in mind and strive to minimize the range of possible hash codes. Simplest approach is to incrementally assign a number/index to your unique MyClass instances and use this number as your hash code. Then you can specify the total number of MyClass instances as dictionary capacity. Though, in such a case an array can easily be maintained instead of dictionary as you know the 'index' of the object and index is incremental.
In the end, I would like to re-iterate what others have said, 'there will not be countless resizes'. Dictionary doubles its capacity (rounded off to nearest prime number greater than or equal to the new capacity) each time it finds itself short of space. In order to save some processing, you can very well set capacity to number of MyClass instances you have as in any case dictionary will require this much capacity to store the instances but this will not minimize 'hash-collisions' and for normal circumstances will be fast enough.
Datastructure like HashTable are meant for dynamic memory allocation. You can however mention the initial size in some structures. But , when you add new elements , they will expand in size. There is in no way you can restrict the size implicitly.
There are many datastructures available , with their own advantages and disadvantages. You need to select the best one. Limiting the size does not affect the performance. You need to take care of Add, Delete and Search which makes the difference in performance.
I am looking for the most efficient way to store a collection of integers. Right now they're being stored in a HashSet<T>, but profiling has shown that these collections weigh heavily on some performance-critical code and I suspect there's a better option.
Some more details:
Random lookups must be O(1) or close to it.
The collections can grow large, so space efficiency is desirable.
The values are uniformly distributed in a 64-bit space.
Mutability is not needed.
There's no clear upper bound on size, but tens of millions of elements is not uncommon.
The most painful performance hit right now is creating them. That seems to be allocation-related - clearing and reusing HashSets helps a lot in benchmarks, but unfortunately that is not a feasible option in the application code.
(added) Implementing a data structure that's tailored to the task is fine. Is a hash table still the way to go? A trie also seems like a possibility at first glance, but I don't have any practical experience with them.
HashSet is usually the best general purpose collection in this case.
If you have any specific information about your collection you may have better options.
If you have a fixed upper bound that is not incredibly large you can use a bit vector of suitable size.
If you have a very dense collection you can instead store the missing values.
If you have very small collections, <= 4 items or so, you can store them in a regular array. A full scan of such small array may be faster than the hashing required to use the hash-set.
If you don't have any more specific characteristics of your data than "large collections of int" HashSet is the way to go.
If the size of the values is bounded you could use a bitset. It stores one bit per integer. In total the memory use would be log n bits with n being the greatest integer.
Another option is a bloom filter. Bloom filters are very compact but you have to be prepared for an occasional false positive in lookups. You can find more about them in wikipedia.
A third option is using a simle sorted array. Lookups are log n with n being the number of integers. It may be fast enough.
I decided to try and implement a special purpose hash-based set class that uses linear probing to handle collisions:
Backing store is a simple array of longs
The array is sized to be larger than the expected number of elements to be stored.
For a value's hash code, use the least-significant 31 bits.
Searching for the position of a value in the backing store is done using a basic linear probe, like so:
int FindIndex(long value)
{
var index = ((int)(value & 0x7FFFFFFF) % _storage.Length;
var slotValue = _storage[index];
if(slotValue == 0x0 || slotValue == value) return index;
for(++index; ; index++)
{
if (index == _storage.Length) index = 0;
slotValue = _storage[index];
if(slotValue == 0x0 || slotValue == value) return index;
}
}
(I was able to determine that the data being stored will never include 0, so that number is safe to use for empty slots.)
The array needs to be larger than the number of elements stored. (Load factor less than 1.) If the set is ever completely filled then FindIndex() will go into an infinite loop if it's used to search for a value that isn't already in the set. In fact, it will want to have quite a lot of empty space, otherwise search and retrieval may suffer as the data starts to form large clumps.
I'm sure there's still room for optimization, and I will may get stuck using some sort of BigArray<T> or sharding for the backing store on large sets. But initial results are promising. It performs over twice as fast as HashSet<T> at a load factor of 0.5, nearly twice as fast with a load factor of 0.8, and even at 0.9 it's still working 40% faster in my tests.
Overhead is 1 / load factor, so if those performance figures hold out in the real world then I believe it will also be more memory-efficient than HashSet<T>. I haven't done a formal analysis, but judging by the internal structure of HashSet<T> I'm pretty sure its overhead is well above 10%.
--
So I'm pretty happy with this solution, but I'm still curious if there are other possibilities. Maybe some sort of trie?
--
Epilogue: Finally got around to doing some competitive benchmarks of this vs. HashSet<T> on live data. (Before I was using synthetic test sets.) It's even beating my optimistic expectations from before. Real-world performance is turning out to be as much as 6x faster than HashSet<T>, depending on collection size.
What I would do is just create an array of integers with a sufficient enough size to handle how ever many integers you need. Is there any reason from staying away from the generic List<T>? http://msdn.microsoft.com/en-us/library/6sh2ey19.aspx
The most painful performance hit right now is creating them...
As you've obviously observed, HashSet<T> does not have a constructor that takes a capacity argument to initialize its capacity.
One trick which I believe would work is the following:
int capacity = ... some appropriate number;
int[] items = new int[capacity];
HashSet<int> hashSet = new HashSet<int>(items);
hashSet.Clear();
...
Looking at the implementation with reflector, this will initialize the capacity to the size of the items array, ignoring the fact that this array contains duplicates. It will, however, only actually add one value (zero), so I'd assume that initializing and clearing should be reasonably efficient.
I haven't tested this so you'd have to benchmark it. And be willing to take the risk of depending on an undocumented internal implementation detail.
It would be interesting to know why Microsoft didn't provide a constructor with a capacity argument like they do for other collection types.
I want to know the probability of getting duplicate values when calling the GetHashCode() method on string instances. For instance, according to this blog post, blair and brainlessness have the same hashcode (1758039503) on an x86 machine.
Large.
(Sorry Jon!)
The probability of getting a hash collision among short strings is extremely large. Given a set of only ten thousand distinct short strings drawn from common words, the probability of there being at least one collision in the set is approximately 1%. If you have eighty thousand strings, the probability of there being at least one collision is over 50%.
For a graph showing the relationship between set size and probability of collision, see my article on the subject:
https://learn.microsoft.com/en-us/archive/blogs/ericlippert/socks-birthdays-and-hash-collisions
Small - if you're talking about the chance of any two arbitrary unequal strings having a collision. (It will depend on just how "arbitrary" the strings are, of course - different contexts will be using different strings.)
Large - if you're talking about the chance of there being at least one collision in a large pool of arbitrary strings. The small individual probabilities are no match for the birthday problem.
That's about all you need to know. There are definitely cases where there will be collisions, and there have to be given that there are only 232 possible hash codes, and more than that many strings - so the pigeonhole principle proves that at least one hash code must have more than one string which generates it. However, you should trust that the hash has been designed to be pretty reasonable.
You can rely on it as a pretty good way of narrowing down the possible matches for a particular string. It would be an unusual set of naturally-occurring strings which generated a lot of collisions - and even when there are some collisions, obviously if you can narrow a candidate search set down from 50K to fewer than 10 strings, that's a pretty big win. But you must not rely on it as a unique value for any string.
Note that the algorithm used in .NET 4 differs between x86 and x64, so that example probably isn't valid on both platforms.
I think all that's possible to say is "small, but finite and definitely not zero" -- in other words you must not rely on GetHashCode() ever returning unique values for two different instances.
To my mind, hashcodes are best used when you want to tell quickly if two instances are different -- not if they're the same.
In other words, if two objects have different hash codes, you know they are different and need not do a (possibly expensive) deeper comparison.
However, if the hash codes for two objects are the same, you must go on to compare the objects themselves to see if they're actually the same.
I ran a test on a database of 466k English words and got 48 collisions with string.GetHashCode(). MurmurHash gives slightly better results. More results are here: https://github.com/jitbit/MurmurHash.net
Just in case your question is meant to be what is the probability of a collision in a group of strings,
For n available slots and m occupying items:
Prob. of no collision on first insertion is 1.
Prob. of no collision on 2nd insertion is ( n - 1 ) / n
Prob. of no collision on 3rd insertion is ( n - 2 ) / n
Prob. of no collision on mth insertion is ( n - ( m - 1 ) ) / n
The probability of no collision after m insertions is the product of the above values: (n - 1)!/((n - m)! * n^(m - 1)).
which simplifies to ( n choose k ) / ( n^m ).
And everybody is right, you can't assume 0 collisions, so, saying the probability is "low" may be true but doesn't allow you to assume that there will be no collisions. If you're looking at a hashtable, I think the standard is you begin to have trouble with significant collisions when you're hashtable is about 2/3rds full.
The probability of a collision between two randomly chosen strings is 1 / 2^(bits in hash code), if the hash is perfect, which is unlikely or impossible.