How can hashset.contains be O(1) with this implementation?

How can hashset.contains be O(1) with this implementation? - c#

HashSet.Contains implementation in .Net is:
/// <summary>
/// Checks if this hashset contains the item
/// </summary>
/// <param name="item">item to check for containment</param>
/// <returns>true if item contained; false if not</returns>
public bool Contains(T item) {
if (m_buckets != null) {
int hashCode = InternalGetHashCode(item);
// see note at "HashSet" level describing why "- 1" appears in for loop
for (int i = m_buckets[hashCode % m_buckets.Length] - 1; i >= 0; i = m_slots[i].next) {
if (m_slots[i].hashCode == hashCode && m_comparer.Equals(m_slots[i].value, item)) {
return true;
}
}
}
// either m_buckets is null or wasn't found
return false;
}
And I read in a lot of places "search complexity in hashset is O(1)". How?
Then why does that for-loop exist?
Edit: .net reference link: https://github.com/microsoft/referencesource/blob/master/System.Core/System/Collections/Generic/HashSet.cs

The classic implementation of a hash table works by assigning elements to one of a number of buckets, based on the hash of the element. If the hashing was perfect, i.e. no two elements had the same hash, then we'd be living in a perfectly perfect world where we wouldn't need to care about anything - any lookup would be O(1) always, because we'd only need to compute the hash, get the bucket and say if something is inside.
We're not living in a perfectly perfect world. First off, consider string hashing. In .NET, there are (2^16)^n possible strings of length n; GetHashCode returns a long, and there are 2^64 possible values of long. That's exactly enough to hash every string of length 4 to a unique long, but if we want strings longer than that, there must exist two different values that give the same hash - this is called a collision. Also, we don't want to maintain 2^64 buckets at all times anyway. The usual way of dealing with that is to take the hashcode and compute its value modulo the number of buckets to determine the bucket's number1. So, the takeaway is - we need to allow for collisions.
The referenced .NET Framework implementation uses the simplest way of dealing with collisions - every bucket holds a linked list of all objects that result in the particular hash. You add object A, it's assigned to a bucket i. You add object B, it has the same hash, so it's added to the list in bucket i right after A. Now if you lookup for any element, you need to traverse the list of all objects and call the actual Equals method to find out if that thing is actually the one you're looking for. That explains the for loop - in the worst case you have to go through the entire list.
Okay, so how "search complexity in hashset is O(1)"? It's not. The worst case complexity is proportional to the number of items. It's O(1) on average.2 If all objects fall to the same bucket, asking for the elements at the end of the list (or for ones that are not in the structure but would fall into the same bucket) will be O(n).
So what do people mean by "it's O(1) on average"? The structure monitors how many objects are there proportional to the number of buckets and if that exceeds some threshold, called the load factor, it resizes. It's easy to see that this makes the average lookup time proportional to the load factor.
That's why it's important for hash functions to be uniform, meaning that the probability that two randomly chosen different objects get the same long assigned is 1/2^643. That keeps the distribution of objects in a hash table uniform, so we avoid pathological cases where one bucket contains a huge number of items.
Note that if you know the hash function and the algorithm used by the hash table, you can force such a pathological case and O(n) lookups. If a server takes inputs from a user and stores them in a hash table, an attacker knowing the hash function and the hash table implementations could use this as a vector for a DDoS attack. There are ways of dealing with that too. Treat this as a demonstration that yes, the worst case can be O(n) and that people are generally aware of that.
There are dozens of other, more complicated ways hash tables can be implemented. If you're interested you need to research on your own. Since lookup structures are so commonplace in computer science, people have come up with all sorts of crazy optimisations that minimise not only the theoretical number of operations, but also things like CPU cache misses.
[1] That's exactly what's happening in the statement int i = m_buckets[hashCode % m_buckets.Length] - 1
[2] At least the ones using naive chaining are not. There exist hash tables with worst-case constant time complexity. But usually they're worse in practice compared to the theoretically (in regards to time complexity) slower implementations, mainly due to CPU cache misses.
[3] I'm assuming the domain of possible hashes is the set of all longs, so there are 2^64 of them, but everything I wrote generalises to any other non-empty, finite set of values.

Related

Hashing performance in a HashSet<int> against a List<int> with Contains

I am looking for a comparison/performance considerations between a list of integers against a hash set of integers. This is what What is the difference between HashSet<T> and List<T>? talks about for T as integer.
I will have up to several thousand integers, and I want to find out, for individual integers, whether they are contained in this set.
Now of course this screams for a hash set, but I wonder whether hashing is beneficial here, since they are just integers to start with. Would hashing them first not add unnecessary overhead here?
Or in other words: Is using a hash set beneficial, even for sets of integers?

Hashing an integer is very cheap, as you can see in the source code of the Int32.GetHashCode method:
// The absolute value of the int contained.
public override int GetHashCode()
{
return m_value;
}
The hash of the number is the number itself. It can't get any cheaper than that. So there is no reason to be concerned about the overhead. Put your numbers in a HashSet, and enjoy searching with O(1) computational complexity.

What ever T is there is a simple but efficient rule of thumb:
The collection is mainly used for adding and iterating with very few
search => Use List
The collection is heavely used for research => Use HashSet

What does it mean when an operation "approaches O(1)" as opposed to "is O(1)"?

Consider for example the documentation for the .NET Framework 4.5 Dictionary<TKey, TValue> class:
In the remarks for the .ContainsKey method, they state that
This method approaches an O(1) operation.
And in the remarks for the .Count property, they state that
Retrieving the value of this property is an O(1) operation.
Note that I am not necessarily asking for the details of C#, .NET, Dictionary or what Big O notation is in general. I just found this distinction of "approaches" intriguing.
Is there any difference? If so, how significant can it potentially be? Should I pay attention to it?

If the hash function used by the underlying objects in the hash code is "good" it means collisions will be very rare. Odds are good that there will be only one item at a given hash bucket, maybe two, and almost never more. If you could say, for certain, that there will never be more than c items (where c is a constant) in a bucket, then the operation would be O(c) (which is O(1)). But that assurance can't be made. It's possible, that you just happen to have n different items that, unluckily enough, all collide, and all end up in the same bucket, and in that case, ContainsKey is O(n). It's also possible that the hash function isn't "good" and results in hash collisions frequently, which can make the actual contains check worse than just O(1).

It's because Dictionary is an implementation of hash tables - this means that a key lookup is done by using a hashing function that tells you which bucket among many buckets contained in the data structure contains the value you're looking up. Usually, for a good hashing function, assuming a large enough set of buckets, each bucket only contains a single element - in which case the complexity is indeed O(1) - unfortunately, this is not always true - the hashing function may have clashes, in which case a bucket may contain more than one entry and so the algorithm has to iterate through the bucket until it finds the entry you're looking for - so it's no longer O(1) for these (hopefully) rare cases.

O(1) is a constant running time, approaching O(1) is close to constant but not quite, but for most purposes is a negligible increase. You should not pay attention to it.

Something either is or isn't O(1). I think what they're trying to say is that running time is approximately O(1) per operation for a large number of operations.

How much capacity should a static hash table have to minimize collisions?

My program retrieves a finite and complete list of elements I want to refer to by a string ID. I'm using a .Net Dictionary<string, MyClass> to store these elements. I personally have no idea how many elements there will be. It could be a few. It could be thousands.
Given the program know exactly how many elements it will be putting in the hash table, what should it specify as the table's capacity. Clearly it should be at least the number of elements it will contain, but using only that number will likely lead to numerous collisions.
Is there a guide to selecting the capacity of a hash table for a known number of elements to balance hash collisions and memory wastage?
EDIT: I'm aware a hash table's size can change. What I'm avoiding first and foremost is leaving it with the default allocation, then immediately adding thousands of elements causing countless resize operations. I won't be adding or removing elements once it's populated. If I know what's going in, I can ensure there's sufficient space upfront. My question relates to the balance of hash collisions versus memory wastage.

Your question seems to imply a false assumption, namely that the dictionary's capacity is fixed. It isn't.
If you know in any given case that a dictionary will hold at least some number of elements, then you can specify that number as the dictionary's initial capacity. The dictionary's capacity is always at least as large as its item count (this is true for .NET 2 through 4, at least; I believe this is an undocumented implementation detail that's subject to change).
Specifying the initial capacity reduces the number of memory allocations by eliminating those that would occurred as the dictionary grows from its default initial capacity to the capacity you have chosen.
If the hash function in use is well chosen, the number of collisions should be relatively small and should have a minimal impact on performance. Specifying an over-large capacity might help in some contrived situations, but I would definitely not give this any thought unless profiling showed that the dictionary's lookups were having a significant impact on performance.
(As an example of a contrived situation, consider a dictionary with int keys with a capacity of 10007, all of whose keys are a multiple of 10007. With the current implementation, all of the items would be stored in a single bucket, because the bucket is chosen by dividing the hash code by the capacity and taking the remainder. In this case, the dictionary would function as a linked list, and forcing it to use a different capacity would fix that.)

This is bit of a subjective question but let me try my best to answer this (from perspective of CLR 2.0. only as I have not yet explored if there have been any changes in dictionary for CLR 4.0).
Your are using a dictionary keyed on string. Since there can be infinite possible strings, it is reasonable to assume that every possible hash code is 'equally likely'. Or in other words each of the 2^32 hash codes (range of int) are equally likely for the string class. Current version of Dictionary in BCL drops off 32nd bit from any 32 bit hash code thus obtained, to essentially get a 31 bit hash code. Hence the range we are dealing with is 2^31 unique equally likely hash codes.
Note that the range of the hash codes is not dependent on the number of elements dictionary contains or can contain.
Dictionary class will use this hash code to allocate a bucket to the 'Myclass' object. So essentially if two different strings return same 31 bits of hash code (assuming BCL designers have chosen the string hash function highly wisely, such instances should be fairly spread out) both will be allocated same bucket. In such a hash collision, nothing can be done.
Now, in current implementation of the Dictionary class, it may happen that even different hash codes (again 31 bit) still end up in the same bucket. The bucket index is identified as follows:
hash = <31 bit hash code>
pr = <least prime number greater than or equal to current dictionary capacity>
bucket_index = hash modulus pr
Hence every hash code of the form (pr*factor + bucket_index) will end up in same bucket irrespective of the factor part.
If you want to be absolutely sure that all different possible 31 bit hash codes end up in different buckets only way is to force the pr to be greater than or equal to the largest possible 31 bit hash code. Or in other words, ensure that every hash code is of the form (pr*0 + hash_code) i.e. pr should be greater than 2^31. This by extension means that the dictionary capacity should be at-least 2^31.
Note that the capacity required to minimize hash collisions is not at all dependent on the number of elements you want to store in the dictionary but on the range of the possible hash codes.
As you can imagine 2^31 is huge huge memory allocation. In fact if you try to specify 2^31 as the capacity, there will be two arrays of 2^31 length. Consider that on a 32 bit machine highest possible address on RAM is 2^32!!!!!
If, for some reason, default behavior of the dictionary is not acceptable to you and it is critical for you to minimize hash collisions (or rather I would say bucket collisions) only hope you have is to provide your own hash code (i.e. you can not use string as key). Such a hash code should keep the formula to obtain bucket index in mind and strive to minimize the range of possible hash codes. Simplest approach is to incrementally assign a number/index to your unique MyClass instances and use this number as your hash code. Then you can specify the total number of MyClass instances as dictionary capacity. Though, in such a case an array can easily be maintained instead of dictionary as you know the 'index' of the object and index is incremental.
In the end, I would like to re-iterate what others have said, 'there will not be countless resizes'. Dictionary doubles its capacity (rounded off to nearest prime number greater than or equal to the new capacity) each time it finds itself short of space. In order to save some processing, you can very well set capacity to number of MyClass instances you have as in any case dictionary will require this much capacity to store the instances but this will not minimize 'hash-collisions' and for normal circumstances will be fast enough.

Datastructure like HashTable are meant for dynamic memory allocation. You can however mention the initial size in some structures. But , when you add new elements , they will expand in size. There is in no way you can restrict the size implicitly.
There are many datastructures available , with their own advantages and disadvantages. You need to select the best one. Limiting the size does not affect the performance. You need to take care of Add, Delete and Search which makes the difference in performance.

Efficiently storing a set of numbers

I am looking for the most efficient way to store a collection of integers. Right now they're being stored in a HashSet<T>, but profiling has shown that these collections weigh heavily on some performance-critical code and I suspect there's a better option.
Some more details:
Random lookups must be O(1) or close to it.
The collections can grow large, so space efficiency is desirable.
The values are uniformly distributed in a 64-bit space.
Mutability is not needed.
There's no clear upper bound on size, but tens of millions of elements is not uncommon.
The most painful performance hit right now is creating them. That seems to be allocation-related - clearing and reusing HashSets helps a lot in benchmarks, but unfortunately that is not a feasible option in the application code.
(added) Implementing a data structure that's tailored to the task is fine. Is a hash table still the way to go? A trie also seems like a possibility at first glance, but I don't have any practical experience with them.

HashSet is usually the best general purpose collection in this case.
If you have any specific information about your collection you may have better options.
If you have a fixed upper bound that is not incredibly large you can use a bit vector of suitable size.
If you have a very dense collection you can instead store the missing values.
If you have very small collections, <= 4 items or so, you can store them in a regular array. A full scan of such small array may be faster than the hashing required to use the hash-set.
If you don't have any more specific characteristics of your data than "large collections of int" HashSet is the way to go.

If the size of the values is bounded you could use a bitset. It stores one bit per integer. In total the memory use would be log n bits with n being the greatest integer.
Another option is a bloom filter. Bloom filters are very compact but you have to be prepared for an occasional false positive in lookups. You can find more about them in wikipedia.
A third option is using a simle sorted array. Lookups are log n with n being the number of integers. It may be fast enough.

I decided to try and implement a special purpose hash-based set class that uses linear probing to handle collisions:
Backing store is a simple array of longs
The array is sized to be larger than the expected number of elements to be stored.
For a value's hash code, use the least-significant 31 bits.
Searching for the position of a value in the backing store is done using a basic linear probe, like so:
int FindIndex(long value)
{
var index = ((int)(value & 0x7FFFFFFF) % _storage.Length;
var slotValue = _storage[index];
if(slotValue == 0x0 || slotValue == value) return index;
for(++index; ; index++)
{
if (index == _storage.Length) index = 0;
slotValue = _storage[index];
if(slotValue == 0x0 || slotValue == value) return index;
}
}
(I was able to determine that the data being stored will never include 0, so that number is safe to use for empty slots.)
The array needs to be larger than the number of elements stored. (Load factor less than 1.) If the set is ever completely filled then FindIndex() will go into an infinite loop if it's used to search for a value that isn't already in the set. In fact, it will want to have quite a lot of empty space, otherwise search and retrieval may suffer as the data starts to form large clumps.
I'm sure there's still room for optimization, and I will may get stuck using some sort of BigArray<T> or sharding for the backing store on large sets. But initial results are promising. It performs over twice as fast as HashSet<T> at a load factor of 0.5, nearly twice as fast with a load factor of 0.8, and even at 0.9 it's still working 40% faster in my tests.
Overhead is 1 / load factor, so if those performance figures hold out in the real world then I believe it will also be more memory-efficient than HashSet<T>. I haven't done a formal analysis, but judging by the internal structure of HashSet<T> I'm pretty sure its overhead is well above 10%.
--
So I'm pretty happy with this solution, but I'm still curious if there are other possibilities. Maybe some sort of trie?
--
Epilogue: Finally got around to doing some competitive benchmarks of this vs. HashSet<T> on live data. (Before I was using synthetic test sets.) It's even beating my optimistic expectations from before. Real-world performance is turning out to be as much as 6x faster than HashSet<T>, depending on collection size.

What I would do is just create an array of integers with a sufficient enough size to handle how ever many integers you need. Is there any reason from staying away from the generic List<T>? http://msdn.microsoft.com/en-us/library/6sh2ey19.aspx

The most painful performance hit right now is creating them...
As you've obviously observed, HashSet<T> does not have a constructor that takes a capacity argument to initialize its capacity.
One trick which I believe would work is the following:
int capacity = ... some appropriate number;
int[] items = new int[capacity];
HashSet<int> hashSet = new HashSet<int>(items);
hashSet.Clear();
...
Looking at the implementation with reflector, this will initialize the capacity to the size of the items array, ignoring the fact that this array contains duplicates. It will, however, only actually add one value (zero), so I'd assume that initializing and clearing should be reasonably efficient.
I haven't tested this so you'd have to benchmark it. And be willing to take the risk of depending on an undocumented internal implementation detail.
It would be interesting to know why Microsoft didn't provide a constructor with a capacity argument like they do for other collection types.

C# Binary Trees and Dictionaries

I'm struggling with the concept of when to use binary search trees and when to use dictionaries.
In my application I did a little experiment which used the C5 library TreeDictionary (which I believe is a red-black binary search tree), and the C# dictionary. The dictionary was always faster at add/find operations and also always used less memory space. For example, at 16809 <int, float> entries, the dictionary used 342 KiB whilst the tree used 723 KiB.
I thought that BST's were supposed to be more memory efficient, but it seems that one node of the tree requires more bytes than one entry in a dictionary. What gives? Is there a point at where BST's are better than dictionaries?
Also, as a side question, does anyone know if there is a faster + more memory efficient data structure for storing <int, float> pairs for dictionary type access than either of the mentioned structures?

I thought that BST's were supposed to
be more memory efficient, but it seems
that one node of the tree requires
more bytes than one entry in a
dictionary. What gives? Is there a
point at where BST's are better than
dictionaries?
I've personally never heard of such a principle. Even still, its only a general principle, not a categorical fact etched in the fabric of the universe.
Generally, Dictionaries are really just a fancy wrapper around an array of linked lists. You insert into the dictionary something like:
LinkedList<Tuple<TKey, TValue>> list =
internalArray[internalArray % key.GetHashCode()];
if (list.Exists(x => x.Key == key))
throw new Exception("Key already exists");
list.AddLast(Tuple.Create(key, value));
So its nearly O(1) operation. The dictionary uses O(internalArray.Length + n) memory, where n is number of items in the collection.
In general BSTs can be implemented as:
linked-lists, which use O(n) space, where n is the number items in the collection.
arrays, which use O(2h - n) space where h is the height of the tree and n is the number of items in the collection.
Since red-black trees have a bounded height of O(1.44 * n), an array implementation should have a bounded memory usage of about O(21.44n - n)
Odds are, the C5 TreeDictionary is implemented using arrays, which is probably responsible for the wasted space.
What gives? Is there a point at where
BST's are better than dictionaries?
Dictionaries have some undesirable properties:
There may not be enough continugous blocks of memory to hold your dictionary, even if its memory requirements are much less than than the total available RAM.
Evaluating the hash function can take an arbitrarily long length of time. Strings, for example, use Reflector to examine the System.String.GetHashCode method -- you'll notice hashing a string always takes O(n) time, which means it can take considerable time for very long strings. On the hand, comparing strings for inequality almost always faster than hashing, since it may require looking at just the first few chars. Its wholly possible for tree inserts to be faster than dictionary inserts if hash code evaluation takes too long.
Int32's GetHashCode method is literally just return this, so you'd be hardpressed to find a case where a hashtable with int keys is slower than a tree dictionary.
RB Trees have some desirable properties:
You can find/remove the Min and Max elements in O(log n) time, compared to O(n) time using a dictionary.
If a tree is implemented as linked list rather than an array, the tree is usually more space efficient than a dictionary.
Likewise, its ridiculous easy to write immutable versions of trees which support insert/lookup/delete in O(log n) time. Dictionaries do not adapt well to immutability, since you need to copy the entire internal array for every operation (actually, I have seen some array-based implementations of immutable finger trees, a kind of general purpose dictionary data structure, but the implementation is very complex).
You can traverse all the elements in a tree in sorted order in constant space and O(n) time, whereas you'd need to dump a hash table into an array and sort it to get the same effect.
So, the choice of data structure really depends on what properties you need. If you just want an unordered bag and can guarantee that your hash function evaluate quickly, go with a .Net Dictionary. If you need an ordered bag or have a slow running hash function, go with TreeDictionary.

It does make sense that a tree node would require more storage than a dictionary entry. A binary tree node needs to store the value and both the left and right subtrees. The generic Dictionary<TKey, TValue> is implemented as a hash table which - I'm assuming - either uses a linked list for each bucket (value plus one pointer/reference) or some sort of remapping (just the value). I'd have to have a peek in Reflector to be sure, but for the purpose of this question I don't think it's that important.
The sparser the hash table, the less efficient in terms of storage/memory. If you create a hash table (dictionary) and initialize its capacity to 1 million, and only fill it with 10,000 elements, then I'm pretty sure it would eat up a lot more memory than a BST with 10,000 nodes.
Still, I wouldn't worry about any of this if the amount of nodes/keys is only in the thousands. That's going to be measured in the kilobytes, compared to gigabytes of physical RAM.
If the question is "why would you want to use a binary tree instead of a hash table?" Then the best answer IMO is that binary trees are ordered whereas hash tables are not. You can only search a hash table for keys that are exactly equal to something; with a tree, you can search for a range of values, nearest value, etc. This is a pretty important distinction if you're creating an index or something similar.

It seems to me you're doing a premature optimization.
What I'd suggest to you is to create an interface to isolate which structure you're actually using, and then implement the interface using the Dictionary (which seems to work best).
If memory/performance becomes an issue (which probably will not for 20k- numbers), then you can create other interface implementations, and check which one works bests. You won't need to change almost anything in the rest of the code (except which implementation you're using).

The interface for a Tree and a Hash table (which I'm guessing is what your Dictionary is based one) should be very similar. Always revolving around keyed lookups.
I had always thought a Dictionary was better for creating things once and then then doing lots of lookups on it. While a Tree was better if you were modifying it significantly. However, I don't know where I picked that idea up from.
(Functional languages often use trees as the basis for they collections as you can re-use most of the tree if you make small modifications to it).

You're not comparing "apples with apples", a BST will give you an ordered representation while a dictionary allows you to do a lookup on a key value pair (in your case ).
I wouldn't expect much size in the memory footprint between the 2 but the dictionary will give you a much faster lookup. To find an item in a BST you (potentially) need to traverse the entire tree. But to do a dictnary lookup you simply lookup based on the key.

A balanced BST is preferable if you need to protect your data structure from latency spikes and hash collisions attacks.
The former happens when an array-backed structure grows an gets resized, the latter is an inevitable property of hashing algorithm as a projection from infinite space to a limited integer range.
Another problem in .NET is that there is LOH, and with a sufficiently large dictionary you run into a LOH fragmentation. In this case you can use a BST, paying a price of larger algorithmic complexity class.
In short, with a BST backed by the allocation heap you get worst case O(log(N)) time, with hashtable you get O(N) worst case time.
BST comes at a price of O(log(N)) average time, worse cache locality and more heap allocations, but it has latency guarantees and is protected from dictionary attacks and memory fragmentation.
Worth noting that BST is also a subject to memory fragmentation on other platforms, not using a compacting garbage collector.
As for the memory size, the .NET Dictionary`2 class is more memory efficient, because it stores data as an off-heap linked list, which only stores value and offset information.
BST has to store object header (as each node is a class instance on the heap), two pointers, and some augmented tree data for balanced trees. For example, a red-black tree would need a boolean interpreted as color (red or black). This is at least 6 machine words, if I'm not mistaken. So, each node in a red-black tree on 64-bit system is a minimum of:
3 words for the header = 24 bytes
2 words for the child pointers = 16 bytes
1 word for the color = 8 bytes
at least 1 word for the value 8+ bytes
= 24+16+8+8 = 56 bytes (+8 bytes if the tree uses a parent node pointer).
At the same time, the minimum size of the dictionary entry would be just 16 bytes.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.