Hashtable collision rehashing - how are values read? - c#

I am trying to understand how Hashtables work in C#. I read the MSDN article and I understand that C# Hashtables use 'rehashing' for collisions, i.e. if I try to insert a key/value pair into the hashtable, if using HashFunction H1 results in a collision, then it will try HashFunction H2, H3, etc, until no collisions are found.
MSDN quote:
The Hashtable class uses a different technique referred to as
rehasing. (Some sources refer to rehashing as double hashing.)
Rehashing works as follows: there is a set of hash different
functions, H1 ... Hn, and when inserting or retrieving an item from
the hash table, initially the H1 hash function is used. If this leads
to a collision, H2 is tried instead, and onwards up to Hn if needed.
The previous section showed only one hash function, which is the
initial hash function (H1). The other hash functions are very similar
to this function, only differentiating by a multiplicative factor. In
general, the hash function Hk is defined as:
Hk(key) = [GetHash(key) + k * (1 + (((GetHash(key) >> 5) + 1) %
(hashsize – 1)))] % hashsize
However, taking the example from the MSDN site1:
private static Hashtable employees = new Hashtable();
public static void Main()
{
// Add some values to the Hashtable, indexed by a string key
employees.Add("111-22-3333", "Scott");
employees.Add("222-33-4444", "Sam");
}
Let's assume that adding the second key will result in a collision, so H2 will have to be used. However, when I call employees["222-33-4444"], how does the hashtable know to use H2? Is there a separate mapping? Thanks.

I think you misunderstand rehashing. There's only one hash function: the virtual object.GetHashCode() (or, if you supply an IHashCodeProvider or IEqualityComparer, it uses that object to calculate the hash code). When the hash table is full, it expands its capacity and redistributes the elements over the new, larger arrays. The private method that does this is called Rehash(), but it doesn't recalculate hash codes.
CORRECTION
The rehashing does not use a new function, but rather operates on the preceding value of the hash code; this has the effect of searching subsequent slots until an empty one is found (for insert/set) or until all keys with the same (initial) hash code have been checked for equality with the index key (for retrieval).
EDIT
To answer your question directly:
Let's assume that adding the second key will result in a collision, so H2 will have to be used. However, when I call employees["222-33-4444"], how does the hashtable know to use H2? Is there a separate mapping? Thanks.
Calculate the correct bucket based on the hash code of the passed key.
If that bucket is empty, fail.
If the bucket's key matches the passed key, return the bucket's value.
If the hash collision count is zero, fail.
Calculate the next hash code from the current hash code.
Calculate the correct bucket based on the new hash code.
Go to step 2.

Hash tables store both the key and the value in the hash table itself. This way later on during operations such as hash table look-ups it can be guaranteed that the value found is the one that matches the index used for the look-up. Hash tables use a simple "try the basic method of look-up until success" methodology. In this case, the method of look-up is "use hash function X" where X changes on failure.
In other schemes, the method of look-up is "look at the table entry X" (as determined by a hash function) where X just increases by one in a wrapping manner each failure.
The nagging question now is what happens when the value ISN'T in the table? Well, that can be rather ugly: When you've either hit an entry in the table which is missing or, even worse, when you've iterated through as many entries as are stored in the table, you can be sure the entry isn't there -- but that can take "a while" in the worst case.
Keep in mind that since only one value can be associated with one key, once you've found the key, you've found the value. The worst a hash table can do is having to do the equivalent of a cache-unfriendly linear search over all the values in the hash table itself... but ultimately, it will find the value if it's there because it's comparing the stored key to the requested key to test if it's there. The only optimization closed hash tables make is where to look first -- in this case, where hash function 1 says, and then 2, and then 3...

It will first try H1. If it does not find a match, it will use H2. And so on.

Related

Calculate a checksum for a string

I got a string of an arbitrary length (lets say 5 to 2000 characters) which I would like to calculate a checksum for.
Requirements
The same checksum must be returned each time a calculation is done for a string
The checksum must be unique (no collisions)
I can not store previous IDs to check for collisions
Which algorithm should I use?
Update:
Are there an approach which is reasonable unique? i.e. the likelihood of a collision is very small.
The checksum should be alphanumeric
The strings are unicode
The strings are actually texts that should be translated and the checksum is stored with each translation (so a translated text can be matched back to the original text).
The length of the checksum is not important for me (the shorter, the better)
Update2
Let's say that I got the following string "Welcome to this website. Navigate using the flashy but useless menu above".
The string is used in a view in a similar way to gettext in linux. i.e. the user just writes (in a razor view)
#T("Welcome to this website. Navigate using the flashy but useless menu above")
Now I need a way to identity that string so that I can fetch it from a data source (there are several implementations of the data source). Having to use the entire string as a key seems a bit inefficient and I'm therefore looking for a way to generate a key out of it.
That's not possible.
If you can't store previous values, it's not possible to create a unique checksum that is smaller than the information in the string.
Update:
The term "reasonably unique" doesn't make sense, either it's unique or it's not.
To get a reasonably low risk of hash collisions, you can use a resonably large hash code.
The MD5 algorithm for example produces a 16 byte hash code. Convert the string to a byte array using some encoding that preserves all characters, for example UTF-8, calculate the hash code using the MD5 class, then convert the hash code byte array into a string using the BitConverter class:
string theString = "asdf";
string hash;
using (System.Security.Cryptography.MD5 md5 = System.Security.Cryptography.MD5.Create()) {
hash = BitConverter.ToString(
md5.ComputeHash(Encoding.UTF8.GetBytes(theString))
).Replace("-", String.Empty);
}
Console.WriteLine(hash);
Output:
912EC803B2CE49E4A541068D495AB570
You can use cryptographic Hash functions for this. Most of them are available in .Net
For example:
var sha1 = System.Security.Cryptography.SHA1.Create();
byte[] buf = System.Text.Encoding.UTF8.GetBytes("test");
byte[] hash= sha1.ComputeHash(buf, 0, buf.Length);
//var hashstr = Convert.ToBase64String(hash);
var hashstr = System.BitConverter.ToString(hash).Replace("-", "");
Note: This is an answer to the original question.
Assuming you want the checksum to be stored in a variable of fixed size (i.e. an integer), you cannot satisfy your second constraint.
The checksum must be unique (no collisions)
You cannot avoid collisions because there will be more distinct strings than there are possible checksum values.
I realize this post is practically ancient, but I stumbled upon it and have run into an almost identical issue in the past. We had an nvarchar(8000) field that we needed to lookup against.
Our solution was to create a persisted computed column using CHECKSUM of the nasty lookup field. We had an auto-incrementing ID field and keyed on (checksum, id)
When reading from the table, we wrote a proc that took the lookup text, computed the checksum and then took where the checksums were equal and the text was equal.
You could easily perform the checksum portions at the application level based on the answer above and store them manually instead of using our DB-centric solution. But the point is to get a reasonably sized key for indexing so that your text comparison runs against a bucket of collisions instead of the entire dataset.
Good luck!
To guarantee uniqueness, for a almost infinite size strings, treat the variable length string as a set of concatenated substrings each having "x characters in length". Your hash function needs only to determine uniqueness for a maximum substring length and then generate a series of checksum numbers generating values. Think of it as the equivalent network IP address with a set of checksum numbers.
Your issue with collisions is the assumption that a collision forces a slower search method to resolve each collision. If their are a insignificant number of possible collisions compared to the number of hash objects, then as a whole the extra overhead becomes NIL. A collision is due to the sizing of a table smaller than the maximum number of objects. This doesn't have to be the case because the table may have "holes" and each object within the table may have a reference count of objects at that collision. Only if this count is greater than 1, then a collision occurs or multiple instances of the same substring.

What is the lookup time complexity of HashSet<T>(IEqualityComparer<T>)?

In C#.NET, I like using HashSets because of their supposed O(1) time complexity for lookups. If I have a large set of data that is going to be queried, I often prefer using a HashSet to a List, since it has this time complexity.
What confuses me is the constructor for the HashSet, which takes IEqualityComparer as an argument:
http://msdn.microsoft.com/en-us/library/bb359100.aspx
In the link above, the remarks note that the "constructor is an O(1) operation," but if this is the case, I am curious if lookup is still O(1).
In particular, it seems to me that, if I were to write a Comparer to pass in to the constructor of a HashSet, whenever I perform a lookup, the Comparer code would have to be executed on every key to check to see if there was a match. This would not be O(1), but O(n).
Does the implementation internally construct a lookup table as elements are added to the collection?
In general, how might I ascertain information about complexity of .NET data structures?
A HashSet works via hashing (via IEqualityComparer.GetHashCode) the objects you insert and tosses the objects into buckets per the hash. The buckets themselves are stored in an array, hence the O(1) part.
For example (this is not necessarily exactly how the C# implementation works, it just gives a flavor) it takes the first character of the hash and throws everything with a hash starting with 1 into bucket 1. Hash of 2, bucket 2, and so on. Inside that bucket is another array of buckets that divvy up by the second character in the hash. So on for every character in the hash....
Now, when you look something up, it hashes it, and jumps thru the appropriate buckets. It has to do several array lookups (one for each character in the hash) but does not grow as a function of N, the number of objects you've added, hence the O(1) rating.
To your other question, here is a blog post with the complexity of a number of collections' operations: http://c-sharp-snippets.blogspot.com/2010/03/runtime-complexity-of-net-generic.html
if I were to write a Comparer to pass in to the constructor of a HashSet, whenever I perform a lookup, the Comparer code would have to be executed on every key to check to see if there was a match. This would not be O(1), but O(n).
Let's call the value you are searching for the "query" value.
Can you explain why you believe the comparer has to be executed on every key to see if it matches the query?
This belief is false. (Unless of course the hash code supplied by the comparer is the same for every key!) The search algorithm executes the equality comparer on every key whose hash code matches the query's hash code, modulo the number of buckets in the hash table. That's how hash tables get O(1) lookup time.
Does the implementation internally construct a lookup table as elements are added to the collection?
Yes.
In general, how might I ascertain information about complexity of .NET data structures?
Read the documentation.
Actually the lookup time of a HashSet<T> isn't always O(1).
As others have already mentioned a HashSet uses IEqualityComparer<T>.GetHashCode().
Now consider a struct or object which always returns the same hash code x.
If you add n items to your HashSet there will be n items with the same hash in it (as long as the objects aren't equal).
So if you were to check if an element with the hash code x exists in your HashSet it will run equality checks for all objects with the hash code x to test wether the HashSet contains the element
It would depends on quality of hash function (GetHashCode()) your IEqualityComparer implementation provides. Ideal hash function should provide well-distributed random set of hash codes. These hash codes will be used as an index which allows mapping key to a value, so search for a value by key becomes more efficient especially when a key is a complex object/structure.
the Comparer code would have to be executed on every key to check to
see if there was a match. This would not be O(1), but O(n).
This is not how hashtable works, this is some kind of straightforward bruteforce search. In case of hashtable you would have more intelligent approach which uses search by index (hash code).
Lookup is still O(1) if you pass an IEqualityComparer. The hash set still uses the same logic as if you don't pass an IEqualityComparer; it just uses the IEqualityComparer's implementations of GetHashCode and Equals instead of the instance methods of System.Object (or the overrides provided by the object in question).

Hash table collision, how to get the right value?

For example, the hash of 'a'(key) and 'b'(key) both refer to the position 10, and I use '+1' method to handle collision, so the position of 'b' is now 11.
So, if I try to get b(key), the hash function returns 10, and how to tell the hash function to return 11 which is supposed.
You have to check the stored key and verify that it matches. Otherwise "use '+1' method" and try again.
You need to compare the key at (10) with the value you are searching for, and if they are not the same, goto the next position,(11) in this case and repeat. Hash tables generally require that the values being stored can be tested for equality.
However, this flavour of hashing has lots of problems - you are better off storing a table of lists.

Why we use Hash Code in HashTable instead of an Index?

How that integer hash is generated by the GetHashCode() function? Is it a random value which is not unique?
In string, it is overridden to make sure that there exists only one hash code for a particular string.
How to do that?
How searching for specific key in a hash table is speeded up using hash code?
What are the advantages of using hash code over using an index directly in the collection (like in arrays)?
Can someone help?
Basically, hash functions use some generic function to digest data and generate a fingerprint (and integer number here) for that data. Unlike an index, this fingerprint depends ONLY on the data, and should be free of any predictable ordering based on the data. Any change to a single bit of the data should also change the fingerprint considerably.
Notice that nowhere does this guarantee that different data won't give the same hash. In fact, quite the opposite: this happens very often, and is called a collision. But, with an integer, the probability is roughly 1 in 4 billion against this (1 in 2^32). If a collision happens, you just compare the actual object you are hashing to see if they match.
This fingerprint can then be used as an index to an array (or arraylist) of stored values. Because the fingerprint is dependent only on the data, you can compute a hash for something and just check the array element for that hash value to see if it has been stored already. Otherwise, you'd have to go through the whole array checking if it matches an item.
You can also VERY quickly do associative arrays by using 2 arrays, one with Key values (indexed by hash), and a second with values mapped to those keys. If you use a hash, you just need to know the key's hash to find the matching value for the key. This is much faster than doing a binary search on a sorted key list, or a scan of the whole array to find matching keys.
There are MANY ways to generate a hash, and all of them have various merits, but few are simple. I suggest consulting the wikipedia page on hash functions for more info.
A hash code IS an index, and a hash table, at its very lowest level, IS an array. But for a given key value, we determine the index into in a hash table differently, to make for much faster data retrieval.
Example: You have 1,000 words and their definitions. You want to store them so that you can retrieve the definition for a word very, very quickly -- faster than a binary search, which is what you would have to do with an array.
So you create a hash table. You start with an array substantially bigger than 1,000 entries -- say 5,000 (the bigger, the more time-efficient).
The way you'll use your table is, you take the word to look up, and convert it to a number between 0 and 4,999. You choose the algorithm for doing this; that's the hashing algorithm. But you could doubtless write something that would be very fast.
Then you use the converted number as an index into your 5,000-element array, and insert/find your definition at that index. There's no searching at all: you've created the index directly from the search word.
All of the operations I've described are constant time; none of them takes longer when we increase the number of entries. We just need to make sure that there is sufficient space in the hash to minimize the chance of "collisions", that is, the chance that two different words will convert to the same integer index. Because that can happen with any hashing algorithm, we need to add checks to see if there is a collision, and do something special (if "hello" and "world" both hash to 1,234 and "hello" is already in the table, what will we do with "world"? Simplest is to put it in 1,235, and adjust our lookup logic to allow for this possibility.)
Edit: after re-reading your post: a hashing algorithm is most definitely not random, it must be deterministic. The index generated for "hello" in my example must be 1,234 every single time; that's the only way the lookup can work.
Answering each one of your questions directly:
How that integer hash is generated by
the GetHashCode() function? Is it a
random value which is not unique?
An integer hash is generated by whatever method is appropriate for the object.
The generation method is not random but must follow consistent rules, ensuring that a hash generated for one particular object will equal the hash generated for an equivalent object. As an example, a hash function for an integer would be to simply return that integer.
In string, it is overridden to make
sure that there exists only one hash
code for a particular string. How to
do that?
There are many ways this can be done. Here's an example I'm thinking of on the spot:
int hash = 0;
for(int i = 0; i < theString.Length; ++i)
{
hash ^= theString[i];
}
This is a valid hash algorithm, because the same sequence of characters will always produce the same hash number. It's not a good hash algorithm (an extreme understatement), because many strings will produce the same hash. A valid hash algorithm doesn't have to guarantee uniqueness. A good hash algorithm will make a chance of two differing objects producing the same number extremely unlikely.
How searching for specific key in a hash table is speeded up using hash code?
What are the advantages of using hash code over using an index directly in the collection (like in arrays)?
A hash code is typically used in hash tables. A hash table is an array, but each entry in the array is a "bucket" of items, not just one item. If you have an object and you want to know which bucket it belongs in, calculate
hash_value MOD hash_table_size.
Then you simply have to compare the object with every item in the bucket. So a hash table lookup will most likely have a search time of O(1), as opposed to O(log(N)) for a sorted list or O(N) for an unsorted list.

C# getting unique hash from all objects

I want to be able to get a uniqe hash from all objects. What more,
in case of
Dictionary<string, MyObject> foo
I want the unique keys for:
string
MyObject
Properties in MyObject
foo[someKey]
foo
etc..
object.GetHashCode() does not guarantee unique return values for different objects.
That's what I need.
Any idea? Thank you
"Unique hash" is generally a contradiction in terms, even in general terms (and it's more obviously impossible if you're trying to use an Int32 as the hash value). From the wikipedia entry:
A hash function is any well-defined
procedure or mathematical function
that converts a large, possibly
variable-sized amount of data into a
small datum, usually a single integer
that may serve as an index to an
array. The values returned by a hash
function are called hash values, hash
codes, hash sums, or simply hashes.
Note the "small datum" bit - in other words, there will be more possible objects than there are possible hash values, so you can't possibly have uniqueness.
Now, it sounds like you actually want the hash to be a string... which means it won't be of a fixed size (but will have to be under 2GB or whatever the limit is). The simplest way of producing this "unique hash" would be to serialize the object and convert the result into a string, e.g. using Base64 if it's a binary serialization format, or just the text if it's a text-based one such as JSON. However, that's not what anyone else would really recognise as "hashing".
Simply put this is not possible. The GetHashCode function returns a signed integer which contains 2^32 possible unique values. On a 64 bit platform you can have many more than 2^32 different objects running around and hence they cannot all have unique hash codes.
The only way to approach this is to create a different hashing function which returns a type with the capacity greater than or equal to the number of values that could be created in the running system.
A unique hash code is impossible without constraints on your input space. This is because Object.GetHashCode is an int. If you have more than Int32.MaxValue objects then at least two of them must map to same hash code (by the pigeonhole principle).
Define a custom type with restrained input (i.e., the number of possible different objects up to equality is less than Int32.MaxValue) and then, and only then, is it possible to produce a unique hash code. That's not saying it will be easy, just possible.
Alternatively, don't use the Object.GetHashCode mechanism but instead some other way of representing hashes and you might be able to do what you want. We need clear details on what you want and are using it for to be able to help you here.
As others have stated, hash code will never be unique, that's not the point.
The point is to help your Dictionary<string, MyObject> foo to find the exact instance faster. It will use the hash code to narrow the search to a smaller set of objects, and then check them for equality.
You can use the Guid class to get unique strings, if what you need is a unique key. But that is not a hash code.

Categories