Hash table collision, how to get the right value? - c#

For example, the hash of 'a'(key) and 'b'(key) both refer to the position 10, and I use '+1' method to handle collision, so the position of 'b' is now 11.
So, if I try to get b(key), the hash function returns 10, and how to tell the hash function to return 11 which is supposed.

You have to check the stored key and verify that it matches. Otherwise "use '+1' method" and try again.

You need to compare the key at (10) with the value you are searching for, and if they are not the same, goto the next position,(11) in this case and repeat. Hash tables generally require that the values being stored can be tested for equality.
However, this flavour of hashing has lots of problems - you are better off storing a table of lists.

Related

Hash key multiplying by 9 and modulus

I have this peculiar piece of code that is bothering me,
// exbPtr points to 128-bit unsigned integer
// lgID is a "short" with 0xFFFF being the max value
int hash = (*exbPtr + (int)lgID * 9) & tlpLengthMask;
Initially this "hash table", which is really an array is initialized to 256 elements, and tlpLengthMask is set to 255.
Then there is this mysterious code .. with a comment right above it saying "if we reached here .. there has been a collision". And then it starts looping back again, so looks like this is a hash collision, and re-hashing?
hash = (hash + (int)lgID * 2 + 1) & tlpLengthMask;
In addition, there is a ton of debug code that says that the length of this array should be a power of 2 because we're using mask as a modulus.
Can someone explain what the authors intent was? What is the reasoning behind this?
EDIT -- what I'm trying to discern is why he multiplied by 9, and then why multiply by 2 to re-hash.
There are three possibilities:
1) The original author just constructed the hashing functions more or less randomly, saw that they worked well enough, and left it at that.
2) The original author had test data that well represented the actual data and saw that these functions worked extremely well for his exact application.
3) This code is performing very poorly and his hash table is not operating efficiently at all.
The only real requirement is that the output look evenly distributed over the hash table for whatever input he actually encounters and always produce the same output for the same input. While these kinds of functions generally perform poorly, they may be good enough for this specific application.
By the way, this type of open hashing doesn't work in the face of deletions. For example, say you add one record to the table. Then you go to add a second, but it collides with the first, so you skip forward to add the second. Everything's fine now -- you can find both the first record (directly) and the second record (by skipping over the first when you find it at the second record's hash location).
But if you delete the first record, how do you find the second? When you look at the second record's hash location, you find nothing. Do you try skipping? If so, how many times?
There are workarounds to these problems, but they tend to be very easy to do incorrectly.

Hashtable collision rehashing - how are values read?

I am trying to understand how Hashtables work in C#. I read the MSDN article and I understand that C# Hashtables use 'rehashing' for collisions, i.e. if I try to insert a key/value pair into the hashtable, if using HashFunction H1 results in a collision, then it will try HashFunction H2, H3, etc, until no collisions are found.
MSDN quote:
The Hashtable class uses a different technique referred to as
rehasing. (Some sources refer to rehashing as double hashing.)
Rehashing works as follows: there is a set of hash different
functions, H1 ... Hn, and when inserting or retrieving an item from
the hash table, initially the H1 hash function is used. If this leads
to a collision, H2 is tried instead, and onwards up to Hn if needed.
The previous section showed only one hash function, which is the
initial hash function (H1). The other hash functions are very similar
to this function, only differentiating by a multiplicative factor. In
general, the hash function Hk is defined as:
Hk(key) = [GetHash(key) + k * (1 + (((GetHash(key) >> 5) + 1) %
(hashsize – 1)))] % hashsize
However, taking the example from the MSDN site1:
private static Hashtable employees = new Hashtable();
public static void Main()
{
// Add some values to the Hashtable, indexed by a string key
employees.Add("111-22-3333", "Scott");
employees.Add("222-33-4444", "Sam");
}
Let's assume that adding the second key will result in a collision, so H2 will have to be used. However, when I call employees["222-33-4444"], how does the hashtable know to use H2? Is there a separate mapping? Thanks.
I think you misunderstand rehashing. There's only one hash function: the virtual object.GetHashCode() (or, if you supply an IHashCodeProvider or IEqualityComparer, it uses that object to calculate the hash code). When the hash table is full, it expands its capacity and redistributes the elements over the new, larger arrays. The private method that does this is called Rehash(), but it doesn't recalculate hash codes.
CORRECTION
The rehashing does not use a new function, but rather operates on the preceding value of the hash code; this has the effect of searching subsequent slots until an empty one is found (for insert/set) or until all keys with the same (initial) hash code have been checked for equality with the index key (for retrieval).
EDIT
To answer your question directly:
Let's assume that adding the second key will result in a collision, so H2 will have to be used. However, when I call employees["222-33-4444"], how does the hashtable know to use H2? Is there a separate mapping? Thanks.
Calculate the correct bucket based on the hash code of the passed key.
If that bucket is empty, fail.
If the bucket's key matches the passed key, return the bucket's value.
If the hash collision count is zero, fail.
Calculate the next hash code from the current hash code.
Calculate the correct bucket based on the new hash code.
Go to step 2.
Hash tables store both the key and the value in the hash table itself. This way later on during operations such as hash table look-ups it can be guaranteed that the value found is the one that matches the index used for the look-up. Hash tables use a simple "try the basic method of look-up until success" methodology. In this case, the method of look-up is "use hash function X" where X changes on failure.
In other schemes, the method of look-up is "look at the table entry X" (as determined by a hash function) where X just increases by one in a wrapping manner each failure.
The nagging question now is what happens when the value ISN'T in the table? Well, that can be rather ugly: When you've either hit an entry in the table which is missing or, even worse, when you've iterated through as many entries as are stored in the table, you can be sure the entry isn't there -- but that can take "a while" in the worst case.
Keep in mind that since only one value can be associated with one key, once you've found the key, you've found the value. The worst a hash table can do is having to do the equivalent of a cache-unfriendly linear search over all the values in the hash table itself... but ultimately, it will find the value if it's there because it's comparing the stored key to the requested key to test if it's there. The only optimization closed hash tables make is where to look first -- in this case, where hash function 1 says, and then 2, and then 3...
It will first try H1. If it does not find a match, it will use H2. And so on.

Creating a character variation algorithm for a synonym table

I have a need to create a variation/synonym table for a client who needs to make sure if someone enters an incorrect variable, we can return the correct part.
Example, if we have a part ID of GRX7-00C. When the client enters this into a part table, they would like to automatically create a variation table that will store variations that this product could be. Like GBX7-OOC (letter O instead of number 0). Or if they have the number 1, to be able to use L or I.
So if we have part GRL8-OOI we could have the following associated to it in the variation table:
GRI8-OOI
GRL8-0OI
GRL8-O0I
GRL8-OOI
etc....
I currently have a manual entry for this, but there could be a ton of variations of these parts. So, would anyone have a good idea at how I can create a automatic process for this?
How can I do this in C# and/or SQL?
I'm not a C# programmer, but for other .NET languages it would make more sense to me to create a list of CHARACTERS that are similar, and group those together, and use RegEx to evaluate if it matches.
i.e. for your example:
Original:
GRL8-001
Regex-ploded:
GR(l|L|1)(8|b|B)-(0|o|O)(0|o|O)(1|l|L)
You could accomplish this by having a table of interchangeable characters and running a replace function to sub the RegEx for the character automatically.
Lookex function psuedocode (works like soundex but for look alike instead of sound alike)
string input
for each char c
if c in "O0Q" c = 'O'
else if c in "IL1" c = 'I'
etc.
compute a single Lookex code and store that with each product id. If user's entry doesn't match a product id, compute the Lookex code on their entry and search for all products having that code (there could be more than 1). This would consume minimal space, and be quite fast with a single index, and inexpensive to compute as well.
Given your input above, what I would do is not store a table of synonyms, but instead, have a set of rules checked against a master dictionary. So for example, if the user types in a value that is not found in the dictionary, change O to 0, and check for that existing in the dictionary. Change GR to GB and check for that. Etc. All the variations they want to allow described above can be explained as rules that you can apply one at a time or in combination and check if the resulting entry exists. That way you do not have to have a massive dictionary of synonyms to maintain and update.
I wouldn't go the synonym route at all.
I would cleanse all values in the database using a standard rule set.
For every value that exists, replace all '0's with 'O's, strip out dashes etc, so that for each real value you have only one modified value and store that in a seperate field\table.
Then I would cleanse the input the same way, and do a two-part match. Check the actual input string against the actual database values(this will get you exact matches), and secondly check the cleansed input against the cleansed values. Then order the output against the actual database values using a distance calc such as Levenshtein Distance to get the most likely match.
Now for the input:
GRL8-OO1
With parts:
GRL8-00I & GRL8-OOI
These would all normalize to the same value GRL8OOI, though the distance match would be closer for GRL8-OOI, so that would be your closest bet.
Granted this dramatically reduces the "uniqueness" of your part numbers, but the combo of the two-part match and the Levenshtein should get you what you are looking for.
There are several T-SQL implementations of Levenshtein available

Why we use Hash Code in HashTable instead of an Index?

How that integer hash is generated by the GetHashCode() function? Is it a random value which is not unique?
In string, it is overridden to make sure that there exists only one hash code for a particular string.
How to do that?
How searching for specific key in a hash table is speeded up using hash code?
What are the advantages of using hash code over using an index directly in the collection (like in arrays)?
Can someone help?
Basically, hash functions use some generic function to digest data and generate a fingerprint (and integer number here) for that data. Unlike an index, this fingerprint depends ONLY on the data, and should be free of any predictable ordering based on the data. Any change to a single bit of the data should also change the fingerprint considerably.
Notice that nowhere does this guarantee that different data won't give the same hash. In fact, quite the opposite: this happens very often, and is called a collision. But, with an integer, the probability is roughly 1 in 4 billion against this (1 in 2^32). If a collision happens, you just compare the actual object you are hashing to see if they match.
This fingerprint can then be used as an index to an array (or arraylist) of stored values. Because the fingerprint is dependent only on the data, you can compute a hash for something and just check the array element for that hash value to see if it has been stored already. Otherwise, you'd have to go through the whole array checking if it matches an item.
You can also VERY quickly do associative arrays by using 2 arrays, one with Key values (indexed by hash), and a second with values mapped to those keys. If you use a hash, you just need to know the key's hash to find the matching value for the key. This is much faster than doing a binary search on a sorted key list, or a scan of the whole array to find matching keys.
There are MANY ways to generate a hash, and all of them have various merits, but few are simple. I suggest consulting the wikipedia page on hash functions for more info.
A hash code IS an index, and a hash table, at its very lowest level, IS an array. But for a given key value, we determine the index into in a hash table differently, to make for much faster data retrieval.
Example: You have 1,000 words and their definitions. You want to store them so that you can retrieve the definition for a word very, very quickly -- faster than a binary search, which is what you would have to do with an array.
So you create a hash table. You start with an array substantially bigger than 1,000 entries -- say 5,000 (the bigger, the more time-efficient).
The way you'll use your table is, you take the word to look up, and convert it to a number between 0 and 4,999. You choose the algorithm for doing this; that's the hashing algorithm. But you could doubtless write something that would be very fast.
Then you use the converted number as an index into your 5,000-element array, and insert/find your definition at that index. There's no searching at all: you've created the index directly from the search word.
All of the operations I've described are constant time; none of them takes longer when we increase the number of entries. We just need to make sure that there is sufficient space in the hash to minimize the chance of "collisions", that is, the chance that two different words will convert to the same integer index. Because that can happen with any hashing algorithm, we need to add checks to see if there is a collision, and do something special (if "hello" and "world" both hash to 1,234 and "hello" is already in the table, what will we do with "world"? Simplest is to put it in 1,235, and adjust our lookup logic to allow for this possibility.)
Edit: after re-reading your post: a hashing algorithm is most definitely not random, it must be deterministic. The index generated for "hello" in my example must be 1,234 every single time; that's the only way the lookup can work.
Answering each one of your questions directly:
How that integer hash is generated by
the GetHashCode() function? Is it a
random value which is not unique?
An integer hash is generated by whatever method is appropriate for the object.
The generation method is not random but must follow consistent rules, ensuring that a hash generated for one particular object will equal the hash generated for an equivalent object. As an example, a hash function for an integer would be to simply return that integer.
In string, it is overridden to make
sure that there exists only one hash
code for a particular string. How to
do that?
There are many ways this can be done. Here's an example I'm thinking of on the spot:
int hash = 0;
for(int i = 0; i < theString.Length; ++i)
{
hash ^= theString[i];
}
This is a valid hash algorithm, because the same sequence of characters will always produce the same hash number. It's not a good hash algorithm (an extreme understatement), because many strings will produce the same hash. A valid hash algorithm doesn't have to guarantee uniqueness. A good hash algorithm will make a chance of two differing objects producing the same number extremely unlikely.
How searching for specific key in a hash table is speeded up using hash code?
What are the advantages of using hash code over using an index directly in the collection (like in arrays)?
A hash code is typically used in hash tables. A hash table is an array, but each entry in the array is a "bucket" of items, not just one item. If you have an object and you want to know which bucket it belongs in, calculate
hash_value MOD hash_table_size.
Then you simply have to compare the object with every item in the bucket. So a hash table lookup will most likely have a search time of O(1), as opposed to O(log(N)) for a sorted list or O(N) for an unsorted list.

C# getting unique hash from all objects

I want to be able to get a uniqe hash from all objects. What more,
in case of
Dictionary<string, MyObject> foo
I want the unique keys for:
string
MyObject
Properties in MyObject
foo[someKey]
foo
etc..
object.GetHashCode() does not guarantee unique return values for different objects.
That's what I need.
Any idea? Thank you
"Unique hash" is generally a contradiction in terms, even in general terms (and it's more obviously impossible if you're trying to use an Int32 as the hash value). From the wikipedia entry:
A hash function is any well-defined
procedure or mathematical function
that converts a large, possibly
variable-sized amount of data into a
small datum, usually a single integer
that may serve as an index to an
array. The values returned by a hash
function are called hash values, hash
codes, hash sums, or simply hashes.
Note the "small datum" bit - in other words, there will be more possible objects than there are possible hash values, so you can't possibly have uniqueness.
Now, it sounds like you actually want the hash to be a string... which means it won't be of a fixed size (but will have to be under 2GB or whatever the limit is). The simplest way of producing this "unique hash" would be to serialize the object and convert the result into a string, e.g. using Base64 if it's a binary serialization format, or just the text if it's a text-based one such as JSON. However, that's not what anyone else would really recognise as "hashing".
Simply put this is not possible. The GetHashCode function returns a signed integer which contains 2^32 possible unique values. On a 64 bit platform you can have many more than 2^32 different objects running around and hence they cannot all have unique hash codes.
The only way to approach this is to create a different hashing function which returns a type with the capacity greater than or equal to the number of values that could be created in the running system.
A unique hash code is impossible without constraints on your input space. This is because Object.GetHashCode is an int. If you have more than Int32.MaxValue objects then at least two of them must map to same hash code (by the pigeonhole principle).
Define a custom type with restrained input (i.e., the number of possible different objects up to equality is less than Int32.MaxValue) and then, and only then, is it possible to produce a unique hash code. That's not saying it will be easy, just possible.
Alternatively, don't use the Object.GetHashCode mechanism but instead some other way of representing hashes and you might be able to do what you want. We need clear details on what you want and are using it for to be able to help you here.
As others have stated, hash code will never be unique, that's not the point.
The point is to help your Dictionary<string, MyObject> foo to find the exact instance faster. It will use the hash code to narrow the search to a smaller set of objects, and then check them for equality.
You can use the Guid class to get unique strings, if what you need is a unique key. But that is not a hash code.

Categories