Probability of getting a duplicate value when calling GetHashCode() on strings

Probability of getting a duplicate value when calling GetHashCode() on strings - c#

I want to know the probability of getting duplicate values when calling the GetHashCode() method on string instances. For instance, according to this blog post, blair and brainlessness have the same hashcode (1758039503) on an x86 machine.

Large.
(Sorry Jon!)
The probability of getting a hash collision among short strings is extremely large. Given a set of only ten thousand distinct short strings drawn from common words, the probability of there being at least one collision in the set is approximately 1%. If you have eighty thousand strings, the probability of there being at least one collision is over 50%.
For a graph showing the relationship between set size and probability of collision, see my article on the subject:
https://learn.microsoft.com/en-us/archive/blogs/ericlippert/socks-birthdays-and-hash-collisions

Small - if you're talking about the chance of any two arbitrary unequal strings having a collision. (It will depend on just how "arbitrary" the strings are, of course - different contexts will be using different strings.)
Large - if you're talking about the chance of there being at least one collision in a large pool of arbitrary strings. The small individual probabilities are no match for the birthday problem.
That's about all you need to know. There are definitely cases where there will be collisions, and there have to be given that there are only 232 possible hash codes, and more than that many strings - so the pigeonhole principle proves that at least one hash code must have more than one string which generates it. However, you should trust that the hash has been designed to be pretty reasonable.
You can rely on it as a pretty good way of narrowing down the possible matches for a particular string. It would be an unusual set of naturally-occurring strings which generated a lot of collisions - and even when there are some collisions, obviously if you can narrow a candidate search set down from 50K to fewer than 10 strings, that's a pretty big win. But you must not rely on it as a unique value for any string.
Note that the algorithm used in .NET 4 differs between x86 and x64, so that example probably isn't valid on both platforms.

I think all that's possible to say is "small, but finite and definitely not zero" -- in other words you must not rely on GetHashCode() ever returning unique values for two different instances.
To my mind, hashcodes are best used when you want to tell quickly if two instances are different -- not if they're the same.
In other words, if two objects have different hash codes, you know they are different and need not do a (possibly expensive) deeper comparison.
However, if the hash codes for two objects are the same, you must go on to compare the objects themselves to see if they're actually the same.

I ran a test on a database of 466k English words and got 48 collisions with string.GetHashCode(). MurmurHash gives slightly better results. More results are here: https://github.com/jitbit/MurmurHash.net

Just in case your question is meant to be what is the probability of a collision in a group of strings,
For n available slots and m occupying items:
Prob. of no collision on first insertion is 1.
Prob. of no collision on 2nd insertion is ( n - 1 ) / n
Prob. of no collision on 3rd insertion is ( n - 2 ) / n
Prob. of no collision on mth insertion is ( n - ( m - 1 ) ) / n
The probability of no collision after m insertions is the product of the above values: (n - 1)!/((n - m)! * n^(m - 1)).
which simplifies to ( n choose k ) / ( n^m ).
And everybody is right, you can't assume 0 collisions, so, saying the probability is "low" may be true but doesn't allow you to assume that there will be no collisions. If you're looking at a hashtable, I think the standard is you begin to have trouble with significant collisions when you're hashtable is about 2/3rds full.

The probability of a collision between two randomly chosen strings is 1 / 2^(bits in hash code), if the hash is perfect, which is unlikely or impossible.

Related

How can hashset.contains be O(1) with this implementation?

HashSet.Contains implementation in .Net is:
/// <summary>
/// Checks if this hashset contains the item
/// </summary>
/// <param name="item">item to check for containment</param>
/// <returns>true if item contained; false if not</returns>
public bool Contains(T item) {
if (m_buckets != null) {
int hashCode = InternalGetHashCode(item);
// see note at "HashSet" level describing why "- 1" appears in for loop
for (int i = m_buckets[hashCode % m_buckets.Length] - 1; i >= 0; i = m_slots[i].next) {
if (m_slots[i].hashCode == hashCode && m_comparer.Equals(m_slots[i].value, item)) {
return true;
}
}
}
// either m_buckets is null or wasn't found
return false;
}
And I read in a lot of places "search complexity in hashset is O(1)". How?
Then why does that for-loop exist?
Edit: .net reference link: https://github.com/microsoft/referencesource/blob/master/System.Core/System/Collections/Generic/HashSet.cs

The classic implementation of a hash table works by assigning elements to one of a number of buckets, based on the hash of the element. If the hashing was perfect, i.e. no two elements had the same hash, then we'd be living in a perfectly perfect world where we wouldn't need to care about anything - any lookup would be O(1) always, because we'd only need to compute the hash, get the bucket and say if something is inside.
We're not living in a perfectly perfect world. First off, consider string hashing. In .NET, there are (2^16)^n possible strings of length n; GetHashCode returns a long, and there are 2^64 possible values of long. That's exactly enough to hash every string of length 4 to a unique long, but if we want strings longer than that, there must exist two different values that give the same hash - this is called a collision. Also, we don't want to maintain 2^64 buckets at all times anyway. The usual way of dealing with that is to take the hashcode and compute its value modulo the number of buckets to determine the bucket's number1. So, the takeaway is - we need to allow for collisions.
The referenced .NET Framework implementation uses the simplest way of dealing with collisions - every bucket holds a linked list of all objects that result in the particular hash. You add object A, it's assigned to a bucket i. You add object B, it has the same hash, so it's added to the list in bucket i right after A. Now if you lookup for any element, you need to traverse the list of all objects and call the actual Equals method to find out if that thing is actually the one you're looking for. That explains the for loop - in the worst case you have to go through the entire list.
Okay, so how "search complexity in hashset is O(1)"? It's not. The worst case complexity is proportional to the number of items. It's O(1) on average.2 If all objects fall to the same bucket, asking for the elements at the end of the list (or for ones that are not in the structure but would fall into the same bucket) will be O(n).
So what do people mean by "it's O(1) on average"? The structure monitors how many objects are there proportional to the number of buckets and if that exceeds some threshold, called the load factor, it resizes. It's easy to see that this makes the average lookup time proportional to the load factor.
That's why it's important for hash functions to be uniform, meaning that the probability that two randomly chosen different objects get the same long assigned is 1/2^643. That keeps the distribution of objects in a hash table uniform, so we avoid pathological cases where one bucket contains a huge number of items.
Note that if you know the hash function and the algorithm used by the hash table, you can force such a pathological case and O(n) lookups. If a server takes inputs from a user and stores them in a hash table, an attacker knowing the hash function and the hash table implementations could use this as a vector for a DDoS attack. There are ways of dealing with that too. Treat this as a demonstration that yes, the worst case can be O(n) and that people are generally aware of that.
There are dozens of other, more complicated ways hash tables can be implemented. If you're interested you need to research on your own. Since lookup structures are so commonplace in computer science, people have come up with all sorts of crazy optimisations that minimise not only the theoretical number of operations, but also things like CPU cache misses.
[1] That's exactly what's happening in the statement int i = m_buckets[hashCode % m_buckets.Length] - 1
[2] At least the ones using naive chaining are not. There exist hash tables with worst-case constant time complexity. But usually they're worse in practice compared to the theoretically (in regards to time complexity) slower implementations, mainly due to CPU cache misses.
[3] I'm assuming the domain of possible hashes is the set of all longs, so there are 2^64 of them, but everything I wrote generalises to any other non-empty, finite set of values.

Optimal performance of Dictionary with custom Equals() and GetHashCode()

So I need to create a dictionary with keys that are objects with a custom Equals() function. I discovered I need to override GetHashCode() too. I heard that for optimal performance you should have hash codes that don't collide, but that seems counter intuitive. I might be misunderstanding it, but it seems the entire point of using hash codes is to group items into buckets and if the hash codes never collide each bucket will only have 1 item which seems to defeat the purpose.
So should I intentionally make my hash codes collide occasionally? Performance is important. This will be a dictionary that will probably grow to multiple million items and I'll be doing lookups very often.

The goal of a hash code is to give you an index into an array, each of which is a bucket that may contain zero, one, or more items. The performance of the lookup then is dependent on the number of elements in the bucket. The fewer the better, since once you're in the bucket, it's an O(n) search (where n is the number of elements in the bucket). Therefore, it's ideal if the hashcode prevents collisions as much as possible, allowing for the optimal O(1) time as much as possible.

Dictionaries store data in buckets but there isn't one bucket for each hashcode. The number of buckets is based on the capacity. Values are put into buckets based on the modulus of the hashcode and number of buckets.
Lets say you have a GetHashCode() method that produces these hash codes for five objects:
925
10641
14316
17213
28624
Hash codes should be spread out. So these look spread out, right? If we have 7 buckets, then we end up calculating the modulus of each which gives us:
1
1
1
0
1
So we end up with buckets:
0 - 1 item
1 - 4 items
2 - 0 items
3 - 0 items
4 - 0 items
5 - 0 items
6 - 0 items
oops, not so well spread out now.
This is not made up data. These are actual hash codes.
Here's a sample of how to generate a hash code from contained data (not the formula used for the above hash codes, a better one).
https://stackoverflow.com/a/263416/118703

You must ensure that the following holds:
(GetHashCode(a) != GetHashCode(b)) => !Equals(a, b)
The reverse implication is identical in meaning:
Equals(a, b) => (GetHashCode(a) == GetHashCode(b))
Apart from that, generate as few collisions as possible. A collision is defined as:
(GetHashCode(a) == GetHashCode(b)) && !Equals(a, b)
A collision does not affect correctness, but performance. GetHashCode always returning zero would be correct for example, but slow.

Hash key multiplying by 9 and modulus

I have this peculiar piece of code that is bothering me,
// exbPtr points to 128-bit unsigned integer
// lgID is a "short" with 0xFFFF being the max value
int hash = (*exbPtr + (int)lgID * 9) & tlpLengthMask;
Initially this "hash table", which is really an array is initialized to 256 elements, and tlpLengthMask is set to 255.
Then there is this mysterious code .. with a comment right above it saying "if we reached here .. there has been a collision". And then it starts looping back again, so looks like this is a hash collision, and re-hashing?
hash = (hash + (int)lgID * 2 + 1) & tlpLengthMask;
In addition, there is a ton of debug code that says that the length of this array should be a power of 2 because we're using mask as a modulus.
Can someone explain what the authors intent was? What is the reasoning behind this?
EDIT -- what I'm trying to discern is why he multiplied by 9, and then why multiply by 2 to re-hash.

There are three possibilities:
1) The original author just constructed the hashing functions more or less randomly, saw that they worked well enough, and left it at that.
2) The original author had test data that well represented the actual data and saw that these functions worked extremely well for his exact application.
3) This code is performing very poorly and his hash table is not operating efficiently at all.
The only real requirement is that the output look evenly distributed over the hash table for whatever input he actually encounters and always produce the same output for the same input. While these kinds of functions generally perform poorly, they may be good enough for this specific application.
By the way, this type of open hashing doesn't work in the face of deletions. For example, say you add one record to the table. Then you go to add a second, but it collides with the first, so you skip forward to add the second. Everything's fine now -- you can find both the first record (directly) and the second record (by skipping over the first when you find it at the second record's hash location).
But if you delete the first record, how do you find the second? When you look at the second record's hash location, you find nothing. Do you try skipping? If so, how many times?
There are workarounds to these problems, but they tend to be very easy to do incorrectly.

Analysis of Permutation Finder algorithm(pseudo code)

A SO post about generating all the permutations got me thinking about a few alternative approaches. I was thinking about using space/run-time trade offs and was wondering if people could critique this approach and possible hiccups while trying to implement it in C#.
The steps goes as follows:
Given a data-structure of homogeneous elements, count the number of elements in the structure.
Assuming the permutation consists of all the elements of the structure, calculate the factorial of the value from Step 1.
Instantiate a newer structure(Dictionary) of type <key(Somehashofcollection),Collection<data-structure of homogeneous elements>> and initialize a counter.
Hash(???) the seed structure from step 1, and insert the key/value pair of hash and collection into the Dictionary. Increment the counter by 1.
Randomly shuffle(???) the order of the seed structure, hash it and then try to insert it into the Dictionary from step 3.
If there is a conflict in hashes,repeat step 5 again to get a new order and hash and check for conflict. Upon successful insertion increment the counter by 1.
Repeat steps 5 & 6 until the counter equals the factorial calculated in step 2.
It seems like doing it this way using some sort of randomizer(which is a black box to me at the moment) might help with getting all the permutations within a decent timeframe for datasets of obscene sizes.
It will be great to get some feedback from the great minds of SO to further analyze this approach whose objective is to deviate from the traditional brute-force approach prevalent in algorithms of such nature and also the repercussions of implementing such an algorithm using C#.
Thanks

This method of generating all permutations does not fare well as compared to the standard known methods.
Say you had n items and M=n! permutations.
This method of generation is expected to generate M*lnM permutations before discovering all M.
(See this answer for a possible explanation: Programing Pearls - Random Select algorithm)
Also, what would the hash function be? For a reasonable hash function, we might have to start dealing with very large integer issues pretty soon (any n > 50 for sure, don't remember that exact cut-off point).
This method uses up a lot of memory too (the hashtable of all permutations).
Even assuming the hash is perfect, this method would take expected Omega(nMlogM) operations and guaranteed Omega(nM) space, while standard well-known methods can do it in O(M) time and O(n) space.
As a starting point I suggest one can read: Systematic Generation of All Permutations which is believe is O(nM) time and O(n) space and still much better than this method.
Note that if one has to generate all permutations, any algorithm will necessarily take Omega(M) steps and so the the method I refer to above is optimal!

It seems like a complicated way to randomise the order of the generated permutations. In terms of time efficiency, you can't do much better than the 'brute force' approach.

You are not so tough without your car. Fastest lookup list

I have a collection of structures. Each structure has 2 keys. If I query using key #1, I should get key #2 in return and vice versa.
It's easy to write code on the desktop when you have the power of the .NET Framework behind you. I am writing code in the .NET Micro Framework, which is a very very limited subset of framework. For instance, as far as collections, I only have arrays and ArrayList objects at my disposal.
So for example here is the list of structures:
Key #1 Key #2
6 A
7 F
8 Z
9 B
So when I query for 8, I should get Z.
When I query for Z, I should get 8.
I am looking to do the fastest and least processor intensive lookup using either arrays or ArrayList. The device I am coding against is a low-end ARM processor, thus I need to optimize early.

If the set is fixed, look into perfect hash functions.

Any reason you can't write your own hashmap?

It depends on the number of entries and your access pattern.
Given that your access pattern is random access if you don't have too many elements you could have 2 Arrays of Pairs and then use
Array.BinarySearch()

Well... if you want the fastest and aren't too concerned about memory, just use two hash tables. One going one way, one going to other. Not sure if there's a more memory efficient way...
Or use just one hash table but have the entries for both directions in there.

Is it not as simple as :
find the key in the array you're querying
return the key at the same index in the opposite array
I would keep it as simple as possible and just iterate through the array you're searching. You'll probably only see a benefit from implementing some hashing routines if your list is (plucks figure from air) over 1k+ elements, with the added complexity of your own hashing routines slowing things down somewhat.

Several solutions:
Keep 2 lists in sync, do a linear search. Works well unless your collections are very large, or you're searching repeatedly.
Two hashtables. Writing your own is fairly easy -- it is just a fixed array of buckets (each bucket can be an ArrayList). Map an item to a bucket by doing object.GetHashCode() % numBuckets.
Two arrays the size of the range of values. If your numbers are in a fixed range, allocate an array the size of the range, with elements being items from the other group. Super quick and easy, but uses a lot of memory.

If it's a fixed set, consider using switch. See the answer to a similar question here.

I had this problem several years ago when programming C, that we need to find a barcode (numeric) quickly in about 10 thousand rows (in that time, using a file as the Database - as it was a hand device)
I created my own search that instead of iterate one by one would start always in the middle...
searching for 4050 in 10000 item stack
start on 5000 ( 10 000 / 2 )
now, is the number higher or lower ... lower
start on 2500 ( 5000 / 2 )
now, is the number higher or lower ... higher
start on 3750 ( 2500 + 2500 / 2 )
now, is the number higher or lower ... higher
start on 4375 ( 3750 + 1250 / 2 )
now, is the number higher or lower ... lower
start on 4063 ( 4375 - 625 / 2 )
now, is the number higher or lower ... lower
start on 3907 ( 4063 - 312 / 2 )
now, is the number higher or lower ... higher
start on 3907 ( 3907 + 156 / 2 )
now, is the number higher or lower ... higher
start on 3946 ( 3907 + 78 / 2 )
now, is the number higher or lower ... higher
...
until you get the value... you will need to search about 14 times instead 4050 iterations
about the letters ... they all represent a numeric number as well...
Hope it helps

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.