Hash quality and stability of String.GetHashCode() in .NET?

Hash quality and stability of String.GetHashCode() in .NET? - c#

I am wondering about the hash quality and the hash stability produced by the String.GetHashCode() implementation in .NET?
Concerning the quality, I am focusing on algorithmic aspects (hence, the quality of the hash as it impacts large hash-tables, not for security concerns).
Then, concerning the stability, I wondering about the potential versionning issues that might arise from one .NET version to the next.
Some lights on those two aspects would be very appreciated.

I can't give you any details about the quality (though I would assume it is pretty good given that string is one of the framework's core classes that is likely to be used as a hash key).
However, regarding the stability, the hash code produced on different versions of the framework is not guaranteed to be the same, and it has changed in the past, so you absolutely must not rely on the hash code being stable between versions (see here for a reference that it changed between 1.1 and 2.0). In fact, it even differs between the 32-bit and 64-bit versions of the same framework version; from the docs:
The value returned by GetHashCode is platform-dependent. For a specific string value, it differs on the 32-bit and 64-bit versions of the .NET Framework.

This is an old question, but I'd like to contribute by mentionning this microsoft bug about hash quality.
Summary: On 64b, hash quality is very low when your string contains '\0' bytes. Basically, only the start of the string will be hashed.
If like me, you have to use .Net strings to represent binary data as key for high-performance dictionaries, you need to be aware of this bug.
Too bad, it's a WONTFIX... As a sidenote, I don't understand how they could say that modifying the hashcode being a breaking change, when the code includes
// We want to ensure we can change our hash function daily.
// This is perfectly fine as long as you don't persist the
// value from GetHashCode to disk or count on String A
// hashing before string B. Those are bugs in your code.
hash1 ^= ThisAssembly.DailyBuildNumber;
and the hashcode is already different in x86/64b anyway.

I just came across a related problem to this. On one of my computers (a 64 bit one) I had a problem that I tracked down to 2 different objects being identical except for the (stored) hashcode. That hashcode was created from a string....the same string!
m_storedhash = astring.GetHashCode();
I dont know how these two objects ended up with different hash codes given they were from the same string however I suspect what happened is that within the same .NET exe, one of the class library projects I depend upon has been set to x86 and another to ANYCPU and one of these objects was created in a method inside the x86 class lib and the other object (same input data, same everything) was created in a method inside the ANYCPU class library.
So, does this sound plausible: Within the same executable in memory (not between processes) some of the code could be running with the x86 Framework's string.GetHashCode() and other code x64 Framework's string.GetHashCode() ?

I know that this isn't really included the meanings of quality and stability that you specified, but it's worth being aware that hashing extremely large strings can produce an OutOfMemoryException.
https://connect.microsoft.com/VisualStudio/feedback/details/517457/stringcomparers-gethashcode-string-throws-outofmemoryexception-with-plenty-of-ram-available

The quality of the hash codes are good enough for their intended purpose, i.e. they doesn't cause too many collisions when you use strings as key in a dictionary. I suspect that it will only use the entire string for calculating the hash code if the string length is reasonably short, for huge strings it will brobably only use the first part.
There is no guarantee for stability across versions. The documentation clearly says that the hashing algorithm may change from one version to the next, so that the hash codes are for short term use.

Related

Why do string hash codes change for each execution in .NET?

Consider the following code:
Console.WriteLine("Hello, World!".GetHashCode());
First run:
139068974
Second run:
-263623806
Now consider the same thing written in Kotlin:
println("Hello, World!".hashCode())
First run:
1498789909
Second run:
1498789909
Why do hash codes for string change for every execution in .NET, but not on other runtimes like the JVM?

Why do hash codes for string change for every execution in .NET
In short to prevent hash collision attacks. You can roughly find out the reason from the docs of the <UseRandomizedStringHashAlgorithm> configuration element:
The string lookup in a hash table is typically an O(1) operation. However, when a large number of collisions occur, the lookup can become an O(n²) operation. You can use the configuration element to generate a random hashing algorithm per application domain, which in turn limits the number of potential collisions, particularly when the keys from which the hash codes are calculated are based on data input by users.
but not on other runtimes like the JVM?
Not exactly, for example Python's hash function is random. C# also produces identity hash in .net framework, core 1.0 and core 2.0 when <UseRandomizedStringHashAlgorithm> is not enabled.
For Java maybe it's a historical issue because the arithmetic is public, and it's not good, read this.

Why do hash codes change for every execution in .NET?
Because changing the hash code of strings (and other objects!) on each run is a very strong hint to developers that hash codes do not have any meaning outside of the process that generated the hash.
Specifically, the documentation says:
Furthermore, .NET does not guarantee the default implementation of the GetHashCode method, and the value this method returns may differ between .NET implementations, such as different versions of .NET Framework and .NET Core, and platforms, such as 32-bit and 64-bit platforms. For these reasons, do not use the default implementation of this method as a unique object identifier for hashing purposes. Two consequences follow from this:
You should not assume that equal hash codes imply object equality.
You should never persist or use a hash code outside the application domain in which it was created, because the same object may hash across application domains, processes, and platforms.
By changing the hash code of a given object from one run to the next, the runtime is telling the developer not to use the hash code for anything that crosses a process/app-domain boundary. That will help to insulate developers from bugs stemming from changes to the GetHashCode algorithms used by standard classes.
Having hash codes change from one run to the next also discourages things like persisting the hash code for use as a "did this thing change" short-cut. This both prevents bugs from changes to the underlying algorithms and bugs from assuming that two objects of the same type with the same hash code are equal, when no such guarantee is made (in fact, no such guarantee can be made for any data structure which requires or allows more than 32 bits, due to the pigeonhole principle).
Why do other languages generate stable hash codes?
Without a thorough language-by-language review, I can only speculate, but the major reasons are likely to be some combination of:
historical inertia (read: "backwards compatibility")
the disadvantages of stable hash codes were insufficiently understood when the language spec was defined
adding instability to hash codes was too computationally expensive when the language spec was defined
hash codes were less visible to developers

Cross Platform, Cross .NET versions Hash-function

I need a cross platform, cross .NET versions hash function.
Note that most\any regular hashing may produce different results on different machines, probably as a result of different settings on the OS, the compiler used, 32\64 bit, etc.
What I need is an all-around C# method that will hash a string but that the hash value will be the same when produced on any of the many machines I have that take part in my system. (They all use .NET 3.5 and above).

If performance is not an issue, try one of the cryptographic hash functions that come with the .NET Framework library: MD5, SHA256, RIPEMD160. If performance is an issue, you could perhaps go for something like MurMurHash3. All of these are dependent only on the input.
(If you want to hash for security purposes, it's worth noting that you should only use cryptographic hash functions and that MD5 and older versions of SHA have known vulnerabilities and should be avoided.)

Generate a Hashcode for a string that is platform independent

We have an application that
Generates a hash code on a string
Saves that hash code into a DB along with associated data
Later, it queries the DB using the string hash code for retrieving the data
This is obviously a bug because the value returned from string.GetHashCode() varies from .NET versions and architectures (32/64 bit). To complicate matters, we're too close to a release to refactor our application to stop serializing hash codes and just query on the strings instead. What we'd like to do is come up with a quick and dirty fix for now, and refactor the code later to do it the right way.
The quick and dirty fix seems like creating a static GetInvariantHashCode(string s) helper method that is consistent across architectures.
Can suggest an algorithm for generating a hashcode on a string that is equivalent on 32 bit and 64 bit architecture?
A few more notes:
I'm aware that HashCodes are not unique. If a hashcode returns a match on two different strings, we post process the results to find the exact match. It is not used as a primary key.
I believe the architect's intent was to speed up the searches by querying on a long instead of an NVarChar

I'm aware that HashCodes are not unique. If a hashcode returns a match on two different strings, we post process the results to find the exact match. It is not used as a primary key.
I believe the architect's intent was to speed up the searches by querying on a long instead of an NVarChar
Then just let the database index the strings for you!
Look, I have no idea how large your domain is, but you're going to get collisions very rapidly with very high likelihood if it's of any decent size at all. It's the birthday problem with a lot of people relative to the number of birthdays. You're going to have collisions, and lose any gain in speed you might think you're gaining by not just indexing the strings in the first place.
Anyway, you don't need us if you're stuck a few days away from release and you really need an invariant hash code across platform. There are really dumb, really fast implementations of hash code out there that you can use. Hell, you could come up with one yourself in the blink of an eye:
string s = "Hello, world!";
int hash = 17;
foreach(char c in s) {
unchecked { hash = hash * 23 + c.GetHashCode(); }
}
Or you could use the old Bernstein hash. And on and on. Are they going to give you the performance gain you're looking for? I don't know, they weren't meant to be used for this purpose. They were meant to be used for balancing hash tables. You're not balancing a hash table. You're using the wrong concept.
Edit (the below was written before the question was edited with new salient information):
You can't do this, at all, theoretically, without some kind of restriction on your input space. Your problem is far more severe than String.GetHashCode differening from platform to platform.
There are a lot of instances of string. In fact, way more instances than there are instances of Int32. So, because of the piegonhole principle, you will have collisions. You can't avoid this: your strings are pigeons and your Int32 hash codes are piegonholes and there are too many pigeons to go in the pigeonholes without some pigeonhole getting more than one pigeon. Because of collision problems, you can't use hash codes as unique keys for strings. It doesn't work. Period.
The only way you can make your current proposed design work (using Int32 as an identifier for instances of string) is if you restrict your input space of strings to something that has at size less than or equal to the number of Int32s. Even then, you'll have difficulty coming up with an algorithm that maps your input space of strings to Int32 in a unique way.
Even if you try to increase the number of pigeonholes by using SHA-512 or whatever, you still have the possibility of collisions. I doubt you considered that possibility previously in your design; this design path is DOA. And that's not what SHA-512 is for anyway, it's not to be used for unique identification of messages. It's just to reduce the likelihood of message forgery.
To complicate matters, we're too close to a release to refactor our application to stop serializing hash codes and just query on the strings instead.
Well, then you have a tremendous amount of work ahead of you. I'm sorry you discovered this so late in the game.
I note the documentation for String.GetHashCode:
The behavior of GetHashCode is dependent on its implementation, which might change from one version of the common language runtime to another. A reason why this might happen is to improve the performance of GetHashCode.
And from Object.GetHashCode:
The GetHashCode method is suitable for use in hashing algorithms and data structures such as a hash table.
Hash codes are for balancing hash tables. They are not for identifying objects. You could have caught this sooner if you had used the concept for what it is meant to be used for.

You should just use SHA512.
Note that hashes are not (and cannot be) unique.
If you need it to be unique, just use the identity function as your hash.

You can use one of the managed cryptography classes (such as SHA512Managed) to compute a platform independent hash via ComputeHash. This will require converting the string to a byte array (ie: using Encoding.GetBytes or some other method), and be slow, but be consistent.
That being said, a hash is not guaranteed unique, and is really not a proper mechanism for uniqueness in a database. Using a hash to store data is likely to cause data to get lost, as the first hash collision will overwrite old data (or throw away new data).

Why is String.GetHashCode() implemented differently in 32-bit and 64-bit versions of the CLR?

What are the technical reasons behind the difference between the 32-bit and 64-bit versions of string.GetHashCode()?
More importantly, why does the 64-bit version seem to terminate its algorithm when it encounters the NUL character? For example, the following expressions all return true when run under the 64-bit CLR.
"\0123456789".GetHashCode() == "\0987654321".GetHashCode()
"\0AAAAAAAAA".GetHashCode() == "\0BBBBBBBBB".GetHashCode()
"\0The".GetHashCode() == "\0Game".GetHashCode()
This behavior (bug?) manifested as a performance issue when we used such strings as keys in a Dictionary.

This looks like a known issue which Microsoft would not fix:
As you have mentioned this would be a breaking change for some programs (even though they shouldn't really be relying on this), the risk of this was deemed too high to fix this in the current release.
I agree that the rate of collisions that this will cause in the default Dictionary<String, Object> will be inflated by this. If this is adversely effecting your applications performance, I would suggest trying to work around it by using one of the Dictionary constructors that takes an IEqualityComparer so you can provide a more appropriate GetHashCode implementation. I know this isn't ideal and would like to get this fixed in a future version of the .NET Framework.
Source: Microsoft Connect - String.GetHashCode ignores any characters in the string beyond the first null byte in x64 runtime

Eric lippert has got a wondeful blog to this
Curious property in String
Curious property Revealed

Partitioning Data by Hash function

I need to split data evenly across n nodes in a distributed cache.
The following code will take a cache key and determine which Node to use:
public static int GetNodeIDByCacheKey(string key)
{
return Math.Abs(key.GetHashCode()) % TotalNumberOfNodes();
}
Unfortunatly the code isn't reliable across different machine instances.
In testing it seems it will sometimes return a different Node for the same key.
Any thoughts or ideas on getting something to work better?

You should not rely on the implementation of string's GetHashCode() other than the fact that strings of equal value will produce the same hash code - but what the particular value of the hash code will be is only required to be consistent as per the documentation for the current execution of an application - a different hash code can be returned if the application is run again.
Also the implementation of GetHashCode might be different if you have different .NET CLR versions on the machines in question:
The behavior of GetHashCode is dependent on its implementation, which
might change from one version of the common language runtime to
another. A reason why this might happen is to improve the performance
of GetHashCode.
Instead you could just define a consistent mapping from your string key to a numeric value which would allow you to bin your nodes consistently across restarts and machine boundaries, this i.e. could be achieved by converting the string into a byte array (i.e using Encoding.UTF8.GetBytes() ) and then converting the byte array to a number (either using a lossy conversion using just 64 bits or i.e using BigInteger)

A particular instance (an instantiated string) will generate the same hash, but two instances (like on Machine A and on Machine B) of the same string ("Hello" for instance) may very well have different hashCodes. I think you will need to implement your own hash function that uses only the contents of the strings if you want identical operation between machines and instances.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.