I need a cross platform, cross .NET versions hash function.
Note that most\any regular hashing may produce different results on different machines, probably as a result of different settings on the OS, the compiler used, 32\64 bit, etc.
What I need is an all-around C# method that will hash a string but that the hash value will be the same when produced on any of the many machines I have that take part in my system. (They all use .NET 3.5 and above).
If performance is not an issue, try one of the cryptographic hash functions that come with the .NET Framework library: MD5, SHA256, RIPEMD160. If performance is an issue, you could perhaps go for something like MurMurHash3. All of these are dependent only on the input.
(If you want to hash for security purposes, it's worth noting that you should only use cryptographic hash functions and that MD5 and older versions of SHA have known vulnerabilities and should be avoided.)
Related
Consider the following code:
Console.WriteLine("Hello, World!".GetHashCode());
First run:
139068974
Second run:
-263623806
Now consider the same thing written in Kotlin:
println("Hello, World!".hashCode())
First run:
1498789909
Second run:
1498789909
Why do hash codes for string change for every execution in .NET, but not on other runtimes like the JVM?
Why do hash codes for string change for every execution in .NET
In short to prevent hash collision attacks. You can roughly find out the reason from the docs of the <UseRandomizedStringHashAlgorithm> configuration element:
The string lookup in a hash table is typically an O(1) operation. However, when a large number of collisions occur, the lookup can become an O(n²) operation. You can use the configuration element to generate a random hashing algorithm per application domain, which in turn limits the number of potential collisions, particularly when the keys from which the hash codes are calculated are based on data input by users.
but not on other runtimes like the JVM?
Not exactly, for example Python's hash function is random. C# also produces identity hash in .net framework, core 1.0 and core 2.0 when <UseRandomizedStringHashAlgorithm> is not enabled.
For Java maybe it's a historical issue because the arithmetic is public, and it's not good, read this.
Why do hash codes change for every execution in .NET?
Because changing the hash code of strings (and other objects!) on each run is a very strong hint to developers that hash codes do not have any meaning outside of the process that generated the hash.
Specifically, the documentation says:
Furthermore, .NET does not guarantee the default implementation of the GetHashCode method, and the value this method returns may differ between .NET implementations, such as different versions of .NET Framework and .NET Core, and platforms, such as 32-bit and 64-bit platforms. For these reasons, do not use the default implementation of this method as a unique object identifier for hashing purposes. Two consequences follow from this:
You should not assume that equal hash codes imply object equality.
You should never persist or use a hash code outside the application domain in which it was created, because the same object may hash across application domains, processes, and platforms.
By changing the hash code of a given object from one run to the next, the runtime is telling the developer not to use the hash code for anything that crosses a process/app-domain boundary. That will help to insulate developers from bugs stemming from changes to the GetHashCode algorithms used by standard classes.
Having hash codes change from one run to the next also discourages things like persisting the hash code for use as a "did this thing change" short-cut. This both prevents bugs from changes to the underlying algorithms and bugs from assuming that two objects of the same type with the same hash code are equal, when no such guarantee is made (in fact, no such guarantee can be made for any data structure which requires or allows more than 32 bits, due to the pigeonhole principle).
Why do other languages generate stable hash codes?
Without a thorough language-by-language review, I can only speculate, but the major reasons are likely to be some combination of:
historical inertia (read: "backwards compatibility")
the disadvantages of stable hash codes were insufficiently understood when the language spec was defined
adding instability to hash codes was too computationally expensive when the language spec was defined
hash codes were less visible to developers
In my application a CRC value is computed for a file by using the System.Security.Cryptography.MD5 (C#). It is used as a compact digital fingerprint.
The MD5 class is declared non-FIPS compliant and "everything" works fine if the following Windows Local Policy is disabled:
"System Cryptography: Use FIPS compliant algorithms for encryption, hashing and signing".
Now, I need to enable the above System Policy, but the MD5 class fails when called..
Is there a way to compute the CRC value exactly as if you are using the System.Security.Cryptography.MD5?
Thanks in advance, regards
As Damien_The_Unbeliever mention above, your requirements are incompatible. But a slightly more detailed answer would be "yes and no".
No: MD5 should not be used any more as there are known collisions. It is broken for pretty much all cryptographic purposes. If those fingerprints are used in any security relevant context then you're well advised to change to a secure hash function. SHA2 and SHA3 are secure and FIPS certified. Switching an entire application to a different hash function may cause you some pain now but the alternative is more pain later.
Yes: It is possible - you could reimplement MD5 yourself or use a library that does not check for the Windows policy. All you'd have to do is ensure a correct data format. However, I would strongly advise against this option. MD5 is broken.
Since you've stated that you have to enable the policy for FIPS compliant cryptography, I would assume that this is either a customer or sales requirement which leaves you with no choice but to switch to SHA2 or SHA3.
I am using .net 3.5 and I'm trying to make my app FIPS compliant.I don't use any of the non FIPS algorithms but I still get this error when I run it on the production server.
This implementation is not the part of the Windows platform FIPS validated cryptographic algorithms.
Here is the List of algorithms that I have checked and I am sure that I haven't used them.
HMACMD5
HMACRIPEMD160
HMACSHA256
HMACSHA384
HMACSHA512
MD5CryptoServiceProvider
RC2CryptoServiceProvider
RijndaelManaged
RIPEMD160Managed
SHA1Managed
How can I find exactly where the problem is or any other ideas?
When you say "FIPS compliant", I assume you want to enforce FIPS 140 compliance in Windows and .Net cryptographic libraries mode by changing the Local Security Policy settings.
The challenge with FIPS 140 compliance (usually level 1 of the latest version of the standard, FIPS 140-2) using this mechanism, as you have discovered, is that it prevents the instantiation of non-FIPS 140 compliant algorithms, even if they are not used for a security-related purpose.
Presumably you have checked your code for any references to non-compliant algorithms using a tool like ildasm or Reflector. Otherwise, debug your code and look at the stack trace of the thrown InvalidOperationException to see where the problem lies.
One easy way to accomplish this is use the generic classes and avoid calling constructors directly. For example, if you want to use Advanced Encryption Standard (AES), instead of:
// Use the faster .Net implementation of AES. Not FIPS 140 compliant.
using (AesManaged aesManaged = new AesManaged())
{
// Do something
}
use:
// Let .Net workout which implementation of AES to use. Will use
// a FIPS compliant implementation if FIPS is turned on.
using (Aes aes = Aes.Create())
{
// Do something
}
Beyond your code, check third party libraries you use. You can use similar tools to the above to check any references from their code. If you have checked your code thoroughly, this is likely where the problem lies. Note that disassembling third party code could be a breach of copyright or license agreements.
Also check your SSL configuration. For example, the digital certificate used for SSL cannot used MD5. You also must use TLS 1.0 or later.
However, forcing Windows FIPS 140 compliance is doing it the hard way. Most customers, including the US government, do not require only FIPS compliant algorithms (or technically, implementations of these algorithms) to be used. For example, they are perfectly happy for you to use MD5 to create a hash key of a string.
Instead, customers want anything your product protects using cryptography to be protected by FIPS 140 complaint implementations of approved algorithms. In other words:
Identify each thing your product should protect
Protect them using FIPS 140 compliant libraries
Use tooling (e.g. static analysis), code review and/or third party audit to demonstrate enforcement.
Also note that turning on FIPS 140 mode does not necessarily make Windows or your product more secure. Security is much more complicated than choosing one cryptographic algorithm over another (or, specifically, a particular implementation of an algorithm over another implementation). Microsoft no longer recommends this be turned on by default.
I'm trying to understand some C#-code, I have been handed, which deals with cryptography, and specifically uses PasswordDeriveBytes from System.Security.Cryptography.
In the .NET docs , it says that PasswordDeriveBytes uses "an extension of the PBKDF1 algorithm" which is later in the document specified as "the PKCS#5 v2.0 standard", which is PBKDF2 (as far as I can tell). Everywhere on the net I've found (including here on Stack Exchange), though, everyone says "use Rfc2898DeriveBytes, cause Password* is deprecated and uses PBKDF1". But the only difference in the docs at msdn.microsoft.com seems to be that the Rfc*-version specifically mentions PBKDF2, where Password* says "extension of PBKDF1" and "PKCS#5 v 2.0".
So, can anyone tell me what the difference is between the two classes (if any) and why I should use one rather than the other for PBKDF2 password key derivation?
Now, other code, that deals with the same data, explicitly uses PBKDF2, and works, so that would suggest that indeed PasswordDeriveBytes also uses PBKDF2, or that PBKDF2 is simply compatible with PBKDF1 under certain circumstances, but I want to know for sure that it's not some side effect of some random thing, and that things just magically works (and eventually probably will magically and spectacularly break) without anyone really understanding why.
If you instantiate PasswordDeriveBytes and make a single call to the GetBytes method passing a value which is smaller than the output size of the underlying digest algorithm then you get back a value from the PBKDF1 algorithm.
If you make two calls to GetBytes for the same object you may encounter a counting bug in the implementation.
PBKDF1 is only described to output up to the size of the hash algorithm (e.g. 20 bytes for SHA-1), but the PasswordDeriveBytes class has made up a formula to support up to 1000 times the hash output size. So a large value produced by this class may not be easily attainable in another platform.
If you instantiate Rfc2898DeriveBytes you get a streaming implementation of the PBKDF2 algorithm. The most obvious difference of PBKDF2 over PBKDF1 is that PBKDF2 allows the generation of an arbitrary amount of data (the limit is (2^32-1)*hashOutputSize; or for SHA-1 85,899,345,900 bytes). PBKDF2 also uses a more complex construction (in particular, HMAC over direct digest) to make recovering the input password from an output value more difficult.
The "streaming" in the implementation is that the concatenation of GetBytes(5) and GetBytes(3) is the same as GetBytes(8). Unlike in PasswordDeriveBytes, this works correctly in Rfc2898DeriveBytes.
PBKDF1 was originally created to generate DES keys, published in PKCS #5 v1.5 in 1993.
PBKDF2 was published in PKCS #5 v2.0 (which was republished as RFC2898) in 1999. A slide deck which should be found at ftp://ftp.rsasecurity.com/pub/pkcs/pkcs-5v2/pkcs5v2-0.pdf (but seems to be having issues so ftp://ftp.dfn-cert.de/pub/pca/docs/PKCS/ftp.rsa.com/99workshop/pkcs5_v2.0.ppt may hve to do) further summarizes differences. (The slide deck was written by RSA Security, the creators of PBKDF1 and PBKDF2, and they are the people who recommend PBKDF2 over PBKDF1).
I think a great answer to this would be found here:
C# PasswordDeriveBytes Confusion
But to sumup:
Microsoft's implementation of original PKCS#5 (aka PBKDF1) include insecure extensions to provide more bytes than the hash function can provide (see bug reports here and here).
Even if it was not buggy you should avoid undocumented, proprietary extensions to standards (or you might never be able to decrypt your data in the future - at least not outside Windows.)
I strongly suggest you to use the newer Rfc2898DeriveBytes which implements PBKDF2 (PKCS#5 v2) which is available since .NET 2.0.
Here's a blog post detailing the differences:
http://blogs.msdn.com/b/shawnfa/archive/2004/04/14/generating-a-key-from-a-password.aspx
PBKDF2 can be used to generate keys of any length, which is very useful for password-based encryption (it can generate any key length as required by the symmetric cipher) but means less for secure password storage. It also applies the salt using HMAC instead of concatenation like PBKDF1, which has better security properties in cases of weak salts.
PKCS#5 v2.0 defines both PBKDF1 and PBKDF2, the former for reasons of backwards compatibility and also recommends you use PBKDF2 for new applications. I've no idea why the latter is better than the former, but the two .NET classes do seem to use different but interoperable algorithms. (Possibly because only the resulting key is being exchanged, not the inputs + KDF.)
I am wondering about the hash quality and the hash stability produced by the String.GetHashCode() implementation in .NET?
Concerning the quality, I am focusing on algorithmic aspects (hence, the quality of the hash as it impacts large hash-tables, not for security concerns).
Then, concerning the stability, I wondering about the potential versionning issues that might arise from one .NET version to the next.
Some lights on those two aspects would be very appreciated.
I can't give you any details about the quality (though I would assume it is pretty good given that string is one of the framework's core classes that is likely to be used as a hash key).
However, regarding the stability, the hash code produced on different versions of the framework is not guaranteed to be the same, and it has changed in the past, so you absolutely must not rely on the hash code being stable between versions (see here for a reference that it changed between 1.1 and 2.0). In fact, it even differs between the 32-bit and 64-bit versions of the same framework version; from the docs:
The value returned by GetHashCode is platform-dependent. For a specific string value, it differs on the 32-bit and 64-bit versions of the .NET Framework.
This is an old question, but I'd like to contribute by mentionning this microsoft bug about hash quality.
Summary: On 64b, hash quality is very low when your string contains '\0' bytes. Basically, only the start of the string will be hashed.
If like me, you have to use .Net strings to represent binary data as key for high-performance dictionaries, you need to be aware of this bug.
Too bad, it's a WONTFIX... As a sidenote, I don't understand how they could say that modifying the hashcode being a breaking change, when the code includes
// We want to ensure we can change our hash function daily.
// This is perfectly fine as long as you don't persist the
// value from GetHashCode to disk or count on String A
// hashing before string B. Those are bugs in your code.
hash1 ^= ThisAssembly.DailyBuildNumber;
and the hashcode is already different in x86/64b anyway.
I just came across a related problem to this. On one of my computers (a 64 bit one) I had a problem that I tracked down to 2 different objects being identical except for the (stored) hashcode. That hashcode was created from a string....the same string!
m_storedhash = astring.GetHashCode();
I dont know how these two objects ended up with different hash codes given they were from the same string however I suspect what happened is that within the same .NET exe, one of the class library projects I depend upon has been set to x86 and another to ANYCPU and one of these objects was created in a method inside the x86 class lib and the other object (same input data, same everything) was created in a method inside the ANYCPU class library.
So, does this sound plausible: Within the same executable in memory (not between processes) some of the code could be running with the x86 Framework's string.GetHashCode() and other code x64 Framework's string.GetHashCode() ?
I know that this isn't really included the meanings of quality and stability that you specified, but it's worth being aware that hashing extremely large strings can produce an OutOfMemoryException.
https://connect.microsoft.com/VisualStudio/feedback/details/517457/stringcomparers-gethashcode-string-throws-outofmemoryexception-with-plenty-of-ram-available
The quality of the hash codes are good enough for their intended purpose, i.e. they doesn't cause too many collisions when you use strings as key in a dictionary. I suspect that it will only use the entire string for calculating the hash code if the string length is reasonably short, for huge strings it will brobably only use the first part.
There is no guarantee for stability across versions. The documentation clearly says that the hashing algorithm may change from one version to the next, so that the hash codes are for short term use.