For C++ I've always been using Boost.Functional/Hash to create good hash values without having to deal with bit shifts, XORs and prime numbers. Is there any libraries that produces good (I'm not asking for optimal) hash values for C#/.NET? I would use this utility to implement GetHashCode(), not cryptographic hashes.
To clarify why I think this is useful, here's the implementation of boost::hash_combine which combines to hash values (ofcourse a very common operation when implementing GetHashCode()):
seed ^= hash_value(v) + 0x9e3779b9 + (seed << 6) + (seed >> 2);
Clearly, this sort of code doesn't belong in the implementation of GetHashCode() and should therefor be implemented elsewhere.
I wouldn't used a separate library just for that. As mentioned before, for the GetHashCode method it is essential to be fast and stable. Usually I prefer to write inline implementation, but it might be actually a good idea to use a helper class:
internal static class HashHelper
{
private static int InitialHash = 17; // Prime number
private static int Multiplier = 23; // Different prime number
public static Int32 GetHashCode(params object[] values)
{
unchecked // overflow is fine
{
int hash = InitialHash;
if (values != null)
for (int i = 0; i < values.Length; i++)
{
object currentValue = values[i];
hash = hash * Multiplier
+ (currentValue != null ? currentValue.GetHashCode() : 0);
}
return hash;
}
}
}
This way common hash-calculation logic can be used:
public override int GetHashCode()
{
return HashHelper.GetHashCode(field1, field2);
}
The answers to this question contains some examples of helper-classes that resembles Boost.Functional/Hash. None looks quite as elegant, though.
I am not aware of any real .NET library that provides the equivalent.
Unless you have very specific requirements you don't need to calculate your type's hashcode from first principles. Rather combine the hash codes of the fields/properties you use for equality determination in one of the simple ways, something like:
int hash = field1.GetHashCode();
hash = (hash *37) + field2.GetHashCode();
(Combination function taken from ยง3.3.2 C# in Depth, 2nd Ed, Jon Skeet).
To avoid the boxing issue chain your calls using a generic extension method on Int32
public static class HashHelper
{
public static int InitialHash = 17; // Prime number
private static int Multiplier = 23; // Different prime number
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static Int32 GetHashCode<T>( this Int32 source, T next )
{
// comparing null of value objects is ok. See
// http://stackoverflow.com/questions/1972262/c-sharp-okay-with-comparing-value-types-to-null
if ( next == null )
{
return source;
}
unchecked
{
return source + next.GetHashCode();
}
}
}
then you can do
HashHelper
.InitialHash
.GetHashCode(field0)
.GetHashCode(field1)
.GetHashCode(field2);
Have a look at this link, it describes MD5 hashing.
Otherwise use GetHashCode().
Related
I've decided to implement a caching facade in one of our applications - the purpose is to eventually reduce the network overhead and limit the amount of db hits. We are using Castle.Windsor as our IoC Container and we have decided to go with Interceptors to add the caching functionality on top of our services layer using the System.Runtime.Caching namespace.
At this moment I can't exactly figure out what's the best approach for constructing the cache key. The goal is to make a distinction between different methods and also include passed argument values - meaning that these two method calls should be cached under two different keys:
IEnumerable<MyObject> GetMyObjectByParam(56); // key1
IEnumerable<MyObject> GetMyObjectByParam(23); // key2
For now I can see two possible implementations:
Option 1:
assembly | class | method return type | method name | argument types | argument hash codes
"MyAssembly.MyClass IEnumerable<MyObject> GetMyObjectByParam(long) { 56 }";
Option 2:
MD5 or SHA-256 computed hash based on the method's fully-qualified name and passed argument values
string key = new SHA256Managed().ComputeHash(name + args).ToString();
I'm thinking about the first option as the second one requires more processing time - on the other hand the second option enforces exactly the same 'length' of all generated keys.
Is it safe to assume that the first option will generate a unique key for methods using complex argument types? Or maybe there is a completely different way of doing this?
Help and opinion will by highly appreciated!
Based on some very useful links that I've found here and here I've decided to implement it more-or-less like this:
public sealed class CacheKey : IEquatable<CacheKey>
{
private readonly Type reflectedType;
private readonly Type returnType;
private readonly string name;
private readonly Type[] parameterTypes;
private readonly object[] arguments;
public User(Type reflectedType, Type returnType, string name,
Type[] parameterTypes, object[] arguments)
{
// check for null, incorrect values etc.
this.reflectedType = reflectedType;
this.returnType = returnType;
this.name = name;
this.parameterTypes = parameterTypes;
this.arguments = arguments;
}
public override bool Equals(object obj)
{
return Equals(obj as CacheKey);
}
public bool Equals(CacheKey other)
{
if (other == null)
{
return false;
}
for (int i = 0; i < parameterTypes.Count; i++)
{
if (!parameterTypes[i].Equals(other.parameterTypes[i]))
{
return false;
}
}
for (int i = 0; i < arguments.Count; i++)
{
if (!arguments[i].Equals(other.arguments[i]))
{
return false;
}
}
return reflectedType.Equals(other.reflectedType) &&
returnType.Equals(other.returnType) &&
name.Equals(other.name);
}
private override int GetHashCode()
{
unchecked
{
int hash = 17;
hash = hash * 31 + reflectedType.GetHashCode();
hash = hash * 31 + returnType.GetHashCode();
hash = hash * 31 + name.GetHashCode();
for (int i = 0; i < parameterTypes.Count; i++)
{
hash = hash * 31 + parameterTypes[i].GetHashCode();
}
for (int i = 0; i < arguments.Count; i++)
{
hash = hash * 31 + arguments[i].GetHashCode();
}
return hash;
}
}
}
Basically it's just a general idea - the above code can be easily rewritten to a more generic version with one collection of Fields - the same rules would have to be applied on each element of the collection. I can share the full code.
An option you seem to have skipped is using the .NET built in GetHashCode() function for the string. I'm fairly certain this is what would go on behind the scenes in a C# dictionary with a String as the <TKey> (I mention that because you've tagged the question with dictionary). I'm not sure how the .NET dictionary class relates to your Castle.Windsor or the system.runtime.caching interface you mention.
The reason you wouldn't want to use GetHashCode as a hash key is that the functionality is specifically disclaimed by MicroSoft to change between versions without warning (as in to provide a more unique or faster executing function). If this cache will live strictly in memory, then this is not a concern because upgrading the .NET framework would necessitate a restart of your application, wiping the cache.
To clarify, just using the concatenated string (Option 1) should be sufficiently unique. It looks like you've added everything possible to uniquely qualify your methods.
If you end up feeding the String of an MD5 or Sha256 into a dictionary key, the program would probably rehash the string behind the scenes anyways. It's been a while since I read about the inner workings of the Dictionary class. If you leave it as a Dictionary<String, IEnumerable<MyObject>> (as opposed to calling GetHashCode() on the strings yourself using the int return value as the key) then the dictionary should handle collisions of the hash code itself.
Also note that (at least according to a benchmark program run on my machine), MD5 is around 10% faster than SHA1 and twice as fast as SHA256. String.GetHashCode() is around 20 times faster than MD5 (it's not cryptographically secure). Tests were taken for the total time to compute the hashes for the same 100,000 randomly generated strings of length between 32 and 1024 characters. But regardless of the exact numbers, using a cryptographically secure hash function as a key will only slow down your program.
I can post the source code for my comparisons if you like.
I've asked a question about this class before, but here is one again.
I've created a Complex class:
public class Complex
{
public double Real { get; set; }
public double Imaginary { get; set; }
}
And I'm implementing the Equals and the Hashcode functions, and the Equal function takes in account a certain precision. I use the following logic for that:
public override bool Equals(object obj)
{
//Some default null checkint etc here, the next code is all that matters.
return Math.Abs(complex.Imaginary - Imaginary) <= 0.00001 &&
Math.Abs(complex.Real - Real) <= 0.00001;
}
Well this works, when the Imaginary and the Real part are really close to each other, it says they are the same.
Now I was trying to implement the HashCode function, I've used some examples John skeet used here, currently I have the following.
public override int GetHashCode()
{
var hash = 17;
hash = hash*23 + Real.GetHashCode();
hash = hash*23 + Imaginary.GetHashCode();
return hash;
}
However, this does not take in account the certain precision I want to use. So basically the following two classes:
Complex1[Real = 1.123456; Imaginary = 1.123456]
Complex2[Real = 1.123457; Imaginary = 1.123457]
Are Equal but do not provide the same HashCode, how can I achieve that?
First of all, your Equals() implementation is broken. Read here to see why.
Second, such a "fuzzy equals" breaks the contract of Equals() (it's not transitive, for one thing), so using it with Hashtable will not work, no matter how you implement GetHashCode().
For this kind of thing, you really need a spatial index such as an R-Tree.
Just drop precision when you calculate the hash value.
public override int GetHashCode()
{
var hash = 17;
hash = hash*23 + Math.Round(Real, 5).GetHashCode();
hash = hash*23 + Math.Round(Imaginary, 5).GetHashCode();
return hash;
}
where 5 is you precision value
I see two simple options:
Use Decimal instead of double
Instead of using Real.GetHashCode, use Real.RoundTo6Ciphers().GetHashCode().
Then you'll have the same hashcode.
I would create read-only properties that round Real and Imaginary to the nearest hundred-thousandth and then do equals and hashcode implementations on those getter properties.
I need to generate a fast hash code in GetHashCode for a BitArray. I have a Dictionary where the keys are BitArrays, and all the BitArrays are of the same length.
Does anyone know of a fast way to generate a good hash from a variable number of bits, as in this scenario?
UPDATE:
The approach I originally took was to access the internal array of ints directly through reflection (speed is more important than encapsulation in this case), then XOR those values. The XOR approach seems to work well i.e. my 'Equals' method isn't called excessively when searching in the Dictionary:
public int GetHashCode(BitArray array)
{
int hash = 0;
foreach (int value in array.GetInternalValues())
{
hash ^= value;
}
return hash;
}
However, the approach suggested by Mark Byers and seen elsewhere on StackOverflow was slightly better (16570 Equals calls vs 16608 for the XOR for my test data). Note that this approach fixes a bug in the previous one where bits beyond the end of the bit array could affect the hash value. This could happen if the bit array was reduced in length.
public int GetHashCode(BitArray array)
{
UInt32 hash = 17;
int bitsRemaining = array.Length;
foreach (int value in array.GetInternalValues())
{
UInt32 cleanValue = (UInt32)value;
if (bitsRemaining < 32)
{
//clear any bits that are beyond the end of the array
int bitsToWipe = 32 - bitsRemaining;
cleanValue <<= bitsToWipe;
cleanValue >>= bitsToWipe;
}
hash = hash * 23 + cleanValue;
bitsRemaining -= 32;
}
return (int)hash;
}
The GetInternalValues extension method is implemented like this:
public static class BitArrayExtensions
{
static FieldInfo _internalArrayGetter = GetInternalArrayGetter();
static FieldInfo GetInternalArrayGetter()
{
return typeof(BitArray).GetField("m_array", BindingFlags.NonPublic | BindingFlags.Instance);
}
static int[] GetInternalArray(BitArray array)
{
return (int[])_internalArrayGetter.GetValue(array);
}
public static IEnumerable<int> GetInternalValues(this BitArray array)
{
return GetInternalArray(array);
}
... more extension methods
}
Any suggestions for improvement are welcome!
It is a terrible class to act as a key in a Dictionary. The only reasonable way to implement GetHashCode() is by using its CopyTo() method to copy the bits into a byte[]. That's not great, it creates a ton of garbage.
Beg, steal or borrow to use a BitVector32 instead. It has a good implementation for GetHashCode(). If you've got more than 32 bits then consider spinning your own class so you can get to the underlying array without having to copy.
If the bit arrays are 32 bits or shorter then you just need to convert them to 32 bit integers (padding with zero bits if necessary).
If they can be longer then you can either convert them to a series of 32-bit integers and XOR them, or better: use the algorithm described in Effective Java.
public int GetHashCode()
{
int hash = 17;
hash = hash * 23 + field1.GetHashCode();
hash = hash * 23 + field2.GetHashCode();
hash = hash * 23 + field3.GetHashCode();
return hash;
}
Taken from here. The field1, field2 correcpond the the first 32 bits, second 32 bits, etc.
Testing the Equals method is pretty much straight forward (as far as I know). But how on earth do you test the GetHashCode method?
Test that two distinct objects which are equal have the same hash code (for various values). Check that non-equal objects give different hash codes, varying one aspect/property at a time. While the hash codes don't have to be different, you'd be really unlucky to pick different values for properties which happen to give the same hash code unless you've got a bug.
Gallio/MbUnit v3.2 comes with convenient contract verifiers which are able to test your implementation of GetHashCode() and IEquatable<T>. More specifically you may be interested by the EqualityContract and the HashCodeAcceptanceContract. See here, here and there for more details.
public class Spot
{
private readonly int x;
private readonly int y;
public Spot(int x, int y)
{
this.x = x;
this.y = y;
}
public override int GetHashCode()
{
int h = -2128831035;
h = (h * 16777619) ^ x;
h = (h * 16777619) ^ y;
return h;
}
}
Then you declare your contract verifier like this:
[TestFixture]
public class SpotTest
{
[VerifyContract]
public readonly IContract HashCodeAcceptanceTests = new HashCodeAcceptanceContract<Spot>()
{
CollisionProbabilityLimit = CollisionProbability.VeryLow,
UniformDistributionQuality = UniformDistributionQuality.Excellent,
DistinctInstances = DataGenerators.Join(Enumerable.Range(0, 1000), Enumerable.Range(0, 1000)).Select(o => new Spot(o.First, o.Second))
};
}
It would be fairly similar to Equals(). You'd want to make sure two objects which were the "same" at least had the same hash code. That means if .Equals() returns true, the hash codes should be identical as well. As far as what the proper hashcode values are, that depends on how you're hashing.
From personal experience. Aside from obvious things like same objects giving you same hash codes, you need to create large enough array of unique objects and count unique hash codes among them. If unique hash codes make less than, say 50% of overall object count, then you are in trouble, as your hash function is not good.
List<int> hashList = new List<int>(testObjectList.Count);
for (int i = 0; i < testObjectList.Count; i++)
{
hashList.Add(testObjectList[i]);
}
hashList.Sort();
int differentValues = 0;
int curValue = hashList[0];
for (int i = 1; i < hashList.Count; i++)
{
if (hashList[i] != curValue)
{
differentValues++;
curValue = hashList[i];
}
}
Assert.Greater(differentValues, hashList.Count/2);
In addition to checking that object equality implies equality of hashcodes, and the distribution of hashes is fairly flat as suggested by Yann Trevin (if performance is a concern), you may also wish to consider what happens if you change a property of the object.
Suppose your object changes while it's in a dictionary/hashset. Do you want the Contains(object) to still be true? If so then your GetHashCode had better not depend on the mutable property that was changed.
I would pre-supply a known/expected hash and compare what the result of GetHashCode is.
You create separate instances with the same value and check that the GetHashCode for the instances returns the same value, and that repeated calls on the same instance returns the same value.
That is the only requirement for a hash code to work. To work well the hash codes should of course have a good distribution, but testing for that requires a lot of testing...
I have an object for which I want to generate a unique hash (override GetHashCode()) but I want to avoid overflows or something unpredictable.
The code should be the result of combining the hash codes of a small collection of strings.
The hash codes will be part of generating a cache key, so ideally they should be unique however the number of possible values that are being hashed is small so I THINK probability is in my favour here.
Would something like this be sufficient AND is there a better way of doing this?
int hash = 0;
foreach(string item in collection){
hash += (item.GetHashCode() / collection.Count)
}
return hash;
EDIT: Thanks for answers so far.
#Jon Skeet: No, order is not important
I guess this is almost a another question but since I am using the result to generate a cache key (string) would it make sense to use a crytographic hash function like MD5 or just use the string representation of this int?
The fundamentals pointed out by Marc and Jon are not bad but they are far from optimal in terms of their evenness of distribution of the results. Sadly the 'multiply by primes' approach copied by so many people from Knuth is not the best choice in many cases better distribution can be achieved by cheaper to calculate functions (though this is very slight on modern hardware). In fact throwing primes into many aspects of hashing is no panacea.
If this data is used for significantly sized hash tables I recommend reading of Bret Mulvey's excellent study and explanation of various modern (and not so modern) hashing techniques handily done with c#.
Note that the behaviour with strings of various hash functions is heavily biased towards wehther the strings are short (roughly speaking how many characters are hashed before the bits begin to over flow) or long.
One of the simplest and easiest to implement is also one of the best, the Jenkins One at a time hash.
private static unsafe void Hash(byte* d, int len, ref uint h)
{
for (int i = 0; i < len; i++)
{
h += d[i];
h += (h << 10);
h ^= (h >> 6);
}
}
public unsafe static void Hash(ref uint h, string s)
{
fixed (char* c = s)
{
byte* b = (byte*)(void*)c;
Hash(b, s.Length * 2, ref h);
}
}
public unsafe static int Avalanche(uint h)
{
h += (h<< 3);
h ^= (h>> 11);
h += (h<< 15);
return *((int*)(void*)&h);
}
you can then use this like so:
uint h = 0;
foreach(string item in collection)
{
Hash(ref h, item);
}
return Avalanche(h);
you can merge multiple different types like so:
public unsafe static void Hash(ref uint h, int data)
{
byte* d = (byte*)(void*)&data;
AddToHash(d, sizeof(int), ref h);
}
public unsafe static void Hash(ref uint h, long data)
{
byte* d= (byte*)(void*)&data;
Hash(d, sizeof(long), ref h);
}
If you only have access to the field as an object with no knowledge of the internals you can simply call GetHashCode() on each one and combine that value like so:
uint h = 0;
foreach(var item in collection)
{
Hash(ref h, item.GetHashCode());
}
return Avalanche(h);
Sadly you can't do sizeof(T) so you must do each struct individually.
If you wish to use reflection you can construct on a per type basis a function which does structural identity and hashing on all fields.
If you wish to avoid unsafe code then you can use bit masking techniques to pull out individual bits from ints (and chars if dealing with strings) with not too much extra hassle.
Hashes aren't meant to be unique - they're just meant to be well distributed in most situations. They're just meant to be consistent. Note that overflows shouldn't be a problem.
Just adding isn't generally a good idea, and dividing certainly isn't. Here's the approach I usually use:
int result = 17;
foreach (string item in collection)
{
result = result * 31 + item.GetHashCode();
}
return result;
If you're otherwise in a checked context, you might want to deliberately make it unchecked.
Note that this assumes that order is important, i.e. that { "a", "b" } should be different from { "b", "a" }. Please let us know if that's not the case.
There is nothing wrong with this approach as long as the members whose hashcodes you are combining follow the rules of hash codes. In short ...
The hash code of the private members should not change for the lifetime of the object
The container must not change the object the private members point to lest it in turn change the hash code of the container
If the order of the items is not important (i.e. {"a","b"} is the same as {"b","a"}) then you can use exclusive or to combine the hash codes:
hash ^= item.GetHashCode();
[Edit: As Mark pointed out in a comment to a different answer, this has the drawback of also give collections like {"a"} and {"a","b","b"} the same hash code.]
If the order is important, you can instead multiply by a prime number and add:
hash *= 11;
hash += item.GetHashCode();
(When you multiply you will sometimes get an overflow that is ignored, but by multiplying with a prime number you lose a minimum of information. If you instead multiplied with a number like 16, you would lose four bits of information each time, so after eight items the hash code from the first item would be completely gone.)