Testing the Equals method is pretty much straight forward (as far as I know). But how on earth do you test the GetHashCode method?
Test that two distinct objects which are equal have the same hash code (for various values). Check that non-equal objects give different hash codes, varying one aspect/property at a time. While the hash codes don't have to be different, you'd be really unlucky to pick different values for properties which happen to give the same hash code unless you've got a bug.
Gallio/MbUnit v3.2 comes with convenient contract verifiers which are able to test your implementation of GetHashCode() and IEquatable<T>. More specifically you may be interested by the EqualityContract and the HashCodeAcceptanceContract. See here, here and there for more details.
public class Spot
{
private readonly int x;
private readonly int y;
public Spot(int x, int y)
{
this.x = x;
this.y = y;
}
public override int GetHashCode()
{
int h = -2128831035;
h = (h * 16777619) ^ x;
h = (h * 16777619) ^ y;
return h;
}
}
Then you declare your contract verifier like this:
[TestFixture]
public class SpotTest
{
[VerifyContract]
public readonly IContract HashCodeAcceptanceTests = new HashCodeAcceptanceContract<Spot>()
{
CollisionProbabilityLimit = CollisionProbability.VeryLow,
UniformDistributionQuality = UniformDistributionQuality.Excellent,
DistinctInstances = DataGenerators.Join(Enumerable.Range(0, 1000), Enumerable.Range(0, 1000)).Select(o => new Spot(o.First, o.Second))
};
}
It would be fairly similar to Equals(). You'd want to make sure two objects which were the "same" at least had the same hash code. That means if .Equals() returns true, the hash codes should be identical as well. As far as what the proper hashcode values are, that depends on how you're hashing.
From personal experience. Aside from obvious things like same objects giving you same hash codes, you need to create large enough array of unique objects and count unique hash codes among them. If unique hash codes make less than, say 50% of overall object count, then you are in trouble, as your hash function is not good.
List<int> hashList = new List<int>(testObjectList.Count);
for (int i = 0; i < testObjectList.Count; i++)
{
hashList.Add(testObjectList[i]);
}
hashList.Sort();
int differentValues = 0;
int curValue = hashList[0];
for (int i = 1; i < hashList.Count; i++)
{
if (hashList[i] != curValue)
{
differentValues++;
curValue = hashList[i];
}
}
Assert.Greater(differentValues, hashList.Count/2);
In addition to checking that object equality implies equality of hashcodes, and the distribution of hashes is fairly flat as suggested by Yann Trevin (if performance is a concern), you may also wish to consider what happens if you change a property of the object.
Suppose your object changes while it's in a dictionary/hashset. Do you want the Contains(object) to still be true? If so then your GetHashCode had better not depend on the mutable property that was changed.
I would pre-supply a known/expected hash and compare what the result of GetHashCode is.
You create separate instances with the same value and check that the GetHashCode for the instances returns the same value, and that repeated calls on the same instance returns the same value.
That is the only requirement for a hash code to work. To work well the hash codes should of course have a good distribution, but testing for that requires a lot of testing...
Related
After executing this piece of code:
int a = 50;
float b = 50.0f;
Console.WriteLine(a.GetHashCode() == b.GetHashCode());
We get False, which is expected, since we are dealing with different objects, hence we should get different hashes.
However, if we execute this:
int a = 0;
float b = 0.0f;
Console.WriteLine(a.GetHashCode() == b.GetHashCode());
We get True. Both obejcts return the same hash code: 0.
Why does this happen? Aren't they supposed to return different hashes?
The GetHashCode of System.Int32 works like:
public override int GetHashCode()
{
return this;
}
Which of course with this being 0, it will return 0.
System.Single's (float is alias) GetHashCode is:
public unsafe override int GetHashCode()
{
float num = this;
if (num == 0f)
{
return 0;
}
return *(int*)(&num);
}
Like you see, at 0f it will return 0.
Program used is ILSpy.
From MSDN Documentation:
Two objects that are equal return hash codes that are equal. However,
the reverse is not true: equal hash codes do not imply object
equality, because different (unequal) objects can have identical hash
codes.
Objects that are conceptually equal are obligated to return the same hashes. Objects that are different are not obligated to return different hashes. That would only be possible if there were less than 2^32 objects that could ever possibly exist. There are more than that. When objects that are different result in the same hash it is called a "collision". A quality hash algorithm minimizes collisions as much as possible, but they can never be removed entirely.
Why should they? Hash codes are a finite set; as many as you can fit in an Int32. There are many many doubles that will have the same hash code as any given int or any other given double.
Hash codes basically have to follow two simple rules:
If two objects are equal, they should have the same hash code.
If an object does not mutate its internal state then the hash code should remain the same.
Nothing obliges two objects that are not equal to have different hash codes; it is mathematically impossible.
After executing this piece of code:
int a = 50;
float b = 50.0f;
Console.WriteLine(a.GetHashCode() == b.GetHashCode());
We get False, which is expected, since we are dealing with different objects, hence we should get different hashes.
However, if we execute this:
int a = 0;
float b = 0.0f;
Console.WriteLine(a.GetHashCode() == b.GetHashCode());
We get True. Both obejcts return the same hash code: 0.
Why does this happen? Aren't they supposed to return different hashes?
The GetHashCode of System.Int32 works like:
public override int GetHashCode()
{
return this;
}
Which of course with this being 0, it will return 0.
System.Single's (float is alias) GetHashCode is:
public unsafe override int GetHashCode()
{
float num = this;
if (num == 0f)
{
return 0;
}
return *(int*)(&num);
}
Like you see, at 0f it will return 0.
Program used is ILSpy.
From MSDN Documentation:
Two objects that are equal return hash codes that are equal. However,
the reverse is not true: equal hash codes do not imply object
equality, because different (unequal) objects can have identical hash
codes.
Objects that are conceptually equal are obligated to return the same hashes. Objects that are different are not obligated to return different hashes. That would only be possible if there were less than 2^32 objects that could ever possibly exist. There are more than that. When objects that are different result in the same hash it is called a "collision". A quality hash algorithm minimizes collisions as much as possible, but they can never be removed entirely.
Why should they? Hash codes are a finite set; as many as you can fit in an Int32. There are many many doubles that will have the same hash code as any given int or any other given double.
Hash codes basically have to follow two simple rules:
If two objects are equal, they should have the same hash code.
If an object does not mutate its internal state then the hash code should remain the same.
Nothing obliges two objects that are not equal to have different hash codes; it is mathematically impossible.
For C++ I've always been using Boost.Functional/Hash to create good hash values without having to deal with bit shifts, XORs and prime numbers. Is there any libraries that produces good (I'm not asking for optimal) hash values for C#/.NET? I would use this utility to implement GetHashCode(), not cryptographic hashes.
To clarify why I think this is useful, here's the implementation of boost::hash_combine which combines to hash values (ofcourse a very common operation when implementing GetHashCode()):
seed ^= hash_value(v) + 0x9e3779b9 + (seed << 6) + (seed >> 2);
Clearly, this sort of code doesn't belong in the implementation of GetHashCode() and should therefor be implemented elsewhere.
I wouldn't used a separate library just for that. As mentioned before, for the GetHashCode method it is essential to be fast and stable. Usually I prefer to write inline implementation, but it might be actually a good idea to use a helper class:
internal static class HashHelper
{
private static int InitialHash = 17; // Prime number
private static int Multiplier = 23; // Different prime number
public static Int32 GetHashCode(params object[] values)
{
unchecked // overflow is fine
{
int hash = InitialHash;
if (values != null)
for (int i = 0; i < values.Length; i++)
{
object currentValue = values[i];
hash = hash * Multiplier
+ (currentValue != null ? currentValue.GetHashCode() : 0);
}
return hash;
}
}
}
This way common hash-calculation logic can be used:
public override int GetHashCode()
{
return HashHelper.GetHashCode(field1, field2);
}
The answers to this question contains some examples of helper-classes that resembles Boost.Functional/Hash. None looks quite as elegant, though.
I am not aware of any real .NET library that provides the equivalent.
Unless you have very specific requirements you don't need to calculate your type's hashcode from first principles. Rather combine the hash codes of the fields/properties you use for equality determination in one of the simple ways, something like:
int hash = field1.GetHashCode();
hash = (hash *37) + field2.GetHashCode();
(Combination function taken from ยง3.3.2 C# in Depth, 2nd Ed, Jon Skeet).
To avoid the boxing issue chain your calls using a generic extension method on Int32
public static class HashHelper
{
public static int InitialHash = 17; // Prime number
private static int Multiplier = 23; // Different prime number
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static Int32 GetHashCode<T>( this Int32 source, T next )
{
// comparing null of value objects is ok. See
// http://stackoverflow.com/questions/1972262/c-sharp-okay-with-comparing-value-types-to-null
if ( next == null )
{
return source;
}
unchecked
{
return source + next.GetHashCode();
}
}
}
then you can do
HashHelper
.InitialHash
.GetHashCode(field0)
.GetHashCode(field1)
.GetHashCode(field2);
Have a look at this link, it describes MD5 hashing.
Otherwise use GetHashCode().
I've asked a question about this class before, but here is one again.
I've created a Complex class:
public class Complex
{
public double Real { get; set; }
public double Imaginary { get; set; }
}
And I'm implementing the Equals and the Hashcode functions, and the Equal function takes in account a certain precision. I use the following logic for that:
public override bool Equals(object obj)
{
//Some default null checkint etc here, the next code is all that matters.
return Math.Abs(complex.Imaginary - Imaginary) <= 0.00001 &&
Math.Abs(complex.Real - Real) <= 0.00001;
}
Well this works, when the Imaginary and the Real part are really close to each other, it says they are the same.
Now I was trying to implement the HashCode function, I've used some examples John skeet used here, currently I have the following.
public override int GetHashCode()
{
var hash = 17;
hash = hash*23 + Real.GetHashCode();
hash = hash*23 + Imaginary.GetHashCode();
return hash;
}
However, this does not take in account the certain precision I want to use. So basically the following two classes:
Complex1[Real = 1.123456; Imaginary = 1.123456]
Complex2[Real = 1.123457; Imaginary = 1.123457]
Are Equal but do not provide the same HashCode, how can I achieve that?
First of all, your Equals() implementation is broken. Read here to see why.
Second, such a "fuzzy equals" breaks the contract of Equals() (it's not transitive, for one thing), so using it with Hashtable will not work, no matter how you implement GetHashCode().
For this kind of thing, you really need a spatial index such as an R-Tree.
Just drop precision when you calculate the hash value.
public override int GetHashCode()
{
var hash = 17;
hash = hash*23 + Math.Round(Real, 5).GetHashCode();
hash = hash*23 + Math.Round(Imaginary, 5).GetHashCode();
return hash;
}
where 5 is you precision value
I see two simple options:
Use Decimal instead of double
Instead of using Real.GetHashCode, use Real.RoundTo6Ciphers().GetHashCode().
Then you'll have the same hashcode.
I would create read-only properties that round Real and Imaginary to the nearest hundred-thousandth and then do equals and hashcode implementations on those getter properties.
I have an immutable Value Object, IPathwayModule, whose value is defined by:
(int) Block;
(Entity) Module, identified by (string) ModuleId;
(enum) Status; and
(entity) Class, identified by (string) ClassId - which may be null.
Here's my current IEqualityComparer implementation which seems to work in a few unit tests. However, I don't think I understand what I'm doing well enough to know whether I am doing it right. A previous implementation would sometimes fail on repeated test runs.
private class StandardPathwayModuleComparer : IEqualityComparer<IPathwayModule>
{
public bool Equals(IPathwayModule x, IPathwayModule y)
{
int hx = GetHashCode(x);
int hy = GetHashCode(y);
return hx == hy;
}
public int GetHashCode(IPathwayModule obj)
{
int h;
if (obj.Class != null)
{
h = obj.Block.GetHashCode() + obj.Module.ModuleId.GetHashCode() + obj.Status.GetHashCode() + obj.Class.ClassId.GetHashCode();
}
else
{
h = obj.Block.GetHashCode() + obj.Module.ModuleId.GetHashCode() + obj.Status.GetHashCode() + "NOCLASS".GetHashCode();
}
return h;
}
}
IPathwayModule is definitely immutable and different instances with the same values should be equal and produce the same HashCode since they are used as items within HashSets.
I suppose my questions are:
Am I using the interface correctly in this case?
Are there cases where I might not see the desired behaviour?
Is there any way to improve the robustness, performance?
Are there any good practices that I am not following?
Don't do the Equals in terms of the Hash function's results it's too fragile. Rather do a field value comparison for each of the fields. Something like:
return x != null && y != null && x.Name.Equals(y.Name) && x.Type.Equals(y.Type) ...
Also, the hash functions results aren't really amenable to addition. Try using the ^ operator instead.
return obj.Name.GetHashCode() ^ obj.Type.GetHashCode() ...
You don't need the null check in GetHashCode. If that value is null, you've got bigger problems, no use trying to recover from something over which you have no control...
The only big problem is the implementation of Equals. Hash codes are not unique, you can get the same hash code for objects which are different. You should compare each field of IPathwayModule individually.
GetHashCode() can be improved a bit. You don't need to call GetHashCode() on an int. The int itself is a good hash code. The same for enum values. Your GetHashCode could be then implemented like this:
public int GetHashCode(IPathwayModule obj)
{
unchecked {
int h = obj.Block + obj.Module.ModeleId.GetHashCode() + (int) obj.Status;
if (obj.class != null)
h += obj.Class.ClassId.GetHashCode();
return h;
}
}
The 'unchecked' block is necessary because there may be overflows in the arithmetic operations.
You shouldn't use GetHashCode() as the main way of comparison objects. Compare it field-wise.
There could be multiple objects with the same hash code (this is called 'hash code collisions').
Also, be careful when add together multiple integer values, since you can easily cause an OverflowException. Use 'exclusive or' (^) to combine hashcodes or wrap code into 'unchecked' block.
You should implement better versions of Equals and GetHashCode.
For instance, the hash code of enums is simply their numerical value.
In other words, with these two enums:
public enum A { x, y, z }
public enum B { k, l, m }
Then with your implementation, the following value type:
public struct AB {
public A;
public B;
}
the following two values would be considered equal:
AB ab1 = new AB { A = A.x, B = B.m };
AB ab2 = new AB { A = A.z, B = B.k };
I'm assuming you don't want that.
Also, passing the value types as interfaces will box them, this could have performance concerns, although probably not much. You might consider making the IEqualityComparer implementation take your value types directly.
Assuming that two objects are equal because their hash code is equal is wrong. You need to compare all members individually
It is proabably better to use ^ rather than + to combine the hash codes.
If I understand you well, you'd like to hear some comments on your code. Here're my remarks:
GetHashCode should be XOR'ed together, not added. XOR (^) gives a better chance of preventing collisions
You compare hashcodes. That's good, but only do this if the underlying object overrides the GetHashCode. If not, use properties and their hashcodes and combine them.
Hash codes are important, they make a quick compare possible. But if hash codes are equal, the object can still be different. This happens rarely. But you'll need to compare the fields of your object if hash codes are equal.
You say your value types are immutable, but you reference objects (.Class), which are not immutable
Always optimize comparison by adding reference comparison as first test. References unequal, the objects are unequal, then the structs are unequal.
Point 5 depends on whether the you want the objects that you reference in your value type to return not equal when not the same reference.
EDIT: you compare many strings. The string comparison is optimized in C#. You can, as others suggested, better use == with them in your comparison. For the GetHashCode, use OR ^ as suggested by others as well.
Thanks to all who responded. I have aggregated the feedback from everyone who responded and my improved IEqualityComparer now looks like:
private class StandardPathwayModuleComparer : IEqualityComparer<IPathwayModule>
{
public bool Equals(IPathwayModule x, IPathwayModule y)
{
if (x == y) return true;
if (x == null || y == null) return false;
if ((x.Class == null) ^ (y.Class == null)) return false;
if (x.Class == null) //and implicitly y.Class == null
{
return x.Block.Equals(y.Block) && x.Status.Equals(y.Status) && x.Module.ModuleId.Equals(y.Module.ModuleId);
}
return x.Block.Equals(y.Block) && x.Status.Equals(y.Status) && x.Module.ModuleId.Equals(y.Module.ModuleId) && x.Class.ClassId.Equals(y.Class.ClassId);
}
public int GetHashCode(IPathwayModule obj)
{
unchecked {
int h = obj.Block ^ obj.Module.ModuleId.GetHashCode() ^ (int) obj.Status;
if (obj.Class != null)
{
h ^= obj.Class.ClassId.GetHashCode();
}
return h;
}
}
}