I have an immutable Value Object, IPathwayModule, whose value is defined by:
(int) Block;
(Entity) Module, identified by (string) ModuleId;
(enum) Status; and
(entity) Class, identified by (string) ClassId - which may be null.
Here's my current IEqualityComparer implementation which seems to work in a few unit tests. However, I don't think I understand what I'm doing well enough to know whether I am doing it right. A previous implementation would sometimes fail on repeated test runs.
private class StandardPathwayModuleComparer : IEqualityComparer<IPathwayModule>
{
public bool Equals(IPathwayModule x, IPathwayModule y)
{
int hx = GetHashCode(x);
int hy = GetHashCode(y);
return hx == hy;
}
public int GetHashCode(IPathwayModule obj)
{
int h;
if (obj.Class != null)
{
h = obj.Block.GetHashCode() + obj.Module.ModuleId.GetHashCode() + obj.Status.GetHashCode() + obj.Class.ClassId.GetHashCode();
}
else
{
h = obj.Block.GetHashCode() + obj.Module.ModuleId.GetHashCode() + obj.Status.GetHashCode() + "NOCLASS".GetHashCode();
}
return h;
}
}
IPathwayModule is definitely immutable and different instances with the same values should be equal and produce the same HashCode since they are used as items within HashSets.
I suppose my questions are:
Am I using the interface correctly in this case?
Are there cases where I might not see the desired behaviour?
Is there any way to improve the robustness, performance?
Are there any good practices that I am not following?
Don't do the Equals in terms of the Hash function's results it's too fragile. Rather do a field value comparison for each of the fields. Something like:
return x != null && y != null && x.Name.Equals(y.Name) && x.Type.Equals(y.Type) ...
Also, the hash functions results aren't really amenable to addition. Try using the ^ operator instead.
return obj.Name.GetHashCode() ^ obj.Type.GetHashCode() ...
You don't need the null check in GetHashCode. If that value is null, you've got bigger problems, no use trying to recover from something over which you have no control...
The only big problem is the implementation of Equals. Hash codes are not unique, you can get the same hash code for objects which are different. You should compare each field of IPathwayModule individually.
GetHashCode() can be improved a bit. You don't need to call GetHashCode() on an int. The int itself is a good hash code. The same for enum values. Your GetHashCode could be then implemented like this:
public int GetHashCode(IPathwayModule obj)
{
unchecked {
int h = obj.Block + obj.Module.ModeleId.GetHashCode() + (int) obj.Status;
if (obj.class != null)
h += obj.Class.ClassId.GetHashCode();
return h;
}
}
The 'unchecked' block is necessary because there may be overflows in the arithmetic operations.
You shouldn't use GetHashCode() as the main way of comparison objects. Compare it field-wise.
There could be multiple objects with the same hash code (this is called 'hash code collisions').
Also, be careful when add together multiple integer values, since you can easily cause an OverflowException. Use 'exclusive or' (^) to combine hashcodes or wrap code into 'unchecked' block.
You should implement better versions of Equals and GetHashCode.
For instance, the hash code of enums is simply their numerical value.
In other words, with these two enums:
public enum A { x, y, z }
public enum B { k, l, m }
Then with your implementation, the following value type:
public struct AB {
public A;
public B;
}
the following two values would be considered equal:
AB ab1 = new AB { A = A.x, B = B.m };
AB ab2 = new AB { A = A.z, B = B.k };
I'm assuming you don't want that.
Also, passing the value types as interfaces will box them, this could have performance concerns, although probably not much. You might consider making the IEqualityComparer implementation take your value types directly.
Assuming that two objects are equal because their hash code is equal is wrong. You need to compare all members individually
It is proabably better to use ^ rather than + to combine the hash codes.
If I understand you well, you'd like to hear some comments on your code. Here're my remarks:
GetHashCode should be XOR'ed together, not added. XOR (^) gives a better chance of preventing collisions
You compare hashcodes. That's good, but only do this if the underlying object overrides the GetHashCode. If not, use properties and their hashcodes and combine them.
Hash codes are important, they make a quick compare possible. But if hash codes are equal, the object can still be different. This happens rarely. But you'll need to compare the fields of your object if hash codes are equal.
You say your value types are immutable, but you reference objects (.Class), which are not immutable
Always optimize comparison by adding reference comparison as first test. References unequal, the objects are unequal, then the structs are unequal.
Point 5 depends on whether the you want the objects that you reference in your value type to return not equal when not the same reference.
EDIT: you compare many strings. The string comparison is optimized in C#. You can, as others suggested, better use == with them in your comparison. For the GetHashCode, use OR ^ as suggested by others as well.
Thanks to all who responded. I have aggregated the feedback from everyone who responded and my improved IEqualityComparer now looks like:
private class StandardPathwayModuleComparer : IEqualityComparer<IPathwayModule>
{
public bool Equals(IPathwayModule x, IPathwayModule y)
{
if (x == y) return true;
if (x == null || y == null) return false;
if ((x.Class == null) ^ (y.Class == null)) return false;
if (x.Class == null) //and implicitly y.Class == null
{
return x.Block.Equals(y.Block) && x.Status.Equals(y.Status) && x.Module.ModuleId.Equals(y.Module.ModuleId);
}
return x.Block.Equals(y.Block) && x.Status.Equals(y.Status) && x.Module.ModuleId.Equals(y.Module.ModuleId) && x.Class.ClassId.Equals(y.Class.ClassId);
}
public int GetHashCode(IPathwayModule obj)
{
unchecked {
int h = obj.Block ^ obj.Module.ModuleId.GetHashCode() ^ (int) obj.Status;
if (obj.Class != null)
{
h ^= obj.Class.ClassId.GetHashCode();
}
return h;
}
}
}
Related
I have a class that is similar to this:
public class Int16_2D
{
public Int16 a, b;
public override bool Equals(Object other)
{
return other is Int16_2D &&
a == ((Int16_2D)other).a &&
b == ((Int16_2D)other).b;
}
}
This works in HashSet<Int16_2D>. However in Dictionary<Int16_2D, myType>, .ContainsKey returns false when it shouldn't. Am I missing something in my implementation of ==?
For a class to work in a hash table or dictionary, you need to implement GetHashCode()! I have no idea why it's working in HashSet; I would guess it was just luck.
Note that it's dangerous to use mutable fields for calculating Equals or GetHashCode(). Why? Consider this:
var x = new Int16_2D { a = 1, b = 2 };
var set = new HashSet<Int16_2D> { x };
var y = new Int16_2D { a = 1, b = 2 };
Console.WriteLine(set.Contains(y)); // True
x.a = 3;
Console.WriteLine(set.Contains(y)); // False
Console.WriteLine(set.Contains(x)); // Also false!
In other words, when you set x.a = 3; you're changing x's hash code. But x's location in the hash table is based on its old hash code, so x is basically lost now. See this in action at http://ideone.com/QQw08
Also, as svick notes, implementing Equals does not implement ==. If you don't implement ==, the == operator will provide a reference comparison, so:
var x = new Int16_2d { a = 1, b = 2 };
var y = new Int16_2d { a = 1, b = 2 };
Console.WriteLine(x.Equals(y)); //True
Console.WriteLine(x == y); //False
In conclusion, you're better off making this an immutable type; since it's only 4 bytes long, I'd probably make it an immutable struct.
You need to override GetHashCode(). The fact that it works with HashSet<T> is probably just a lucky coincidence.
Both collections use the hash code obtained from GetHashCode to find a bucket (ie. list of objects), where the object should be placed. Then it searches that bucket to find the object, and uses Equals to ensure equality. This is what gives the nice fast lookup properties of the Dictionary and HashSet. However, this also means, that if GetHashCode is not overridden so that it corresponds to the types Equals method, you will not be able to find such an object in one of the collections.
You should, almost always, implement both GetHashCode and Equals, or none of them.
You need to override GetHashCode as well for the dictionary to work.
You have to override GetHashCode() as well - this goes hand in hand with overriding Equals. Dictionary is using GetHashCode() to determine what bin a value would fall into - only if a suitable item is found in that bin it checks on actual equality of the items.
Referring to the question that I previously asked:
Compare two lists that contain a lot of objects
It is impressive to see how fast that comparison is maide by implementing the IEqualityComparer interface: example here
As I mentioned in my other question this comparison helps me to backup a sourse folder on a destination folder. Know I want to sync to folders therefore I need to compare the dates of the files. Whenever I do something like:
public class MyFileComparer2 : IEqualityComparer<MyFile>
{
public bool Equals(MyFile s, MyFile d)
{
return
s.compareName.Equals(d.compareName) &&
s.size == d.size &&
s.deepness == d.deepness &&
s.dateModified.Date <= d.dateModified.Date; // This line does not work.
// I also tried comparing the strings by converting it to a string and it does
// not work. It does not give me an error but it does not seem to include the files
// where s.dateModified.Date < d.dateModified.Date
}
public int GetHashCode(MyFile a)
{
int rt = (a.compareName.GetHashCode() * 251 + a.size.GetHashCode() * 251 + a.deepness.GetHashCode() + a.dateModified.Date.GetHashCode());
return rt;
}
}
It will be nice if I could do something similar using greater or equal than signs. I also tried using the tick property and it does not work. Maybe I am doing something wrong. I believe it is not possible to compare things with the less than equal sign implementing this interface. Moreover, I don't understand how this Class works; I just know it is impressive how fast it iterates through the whole list.
Your whole approach is fundementally flawed because your IEqualityComparer.Equals method is not symmetric. This means Equals(file1, file2) does not equal Equals(file2, file1) because of the way you are using the less than operator.
The documentation:
IEqualityComparer.Equals Method
clearly states:
Notes to Implementers
The Equals method is reflexive, symmetric, and transitive. That is, it returns true if used to compare an object with itself; true for two objects x and y if it is true for y and x; and true for two objects x and z if it is true for x and y and also true for y and z.
Implementations are required to ensure that if the Equals method returns true for two objects x and y, then the value returned by the GetHashCode method for x must equal the value returned for y.
Instead you need to use the IComparable interface or IEqualityComparer in combination with date comparisions. If you do not, things might seem to work for a while but you will regret it later.
Since the DateTime objects are different in the case when one DateTime is less than the other, you get different hashcodes for the objects s and d and the Equals method is not called. In order for your comparison of the dates to work, you should remove the date part from the GetHashCode method:
public int GetHashCode(MyFile a)
{
int rt = ((a.compareName.GetHashCode() * 251 + a.size.GetHashCode())
* 251 + a.deepness.GetHashCode()) *251;
return rt;
}
Your GetHashCode has a problem:
public int GetHashCode(MyFile a)
{
int rt = (((a.compareName.GetHashCode() * 251)
+ a.size.GetHashCode() * 251)
+ a.deepness.GetHashCode() *251)
+ a.dateModified.Date.GetHashCode();
return rt;
}
I changed the date part because I also needed the time therefore I use the ticks property instead. I got rid of the dateModified hashed code and it works great. here is how I modified my program. I was having trouble comparing the dates therefore I used the Ticks property.
public class MyFileComparer2 : IEqualityComparer<MyFile>
{
public bool Equals(MyFile s, MyFile d)
{
return
s.compareName.Equals(d.compareName) &&
s.size == d.size &&
s.deepness == d.deepness &&
//s.dateModified.Date <= d.dateModified.Date &&
s.dateModified.Ticks >= d.dateModified.Ticks
;
}
public int GetHashCode(MyFile a)
{
int rt = (((a.compareName.GetHashCode() * 251)
+ a.size.GetHashCode() * 251)
+ a.deepness.GetHashCode() * 251)
//+ a.dateModified.Ticks.GetHashCode();
;
return rt;
}
}
I still don't know how this hash code function works. The nice thing is that it works great.
What would be the best way to override the GetHashCode function for the case, when
my objects are considered equal if there is at least ONE field match in them.
In the case of generic Equals method the example might look like this:
public bool Equals(Whatever other)
{
if (ReferenceEquals(null, other)) return false;
if (ReferenceEquals(this, other)) return true;
// Considering that the values can't be 'null' here.
return other.Id.Equals(Id) || Equals(other.Money, Money) ||
Equals(other.Code, Code);
}
Still, I'm confused about making a good GetHashCode implementation for this case.
How should this be done?
Thank you.
This is a terrible definition of Equals because it is not transitive.
Consider
x = { Id = 1, Money = 0.1, Code = "X" }
y = { Id = 1, Money = 0.2, Code = "Y" }
z = { Id = 3, Money = 0.2, Code = "Z" }
Then x == y and y == z but x != z.
Additionally, we can establish that the only reasonable implementation of GetHashCode is a constant map.
Suppose that x and y are distinct objects. Let z be the object
z = { Id = x.Id, Money = y.Money, Code = "Z" }
Then x == z and y == z so that x.GetHashCode() == z.GetHashCode() and y.GetHashCode() == z.GetHashCode() establishing that x.GetHashCode() == y.GetHashCode(). Since x and y were arbitrary we have established that GetHashCode is constant.
Thus, we have shown that the only possible implementation of GetHashCode is
private readonly int constant = 17;
public override int GetHashCode() {
return constant;
}
All of this put together makes it clear that you need to rethink the concept you are trying model, and come up with a different definition of Equals.
I don't think you should be using Equals for this. People have a very explicit notion of what equals means, and if the Ids are different but the code or name are the same, I would not consider those "Equal". Maybe you need a different method like "IsCompatible".
If you want to be able to group them, you could use the extension method ToLookup() on a list of these objects, to use a predicate which would be your IsCompatible method. Then they would be grouped.
The golden rule is: if the objects compare equal, they must produce the same hash code.
Therefore a conforming (but let's say, undesirable) implementation would be
public override int GetHashCode()
{
return 0;
}
Frankly, if Id, Name and Code are independent of each other then I don't know if you can do any better. Putting objects of this type in a hash table is going to be painful.
Testing the Equals method is pretty much straight forward (as far as I know). But how on earth do you test the GetHashCode method?
Test that two distinct objects which are equal have the same hash code (for various values). Check that non-equal objects give different hash codes, varying one aspect/property at a time. While the hash codes don't have to be different, you'd be really unlucky to pick different values for properties which happen to give the same hash code unless you've got a bug.
Gallio/MbUnit v3.2 comes with convenient contract verifiers which are able to test your implementation of GetHashCode() and IEquatable<T>. More specifically you may be interested by the EqualityContract and the HashCodeAcceptanceContract. See here, here and there for more details.
public class Spot
{
private readonly int x;
private readonly int y;
public Spot(int x, int y)
{
this.x = x;
this.y = y;
}
public override int GetHashCode()
{
int h = -2128831035;
h = (h * 16777619) ^ x;
h = (h * 16777619) ^ y;
return h;
}
}
Then you declare your contract verifier like this:
[TestFixture]
public class SpotTest
{
[VerifyContract]
public readonly IContract HashCodeAcceptanceTests = new HashCodeAcceptanceContract<Spot>()
{
CollisionProbabilityLimit = CollisionProbability.VeryLow,
UniformDistributionQuality = UniformDistributionQuality.Excellent,
DistinctInstances = DataGenerators.Join(Enumerable.Range(0, 1000), Enumerable.Range(0, 1000)).Select(o => new Spot(o.First, o.Second))
};
}
It would be fairly similar to Equals(). You'd want to make sure two objects which were the "same" at least had the same hash code. That means if .Equals() returns true, the hash codes should be identical as well. As far as what the proper hashcode values are, that depends on how you're hashing.
From personal experience. Aside from obvious things like same objects giving you same hash codes, you need to create large enough array of unique objects and count unique hash codes among them. If unique hash codes make less than, say 50% of overall object count, then you are in trouble, as your hash function is not good.
List<int> hashList = new List<int>(testObjectList.Count);
for (int i = 0; i < testObjectList.Count; i++)
{
hashList.Add(testObjectList[i]);
}
hashList.Sort();
int differentValues = 0;
int curValue = hashList[0];
for (int i = 1; i < hashList.Count; i++)
{
if (hashList[i] != curValue)
{
differentValues++;
curValue = hashList[i];
}
}
Assert.Greater(differentValues, hashList.Count/2);
In addition to checking that object equality implies equality of hashcodes, and the distribution of hashes is fairly flat as suggested by Yann Trevin (if performance is a concern), you may also wish to consider what happens if you change a property of the object.
Suppose your object changes while it's in a dictionary/hashset. Do you want the Contains(object) to still be true? If so then your GetHashCode had better not depend on the mutable property that was changed.
I would pre-supply a known/expected hash and compare what the result of GetHashCode is.
You create separate instances with the same value and check that the GetHashCode for the instances returns the same value, and that repeated calls on the same instance returns the same value.
That is the only requirement for a hash code to work. To work well the hash codes should of course have a good distribution, but testing for that requires a lot of testing...
I have a class which looks like this:
public class NumericalRange:IEquatable<NumericalRange>
{
public double LowerLimit;
public double UpperLimit;
public NumericalRange(double lower, double upper)
{
LowerLimit = lower;
UpperLimit = upper;
}
public bool DoesLieInRange(double n)
{
if (LowerLimit <= n && n <= UpperLimit)
return true;
else
return false;
}
#region IEquatable<NumericalRange> Members
public bool Equals(NumericalRange other)
{
if (Double.IsNaN(this.LowerLimit)&& Double.IsNaN(other.LowerLimit))
{
if (Double.IsNaN(this.UpperLimit) && Double.IsNaN(other.UpperLimit))
{
return true;
}
}
if (this.LowerLimit == other.LowerLimit && this.UpperLimit == other.UpperLimit)
return true;
return false;
}
#endregion
}
This class holds a neumerical range of values. This class should also be able to hold a default range, where both LowerLimit and UpperLimit are equal to Double.NaN.
Now this class goes into a Dictionary
The Dictionary works fine for 'non-NaN' numerical range values, but when the Key is {NaN,NaN} NumericalRange Object, then the dictionary throws a KeyNotFoundException.
What am I doing wrong? Is there any other interface that I have to implement?
Based on your comment, you haven't implemented GetHashCode. I'm amazed that the class works at all in a dictionary, unless you're always requesting the identical key that you put in. I would suggest an implementation of something like:
public override int GetHashCode()
{
int hash = 17;
hash = hash * 23 + UpperLimit.GetHashCode();
hash = hash * 23 + LowerLimit.GetHashCode();
return hash;
}
That assumes Double.GetHashCode() gives a consistent value for NaN. There are many values of NaN of course, and you may want to special case it to make sure they all give the same hash.
You should also override the Equals method inherited from Object:
public override bool Equals(Object other)
{
return other != null &&
other.GetType() == GetType() &&
Equals((NumericalRange) other);
}
Note that the type check can be made more efficient by using as if you seal your class. Otherwise you'll get interesting asymmetries between x.Equals(y) and y.Equals(x) if someone derives another class from yours. Equality becomes tricky with inheritance.
You should also make your fields private, exposing them only as propertes. If this is going to be used as a key in a dictionary, I strongly recommend that you make them readonly, too. Changing the contents of a key when it's used in a dictionary is likely to lead to it being "unfindable" later.
The default implementation of the GetHashCode method uses the reference of the object rather than the values in the object. You have to use the same instance of the object as you used to put the data in the dictionary for that to work.
An implementation of GetHashCode that works simply creates a code from the hash codes of it's data members:
public int GetHashCode() {
return LowerLimit.GetHashCode() ^ UpperLimit.GetHashCode();
}
(This is the same implementation that the Point structure uses.)
Any implementation of the method that always returns the same hash code for any given parameter values works when used in a Dictionary. Just returning the same hash code for all values actually also works, but then the performance of the Dictionary gets bad (looking up a key becomes an O(n) operation instead of an O(1) operation. To give the best performance, the method should distribute the hash codes evenly within the range.
If your data is strongly biased, the above implementation might not give the best performance. If you for example have a lot of ranges where the lower and upper limits are the same, they will all get the hash code zero. In that case something like this might work better:
public int GetHashCode() {
return (LowerLimit.GetHashCode() * 251) ^ UpperLimit.GetHashCode();
}
You should consider making the class immutable, i.e. make it's properties read-only and only setting them in the constructor. If you change the properties of an object while it's in a Dictionary, it's hash code will change and you will not be able to access the object any more.