Why should I *not* override GetHashCode()? - c#

My search for a helper to correctly combine constituent hashcodes for GetHashCode() seemed to garner some hostility. I got the impression from the comments that some C# developers don't think you should override GetHashCode() often - certainly some commenters seemed to think that a library for helping get the behaviour right would be useless. Such functionality was considered useful enough in Java for the Java community to ask for it to be added to the JDK, and it's now in JDK 7.
Is there some fundamental reason that in C# you don't need to - or should definitely not - override GetHashCode() (and correspondingly, Equals()) as often as in Java? I find myself doing this often with Java, for example whenever I create a type that I know I want to keep in a HashSet or use as a key in a HashMap (equivalently, .net Dictionary).

C# has built-in value types which provide value equality, whereas Java does not. So writing your own hashcode in Java may be a necessity, whereas doing it in C# may be a premature optimisation.
It's common to write a type to use as a composite key to use in a Dictionary/HashMap. Often on such types you need value equality (equivalence) as opposed to reference equality(identity), for example:
IDictionary<Person, IList<Movie> > moviesByActor; // e.g. initialised from DB
// elsewhere...
Person p = new Person("Chuck", "Norris");
IList<Movie> chuckNorrisMovies = moviesByActor[p];
Here, if I need to create a new instance of Person to do the lookup, I need Person to implement value equality otherwise it won't match existing entries in the Dictionary as they have a different identity.
To get value equality, you need an overridden Equals() and GetHashCode(), in both languages.
C#'s structs (value types) implement value equality for you (albeit a potentially inefficient one), and provide a consistent implementation of GetHashCode. This may suffice for many people's needs and they won't go further to implement their own improved version unless performance problems dictate otherwise.
Java has no such built-in language feature. If you want to create a type with value equality semantics to use as a composite key, you must implement equals() and correspondingly hashCode() yourself. (There are third-party helpers and libraries to help you do this, but nothing built into the language itself).
I've described C# value types as 'potentially inefficient' for use in a Dictionary because:
The implementation of ValueType.Equals itself can sometimes be slow. This is used in Dictionary lookups.
The implementation of ValueType.GetHashCode, whilst correct, can yield many collisions leading to very poor Dictionary performance also. Have a look at this answer to a Q by Jon Skeet, which demonstrates that KeyValuePair<ushort, uint> appears to always yield the same hashCode!

If your object represents a value or type, then you SHOULD override the GetHashCode() along with Equals. I never override hash codes for control classes, like "App". Though I see no reason why even overriding GetHashCode() in those circumstances would be a problem as they will never be put in a position to interfere with collection indexing or comparisons.
Example:
public class ePoint : eViewModel, IEquatable<ePoint>
{
public double X;
public double Y;
// Methods
#region IEquatable Overrides
public override bool Equals(object obj)
{
if (Object.ReferenceEquals(obj, null)) { return false; }
if (Object.ReferenceEquals(this, obj)) { return true; }
if (!(obj is ePoint)) { return false; }
return Equals((ePoint)obj);
}
public bool Equals(ePoint other)
{
return X == other.X && Y == other.Y;
}
public override int GetHashCode()
{
return (int)Math.Pow(X,Y);
}
#endregion

I wrote a helper class to implement GetHashCode(), Equals(), and CompareTo() using value semantics from an array of properties.

Related

C# Immutability and Equality

I'm trying to create and use only immutable classes where all fields are readonly immutable types, though there may be additional fields which are mutable and not considered to be part of the object's state (mainly a cached hashcode).
When implementing IEquatable I do the same as I would for non immutable objects
Ie,
public bool Equals(MyImmutableType o) =>
object.Equals(this.x, o.x) && object.Equals(this.y, o.y);
Now being immutable this seems inefficient, the object will never change, if I could calculate and store some unique fingerprint of it I could simply compare fingerprints instead of whole fields (which may call their own Equals etc).
I am wondering what can be a good solution for this ? will BinaryFormatter + MD5 be worth exploring ?
Since you've already overridden Equals, you are required to also overload GetHashCode. Remember, the fundamental rule of GetHashCode is equal objects have equal hashes.
Therefore, you have overridden GetHashCode.
Since equal objects are required to have equal hash codes, you can implement Equals as:
public static bool Equals(M a, M b)
{
if (object.ReferenceEquals(a, b)) return true;
// If both of them are null, we're done, but maybe one is.
if (object.ReferenceEquals(null, a)) return false;
if (object.ReferenceEquals(null, b)) return false;
// Both are not null.
if (a.GetHashCode() != b.GetHashCode()) return false;
if (!object.Equals(a.x, b.x)) return false;
if (!object.Equals(a.y, b.y)) return false;
return true;
}
And now you can implement as many instance versions of Equals as you like by calling the static helper. Also overload == and != while you're at it.
That implementation takes as many early outs as possible. Of course, the worst-performing case is the case where we have value equality but not reference equality, but that's also the rarest case! In practice, most objects are unequal to each other, and most objects that are equal to each other are reference equal. In those 99% cases we get the right answer in four or fewer highly efficient comparisons.
If you are in a scenario where it is extremely common for there to be objects that are value equal but not reference equal, then solve the problem in the factory; memoize the factory!

Should GetHashCode be implemented for IEquatable<T> on mutable types?

I'm implementing IEquatable<T>, and I am having difficulty finding consensus on the GetHashCode override on a mutable class.
The following resources all provide an implementation where GetHashCode would return different values during the object's lifetime if the object changes:
https://stackoverflow.com/a/13906125/197591
https://csharp.2000things.com/tag/iequatable/
http://broadcast.oreilly.com/2010/09/understanding-c-equality-iequa.html
However, this link states that GetHashCode should not be implemented for mutable types for the reason that it could cause undesirable behaviour if the object is part of a collection (and this has always been my understanding also).
Interestingly, the MSDN example implements the GetHashCode using only immutable properties which is in line with my understanding. But I'm confused as to why the other resources don't cover this. Are they simply wrong?
And if a type has no immutable properties at all, the compiler warns that GetHashCode is missing when I override Equals(object). In this case, should I implement it and just call base.GetHashCode() or just disable the compiler warning, or have I missed something and GetHashCode should always be overridden and implemented? In fact, if the advice is that GetHashCode should not be implemented for mutable types, why bother implementing for immutable types? Is it simply to reduce collisions compared to the default GetHashCode implementation, or does it actually add more tangible functionality?
To summarise my Question, my dilemma is that using GetHashCode on mutable objects means it can return different values during the lifetime of the object if properties on it change. But not using it means that the benefit of comparing objects that might be equivalent is lost because it will always return a unique value and thus collections will always fall back to using Equals for its operations.
Having typed this Question out, another Question popped up in the 'Similar Questions' box that seems to address the same topic. The answer there seems to be quite explicit in that only immutable properties should be used in a GetHashCode implementation. If there are none, then simply don't write one. Dictionary<TKey, TValue> will still function correctly albeit not at O(1) performance.
Mutable classes work quite bad with Dictionaries and other classes that relies on GetHashCode and Equals.
In the scenario you are describing, with mutable object, I suggest one of the following:
class ConstantHasCode: IEquatable<ConstantHasCode>
{
public int SomeVariable;
public virtual Equals(ConstantHasCode other)
{
return other.SomeVariable == SomeVariable;
}
public override int GetHashCode()
{
return 0;
}
}
or
class ThrowHasCode: IEquatable<ThrowHasCode>
{
public int SomeVariable;
public virtual Equals(ThrowHasCode other)
{
return other.SomeVariable == SomeVariable;
}
public override int GetHashCode()
{
throw new ApplicationException("this class does not support GetHashCode and should not be used as a key for a dictionary");
}
}
With the first, Dictionary works (almost) as expected, with performance penalty in lookup and insertion: in both cases, Equals will be called for every element already in the dictionary until a comparison return true. You are actually reverting to performance of a List
The second is a way to tell the programmers will use your class "no, you cannot use this within a dictionary".
Unfortunately, as far as I know there is no method to detect it at compile time, but this will fail the first time the code adds an element to the dictionary, very likely quite early while developping, not the kind of bug happening only in production environment with an unpredicted set of input.
Last but not least, ignore the "mutable" problem and implement GetHashCode using member variables: now you have to be aware that you are not free to modify the class when it's used withing a Dictionary. In some scenario this can be acceptable, in other it's not
It all depends of what kind of collection type you are talking about. For my answer I will assume you are talking about Hash Table based collections and in particular I will address it for .NET Dictionary and Key calculation.
So best way to identify what will happen if you modify key( given your key is a class which does custom HashCode calculation) is to look at the .NET source. From .NET source we can see that your key value pair is now wrapped into Entry struct which carries hashcode which was calculated on addition of your value. Meaning that if you change HashCode value after that time of your key was added, it will no longer be able to find a value in dictionary.
Code to prove it:
static void Main()
{
var myKey = new MyKey { MyBusinessKey = "Ohai" };
var dic = new Dictionary<MyKey, int>();
dic.Add(myKey, 1);
Console.WriteLine(dic[myKey]);
myKey.MyBusinessKey = "Changing value";
Console.WriteLine(dic[myKey]); // Key Not Found Exception.
}
public class MyKey
{
public string MyBusinessKey { get; set; }
public override int GetHashCode()
{
return MyBusinessKey.GetHashCode();
}
}
.NET source reference.
So to answer your question. You want to have immutable values for which you base your hashcode calculation on.
Another point, hashcode for custom class if you do not override GetHashCode will be based on reference of the object. So concern of returning same hashcode for different object which are identical in underlying values could be mitigated by overriding GetHashCode method and calculating your HashCode depending on your business keys. For example you would have two string properties, to calculate hashcode you would concat strings and call base string GetHashCode method. This will guarantee that you will get same hashcode for same underlying values of the object.
After much discussion and reading other SO answers on the topic, it was eventually this ReSharper help page that summarised it very well for me:
MSDN documentation of the GetHashCode() method does not explicitly require that your override of this method returns a value that never changes during the object's lifetime. Specifically, it says:
The GetHashCode method for an object must consistently return the same hash code as long as there is no modification to the object state that determines the return value of the object's Equals method.
On the other hand, it says that the hash code should not change at least when your object is in a collection:
*You can override GetHashCode for immutable reference types. In general, for mutable reference types, you should override GetHashCode only if:
You can compute the hash code from fields that are not mutable; or
You can ensure that the hash code of a mutable object does not change while the object is contained in a collection that relies on its hash code.*
But why do you need to override GetHashCode() in the first place? Normally, you will do it if your object is going to be used in a Hashtable, as a key in a dictionary, etc., and it's quite hard to predict when your object will be added to a collection and how long it will be kept there.
With all that said, if you want to be on the safe side make sure that your override of GetHashCode() returns the same value during the object's lifetime. ReSharper will help you here by pointing at each non-readonly field or non-get-only property in your implementation of GetHashCode(). If possible, ReSharper will also suggest quick-fixes to make these members read-only/get-only.
Of course, it doesn't suggest what to do if the quick-fixes are not possible. However, it does indicate that those quick-fixes should only be used "if possible" which implies that the the inspection could be suppressed. Gian Paolo's answer on this suggests to throw an exception which will prevent the class from being used as a key and would present itself early in development if it was inadvertently used as a key.
However, GetHashCode is used in other circumstances such as when an instance of your object is passed as a parameter to a mock method setup. Therefore, the only viable option is to implement GetHashCode using the mutable values and put the onus on the rest of the code to ensure the object is not mutated while it is being used as a key, or to not use it as a key at all.

Do I have to implement GetHashCode() to follow best practice?

In my class I have the following override.
public override bool Equals(object input)
{
Occasion comparee = input as Occasion;
return comparee != null && comparee.Id == Id;
}
It's being used in the GUI for determining of a combox pre-selected value and it works perfectly. However, R# nags and suggests that, if I have overriden Equals(object), then I also should override GetHashCode(). When I add the following code, it nags that the code is calling the base method.
public override int GetHashCode()
{
return base.GetHashCode();
}
I have to return something, so basically it wants me to implement a dummy method returning an arbitrary integer (it's not being used anywhere in the code as far I can tell because I get the same behavior regardless of whether I have it commented out or not.
Since best practice is to omit the code that isn't needed, I'm confused on what the appropriate design would be.
The MSDN page for GetHashCode states that
If you override the GetHashCode method, you should also override Equals, and vice versa. If your overridden Equals method returns true when two objects are tested for equality, your overridden GetHashCode method must return the same value for the two objects.
If you are overriding the equals so that it returns true when the object ids are equal, you must define your GetHashCode method to return the same hash code when given the same id.
The hashcode is used for insertion and lookup in collections - so it's unlikely to be used directly by any code you write, but it will be used by other .NET code. But if you don't implement it correctly you may find performance issues or worse when dealing with collections of your object. For example, if you get it wrong or don't implement it then you might not be able to determine whether an instance of your class is in the collection or not, or you won't be able to retrieve the instance you you put in.
In your case the method is simply:
public override int GetHashCode()
{
return Id.GetHashCode();
}
as you are just comparing the id's in the Equals method.

Overriding GetHashCode()

In this article, Jon Skeet mentioned that he usually uses this kind of algorithm for overriding GetHashCode().
public override int GetHashCode()
{
unchecked // Overflow is fine, just wrap
{
int hash = 17;
// Suitable nullity checks etc, of course :)
hash = hash * 23 + Id.GetHashCode();
return hash;
}
}
Now, I've tried using this, but Resharper tells me that the method GetHashCode() should be hashing using only read-only fields (it compiles fine, though). What would be a good practice, because right now I can't really have my fields to be read-only?
I tried generating this method by Resharper, here's the result.
public override int GetHashCode()
{
return base.GetHashCode();
}
This doesn't contribute much, to be honest...
If all your fields are mutable and you have to implement GetHashCode method, I am afraid this is the implementation you would need to have.
public override int GetHashCode()
{
return 1;
}
Yes, this is inefficient but this is at least correct.
The problem is that GetHashCode is being used by Dictionary and HashSet collections to place each item in a bucket. If hashcode is calculated based on some mutable fields and the fields are really changed after the object is placed into the HashSet or Dictionary, the object can no longer be found from the HashSet or Dictionary.
Note that with all the objects returning the same HashCode 1, this basically means all the objects are being put in the same bucket in the HashSet or Dictionary. So, there is always only one single bucket in the HashSet or Dictionary. When trying to lookup the object, it will do a equality check on each of the objects inside the only bucket. This is like doing a search in a linked list.
Somebody may argue that implementing the hashcode based on mutable fields can be fine if we can make sure fields are never changed after the objects added to HashCode or Dictionary collection. My personal view is that this is error-prone. Somebody taking over your code two years later might not be aware of this and breaks the code accidentally.
Please note that your GetHashCode must go hand in hand with your Equals method. And if you can just use reference equality (when you'd never have two different instances of your class that can be equal) then you can safely use Equals and GetHashCode that are inherited from Object. This would work much better than simply return 1 from GetHashCode.
I personally tend to return a different numeric value for each implementation of GetHashCode() in a class which has no immutable fields. This means if I have a dictionary containing different implementing types, there is a chance the different instances of different types will be put in different buckets.
For example
public class A
{
// TODO Equals override
public override int GetHashCode()
{
return 21313;
}
}
public class B
{
// TODO Equals override
public override int GetHashCode()
{
return 35507;
}
}
Then if I have a Dictionary<object, TValue> containing instances of A, B and other types , the performance of the lookup will be better than if all the implementations of GetHashCode returned the same numeric value.
It should also be noted that I make use of prime numbers to get a better distribution.
As per the comments I have provided a LINQPad sample here which demonstrates the performance difference between using return 1 for different types and returning a different value for each type.

Why does C# not implement GetHashCode for Collections?

I am porting something from Java to C#. In Java the hashcode of a ArrayList depends on the items in it. In C# I always get the same hashcode from a List...
Why is this?
For some of my objects the hashcode needs to be different because the objects in their list property make the objects non-equal. I would expect that a hashcode is always unique for the object's state and only equals another hashcode when the object is equal. Am I wrong?
In order to work correctly, hashcodes must be immutable – an object's hash code must never change.
If an object's hashcode does change, any dictionaries containing the object will stop working.
Since collections are not immutable, they cannot implement GetHashCode.
Instead, they inherit the default GetHashCode, which returns a (hopefully) unique value for each instance of an object. (Typically based on a memory address)
Hashcodes must depend upon the definition of equality being used so that if A == B then A.GetHashCode() == B.GetHashCode() (but not necessarily the inverse; A.GetHashCode() == B.GetHashCode() does not entail A == B).
By default, the equality definition of a value type is based on its value, and of a reference type is based on it's identity (that is, by default an instance of a reference type is only equal to itself), hence the default hashcode for a value type is such that it depends on the values of the fields it contains* and for reference types it depends on the identity. Indeed, since we ideally want the hashcodes for non-equal objects to be different particularly in the low-order bits (most likely to affect the value of a re-hashing), we generally want two equivalent but non-equal objects to have different hashes.
Since an object will remain equal to itself, it should also be clear that this default implementation of GetHashCode() will continue to have the same value, even when the object is mutated (identity does not mutate even for a mutable object).
Now, in some cases reference types (or value types) re-define equality. An example of this is string, where for example "ABC" == "AB" + "C". Though there are two different instances of string compared, they are considered equal. In this case GetHashCode() must be overridden so that the value relates to the state upon which equality is defined (in this case, the sequence of characters contained).
While it is more common to do this with types that also are immutable, for a variety of reasons, GetHashCode() does not depend upon immutability. Rather, GetHashCode() must remain consistent in the face of mutability - change a value that we use in determining the hash, and the hash must change accordingly. Note though, that this is a problem if we are using this mutable object as a key into a structure using the hash, as mutating the object changes the position in which it should be stored, without moving it to that position (it's also true of any other case where the position of an object within a collection depends on its value - e.g. if we sort a list and then mutate one of the items in the list, the list is no longer sorted). However, this doesn't mean that we must only use immutable objects in dictionaries and hashsets. Rather it means that we must not mutate an object that is in such a structure, and making it immutable is a clear way to guarantee this.
Indeed, there are quite a few cases where storing mutable objects in such structures is desirable, and as long as we don't mutate them during this time, this is fine. Since we don't have the guarantee immutability brings, we then want to provide it another way (spending a short time in the collection and being accessible from only one thread, for example).
Hence immutability of key values is one of those cases where something is possible, but generally a idea. To the person defining the hashcode algorithm though, it's not for them to assume any such case will always be a bad idea (they don't even know the mutation happened while the object was stored in such a structure); it's for them to implement a hashcode defined on the current state of the object, whether calling it in a given point is good or not. Hence for example, a hashcode should not be memoised on a mutable object unless the memoisation is cleared on every mutate. (It's generally a waste to memoise hashes anyway, as structures that hit the same objects hashcode repeatedly will have their own memoisation of it).
Now, in the case in hand, ArrayList operates on the default case of equality being based on identity, e.g.:
ArrayList a = new ArrayList();
ArrayList b = new ArrayList();
for(int i = 0; i != 10; ++i)
{
a.Add(i);
b.Add(i);
}
return a == b;//returns false
Now, this is actually a good thing. Why? Well, how do you know in the above that we want to consider a as equal to b? We might, but there are plenty of good reasons for not doing so in other cases too.
What's more, it's much easier to redefine equality from identity-based to value-based, than from value-based to identity-based. Finally, there are more than one value-based definitions of equality for many objects (classic case being the different views on what makes a string equal), so there isn't even a one-and-only definition that works. For example:
ArrayList c = new ArrayList();
for(short i = 0; i != 10; ++i)
{
c.Add(i);
}
If we considered a == b above, should we consider a == c aslo? The answer depends on just what we care about in the definition of equality we are using, so the framework could't know what the right answer is for all cases, since all cases don't agree.
Now, if we do care about value-based equality in a given case we have two very easy options. The first is to subclass and over-ride equality:
public class ValueEqualList : ArrayList, IEquatable<ValueEqualList>
{
/*.. most methods left out ..*/
public Equals(ValueEqualList other)//optional but a good idea almost always when we redefine equality
{
if(other == null)
return false;
if(ReferenceEquals(this, other))//identity still entails equality, so this is a good shortcut
return true;
if(Count != other.Count)
return false;
for(int i = 0; i != Count; ++i)
if(this[i] != other[i])
return false;
return true;
}
public override bool Equals(object other)
{
return Equals(other as ValueEqualList);
}
public override int GetHashCode()
{
int res = 0x2D2816FE;
foreach(var item in this)
{
res = res * 31 + (item == null ? 0 : item.GetHashCode());
}
return res;
}
}
This assumes that we will always want to treat such lists this way. We can also implement an IEqualityComparer for a given case:
public class ArrayListEqComp : IEqualityComparer<ArrayList>
{//we might also implement the non-generic IEqualityComparer, omitted for brevity
public bool Equals(ArrayList x, ArrayList y)
{
if(ReferenceEquals(x, y))
return true;
if(x == null || y == null || x.Count != y.Count)
return false;
for(int i = 0; i != x.Count; ++i)
if(x[i] != y[i])
return false;
return true;
}
public int GetHashCode(ArrayList obj)
{
int res = 0x2D2816FE;
foreach(var item in obj)
{
res = res * 31 + (item == null ? 0 : item.GetHashCode());
}
return res;
}
}
In summary:
The default equality definition of a reference type is dependant upon identity alone.
Most of the time, we want that.
When the person defining the class decides that this isn't what is wanted, they can override this behaviour.
When the person using the class wants a different definition of equality again, they can use IEqualityComparer<T> and IEqualityComparer so their that dictionaries, hashmaps, hashsets, etc. use their concept of equality.
It's disastrous to mutate an object while it is the key to a hash-based structure. Immutability can be used of ensure this doesn't happen, but is not compulsory, nor always desirable.
All in all, the framework gives us nice defaults and detailed override possibilities.
*There is a bug in the case of a decimal within a struct, because there is a short-cut used in some cases with stucts when it is safe and not othertimes, but while a struct containing a decimal is one case when the short-cut is not safe, it is incorrectly identified as a case where it is safe.
Yes, you are wrong. In both Java and C#, being equal implies having the same hash-code, but the converse is not (necessarily) true.
See GetHashCode for more information.
It is not possible for a hashcode to be unique across all variations of most non-trivial classes. In C# the concept of List equality is not the same as in Java (see here), so the hash code implementation is also not the same - it mirrors the C# List equality.
You're only partly wrong. You're definitely wrong when you think that equal hashcodes means equal objects, but equal objects must have equal hashcodes, which means that if the hashcodes differ, so do the objects.
The core reasons are performance and human nature - people tend to think about hashes as something fast but it normally requires traversing all elements of an object at least once.
Example: If you use a string as a key in a hash table every query has complexity O(|s|) - use 2x longer strings and it will cost you at least twice as much. Imagine that it was a full blown tree (just a list of lists) - oops :-)
If full, deep hash calculation was a standard operation on a collection, enormous percentage of progammers would just use it unwittingly and then blame the framework and the virtual machine for being slow. For something as expensive as full traversal it is crucial that a programmer has to be aware of the complexity. The only was to achieve that is to make sure that you have to write your own. It's a good deterrent as well :-)
Another reason is updating tactics. Calculating and updating a hash on the fly vs. doing the full calculation every time requires a judgement call depending on concrete case in hand.
Immutabilty is just an academic cop out - people do hashes as a way of detecting a change faster (file hashes for example) and also use hashes for complex structures which change all the time. Hash has many more uses beyong the 101 basics. The key is again that what to use for a hash of a complex object has to be a judgement call on a case by case basis.
Using object's address (actually a handle so it doesn't change after GC) as a hash is actually the case where the hash value remains the same for arbitrary mutable object :-) The reason C# does it is that it's cheap and again nudges people to calculate their own.
Why is too philosophical. Create helper method (may be extension method) and calculate hashcode as you like. May be XOR elements' hashcodes

Categories