What does .NET Dictionary<T,T> use to hash a reference? - c#

So I'm thinking of using a reference type as a key to a .NET Dictionary...
Example:
class MyObj
{
private int mID;
public MyObj(int id)
{
this.mID = id;
}
}
// whatever code here
static void Main(string[] args)
{
Dictionary<MyObj, string> dictionary = new Dictionary<MyObj, string>();
}
My question is, how is the hash generated for custom objects (ie not int, string, bool etc)? I ask because the objects I'm using as keys may change before I need to look up stuff in the Dictionary again. If the hash is generated from the object's address, then I'm probably fine... but if it is generated from some combination of the object's member variables then I'm in trouble.
EDIT:
I should've originally made it clear that I don't care about the equality of the objects in this case... I was merely looking for a fast lookup (I wanted to do a 1-1 association without changing the code of the classes involved).
Thanks

The default implementation of GetHashCode/Equals basically deals with identity. You'll always get the same hash back from the same object, and it'll probably be different to other objects (very high probability!).
In other words, if you just want reference identity, you're fine. If you want to use the dictionary treating the keys as values (i.e. using the data within the object, rather than just the object reference itself, to determine the notion of equality) then it's a bad idea to mutate any of the equality-sensitive data within the key after adding it to the dictionary.
The MSDN documentation for object.GetHashCode is a little bit overly scary - basically you shouldn't use it for persistent hashes (i.e. saved between process invocations) but it will be consistent for the same object which is all that's required for it to be a valid hash for a dictionary. While it's not guaranteed to be unique, I don't think you'll run into enough collections to cause a problem.

The hash used is the return value of the .GetHashcode method on the object. By default this essentially a value representing the reference. It is not guaranteed to be unique for an object, and in fact likely won't be in many situations. But the value for a particular reference will not change over the lifetime of the object even if you mutate it. So for this particular sample you will be OK.
In general though, it is a very bad idea to use objects which are not immutable as keys to a Dictionary. It's way too easy to fall into a trap where you override Equals and GetHashcode on an object and break code where the type was formerly used as a key in a Dictionary.

The dictionary will use the GetHashCode method defined on System.Object, which will not change over the object's lifetime regardless of field changes etc. So you won't encounter problems in the scenario you describe.
You can override GetHashCode, and should do so if you override Equals so that objects which are equal also return the same hash code. However, if the type is mutable then you must be aware that if you use the object as a key of a dictionary you will not be able to find it again if it is subsequently altered.

The default implementation of the
GetHashCode method does not guarantee
unique return values for different
objects. Furthermore, the .NET
Framework does not guarantee the
default implementation of the
GetHashCode method, and the value it
returns will be the same between
different versions of the .NET
Framework. Consequently, the default
implementation of this method must not
be used as a unique object identifier
for hashing purposes.
http://msdn.microsoft.com/en-us/library/system.object.gethashcode.aspx

For CustomObject , derived from objects, the hash code will be generated in beginning of the object and they will remain same throughout its life of its instance. Further more, hash code will never change as values of internal fields/properties will change.
Hashtable/Dictionary will not use GetHashCode as unique identifier but rather it will only use it as "hash buckets". For example string "aaa123" and "aaa456" may have hash as "aaa" and that all objects having same hash "aaa" will be stored in one bucket. Whenever you will insert/retrive an object, Dictionary will always call GetHashCode and determine the bucket to further individual address comparison of objects.
Custom Object as Dictionary key should be taken as if, Dictionary only stores theirs "Reference (addresses or memory pointers)" it doesnt know its contents, and contents of objects change but Reference never change. This also means that if two objects are exact replica of each other, but they are different in memory, your hashtable will not consider them as same because their memory pointers are different.
Best way to guarentee identity equality is to override method "Equals" as following... if you are having any problem.
class MyObj
{
private int mID;
public MyObj(int id)
{
this.mID = id;
}
public bool override Equals(Object obj)
{
MyObj mobj = obj as MyObj;
if(mobj==null)
return false;
return this.mID == mobj.mID;
}
}

Related

Is GetHashCode guaranteed to be the same for the lifetime of an object?

I hate to beat a dead horse. In #eric-lippert's blog, he states:
the hash value of an object is the same for its entire lifetime
Then follows up with:
However, this is only an ideal-situation guideline
So, my question is this.
For a POCO (nothing overriden) or a framework (like FileInfo or Process) object, is the return value of the GetHashCode() method guaranteed to be the same during its lifetime?
P.S. I am talking about pre-allocated objects. var foo = new Bar(); Will foo.GetHashCode() always return the same value.
If you look at the MSDN documentation you will find the following remarks about the default behavior of the GetHashCode method:
If GetHashCode is not overridden, hash codes for reference types are
computed by calling the Object.GetHashCode method of the base class,
which computes a hash code based on an object's reference; for more
information, see RuntimeHelpers.GetHashCode. In other words, two
objects for which the ReferenceEquals method returns true have
identical hash codes. If value types do not override GetHashCode, the
ValueType.GetHashCode method of the base class uses reflection to
compute the hash code based on the values of the type's fields. In
other words, value types whose fields have equal values have equal
hash codes
Based on my understanding we can assume that:
for a reference type (which doesn't override Object.GetHashCode) the
value of the hash code of a given instance is guaranteed to be the
same for the entire lifetime of the instance (because the memory
address at which the object is stored won't change during its
lifetime)
for a value type (which doesn't override Object.GetHashCode) it depends: if the value type is immutable then the hash code won't
change during its lifetime. If, otherwise, the value of its fields
can be changed after its creation then its hash code will change too.
Please, notice that value types are generally immutable.
IMPORTANT EDIT
As pointed out in one comment above the .NET garbage collector can decide to move the physical location of an object in memory during the object lifetime, in other words an object can be "relocated" inside the managed memory.
This makes sense because the garbage collector is in charge of managing the memory allocated when objects are created.
After some searches and according to this stackoverflow question (read the comments provided by the user #supercat) it seems that this relocation does not change the hash code of an object instance during its lifetime, because the hash code is calculated once (the first time that it's value is requested) and the computed value is saved and reused later (when
the hash code value is requested again).
To summarize, based on in my understanding, the only thing you can assume is that given two references pointing to the same object in memory the hash codes of them will always be identical. In other words if Object.ReferenceEquals(a, b) then a.GetHashCode() == b.GetHashCode(). Furthermore it seems that given an object instance its hash code will stay the same for its entire lifetime, even if the physical memory address of the object is changed by the garbage collector.
SIDENOTE ON HASH CODES USAGE
It is important to always remember that the hash code has been introduced in the .NET framework at the sole purpose of handling the hash table data structure.
In order to determine the bucket to be used for a given value, the corresponding key is taken and its hash code is computed (to be precise, the bucket index is obtained by applying some normalizations on the value returned by the GetHashCode call, but the details are not important for this discussion). Put another way, the hash function used in the .NET implementation of hash tables is based on the computation of the hash code of the key.
This means that the only safe usage for an hash code is balancing an hash table, as pointed out by Eric Lippert here, so don't write code which depends on hash codes values for any other purpose.
There are three cases.
A class which does not override GetHashCode
A struct which does not override GetHashCode
A class or struct which does override GetHashCode
If a class does not override GetHashCode, then the return value of the helper function RuntimeHelpers.GetHashCode is used. This will return the same value each time it's called for the same object, so an object will always have the same hash code. Note that this hash code is specific to a single AppDomain - restarting your application, or creating another AppDomain, will probably result in your object getting a different hash code.
If a struct does not override GetHashCode, then the hash code is generated based the hash code of one of its members. Of course, if your struct is mutable, then that member can change over time, and so the hash code can change over time. Even if the struct is immutable, that member could itself be mutated, and could return different hash codes.
If a class or struct does override GetHashCode, then all bets are off. Someone could implement GetHashCode by returning a random number - that's a bit of a silly thing to do, but it's perfectly possible. More likely, the object could be mutable, and its hash code could be based off its members, both of which can change over time.
It's generally a bad idea to implement GetHashCode for objects which are mutable, or in a way where the hash code can change over time (in a given AppDomain). Many of the assumptions made by classes like Dictionary<TKey, TValue> break down in this case, and you will probably see strange behaviour.

Should GetHashCode be implemented for IEquatable<T> on mutable types?

I'm implementing IEquatable<T>, and I am having difficulty finding consensus on the GetHashCode override on a mutable class.
The following resources all provide an implementation where GetHashCode would return different values during the object's lifetime if the object changes:
https://stackoverflow.com/a/13906125/197591
https://csharp.2000things.com/tag/iequatable/
http://broadcast.oreilly.com/2010/09/understanding-c-equality-iequa.html
However, this link states that GetHashCode should not be implemented for mutable types for the reason that it could cause undesirable behaviour if the object is part of a collection (and this has always been my understanding also).
Interestingly, the MSDN example implements the GetHashCode using only immutable properties which is in line with my understanding. But I'm confused as to why the other resources don't cover this. Are they simply wrong?
And if a type has no immutable properties at all, the compiler warns that GetHashCode is missing when I override Equals(object). In this case, should I implement it and just call base.GetHashCode() or just disable the compiler warning, or have I missed something and GetHashCode should always be overridden and implemented? In fact, if the advice is that GetHashCode should not be implemented for mutable types, why bother implementing for immutable types? Is it simply to reduce collisions compared to the default GetHashCode implementation, or does it actually add more tangible functionality?
To summarise my Question, my dilemma is that using GetHashCode on mutable objects means it can return different values during the lifetime of the object if properties on it change. But not using it means that the benefit of comparing objects that might be equivalent is lost because it will always return a unique value and thus collections will always fall back to using Equals for its operations.
Having typed this Question out, another Question popped up in the 'Similar Questions' box that seems to address the same topic. The answer there seems to be quite explicit in that only immutable properties should be used in a GetHashCode implementation. If there are none, then simply don't write one. Dictionary<TKey, TValue> will still function correctly albeit not at O(1) performance.
Mutable classes work quite bad with Dictionaries and other classes that relies on GetHashCode and Equals.
In the scenario you are describing, with mutable object, I suggest one of the following:
class ConstantHasCode: IEquatable<ConstantHasCode>
{
public int SomeVariable;
public virtual Equals(ConstantHasCode other)
{
return other.SomeVariable == SomeVariable;
}
public override int GetHashCode()
{
return 0;
}
}
or
class ThrowHasCode: IEquatable<ThrowHasCode>
{
public int SomeVariable;
public virtual Equals(ThrowHasCode other)
{
return other.SomeVariable == SomeVariable;
}
public override int GetHashCode()
{
throw new ApplicationException("this class does not support GetHashCode and should not be used as a key for a dictionary");
}
}
With the first, Dictionary works (almost) as expected, with performance penalty in lookup and insertion: in both cases, Equals will be called for every element already in the dictionary until a comparison return true. You are actually reverting to performance of a List
The second is a way to tell the programmers will use your class "no, you cannot use this within a dictionary".
Unfortunately, as far as I know there is no method to detect it at compile time, but this will fail the first time the code adds an element to the dictionary, very likely quite early while developping, not the kind of bug happening only in production environment with an unpredicted set of input.
Last but not least, ignore the "mutable" problem and implement GetHashCode using member variables: now you have to be aware that you are not free to modify the class when it's used withing a Dictionary. In some scenario this can be acceptable, in other it's not
It all depends of what kind of collection type you are talking about. For my answer I will assume you are talking about Hash Table based collections and in particular I will address it for .NET Dictionary and Key calculation.
So best way to identify what will happen if you modify key( given your key is a class which does custom HashCode calculation) is to look at the .NET source. From .NET source we can see that your key value pair is now wrapped into Entry struct which carries hashcode which was calculated on addition of your value. Meaning that if you change HashCode value after that time of your key was added, it will no longer be able to find a value in dictionary.
Code to prove it:
static void Main()
{
var myKey = new MyKey { MyBusinessKey = "Ohai" };
var dic = new Dictionary<MyKey, int>();
dic.Add(myKey, 1);
Console.WriteLine(dic[myKey]);
myKey.MyBusinessKey = "Changing value";
Console.WriteLine(dic[myKey]); // Key Not Found Exception.
}
public class MyKey
{
public string MyBusinessKey { get; set; }
public override int GetHashCode()
{
return MyBusinessKey.GetHashCode();
}
}
.NET source reference.
So to answer your question. You want to have immutable values for which you base your hashcode calculation on.
Another point, hashcode for custom class if you do not override GetHashCode will be based on reference of the object. So concern of returning same hashcode for different object which are identical in underlying values could be mitigated by overriding GetHashCode method and calculating your HashCode depending on your business keys. For example you would have two string properties, to calculate hashcode you would concat strings and call base string GetHashCode method. This will guarantee that you will get same hashcode for same underlying values of the object.
After much discussion and reading other SO answers on the topic, it was eventually this ReSharper help page that summarised it very well for me:
MSDN documentation of the GetHashCode() method does not explicitly require that your override of this method returns a value that never changes during the object's lifetime. Specifically, it says:
The GetHashCode method for an object must consistently return the same hash code as long as there is no modification to the object state that determines the return value of the object's Equals method.
On the other hand, it says that the hash code should not change at least when your object is in a collection:
*You can override GetHashCode for immutable reference types. In general, for mutable reference types, you should override GetHashCode only if:
You can compute the hash code from fields that are not mutable; or
You can ensure that the hash code of a mutable object does not change while the object is contained in a collection that relies on its hash code.*
But why do you need to override GetHashCode() in the first place? Normally, you will do it if your object is going to be used in a Hashtable, as a key in a dictionary, etc., and it's quite hard to predict when your object will be added to a collection and how long it will be kept there.
With all that said, if you want to be on the safe side make sure that your override of GetHashCode() returns the same value during the object's lifetime. ReSharper will help you here by pointing at each non-readonly field or non-get-only property in your implementation of GetHashCode(). If possible, ReSharper will also suggest quick-fixes to make these members read-only/get-only.
Of course, it doesn't suggest what to do if the quick-fixes are not possible. However, it does indicate that those quick-fixes should only be used "if possible" which implies that the the inspection could be suppressed. Gian Paolo's answer on this suggests to throw an exception which will prevent the class from being used as a key and would present itself early in development if it was inadvertently used as a key.
However, GetHashCode is used in other circumstances such as when an instance of your object is passed as a parameter to a mock method setup. Therefore, the only viable option is to implement GetHashCode using the mutable values and put the onus on the rest of the code to ensure the object is not mutated while it is being used as a key, or to not use it as a key at all.

When implementing IEqualityComparer<T>.GetHashCode(T obj), can I use the current instance's state, or do I have to use obj?

How come when I implement IEqualityComparer, it has a parameter for GetHashCode(T obj)? It's not a static object of course, so why can't I just use the current instance's state to generate the hash code? Is this == obj?
I'm curious because I'm trying to do this:
public abstract class BaseClass : IEqualityComparer<BaseClass>
{
public abstract int GetHashCode(BaseClass obj);
}
public class DerivedClass : BaseClass
{
public int MyData;
public override int GetHashCode(BaseClass obj)
{
return MyData.GetHashCode();
// Or do I have to do this:
// return (DerivedClass)obj.MyData.GetHashCode();
}
}
I'm trying to prevent doing the cast, since it's being used in really high-performance code.
I think the main issue here is that you're confusing IEqualityComparer<T> with IEquatable<T>.
IEquatable<T> defines a method for determining if the current instance (this) is equal to an instance of the same type. In other words it's used for testing objA.Equals(objB). When implementing this interface, it is recommended that you also override the GetHashCode() instance method.
IEqualityComparer<T> defines methods for testing whether two objects of the given type are equal, in other words, it's for testing comparer.Equals(objA, objB). Hence the necessity to to provide an object as a parameter to GetHashCode (which, remember is different than the GetHashCode that it inherits from object)
You can think of IEquatable<T> as your object's way of saying, "this is how I know if I am equal to something else," and IEqualityComparer<T> as your object's way of saying, "this is how I know if two other things are equal".
For some good examples of how these two interfaces are used in the framework see:
String which implements IEquatable<string>
StringComparer which implements IEqualityComparer<string>
Should you use the current state of an IEqualityComparer<T> to determine the hash code? If the state is at all mutable, then no! Anywhere where the hash is used (e.g. HashSet<T> or Dictionary<T, V>) the hash code will be cached and used for efficient lookup. If that hash code can change because the state of the comparer changes, that would totally destroy the usefulness of the data structure storing the hash. Now, if the state is not mutable (i.e. it's set only when creating the comparer and cannot be modified throughout the lifetime of the comparer), then yes, you can, but I would still recommend against it, unless you have a really good reason.
Finally, you mentioned performance. Honestly, this sounds like premature optimization. I'd recommend not worrying so much about performance until you can be sure that this particular line of code is causing a problem.
If you are not using information from passed in obj arguments your hash code will not vary for different incoming objects and will not be useful. Comparer is not instance of object you want to get hash code for or compare to.
Indeed you can use local fields of comaprer in GetHashCode and even can return MyData as hash code as shown in your sample - it will still satisfy GetHashCode requirement to "return the same value data for the same object". But in your sample all hash codes will be the same for instance of comparer and hence using it for Dictionary will essentially turn dictionary into list.
The same applies to Equals call - indeed you can return true all the time, but how useful it will be?

How to write a good GetHashCode() implementation for a class that is compared by value?

Let's say we have such a class:
class MyClass
{
public string SomeValue { get; set; }
// ...
}
Now, let's say two MyClass instances are equal when their SomeValue property is equal. Thus, I overwrite the Object.Equals() and the Object.GetHashCode() methods to represent that. Object.GetHashCode() returns SomeValue.GetHashCode() But at the same time I need to follow these rules:
If two instances of an object are equal, they should return the same hash code.
The hash code should not change throughout the runtime.
But apparently, SomeValue can change, and the hash code we did get before may turn to be invalid.
I can only think of making the class immutable, but I'd like to know what others do in this case.
What do you do in such cases? Is having such a class represents a subtler problem in the design decisions?
The general contract says that if A.equals(B) is true, then their hash codes must be the same. If SomeValue changes in A in such a way that A.equals(B) is no longer true, then A.GetHashCode() can return a different value than before. Mutable objects cannot cache GetHashCode(), it must be calculated every time the method is called.
This article has detailed guidelines for GetHashCode and mutability:
http://ericlippert.com/2011/02/28/guidelines-and-rules-for-gethashcode/
If your GetHashCode() depends on some mutable value you have to change your hash whenever your value changes. Otherwise you break the equals law.
The part, that a hash should never be changed, once somebody asked for it, is needed if you put your object into a HashSet or as a key within a Dictionary. In these cases you have to ensure that the hash code won't be changed as long as it is stored in such a container. This can either be ensured manually, by simply taking care of this issue when you program or you could provide some Freeze() method to your object. If this is called any subsequent try to set a property would lead to some kind of exception (also you should then provide some Defrost() method). Additionally you put the call of the Freeze() method into your GetHashCode() implementation and so you can be quite sure that nobody alter a frozen object by mistake.
And just one last tip: If you need to alter a object within such a container, simply remove it, alter it (don't forget to defrost it) and re-add it again.
You sort of need to choose between mutability and GetHashCode returning the same value for 'equal' objects. Often when you think you want to implement 'equal' for mutable objects, you end up later deciding that you have "shades of equal" and really didn't mean Object.Equals equality.
Having a mutable object as the 'key' in any sort of data structure is a big red flag to me. For example:
MyObj a = new MyObj("alpha");
MyObj b = new MyObj("beta");
HashSet<MyObj> objs = new HashSet<MyObj>();
objs.Add(a);
objs.Add(b);
// objs.Count == 2
b.SomeValue = "alpha";
// objs.Distinct().Count() == 1, objs.Count == 2
We've badly violated the contract of HashSet<T>. This is an obvious example, there are subtle ones.

HashSets don't keep the elements unique if you mutate their identity

When working with HashSets in C#, I recently came across an annoying problem: HashSets don't guarantee unicity of the elements; they are not Sets. What they do guarantee is that when Add(T item) is called the item is not added if for any item in the set item.equals(that) is true. This holds no longer if you manipulate items already in the set. A small program that demonstrates (copypasta from my Linqpad):
void Main()
{
HashSet<Tester> testset = new HashSet<Tester>();
testset.Add(new Tester(1));
testset.Add(new Tester(2));
foreach(Tester tester in testset){
tester.Dump();
}
foreach(Tester tester in testset){
tester.myint = 3;
}
foreach(Tester tester in testset){
tester.Dump();
}
HashSet<Tester> secondhashset = new HashSet<Tester>(testset);
foreach(Tester tester in secondhashset){
tester.Dump();
}
}
class Tester{
public int myint;
public Tester(int i){
this.myint = i;
}
public override bool Equals(object o){
if (o== null) return false;
Tester that = o as Tester;
if (that == null) return false;
return (this.myint == that.myint);
}
public override int GetHashCode(){
return this.myint;
}
public override string ToString(){
return this.myint.ToString();
}
}
It will happily manipulate the items in the collection to be equal, only filtering them out when a new HashSet is built. What is advicible when I want to work with sets where I need to know the entries are unique? Roll my own, where Add(T item) adds a copy off the item, and the enumerator enumerates over copies of the contained items? This presents the challenge that every contained element should be deep-copyable, at least in its items that influence it's equality.
Another solution would be to roll your own, and only accepts elements that implement INotifyPropertyChanged, and taking action on the event to re-check for equality, but this seems severely limiting, not to mention a whole lot of work and performance loss under the hood.
Yet another possible solution I thought of is making sure that all fields are readonly or const in the constructor. All solutions seem to have very large drawbacks. Do I have any other options?
You're really talking about object identity. If you're going to hash items they need to have some kind of identity so they can be compared.
If that changes, it is not a valid identity method. You currently have public int myint. It really should be readonly, and only set in the constructor.
If two objects are conceptually different (i.e. you want to treat them as different in your specific design) then their hash code should be different.
If you have two objects with the same content (i.e. two value objects that have the same field values) then they should have the same hash codes and should be equal.
If your data model says that you can have two objects with the same content but they can't be equal, you should use a surrogate id, not hash the contents.
Perhaps your objects should be immutable value types so the object can't change
If they are mutable types, you should assign a surrogate ID (i.e. one that is introduced externally, like an increasing counter id or using the object's hashcode) that never changes for the given object
This is a problem with your Tester objects, not the set. You need to think hard about how you define identity. It's not an easy problem.
When I need a 1-dimensional collection of guaranteed unique items I usually go with Dictionary<TKey, Tvalue>: you cannot add elements with the same Key, plus I usually need to attach some properties to the items and the Value comes in handy (my go-to value type is Tuple<> for many values...).
OF course, it's not the most performant nor the least memory-hungry solution, but I don't usually have performance/memory concerns.
You should implement your own IEqualityComparer and pass it to the constructor of the HashSet to ensure you get the desired equality comparer.
And as Joe said, if you want the collection to remain unique even beyond .Add(T item) you need to use ValueObjects that are created by the constructor and have no publicly visible set attributes.
i.e.

Categories