I've made the assumption that the (generic) Dictionary class in .NET uses the GetHashCode() method on its keys to produce hashes. I have two questions leading on from it:
Object has an overridable GetHashCode() method. For a user defined
reference type object, will this method produce a hash based on the
referenced data? e.g. If I have a class OneString which contains
only one String instance variable - will two separate instances of
this class with matching strings always produce the same hash code?
Or does the GetHashCode() method of OneString need to be overridden
to achieve this functionality?
Presumably the hash function implemented in the String class is different to the hash function implemented in a different reference type (e.g. BitmapImage). Are the hash functions implemented in the most common classes publicly available?
No.
object.GetHashCode() returns a value based on that object's identity alone.
It will not return the same value for two equivalent objects; it is completely unaware of the type or meaning of the object.
Classes that represent values (such as String) override GetHashCode() to return a hash based on the value represented.
The algorithm used is up to the class designer; GetHashCode() is written like any other method.
However, GetHashCode() is supposed to return equal values whenever Equals() returns true; if your class does not do this, it is wrong.
Object has an overridable GetHashCode() method. For a user defined
reference type object, will this method produce a hash based on the
referenced data?
No, the default GetHashCode method doesn't attempt to use the data in the class, it only bases it on the reference. Two separate instances with identical content will have different hash codes.
If I have a class OneString which contains only one String instance
variable - will two separate instances of this class with matching
strings always produce the same hash code? Or does the GetHashCode()
method of OneString need to be overridden to achieve this
functionality?
You have to override it.
Presumably the hash function implemented in the String class is
different to the hash function implemented in a different reference
type (e.g. SqlCommand). Are the hash functions implemented in the most
common classes publicly available?
Yes, the GetHashCode for strings and common value types are implemented to produce a working hash code from the values.
1) Different string instances with the same contents will always produce the same hash code. (see: http://msdn.microsoft.com/en-us/library/system.string.gethashcode.aspx)
2) GetHashCode() is a method of the base Object class, from which all types derive. So, there is always an implementation of this method for any type.
Related
I hate to beat a dead horse. In #eric-lippert's blog, he states:
the hash value of an object is the same for its entire lifetime
Then follows up with:
However, this is only an ideal-situation guideline
So, my question is this.
For a POCO (nothing overriden) or a framework (like FileInfo or Process) object, is the return value of the GetHashCode() method guaranteed to be the same during its lifetime?
P.S. I am talking about pre-allocated objects. var foo = new Bar(); Will foo.GetHashCode() always return the same value.
If you look at the MSDN documentation you will find the following remarks about the default behavior of the GetHashCode method:
If GetHashCode is not overridden, hash codes for reference types are
computed by calling the Object.GetHashCode method of the base class,
which computes a hash code based on an object's reference; for more
information, see RuntimeHelpers.GetHashCode. In other words, two
objects for which the ReferenceEquals method returns true have
identical hash codes. If value types do not override GetHashCode, the
ValueType.GetHashCode method of the base class uses reflection to
compute the hash code based on the values of the type's fields. In
other words, value types whose fields have equal values have equal
hash codes
Based on my understanding we can assume that:
for a reference type (which doesn't override Object.GetHashCode) the
value of the hash code of a given instance is guaranteed to be the
same for the entire lifetime of the instance (because the memory
address at which the object is stored won't change during its
lifetime)
for a value type (which doesn't override Object.GetHashCode) it depends: if the value type is immutable then the hash code won't
change during its lifetime. If, otherwise, the value of its fields
can be changed after its creation then its hash code will change too.
Please, notice that value types are generally immutable.
IMPORTANT EDIT
As pointed out in one comment above the .NET garbage collector can decide to move the physical location of an object in memory during the object lifetime, in other words an object can be "relocated" inside the managed memory.
This makes sense because the garbage collector is in charge of managing the memory allocated when objects are created.
After some searches and according to this stackoverflow question (read the comments provided by the user #supercat) it seems that this relocation does not change the hash code of an object instance during its lifetime, because the hash code is calculated once (the first time that it's value is requested) and the computed value is saved and reused later (when
the hash code value is requested again).
To summarize, based on in my understanding, the only thing you can assume is that given two references pointing to the same object in memory the hash codes of them will always be identical. In other words if Object.ReferenceEquals(a, b) then a.GetHashCode() == b.GetHashCode(). Furthermore it seems that given an object instance its hash code will stay the same for its entire lifetime, even if the physical memory address of the object is changed by the garbage collector.
SIDENOTE ON HASH CODES USAGE
It is important to always remember that the hash code has been introduced in the .NET framework at the sole purpose of handling the hash table data structure.
In order to determine the bucket to be used for a given value, the corresponding key is taken and its hash code is computed (to be precise, the bucket index is obtained by applying some normalizations on the value returned by the GetHashCode call, but the details are not important for this discussion). Put another way, the hash function used in the .NET implementation of hash tables is based on the computation of the hash code of the key.
This means that the only safe usage for an hash code is balancing an hash table, as pointed out by Eric Lippert here, so don't write code which depends on hash codes values for any other purpose.
There are three cases.
A class which does not override GetHashCode
A struct which does not override GetHashCode
A class or struct which does override GetHashCode
If a class does not override GetHashCode, then the return value of the helper function RuntimeHelpers.GetHashCode is used. This will return the same value each time it's called for the same object, so an object will always have the same hash code. Note that this hash code is specific to a single AppDomain - restarting your application, or creating another AppDomain, will probably result in your object getting a different hash code.
If a struct does not override GetHashCode, then the hash code is generated based the hash code of one of its members. Of course, if your struct is mutable, then that member can change over time, and so the hash code can change over time. Even if the struct is immutable, that member could itself be mutated, and could return different hash codes.
If a class or struct does override GetHashCode, then all bets are off. Someone could implement GetHashCode by returning a random number - that's a bit of a silly thing to do, but it's perfectly possible. More likely, the object could be mutable, and its hash code could be based off its members, both of which can change over time.
It's generally a bad idea to implement GetHashCode for objects which are mutable, or in a way where the hash code can change over time (in a given AppDomain). Many of the assumptions made by classes like Dictionary<TKey, TValue> break down in this case, and you will probably see strange behaviour.
Attempt #3 to simplify this question:
A generic List<T> can contain any type - value or reference. When checking to see if a list contains an object, .Contains() uses the default EqualityComparer<T> for type T, and calls .Equals() (is my understanding). If no EqualityComparer has been defined, the default comparer will call .Equals(). By default, .Equals() calls .ReferenceEquals(), so .Contains() will only return true if the list contains the exact same object.
Until you need to override .Equals() to implement value equality, at which point the default comparer says two objects are the same if they have the same values. I can't think of a single case where that would be desirable for a reference type.
What I'm hearing from #Enigmativity is that implementing IEqualityComparer<StagingDataRow> will give my typed DataRow a default equality comparer that will be used instead of the default comparer for Object – allowing me to implement value equality logic in StagingDataRow.Equals().
Questions:
Am I understanding that correctly?
Am I guaranteed that everything in the .NET framework will call EqualityComparer<StagingDataRow>.Equals() instead of StagingDataRow.Equals()?
What should IEqualityComparer<StagingDataRow>.GetHashCode(StagingDataRow obj) hash against, and should it return the same value as StagingDataRow.GetHashCode()?
What is passed to IEqualityComparer<StagingDataRow>.GetHashCode(StagingDataRow obj)? The object I'm looking for or the object in the list? Both? It would be strange to have an instance method accept itself as a parameter...
In general, how does one separate value equality from reference equality when overriding .Equals()?
The original line of code spurring this question:
// For each ID, a collection of matching rows
Dictionary<string, List<StagingDataRow>> stagingTableDictionary;
StagingTableMatches.AddRange(stagingTableDictionary[perNr].Where(row => !StagingTableMatches.Contains(row)));
.
Ok, let's handle a few misconceptions first:
By default, .Equals() calls .ReferenceEquals(), so .Contains() will only return true if the list contains the exact same object.
This is true, but only for reference types. Value types will implement a very slow reflection-based Equals function by default, so it's in your best interest to override that.
I can't think of a single case where that would be desirable for a reference type.
Oh I'm sure you can... String is a reference type for instance :)
What I'm hearing from #Enigmativity is that implementing IEqualityComparer<StagingDataRow> will give my typed DataRow a default equality comparer that will be used instead of the default comparer for Object – allowing me to implement value equality logic in StagingDataRow.Equals().
Err... No.
IEqualityComaprer<T> is an interface which lets you delegate equality comparison to a different object. If you want a different default behavior for your class, you implement IEquatable<T>, and also delegate object.Equals to that for consistency. Actually, overriding object.Equals and object.GetHashCode is sufficient to change the default equality comparison behavior, but also implementing IEquatable<T> has additional benefits:
It makes it more obvious that your type has custom equality comparison logic - think self documenting code.
It improves performance for value types, since it avoids unnecessary boxing (which happens with object.Equals)
So, for your actual questions:
Am I understanding that correctly?
You still seem a bit confused about this, but don't worry :)
Enigmativity actually suggested that you create a different type which implements IEqualityComparer<T>. Looks like you misunderstood that part.
Am I guaranteed that everything in the .NET framework will call EqualityComparer<StagingDataRow>.Equals() instead of StagingDataRow.Equals()
By default, the (properly written) framework data structures will delegate equality comparison to EqualityComparer<StagingDataRow>.Default, which will in turn delegate to StagingDataRow.Equals.
What should IEqualityComparer<StagingDataRow>.GetHashCode(StagingDataRow obj) hash against, and should it return the same value as StagingDataRow.GetHashCode()
Not necessarily. It should be self-consistent: if myEqualitycomaprer.Equals(a, b) then you must ensure that myEqualitycomaprer.GetHashCode(a) == myEqualitycomaprer.GetHashCode(b).
It can be the same implementation than StagingDataRow.GetHashCode, but not necessarily.
What is passed to IEqualityComparer<StagingDataRow>.GetHashCode(StagingDataRow obj)? The object I'm looking for or the object in the list? Both? It would be strange to have an instance method accept itself as a parameter...
Well, by now I hope you've understood that the object which implements IEqualityComparer<T> is a different object, so this should make sense.
Please read my answer on Using of IEqualityComparer interface and EqualityComparer class in C# for more in-depth information.
Am I understanding that correctly?
Partially - the "default" IEqualityComparer will use either (in order):
The implementation of IEquatable<T>
An overridden Equals(object)
the base object.Equals(object), which is reference equality for reference types.
I think you are confusing two different methods of defining "equality" in a custom type. One is by implementing IEquatable<T> Which allows an instance of a type to determine if it's "equal" to another instance of the same type.
The other is IEqualityComparer<T> which is an independent interface that determines if two instance of that type are equal.
So if your definition of Equals should apply whenever you are comparing two instances, then implement IEquatable, as well as overriding Equals (which is usually trivial after implementing IEquatable) and GetHashCode.
If your definition of "equal" only applies in a particular use case, then create a different class that implements IEqualityComparer<T>, then pass an instance of it to whatever class or method you want that definition to apply to.
Am I guaranteed that everything in the .NET framework will call EqualityComparer<StagingDataRow>.Equals() instead of StagingDataRow.Equals()?
No - only types and methods that accept an instance of IEqualityComparer as a parameter will use it.
What should IEqualityComparer<StagingDataRow>.GetHashCode(StagingDataRow obj) hash against, and should it return the same value as StagingDataRow.GetHashCode()?
It will compute the hash code for the object that's passed in. It doesn't "compare" the hash code to anything. It does not necessarily have to return the same value as the overridden GetHashCode, but it must follow the rules for GetHashCode, particularly that two "equal" objects must return the same hash code.
It would be strange to have an instance method accept itself as a parameter...
Which is why IEqualityComparer is generally implemented on a different class. Note that IEquatable<T> doesn't have a GetHashCode() method, because it doesn't need one. It assumes that GetHashCode is overridden to match the override of object.Equals, which should match the strongly-typed implementation of IEquatable<T>
Bottom Line
If you want your definition of "equal" to be the default for that type, implement IEquatable<T> and override Equals and GetHashCode. If you want a definition of "equal" that is just for a specific use case, then create a different class that implements IEqualityComparer<T> and pass an instance of it to whatever types or methods need to use that definition.
Also, I would note that you very rarely call these methods directly (except Equals). They are usually called by the methods that use them (like Contains) to determine if two objects are "equal" or to get the hash code for an item.
I don't understand how a compiler can be smart enough to construct an O(1) lookup for MyObject where I can put anything inside
public class MyObject
{
// ...
}
I understand how this can be done for a limited number of non-primitives such as
public class MyObject
{
int i { get; set; }
char c { get; set; }
}
but how can it possibly know how to do this for any implementation of MyObject?
Get the hash-code
Modulo that down to produce an index into an array.
Look there. If an item is present see if it is equal.
So far, perfectly O(1). It falls down if a lot of items end up having hash-codes that modulo-down to the same index. This happening a bit is expected and dealt with, but if it happens all the time you end up with O(n) behaviour (and with really bad constant costs).
All objects by default have a GetHashCode() and an Equals() based on reference identity (that is, they are only equal to themselves). Overriding those changes the concept of equality it has, and hence you must always change GetHashCode() when you change Equals() (all objects that are equal must have equal hash codes). You can also enforce the use of a different concept of equality by using an IEqualityComparer<T> implementation which provides a different GetHashCode() and Equals() to use.
Each object has a Hash Code associated with it. There is a method GetHashCode (defined as virtual in the base object class) that has to be overridden in the class so that HashSet can work properly.
A hash code is a numeric value that is used to insert and identify
an object in a hash-based collection such as the
Dictionary class, the Hashtable class, or a type derived
from the DictionaryBase class. The GetHashCode method provides this
hash code for algorithms that need quick checks of object equality.
With your current class, it will not work properly (since GetHashCode is not overridden). The comparison for equality will be done on the basis of reference instead of actual values.
Hey all, I've been reading up on the best way to implement the GetHashCode() override for objects in .NET, and most answers I run across involve somehow munging numbers together from members that are numeric types to come up with a method. Problem is, I have an object that uses an alphanumeric string as its key, and I'm wondering if there's something fundamentally wrong with just using an internal ID for objects with strings as keys, something like the following?
// Override GetHashCode() to return a permanent, unique identifier for
// this object.
static private int m_next_hash_id = 1;
private int m_hash_code = 0;
public override int GetHashCode() {
if (this.m_hash_code == 0)
this.m_hash_code = <type>.m_next_hash_id++;
return this.m_hash_code;
}
Is there a better way to come up with a unique hash code for an object that uses an alphanumeric string as its key? (And no, the numeric parts of the alphanumeric string isn't unique; some of these strings don't actually have numbers in them at all.) Any thoughts would be appreciated!
You can call GetHashCode() on the non-numeric values that you use in your object.
private string m_foo;
public override int GetHashCode()
{
return m_foo.GetHashCode();
}
This is not a good pattern for generating hashes for an object.
It's important to undunderstand the purpose of GetHashCode() - it's a way to generate a numeric representation of the identifying properties of an object. Hash codes are used to allow an object to serve as a key in a dictionary and in some cases accelerate comparisons between complex types.
If you simply generate a random value and call it a hash code, you have no repeatability. Another instance with the same key fields will have a different hash code, and will violate the behavior expected by classes like HashSet, Dictionary, etc.
If you already have an identifying string member in you object, just return its hash code.
The documentation on MSDN for implementers of GetHashCode() is a must read for anyone that plans on overriding that method:
Notes to Implementers
A hash function
is used to quickly generate a number
(hash code) that corresponds to the
value of an object. Hash functions are
usually specific to each Type and, for
uniqueness, must use at least one of
the instance fields as input.
A hash function must have the
following properties:
If two objects compare as equal, the
GetHashCode method for each object
must return the same value. However,
if two objects do not compare as
equal, the GetHashCode methods for the
two object do not have to return
different values.
The GetHashCode method for an object
must consistently return the same hash
code as long as there is no
modification to the object state that
determines the return value of the
object's Equals method. Note that this
is true only for the current execution
of an application, and that a
different hash code can be returned if
the application is run again.
For the best performance, a hash
function must generate a random
distribution for all input.
For example, the implementation of the
GetHashCode method provided by the
String class returns identical hash
codes for identical string values.
Therefore, two String objects return
the same hash code if they represent
the same string value. Also, the
method uses all the characters in the
string to generate reasonably randomly
distributed output, even when the
input is clustered in certain ranges
(for example, many users might have
strings that contain only the lower
128 ASCII characters, even though a
string can contain any of the 65,535
Unicode characters).
Hash codes don't have to be unique. Provided your Equals implementation is correct, it's OK to return the same hash code for two instances. The m_next_hash_id logic is broken, since it allows two objects to have different hash codes even if they compare equals.
MSDN gives a good set of instructions on how to implement Equals and GetHashCode. Several of the examples here implement GetHashCode in terms of the hash codes of an object's fields
Yes, a better way would be to use the hashcode of the string you already have. If the alpha numeric string defines the identity of the object you have, it's hashcode will do quite nicely for the hashcode of your object.
The idea of incrementing a static field and using it as the hashcode, is a bad one. The hash code should have an even distribution across the space of possible values. This ensures, amongst other things, that it will perform well when used as the key in a hashtable.
I believe you generally want GetHashCode() to return something that identifies the object by it's value, rather than it's instance, if I'm understanding the idea here, I think your method would ensure GetHashCode() on two different objects with equivalent values would return different hashes just because they're different instances.
GetHashCode() is meant to return a value that lets you compare two objects values, not their references.
So I'm thinking of using a reference type as a key to a .NET Dictionary...
Example:
class MyObj
{
private int mID;
public MyObj(int id)
{
this.mID = id;
}
}
// whatever code here
static void Main(string[] args)
{
Dictionary<MyObj, string> dictionary = new Dictionary<MyObj, string>();
}
My question is, how is the hash generated for custom objects (ie not int, string, bool etc)? I ask because the objects I'm using as keys may change before I need to look up stuff in the Dictionary again. If the hash is generated from the object's address, then I'm probably fine... but if it is generated from some combination of the object's member variables then I'm in trouble.
EDIT:
I should've originally made it clear that I don't care about the equality of the objects in this case... I was merely looking for a fast lookup (I wanted to do a 1-1 association without changing the code of the classes involved).
Thanks
The default implementation of GetHashCode/Equals basically deals with identity. You'll always get the same hash back from the same object, and it'll probably be different to other objects (very high probability!).
In other words, if you just want reference identity, you're fine. If you want to use the dictionary treating the keys as values (i.e. using the data within the object, rather than just the object reference itself, to determine the notion of equality) then it's a bad idea to mutate any of the equality-sensitive data within the key after adding it to the dictionary.
The MSDN documentation for object.GetHashCode is a little bit overly scary - basically you shouldn't use it for persistent hashes (i.e. saved between process invocations) but it will be consistent for the same object which is all that's required for it to be a valid hash for a dictionary. While it's not guaranteed to be unique, I don't think you'll run into enough collections to cause a problem.
The hash used is the return value of the .GetHashcode method on the object. By default this essentially a value representing the reference. It is not guaranteed to be unique for an object, and in fact likely won't be in many situations. But the value for a particular reference will not change over the lifetime of the object even if you mutate it. So for this particular sample you will be OK.
In general though, it is a very bad idea to use objects which are not immutable as keys to a Dictionary. It's way too easy to fall into a trap where you override Equals and GetHashcode on an object and break code where the type was formerly used as a key in a Dictionary.
The dictionary will use the GetHashCode method defined on System.Object, which will not change over the object's lifetime regardless of field changes etc. So you won't encounter problems in the scenario you describe.
You can override GetHashCode, and should do so if you override Equals so that objects which are equal also return the same hash code. However, if the type is mutable then you must be aware that if you use the object as a key of a dictionary you will not be able to find it again if it is subsequently altered.
The default implementation of the
GetHashCode method does not guarantee
unique return values for different
objects. Furthermore, the .NET
Framework does not guarantee the
default implementation of the
GetHashCode method, and the value it
returns will be the same between
different versions of the .NET
Framework. Consequently, the default
implementation of this method must not
be used as a unique object identifier
for hashing purposes.
http://msdn.microsoft.com/en-us/library/system.object.gethashcode.aspx
For CustomObject , derived from objects, the hash code will be generated in beginning of the object and they will remain same throughout its life of its instance. Further more, hash code will never change as values of internal fields/properties will change.
Hashtable/Dictionary will not use GetHashCode as unique identifier but rather it will only use it as "hash buckets". For example string "aaa123" and "aaa456" may have hash as "aaa" and that all objects having same hash "aaa" will be stored in one bucket. Whenever you will insert/retrive an object, Dictionary will always call GetHashCode and determine the bucket to further individual address comparison of objects.
Custom Object as Dictionary key should be taken as if, Dictionary only stores theirs "Reference (addresses or memory pointers)" it doesnt know its contents, and contents of objects change but Reference never change. This also means that if two objects are exact replica of each other, but they are different in memory, your hashtable will not consider them as same because their memory pointers are different.
Best way to guarentee identity equality is to override method "Equals" as following... if you are having any problem.
class MyObj
{
private int mID;
public MyObj(int id)
{
this.mID = id;
}
public bool override Equals(Object obj)
{
MyObj mobj = obj as MyObj;
if(mobj==null)
return false;
return this.mID == mobj.mID;
}
}