What is the use of GetHashCode()? Can I trace object identity using GetHashCode()? If so, could you provide an example?
Hash codes aren't about identity, they're about equality. In fact, you could say they're about non-equality:
If two objects have the same hash code, they may be equal
If two objects have different hash codes, they're not equal
Hash codes are not unique, nor do they guarantee equality (two objects may have the same hash but still be unequal).
As for their uses: they're almost always used to quickly select possibly equal objects to then test for actual equality, usually in a key/value map (e.g. Dictionary<TKey, TValue>) or a set (e.g. HashSet<T>).
No, a HashCode is not guaranteed to be unique. But you already have references to your objects, they are perfect for tracking identity, using object.ReferenceEquals().
The value itself is used in hashing algorithms, such as hashtables.
In its default implementation, GetHasCode does not guarantee the uniqueness of an object, thus for .NET objects should not be used as such,
In you own classes, it is generally good practice to override GetHashCode to create a unique value for your object.
It's used for algorithms\data structures that require hashing (such as a hash table). A hash code cannot on its own be used to track object identity since two objects with the same hash are not necessarily equal. However, two equal objects should have the same hash code (which is why C# emits a warning if you override one without overriding the other).
Related
I hate to beat a dead horse. In #eric-lippert's blog, he states:
the hash value of an object is the same for its entire lifetime
Then follows up with:
However, this is only an ideal-situation guideline
So, my question is this.
For a POCO (nothing overriden) or a framework (like FileInfo or Process) object, is the return value of the GetHashCode() method guaranteed to be the same during its lifetime?
P.S. I am talking about pre-allocated objects. var foo = new Bar(); Will foo.GetHashCode() always return the same value.
If you look at the MSDN documentation you will find the following remarks about the default behavior of the GetHashCode method:
If GetHashCode is not overridden, hash codes for reference types are
computed by calling the Object.GetHashCode method of the base class,
which computes a hash code based on an object's reference; for more
information, see RuntimeHelpers.GetHashCode. In other words, two
objects for which the ReferenceEquals method returns true have
identical hash codes. If value types do not override GetHashCode, the
ValueType.GetHashCode method of the base class uses reflection to
compute the hash code based on the values of the type's fields. In
other words, value types whose fields have equal values have equal
hash codes
Based on my understanding we can assume that:
for a reference type (which doesn't override Object.GetHashCode) the
value of the hash code of a given instance is guaranteed to be the
same for the entire lifetime of the instance (because the memory
address at which the object is stored won't change during its
lifetime)
for a value type (which doesn't override Object.GetHashCode) it depends: if the value type is immutable then the hash code won't
change during its lifetime. If, otherwise, the value of its fields
can be changed after its creation then its hash code will change too.
Please, notice that value types are generally immutable.
IMPORTANT EDIT
As pointed out in one comment above the .NET garbage collector can decide to move the physical location of an object in memory during the object lifetime, in other words an object can be "relocated" inside the managed memory.
This makes sense because the garbage collector is in charge of managing the memory allocated when objects are created.
After some searches and according to this stackoverflow question (read the comments provided by the user #supercat) it seems that this relocation does not change the hash code of an object instance during its lifetime, because the hash code is calculated once (the first time that it's value is requested) and the computed value is saved and reused later (when
the hash code value is requested again).
To summarize, based on in my understanding, the only thing you can assume is that given two references pointing to the same object in memory the hash codes of them will always be identical. In other words if Object.ReferenceEquals(a, b) then a.GetHashCode() == b.GetHashCode(). Furthermore it seems that given an object instance its hash code will stay the same for its entire lifetime, even if the physical memory address of the object is changed by the garbage collector.
SIDENOTE ON HASH CODES USAGE
It is important to always remember that the hash code has been introduced in the .NET framework at the sole purpose of handling the hash table data structure.
In order to determine the bucket to be used for a given value, the corresponding key is taken and its hash code is computed (to be precise, the bucket index is obtained by applying some normalizations on the value returned by the GetHashCode call, but the details are not important for this discussion). Put another way, the hash function used in the .NET implementation of hash tables is based on the computation of the hash code of the key.
This means that the only safe usage for an hash code is balancing an hash table, as pointed out by Eric Lippert here, so don't write code which depends on hash codes values for any other purpose.
There are three cases.
A class which does not override GetHashCode
A struct which does not override GetHashCode
A class or struct which does override GetHashCode
If a class does not override GetHashCode, then the return value of the helper function RuntimeHelpers.GetHashCode is used. This will return the same value each time it's called for the same object, so an object will always have the same hash code. Note that this hash code is specific to a single AppDomain - restarting your application, or creating another AppDomain, will probably result in your object getting a different hash code.
If a struct does not override GetHashCode, then the hash code is generated based the hash code of one of its members. Of course, if your struct is mutable, then that member can change over time, and so the hash code can change over time. Even if the struct is immutable, that member could itself be mutated, and could return different hash codes.
If a class or struct does override GetHashCode, then all bets are off. Someone could implement GetHashCode by returning a random number - that's a bit of a silly thing to do, but it's perfectly possible. More likely, the object could be mutable, and its hash code could be based off its members, both of which can change over time.
It's generally a bad idea to implement GetHashCode for objects which are mutable, or in a way where the hash code can change over time (in a given AppDomain). Many of the assumptions made by classes like Dictionary<TKey, TValue> break down in this case, and you will probably see strange behaviour.
I have a CustomObject object which overrides GetHashCode().
I have a HashSet, and I am able to call add with two distinct object having the same hash code.
Both get added and later on I end up with some database insertion troubles (primary key duplicates)... The purpose of using a hashSet was connected to these database insertions (avoiding key collisions).
Am I possibly missing out on some properties of HashSet ? Even when I try checking (.Contains) before adding (.Add), I end up adding hashCode duplicates...
Because HashSet<T> membership is based on object equality, not hash code equality. It's perfectly legal for every member of a HashSet<T> to have the same hash code as long as the members are different according to Equals. The role that hash codes play in HashSet<T> is for rapid testing of membership. If you have an object and its hash code is not in the HashSet<T>, then you know that the object is not in the HashSet<T>. If you have an object and its hash code is in the HashSet<T>, then you have to walk the chain of objects with that same hash code testing for equality using Equals to see if the object is actually in the HashSet<T> or not. That's why a balanced hash code distribution is important. But it is not the case that unique hash codes are necessary.
Overriding GetHashCode is not enough. You need to override Equals function as well.
Do not use hashsets to try to avoid duplicate values. Use them to balance hash tables!
I'm considering implementing my own custom hashcode for a given object... and use this as a key for my dictionary. Since it's possible (likely) that 2 objects will have the same hashcode, what additional operators should I override, and what should that override (conceptually) look like?
myDictionary.Add(myObj.GetHashCode(),myObj);
vs
myDictionary.Add(myObj,myObj);
In other words, does a Dictionary use a combination of the following in order to determine uniqueness and which bucket to place an object in?
Which are more important than others?
HashCode
Equals
==
CompareTo()
Is compareTo only needed in the SortedDictionary?
What is GetHashCode used for?
It is by design useful for only one thing: putting an object in a hash table. Hence the name.
GetHashCode is designed to do only one thing: balance a hash table. Do not use it for anything else. In particular:
It does not provide a unique key for an object; probability of collision is extremely high.
It is not of cryptographic strength, so do not use it as part of a digital signature or as a password equivalent
It does not necessarily have the error-detection properties needed for checksums.
and so on.
Eric Lippert
http://ericlippert.com/2011/02/28/guidelines-and-rules-for-gethashcode/
It's not the buckets that cause the problem - it is actually finding the right object instance once you have determined the bucket using the hash code. Since all objects in a bucket share the same hash code, object equality (Equals) is used to find the right one. The rule is that if two objects are considered equal, they should produce the same hash code - but two objects producing the same hash codes might not be equal.
Not sure whether it is sensible reopen my earlier thread on Hashing URL.
Nonetheless, I am still curious know how this work undercover.
Assumption: We have a hashtable with n (where n < Infinity) element where asymptotic time complexity is o(1); we (CLR) have achieved this while applying some hashing function ( Hn-1 hash function where n>1).
Question: Can someone explain me how CLR map Key to the hash code when we seek (retrieve) any element (if different hashing functions are used)? How CLR track (if it) the hashing function of any live object (hash table)?
Thanks in advance.
Conceptually, there are two hash functions. The first hash function, as you probably have guessed, is the key object's GetHashCode method. The second hash function is a hash of the key returned by the first hash function.
So, imagine a hash table that has a capacity of 1,024 items, and you're going to insert two keys: K1 and K2.
K1.GetHashCode() returns 1,023. K2.GetHashCode() returns 65,535
The code then divides the returned key by the hash table size and takes the remainder. So both of the keys map to position 1,023 in the hash table.
K1 is added to the table. When it comes time to add K2, there is a collision. So the code resorts to the second hash function. That second hash function is probably a "bit mixer" (often the last stage in calculating a hash code) of some sort that randomizes the bits in the returned key. Conceptually, the code would look something like this:
int hashCode = K2.GetHashCode();
int slot = hashCode % 1024;
if (table[slot] != null)
{
int secondHashCode = BitMixer(hashCode);
slot = secondHashCode % 1024;
}
The point here is that the code doesn't have to keep track of multiple hash functions for the different keys. It knows that it can call Key.GetHashCode() to get the object's hash code. From there, it can call its own bit mixer function or functions to generate additional hash codes.
A hash code does not uniquely identify an object. It's just used to quickly put that object into a bucket. The elements in one bucket may but need not be equal, but elements in different buckets must be unequal.
Conceptually you can think of the default GetHashCode() implementation on reference types as using a field in every instance containing a random value for the hashcode which gets initialized on object creation. The actual implementation is a bit more complex but that doesn't matter here.
Since there are only 2 billion different hash codes, the O(1) runtime of most hash table implementations will break down if you have more elements than that. And of course the distribution must be good, i.e. there must not be too many hash collisions, but having a few is no big problem.
For types with value semantics you override both Equals and GetHashCode consistently to use the fields which determine equality.
Not sure if I understand you question well, but every object in .NET implements GetHashCode function which returns a hash code usable (and used) in dictionaries / hashtables, so the object itself is responsible for generating a good hash code.
Of course, there may (and will) be conficts as the hash code is an int. The conflicts are handled / resolved by the dictionary / hashtable.
Every object implements the GetHashCode() function and Equals() function.
The default implementations for these are related to the object references. For example a.Equals(b) would return the same as object.ReferenceEquals(a,b). This would mean if two object references are equal so is their Hash Codes.
There are cases that you need to provide a different semantic to the Equals() function. In these cases you must maintain the contract that if a.Equals(b) then a.GetHashCode() == b.GetHashCode().
Hashing functions used are many and each with its own advantages and disadvantages. There is a useful explanation here. The actual function used is not something you should worry about, what is most important to keep the average o(1) lookup time in the Hashtable is (ideally) ensure that the objects which will be inserted have their GetHashCode() result is as close to uniformly distributed as possible.
Let's assume I have two objects called K and M
if(K.Equals(M))
{
}
If that's true, K and M always has the same HashCode ?
Or It depends on the programming language ?
The contract for GetHashCode() requires it, but since anyone can make their own implementation it is never guaranteed.
Many classes (especially hashtables) require it in order to behave correctly.
If you are implementing a class, you should always make sure that two equal objects have the same hashcode.
If you are implementing an utility method/class, you can assume that two equal objects have the same hashcode (if not, it is the other class, not yours, that is buggy).
If you are implementing something with security implications, you cannot assume it.
If that's true, K and M always has the same HashCode ?
Yes.
Or rather it should be the case. Consumers of hash codes (eg. containers) can assume that equal objects have equal hash codes, or rather unequal hash codes means the objects are unequal. (Unequal objects can have the same hash code: there are more possible objects than hash codes so this has to be allowed.)
Or It depends on the programming language ?
No
If that's true, K and M always has the same HashCode ?
Yes. Unless they have a wickedly overridden Equals method. But that would be considered broken.
But note that the reverse is not true,
if K and M have the same HashCode it could still be that K.Equals(M) == false
Yes, it should return the same hash code.
I'd say it's language independent. But there's no guaranty as if other programmes has implemented that correctly.
GetHashCode returns a value based on the current instance that is
suited for hashing algorithms and data structures such as a hash
table. Two objects that are the same type and are equal must return
the same hash code to ensure that instances of
System.Collections.HashTable and
System.Collections.Generic.Dictionary work correctly.
in your application the hashcode has to uniquely identify an instance of the object. this is part of to the .net platform, so, the hashcode value should work regardless of which .net language you are authoring in.
GetHashCode() could return the same hash for different objects. You should use Equals() to compare objects not GetHashCode(), in case when GetHashCode() return the same value - implementation of Equals() should consider another object equality checks.
Hash data structure able to handle such cases by using collision resolution algotithms.
From wikipedia:
Hash collisions are practically unavoidable when hashing a random
subset of a large set of possible keys. For example, if 2,500 keys are
hashed into a million buckets, even with a perfectly uniform random
distribution, according to the birthday problem there is a 95% chance
of at least two of the keys being hashed to the same slot.
Therefore, most hash table implementations have some collision
resolution strategy to handle such events. Some common strategies are
described below. All these methods require that the keys (or pointers
to them) be stored in the table, together with the associated values.
It depends on the Equals implementation of the object. It may use GetHashCode under the hood, but it doesn´t have too. So basically if you have an object with a custom Equals implementation the HashCode may be different for both objects.