Not sure whether it is sensible reopen my earlier thread on Hashing URL.
Nonetheless, I am still curious know how this work undercover.
Assumption: We have a hashtable with n (where n < Infinity) element where asymptotic time complexity is o(1); we (CLR) have achieved this while applying some hashing function ( Hn-1 hash function where n>1).
Question: Can someone explain me how CLR map Key to the hash code when we seek (retrieve) any element (if different hashing functions are used)? How CLR track (if it) the hashing function of any live object (hash table)?
Thanks in advance.
Conceptually, there are two hash functions. The first hash function, as you probably have guessed, is the key object's GetHashCode method. The second hash function is a hash of the key returned by the first hash function.
So, imagine a hash table that has a capacity of 1,024 items, and you're going to insert two keys: K1 and K2.
K1.GetHashCode() returns 1,023. K2.GetHashCode() returns 65,535
The code then divides the returned key by the hash table size and takes the remainder. So both of the keys map to position 1,023 in the hash table.
K1 is added to the table. When it comes time to add K2, there is a collision. So the code resorts to the second hash function. That second hash function is probably a "bit mixer" (often the last stage in calculating a hash code) of some sort that randomizes the bits in the returned key. Conceptually, the code would look something like this:
int hashCode = K2.GetHashCode();
int slot = hashCode % 1024;
if (table[slot] != null)
{
int secondHashCode = BitMixer(hashCode);
slot = secondHashCode % 1024;
}
The point here is that the code doesn't have to keep track of multiple hash functions for the different keys. It knows that it can call Key.GetHashCode() to get the object's hash code. From there, it can call its own bit mixer function or functions to generate additional hash codes.
A hash code does not uniquely identify an object. It's just used to quickly put that object into a bucket. The elements in one bucket may but need not be equal, but elements in different buckets must be unequal.
Conceptually you can think of the default GetHashCode() implementation on reference types as using a field in every instance containing a random value for the hashcode which gets initialized on object creation. The actual implementation is a bit more complex but that doesn't matter here.
Since there are only 2 billion different hash codes, the O(1) runtime of most hash table implementations will break down if you have more elements than that. And of course the distribution must be good, i.e. there must not be too many hash collisions, but having a few is no big problem.
For types with value semantics you override both Equals and GetHashCode consistently to use the fields which determine equality.
Not sure if I understand you question well, but every object in .NET implements GetHashCode function which returns a hash code usable (and used) in dictionaries / hashtables, so the object itself is responsible for generating a good hash code.
Of course, there may (and will) be conficts as the hash code is an int. The conflicts are handled / resolved by the dictionary / hashtable.
Every object implements the GetHashCode() function and Equals() function.
The default implementations for these are related to the object references. For example a.Equals(b) would return the same as object.ReferenceEquals(a,b). This would mean if two object references are equal so is their Hash Codes.
There are cases that you need to provide a different semantic to the Equals() function. In these cases you must maintain the contract that if a.Equals(b) then a.GetHashCode() == b.GetHashCode().
Hashing functions used are many and each with its own advantages and disadvantages. There is a useful explanation here. The actual function used is not something you should worry about, what is most important to keep the average o(1) lookup time in the Hashtable is (ideally) ensure that the objects which will be inserted have their GetHashCode() result is as close to uniformly distributed as possible.
Related
I hate to beat a dead horse. In #eric-lippert's blog, he states:
the hash value of an object is the same for its entire lifetime
Then follows up with:
However, this is only an ideal-situation guideline
So, my question is this.
For a POCO (nothing overriden) or a framework (like FileInfo or Process) object, is the return value of the GetHashCode() method guaranteed to be the same during its lifetime?
P.S. I am talking about pre-allocated objects. var foo = new Bar(); Will foo.GetHashCode() always return the same value.
If you look at the MSDN documentation you will find the following remarks about the default behavior of the GetHashCode method:
If GetHashCode is not overridden, hash codes for reference types are
computed by calling the Object.GetHashCode method of the base class,
which computes a hash code based on an object's reference; for more
information, see RuntimeHelpers.GetHashCode. In other words, two
objects for which the ReferenceEquals method returns true have
identical hash codes. If value types do not override GetHashCode, the
ValueType.GetHashCode method of the base class uses reflection to
compute the hash code based on the values of the type's fields. In
other words, value types whose fields have equal values have equal
hash codes
Based on my understanding we can assume that:
for a reference type (which doesn't override Object.GetHashCode) the
value of the hash code of a given instance is guaranteed to be the
same for the entire lifetime of the instance (because the memory
address at which the object is stored won't change during its
lifetime)
for a value type (which doesn't override Object.GetHashCode) it depends: if the value type is immutable then the hash code won't
change during its lifetime. If, otherwise, the value of its fields
can be changed after its creation then its hash code will change too.
Please, notice that value types are generally immutable.
IMPORTANT EDIT
As pointed out in one comment above the .NET garbage collector can decide to move the physical location of an object in memory during the object lifetime, in other words an object can be "relocated" inside the managed memory.
This makes sense because the garbage collector is in charge of managing the memory allocated when objects are created.
After some searches and according to this stackoverflow question (read the comments provided by the user #supercat) it seems that this relocation does not change the hash code of an object instance during its lifetime, because the hash code is calculated once (the first time that it's value is requested) and the computed value is saved and reused later (when
the hash code value is requested again).
To summarize, based on in my understanding, the only thing you can assume is that given two references pointing to the same object in memory the hash codes of them will always be identical. In other words if Object.ReferenceEquals(a, b) then a.GetHashCode() == b.GetHashCode(). Furthermore it seems that given an object instance its hash code will stay the same for its entire lifetime, even if the physical memory address of the object is changed by the garbage collector.
SIDENOTE ON HASH CODES USAGE
It is important to always remember that the hash code has been introduced in the .NET framework at the sole purpose of handling the hash table data structure.
In order to determine the bucket to be used for a given value, the corresponding key is taken and its hash code is computed (to be precise, the bucket index is obtained by applying some normalizations on the value returned by the GetHashCode call, but the details are not important for this discussion). Put another way, the hash function used in the .NET implementation of hash tables is based on the computation of the hash code of the key.
This means that the only safe usage for an hash code is balancing an hash table, as pointed out by Eric Lippert here, so don't write code which depends on hash codes values for any other purpose.
There are three cases.
A class which does not override GetHashCode
A struct which does not override GetHashCode
A class or struct which does override GetHashCode
If a class does not override GetHashCode, then the return value of the helper function RuntimeHelpers.GetHashCode is used. This will return the same value each time it's called for the same object, so an object will always have the same hash code. Note that this hash code is specific to a single AppDomain - restarting your application, or creating another AppDomain, will probably result in your object getting a different hash code.
If a struct does not override GetHashCode, then the hash code is generated based the hash code of one of its members. Of course, if your struct is mutable, then that member can change over time, and so the hash code can change over time. Even if the struct is immutable, that member could itself be mutated, and could return different hash codes.
If a class or struct does override GetHashCode, then all bets are off. Someone could implement GetHashCode by returning a random number - that's a bit of a silly thing to do, but it's perfectly possible. More likely, the object could be mutable, and its hash code could be based off its members, both of which can change over time.
It's generally a bad idea to implement GetHashCode for objects which are mutable, or in a way where the hash code can change over time (in a given AppDomain). Many of the assumptions made by classes like Dictionary<TKey, TValue> break down in this case, and you will probably see strange behaviour.
In .NET, Whenever we override Equals() method for a class, it is a normal practice to override the GetHashCode() method as well. Doing so will ensure better performance when the object is used in Hashtables and Dictionaries. Two keys are considered to be equal in Hashtable only if their GetHashCode() values are same. My question is why can't the Hashtables use Equals() method to compare the keys?, that would have removed the burden of overriding GetHashCode() method.
HastTable/Dictionaries use Equals in case of collision (when two hash codes are same).
Why don't they use only Equals ?
Because that would require a lot more processing than accessing/(comparing) integer value value (hash code). (Since hash codes are used as index so they have the complexity of O(1))
A HashSet (or HashTable, or Dictionary) uses an array of buckets to distribute the items, those buckets are indexed by the object's hash code (which should be immutable), so the search of the bucket the item is in is O(1).
Then it uses Equals within that bucket to find the exact match if there's more than one item with the same hashcode: that's O(N) since it needs to iterate over all items within that bucket to find the match.
If a hashset used only Equals, finding an item would be O(N) and you could aswell be using a list, or an array.
That's also why two equal items must have the same hashcode, but two items with the same hashcode don't necessarily need to be equal.
Two object instances that compare as equal must always have identical hash codes. If this doesn't hold, hash-based data structures will not work correctly. It's not a matter of performance.
Two object instances that don't compare as equal should ideally have different hash codes. If this doesn't hold, hash-based data structures will have degraded performance, but at least they'll still work.
Thus, for a given object instance, GetHashCode needs to reflect the logic of Equals, to some extent.
Now if you're overriding the Equals method, you're providing custom comparison logic. As an example, let's say your custom comparison logic involves only one particular data member of the instance. For a non-virtual GetHashCode method to be useful, it would have to be general enough to understand your custom Equals logic and be able to come up with a custom hash code function (one that only involves your chosen data member) on the spot.
It's not that easy to write such a sophisticated GetHashCode and it's not worth the trouble either, when the user can simply provide a custom one-liner that honors the initial requirement.
I'm considering implementing my own custom hashcode for a given object... and use this as a key for my dictionary. Since it's possible (likely) that 2 objects will have the same hashcode, what additional operators should I override, and what should that override (conceptually) look like?
myDictionary.Add(myObj.GetHashCode(),myObj);
vs
myDictionary.Add(myObj,myObj);
In other words, does a Dictionary use a combination of the following in order to determine uniqueness and which bucket to place an object in?
Which are more important than others?
HashCode
Equals
==
CompareTo()
Is compareTo only needed in the SortedDictionary?
What is GetHashCode used for?
It is by design useful for only one thing: putting an object in a hash table. Hence the name.
GetHashCode is designed to do only one thing: balance a hash table. Do not use it for anything else. In particular:
It does not provide a unique key for an object; probability of collision is extremely high.
It is not of cryptographic strength, so do not use it as part of a digital signature or as a password equivalent
It does not necessarily have the error-detection properties needed for checksums.
and so on.
Eric Lippert
http://ericlippert.com/2011/02/28/guidelines-and-rules-for-gethashcode/
It's not the buckets that cause the problem - it is actually finding the right object instance once you have determined the bucket using the hash code. Since all objects in a bucket share the same hash code, object equality (Equals) is used to find the right one. The rule is that if two objects are considered equal, they should produce the same hash code - but two objects producing the same hash codes might not be equal.
Let's assume I have two objects called K and M
if(K.Equals(M))
{
}
If that's true, K and M always has the same HashCode ?
Or It depends on the programming language ?
The contract for GetHashCode() requires it, but since anyone can make their own implementation it is never guaranteed.
Many classes (especially hashtables) require it in order to behave correctly.
If you are implementing a class, you should always make sure that two equal objects have the same hashcode.
If you are implementing an utility method/class, you can assume that two equal objects have the same hashcode (if not, it is the other class, not yours, that is buggy).
If you are implementing something with security implications, you cannot assume it.
If that's true, K and M always has the same HashCode ?
Yes.
Or rather it should be the case. Consumers of hash codes (eg. containers) can assume that equal objects have equal hash codes, or rather unequal hash codes means the objects are unequal. (Unequal objects can have the same hash code: there are more possible objects than hash codes so this has to be allowed.)
Or It depends on the programming language ?
No
If that's true, K and M always has the same HashCode ?
Yes. Unless they have a wickedly overridden Equals method. But that would be considered broken.
But note that the reverse is not true,
if K and M have the same HashCode it could still be that K.Equals(M) == false
Yes, it should return the same hash code.
I'd say it's language independent. But there's no guaranty as if other programmes has implemented that correctly.
GetHashCode returns a value based on the current instance that is
suited for hashing algorithms and data structures such as a hash
table. Two objects that are the same type and are equal must return
the same hash code to ensure that instances of
System.Collections.HashTable and
System.Collections.Generic.Dictionary work correctly.
in your application the hashcode has to uniquely identify an instance of the object. this is part of to the .net platform, so, the hashcode value should work regardless of which .net language you are authoring in.
GetHashCode() could return the same hash for different objects. You should use Equals() to compare objects not GetHashCode(), in case when GetHashCode() return the same value - implementation of Equals() should consider another object equality checks.
Hash data structure able to handle such cases by using collision resolution algotithms.
From wikipedia:
Hash collisions are practically unavoidable when hashing a random
subset of a large set of possible keys. For example, if 2,500 keys are
hashed into a million buckets, even with a perfectly uniform random
distribution, according to the birthday problem there is a 95% chance
of at least two of the keys being hashed to the same slot.
Therefore, most hash table implementations have some collision
resolution strategy to handle such events. Some common strategies are
described below. All these methods require that the keys (or pointers
to them) be stored in the table, together with the associated values.
It depends on the Equals implementation of the object. It may use GetHashCode under the hood, but it doesn´t have too. So basically if you have an object with a custom Equals implementation the HashCode may be different for both objects.
What is the use of GetHashCode()? Can I trace object identity using GetHashCode()? If so, could you provide an example?
Hash codes aren't about identity, they're about equality. In fact, you could say they're about non-equality:
If two objects have the same hash code, they may be equal
If two objects have different hash codes, they're not equal
Hash codes are not unique, nor do they guarantee equality (two objects may have the same hash but still be unequal).
As for their uses: they're almost always used to quickly select possibly equal objects to then test for actual equality, usually in a key/value map (e.g. Dictionary<TKey, TValue>) or a set (e.g. HashSet<T>).
No, a HashCode is not guaranteed to be unique. But you already have references to your objects, they are perfect for tracking identity, using object.ReferenceEquals().
The value itself is used in hashing algorithms, such as hashtables.
In its default implementation, GetHasCode does not guarantee the uniqueness of an object, thus for .NET objects should not be used as such,
In you own classes, it is generally good practice to override GetHashCode to create a unique value for your object.
It's used for algorithms\data structures that require hashing (such as a hash table). A hash code cannot on its own be used to track object identity since two objects with the same hash are not necessarily equal. However, two equal objects should have the same hash code (which is why C# emits a warning if you override one without overriding the other).