Hi I have a class with 6 string properties. A unique object will have different values for atleast one of these fields
To implement IEqualityComparer's GetHashCode function, I am concatenating all 6 properties and calling the GetHashCode on the resultant string.
I had the following doubts:
Is it necessary to call the GetHashcode on a unique value?
Will the concatenation operation on the six properties make the comparison slow?
Should I use some other approach?
If your string fields are named a-f and known not to be null, this is ReSharper's proposal for your GetHashCode()
public override int GetHashCode() {
unchecked {
int result=a.GetHashCode();
result=(result*397)^b.GetHashCode();
result=(result*397)^c.GetHashCode();
result=(result*397)^d.GetHashCode();
result=(result*397)^e.GetHashCode();
result=(result*397)^f.GetHashCode();
return result;
}
}
GetHashCode does not need to return unequal values for "unequal" objects. It only needs to return equal values for equal objects (it also must return the same value for the lifetime of the object).
This means that:
If two objects compare as equal with Equals, then their GetHashCode must return the same value.
If some of the 6 string properties are not strictly read-only, they cannot take part in the GetHashCode implementation.
If you cannot satisfy both points at the same time, you should re-evaluate your design because anything else will leave the door open for bugs.
Finally, you could probably make GetHashCode faster by calling GetHashCode on each of the 6 strings and then integrating all 6 results in one value using some bitwise operations.
GetHashCode() should return the same hash code for all objects that return true if you call Equals() on those objects. This means, for example, that you can return zero as the hash code regardless of what the field values are. But that would make your object very inefficient when stored in data structures such as hash tables.
Combining the strings is one option, but note that you could for example combine just two of the stringsfor the hash code (while still comparing all the strings in equals!).
You can also combine the hashes of the six separate strings, rather than computing a single hash for a combined string. See for example
Quick and Simple Hash Code Combinations
I'm not sure if this will be significantly faster than concatenating the string.
You can use the behavior from:
http://moh-abed.com/2011/07/13/entities-and-value-objects/
Related
In .NET, Whenever we override Equals() method for a class, it is a normal practice to override the GetHashCode() method as well. Doing so will ensure better performance when the object is used in Hashtables and Dictionaries. Two keys are considered to be equal in Hashtable only if their GetHashCode() values are same. My question is why can't the Hashtables use Equals() method to compare the keys?, that would have removed the burden of overriding GetHashCode() method.
HastTable/Dictionaries use Equals in case of collision (when two hash codes are same).
Why don't they use only Equals ?
Because that would require a lot more processing than accessing/(comparing) integer value value (hash code). (Since hash codes are used as index so they have the complexity of O(1))
A HashSet (or HashTable, or Dictionary) uses an array of buckets to distribute the items, those buckets are indexed by the object's hash code (which should be immutable), so the search of the bucket the item is in is O(1).
Then it uses Equals within that bucket to find the exact match if there's more than one item with the same hashcode: that's O(N) since it needs to iterate over all items within that bucket to find the match.
If a hashset used only Equals, finding an item would be O(N) and you could aswell be using a list, or an array.
That's also why two equal items must have the same hashcode, but two items with the same hashcode don't necessarily need to be equal.
Two object instances that compare as equal must always have identical hash codes. If this doesn't hold, hash-based data structures will not work correctly. It's not a matter of performance.
Two object instances that don't compare as equal should ideally have different hash codes. If this doesn't hold, hash-based data structures will have degraded performance, but at least they'll still work.
Thus, for a given object instance, GetHashCode needs to reflect the logic of Equals, to some extent.
Now if you're overriding the Equals method, you're providing custom comparison logic. As an example, let's say your custom comparison logic involves only one particular data member of the instance. For a non-virtual GetHashCode method to be useful, it would have to be general enough to understand your custom Equals logic and be able to come up with a custom hash code function (one that only involves your chosen data member) on the spot.
It's not that easy to write such a sophisticated GetHashCode and it's not worth the trouble either, when the user can simply provide a custom one-liner that honors the initial requirement.
Let's assume I have two objects called K and M
if(K.Equals(M))
{
}
If that's true, K and M always has the same HashCode ?
Or It depends on the programming language ?
The contract for GetHashCode() requires it, but since anyone can make their own implementation it is never guaranteed.
Many classes (especially hashtables) require it in order to behave correctly.
If you are implementing a class, you should always make sure that two equal objects have the same hashcode.
If you are implementing an utility method/class, you can assume that two equal objects have the same hashcode (if not, it is the other class, not yours, that is buggy).
If you are implementing something with security implications, you cannot assume it.
If that's true, K and M always has the same HashCode ?
Yes.
Or rather it should be the case. Consumers of hash codes (eg. containers) can assume that equal objects have equal hash codes, or rather unequal hash codes means the objects are unequal. (Unequal objects can have the same hash code: there are more possible objects than hash codes so this has to be allowed.)
Or It depends on the programming language ?
No
If that's true, K and M always has the same HashCode ?
Yes. Unless they have a wickedly overridden Equals method. But that would be considered broken.
But note that the reverse is not true,
if K and M have the same HashCode it could still be that K.Equals(M) == false
Yes, it should return the same hash code.
I'd say it's language independent. But there's no guaranty as if other programmes has implemented that correctly.
GetHashCode returns a value based on the current instance that is
suited for hashing algorithms and data structures such as a hash
table. Two objects that are the same type and are equal must return
the same hash code to ensure that instances of
System.Collections.HashTable and
System.Collections.Generic.Dictionary work correctly.
in your application the hashcode has to uniquely identify an instance of the object. this is part of to the .net platform, so, the hashcode value should work regardless of which .net language you are authoring in.
GetHashCode() could return the same hash for different objects. You should use Equals() to compare objects not GetHashCode(), in case when GetHashCode() return the same value - implementation of Equals() should consider another object equality checks.
Hash data structure able to handle such cases by using collision resolution algotithms.
From wikipedia:
Hash collisions are practically unavoidable when hashing a random
subset of a large set of possible keys. For example, if 2,500 keys are
hashed into a million buckets, even with a perfectly uniform random
distribution, according to the birthday problem there is a 95% chance
of at least two of the keys being hashed to the same slot.
Therefore, most hash table implementations have some collision
resolution strategy to handle such events. Some common strategies are
described below. All these methods require that the keys (or pointers
to them) be stored in the table, together with the associated values.
It depends on the Equals implementation of the object. It may use GetHashCode under the hood, but it doesn´t have too. So basically if you have an object with a custom Equals implementation the HashCode may be different for both objects.
Hey all, I've been reading up on the best way to implement the GetHashCode() override for objects in .NET, and most answers I run across involve somehow munging numbers together from members that are numeric types to come up with a method. Problem is, I have an object that uses an alphanumeric string as its key, and I'm wondering if there's something fundamentally wrong with just using an internal ID for objects with strings as keys, something like the following?
// Override GetHashCode() to return a permanent, unique identifier for
// this object.
static private int m_next_hash_id = 1;
private int m_hash_code = 0;
public override int GetHashCode() {
if (this.m_hash_code == 0)
this.m_hash_code = <type>.m_next_hash_id++;
return this.m_hash_code;
}
Is there a better way to come up with a unique hash code for an object that uses an alphanumeric string as its key? (And no, the numeric parts of the alphanumeric string isn't unique; some of these strings don't actually have numbers in them at all.) Any thoughts would be appreciated!
You can call GetHashCode() on the non-numeric values that you use in your object.
private string m_foo;
public override int GetHashCode()
{
return m_foo.GetHashCode();
}
This is not a good pattern for generating hashes for an object.
It's important to undunderstand the purpose of GetHashCode() - it's a way to generate a numeric representation of the identifying properties of an object. Hash codes are used to allow an object to serve as a key in a dictionary and in some cases accelerate comparisons between complex types.
If you simply generate a random value and call it a hash code, you have no repeatability. Another instance with the same key fields will have a different hash code, and will violate the behavior expected by classes like HashSet, Dictionary, etc.
If you already have an identifying string member in you object, just return its hash code.
The documentation on MSDN for implementers of GetHashCode() is a must read for anyone that plans on overriding that method:
Notes to Implementers
A hash function
is used to quickly generate a number
(hash code) that corresponds to the
value of an object. Hash functions are
usually specific to each Type and, for
uniqueness, must use at least one of
the instance fields as input.
A hash function must have the
following properties:
If two objects compare as equal, the
GetHashCode method for each object
must return the same value. However,
if two objects do not compare as
equal, the GetHashCode methods for the
two object do not have to return
different values.
The GetHashCode method for an object
must consistently return the same hash
code as long as there is no
modification to the object state that
determines the return value of the
object's Equals method. Note that this
is true only for the current execution
of an application, and that a
different hash code can be returned if
the application is run again.
For the best performance, a hash
function must generate a random
distribution for all input.
For example, the implementation of the
GetHashCode method provided by the
String class returns identical hash
codes for identical string values.
Therefore, two String objects return
the same hash code if they represent
the same string value. Also, the
method uses all the characters in the
string to generate reasonably randomly
distributed output, even when the
input is clustered in certain ranges
(for example, many users might have
strings that contain only the lower
128 ASCII characters, even though a
string can contain any of the 65,535
Unicode characters).
Hash codes don't have to be unique. Provided your Equals implementation is correct, it's OK to return the same hash code for two instances. The m_next_hash_id logic is broken, since it allows two objects to have different hash codes even if they compare equals.
MSDN gives a good set of instructions on how to implement Equals and GetHashCode. Several of the examples here implement GetHashCode in terms of the hash codes of an object's fields
Yes, a better way would be to use the hashcode of the string you already have. If the alpha numeric string defines the identity of the object you have, it's hashcode will do quite nicely for the hashcode of your object.
The idea of incrementing a static field and using it as the hashcode, is a bad one. The hash code should have an even distribution across the space of possible values. This ensures, amongst other things, that it will perform well when used as the key in a hashtable.
I believe you generally want GetHashCode() to return something that identifies the object by it's value, rather than it's instance, if I'm understanding the idea here, I think your method would ensure GetHashCode() on two different objects with equivalent values would return different hashes just because they're different instances.
GetHashCode() is meant to return a value that lets you compare two objects values, not their references.
According to MSDN, a hash function must have the following properties:
If two objects compare as equal, the GetHashCode method for each object must return the same value. However, if two objects do not compare as equal, the GetHashCode methods for the two object do not have to return different values.
The GetHashCode method for an object must consistently return the same hash code as long as there is no modification to the object state that determines the return value of the object's Equals method. Note that this is true only for the current execution of an application, and that a different hash code can be returned if the application is run again.
For the best performance, a hash function must generate a random distribution for all input.
I keep finding myself in the following scenario: I have created a class, implemented IEquatable<T> and overridden object.Equals(object). MSDN states that:
Types that override Equals must also override GetHashCode ; otherwise, Hashtable might not work correctly.
And then it usually stops up a bit for me. Because, how do you properly override object.GetHashCode()? Never really know where to start, and it seems to be a lot of pitfalls.
Here at StackOverflow, there are quite a few questions related to GetHashCode overriding, but most of them seems to be on quite particular cases and specific issues. So, therefore I would like to get a good compilation here. An overview with general advice and guidelines. What to do, what not to do, common pitfalls, where to start, etc.
I would like it to be especially directed at C#, but I would think it will work kind of the same way for other .NET languages as well(?).
I think maybe the best way is to create one answer per topic with a quick and short answer first (close to one-liner if at all possible), then maybe some more information and end with related questions, discussions, blog posts, etc., if there are any. I can then create one post as the accepted answer (to get it on top) with just a "table of contents". Try to keep it short and concise. And don't just link to other questions and blog posts. Try to take the essence of them and then rather link to source (especially since the source could disappear. Also, please try to edit and improve answers instead of created lots of very similar ones.
I am not a very good technical writer, but I will at least try to format answers so they look alike, create the table of contents, etc. I will also try to search up some of the related questions here at SO that answers parts of these and maybe pull out the essence of the ones I can manage. But since I am not very stable on this topic, I will try to stay away for the most part :p
Table of contents
When do I override object.GetHashCode?
Why do I have to override object.GetHashCode()?
What are those magic numbers seen in GetHashCode implementations?
Things that I would like to be covered, but haven't been yet:
How to create the integer (How to "convert" an object into an int wasn't very obvious to me anyways).
What fields to base the hash code upon.
If it should only be on immutable fields, what if there are only mutable ones?
How to generate a good random distribution. (MSDN Property #3)
Part to this, seems to choose a good magic prime number (have seen 17, 23 and 397 been used), but how do you choose it, and what is it for exactly?
How to make sure the hash code stays the same all through the object lifetime. (MSDN Property #2)
Especially when the equality is based upon mutable fields. (MSDN Property #1)
How to deal with fields that are complex types (not among the built-in C# types).
Complex objects and structs, arrays, collections, lists, dictionaries, generic types, etc.
For example, even though the list or dictionary might be readonly, that doesn't mean the contents of it are.
How to deal with inherited classes.
Should you somehow incorporate base.GetHashCode() into your hash code?
Could you technically just be lazy and return 0? Would heavily break MSDN guideline number #3, but would at least make sure #1 and #2 were always true :P
Common pitfalls and gotchas.
What are those magic numbers often seen in GetHashCode implementations?
They are prime numbers. Prime numbers are used for creating hash codes because prime number maximize the usage of the hash code space.
Specifically, start with the small prime number 3, and consider only the low-order nybbles of the results:
3 * 1 = 3 = 3(mod 8) = 0011
3 * 2 = 6 = 6(mod 8) = 1010
3 * 3 = 9 = 1(mod 8) = 0001
3 * 4 = 12 = 4(mod 8) = 1000
3 * 5 = 15 = 7(mod 8) = 1111
3 * 6 = 18 = 2(mod 8) = 0010
3 * 7 = 21 = 5(mod 8) = 1001
3 * 8 = 24 = 0(mod 8) = 0000
3 * 9 = 27 = 3(mod 8) = 0011
And we start over. But you'll notice that successive multiples of our prime generated every possible permutation of bits in our nybble before starting to repeat. We can get the same effect with any prime number and any number of bits, which makes prime numbers optimal for generating near-random hash codes. The reason we usually see larger primes instead of small primes like 3 in the example above is that, for greater numbers of bits in our hash code, the results obtained from using a small prime are not even pseudo-random - they're simply an increasing sequence until an overflow is encountered. For optimal randomness, a prime number that results in overflow for fairly small coefficients should be used, unless you can guarantee that your coefficients will not be small.
Related links:
Why is ‘397’ used for ReSharper GetHashCode override?
Why do I have to override object.GetHashCode()?
Overriding this method is important because the following property must always remain true:
If two objects compare as equal, the GetHashCode method for each object must return the same value.
The reason, as stated by JaredPar in a blog post on implementing equality, is that
Many classes use the hash code to classify an object. In particular hash tables and dictionaries tend to place objects in buckets based on their hash code. When checking if an object is already in the hash table it will first look for it in a bucket. If two objects are equal but have different hash codes they may be put into different buckets and the dictionary would fail to lookup the object.
Related links:
Do I HAVE to override GetHashCode and Equals in new Classes?
Do I need to override GetHashCode() on reference types?
Override Equals and GetHashCode Question
Why is it important to override GetHashCode when Equals method is overriden in C#?
Properly Implementing Equality in VB
Check out Guidelines and rules for GetHashCode by Eric Lippert
You should override it whenever you have a meaningful measure of equality for objects of that type (i.e. you override Equals). If you knew the object wasn't going to be hashed for any reason you could leave it, but it's unlikely you could know this in advance.
The hash should be based only on the properties of the object that are used to define equality since two objects that are considered equal should have the same hash code. In general you would usually do something like:
public override int GetHashCode()
{
int mc = //magic constant, usually some prime
return mc * prop1.GetHashCode() * prop2.GetHashCode * ... * propN.GetHashCode();
}
I usually assume multiplying the values together will produce a fairly uniform distribution, assuming each property's hashcode function does the same, although this may well be wrong. Using this method, if the objects equality-defining properties change, then the hash code is also likely to change, which is acceptable given definition #2 in your question. It also deals with all types in a uniform way.
You could return the same value for all instances, although this will make any algorithms that use hashing (such as dictionarys) very slow - essentially all instances will be hashed to the same bucket and lookup will then become O(n) instead of the expected O(1). This of course negates any benefits of using such structures for lookup.
A) You must override both Equals and GetHashCode if you want to employ value equality instead of the default reference equality. With the later, two object references compare as equal if they both refer to the same object instance. With the former they compare as equal if their value is the same even if they refer to different objects. For example, you probably want to employ value equality for Date, Money, and Point objects.
B) In order to implement value equality you must override Equals and GetHashCode. Both should depend on the fields of the object that encapsulate the value. For example, Date.Year, Date.Month and Date.Day; or Money.Currency and Money.Amount; or Point.X, Point.Y and Point.Z. You should also consider overriding operator ==, operator !=, operator <, and operator >.
C) The hashcode doesn't have to stay constant all through the object lifetime. However it must remain immutable while it participates as the key in a hash. From MSDN doco for Dictionary: "As long as an object is used as a key in the Dictionary<(Of <(TKey, TValue>)>), it must not change in any way that affects its hash value." If you must change the value of a key remove the entry from the dictionary, change the key value, and replace the entry.
D) IMO, you will simplify your life if your value objects are themselves immutable.
When do I override object.GetHashCode()?
As MSDN states:
Types that override Equals must also override GetHashCode ; otherwise, Hashtable might not work correctly.
Related links:
When to override GetHashCode()?
How to generate GetHashCode() and Equals()
Visual Studio 2017
https://learn.microsoft.com/en-us/visualstudio/ide/reference/generate-equals-gethashcode-methods?view=vs-2017
ReSharper
https://www.jetbrains.com/help/resharper/Code_Generation__Equality_Members.html
What fields to base the hash code upon? If it should only be on immutable fields, what if there are only mutable ones?
It doesn't need to be based only on immutable fields. I would base it on the fields that determine the outcome of the equals method.
How to make sure the hash code stays the same all through the object lifetime. (MSDN Property #2) Especially when the equality is based upon mutable fields. (MSDN Property #1)
You seem to misunderstand Property #2. The hashcode doesn't need to stay the same thoughout the objects lifetime. It just needs to stay the same as long as the values that determine the outcome of the equals method are not changed. So logically, you base the hashcode on those values only. Then there shouldn't be a problem.
public override int GetHashCode()
{
return IntProp1 ^ IntProp2 ^ StrProp3.GetHashCode() ^ StrProp4.GetHashCode ^ CustomClassProp.GetHashCode;
}
Do the same in the customClass's GetHasCode method. Works like a charm.
I've got multiple classes that, for certain reasons, do not follow the official Equals contract. In the overwritten GetHashCode() these classes simply return 0 so they can be used in a Hashmap.
Some of these classes implement the same interface and there are Hashmaps using this interface as key. So I figured that every class should at least return a different (but still constant) value in GetHashCode().
The question is how to select this value. Should I simply let the first class return 1, the next class 2 and so on? Or should I try something like
class SomeClass : SomeInterface {
public overwrite int GetHashCode() {
return "SomeClass".GetHashCode();
}
}
so the hash is distributed more evenly? (Do I have to cache the returned value myself or is Microsoft's compiler able to optimize this?)
Update: It is not possible to return an individual hashcode for each object, because Equals violates the contract. Specifially, I'm refering to this problem.
If it "violates the Equals contract", then I'm not sure you should be using it as a key.
It something is using that as a key, you really need to get the hashing right... it is very unclear what the Equals logic is, but two values that are considered equal must have the same hash-code. It is not required that two values with the same hash-code are equal.
Using a constant string won't really help much - you'll get the values split evenly over the types, but that is about it...
I'm curious what the reasoning would be for overriding GetHashCode() and returning a constant value. Why violate the idea of a hash rather than just violating the "contract" and not overriding the GetHashCode() function at all and leave the default implementation from Object?
Edit
If what you've done is that so you can have your objects match based on their contents rather than their reference then what you propose with having different classes simply use different constants can WORK, but is highly inefficient. What you want to do is come up with a hashing algorithm that can take the contents of your class and produce a value that balances speed with even distribution (that's hashing 101).
I guess I'm not sure what you're looking for...there isn't a "good" scheme for choosing constant numbers for this paradigm. One is not any better than the other. Try to improve your objects so that you're creating a real hash.
I ran into this exact problem when writing a vector class. I wanted to compare vectors for equality, but float operations give rounding errors, so I wanted approximate equality. Long story short, overriding equals is a bad idea unless your implementation is symmetric, reflexive, and transitive.
Other classes are going to assume equals has those properties, and so will classes using those classes, and so you can end up in weird cases. For example a list might enforce uniqueness, but end up with two elements which evaluate as equal to some element B.
A hash table is the perfect example of unpredictable behavior when you break equality. For example:
//Assume a == b, b == c, but a != c
var T = new Dictionary<YourType, int>()
T[a] = 0
T[c] = 1
return T[b] //0 or 1? who knows!
Another example would be a Set:
//Assume a == b, b == c, but a != c
var T = new HashSet<YourType>()
T.Add(a)
T.Add(c)
if (T.contains(b)) then T.remove(b)
//surely T can't contain b anymore! I sure hope no one breaks the properties of equality!
if (T.contains(b)) then throw new Exception()
I suggest using another method, with a name like ApproxEquals. You might also consider overriding the == operator, because it isn't virtual and therefore won't be used accidentally by other classes like Equals could be.
If you really can't use reference equality for the hash table, don't ruin the performance of cases where you can. Add an IApproxEquals interface, implement it in your class, and add an extension method GetApprox to Dictionary which enumerates the keys looking for an approximately equal one, and returns the associated value. You could also write a custom dictionary especially for 3-dimensional vectors, or whatever you need.
When hash collisions occur, the HashTable/Dictionary calls Equals to find the key you're looking for. Using a constant hash code removes the speed advantages of using a hash in the first place - it becomes a linear search.
You're saying the Equals method hasn't been implemented according to the contract. What exactly do you mean with this? Depending on the kind of violation, the HashTable or Dictionary will merely be slow (linear search) or not work at all.