C# Immutability and Equality - c#

I'm trying to create and use only immutable classes where all fields are readonly immutable types, though there may be additional fields which are mutable and not considered to be part of the object's state (mainly a cached hashcode).
When implementing IEquatable I do the same as I would for non immutable objects
Ie,
public bool Equals(MyImmutableType o) =>
object.Equals(this.x, o.x) && object.Equals(this.y, o.y);
Now being immutable this seems inefficient, the object will never change, if I could calculate and store some unique fingerprint of it I could simply compare fingerprints instead of whole fields (which may call their own Equals etc).
I am wondering what can be a good solution for this ? will BinaryFormatter + MD5 be worth exploring ?

Since you've already overridden Equals, you are required to also overload GetHashCode. Remember, the fundamental rule of GetHashCode is equal objects have equal hashes.
Therefore, you have overridden GetHashCode.
Since equal objects are required to have equal hash codes, you can implement Equals as:
public static bool Equals(M a, M b)
{
if (object.ReferenceEquals(a, b)) return true;
// If both of them are null, we're done, but maybe one is.
if (object.ReferenceEquals(null, a)) return false;
if (object.ReferenceEquals(null, b)) return false;
// Both are not null.
if (a.GetHashCode() != b.GetHashCode()) return false;
if (!object.Equals(a.x, b.x)) return false;
if (!object.Equals(a.y, b.y)) return false;
return true;
}
And now you can implement as many instance versions of Equals as you like by calling the static helper. Also overload == and != while you're at it.
That implementation takes as many early outs as possible. Of course, the worst-performing case is the case where we have value equality but not reference equality, but that's also the rarest case! In practice, most objects are unequal to each other, and most objects that are equal to each other are reference equal. In those 99% cases we get the right answer in four or fewer highly efficient comparisons.
If you are in a scenario where it is extremely common for there to be objects that are value equal but not reference equal, then solve the problem in the factory; memoize the factory!

Related

Why should I *not* override GetHashCode()?

My search for a helper to correctly combine constituent hashcodes for GetHashCode() seemed to garner some hostility. I got the impression from the comments that some C# developers don't think you should override GetHashCode() often - certainly some commenters seemed to think that a library for helping get the behaviour right would be useless. Such functionality was considered useful enough in Java for the Java community to ask for it to be added to the JDK, and it's now in JDK 7.
Is there some fundamental reason that in C# you don't need to - or should definitely not - override GetHashCode() (and correspondingly, Equals()) as often as in Java? I find myself doing this often with Java, for example whenever I create a type that I know I want to keep in a HashSet or use as a key in a HashMap (equivalently, .net Dictionary).
C# has built-in value types which provide value equality, whereas Java does not. So writing your own hashcode in Java may be a necessity, whereas doing it in C# may be a premature optimisation.
It's common to write a type to use as a composite key to use in a Dictionary/HashMap. Often on such types you need value equality (equivalence) as opposed to reference equality(identity), for example:
IDictionary<Person, IList<Movie> > moviesByActor; // e.g. initialised from DB
// elsewhere...
Person p = new Person("Chuck", "Norris");
IList<Movie> chuckNorrisMovies = moviesByActor[p];
Here, if I need to create a new instance of Person to do the lookup, I need Person to implement value equality otherwise it won't match existing entries in the Dictionary as they have a different identity.
To get value equality, you need an overridden Equals() and GetHashCode(), in both languages.
C#'s structs (value types) implement value equality for you (albeit a potentially inefficient one), and provide a consistent implementation of GetHashCode. This may suffice for many people's needs and they won't go further to implement their own improved version unless performance problems dictate otherwise.
Java has no such built-in language feature. If you want to create a type with value equality semantics to use as a composite key, you must implement equals() and correspondingly hashCode() yourself. (There are third-party helpers and libraries to help you do this, but nothing built into the language itself).
I've described C# value types as 'potentially inefficient' for use in a Dictionary because:
The implementation of ValueType.Equals itself can sometimes be slow. This is used in Dictionary lookups.
The implementation of ValueType.GetHashCode, whilst correct, can yield many collisions leading to very poor Dictionary performance also. Have a look at this answer to a Q by Jon Skeet, which demonstrates that KeyValuePair<ushort, uint> appears to always yield the same hashCode!
If your object represents a value or type, then you SHOULD override the GetHashCode() along with Equals. I never override hash codes for control classes, like "App". Though I see no reason why even overriding GetHashCode() in those circumstances would be a problem as they will never be put in a position to interfere with collection indexing or comparisons.
Example:
public class ePoint : eViewModel, IEquatable<ePoint>
{
public double X;
public double Y;
// Methods
#region IEquatable Overrides
public override bool Equals(object obj)
{
if (Object.ReferenceEquals(obj, null)) { return false; }
if (Object.ReferenceEquals(this, obj)) { return true; }
if (!(obj is ePoint)) { return false; }
return Equals((ePoint)obj);
}
public bool Equals(ePoint other)
{
return X == other.X && Y == other.Y;
}
public override int GetHashCode()
{
return (int)Math.Pow(X,Y);
}
#endregion
I wrote a helper class to implement GetHashCode(), Equals(), and CompareTo() using value semantics from an array of properties.

How to check if two dictionaries contain the same values?

I've got this, but it's so short I'm nearly sure I'm missing something:
public static bool ValueEquals<TKey, TValue>
(this IDictionary<TKey, TValue> source, IDictionary<TKey, TValue> toCheck)
{
if (object.ReferenceEquals(source, toCheck))
return true;
if (source == null || toCheck == null || source.Count != toCheck.Count)
return false;
return source.OrderBy(t => t.Key).SequenceEqual(toCheck.OrderBy(t => t.Key));
}
So basically, if they have an equal reference, return true. If either of them are null or their counts are different, return false. Then return if the sequences (ordered by their keys and then their values) are the same. I must be missing something, as it's far too short to be good enough.
Yes, your code will work, so long as the keys all implement IComparable and both the keys and values have an Equals method that compares what you want it to compare. If either the keys or the values don't have appropriate implementations of those methods then this won't work.
Your method also doesn't provide functionality for custom IComparer or IEqualityComparer objects to be passed in to account for the cases where the object doesn't have a desirable implementation of one of those methods. Whether this is an issue in your particular case we can't say.
Your solution also needs to sort all of the values, which is somewhat less efficient than other possible implementations of a set equals, but it's not dramatically worse, so if you don't have particularly large collections that shouldn't be a huge issue.
A method with comparable functionality to yours but improved speed would be (keeping the first two checks you have):
return !source.Except(toCheck).Any();
Since this method doesn't rely on sorting it also provides the benefit of not needing TKey to implement IComparable.
A key reason that both this method and your method works is due to the fact that KeyValuePair overrides it's definition of Equals and GetHashCode to to be based on it's own reference, but rather on the key and value it wraps. Two KeyValuePairs are equal if both the keys and values are equal, and the hash code incorporates the hash code of both the key and the value.

Overriding GetHashCode()

In this article, Jon Skeet mentioned that he usually uses this kind of algorithm for overriding GetHashCode().
public override int GetHashCode()
{
unchecked // Overflow is fine, just wrap
{
int hash = 17;
// Suitable nullity checks etc, of course :)
hash = hash * 23 + Id.GetHashCode();
return hash;
}
}
Now, I've tried using this, but Resharper tells me that the method GetHashCode() should be hashing using only read-only fields (it compiles fine, though). What would be a good practice, because right now I can't really have my fields to be read-only?
I tried generating this method by Resharper, here's the result.
public override int GetHashCode()
{
return base.GetHashCode();
}
This doesn't contribute much, to be honest...
If all your fields are mutable and you have to implement GetHashCode method, I am afraid this is the implementation you would need to have.
public override int GetHashCode()
{
return 1;
}
Yes, this is inefficient but this is at least correct.
The problem is that GetHashCode is being used by Dictionary and HashSet collections to place each item in a bucket. If hashcode is calculated based on some mutable fields and the fields are really changed after the object is placed into the HashSet or Dictionary, the object can no longer be found from the HashSet or Dictionary.
Note that with all the objects returning the same HashCode 1, this basically means all the objects are being put in the same bucket in the HashSet or Dictionary. So, there is always only one single bucket in the HashSet or Dictionary. When trying to lookup the object, it will do a equality check on each of the objects inside the only bucket. This is like doing a search in a linked list.
Somebody may argue that implementing the hashcode based on mutable fields can be fine if we can make sure fields are never changed after the objects added to HashCode or Dictionary collection. My personal view is that this is error-prone. Somebody taking over your code two years later might not be aware of this and breaks the code accidentally.
Please note that your GetHashCode must go hand in hand with your Equals method. And if you can just use reference equality (when you'd never have two different instances of your class that can be equal) then you can safely use Equals and GetHashCode that are inherited from Object. This would work much better than simply return 1 from GetHashCode.
I personally tend to return a different numeric value for each implementation of GetHashCode() in a class which has no immutable fields. This means if I have a dictionary containing different implementing types, there is a chance the different instances of different types will be put in different buckets.
For example
public class A
{
// TODO Equals override
public override int GetHashCode()
{
return 21313;
}
}
public class B
{
// TODO Equals override
public override int GetHashCode()
{
return 35507;
}
}
Then if I have a Dictionary<object, TValue> containing instances of A, B and other types , the performance of the lookup will be better than if all the implementations of GetHashCode returned the same numeric value.
It should also be noted that I make use of prime numbers to get a better distribution.
As per the comments I have provided a LINQPad sample here which demonstrates the performance difference between using return 1 for different types and returning a different value for each type.

Why does C# not implement GetHashCode for Collections?

I am porting something from Java to C#. In Java the hashcode of a ArrayList depends on the items in it. In C# I always get the same hashcode from a List...
Why is this?
For some of my objects the hashcode needs to be different because the objects in their list property make the objects non-equal. I would expect that a hashcode is always unique for the object's state and only equals another hashcode when the object is equal. Am I wrong?
In order to work correctly, hashcodes must be immutable – an object's hash code must never change.
If an object's hashcode does change, any dictionaries containing the object will stop working.
Since collections are not immutable, they cannot implement GetHashCode.
Instead, they inherit the default GetHashCode, which returns a (hopefully) unique value for each instance of an object. (Typically based on a memory address)
Hashcodes must depend upon the definition of equality being used so that if A == B then A.GetHashCode() == B.GetHashCode() (but not necessarily the inverse; A.GetHashCode() == B.GetHashCode() does not entail A == B).
By default, the equality definition of a value type is based on its value, and of a reference type is based on it's identity (that is, by default an instance of a reference type is only equal to itself), hence the default hashcode for a value type is such that it depends on the values of the fields it contains* and for reference types it depends on the identity. Indeed, since we ideally want the hashcodes for non-equal objects to be different particularly in the low-order bits (most likely to affect the value of a re-hashing), we generally want two equivalent but non-equal objects to have different hashes.
Since an object will remain equal to itself, it should also be clear that this default implementation of GetHashCode() will continue to have the same value, even when the object is mutated (identity does not mutate even for a mutable object).
Now, in some cases reference types (or value types) re-define equality. An example of this is string, where for example "ABC" == "AB" + "C". Though there are two different instances of string compared, they are considered equal. In this case GetHashCode() must be overridden so that the value relates to the state upon which equality is defined (in this case, the sequence of characters contained).
While it is more common to do this with types that also are immutable, for a variety of reasons, GetHashCode() does not depend upon immutability. Rather, GetHashCode() must remain consistent in the face of mutability - change a value that we use in determining the hash, and the hash must change accordingly. Note though, that this is a problem if we are using this mutable object as a key into a structure using the hash, as mutating the object changes the position in which it should be stored, without moving it to that position (it's also true of any other case where the position of an object within a collection depends on its value - e.g. if we sort a list and then mutate one of the items in the list, the list is no longer sorted). However, this doesn't mean that we must only use immutable objects in dictionaries and hashsets. Rather it means that we must not mutate an object that is in such a structure, and making it immutable is a clear way to guarantee this.
Indeed, there are quite a few cases where storing mutable objects in such structures is desirable, and as long as we don't mutate them during this time, this is fine. Since we don't have the guarantee immutability brings, we then want to provide it another way (spending a short time in the collection and being accessible from only one thread, for example).
Hence immutability of key values is one of those cases where something is possible, but generally a idea. To the person defining the hashcode algorithm though, it's not for them to assume any such case will always be a bad idea (they don't even know the mutation happened while the object was stored in such a structure); it's for them to implement a hashcode defined on the current state of the object, whether calling it in a given point is good or not. Hence for example, a hashcode should not be memoised on a mutable object unless the memoisation is cleared on every mutate. (It's generally a waste to memoise hashes anyway, as structures that hit the same objects hashcode repeatedly will have their own memoisation of it).
Now, in the case in hand, ArrayList operates on the default case of equality being based on identity, e.g.:
ArrayList a = new ArrayList();
ArrayList b = new ArrayList();
for(int i = 0; i != 10; ++i)
{
a.Add(i);
b.Add(i);
}
return a == b;//returns false
Now, this is actually a good thing. Why? Well, how do you know in the above that we want to consider a as equal to b? We might, but there are plenty of good reasons for not doing so in other cases too.
What's more, it's much easier to redefine equality from identity-based to value-based, than from value-based to identity-based. Finally, there are more than one value-based definitions of equality for many objects (classic case being the different views on what makes a string equal), so there isn't even a one-and-only definition that works. For example:
ArrayList c = new ArrayList();
for(short i = 0; i != 10; ++i)
{
c.Add(i);
}
If we considered a == b above, should we consider a == c aslo? The answer depends on just what we care about in the definition of equality we are using, so the framework could't know what the right answer is for all cases, since all cases don't agree.
Now, if we do care about value-based equality in a given case we have two very easy options. The first is to subclass and over-ride equality:
public class ValueEqualList : ArrayList, IEquatable<ValueEqualList>
{
/*.. most methods left out ..*/
public Equals(ValueEqualList other)//optional but a good idea almost always when we redefine equality
{
if(other == null)
return false;
if(ReferenceEquals(this, other))//identity still entails equality, so this is a good shortcut
return true;
if(Count != other.Count)
return false;
for(int i = 0; i != Count; ++i)
if(this[i] != other[i])
return false;
return true;
}
public override bool Equals(object other)
{
return Equals(other as ValueEqualList);
}
public override int GetHashCode()
{
int res = 0x2D2816FE;
foreach(var item in this)
{
res = res * 31 + (item == null ? 0 : item.GetHashCode());
}
return res;
}
}
This assumes that we will always want to treat such lists this way. We can also implement an IEqualityComparer for a given case:
public class ArrayListEqComp : IEqualityComparer<ArrayList>
{//we might also implement the non-generic IEqualityComparer, omitted for brevity
public bool Equals(ArrayList x, ArrayList y)
{
if(ReferenceEquals(x, y))
return true;
if(x == null || y == null || x.Count != y.Count)
return false;
for(int i = 0; i != x.Count; ++i)
if(x[i] != y[i])
return false;
return true;
}
public int GetHashCode(ArrayList obj)
{
int res = 0x2D2816FE;
foreach(var item in obj)
{
res = res * 31 + (item == null ? 0 : item.GetHashCode());
}
return res;
}
}
In summary:
The default equality definition of a reference type is dependant upon identity alone.
Most of the time, we want that.
When the person defining the class decides that this isn't what is wanted, they can override this behaviour.
When the person using the class wants a different definition of equality again, they can use IEqualityComparer<T> and IEqualityComparer so their that dictionaries, hashmaps, hashsets, etc. use their concept of equality.
It's disastrous to mutate an object while it is the key to a hash-based structure. Immutability can be used of ensure this doesn't happen, but is not compulsory, nor always desirable.
All in all, the framework gives us nice defaults and detailed override possibilities.
*There is a bug in the case of a decimal within a struct, because there is a short-cut used in some cases with stucts when it is safe and not othertimes, but while a struct containing a decimal is one case when the short-cut is not safe, it is incorrectly identified as a case where it is safe.
Yes, you are wrong. In both Java and C#, being equal implies having the same hash-code, but the converse is not (necessarily) true.
See GetHashCode for more information.
It is not possible for a hashcode to be unique across all variations of most non-trivial classes. In C# the concept of List equality is not the same as in Java (see here), so the hash code implementation is also not the same - it mirrors the C# List equality.
You're only partly wrong. You're definitely wrong when you think that equal hashcodes means equal objects, but equal objects must have equal hashcodes, which means that if the hashcodes differ, so do the objects.
The core reasons are performance and human nature - people tend to think about hashes as something fast but it normally requires traversing all elements of an object at least once.
Example: If you use a string as a key in a hash table every query has complexity O(|s|) - use 2x longer strings and it will cost you at least twice as much. Imagine that it was a full blown tree (just a list of lists) - oops :-)
If full, deep hash calculation was a standard operation on a collection, enormous percentage of progammers would just use it unwittingly and then blame the framework and the virtual machine for being slow. For something as expensive as full traversal it is crucial that a programmer has to be aware of the complexity. The only was to achieve that is to make sure that you have to write your own. It's a good deterrent as well :-)
Another reason is updating tactics. Calculating and updating a hash on the fly vs. doing the full calculation every time requires a judgement call depending on concrete case in hand.
Immutabilty is just an academic cop out - people do hashes as a way of detecting a change faster (file hashes for example) and also use hashes for complex structures which change all the time. Hash has many more uses beyong the 101 basics. The key is again that what to use for a hash of a complex object has to be a judgement call on a case by case basis.
Using object's address (actually a handle so it doesn't change after GC) as a hash is actually the case where the hash value remains the same for arbitrary mutable object :-) The reason C# does it is that it's cheap and again nudges people to calculate their own.
Why is too philosophical. Create helper method (may be extension method) and calculate hashcode as you like. May be XOR elements' hashcodes

C# How to select a Hashcode for a class that violates the Equals contract?

I've got multiple classes that, for certain reasons, do not follow the official Equals contract. In the overwritten GetHashCode() these classes simply return 0 so they can be used in a Hashmap.
Some of these classes implement the same interface and there are Hashmaps using this interface as key. So I figured that every class should at least return a different (but still constant) value in GetHashCode().
The question is how to select this value. Should I simply let the first class return 1, the next class 2 and so on? Or should I try something like
class SomeClass : SomeInterface {
public overwrite int GetHashCode() {
return "SomeClass".GetHashCode();
}
}
so the hash is distributed more evenly? (Do I have to cache the returned value myself or is Microsoft's compiler able to optimize this?)
Update: It is not possible to return an individual hashcode for each object, because Equals violates the contract. Specifially, I'm refering to this problem.
If it "violates the Equals contract", then I'm not sure you should be using it as a key.
It something is using that as a key, you really need to get the hashing right... it is very unclear what the Equals logic is, but two values that are considered equal must have the same hash-code. It is not required that two values with the same hash-code are equal.
Using a constant string won't really help much - you'll get the values split evenly over the types, but that is about it...
I'm curious what the reasoning would be for overriding GetHashCode() and returning a constant value. Why violate the idea of a hash rather than just violating the "contract" and not overriding the GetHashCode() function at all and leave the default implementation from Object?
Edit
If what you've done is that so you can have your objects match based on their contents rather than their reference then what you propose with having different classes simply use different constants can WORK, but is highly inefficient. What you want to do is come up with a hashing algorithm that can take the contents of your class and produce a value that balances speed with even distribution (that's hashing 101).
I guess I'm not sure what you're looking for...there isn't a "good" scheme for choosing constant numbers for this paradigm. One is not any better than the other. Try to improve your objects so that you're creating a real hash.
I ran into this exact problem when writing a vector class. I wanted to compare vectors for equality, but float operations give rounding errors, so I wanted approximate equality. Long story short, overriding equals is a bad idea unless your implementation is symmetric, reflexive, and transitive.
Other classes are going to assume equals has those properties, and so will classes using those classes, and so you can end up in weird cases. For example a list might enforce uniqueness, but end up with two elements which evaluate as equal to some element B.
A hash table is the perfect example of unpredictable behavior when you break equality. For example:
//Assume a == b, b == c, but a != c
var T = new Dictionary<YourType, int>()
T[a] = 0
T[c] = 1
return T[b] //0 or 1? who knows!
Another example would be a Set:
//Assume a == b, b == c, but a != c
var T = new HashSet<YourType>()
T.Add(a)
T.Add(c)
if (T.contains(b)) then T.remove(b)
//surely T can't contain b anymore! I sure hope no one breaks the properties of equality!
if (T.contains(b)) then throw new Exception()
I suggest using another method, with a name like ApproxEquals. You might also consider overriding the == operator, because it isn't virtual and therefore won't be used accidentally by other classes like Equals could be.
If you really can't use reference equality for the hash table, don't ruin the performance of cases where you can. Add an IApproxEquals interface, implement it in your class, and add an extension method GetApprox to Dictionary which enumerates the keys looking for an approximately equal one, and returns the associated value. You could also write a custom dictionary especially for 3-dimensional vectors, or whatever you need.
When hash collisions occur, the HashTable/Dictionary calls Equals to find the key you're looking for. Using a constant hash code removes the speed advantages of using a hash in the first place - it becomes a linear search.
You're saying the Equals method hasn't been implemented according to the contract. What exactly do you mean with this? Depending on the kind of violation, the HashTable or Dictionary will merely be slow (linear search) or not work at all.

Categories