Is it a good practice to use your domain objects as keys in a Dictionary ?
I have a scenario where I populate my domain object using NHibernate.
For performing business logic I need to look-up Dictionary. I can make use of
Either
IDictionary<int, ValueFortheObject>
or
Dictionary<DomainObject, ValueFortheObject>
The second options seems better to me as
I can write easier test cases and can use real domain objects in test cases as well rather than using Mock<DomainObject> (if I go with the first option) since setter on the Id is private on ALL domain objects.
The code is more readable as Dictionary<Hobbit,List<Adventures>> is more readable to me than Dictionary<int,List<Adventures>> with comments suggesting int is the hobbitId especially when passing as parameters
My questions are :
What are advantages of using the first option over the second (which I might be blindly missing) ?
Will be any performance issues using the second approach?
Update 01:
My domain models implement those and they DO NOT get mutated while performing the operations.
Will there be issues with performance using them as keys ? or am I completely missing the point here and performance / memory are not related to what keys are being used?
Update 02:
My question is
Will there be any issues with performance or memory if I use Objects as keys instead of primitive types and WHY / HOW ?
The biggest issue you will run in to is you must not mutate your key object while it is performing it's roll as a key.
When i say "mutate" I mean your key object must implement Equals and GetHashCode to be used as a key for a dictionary. Anything you do to the object while it is being used as a key must not change the value of GetHashCode nor cause Equals to evaluate to true with any other key in the collection.
#ScottChamberlain gives you the overall issue, but your use cases could argue either way. Some questions you should ask yourself are: what does it mean for two business objects to be equal? Is this the same or different if they are being used as a key in a dictionary or if I'm comparing them elsewhere? If I change an object, should it's value as a key change or remain the same? If you use overrides for GetHashCode() and Equals(), what is the cost of computing those functions?
Generally I am in favor of using simple types for keys as there is a lot of room for misunderstanding with respect to object equality. You could always create a custom dictionary (wrapper around Dictionary<Key,Value>) with appropriate method if readability is your highest concern. You could then write the methods in terms of the objects, then use whatever (appropriate) property you want as the key internally.
Related
Ok, before you get all mad because there are hundreds of similar sounding questions posted on the internet, I can assure you that I have just spent the last few hours reading all of them and have not found the answer to my question.
Background:
Basically, one of my large scale applications had been suffering from a situation where some Bindings on the ListBox.SelectedItem property would stop working or the program would crash after an edit had been made to the currently selected item. I initially asked the 'An item with the same key has already been added' Exception on selecting a ListBoxItem from code question here, but got no answers.
I hadn't had time to address that problem until this week, when I was given a number of days to sort it out. Now to cut a long story short, I found out the reason for the problem. It was because my data type classes had overridden the Equals method and therefore the GetHashCode method as well.
Now for those of you that are unaware of this issue, I discovered that you can only implement the GetHashCode method using immutable fields/properties. Using a excerpt from Harvey Kwok's answer to the Overriding GetHashCode() post to explain this:
The problem is that GetHashCode is being used by Dictionary and HashSet collections to place each item in a bucket. If hashcode is calculated based on some mutable fields and the fields are really changed after the object is placed into the HashSet or Dictionary, the object can no longer be found from the HashSet or Dictionary.
So the actual problem was caused because I had used mutable properties in the GetHashCode methods. When users changed these property values in the UI, the associated hash code values of the objects changed and then items could no longer be found in their collections.
Question:
So, my question is what is the best way of handling the situation where I need to implement the GetHashCode method in classes with no immutable fields? Sorry, let me be more specific, as that question has been asked before.
The answers in the Overriding GetHashCode() post suggest that in these situations, it is better to simply return a constant value... some suggest to return the value 1, while other suggest returning a prime number. Personally, I can't see any difference between these suggestions because I would have thought that there would only be one bucket used for either of them.
Furthermore, the Guidelines and rules for GetHashCode article in Eric Lippert's Blog has a section titled Guideline: the distribution of hash codes must be "random" which highlights the pitfalls of using an algorithm that results in not enough buckets being used. He warns of algorithms that decrease the number of buckets used and cause a performance problem when the bucket gets really big. Surely, returning a constant falls into this category.
I had an idea of adding an extra Guid field to all of my data type classes (just in C#, not the database) specifically to be used in and only in the GetHashCode method. So I suppose at the end of this long intro, my actual question is which implementation is better? To summarise:
Summary:
When overriding Object.GetHashCode() in classes with no immutable fields, is it better to return a constant from the GetHashCode method, or to create an additional readonly field for each class, solely to be used in the GetHashCode method? If I should add a new field, what type should it be and shouldn't I then include it in the Equals method?
While I am happy to receive answers from anyone, I am really hoping to receive answers from advanced developers with a sound knowledge on this subject.
Go back to basics. You read my article; read it again. The two ironclad rules that are relevant to your situation are:
if x equals y then the hash code of x must equal the hash code of y. Equivalently: if the hash code of x does not equal the hash code of y then x and y must be unequal.
the hash code of x must remain stable while x is in a hash table.
Those are requirements for correctness. If you can't guarantee those two simple things then your program will not be correct.
You propose two solutions.
Your first solution is that you always return a constant. That meets the requirement of both rules, but you are then reduced to linear searches in your hash table. You might as well use a list.
The other solution you propose is to somehow produce a hash code for each object and store it in the object. That is perfectly legal provided that equal items have equal hash codes. If you do that then you are restricted such that x equals y must be false if the hash codes differ. This seems to make value equality basically impossible. Since you wouldn't be overriding Equals in the first place if you wanted reference equality, this seems like a really bad idea, but it is legal provided that equals is consistent.
I propose a third solution, which is: never put your object in a hash table, because a hash table is the wrong data structure in the first place. The point of a hash table is to quickly answer the question "is this given value in this set of immutable values?" and you don't have a set of immutable values, so don't use a hash table. Use the right tool for the job. Use a list, and live with the pain of doing linear searches.
A fourth solution is: hash on the mutable fields used for equality, remove the object from all hash tables it is in just before every time you mutate it, and put it back in afterwards. This meets both requirements: the hash code agrees with equality, and hashes of objects in hash tables are stable, and you still get fast lookups.
I would either create an additional readonly field or else throw NotSupportedException. In my view the other option is meaningless. Let's see why.
Distinct (fixed) hash codes
Providing distinct hash codes is easy, e.g.:
class Sample
{
private static int counter;
private readonly int hashCode;
public Sample() { this.hashCode = counter++; }
public override int GetHashCode()
{
return this.hashCode;
}
public override bool Equals(object other)
{
return object.ReferenceEquals(this, other);
}
}
Technically you have to look out for creating too many objects and overflowing the counter here, but in practice I think that's not going to be an issue for anyone.
The problem with this approach is that instances will never compare equal. However, that's perfectly fine if you only want to use instances of Sample as indexes into a collection of some other type.
Constant hash codes
If there is any scenario in which distinct instances should compare equal then at first glance you have no other choice than returning a constant. But where does that leave you?
Locating an instance inside a container will always degenerate to the equivalent of a linear search. So in effect by returning a constant you allow the user to make a keyed container for your class, but that container will exhibit the performance characteristics of a LinkedList<T>. This might be obvious to someone familiar with your class, but personally I see it as letting people shoot themselves in the foot. If you know from beforehand that a Dictionary won't behave as one might expect, then why let the user create one? In my view, better to throw NotSupportedException.
But throwing is what you must not do!
Some people will disagree with the above, and when those people are smarter than oneself then one should pay attention. First of all, this code analysis warning states that GetHashCode should not throw. That's something to think about, but let's not be dogmatic. Sometimes you have to break the rules for a reason.
However, that is not all. In his blog post on the subject, Eric Lippert says that if you throw from inside GetHashCode then
your object cannot be a result in many LINQ-to-objects queries that use hash tables
internally for performance reasons.
Losing LINQ is certainly a bummer, but fortunately the road does not end here. Many (all?) LINQ methods that use hash tables have overloads that accept an IEqualityComparer<T> to be used when hashing. So you can in fact use LINQ, but it's going to be less convenient.
In the end you will have to weigh the options yourself. My opinion is that it's better to operate with a whitelist strategy (provide an IEqualityComparer<T> whenever needed) as long as it is technically feasible because that makes the code explicit: if someone tries to use the class naively they get an exception that helpfully tells them what's going on and the equality comparer is visible in the code wherever it is used, making the extraordinary behavior of the class immediately clear.
Where I want to override Equals, but there is no sensible immutable "key" for an object (and for whatever reason it doesn't make sense to make the whole object immutable), in my opinion there is only one "correct" choice:
Implement GetHashCode to hash the same fields as Equals uses. (This might be all the fields.)
Document that these fields must not be altered while in a dictionary.
Trust that users either don't put these objects in dictionaries, or obey the second rule.
(Returning a constant value compromises dictionary performance. Throwing an exception disallows too many useful cases where objects are cached but not modified. Any other implementation for GetHashCode would be wrong.)
Where this runs the user into trouble anyway, it's probably their fault. (Specifically: using a dictionary where they shouldn't, or using a model type in a context where they should be using a view-model type that uses reference equality instead.)
Or perhaps I shouldn't be overriding Equals in the first place.
If the classes truly contain nothing constant on which a hash value can be calculated then I would use something simpler than a GUID. Just use a random number persisted in the class (or in a wrapper class).
A simple approach is to store the hashCode in a private member and generate it on the first use. If your entity doesn't change often, and you're not going to be using two different objects that are Equal (where your Equals method returns true) as keys in your dictionary, then this should be fine:
private int? _hashCode;
public override int GetHashCode() {
if (!_hashCode.HasValue)
_hashCode = Property1.GetHashCode() ^ Property2.GetHashCode() etc... based on whatever you use in your equals method
return _hashCode.Value;
}
However, if you have, say, object a and object b, where a.Equals(b) == true, and you store an entry in your dictionary using a as the key (dictionary[a] = value).
If a does not change, then dictionary[b] will return value, however, if you change a after storing the entry in the dictionary, then dictionary[b] will most likely fail.
The only workaround to this is to rehash the dictionary when any of the keys change.
By creating a Dictionary<int,int> or List<KeyValuePair<int,int>> I can create a list of related ids.
By calling collection[key] I can return the corresponding value stored against it.
I also want to be able to return the key by passing in a value - which I know is possible using some LINQ, however it doesn't seem very efficient.
In my case along with each key being unique, each value is too. Does this fact make it possible to use another approach which will provide better performance?
It sounds like you need a bi-directional dictionary. There are no framework classes that support this, but you can implement your own:
Bidirectional 1 to 1 Dictionary in C#
You could encapsulate two dictionaries, one with your "keys" storing your values and the other keyed with your "values" storing your keys.
Then manage access to them through a few methods. Fast and the added memory overhead shouldn't make a huge difference.
Edit: just noticed this is essentially the same as the previous answer :-/
I was attempting to use ObjectIDGenerator in C# to generate a unique ID during serialization, however, this class is not available in the XBox360 or Windows Phone 7 .NET frameworks (they use a compact version of .NET). I implemented a version using a dictionary of Object to Int64 and was able to get a fully working version up, however, the performance is unsatisfactory. I'm serializing on the order of tens of thousands of objects, and currently this is the greatest bottleneck in save/load performance. Using the actual .NET implementation on PC it takes about 0.3 seconds to serialize about 20,000 objects. Using my implementation, it takes about 6 seconds.
In profiling, I found that the heavy hitters were .TryGetValue and .Add on the dictionary (which makes sense since it's both indexing and adding to the hash map). More importantly, the virtual equality operator was being called instead of simply comparing references, so I implemented an IEqualityComparer that only used ReferenceEquals (this resulted in a speed increase).
Does anyone have an insight into a better implementation of ObjectIDGenerator? Thanks for your help!
My Implementation: http://pastebin.com/H1skZwNK
[Edit]
Another note, the results of profiling says that the object comparison / ReferenceEquals is still the bottleneck, with a hit count of 43,000,000. I'm wondering if there's a way to store data along side this object without having to look it up in a hash map...
Is it possible to use an Int32 Id property / handle for each object rather than Object? That may help things. It looks like you're assigning an Id type number to each object anyway, only you're then looking up based on the Object reference instead of the Id. Can you persist the object id (your Int64) within each object and make your dictionary into Dictionary<Int64, Object> instead?
You might also want to see if SortedDictionary<TKey, TValue> or SortedList<TKey, TValue> perform better or worse. But if your main bottleneck is in your IEqualityComparer, these might not help very much.
UPDATE
After looking at the ObjectIDGenerator class API, I can see why you can't do what I advised at first; you're creating the ids!
ObjectIDGenerator seems to be manually implementing its own hash table (it allocates an object[] and a parallel long[] and resizes them as objects are added). It also uses RuntimeHelpers.GetHashCode(Object) to calculate its hash rather than an IEqualityComparer which may be a big boost to your perf as its always calling Object.GetHashCode() and not doing the virtual call on the derived type (or the interface call in your case with IEqualityComparer).
You can see the source for yourself via the Microsoft Shared Source Initiative:
I've got a complex class in my C# project on which I want to be able to do equality tests. It is not a trivial class; it contains a variety of scalar properties as well as references to other objects and collections (e.g. IDictionary). For what it's worth, my class is sealed.
To enable a performance optimization elsewhere in my system (an optimization that avoids a costly network round-trip), I need to be able to compare instances of these objects to each other for equality – other than the built-in reference equality – and so I'm overriding the Object.Equals() instance method. However, now that I've done that, Visual Studio 2008's Code Analysis a.k.a. FxCop, which I keep enabled by default, is raising the following warning:
warning : CA2218 : Microsoft.Usage : Since 'MySuperDuperClass'
redefines Equals, it should also redefine GetHashCode.
I think I understand the rationale for this warning: If I am going to be using such objects as the key in a collection, the hash code is important. i.e. see this question. However, I am not going to be using these objects as the key in a collection. Ever.
Feeling justified to suppress the warning, I looked up code CA2218 in the MSDN documentation to get the full name of the warning so I could apply a SuppressMessage attribute to my class as follows:
[SuppressMessage("Microsoft.Naming",
"CA2218:OverrideGetHashCodeOnOverridingEquals",
Justification="This class is not to be used as key in a hashtable.")]
However, while reading further, I noticed the following:
How to Fix Violations
To fix a violation of this rule,
provide an implementation of
GetHashCode. For a pair of objects of
the same type, you must ensure that
the implementation returns the same
value if your implementation of Equals
returns true for the pair.
When to Suppress Warnings
-----> Do not suppress a warning from this
rule. [arrow & emphasis mine]
So, I'd like to know: Why shouldn't I suppress this warning as I was planning to? Doesn't my case warrant suppression? I don't want to code up an implementation of GetHashCode() for this object that will never get called, since my object will never be the key in a collection. If I wanted to be pedantic, instead of suppressing, would it be more reasonable for me to override GetHashCode() with an implementation that throws a NotImplementedException?
Update: I just looked this subject up again in Bill Wagner's good book Effective C#, and he states in "Item 10: Understand the Pitfalls of GetHashCode()":
If you're defining a type that won't
ever be used as the key in a
container, this won't matter. Types
that represent window controls, web
page controls, or database connections
are unlikely to be used as keys in a
collection. In those cases, do
nothing. All reference types will
have a hash code that is correct, even
if it is very inefficient. [...] In
most types that you create, the best
approach is to avoid the existence of
GetHashCode() entirely.
... that's where I originally got this idea that I need not be concerned about GetHashCode() always.
If you are reallio-trulio absosmurfly positive that you'll never use the thing as a key to a hash table then your proposal is reasonable. Override GetHashCode; make it throw an exception.
Note that hash tables hide in unlikely places. Plenty of LINQ sequence operators use hash table implementations internally to speed things up. By rejecting the implementation of GetHashCode you are also rejecting being able to use your type in a variety of LINQ queries. I like to build algorithms that use memoization for speed increases; memoizers usually use hash tables. You are therefore also rejecting ability to memoize method calls that take your type as a parameter.
Alternatively, if you don't want to be that harsh: Override GetHashCode; make it always return zero. That meets the semantic requirements of GetHashCode; that two equal objects always have the same hash code. If it is ever used as a key in a dictionary performance is going to be terrible, but you can deal with that problem when it arises, which you claim it never will.
All that said: come on. You've probably spent more time typing up the question than it would take to correctly implement it. Just do it.
You should not suppress it. Look at how your equals method is implemented. I'm sure it compares one or more members on the class to determine equality. One of these members is oftentimes enough to distinguish one object from another, and therefore you could implement GetHashCode by returning membername.GetHashCode();.
My $0.10 worth? Implement GetHashCode.
As much as you say you'll never, ever need it, you may change your mind, or someone else may have other ideas on how to use the code. A working GetHashCode isn't hard to make, and guarantees that there won't be any problems in the future.
As soon as you forget, or another developer who isn't aware uses this, someone is going to have a painful bug to track down. I'd recommend simply implementing GetHashCode correctly and then you won't have to worry about it. Or just don't use Equals for your special equality comparison case.
The GetHashCode and Equals methods work together to provide value-based equality semantics for your type - you ought to implement them together.
For more information on this topic please see these articles:
All types are not compared equally
All types are not compared equally (part 2)
Shameless plug: These articles were written by me.
I'm using a HashSet<T> to store a collection of objects. These objects already have a unique ID of System.Guid, so I'd rather the HashSet<> just use that existing ID rather then trying to figure out itself how to hash the object. How do I override the build in hashing and force my program to use the build in ID value as the hash value?
Also say I know the Guid of an object in my HashSet<>, is there a way to get an object from a HashSet<T> based on this Guid alone? Or should I use a dictionary instead.
A HashSet<> is not based a key/value pair, and provides no "by key" access - it is just a set of unique values, using the hash to check containment very quickly.
To use a key/value pair (to fetch out by Guid later) the simplest option would be a Dictionary<Guid,SomeType>. The existing hash-code on Guid should be fine (although if you needed (you don't here) you can provide an IEqualityComparer<T> to use for hashing.
Override the GetHashCode() method for your object.
Of course, there's a slight wrinkle here... GUIDs are larger than int32s, which .NET uses for hashcodes.
Why do you need to override this? seems like perhaps a premature optimization.
Yeah, just use a dictionary. Once you develop your application, go through a performance tuning phase where you measure the performance of all your code. If and only If this hashing function shows as being your largest drain should you consider a more performant data structure (if there is one anyways) :-)
Try looking into System.KeyedCollection. It allows you to embed the knowledge of the key field into your collection implementation.