Adding Complex Keys in the Dictionary! Does it effect the performance - c#

I am just writing a program that requires Dictionary as (in C#.net-4.5)
Dictionary<List<Button>, String> Label_Group = new Dictionary<List<Button>, String>();
my friend suggests to use key as string, does this makes any difference in performance while doing the search!,
I am just curious how it work

In fact, the lookup based on a List<Button> will be faster than based on a string, because List<T> doesn't override Equals. Your keys will just be compared by reference effectively - which is blazingly cheap.
Compare that with using a string as a key:
The hash code needs to be computed, which is non-trivial
Each string comparison performs an equality check, which has a short cut for equal references (and probably different lengths), but will otherwise need to compare each character until it finds a difference or reaches the end
(Taking the hash code of an object of a type which doesn't override GetHashCode may require allocation of a SyncBlock - I think it used to. That may be more expensive than hashing very short strings...)
It's rarely a good idea to use a collection as a dictionary key though - and if you need anything other than reference equality for key comparisons, you'll need to write your own IEqualityComparer<>.

As far as I know, List<T> does not override GetHashCode, so its use as a key would have similar performance to using an object.

Related

Why don't Hashtables and dictionaries use Equals() method instead of GetHashCode for keys comparision in .NET?

In .NET, Whenever we override Equals() method for a class, it is a normal practice to override the GetHashCode() method as well. Doing so will ensure better performance when the object is used in Hashtables and Dictionaries. Two keys are considered to be equal in Hashtable only if their GetHashCode() values are same. My question is why can't the Hashtables use Equals() method to compare the keys?, that would have removed the burden of overriding GetHashCode() method.
HastTable/Dictionaries use Equals in case of collision (when two hash codes are same).
Why don't they use only Equals ?
Because that would require a lot more processing than accessing/(comparing) integer value value (hash code). (Since hash codes are used as index so they have the complexity of O(1))
A HashSet (or HashTable, or Dictionary) uses an array of buckets to distribute the items, those buckets are indexed by the object's hash code (which should be immutable), so the search of the bucket the item is in is O(1).
Then it uses Equals within that bucket to find the exact match if there's more than one item with the same hashcode: that's O(N) since it needs to iterate over all items within that bucket to find the match.
If a hashset used only Equals, finding an item would be O(N) and you could aswell be using a list, or an array.
That's also why two equal items must have the same hashcode, but two items with the same hashcode don't necessarily need to be equal.
Two object instances that compare as equal must always have identical hash codes. If this doesn't hold, hash-based data structures will not work correctly. It's not a matter of performance.
Two object instances that don't compare as equal should ideally have different hash codes. If this doesn't hold, hash-based data structures will have degraded performance, but at least they'll still work.
Thus, for a given object instance, GetHashCode needs to reflect the logic of Equals, to some extent.
Now if you're overriding the Equals method, you're providing custom comparison logic. As an example, let's say your custom comparison logic involves only one particular data member of the instance. For a non-virtual GetHashCode method to be useful, it would have to be general enough to understand your custom Equals logic and be able to come up with a custom hash code function (one that only involves your chosen data member) on the spot.
It's not that easy to write such a sophisticated GetHashCode and it's not worth the trouble either, when the user can simply provide a custom one-liner that honors the initial requirement.

What to return when overriding Object.GetHashCode() in classes with no immutable fields?

Ok, before you get all mad because there are hundreds of similar sounding questions posted on the internet, I can assure you that I have just spent the last few hours reading all of them and have not found the answer to my question.
Background:
Basically, one of my large scale applications had been suffering from a situation where some Bindings on the ListBox.SelectedItem property would stop working or the program would crash after an edit had been made to the currently selected item. I initially asked the 'An item with the same key has already been added' Exception on selecting a ListBoxItem from code question here, but got no answers.
I hadn't had time to address that problem until this week, when I was given a number of days to sort it out. Now to cut a long story short, I found out the reason for the problem. It was because my data type classes had overridden the Equals method and therefore the GetHashCode method as well.
Now for those of you that are unaware of this issue, I discovered that you can only implement the GetHashCode method using immutable fields/properties. Using a excerpt from Harvey Kwok's answer to the Overriding GetHashCode() post to explain this:
The problem is that GetHashCode is being used by Dictionary and HashSet collections to place each item in a bucket. If hashcode is calculated based on some mutable fields and the fields are really changed after the object is placed into the HashSet or Dictionary, the object can no longer be found from the HashSet or Dictionary.
So the actual problem was caused because I had used mutable properties in the GetHashCode methods. When users changed these property values in the UI, the associated hash code values of the objects changed and then items could no longer be found in their collections.
Question:
So, my question is what is the best way of handling the situation where I need to implement the GetHashCode method in classes with no immutable fields? Sorry, let me be more specific, as that question has been asked before.
The answers in the Overriding GetHashCode() post suggest that in these situations, it is better to simply return a constant value... some suggest to return the value 1, while other suggest returning a prime number. Personally, I can't see any difference between these suggestions because I would have thought that there would only be one bucket used for either of them.
Furthermore, the Guidelines and rules for GetHashCode article in Eric Lippert's Blog has a section titled Guideline: the distribution of hash codes must be "random" which highlights the pitfalls of using an algorithm that results in not enough buckets being used. He warns of algorithms that decrease the number of buckets used and cause a performance problem when the bucket gets really big. Surely, returning a constant falls into this category.
I had an idea of adding an extra Guid field to all of my data type classes (just in C#, not the database) specifically to be used in and only in the GetHashCode method. So I suppose at the end of this long intro, my actual question is which implementation is better? To summarise:
Summary:
When overriding Object.GetHashCode() in classes with no immutable fields, is it better to return a constant from the GetHashCode method, or to create an additional readonly field for each class, solely to be used in the GetHashCode method? If I should add a new field, what type should it be and shouldn't I then include it in the Equals method?
While I am happy to receive answers from anyone, I am really hoping to receive answers from advanced developers with a sound knowledge on this subject.
Go back to basics. You read my article; read it again. The two ironclad rules that are relevant to your situation are:
if x equals y then the hash code of x must equal the hash code of y. Equivalently: if the hash code of x does not equal the hash code of y then x and y must be unequal.
the hash code of x must remain stable while x is in a hash table.
Those are requirements for correctness. If you can't guarantee those two simple things then your program will not be correct.
You propose two solutions.
Your first solution is that you always return a constant. That meets the requirement of both rules, but you are then reduced to linear searches in your hash table. You might as well use a list.
The other solution you propose is to somehow produce a hash code for each object and store it in the object. That is perfectly legal provided that equal items have equal hash codes. If you do that then you are restricted such that x equals y must be false if the hash codes differ. This seems to make value equality basically impossible. Since you wouldn't be overriding Equals in the first place if you wanted reference equality, this seems like a really bad idea, but it is legal provided that equals is consistent.
I propose a third solution, which is: never put your object in a hash table, because a hash table is the wrong data structure in the first place. The point of a hash table is to quickly answer the question "is this given value in this set of immutable values?" and you don't have a set of immutable values, so don't use a hash table. Use the right tool for the job. Use a list, and live with the pain of doing linear searches.
A fourth solution is: hash on the mutable fields used for equality, remove the object from all hash tables it is in just before every time you mutate it, and put it back in afterwards. This meets both requirements: the hash code agrees with equality, and hashes of objects in hash tables are stable, and you still get fast lookups.
I would either create an additional readonly field or else throw NotSupportedException. In my view the other option is meaningless. Let's see why.
Distinct (fixed) hash codes
Providing distinct hash codes is easy, e.g.:
class Sample
{
private static int counter;
private readonly int hashCode;
public Sample() { this.hashCode = counter++; }
public override int GetHashCode()
{
return this.hashCode;
}
public override bool Equals(object other)
{
return object.ReferenceEquals(this, other);
}
}
Technically you have to look out for creating too many objects and overflowing the counter here, but in practice I think that's not going to be an issue for anyone.
The problem with this approach is that instances will never compare equal. However, that's perfectly fine if you only want to use instances of Sample as indexes into a collection of some other type.
Constant hash codes
If there is any scenario in which distinct instances should compare equal then at first glance you have no other choice than returning a constant. But where does that leave you?
Locating an instance inside a container will always degenerate to the equivalent of a linear search. So in effect by returning a constant you allow the user to make a keyed container for your class, but that container will exhibit the performance characteristics of a LinkedList<T>. This might be obvious to someone familiar with your class, but personally I see it as letting people shoot themselves in the foot. If you know from beforehand that a Dictionary won't behave as one might expect, then why let the user create one? In my view, better to throw NotSupportedException.
But throwing is what you must not do!
Some people will disagree with the above, and when those people are smarter than oneself then one should pay attention. First of all, this code analysis warning states that GetHashCode should not throw. That's something to think about, but let's not be dogmatic. Sometimes you have to break the rules for a reason.
However, that is not all. In his blog post on the subject, Eric Lippert says that if you throw from inside GetHashCode then
your object cannot be a result in many LINQ-to-objects queries that use hash tables
internally for performance reasons.
Losing LINQ is certainly a bummer, but fortunately the road does not end here. Many (all?) LINQ methods that use hash tables have overloads that accept an IEqualityComparer<T> to be used when hashing. So you can in fact use LINQ, but it's going to be less convenient.
In the end you will have to weigh the options yourself. My opinion is that it's better to operate with a whitelist strategy (provide an IEqualityComparer<T> whenever needed) as long as it is technically feasible because that makes the code explicit: if someone tries to use the class naively they get an exception that helpfully tells them what's going on and the equality comparer is visible in the code wherever it is used, making the extraordinary behavior of the class immediately clear.
Where I want to override Equals, but there is no sensible immutable "key" for an object (and for whatever reason it doesn't make sense to make the whole object immutable), in my opinion there is only one "correct" choice:
Implement GetHashCode to hash the same fields as Equals uses. (This might be all the fields.)
Document that these fields must not be altered while in a dictionary.
Trust that users either don't put these objects in dictionaries, or obey the second rule.
(Returning a constant value compromises dictionary performance. Throwing an exception disallows too many useful cases where objects are cached but not modified. Any other implementation for GetHashCode would be wrong.)
Where this runs the user into trouble anyway, it's probably their fault. (Specifically: using a dictionary where they shouldn't, or using a model type in a context where they should be using a view-model type that uses reference equality instead.)
Or perhaps I shouldn't be overriding Equals in the first place.
If the classes truly contain nothing constant on which a hash value can be calculated then I would use something simpler than a GUID. Just use a random number persisted in the class (or in a wrapper class).
A simple approach is to store the hashCode in a private member and generate it on the first use. If your entity doesn't change often, and you're not going to be using two different objects that are Equal (where your Equals method returns true) as keys in your dictionary, then this should be fine:
private int? _hashCode;
public override int GetHashCode() {
if (!_hashCode.HasValue)
_hashCode = Property1.GetHashCode() ^ Property2.GetHashCode() etc... based on whatever you use in your equals method
return _hashCode.Value;
}
However, if you have, say, object a and object b, where a.Equals(b) == true, and you store an entry in your dictionary using a as the key (dictionary[a] = value).
If a does not change, then dictionary[b] will return value, however, if you change a after storing the entry in the dictionary, then dictionary[b] will most likely fail.
The only workaround to this is to rehash the dictionary when any of the keys change.

Can a custom GetHashcode implementation cause problems with Dictionary or Hashtable's "buckets"

I'm considering implementing my own custom hashcode for a given object... and use this as a key for my dictionary. Since it's possible (likely) that 2 objects will have the same hashcode, what additional operators should I override, and what should that override (conceptually) look like?
myDictionary.Add(myObj.GetHashCode(),myObj);
vs
myDictionary.Add(myObj,myObj);
In other words, does a Dictionary use a combination of the following in order to determine uniqueness and which bucket to place an object in?
Which are more important than others?
HashCode
Equals
==
CompareTo()
Is compareTo only needed in the SortedDictionary?
What is GetHashCode used for?
It is by design useful for only one thing: putting an object in a hash table. Hence the name.
GetHashCode is designed to do only one thing: balance a hash table. Do not use it for anything else. In particular:
It does not provide a unique key for an object; probability of collision is extremely high.
It is not of cryptographic strength, so do not use it as part of a digital signature or as a password equivalent
It does not necessarily have the error-detection properties needed for checksums.
and so on.
Eric Lippert
http://ericlippert.com/2011/02/28/guidelines-and-rules-for-gethashcode/
It's not the buckets that cause the problem - it is actually finding the right object instance once you have determined the bucket using the hash code. Since all objects in a bucket share the same hash code, object equality (Equals) is used to find the right one. The rule is that if two objects are considered equal, they should produce the same hash code - but two objects producing the same hash codes might not be equal.

If I never ever use HashSet, should I still implement GetHashCode?

I never need to store objects in a hash table. The reason is twofold:
coming up with a good hash function is difficult and error prone.
an AVL tree is almost always fast enough, and it merely requires a strict order predicate, which is much easier to implement.
The Equals() operation on the other hand is a very frequently used function.
Therefore I wonder whether it is necessary to implement GetHashCode (which I never need) when implementing the Equals function (which I often need)?
my advise - if you don't want to use it, override it and throw new NotImplementedException(); so that you will see where did you need it.
I think you're quite wrong if you believe that implementing a strict order predicate is much easier to implement than a hash function - it needs to handle a large number of edge cases (null values, class hierarchies). And hash functions aren't that difficult, really.
An AVL tree will be much slower than a hashtable. If you are dealing with only a few items then it will not be much of an issue. Hashtables have O(1) inserts, deletes, and searches, but an AVL tree has O(log(n)) operations.
I would go ahead and override GetHashCode and Equals for two reasons.
It really is not that difficult to get a decent distribution by using a trivial XOR implementation.1
If your classes are part of a public API then someone else might want to store them in a hashtable.
Also, I have to question the choice of BST. AVL trees are bit out of style of these days. There are other more modern BSTs that are easier to implement and work just as well (sometimes better). If you really do need a data structure that maintains ordering then consider these alternatives.
Red Black Tree - Already implemented by SortedDictionary.
Skip List
Scapegoat Tree
Splay Tree
1The XOR strategy has a subtle associativity problem that can cause collisions in some cases since a^b = b^a. There is a solution from Effective Java that has achieved cult-like recognition which is fairly simple to implement as well.
If you use Dictionary or SortedList, and override Equals, you need to have a hash function, else they will break. Equals is also used all over the place in the BCL, and if anyone else uses your objects they will expect GetHashCode to behave sensibly.
Note that a hash function doesn't have to be that complicated. A basic version is to take the hash of whatever member variables you're using for equality, multiply each one with a separate coprime number, and XOR them together.
You don't need to implement it. If you write your own Equals() method I'd recommend to use some GetHashCode implementation that doesn't break HashSet though. You could for instance return a static value (typically 42). HashSet performance will degrade dramatically, but at least it will still work - you'll never know who'll use/edit/maintain your code in the future. (edit: you may want to log a warning if such a class is used in a hashed structure in order to early spot performance problems)
EDIT: don't only use XOR to combine hash codes of your properties
It has already been said by others that you may simply combine hash codes of all your properties. Instead of only using XOR I'd encourage multiplying results though. XOR may result in a 0 value if both values are equal (e.g. 0xA ^ 0xA == 0x0). This may be easily improved using 0xA * 0xA, 0xA * 31 + 0xA or 0xA ^ (0xA * 31).
Still, the intent of my answer is that any hash function is better than one that isn't consistent with equals - even if it only returns a static value. Simply select any subset of properties (from none to all) you use for equality and throw the results together. While selecting properties for hash code, prefer those small subsets which combinations are pretty unique (e.g. firstname, lastname, birthday - no need to add the whole address)
Coming up with an adequate hash function is not difficult. Most often, a simple XOR of the results from GetHashCode() of all fields is sufficient.
If you override equals you should override GetHashCode() from MSDN: "It is recommended that any class that overrides Equals also override System.Object.GetHashCode." http://msdn.microsoft.com/en-us/library/ms173147.aspx
The two functions should match in the sense that if two objects are equal they should have the same hash value. That doesn't mean that if two objects have the same hash they should be equal. You don't need an overly complex hash algorithm but it should attempt to distribute well across the integer space.

How do I control how an object is hashed by a hashset

I'm using a HashSet<T> to store a collection of objects. These objects already have a unique ID of System.Guid, so I'd rather the HashSet<> just use that existing ID rather then trying to figure out itself how to hash the object. How do I override the build in hashing and force my program to use the build in ID value as the hash value?
Also say I know the Guid of an object in my HashSet<>, is there a way to get an object from a HashSet<T> based on this Guid alone? Or should I use a dictionary instead.
A HashSet<> is not based a key/value pair, and provides no "by key" access - it is just a set of unique values, using the hash to check containment very quickly.
To use a key/value pair (to fetch out by Guid later) the simplest option would be a Dictionary<Guid,SomeType>. The existing hash-code on Guid should be fine (although if you needed (you don't here) you can provide an IEqualityComparer<T> to use for hashing.
Override the GetHashCode() method for your object.
Of course, there's a slight wrinkle here... GUIDs are larger than int32s, which .NET uses for hashcodes.
Why do you need to override this? seems like perhaps a premature optimization.
Yeah, just use a dictionary. Once you develop your application, go through a performance tuning phase where you measure the performance of all your code. If and only If this hashing function shows as being your largest drain should you consider a more performant data structure (if there is one anyways) :-)
Try looking into System.KeyedCollection. It allows you to embed the knowledge of the key field into your collection implementation.

Categories