How do I control how an object is hashed by a hashset - c#

I'm using a HashSet<T> to store a collection of objects. These objects already have a unique ID of System.Guid, so I'd rather the HashSet<> just use that existing ID rather then trying to figure out itself how to hash the object. How do I override the build in hashing and force my program to use the build in ID value as the hash value?
Also say I know the Guid of an object in my HashSet<>, is there a way to get an object from a HashSet<T> based on this Guid alone? Or should I use a dictionary instead.

A HashSet<> is not based a key/value pair, and provides no "by key" access - it is just a set of unique values, using the hash to check containment very quickly.
To use a key/value pair (to fetch out by Guid later) the simplest option would be a Dictionary<Guid,SomeType>. The existing hash-code on Guid should be fine (although if you needed (you don't here) you can provide an IEqualityComparer<T> to use for hashing.

Override the GetHashCode() method for your object.
Of course, there's a slight wrinkle here... GUIDs are larger than int32s, which .NET uses for hashcodes.

Why do you need to override this? seems like perhaps a premature optimization.
Yeah, just use a dictionary. Once you develop your application, go through a performance tuning phase where you measure the performance of all your code. If and only If this hashing function shows as being your largest drain should you consider a more performant data structure (if there is one anyways) :-)

Try looking into System.KeyedCollection. It allows you to embed the knowledge of the key field into your collection implementation.

Related

Adding Complex Keys in the Dictionary! Does it effect the performance

I am just writing a program that requires Dictionary as (in C#.net-4.5)
Dictionary<List<Button>, String> Label_Group = new Dictionary<List<Button>, String>();
my friend suggests to use key as string, does this makes any difference in performance while doing the search!,
I am just curious how it work
In fact, the lookup based on a List<Button> will be faster than based on a string, because List<T> doesn't override Equals. Your keys will just be compared by reference effectively - which is blazingly cheap.
Compare that with using a string as a key:
The hash code needs to be computed, which is non-trivial
Each string comparison performs an equality check, which has a short cut for equal references (and probably different lengths), but will otherwise need to compare each character until it finds a difference or reaches the end
(Taking the hash code of an object of a type which doesn't override GetHashCode may require allocation of a SyncBlock - I think it used to. That may be more expensive than hashing very short strings...)
It's rarely a good idea to use a collection as a dictionary key though - and if you need anything other than reference equality for key comparisons, you'll need to write your own IEqualityComparer<>.
As far as I know, List<T> does not override GetHashCode, so its use as a key would have similar performance to using an object.

Using Objects as Keys in Dictionary

Is it a good practice to use your domain objects as keys in a Dictionary ?
I have a scenario where I populate my domain object using NHibernate.
For performing business logic I need to look-up Dictionary. I can make use of
Either
IDictionary<int, ValueFortheObject>
or
Dictionary<DomainObject, ValueFortheObject>
The second options seems better to me as
I can write easier test cases and can use real domain objects in test cases as well rather than using Mock<DomainObject> (if I go with the first option) since setter on the Id is private on ALL domain objects.
The code is more readable as Dictionary<Hobbit,List<Adventures>> is more readable to me than Dictionary<int,List<Adventures>> with comments suggesting int is the hobbitId especially when passing as parameters
My questions are :
What are advantages of using the first option over the second (which I might be blindly missing) ?
Will be any performance issues using the second approach?
Update 01:
My domain models implement those and they DO NOT get mutated while performing the operations.
Will there be issues with performance using them as keys ? or am I completely missing the point here and performance / memory are not related to what keys are being used?
Update 02:
My question is
Will there be any issues with performance or memory if I use Objects as keys instead of primitive types and WHY / HOW ?
The biggest issue you will run in to is you must not mutate your key object while it is performing it's roll as a key.
When i say "mutate" I mean your key object must implement Equals and GetHashCode to be used as a key for a dictionary. Anything you do to the object while it is being used as a key must not change the value of GetHashCode nor cause Equals to evaluate to true with any other key in the collection.
#ScottChamberlain gives you the overall issue, but your use cases could argue either way. Some questions you should ask yourself are: what does it mean for two business objects to be equal? Is this the same or different if they are being used as a key in a dictionary or if I'm comparing them elsewhere? If I change an object, should it's value as a key change or remain the same? If you use overrides for GetHashCode() and Equals(), what is the cost of computing those functions?
Generally I am in favor of using simple types for keys as there is a lot of room for misunderstanding with respect to object equality. You could always create a custom dictionary (wrapper around Dictionary<Key,Value>) with appropriate method if readability is your highest concern. You could then write the methods in terms of the objects, then use whatever (appropriate) property you want as the key internally.

A two way KeyValuePair collection in C#

By creating a Dictionary<int,int> or List<KeyValuePair<int,int>> I can create a list of related ids.
By calling collection[key] I can return the corresponding value stored against it.
I also want to be able to return the key by passing in a value - which I know is possible using some LINQ, however it doesn't seem very efficient.
In my case along with each key being unique, each value is too. Does this fact make it possible to use another approach which will provide better performance?
It sounds like you need a bi-directional dictionary. There are no framework classes that support this, but you can implement your own:
Bidirectional 1 to 1 Dictionary in C#
You could encapsulate two dictionaries, one with your "keys" storing your values and the other keyed with your "values" storing your keys.
Then manage access to them through a few methods. Fast and the added memory overhead shouldn't make a huge difference.
Edit: just noticed this is essentially the same as the previous answer :-/

Can a custom GetHashcode implementation cause problems with Dictionary or Hashtable's "buckets"

I'm considering implementing my own custom hashcode for a given object... and use this as a key for my dictionary. Since it's possible (likely) that 2 objects will have the same hashcode, what additional operators should I override, and what should that override (conceptually) look like?
myDictionary.Add(myObj.GetHashCode(),myObj);
vs
myDictionary.Add(myObj,myObj);
In other words, does a Dictionary use a combination of the following in order to determine uniqueness and which bucket to place an object in?
Which are more important than others?
HashCode
Equals
==
CompareTo()
Is compareTo only needed in the SortedDictionary?
What is GetHashCode used for?
It is by design useful for only one thing: putting an object in a hash table. Hence the name.
GetHashCode is designed to do only one thing: balance a hash table. Do not use it for anything else. In particular:
It does not provide a unique key for an object; probability of collision is extremely high.
It is not of cryptographic strength, so do not use it as part of a digital signature or as a password equivalent
It does not necessarily have the error-detection properties needed for checksums.
and so on.
Eric Lippert
http://ericlippert.com/2011/02/28/guidelines-and-rules-for-gethashcode/
It's not the buckets that cause the problem - it is actually finding the right object instance once you have determined the bucket using the hash code. Since all objects in a bucket share the same hash code, object equality (Equals) is used to find the right one. The rule is that if two objects are considered equal, they should produce the same hash code - but two objects producing the same hash codes might not be equal.

ObjectIDGenerator implementation

I was attempting to use ObjectIDGenerator in C# to generate a unique ID during serialization, however, this class is not available in the XBox360 or Windows Phone 7 .NET frameworks (they use a compact version of .NET). I implemented a version using a dictionary of Object to Int64 and was able to get a fully working version up, however, the performance is unsatisfactory. I'm serializing on the order of tens of thousands of objects, and currently this is the greatest bottleneck in save/load performance. Using the actual .NET implementation on PC it takes about 0.3 seconds to serialize about 20,000 objects. Using my implementation, it takes about 6 seconds.
In profiling, I found that the heavy hitters were .TryGetValue and .Add on the dictionary (which makes sense since it's both indexing and adding to the hash map). More importantly, the virtual equality operator was being called instead of simply comparing references, so I implemented an IEqualityComparer that only used ReferenceEquals (this resulted in a speed increase).
Does anyone have an insight into a better implementation of ObjectIDGenerator? Thanks for your help!
My Implementation: http://pastebin.com/H1skZwNK
[Edit]
Another note, the results of profiling says that the object comparison / ReferenceEquals is still the bottleneck, with a hit count of 43,000,000. I'm wondering if there's a way to store data along side this object without having to look it up in a hash map...
Is it possible to use an Int32 Id property / handle for each object rather than Object? That may help things. It looks like you're assigning an Id type number to each object anyway, only you're then looking up based on the Object reference instead of the Id. Can you persist the object id (your Int64) within each object and make your dictionary into Dictionary<Int64, Object> instead?
You might also want to see if SortedDictionary<TKey, TValue> or SortedList<TKey, TValue> perform better or worse. But if your main bottleneck is in your IEqualityComparer, these might not help very much.
UPDATE
After looking at the ObjectIDGenerator class API, I can see why you can't do what I advised at first; you're creating the ids!
ObjectIDGenerator seems to be manually implementing its own hash table (it allocates an object[] and a parallel long[] and resizes them as objects are added). It also uses RuntimeHelpers.GetHashCode(Object) to calculate its hash rather than an IEqualityComparer which may be a big boost to your perf as its always calling Object.GetHashCode() and not doing the virtual call on the derived type (or the interface call in your case with IEqualityComparer).
You can see the source for yourself via the Microsoft Shared Source Initiative:

Categories