using HashSet as underlying storage to replicate a dictionary - c#

I was asked a question today to re-implement the dictionary. My solution is to use a HashSet as the storage, and a class to represent the KeyValue pair. In this class, I override the GetHashCode and Equals methods in order to add the KeyValue pair instance to the HashSet.
I then read the source code for C# Dictionary, and found it uses the array for storage, and loop through the array to find the matching keyvalues.
Is my approach correct? What is advantage of current Dictionary implementation in C#? Thanks in advance.
public class MyDictionary<K,V>
{
private class KV
{
public K Key {get;set;}
public V Value {get;set;}
public override int GetHashCode()
{
return Key.GetHashCode();
}
public override bool Equals(object o)
{
var obj = ((KV)o).Key;
return Key.Equals(obj);
}
}
private readonly HashSet<KV> _store = new HashSet<KV>();
public void Add(K key, V value)
{
_store.Add(new KV{Key = key, Value = value});
}
public V this[K key]
{
get
{
KV _kv;
if (_store.TryGetValue(new KV{Key = key}, out _kv))
{
return _kv.Value;
}
else
{
return default(V);
}
}
set
{
this.Add(key, value);
}
}
}

How do you think HashSet is implemented? The code that you're seeing in Dictionary is going to look very similar to the code that's internally in HashSet. Both are backed by an array that stores a collection of all of the keyed items that share a hash, it's just that one stores a key and a pair, and one just stores the key on its own.
If you're just asking why the developer for Dictionary re-implemented some similar code to what's in a HashSet rather than actually using the actual HashSet internally, we can only guess. They naturally could have, if they wanted to, in the sense that they can create functionally identical results from the perspective of an outside observer.

The reason to use Dictionary is because it is well written, well tested, is already done, and it works.
Your code has a problem when replacing the value associated with a key that's already been added. The following code:
dict["hi"]=10;
dict["hi"]=4;
Console.WriteLine(dict["hi"]);
will output 10 with your class. Dictionary will output (correctly) 4.
As far as the use of arrays, both HashSet and Dictionary use them in their implementations.
HashSet
private int[] m_buckets;
private HashSet<T>.Slot[] m_slots;
Dictionary
private int[] buckets;
private Dictionary<TKey, TValue>.Entry[] entries;
HashSet and Dictionary do not loop through their arrays to find the key/value. They use a modulus of the hashcode value to directly index into the bucket array. The value in the bucket array points into the slots or entries array. Then, they loop over the list of keys that had identical hashcodes or colliding hashcodes (two different hashcodes that result in the same value after the modulus is applied). These little collision lists are in the slots or entries arrays, and are typically very small, usually with just a single element.
Why isn't Dictionary just implemented onto HashSet? Because the two classes do two different things. HashSet is geared towards storing a set of unique keys. Dictionary is geared towards storing values associated with unique keys. You tried to use a HashSet to store a value by embedding it in the key (which is an object). But I pointed out why that fails to work. It's because HashSet doesn't entertain the concept of a value. It cares only for the key. So it's not suited to being used as a dictionary. Now, you could use Dictionary to implement a HashSet, but that would be wasteful, as there is code and memory in Dictionary dedicated to handling the values. There are two classes, that are each made to fulfill a specific purpose. They are similar, but not the same

What is advantage of ... us[ing] the array for storage, and loop[ing] through the array to find the matching keyvalues[?]
I can answer this from a Java perspective. I think it's very similar in C#.
The Big O time complexity of a get from a hashset is O(1), while an array is O(n). Naively, one might think the hashset would perform better. But it's not that simple. Computing a hash code is relatively expensive, and each class provides its own hashing algorithm, so the run time and quality of hash distribution can vary widely. (It is inefficient but perfectly legal for a class to return the same hash for every object. Hash based collections storing such objects will degenerate to array performance.)
The upshot of all this is that despite the theoretical performance difference, it turns out that for small collections, which are the vast majority of collections in a typical program, iterating over an array is faster than computing a hash. Google introduced an array based map as an alternative to hashmap in their Android API, and they suggest that the array based version performs better for collections up to around 10 to 100 elements. The uncertain range is because, as I mentioned, the cost of hashing varies.
Bottom line... if performance matters, forget Big O and trust your benchmarks.

The problem with your implementation is that a HashSet only stores a single entry for the specified key, in your case the hash value. So if the caller wants to add two entries to your dictionary that happen to have the same hash value then only the first is stored, the second is ignored.
A dictionary is typically implemented as a list of entries that match the hash value, that way you can have multiple entries with the same hash value. This does make it more complicated because when adding/removing/looking up you need to handle the list.

Related

Use List or Dictionary?

I have a program where i need to store a list of some variables.
each variables has a name and a value and i want to make a function that gets the name of a variable and returns its value:
object getValue(string name);
To do that i have two choices:
1: Store the variables in a dictionary Dictionary and then the function getValue would just fetch the variable whose key is the name i am looking for:
object getValue (string name)
{
return variablesDictionary[name].Value;
}
2: Store the variables in a list and then access the wanted variable through linq:
object getValue (string name)
{
return variablesList.Where(v => v.Name == name).First();
}
Both are very simple but the second one (linq) seems more compelling because it uses linq and also because in the first method the same name is stored in two different places which is redundant.
What is the best method with respect to best practices and performance?
Thanks
Using a dictionary is way faster than using a list, at least in any case when it matters.
The performance of a dictionary lookup is O(1), while a list search is O(n). That means that it takes the same time to find the item in the dictionary with few items as with many items, but finding them in the list takes longer the more items that you have.
For very small sets of variables the list may be slightly faster, but then they are both so fast that it doesn't matter. With many items the dictionary clearly outperforms the list.
The dictionary uses a bit more memory, but not so much. Remember that it will only store the reference to the name, it's not another copy of the string.
You should definitely use a Dictionary in this case. If you use a List and want to do a lookup, in the worst case the program has to loop over the entire list to find the right object. For a Dictionary, this is always a constant time, irrespective of its size.
By the way, 'uses LINQ' is not a good reason to prefer one method over the other.
What you're trying to do is exactly what dictionaries were designed for.
Technically, any dictionary that uses an object property as a key to that object will be "redundant" as you describe, but because string is a reference type, it's not like you'll be using up a huge amount of memory to store the "redundant" key.
At the cost of a few extra bytes, you get a huge performance increase. The thing that makes dictionaries so cool is that they're hash tables, so they'll always look up an element from a key quickly, no matter how big they are. But if you use a list and try to iterate over it with LINQ, you might have 10,000 items in the list and the one you're looking for is at the end, and it will take approximately 10,000 times longer than looking it up with a Dictionary. (For a more formal look at the math involved, try Googling "Big O notation" and "time complexity". It's a very useful bit of theory to know about when developing software!)

Adding Complex Keys in the Dictionary! Does it effect the performance

I am just writing a program that requires Dictionary as (in C#.net-4.5)
Dictionary<List<Button>, String> Label_Group = new Dictionary<List<Button>, String>();
my friend suggests to use key as string, does this makes any difference in performance while doing the search!,
I am just curious how it work
In fact, the lookup based on a List<Button> will be faster than based on a string, because List<T> doesn't override Equals. Your keys will just be compared by reference effectively - which is blazingly cheap.
Compare that with using a string as a key:
The hash code needs to be computed, which is non-trivial
Each string comparison performs an equality check, which has a short cut for equal references (and probably different lengths), but will otherwise need to compare each character until it finds a difference or reaches the end
(Taking the hash code of an object of a type which doesn't override GetHashCode may require allocation of a SyncBlock - I think it used to. That may be more expensive than hashing very short strings...)
It's rarely a good idea to use a collection as a dictionary key though - and if you need anything other than reference equality for key comparisons, you'll need to write your own IEqualityComparer<>.
As far as I know, List<T> does not override GetHashCode, so its use as a key would have similar performance to using an object.

HashSets don't keep the elements unique if you mutate their identity

When working with HashSets in C#, I recently came across an annoying problem: HashSets don't guarantee unicity of the elements; they are not Sets. What they do guarantee is that when Add(T item) is called the item is not added if for any item in the set item.equals(that) is true. This holds no longer if you manipulate items already in the set. A small program that demonstrates (copypasta from my Linqpad):
void Main()
{
HashSet<Tester> testset = new HashSet<Tester>();
testset.Add(new Tester(1));
testset.Add(new Tester(2));
foreach(Tester tester in testset){
tester.Dump();
}
foreach(Tester tester in testset){
tester.myint = 3;
}
foreach(Tester tester in testset){
tester.Dump();
}
HashSet<Tester> secondhashset = new HashSet<Tester>(testset);
foreach(Tester tester in secondhashset){
tester.Dump();
}
}
class Tester{
public int myint;
public Tester(int i){
this.myint = i;
}
public override bool Equals(object o){
if (o== null) return false;
Tester that = o as Tester;
if (that == null) return false;
return (this.myint == that.myint);
}
public override int GetHashCode(){
return this.myint;
}
public override string ToString(){
return this.myint.ToString();
}
}
It will happily manipulate the items in the collection to be equal, only filtering them out when a new HashSet is built. What is advicible when I want to work with sets where I need to know the entries are unique? Roll my own, where Add(T item) adds a copy off the item, and the enumerator enumerates over copies of the contained items? This presents the challenge that every contained element should be deep-copyable, at least in its items that influence it's equality.
Another solution would be to roll your own, and only accepts elements that implement INotifyPropertyChanged, and taking action on the event to re-check for equality, but this seems severely limiting, not to mention a whole lot of work and performance loss under the hood.
Yet another possible solution I thought of is making sure that all fields are readonly or const in the constructor. All solutions seem to have very large drawbacks. Do I have any other options?
You're really talking about object identity. If you're going to hash items they need to have some kind of identity so they can be compared.
If that changes, it is not a valid identity method. You currently have public int myint. It really should be readonly, and only set in the constructor.
If two objects are conceptually different (i.e. you want to treat them as different in your specific design) then their hash code should be different.
If you have two objects with the same content (i.e. two value objects that have the same field values) then they should have the same hash codes and should be equal.
If your data model says that you can have two objects with the same content but they can't be equal, you should use a surrogate id, not hash the contents.
Perhaps your objects should be immutable value types so the object can't change
If they are mutable types, you should assign a surrogate ID (i.e. one that is introduced externally, like an increasing counter id or using the object's hashcode) that never changes for the given object
This is a problem with your Tester objects, not the set. You need to think hard about how you define identity. It's not an easy problem.
When I need a 1-dimensional collection of guaranteed unique items I usually go with Dictionary<TKey, Tvalue>: you cannot add elements with the same Key, plus I usually need to attach some properties to the items and the Value comes in handy (my go-to value type is Tuple<> for many values...).
OF course, it's not the most performant nor the least memory-hungry solution, but I don't usually have performance/memory concerns.
You should implement your own IEqualityComparer and pass it to the constructor of the HashSet to ensure you get the desired equality comparer.
And as Joe said, if you want the collection to remain unique even beyond .Add(T item) you need to use ValueObjects that are created by the constructor and have no publicly visible set attributes.
i.e.

When do we use HashSet<> [duplicate]

This question already has answers here:
When should I use the HashSet<T> type?
(11 answers)
Closed 8 years ago.
i have a small sample .
//Class
public class GetEntity
{
public string name1 { get; set; }
public string name2 { get; set; }
public GetEntity() { }
}
and:
public void GetHash()
{
HashSet objHash = new HashSet();
GetEntity obj = new GetEntity();
obj.name1 = "Ram";
obj.name2 = "Shyam";
objHash.Add(obj);
foreach (GetEntity objEntity in objHash)
{
Label2.Text = objEntity.name1.ToString() + objEntity.name2.ToString();
}
}
Code works fine.Same task is done through Dictionary and List.But i want to know when we use HashSet<> , Dictionary<> or List<>.Is there only performance issue or any other things which i dont understand.Thanks.
i want to know when we use HashSet<> , Dictionary<> or <List>
They all have different purpose and used in different scenarios
HashSet
Is used when you want to have a collection with unique elements. HashSet stores list of unique elements and won't allow duplicates in it.
Dictionary
Is used when you want to have a value against a unique key. Each element in Dictionary has two parts a (unique) key and a value. You can store a unique key in it (just like Hashset) in addition you can store a value against that unique key.
List
Is just a simple collection of elements. You can have duplicates in it.
Set does not contain duplicated values.
I am not a C# guy myself but following should be the difference.
PLease correct me if I am wrong
HashSet will only take unique values, values can be randomly accessed by index, works in constant time
Dictionary will take key value pairs, Values can be accessed randomly by key names, key names can not be duplicate. This also is a very fast DS. Works in constant time
List will take n values even if they are not unique, values has to be accessed sequentially. Time complexity for insert and retrieval would be o(n) in worst case scenario
They are all called collections, which are usually in the namespace System.Collections.Generic.
When to use a certain data structure essentially requires understanding what operations they support. For HashSet, it's basically a Set in mathematics, and supports efficient Add, Remove, and quick judgement on whether an element Exists in the set. Given it's a set, the elements must be unique in Hashsets.
For Dictionary, it's basically a mapping structure, i.e. a set of Key-Value Pairs. Dictionary provides efficient Query on key-value pairs with a given key and also Add/Remove of the key-value pairs.
Lists are an ordered collection of elements. Unlike Hashsets, judging the existence of an element in a list is inefficient. And unlike Dictionaries, the internal data structure is not key-value pairs but simple objects. Another difference is you can use index (like list[3]) to efficiently access elements in Lists. (although it's not true for LinkedList)

What does .NET Dictionary<T,T> use to hash a reference?

So I'm thinking of using a reference type as a key to a .NET Dictionary...
Example:
class MyObj
{
private int mID;
public MyObj(int id)
{
this.mID = id;
}
}
// whatever code here
static void Main(string[] args)
{
Dictionary<MyObj, string> dictionary = new Dictionary<MyObj, string>();
}
My question is, how is the hash generated for custom objects (ie not int, string, bool etc)? I ask because the objects I'm using as keys may change before I need to look up stuff in the Dictionary again. If the hash is generated from the object's address, then I'm probably fine... but if it is generated from some combination of the object's member variables then I'm in trouble.
EDIT:
I should've originally made it clear that I don't care about the equality of the objects in this case... I was merely looking for a fast lookup (I wanted to do a 1-1 association without changing the code of the classes involved).
Thanks
The default implementation of GetHashCode/Equals basically deals with identity. You'll always get the same hash back from the same object, and it'll probably be different to other objects (very high probability!).
In other words, if you just want reference identity, you're fine. If you want to use the dictionary treating the keys as values (i.e. using the data within the object, rather than just the object reference itself, to determine the notion of equality) then it's a bad idea to mutate any of the equality-sensitive data within the key after adding it to the dictionary.
The MSDN documentation for object.GetHashCode is a little bit overly scary - basically you shouldn't use it for persistent hashes (i.e. saved between process invocations) but it will be consistent for the same object which is all that's required for it to be a valid hash for a dictionary. While it's not guaranteed to be unique, I don't think you'll run into enough collections to cause a problem.
The hash used is the return value of the .GetHashcode method on the object. By default this essentially a value representing the reference. It is not guaranteed to be unique for an object, and in fact likely won't be in many situations. But the value for a particular reference will not change over the lifetime of the object even if you mutate it. So for this particular sample you will be OK.
In general though, it is a very bad idea to use objects which are not immutable as keys to a Dictionary. It's way too easy to fall into a trap where you override Equals and GetHashcode on an object and break code where the type was formerly used as a key in a Dictionary.
The dictionary will use the GetHashCode method defined on System.Object, which will not change over the object's lifetime regardless of field changes etc. So you won't encounter problems in the scenario you describe.
You can override GetHashCode, and should do so if you override Equals so that objects which are equal also return the same hash code. However, if the type is mutable then you must be aware that if you use the object as a key of a dictionary you will not be able to find it again if it is subsequently altered.
The default implementation of the
GetHashCode method does not guarantee
unique return values for different
objects. Furthermore, the .NET
Framework does not guarantee the
default implementation of the
GetHashCode method, and the value it
returns will be the same between
different versions of the .NET
Framework. Consequently, the default
implementation of this method must not
be used as a unique object identifier
for hashing purposes.
http://msdn.microsoft.com/en-us/library/system.object.gethashcode.aspx
For CustomObject , derived from objects, the hash code will be generated in beginning of the object and they will remain same throughout its life of its instance. Further more, hash code will never change as values of internal fields/properties will change.
Hashtable/Dictionary will not use GetHashCode as unique identifier but rather it will only use it as "hash buckets". For example string "aaa123" and "aaa456" may have hash as "aaa" and that all objects having same hash "aaa" will be stored in one bucket. Whenever you will insert/retrive an object, Dictionary will always call GetHashCode and determine the bucket to further individual address comparison of objects.
Custom Object as Dictionary key should be taken as if, Dictionary only stores theirs "Reference (addresses or memory pointers)" it doesnt know its contents, and contents of objects change but Reference never change. This also means that if two objects are exact replica of each other, but they are different in memory, your hashtable will not consider them as same because their memory pointers are different.
Best way to guarentee identity equality is to override method "Equals" as following... if you are having any problem.
class MyObj
{
private int mID;
public MyObj(int id)
{
this.mID = id;
}
public bool override Equals(Object obj)
{
MyObj mobj = obj as MyObj;
if(mobj==null)
return false;
return this.mID == mobj.mID;
}
}

Categories