When working with HashSets in C#, I recently came across an annoying problem: HashSets don't guarantee unicity of the elements; they are not Sets. What they do guarantee is that when Add(T item) is called the item is not added if for any item in the set item.equals(that) is true. This holds no longer if you manipulate items already in the set. A small program that demonstrates (copypasta from my Linqpad):
void Main()
{
HashSet<Tester> testset = new HashSet<Tester>();
testset.Add(new Tester(1));
testset.Add(new Tester(2));
foreach(Tester tester in testset){
tester.Dump();
}
foreach(Tester tester in testset){
tester.myint = 3;
}
foreach(Tester tester in testset){
tester.Dump();
}
HashSet<Tester> secondhashset = new HashSet<Tester>(testset);
foreach(Tester tester in secondhashset){
tester.Dump();
}
}
class Tester{
public int myint;
public Tester(int i){
this.myint = i;
}
public override bool Equals(object o){
if (o== null) return false;
Tester that = o as Tester;
if (that == null) return false;
return (this.myint == that.myint);
}
public override int GetHashCode(){
return this.myint;
}
public override string ToString(){
return this.myint.ToString();
}
}
It will happily manipulate the items in the collection to be equal, only filtering them out when a new HashSet is built. What is advicible when I want to work with sets where I need to know the entries are unique? Roll my own, where Add(T item) adds a copy off the item, and the enumerator enumerates over copies of the contained items? This presents the challenge that every contained element should be deep-copyable, at least in its items that influence it's equality.
Another solution would be to roll your own, and only accepts elements that implement INotifyPropertyChanged, and taking action on the event to re-check for equality, but this seems severely limiting, not to mention a whole lot of work and performance loss under the hood.
Yet another possible solution I thought of is making sure that all fields are readonly or const in the constructor. All solutions seem to have very large drawbacks. Do I have any other options?
You're really talking about object identity. If you're going to hash items they need to have some kind of identity so they can be compared.
If that changes, it is not a valid identity method. You currently have public int myint. It really should be readonly, and only set in the constructor.
If two objects are conceptually different (i.e. you want to treat them as different in your specific design) then their hash code should be different.
If you have two objects with the same content (i.e. two value objects that have the same field values) then they should have the same hash codes and should be equal.
If your data model says that you can have two objects with the same content but they can't be equal, you should use a surrogate id, not hash the contents.
Perhaps your objects should be immutable value types so the object can't change
If they are mutable types, you should assign a surrogate ID (i.e. one that is introduced externally, like an increasing counter id or using the object's hashcode) that never changes for the given object
This is a problem with your Tester objects, not the set. You need to think hard about how you define identity. It's not an easy problem.
When I need a 1-dimensional collection of guaranteed unique items I usually go with Dictionary<TKey, Tvalue>: you cannot add elements with the same Key, plus I usually need to attach some properties to the items and the Value comes in handy (my go-to value type is Tuple<> for many values...).
OF course, it's not the most performant nor the least memory-hungry solution, but I don't usually have performance/memory concerns.
You should implement your own IEqualityComparer and pass it to the constructor of the HashSet to ensure you get the desired equality comparer.
And as Joe said, if you want the collection to remain unique even beyond .Add(T item) you need to use ValueObjects that are created by the constructor and have no publicly visible set attributes.
i.e.
Related
For a NetCore Web API GET method I need to caluclate the ETag for a returned List<T>. T is the DTO in the form of a record that holds only primitive types.
I wanted to calculate a hash of the list. I was searching for information about how GetHashCode() is implemented, but couldn't find any information. The documentation of object.GetHashCode()
doesn't state any information about lists or collections. By the results of the code I observed that on each run the same list data creates a different hash code. I concluded that GetHashCode() uses the pointer values for reference type items.
GetHashCode() of record calculates the hash code per member value. Therefore I created the list hash code by looping over the list items:
List<GetGroupsDTO> dtoList = commandResult.Value;
int hash = 17;
foreach(GetGroupsDTO dto in dtoList)
{
hash = hash * 23 + dto.GetHashCode();
}
string eTagPayload = hash.ToString().SurroundWithDoubleQuotes();
I don't want to do this for every List<T>, of course. I thought to override GetHashCode(), but I'm struggling with it. I don't know how to override it for the generic List. I could derive a new class DTOList where I can override GetHashCode(). But this leads to more complexity in other places. Since the result of an EFCore Set query fills the List I would need a custom converter and then a custom serializer to return the List in Web API.
Therefore I wonder if I rather should create an extension method for List or just a function that takes List as an argument. Is there any other option to calculate the ETag? How can I calculate the ETag for a list of DTO objects efficently?
A little extension method and HashCode could help with this:
internal static class EnumerableExtensions {
public static int GetCombinedHashCode<T>(this IEnumerable<T> source) =>
source.Aggregate(typeof(T).GetHashCode(), (hash, t) => HashCode.Combine(hash, t));
}
Seeding the hash with typeof(T).GetHashCode is a rather arbitrary, but ensures that empty collections of different types do not all "look equal", since they would not normally compare equal either. Whether this matters or is even desirable will depend on your scenario.
Of course the result of this is only usable if T has a meaningful GetHashCode implementation, but that's true of hashes in general. For extra peace of mind a where T : IEquatable<T> constraint could be added, although that's not the standard approach for methods involving hashes. Adding the ability to use a custom IEqualityComparer<T> for the hash is left as an exercise.
I was asked a question today to re-implement the dictionary. My solution is to use a HashSet as the storage, and a class to represent the KeyValue pair. In this class, I override the GetHashCode and Equals methods in order to add the KeyValue pair instance to the HashSet.
I then read the source code for C# Dictionary, and found it uses the array for storage, and loop through the array to find the matching keyvalues.
Is my approach correct? What is advantage of current Dictionary implementation in C#? Thanks in advance.
public class MyDictionary<K,V>
{
private class KV
{
public K Key {get;set;}
public V Value {get;set;}
public override int GetHashCode()
{
return Key.GetHashCode();
}
public override bool Equals(object o)
{
var obj = ((KV)o).Key;
return Key.Equals(obj);
}
}
private readonly HashSet<KV> _store = new HashSet<KV>();
public void Add(K key, V value)
{
_store.Add(new KV{Key = key, Value = value});
}
public V this[K key]
{
get
{
KV _kv;
if (_store.TryGetValue(new KV{Key = key}, out _kv))
{
return _kv.Value;
}
else
{
return default(V);
}
}
set
{
this.Add(key, value);
}
}
}
How do you think HashSet is implemented? The code that you're seeing in Dictionary is going to look very similar to the code that's internally in HashSet. Both are backed by an array that stores a collection of all of the keyed items that share a hash, it's just that one stores a key and a pair, and one just stores the key on its own.
If you're just asking why the developer for Dictionary re-implemented some similar code to what's in a HashSet rather than actually using the actual HashSet internally, we can only guess. They naturally could have, if they wanted to, in the sense that they can create functionally identical results from the perspective of an outside observer.
The reason to use Dictionary is because it is well written, well tested, is already done, and it works.
Your code has a problem when replacing the value associated with a key that's already been added. The following code:
dict["hi"]=10;
dict["hi"]=4;
Console.WriteLine(dict["hi"]);
will output 10 with your class. Dictionary will output (correctly) 4.
As far as the use of arrays, both HashSet and Dictionary use them in their implementations.
HashSet
private int[] m_buckets;
private HashSet<T>.Slot[] m_slots;
Dictionary
private int[] buckets;
private Dictionary<TKey, TValue>.Entry[] entries;
HashSet and Dictionary do not loop through their arrays to find the key/value. They use a modulus of the hashcode value to directly index into the bucket array. The value in the bucket array points into the slots or entries array. Then, they loop over the list of keys that had identical hashcodes or colliding hashcodes (two different hashcodes that result in the same value after the modulus is applied). These little collision lists are in the slots or entries arrays, and are typically very small, usually with just a single element.
Why isn't Dictionary just implemented onto HashSet? Because the two classes do two different things. HashSet is geared towards storing a set of unique keys. Dictionary is geared towards storing values associated with unique keys. You tried to use a HashSet to store a value by embedding it in the key (which is an object). But I pointed out why that fails to work. It's because HashSet doesn't entertain the concept of a value. It cares only for the key. So it's not suited to being used as a dictionary. Now, you could use Dictionary to implement a HashSet, but that would be wasteful, as there is code and memory in Dictionary dedicated to handling the values. There are two classes, that are each made to fulfill a specific purpose. They are similar, but not the same
What is advantage of ... us[ing] the array for storage, and loop[ing] through the array to find the matching keyvalues[?]
I can answer this from a Java perspective. I think it's very similar in C#.
The Big O time complexity of a get from a hashset is O(1), while an array is O(n). Naively, one might think the hashset would perform better. But it's not that simple. Computing a hash code is relatively expensive, and each class provides its own hashing algorithm, so the run time and quality of hash distribution can vary widely. (It is inefficient but perfectly legal for a class to return the same hash for every object. Hash based collections storing such objects will degenerate to array performance.)
The upshot of all this is that despite the theoretical performance difference, it turns out that for small collections, which are the vast majority of collections in a typical program, iterating over an array is faster than computing a hash. Google introduced an array based map as an alternative to hashmap in their Android API, and they suggest that the array based version performs better for collections up to around 10 to 100 elements. The uncertain range is because, as I mentioned, the cost of hashing varies.
Bottom line... if performance matters, forget Big O and trust your benchmarks.
The problem with your implementation is that a HashSet only stores a single entry for the specified key, in your case the hash value. So if the caller wants to add two entries to your dictionary that happen to have the same hash value then only the first is stored, the second is ignored.
A dictionary is typically implemented as a list of entries that match the hash value, that way you can have multiple entries with the same hash value. This does make it more complicated because when adding/removing/looking up you need to handle the list.
I am porting something from Java to C#. In Java the hashcode of a ArrayList depends on the items in it. In C# I always get the same hashcode from a List...
Why is this?
For some of my objects the hashcode needs to be different because the objects in their list property make the objects non-equal. I would expect that a hashcode is always unique for the object's state and only equals another hashcode when the object is equal. Am I wrong?
In order to work correctly, hashcodes must be immutable – an object's hash code must never change.
If an object's hashcode does change, any dictionaries containing the object will stop working.
Since collections are not immutable, they cannot implement GetHashCode.
Instead, they inherit the default GetHashCode, which returns a (hopefully) unique value for each instance of an object. (Typically based on a memory address)
Hashcodes must depend upon the definition of equality being used so that if A == B then A.GetHashCode() == B.GetHashCode() (but not necessarily the inverse; A.GetHashCode() == B.GetHashCode() does not entail A == B).
By default, the equality definition of a value type is based on its value, and of a reference type is based on it's identity (that is, by default an instance of a reference type is only equal to itself), hence the default hashcode for a value type is such that it depends on the values of the fields it contains* and for reference types it depends on the identity. Indeed, since we ideally want the hashcodes for non-equal objects to be different particularly in the low-order bits (most likely to affect the value of a re-hashing), we generally want two equivalent but non-equal objects to have different hashes.
Since an object will remain equal to itself, it should also be clear that this default implementation of GetHashCode() will continue to have the same value, even when the object is mutated (identity does not mutate even for a mutable object).
Now, in some cases reference types (or value types) re-define equality. An example of this is string, where for example "ABC" == "AB" + "C". Though there are two different instances of string compared, they are considered equal. In this case GetHashCode() must be overridden so that the value relates to the state upon which equality is defined (in this case, the sequence of characters contained).
While it is more common to do this with types that also are immutable, for a variety of reasons, GetHashCode() does not depend upon immutability. Rather, GetHashCode() must remain consistent in the face of mutability - change a value that we use in determining the hash, and the hash must change accordingly. Note though, that this is a problem if we are using this mutable object as a key into a structure using the hash, as mutating the object changes the position in which it should be stored, without moving it to that position (it's also true of any other case where the position of an object within a collection depends on its value - e.g. if we sort a list and then mutate one of the items in the list, the list is no longer sorted). However, this doesn't mean that we must only use immutable objects in dictionaries and hashsets. Rather it means that we must not mutate an object that is in such a structure, and making it immutable is a clear way to guarantee this.
Indeed, there are quite a few cases where storing mutable objects in such structures is desirable, and as long as we don't mutate them during this time, this is fine. Since we don't have the guarantee immutability brings, we then want to provide it another way (spending a short time in the collection and being accessible from only one thread, for example).
Hence immutability of key values is one of those cases where something is possible, but generally a idea. To the person defining the hashcode algorithm though, it's not for them to assume any such case will always be a bad idea (they don't even know the mutation happened while the object was stored in such a structure); it's for them to implement a hashcode defined on the current state of the object, whether calling it in a given point is good or not. Hence for example, a hashcode should not be memoised on a mutable object unless the memoisation is cleared on every mutate. (It's generally a waste to memoise hashes anyway, as structures that hit the same objects hashcode repeatedly will have their own memoisation of it).
Now, in the case in hand, ArrayList operates on the default case of equality being based on identity, e.g.:
ArrayList a = new ArrayList();
ArrayList b = new ArrayList();
for(int i = 0; i != 10; ++i)
{
a.Add(i);
b.Add(i);
}
return a == b;//returns false
Now, this is actually a good thing. Why? Well, how do you know in the above that we want to consider a as equal to b? We might, but there are plenty of good reasons for not doing so in other cases too.
What's more, it's much easier to redefine equality from identity-based to value-based, than from value-based to identity-based. Finally, there are more than one value-based definitions of equality for many objects (classic case being the different views on what makes a string equal), so there isn't even a one-and-only definition that works. For example:
ArrayList c = new ArrayList();
for(short i = 0; i != 10; ++i)
{
c.Add(i);
}
If we considered a == b above, should we consider a == c aslo? The answer depends on just what we care about in the definition of equality we are using, so the framework could't know what the right answer is for all cases, since all cases don't agree.
Now, if we do care about value-based equality in a given case we have two very easy options. The first is to subclass and over-ride equality:
public class ValueEqualList : ArrayList, IEquatable<ValueEqualList>
{
/*.. most methods left out ..*/
public Equals(ValueEqualList other)//optional but a good idea almost always when we redefine equality
{
if(other == null)
return false;
if(ReferenceEquals(this, other))//identity still entails equality, so this is a good shortcut
return true;
if(Count != other.Count)
return false;
for(int i = 0; i != Count; ++i)
if(this[i] != other[i])
return false;
return true;
}
public override bool Equals(object other)
{
return Equals(other as ValueEqualList);
}
public override int GetHashCode()
{
int res = 0x2D2816FE;
foreach(var item in this)
{
res = res * 31 + (item == null ? 0 : item.GetHashCode());
}
return res;
}
}
This assumes that we will always want to treat such lists this way. We can also implement an IEqualityComparer for a given case:
public class ArrayListEqComp : IEqualityComparer<ArrayList>
{//we might also implement the non-generic IEqualityComparer, omitted for brevity
public bool Equals(ArrayList x, ArrayList y)
{
if(ReferenceEquals(x, y))
return true;
if(x == null || y == null || x.Count != y.Count)
return false;
for(int i = 0; i != x.Count; ++i)
if(x[i] != y[i])
return false;
return true;
}
public int GetHashCode(ArrayList obj)
{
int res = 0x2D2816FE;
foreach(var item in obj)
{
res = res * 31 + (item == null ? 0 : item.GetHashCode());
}
return res;
}
}
In summary:
The default equality definition of a reference type is dependant upon identity alone.
Most of the time, we want that.
When the person defining the class decides that this isn't what is wanted, they can override this behaviour.
When the person using the class wants a different definition of equality again, they can use IEqualityComparer<T> and IEqualityComparer so their that dictionaries, hashmaps, hashsets, etc. use their concept of equality.
It's disastrous to mutate an object while it is the key to a hash-based structure. Immutability can be used of ensure this doesn't happen, but is not compulsory, nor always desirable.
All in all, the framework gives us nice defaults and detailed override possibilities.
*There is a bug in the case of a decimal within a struct, because there is a short-cut used in some cases with stucts when it is safe and not othertimes, but while a struct containing a decimal is one case when the short-cut is not safe, it is incorrectly identified as a case where it is safe.
Yes, you are wrong. In both Java and C#, being equal implies having the same hash-code, but the converse is not (necessarily) true.
See GetHashCode for more information.
It is not possible for a hashcode to be unique across all variations of most non-trivial classes. In C# the concept of List equality is not the same as in Java (see here), so the hash code implementation is also not the same - it mirrors the C# List equality.
You're only partly wrong. You're definitely wrong when you think that equal hashcodes means equal objects, but equal objects must have equal hashcodes, which means that if the hashcodes differ, so do the objects.
The core reasons are performance and human nature - people tend to think about hashes as something fast but it normally requires traversing all elements of an object at least once.
Example: If you use a string as a key in a hash table every query has complexity O(|s|) - use 2x longer strings and it will cost you at least twice as much. Imagine that it was a full blown tree (just a list of lists) - oops :-)
If full, deep hash calculation was a standard operation on a collection, enormous percentage of progammers would just use it unwittingly and then blame the framework and the virtual machine for being slow. For something as expensive as full traversal it is crucial that a programmer has to be aware of the complexity. The only was to achieve that is to make sure that you have to write your own. It's a good deterrent as well :-)
Another reason is updating tactics. Calculating and updating a hash on the fly vs. doing the full calculation every time requires a judgement call depending on concrete case in hand.
Immutabilty is just an academic cop out - people do hashes as a way of detecting a change faster (file hashes for example) and also use hashes for complex structures which change all the time. Hash has many more uses beyong the 101 basics. The key is again that what to use for a hash of a complex object has to be a judgement call on a case by case basis.
Using object's address (actually a handle so it doesn't change after GC) as a hash is actually the case where the hash value remains the same for arbitrary mutable object :-) The reason C# does it is that it's cheap and again nudges people to calculate their own.
Why is too philosophical. Create helper method (may be extension method) and calculate hashcode as you like. May be XOR elements' hashcodes
I'm looking for a way to sort a list of object (of any type possible) so that whatever happens to the objects, as long as they are not destructed, the order keeps the same (so the hashCode isn't a good idea because in some classes it's changing over a time), for that reason I was thinking to use the address of the object in the memory but I'm not sure that this always keeps the same (Can the address change by a garbage collecting call for instance?). However I'm looking for properties of objects (of any type) that will keep the same as long as the object isn't destroyed. Are there any of them? And if yes, what are they?
Yes, objects can be moved around in memory by the garbage collector, unless you specifically ask it not to (and it's generally recommended to let the GC do its thing).
What you need here is a side table: create a dictionary keyed by the objects themselves, and for the value put anything you like (could be the original hashcode of the object, or even a random number). When you sort, sort by that side key. Now if for example the object a has value "1" in this dictionary, it'll always be sorted first - regardless of what changes are made to a, because you'll look in your side dictionary for the key and the code for a doesn't know to go there and change it (and of course you are careful to keep that data immutable). You can use weak references to make sure that your dictionary entries go away if there is no other reference to the object a.
Updated given the detail now added to the question (comments); just take a copy of the list contents before you sort it...
No the address is not fixed. And for arbitrary objects, no there is no sensible way of doing this. For your own objects you could add something common, like:
interface ISequence { int Order { get; } }
static class Sequence {
private static int next;
public static int Next() {
return Interlocked.Increment(ref next); }
}
class Foo : ISequence {
private readonly int sequence;
int ISequence.Order { get { return sequence; } }
public Foo() {
sequence = Sequence.Next();
}
}
A bit scrappy, but it should work, and could be used in a base-class. The Order is now non-changing and sequential. But only AppDomain-specific, and not all serialization APIs will respect it (you'd need to use serialization-callbacks to initialize the sequence in such cases).
Sorting by memory address is only possible of reference objects of course. So not all types are possible to be sorted this way, primitive types and structs are not.
The other way is to depend on a certain interface, where you require that each of this instances can return a Guid. This is created in the constructor and not changed.
public interface ISortable
{
Guid SortId { get; }
}
class Foo : ISortable
{
Foo()
{
SortId = Guid.NewGuid();
}
Guid SortId { get; private set; }
}
The advantage of a guid is that it could be created independently in each class. You need no synchronization, you just give every class an id.
By the way: If you are using the objects in a Dictionary as a Key, they must not change their hash code. They must be immutable. This could probably be a constraint you could depend on.
Edit: you could write your specialized list that is able to keep orderings.
Either you store the original order when the list is created from another list, and then you're able to restore the order at any point in time. New items could be put at the end. (are there new items anyway?)
Or you do something more sophisticated and store the order of any object that has ever be seen by your list class in a static memory. Then you can sort all the lists independently. But beware of the references you are holding, which will avoid the objects be cleaned up by the GC. You'll need week references, I think there are weak references in C# but I've never used them.
Even better would be to put this logic into a sorting class. So it works for every list that has been sorted by your sorting class.
So I'm thinking of using a reference type as a key to a .NET Dictionary...
Example:
class MyObj
{
private int mID;
public MyObj(int id)
{
this.mID = id;
}
}
// whatever code here
static void Main(string[] args)
{
Dictionary<MyObj, string> dictionary = new Dictionary<MyObj, string>();
}
My question is, how is the hash generated for custom objects (ie not int, string, bool etc)? I ask because the objects I'm using as keys may change before I need to look up stuff in the Dictionary again. If the hash is generated from the object's address, then I'm probably fine... but if it is generated from some combination of the object's member variables then I'm in trouble.
EDIT:
I should've originally made it clear that I don't care about the equality of the objects in this case... I was merely looking for a fast lookup (I wanted to do a 1-1 association without changing the code of the classes involved).
Thanks
The default implementation of GetHashCode/Equals basically deals with identity. You'll always get the same hash back from the same object, and it'll probably be different to other objects (very high probability!).
In other words, if you just want reference identity, you're fine. If you want to use the dictionary treating the keys as values (i.e. using the data within the object, rather than just the object reference itself, to determine the notion of equality) then it's a bad idea to mutate any of the equality-sensitive data within the key after adding it to the dictionary.
The MSDN documentation for object.GetHashCode is a little bit overly scary - basically you shouldn't use it for persistent hashes (i.e. saved between process invocations) but it will be consistent for the same object which is all that's required for it to be a valid hash for a dictionary. While it's not guaranteed to be unique, I don't think you'll run into enough collections to cause a problem.
The hash used is the return value of the .GetHashcode method on the object. By default this essentially a value representing the reference. It is not guaranteed to be unique for an object, and in fact likely won't be in many situations. But the value for a particular reference will not change over the lifetime of the object even if you mutate it. So for this particular sample you will be OK.
In general though, it is a very bad idea to use objects which are not immutable as keys to a Dictionary. It's way too easy to fall into a trap where you override Equals and GetHashcode on an object and break code where the type was formerly used as a key in a Dictionary.
The dictionary will use the GetHashCode method defined on System.Object, which will not change over the object's lifetime regardless of field changes etc. So you won't encounter problems in the scenario you describe.
You can override GetHashCode, and should do so if you override Equals so that objects which are equal also return the same hash code. However, if the type is mutable then you must be aware that if you use the object as a key of a dictionary you will not be able to find it again if it is subsequently altered.
The default implementation of the
GetHashCode method does not guarantee
unique return values for different
objects. Furthermore, the .NET
Framework does not guarantee the
default implementation of the
GetHashCode method, and the value it
returns will be the same between
different versions of the .NET
Framework. Consequently, the default
implementation of this method must not
be used as a unique object identifier
for hashing purposes.
http://msdn.microsoft.com/en-us/library/system.object.gethashcode.aspx
For CustomObject , derived from objects, the hash code will be generated in beginning of the object and they will remain same throughout its life of its instance. Further more, hash code will never change as values of internal fields/properties will change.
Hashtable/Dictionary will not use GetHashCode as unique identifier but rather it will only use it as "hash buckets". For example string "aaa123" and "aaa456" may have hash as "aaa" and that all objects having same hash "aaa" will be stored in one bucket. Whenever you will insert/retrive an object, Dictionary will always call GetHashCode and determine the bucket to further individual address comparison of objects.
Custom Object as Dictionary key should be taken as if, Dictionary only stores theirs "Reference (addresses or memory pointers)" it doesnt know its contents, and contents of objects change but Reference never change. This also means that if two objects are exact replica of each other, but they are different in memory, your hashtable will not consider them as same because their memory pointers are different.
Best way to guarentee identity equality is to override method "Equals" as following... if you are having any problem.
class MyObj
{
private int mID;
public MyObj(int id)
{
this.mID = id;
}
public bool override Equals(Object obj)
{
MyObj mobj = obj as MyObj;
if(mobj==null)
return false;
return this.mID == mobj.mID;
}
}