GetHashCode() - immutable values?

GetHashCode() - immutable values? - c#

As I know the method "GetHashCode()" should use only readonly / immutable properties. But if I change for example id property which use GetHashCode() then I get new hash code. So why it should be immutable? If it wouldn't changed then I see problem but it changes.
class Program
{
public class Point
{
public int Id { get; set; }
public override bool Equals(object obj)
{
return obj is Point point &&
Id == point.Id;
}
public override int GetHashCode()
{
return HashCode.Combine(Id);
}
}
static void Main(string[] args)
{
Point point = new Point();
point.Id = 5;
var r1 = point.GetHashCode(); //467047723
point.Id = 10;
var r2 = point.GetHashCode(); //1141379410
}
}

GetHashCode() is there for mainly one reason: retrieval of an object from a hash table. You are right that it is desirable that the hash code should be computed only from immutable fields, but think about the reason for this. Since the hashcode is used to retrieve an object from a hashtable it will lead to errors when the hashcode changes while the object is stored in the hashtable.
To put it more generally: the value returned by GetHashCode must stay stable as long as a structure depends on that hashcode to stay stable. So for you example it means you can change the id field as long as the object is currently not used in any such structure.

Exactly because of this, because if it's not Immutable the hash code changes every time
A hash code is a numeric value that is used to identify an object
during equality testing. It can also serve as an index for an object
in a collection.
so if it changes every time you can't use it for its purpose. more info...

Related

Is my syncfunctions hashCode usage approach correct?

Please read my previous question, because my fear of getting collision when using hashCode for strings !
Previous question
I having a database table with items in a repo, and a "incoming" function with items from a model that should sync - to the database table.
Im using intersect and except to make this possible.
The class i use for my sunc purpose:
private class syncItemModel
{
public override int GetHashCode()
{
return this.ItemLookupCode.GetHashCode();
}
public override bool Equals(object other)
{
if (other is syncItemModel)
return ((syncItemModel)other).ItemLookupCode == this.ItemLookupCode;
return false;
}
public string Description { get; set; }
public string ItemLookupCode { get; set; }
public int ItemID { get; set; }
}
Then i use this in my method:
1) Convert datatable items to syncmodel:
var DbItemsInCampaignDiscount_SyncModel =
DbItemsInCampaignDiscount(dbcampaignDiscount, datacontext)
.Select(i => new syncItemModel { Description = i.Description,
ItemLookupCode = i.ItemLookupCode,
ItemID = i.ID}).ToList();
2) Convert my incoming item model to syncmodel:
var ItemsInCampaignDiscountModel_SyncModel = modelItems
.Select(i => new syncItemModel { Description =
i.Description, ItemLookupCode = i.ItemLookUpCode, ItemID =0 }).ToList();
3) Make an intersect:
var CommonItemInDbAndModel =
ItemsInCampaignDiscountModel_SyncModel.Intersect(DbItemsInCampaignDiscount_SyncModel).ToList();
4) Take out items to be deleted in database (that not exist in incoming model items)
var SyncModel_OnlyInDb =
DbItemsInCampaignDiscount_SyncModel.Except(CommonItemInDbAndModel).ToList();
5) Take out items to be added to database, items that exist in incoming model but not in db:
var SyncModel_OnlyInModel =
ItemsInCampaignDiscountModel_SyncModel.Except(CommonItemInDbAndModel).ToList();
My question is then - can it be a collision? Can two differnt ItemLookupCode in my example be treated as the same ItemLookupCode? Because intersect and except using HashCode ! Or vill the Equal function "double check" it -so this approach is safe to use? If its a possible chance of collision how big is that chance?

Yes, there could be always a hash-collision, that's why identity should be confirmed by calling Equals(). GetHashCode() and Equals() must be implemented correctly.
Except() in LINQ to Objects internally uses HashSet, in case of hash-collision it will call Equals to guarantee identity. As you are using a single property, you are good to proxy calls to its hashcode and equals methods.
Please find some comments below about your implementation:
comparison with ==
This is fine to compare strings with ==, but if type is changed to non-primitive, you'll get issues because object reference instead of content will be compared. Proxy call to Equals() instead of ==.
mutability of the object
That is very error prone to bound gethashcode/Equals logic to mutable state. I'd strongly recommend to encapsulate your state so that once you create your object it could not be changed, make set private for a sake of safety.

Unique ID for each class

I'm want a unique ID (preferably static, without computation) for each class implementation, but not instance. The most obvious way to do this is just hardcode a value in the class, but keeping the values unique becomes a task for an human and isn't ideal.
class Base
{
abstract int GetID();
}
class Foo: Base
{
int GetID() => 10;
}
class Bar: Base
{
int GetID() => 20;
}
Foo foo1 = new Foo();
Foo foo2 = new Foo();
Bar bar = new Bar();
foo1.GetID() == foo2.GetID();
foo1.GetID() != bar.GetID()
The class name would be an obvious unique identifier, but I need an int (or fixed length bytes). I pack the entire object into bytes, and use the id to know what class it is when I unpack it at the other end.
Hashing the class name every time I call GetID() seems needlessly process heavy just to get an ID number.
I could also make an enum as a lookup, but again I need to populate the enum manually.
EDIT: People have been asking important questions, so I'll put the info here.
Needs to be unique per class, not per instance (this is why the identified duplicate question doesn't answer this one).
ID value needs to be persistent between runs.
Value needs to be fixed length bytes or int. Variable length strings such as class name are not acceptable.
Needs to reduce CPU load wherever possible (caching results or using assembly based metadata instead of doing a hash each time).
Ideally, the ID can be retrieved from a static function. This means I can make a static lookup function that matches ID to class.
Number of different classes that need ID isn't that big (<100) so collisions isn't a major concern.
EDIT2:
Some more colour since people are skeptical that this is really needed. I'm open to a different approach.
I'm writing some networking code for a game, and its broken down into message objects. Each different message type is a class that inherits from MessageBase, and adds it's own fields which will be sent.
The MessageBase class has a method for packing itself into bytes, and it sticks a message identifier (the class ID) on the front. When it comes to unpacking it at the other end, I use the identifier to know how to unpack the bytes. This results in some easy to pack/unpack messages and very little overhead (few bytes for ID, then just class property values).
Currently I hard code an ID number in the classes, but it doesn't seem like the best way of doing things.
EDIT3: Here is my code after implementing the accepted answer.
public class MessageBase
{
public MessageID id { get { return GetID(); } }
private MessageID cacheId;
private MessageID GetID()
{
// Check if cacheID hasn't been intialised
if (cacheId == null)
{
// Hash the class name
MD5 md5 = MD5.Create();
byte[] md5Bytes = md5.ComputeHash(Encoding.UTF8.GetBytes(GetType().AssemblyQualifiedName));
// Convert the first few bytes into a uint32, and create the messageID from it and store in cache
cacheId = new MessageID(BitConverter.ToUInt32(md5Bytes, 0));
}
// Return the cacheId
return cacheId;
}
}
public class Protocol
{
private Dictionary<Type, MessageID> messageTypeToId = new Dictionary<Type, int>();
private Dictionary<MessageID, Type> idToMessageType = new Dictionary<int, Type>();
private Dictionary<MessageID, Action<MessageBase>> handlers = new Dictionary<int, Action<MessageBase>>();
public Protocol()
{
// Create a list of all classes that are a subclass of MessageBase this namespace
IEnumerable<Type> messageClasses = from t in Assembly.GetExecutingAssembly().GetTypes()
where t.Namespace == GetType().Namespace && t.IsSubclassOf(typeof(MessageBase))
select t;
// Iterate through the list of message classes, and store their type and id in the dicts
foreach(Type messageClass in messageClasses)
{
MessageID = (MessageID)messageClass.GetField("id").GetValue(null);
messageTypeToId[messageClass] = id;
idToMessageType[id] = messageClass;
}
}
}

Given that you can get a Type by calling GetType on the instance, you can easily cache the results. That reduces the problem to working out how to generate an ID for each type. You'd then call something like:
int id = typeIdentifierCache.GetIdentifier(foo1.GetType());
... or make GetIdentifier accept object and it can call GetType(), leaving you with
int id = typeIdentifierCache.GetIdentifier(foo1);
At that point, the detail is all in the type identifier cache.
A simple option would be to take a hash (e.g. SHA-256, for stability and making it very unlikely that you'll encounter collisions) of the fully-qualified type name. To prove that you have no collisions, you could easily write a unit test that runs over all the type names in the assembly and hashes them, then checks there are no duplicates. (Even that might be overkill, given the nature of SHA-256.)
This is all assuming that the types are in a single assembly. If you need to cope with multiple assemblies, you may want to hash the assembly-qualified name instead.

Here is one suggestion. I have used a sha256 byte array which is guaranteed to be a fixed size and astronomically unlikely to have a collision. That may well be overkill, you can easily substitute it out for something smaller. You could also use the AssemblyQualifiedName rather than FullName if you need to worry about version differences or the same class name in multiple assemblies
Firstly, here are all my usings
using System;
using System.Collections.Concurrent;
using System.Text;
using System.Security.Cryptography;
Next, a static cached type hasher object to remember the mapping between your types and the resulting byte arrays. You don't need the Console.WriteLines below, they are just there to demonstrate that you are not computing it over and over again.
public static class TypeHasher
{
private static ConcurrentDictionary<Type, byte[]> cache = new ConcurrentDictionary<Type, byte[]>();
public static byte[] GetHash(Type type)
{
byte[] result;
if (!cache.TryGetValue(type, out result))
{
Console.WriteLine("Computing Hash for {0}", type.FullName);
SHA256Managed sha = new SHA256Managed();
result = sha.ComputeHash(Encoding.UTF8.GetBytes(type.FullName));
cache.TryAdd(type, result);
}
else
{
// Not actually required, but shows that hashing only done once per type
Console.WriteLine("Using cached Hash for {0}", type.FullName);
}
return result;
}
}
Next, an extension method on object so that you can ask for anything's id. Of course if you have a more suitable base class, it doesn't need to go on object per se.
public static class IdExtension
{
public static byte[] GetId(this object obj)
{
return TypeHasher.GetHash(obj.GetType());
}
}
Next, here are some random classes
public class A
{
}
public class ChildOfA : A
{
}
public class B
{
}
And finally, here is everything put together.
public class Program
{
public static void Main()
{
A a1 = new A();
A a2 = new A();
B b1 = new B();
ChildOfA coa = new ChildOfA();
Console.WriteLine("a1 hash={0}", Convert.ToBase64String(a1.GetId()));
Console.WriteLine("b1 hash={0}", Convert.ToBase64String(b1.GetId()));
Console.WriteLine("a2 hash={0}", Convert.ToBase64String(a2.GetId()));
Console.WriteLine("coa hash={0}", Convert.ToBase64String(coa.GetId()));
}
}
Here is the console output
Computing Hash for A
a1 hash=VZrq0IJk1XldOQlxjN0Fq9SVcuhP5VWQ7vMaiKCP3/0=
Computing Hash for B
b1 hash=335w5QIVRPSDS77mSp43if68S+gUcN9inK1t2wMyClw=
Using cached Hash for A
a2 hash=VZrq0IJk1XldOQlxjN0Fq9SVcuhP5VWQ7vMaiKCP3/0=
Computing Hash for ChildOfA
coa hash=wSEbCG22Dyp/o/j1/9mIbUZTbZ82dcRkav4olILyZs4=
On the other side, you would use reflection to iterate all of the types in your library and store a reverse dictionary of hash to type.

Have not seen you answer the question if the same value needs to persist between different runs, but if all you need is a unique ID for a class, then use the built-in and simple GetHashCode method:
class BaseClass
{
public int ClassId() => typeof(this).GetHashCode();
}
If you are worried about performance of multiple calls to GetHashCode(), then first, don't, that is ridiculous micro-optimization, but if you insist, then store it.
GetHashCode() is fast, that is its entire purpose, as a fast way to compare values in a hash.
EDIT:
After doing some tests, the same hash code is returned between different runs using this method. I did not test after altering the classes, though, I am not aware of the exact method on how a Type is hashed.

C# User class. GetHashCode implementation

I have simple class only with public string properties.
public class SimpleClass
{
public string Field1 {get; set;}
public string Field2 {get; set;}
public string Field3 {get; set;}
public List<SimpleClass> Children {get; set;}
public bool Equals(SimpleClass simple)
{
if (simple == null)
{
return false;
}
return IsFieldsAreEquals(simple) && IsChildrenAreEquals(simple);
}
public override int GetHashCode()
{
return RuntimeHelpers.GetHashCode(this); //Bad idea!
}
}
This code doesn't return same value for equal instances. But this class does not have readonly fields for compute hash.
How can i generate correct hash in GetHashCode() if all my properties are mutable.

The contract for GetHashCode requires (emphasis mine):
The GetHashCode method for an object must consistently return the same hash code as long as there is no modification to the object state that determines the return value of the object's Equals method.
So basically, you should compute it based on all the used fields in Equals, even though they're mutable. However, the documentation also notes:
If you do choose to override GetHashCode for a mutable reference type, your documentation should make it clear that users of your type should not modify object values while the object is stored in a hash table.
If only some of your properties were mutable, you could potentially override GetHashCode to compute it based only on the immutable ones - but in this case everything is mutable, so you'd basically end up returning a constant, making it awful to be in a hash-based collection.
So I'd suggest one of three options:
Use the mutable fields, and document it carefully.
Abandon overriding equality/hashing operations
Abandon it being mutable

why don't List<T>.GetHashCode and ObservableCollection<T>.GetHashCode evaluate their items?

I think it is strange that the GetHashCode function of these collections don't base their hashcode on the items in their lists.
I need this to work in order to provide dirty checking (you have unsaved data).
I've written a wrapping class that overrides the GetHashCode method but I find it weird that this is not the default implementation.
I guess this is a performance optimization?
class Program
{
static void Main(string[] args)
{
var x = new ObservableCollection<test>();
int hash = x.GetHashCode();
x.Add(new test("name"));
int hash2 = x.GetHashCode();
var z = new List<test>();
int hash3 = z.GetHashCode();
z.Add(new test("tets"));
int hash4 = z.GetHashCode();
var my = new CustomObservableCollection<test>();
int hash5 = my.GetHashCode();
var test = new test("name");
my.Add(test);
int hash6 = my.GetHashCode();
test.Name = "name2";
int hash7 = my.GetHashCode();
}
}
public class test
{
public test(string name)
{
Name = name;
}
public string Name { get; set; }
public override bool Equals(object obj)
{
if (obj is test)
{
var o = (test) obj;
return o.Name == this.Name;
}
return base.Equals(obj);
}
public override int GetHashCode()
{
return Name.GetHashCode();
}
}
public class CustomObservableCollection<T> : ObservableCollection<T>
{
public override int GetHashCode()
{
int collectionHash = base.GetHashCode();
foreach (var item in Items)
{
var itemHash = item.GetHashCode();
if (int.MaxValue - itemHash > collectionHash)
{
collectionHash = collectionHash * -1;
}
collectionHash += itemHash;
}
return collectionHash;
}
}

If it did, it would break a few of the guidelines for implementing GetHashCode. Namely:
the integer returned by GetHashCode should never change
Since the content of a list can change, then so would its hash code.
the implementation of GetHashCode must be extremely fast
Depending on the size of the list, you could risk slowing down the calculation of its hash code.
Also, I do not believe you should be using an object's hashcode to check if data is dirty. The probability of collision is higher than you think.

The Equals/GetHashCode of lists checks for reference equality, not content equality. The reason behind this is, that lists are both mutable and by reference (not struct) objects. So every time you change the contents, the hash code would change.
The common use case of hash codes are hash tables (for example Dictionary<K,V> or HashSet), which sort their items based on hash when the are first inserted into the table. If the hash of an object wich is already in the table changes, it may no longer be found, wich leads to erratic behavior.

The key of GetHashCode is to reflect the Equals() logic, in a light weight way.
And List<T>.Equals() inherits Object.Equals(), and Object.Equals() compares the equality by reference, so that the list do not based on it's items, but the list itself

It would be helpful to have a couple types which behaved like List<T> and could generally be used interchangeably with it, but with GetHashCode and Equals methods which would define equivalence either in terms of the sequence of identities, or the Equals and GetHashCode behaviors of the items encapsulated therein. Making such methods to behave efficiently, however, would require that the class include code to cache its hash value but invalidate or update the cached hash value whenever the collection was modified (it would not be legitimate to modify a list while it was stored as a dictionary key, but it should be legitimate to remove a list, modify it, and re-add it, and it would be very desirable to avoid having such modification necessitate re-hashing the entire contents of the list). It was not considered worthwhile to have ordinary lists go through the effort of supporting such behavior at the cost of slowing down operations on lists that never get hashed; nor was it considered worthwhile to define multiple types of list, multiple types of dictionary, etc. based upon the kind of equivalence they should look for in their members or should expose to the outside world.

Weird dictionary ContainsKey issue

Before I start, I'd like to clarify that this is not like all the other somewhat "similar" questions out there. I've tried implementing each approach, but the phenomena I am getting here are really weird.
I have a dictionary where ContainsKey always returns false, even if their GetHashCode functions return the same output, and even if their Equals method returns true.
What could this mean? What am I doing wrong here?
Additional information
The two elements I am inserting are both of type Owner, with no GetHashCode or Equals method. These inherit from a type Storable, which then implements an interface, and also has GetHashCode and Equals defined.
Here's my Storable class. You are probably wondering if the two Guid properties are indeed equal - and yes, they are. I double-checked. See the sample code afterwards.
public abstract class Storable : IStorable
{
public override int GetHashCode()
{
return Id == default(Guid) ? 0 : Id.GetHashCode();
}
public override bool Equals(object obj)
{
var other = obj as Storable;
return other != null && (other.Id == Id || ReferenceEquals(obj, this));
}
public Guid Id { get; set; }
protected Storable()
{
Id = Guid.NewGuid();
}
}
Now, here's the relevant part of my code where the dictionary stuff occurs. It takes in a Supporter object which has a link to an Owner.
public class ChatSession : Storable, IChatSession
{
static ChatSession()
{
PendingSupportSessions = new Dictionary<IOwner, LinkedList<IChatSession>>();
}
private static readonly IDictionary<IOwner, LinkedList<IChatSession>> PendingSupportSessions;
public static ChatSession AssignSupporterForNextPendingSession(ISupporter supporter)
{
var owner = supporter.Owner;
if (!PendingSupportSessions.ContainsKey(owner)) //always returns false
{
var hashCode1 = owner.GetHashCode();
var hashCode2 = PendingSupportSessions.First().Key.GetHashCode();
var equals = owner.Equals(PendingSupportSessions.First().Key);
//here, equals is true, and the two hashcodes are identical,
//and there is only one element in the dictionary according to the debugger.
//however, calling two "Add" calls after eachother does indeed crash.
PendingSupportSessions.Add(owner, new LinkedList<IChatSession>());
PendingSupportSessions.Add(owner, new LinkedList<IChatSession>()); //crash
}
...
}
}
If you need additional information, let me know. I am not sure what kind of information would be sufficient, so it was hard for me to include more.

Guillaume was right. It appears that I was changing the value of one of my keys after it is added to the dictionary. Doh!

Make sure you are passing same object that is stored as key in dictionary. If you are creating new object each time and trying to find key assuming the object is already stored because of similar values, then containsKey returns false. Object comparisons are different than value comparisons.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

GetHashCode() - immutable values? - c#

Related

Is my syncfunctions hashCode usage approach correct?

Unique ID for each class

C# User class. GetHashCode implementation

why don't List<T>.GetHashCode and ObservableCollection<T>.GetHashCode evaluate their items?

Weird dictionary ContainsKey issue

Categories

Resources