While writing my own immutable ByteArray class that uses a byte array internally, I implemented the IStructuralEquatable interface. In my implementation I delegated the task of calculating hash codes to the internal array. While testing it, to my great surprise, I found that my two different arrays had the same structural hash code, i.e. they returned the same value from GetHashCode. To reproduce:
IStructuralEquatable array11 = new int[] { 1, 1 };
IStructuralEquatable array12 = new int[] { 1, 2 };
IStructuralEquatable array22 = new int[] { 2, 2 };
var comparer = EqualityComparer<int>.Default;
Console.WriteLine(array11.GetHashCode(comparer)); // 32
Console.WriteLine(array12.GetHashCode(comparer)); // 32
Console.WriteLine(array22.GetHashCode(comparer)); // 64
IStructuralEquatable is quite new and unknown, but I read somewhere that it can be used to compare the contents of collections and arrays. Am I wrong, or is my .Net wrong?
Note that I am not talking about Object.GetHashCode!
Edit:
So, I am apparently wrong as unequal objects may have equal hash codes. But isn't GetHashCode returning a somewhat randomly distributed set of values a requirement? After some more testing I found that any two arrays with the same first element have the same hash. I still think this is strange behavior.
What you have described is not a bug. GetHashCode() does not guarantee unique hashes for nonequal objects.
From MSDN:
If two objects compare as equal, the GetHashCode method for each object must return the same value. However, if two objects do not compare as equal, the GetHashCode methods for the two object do not have to return different values.
EDIT
While the MSFT .NET implementation of GetHashCode() for Array.IStructuralEquatable obeys the principles in the above MSDN documentation, it appears that the authors did not implement it as intended.
Here is the code from "Array.cs":
int IStructuralEquatable.GetHashCode(IEqualityComparer comparer) {
if (comparer == null)
throw new ArgumentNullException("comparer");
Contract.EndContractBlock();
int ret = 0;
for (int i = (this.Length >= 8 ? this.Length - 8 : 0); i < this.Length; i++) {
ret = CombineHashCodes(ret, comparer.GetHashCode(GetValue(0)));
}
return ret;
}
Notice in particular this line:
ret = CombineHashCodes(ret, comparer.GetHashCode(GetValue(0)));
Unless I am mistaken, that 0 was intended to be i. Because of this, GetHashCode() always returns the same value for arrays with the same max(0, n-8th) element, where n is the length of the array. This isn't wrong (doesn't violate documentation), but it is clearly not as good as it would be if 0 were replaced with i. Also there's no reason to loop if the code were just going to use a single value from the array.
This bug has been fixed, at least as of .NET 4.6.2. You can see it through Reference Source.
ret = CombineHashCodes(ret, comparer.GetHashCode(GetValue(i)));
GetHashCode does not return unique values for instances that are not equal. However, instances that are equal will always return the same hash code.
To quote from Object.GetHashCode method:
If two objects compare as equal, the GetHashCode method for each object must return the same value. However, if two objects do not compare as equal, the GetHashCode methods for the two object do not have to return different values.
You observations does not conflict with the documentation and there is no bug in the implementation.
Related
After executing this piece of code:
int a = 50;
float b = 50.0f;
Console.WriteLine(a.GetHashCode() == b.GetHashCode());
We get False, which is expected, since we are dealing with different objects, hence we should get different hashes.
However, if we execute this:
int a = 0;
float b = 0.0f;
Console.WriteLine(a.GetHashCode() == b.GetHashCode());
We get True. Both obejcts return the same hash code: 0.
Why does this happen? Aren't they supposed to return different hashes?
The GetHashCode of System.Int32 works like:
public override int GetHashCode()
{
return this;
}
Which of course with this being 0, it will return 0.
System.Single's (float is alias) GetHashCode is:
public unsafe override int GetHashCode()
{
float num = this;
if (num == 0f)
{
return 0;
}
return *(int*)(&num);
}
Like you see, at 0f it will return 0.
Program used is ILSpy.
From MSDN Documentation:
Two objects that are equal return hash codes that are equal. However,
the reverse is not true: equal hash codes do not imply object
equality, because different (unequal) objects can have identical hash
codes.
Objects that are conceptually equal are obligated to return the same hashes. Objects that are different are not obligated to return different hashes. That would only be possible if there were less than 2^32 objects that could ever possibly exist. There are more than that. When objects that are different result in the same hash it is called a "collision". A quality hash algorithm minimizes collisions as much as possible, but they can never be removed entirely.
Why should they? Hash codes are a finite set; as many as you can fit in an Int32. There are many many doubles that will have the same hash code as any given int or any other given double.
Hash codes basically have to follow two simple rules:
If two objects are equal, they should have the same hash code.
If an object does not mutate its internal state then the hash code should remain the same.
Nothing obliges two objects that are not equal to have different hash codes; it is mathematically impossible.
Why HashSet<T>.GetHashCode() returns different hashcodes when they have the same elements?
For instance:
[Fact]
public void EqualSetsHaveSameHashCodes()
{
var set1 = new HashSet<int>(new [] { 1, 2, 3 } );
var set2 = new HashSet<int>(new [] { 1, 2, 3 } );
Assert.Equal(set1.GetHashCode(), set2.GetHashCode());
}
This test fails. Why?
How can I get the result I need? "Equal sets give the same hashcode"
HashSet<T> by default does not have value equality semantics. It has reference equality semantics, so two distinct hash sets won't be equal or have the same hash code even if the containing elements are the same.
You need to use a special purpose IEqualityComparer<HashSet<int>> to get the behavior you want. You can roll your own or use the default one the framework provides for you:
var hashSetOfIntComparer = HashSet<int>.CreateSetComparer();
//will evaluate to true
var haveSameHash = hashSetOfIntComparer.GetHashCode(set1) ==
hashSetOfIntComparer.GetHashCode(set2);
So, to make a long story short:
How can I get the result I need? "Equal sets give the same hashcode"
You can't if you are planning on using the default implementation of HashSet<T>.GetHashCode(). You either use a special purpose comparer or you extend HashSet<T> and override Equals and GetHashCode to suit your needs.
By default (and unless otherwise specifically documented), reference types are only considered equal if they reference the same object. As a developer, you can override the Equals() and GetHashCode() methods so that objects that you consider equal return true for the Equals and the same int for GetHashCode.
Depending on which test framework you are using, there will be either CollectionAssert.AreEquivalent() or an override to Assert.Equal that takes a comparer.
You could implement a custom HashSet that overrides the GetHashCode function which generates a new hashcode from all of the contents like below:
public class HashSetWithGetHashCode<T> : HashSet<T>
{
public override int GetHashCode()
{
unchecked // Overflow is fine, just wrap
{
int hash = 17;
foreach (var item in this)
hash = hash * 23 + item.GetHashCode();
return hash;
}
}
}
After executing this piece of code:
int a = 50;
float b = 50.0f;
Console.WriteLine(a.GetHashCode() == b.GetHashCode());
We get False, which is expected, since we are dealing with different objects, hence we should get different hashes.
However, if we execute this:
int a = 0;
float b = 0.0f;
Console.WriteLine(a.GetHashCode() == b.GetHashCode());
We get True. Both obejcts return the same hash code: 0.
Why does this happen? Aren't they supposed to return different hashes?
The GetHashCode of System.Int32 works like:
public override int GetHashCode()
{
return this;
}
Which of course with this being 0, it will return 0.
System.Single's (float is alias) GetHashCode is:
public unsafe override int GetHashCode()
{
float num = this;
if (num == 0f)
{
return 0;
}
return *(int*)(&num);
}
Like you see, at 0f it will return 0.
Program used is ILSpy.
From MSDN Documentation:
Two objects that are equal return hash codes that are equal. However,
the reverse is not true: equal hash codes do not imply object
equality, because different (unequal) objects can have identical hash
codes.
Objects that are conceptually equal are obligated to return the same hashes. Objects that are different are not obligated to return different hashes. That would only be possible if there were less than 2^32 objects that could ever possibly exist. There are more than that. When objects that are different result in the same hash it is called a "collision". A quality hash algorithm minimizes collisions as much as possible, but they can never be removed entirely.
Why should they? Hash codes are a finite set; as many as you can fit in an Int32. There are many many doubles that will have the same hash code as any given int or any other given double.
Hash codes basically have to follow two simple rules:
If two objects are equal, they should have the same hash code.
If an object does not mutate its internal state then the hash code should remain the same.
Nothing obliges two objects that are not equal to have different hash codes; it is mathematically impossible.
I am having the following situation
class Custom
{
public override int GetHashCode(){...calculation1}
}
public class MyComparer : IEqualityComparer<Custom>
{
public bool Equals(Custom cus1, Custom cus2)
{
if (cus1 == null || cus2 == null)
return false;
return cus1.GetHashCode() == cus2.GetHashCode();
}
public int GetHashCode(Custom cus1)
{
return ...calculation2;
}
}
int Main()
{
List<Custom> mine1 = new List<Custom>(){....};
List<Custom> mine2 = new List<Custom>(){....};
MyComparer myComparer = new MyComparer();
List<Custom> result = mine1.intersect(mine2,myComparer);
}
Here Just I want to know which GetHashCode will be used in intersecting.
To answer your question, it will be GetHashCode from MyComparer.
But, there is a very improtant reason why there is a GetHashCode and an Equals method. GetHashCode() is an optimization, so when the items are initially compared, only the hash code is checked, if the hash code is the same, then the Equals method is used. That avoids the chance of same hashes for different objects (the chance is one in ~4 bilions, but it still happens, seen it first person). In Equals() method you should compare all the relevant fields from one object to the other. Comparing objects by hashcode in Equals is wrong and defies the whole purpose of this method.
Hope that clarifies.
Why didn't you test it yourself? You already have the code...
MyComparer.GetHashCode will be used in your case. You can see the code here: http://referencesource.microsoft.com/#System.Core/System/Linq/Enumerable.cs#f4105a494115b366
Custom.GetHashCode would be used if you didn't specify comparer at Intersect call at all.
Generally, Hash codes as well as getHashCode functions provide a good mechanism for comparing, but you should beware of similarity. In result of limited range supported by hash facilities, it is very common that two different numbers consequence in the same hash-code which may interferes comparison contexts.
How would I do this? I am trying to count when both arrays have the same value of TRUE/1 at the same index. As you can see, my code has multiple bitarrays and is looping through each one and comparing them with a comparisonArray with another loop. It doesn't seem to be very efficient and I need it to be.
foreach (bitArrayTuple in bitarryList) {
for (int i = 0; i < arrayLength; i++)
if (bArrayTuple.Item2[i] && comparisonArray[i])
bitArrayTuple.Item1++;
}
where Item1 is the count and Item2 is a bitarray.
bool equals = ba1.Xor(ba2).OfType<bool>().All(e => !e);
There's not much of a way to do this, because BitArray doesn't let its internal array leak, and because .NET doesn't have the C++ equivalent of const to prevent external modification. You might want to just create your own class from scratch, or, if you feel like hacking, use reflection to get the private field inside the BitArray.
Would this work?
http://msdn.microsoft.com/en-us/library/system.collections.bitarray.and%28v=VS.90%29.aspx
It's like the single & operator in C.
Depending in the number of elements, BitVector32 may be usable. That would simply be an Int32 comparison.
If not possible, you will need to get hold of the int[] located on the m_array private field of each BitArray. Then compare the int[] of each (which is a comparison of 32 bits at a time).
I realize this is an old thread, but I've recently run into a need for this myself and have performed some benchmarks in order to determine which method is fastest:
Firstly, at the moment we can't use BitArray.Clone() because of a known bug in Microsoft's code that will not allow cloning of arrays that are larger than int.MaxValue / 32. We will need to avoid this method until they have fixed the bug.
With that in mind I have run benchmarks against 5 different implementations, all using the largest BitArray I could construct (size of int.MaxValue) with alternating bits. I have run the tests with equal and not equal arrays and resulting speed rankings are the same. Here are the implementations:
Implementation 1: Convert each BitArray into a byte[] and compare the arrays using the CompareTo() method.
Implementation 2: Convert each BitArray into a byte[] and compare the each set of bytes using an XOR operator (^).
Implementation 3: Convert each BitArray into a int[] and compare the arrays using the CompareTo() method.
Implementation 4: Convert each BitArray into a int[] and compare the each set of ints using an XOR operator (^).
Implementation 5: Use a for loop to iterate over each set of bool values and compare
The winner surprised me: Implementation 3.
I would have expected Implementation 4 to be the fastest, but as it turns out 3 is significantly faster.
In terms of speed, here are the implementations ranked fastest first:
Implementation 3
Implementation 4
Implementation 2
Implementation 1
Implementation 5
Here's my code for implementation 3:
public static bool Equals(this BitArray first, BitArray second)
{
// Short-circuit if the arrays are not equal in size
if (first.length != second.length)
return false;
// Convert the arrays to int[]s
int[] firstInts = new int[(int)Math.Ceiling((decimal)first.Count / 32)];
first.CopyTo(firstInts, 0);
int[] secondInts = new int[(int)Math.Ceiling((decimal)second.Count / 32)];
second.CopyTo(secondInts , 0);
// Look for differences
bool areDifferent = false;
for (int i = 0; i < firstInts.Length && !areDifferent; i++)
areDifferent = firstInts[i] != secondInts[i];
return !areDifferent;
}