Create Unique Hashcode for the permutation of two Order Ids - c#

I have a collection which is a permutation of two unique orders, where OrderId is unique. Thus it contains the Order1 (Id = 1) and Order2 (Id = 2) as both 12 and 21. Now while processing a routing algorithm, few conditions are checked and while a combination is included in the final result, then its reverse has to be ignored and needn't be considered for processing. Now since the Id is an integer, I have created a following logic:
private static int GetPairKey(int firstOrderId, int secondOrderId)
{
var orderCombinationType = (firstOrderId < secondOrderId)
? new {max = secondOrderId, min = firstOrderId}
: new { max = firstOrderId, min = secondOrderId };
return (orderCombinationType.min.GetHashCode() ^ orderCombinationType.max.GetHashCode());
}
In the logic, I create a Dictionary<int,int>, where key is created using the method GetPairKey shown above, where I ensure that out of given combination they are arranged correctly, so that I get the same Hashcode, which can be inserted and checked for an entry in a Dictionary, while its value is dummy and its ignored.
However above logic seems to have a flaw and it doesn't work as expected for all the logic processing, what am I doing wrong in this case, shall I try something different to create a Hashcode. Is something like following code a better choice, please suggest
Tuple.Create(minOrderId,maxOrderId).GetHashCode, following is relevant code usage:
foreach (var pair in localSavingPairs)
{
var firstOrder = pair.FirstOrder;
var secondOrder = pair.SecondOrder;
if (processedOrderDictionary.ContainsKey(GetPairKey(firstOrder.Id, secondOrder.Id))) continue;
Adding to the Dictionary, is the following code:
processedOrderDictionary.Add(GetPairKey(firstOrder.Id, secondOrder.Id), 0); here the value 0 is dummy and is not used

You need a value that can uniquely represent every possible value.
That is different to a hash-code.
You could uniquely represent each value with a long or with a class or struct that contains all of the appropriate values. Since after a certain total size using long won't work any more, let's look at the other approach, which is more flexible and more extensible:
public class KeyPair : IEquatable<KeyPair>
{
public int Min { get; private set; }
public int Max { get; private set; }
public KeyPair(int first, int second)
{
if (first < second)
{
Min = first;
Max = second;
}
else
{
Min = second;
Max = first;
}
}
public bool Equals(KeyPair other)
{
return other != null && other.Min == Min && other.Max == Max;
}
public override bool Equals(object other)
{
return Equals(other as KeyPair);
}
public override int GetHashCode()
{
return unchecked(Max * 31 + Min);
}
}
Now, the GetHashCode() here will not be unique, but the KeyPair itself will be. Ideally the hashcodes will be very different to each other to better distribute these objects, but doing much better than the above depends on information about the actual values that will be seen in practice.
The dictionary will use that to find the item, but it will also use Equals to pick between those where the hash code is the same.
(You can experiment with this by having a version for which GetHashCode() always just returns 0. It will have very poor performance because collisions hurt performance and this will always collide, but it will still work).

First, 42.GetHashCode() returns 42. Second, 1 ^ 2 is identical to 2 ^ 1, so there's really no point in sorting numbers. Third, your "hash" function is very weak and produces a lot of collisions, which is why you're observing the flaws.
There are two options I can think of right now:
Use a slightly "stronger" hash function
Replace your Dictionary<int, int> key with Dictionary<string, int> with keys being your two sorted numbers separated by whatever character you prever -- e.g. 56-6472

Given that XOR is commutative (so (a ^ b) will always be the same as (b ^ a)) it seems to me that your ordering is misguided... I'd just
(new {firstOrderId, secondOrderId}).GetHashCode()
.Net will fix you up a good well-distributed hashing implementation for anonymous types.

Related

Is there a .NET built-in structure similar with Tuple (or a recommended way to build one) that is order invariant regarding the hashcode?

Is there a .NET built-in structure similar with Tuple (or a recommended way to build one) that is order invariant regarding the equality and hashcode?
The code below has the expected ishashequal=false, the structure I am looking for would return true.
var dict = new Dictionary<Tuple<char, char>, int>();
var x = new Tuple<char, char>('a', 'b');
var y = new Tuple<char, char>('b', 'a');
dict.Add(x, 1);
bool isequal = dict.ContainsKey(y);
Is there a .NET built-in structure similar with Tuple (or a recommended way to build one) that is order invariant regarding the equality and hashcode?
No. It would be broken if there was.
There's an over-simplification often given of "a hash code of a key must not change", which is wrong. Really, "a key must not change while being used as a key" which the hash code not changing should reflect. If something changes in a way that changes how it compares to other items then the hash code must change to reflect that. This mustn't happen while it is used as a key, but that means the key must not change. Otherwise if you created an object, changed it and then used it as a key, it wouldn't work.
Okay, so to alter your question
Is there a .NET built-in structure similar with Tuple (or a recommended way to build one) that is order invariant regarding the equality and hashcode [when the component items are invariant]?
Yes, Tuple would be one example. As would ValueTuple and anonymous objects composed of the same parts.
The code below has the expected ishashequal=false, the structure I am looking for would return true
That's something different again. A tuple is a finite sequence of a given number of elements. Being a sequence means that order is signficant. We don't expect the tuple (a, b) to hash to the same thing as (b, a) because (a, b) is a different tuple to (b, a) and so must not be considered equal and ideally (but not a strict requirement) would not have the same hash code.
Indeed Tuple<int, string> can't have the same elements in different orders at all.
Tuple represents tuples, but you are describing a use of a finite set. The finite set {a, b} is the same as the finite set {b, a} because order is not significant.
You need to do one of two things.
Create a structure to represent a finite set of two elements
public sealed class FiniteSet2<T> : IEquatable<FiniteSet2<T>>
{
public T First { get; }
public T Second { get; }
public FiniteSet2(T first, T second)
{
First = first;
Second = second;
}
public bool Equals(FiniteSet2<T> other)
{
if ((object)other != null)
{
return false;
}
// Test for same order.
if (EqualityComparer<T>.Default.Equals(First, other.First))
{
return EqualityComparer<T>.Default.Equals(Second, other.Second);
}
// Test for different order.
return EqualityComparer<T>.Default.Equals(First, other.Second)
&& EqualityComparer<T>.Default.Equals(Second, other.First)
}
public override bool Equals(object obj) => Equals(obj as FiniteSet2<T>);
// Deliberately matches elements in different order.
public override int GetHashCode() => First.GetHashCode() ^ Second.GetHashCode();
}
Or if you really do need to use tuples, define an appropriate comparer:
public sealed class CompareAsSetEqualityComparer<T> : IEqualityComparer<Tuple<T, T>>
{
public bool Equals(Tuple<T, T> x, Tuple<T, T> y)
{
if ((object)x == y)
{
return true;
}
if ((object)x == null | (object)y == null)
{
return false;
}
if (EqualityComparer<T>.Default.Equals(x.Item1, y.Item1))
{
return EqualityComparer<T>.Default.Equals(x.Item2, y.Item2);
}
return EqualityComparer<T>.Default.Equals(x.Item1, y.Item2)
&& EqualityComparer<T>.Default.Equals(x.Item2, y.Item1);
}
public int GetHashCode(Tuple<T, T> obj) =>
obj == null ? 0 : obj.Item1.GetHashCode() ^ obj.Item2.GetHashCode();
}
Of course, if the elements are reference types they could themselves be mutated, which would still mutate the otherwise-immutable set or tuple.
(Aside: A recent improvement in the jitter means that it no longer makes sense to memoise EqualityComparer<T>.Default as the sequence EqualityComparer<T>.Default.Equals(…) gets inlined).

Near perfect hash for guid tostring as dictionary key

what Im trying to solve: using a guid string as a key for Dictionary(string, someObject) and I want perfect hashing on the key...
not sure if Im missing something... When I run the following test with the dictionary constructor only passing in size allocation I get +- 10 collisions each run. When I pass in the IEqualityComparer just calling gethashcode on the string I have the test passing all good! with multiple runs using x = 10 iterations in some cases and y upto a million! I thought the dictionary was adjusting the hashing function especially when dealing with strings? I don't have reflector on my machine :( so I cant check tonight... If you comment out the alternating dictionary initialisations youll see... the test runs relatively quick on my i7.
[TestMethod]
public void NearPerfectHashingForGuidStrings()
{
int y = 100000;
int collisions = 0;
//Dictionary<string, string> list = new Dictionary<string, string>(y, new GuidStringHashing());
Dictionary<string, string> list = new Dictionary<string, string>(y);
for (int x = 0; x < 5; x++)
{
Enumerable.Range(1, y).ToList().ForEach((h) =>
{
list[Guid.NewGuid().ToString()] = h.ToString();
});
var hashDuplicates = list.Keys.GroupBy(h => h.GetHashCode())
.Where(group => group.Count() > 1)
.Select(group => group.Key).ToList();
hashDuplicates.ToList().ForEach(v => Debug.WriteLine( x + "--- " + v));
collisions += hashDuplicates.Count();
list.Clear();
}
Assert.AreEqual(0, collisions);
}
public class GuidStringHashing : IEqualityComparer<string>
{
public bool Equals(string x, string y)
{
return GetHashCode(x) == GetHashCode(y);
}
public int GetHashCode(string obj)
{
return obj.GetHashCode();
}
}
Your test is broken.
Because your equality comparer incorrectly reports that two different GUIDs that happen to have the same hash are equal, your dictionary never stores the collisions in the first place.
Due to the pigeonhole principle, it is fundamentally impossible to create a 32-bit perfect hash for more than 232 items.
It's impossible. You want a perfect hash function for an unknown set of keys. You can create perfect hash functions for specific set of keys. You can't create one perfect hash function that will work on all sets of keys.
The reason is the "Two Jesus Principle", as was so nicely put by Mark Knopfler: "Two men say they're jesus, one of them must be wrong." (it's more widely known as the "pigeonhole principle")
What do you mean by a perfect hash code?
Your code is somewhat confusing, especially because you post a class GuidStringHashing that is not used by your test method.
But your code demonstrates that when you make 100,000 GUIDs, convert them all to strings, and then take the hash code of the strings, then it happens quite often that not all hash codes are distinct. That might be surprising when there are more than 4 billion integers to choose between, and you only generate 100,000 strings.
You're using the GetHashCode() for general strings, but your strings are not too general, they're all something like
"2315c2a7-7d29-42b1-9696-fe6a9dd72ffd"
so maybe your hash code is not optimal. It's better to parse the strings h back to a GUID and use the hash code of that, as in (new Guid(h)).GetHashCode().
However this still gives collisions with 100,000 GUIDs. I think what you're seeing is just the birthday paradox.
Try this more simple code. Here I use GetHashCode() on the GUIDs, so we expect that the integers are quite random:
var set = new HashSet<int>();
for (int i = 1; true; ++i)
{
if (!set.Add(Guid.NewGuid().GetHashCode()))
Console.WriteLine("Collision, i is: " + i);
}
We see (by running the above code many times) that a collision almost always happens before 100,000 hash codes have been calculated.

Hashcode implementation double precision

I've asked a question about this class before, but here is one again.
I've created a Complex class:
public class Complex
{
public double Real { get; set; }
public double Imaginary { get; set; }
}
And I'm implementing the Equals and the Hashcode functions, and the Equal function takes in account a certain precision. I use the following logic for that:
public override bool Equals(object obj)
{
//Some default null checkint etc here, the next code is all that matters.
return Math.Abs(complex.Imaginary - Imaginary) <= 0.00001 &&
Math.Abs(complex.Real - Real) <= 0.00001;
}
Well this works, when the Imaginary and the Real part are really close to each other, it says they are the same.
Now I was trying to implement the HashCode function, I've used some examples John skeet used here, currently I have the following.
public override int GetHashCode()
{
var hash = 17;
hash = hash*23 + Real.GetHashCode();
hash = hash*23 + Imaginary.GetHashCode();
return hash;
}
However, this does not take in account the certain precision I want to use. So basically the following two classes:
Complex1[Real = 1.123456; Imaginary = 1.123456]
Complex2[Real = 1.123457; Imaginary = 1.123457]
Are Equal but do not provide the same HashCode, how can I achieve that?
First of all, your Equals() implementation is broken. Read here to see why.
Second, such a "fuzzy equals" breaks the contract of Equals() (it's not transitive, for one thing), so using it with Hashtable will not work, no matter how you implement GetHashCode().
For this kind of thing, you really need a spatial index such as an R-Tree.
Just drop precision when you calculate the hash value.
public override int GetHashCode()
{
var hash = 17;
hash = hash*23 + Math.Round(Real, 5).GetHashCode();
hash = hash*23 + Math.Round(Imaginary, 5).GetHashCode();
return hash;
}
where 5 is you precision value
I see two simple options:
Use Decimal instead of double
Instead of using Real.GetHashCode, use Real.RoundTo6Ciphers().GetHashCode().
Then you'll have the same hashcode.
I would create read-only properties that round Real and Imaginary to the nearest hundred-thousandth and then do equals and hashcode implementations on those getter properties.

High Runtime for Dictionary.Add for a large amount of items

I have a C#-Application that stores data from a TextFile in a Dictionary-Object. The amount of data to be stored can be rather large, so it takes a lot of time inserting the entries. With many items in the Dictionary it gets even worse, because of the resizing of internal array, that stores the data for the Dictionary.
So I initialized the Dictionary with the amount of items that will be added, but this has no impact on speed.
Here is my function:
private Dictionary<IdPair, Edge> AddEdgesToExistingNodes(HashSet<NodeConnection> connections)
{
Dictionary<IdPair, Edge> resultSet = new Dictionary<IdPair, Edge>(connections.Count);
foreach (NodeConnection con in connections)
{
...
resultSet.Add(nodeIdPair, newEdge);
}
return resultSet;
}
In my tests, I insert ~300k items.
I checked the running time with ANTS Performance Profiler and found, that the Average time for resultSet.Add(...) doesn't change when I initialize the Dictionary with the needed size. It is the same as when I initialize the Dictionary with new Dictionary(); (about 0.256 ms on average for each Add).
This is definitely caused by the amount of data in the Dictionary (ALTHOUGH I initialized it with the desired size). For the first 20k items, the average time for Add is 0.03 ms for each item.
Any idea, how to make the add-operation faster?
Thanks in advance,
Frank
Here is my IdPair-Struct:
public struct IdPair
{
public int id1;
public int id2;
public IdPair(int oneId, int anotherId)
{
if (oneId > anotherId)
{
id1 = anotherId;
id2 = oneId;
}
else if (anotherId > oneId)
{
id1 = oneId;
id2 = anotherId;
}
else
throw new ArgumentException("The two Ids of the IdPair can't have the same value.");
}
}
Since you have a struct, you get the default implementation of Equals() and GetHashCode(). As others have pointed out, this is not very efficient since it uses reflection, but I don't think the reflection is the issue.
My guess is that your hash codes get distributed unevenly by the default GetHashCode(), which could happen, for example, if the default implementation returns a simple XOR of all members (in which case hash(a, b) == hash(b, a)). I can't find any documentation of how ValueType.GetHashCode() is implemented, but try adding
public override int GetHashCode() {
return oneId << 16 | (anotherId & 0xffff);
}
which might be better.
IdPair is a struct, and you haven't overridden Equals or GetHashCode. This means that the default implementation of those methods will be used.
For value-types the default implementation of Equals and GetHashCode uses reflection, which is likely to result in poor performance. Try providing your own implementation of the methods and see if that helps.
My suggested implementation, it might not be exactly what you need/want:
public struct IdPair : IEquatable<IdPair>
{
// ...
public override bool Equals(object obj)
{
if (obj is IdPair)
return Equals((IdPair)obj);
return false;
}
public bool Equals(IdPair other)
{
return id1.Equals(other.id1)
&& id2.Equals(other.id2);
}
public override int GetHashCode()
{
unchecked
{
int hash = 269;
hash = (hash * 19) + id1.GetHashCode();
hash = (hash * 19) + id2.GetHashCode();
return hash;
}
}
}

Question about Dictionary<T,T>

I have a class which looks like this:
public class NumericalRange:IEquatable<NumericalRange>
{
public double LowerLimit;
public double UpperLimit;
public NumericalRange(double lower, double upper)
{
LowerLimit = lower;
UpperLimit = upper;
}
public bool DoesLieInRange(double n)
{
if (LowerLimit <= n && n <= UpperLimit)
return true;
else
return false;
}
#region IEquatable<NumericalRange> Members
public bool Equals(NumericalRange other)
{
if (Double.IsNaN(this.LowerLimit)&& Double.IsNaN(other.LowerLimit))
{
if (Double.IsNaN(this.UpperLimit) && Double.IsNaN(other.UpperLimit))
{
return true;
}
}
if (this.LowerLimit == other.LowerLimit && this.UpperLimit == other.UpperLimit)
return true;
return false;
}
#endregion
}
This class holds a neumerical range of values. This class should also be able to hold a default range, where both LowerLimit and UpperLimit are equal to Double.NaN.
Now this class goes into a Dictionary
The Dictionary works fine for 'non-NaN' numerical range values, but when the Key is {NaN,NaN} NumericalRange Object, then the dictionary throws a KeyNotFoundException.
What am I doing wrong? Is there any other interface that I have to implement?
Based on your comment, you haven't implemented GetHashCode. I'm amazed that the class works at all in a dictionary, unless you're always requesting the identical key that you put in. I would suggest an implementation of something like:
public override int GetHashCode()
{
int hash = 17;
hash = hash * 23 + UpperLimit.GetHashCode();
hash = hash * 23 + LowerLimit.GetHashCode();
return hash;
}
That assumes Double.GetHashCode() gives a consistent value for NaN. There are many values of NaN of course, and you may want to special case it to make sure they all give the same hash.
You should also override the Equals method inherited from Object:
public override bool Equals(Object other)
{
return other != null &&
other.GetType() == GetType() &&
Equals((NumericalRange) other);
}
Note that the type check can be made more efficient by using as if you seal your class. Otherwise you'll get interesting asymmetries between x.Equals(y) and y.Equals(x) if someone derives another class from yours. Equality becomes tricky with inheritance.
You should also make your fields private, exposing them only as propertes. If this is going to be used as a key in a dictionary, I strongly recommend that you make them readonly, too. Changing the contents of a key when it's used in a dictionary is likely to lead to it being "unfindable" later.
The default implementation of the GetHashCode method uses the reference of the object rather than the values in the object. You have to use the same instance of the object as you used to put the data in the dictionary for that to work.
An implementation of GetHashCode that works simply creates a code from the hash codes of it's data members:
public int GetHashCode() {
return LowerLimit.GetHashCode() ^ UpperLimit.GetHashCode();
}
(This is the same implementation that the Point structure uses.)
Any implementation of the method that always returns the same hash code for any given parameter values works when used in a Dictionary. Just returning the same hash code for all values actually also works, but then the performance of the Dictionary gets bad (looking up a key becomes an O(n) operation instead of an O(1) operation. To give the best performance, the method should distribute the hash codes evenly within the range.
If your data is strongly biased, the above implementation might not give the best performance. If you for example have a lot of ranges where the lower and upper limits are the same, they will all get the hash code zero. In that case something like this might work better:
public int GetHashCode() {
return (LowerLimit.GetHashCode() * 251) ^ UpperLimit.GetHashCode();
}
You should consider making the class immutable, i.e. make it's properties read-only and only setting them in the constructor. If you change the properties of an object while it's in a Dictionary, it's hash code will change and you will not be able to access the object any more.

Categories