Currently I'm using the following class as my key for a Dictionary collection of objects that are unique by ColumnID and a nullable SubGroupID:
public class ColumnDataKey
{
public int ColumnID { get; private set; }
public int? SubGroupID { get; private set; }
// ...
public override int GetHashCode()
{
var hashKey = this.ColumnID + "_" +
(this.SubGroupID.HasValue ? this.SubGroupID.Value.ToString() : "NULL");
return hashKey.GetHashCode();
}
}
I was thinking of somehow combining this to a 64-bit integer but I'm not sure how to deal with null SubGroupIDs. This is as far as I got, but it isn't valid as SubGroupID can be zero:
var hashKey = (long)this.ColumnID << 32 +
(this.SubGroupID.HasValue ? this.SubGroupID.Value : 0);
return hashKey.GetHashCode();
Any ideas?
Strictly speaking you won't be able to combine these perfectly because logically int? has 33 bits of information (32 bits for the integer and a further bit indicating whether the value is present or not). Your non-nullable int has a further 32 bits of information making 65 bits in total but a long only has 64 bits.
If you can safely restrict the value range of either of the ints to only 31 bits then you'd be able to pack them roughly as you're already doing. However, you won't get any advantage over doing it that way - you might as well just calculate the hash code directly like this (with thanks to Resharper's boilerplate code generation):
public override int GetHashCode()
{
unchecked
{
return (ColumnID*397) ^ (SubGroupID.HasValue ? SubGroupID.Value : 0);
}
}
You seem to be thinking of GetHashCode as a unique key. It isn't. HashCodes are 32-bit integers, and are not meant to be unique, only well-distributed across the 32 bit space to minimise the probability of collision. Try this for your GetHashCode method of ColumnDataKey:
ColumnID * 397 ^ (SubGroupID.HasValue ?? SubGroupID.Value : -11111111)
The magic numbers here are 397, a prime number, which for reasons of voodoo magic is a good number to multiply by to mix up your bits (and is the number the ReSharper team picked), and -11111111, a SubGroup ID which I assume is unlikely to arise in practice.
Related
I have a collection which is a permutation of two unique orders, where OrderId is unique. Thus it contains the Order1 (Id = 1) and Order2 (Id = 2) as both 12 and 21. Now while processing a routing algorithm, few conditions are checked and while a combination is included in the final result, then its reverse has to be ignored and needn't be considered for processing. Now since the Id is an integer, I have created a following logic:
private static int GetPairKey(int firstOrderId, int secondOrderId)
{
var orderCombinationType = (firstOrderId < secondOrderId)
? new {max = secondOrderId, min = firstOrderId}
: new { max = firstOrderId, min = secondOrderId };
return (orderCombinationType.min.GetHashCode() ^ orderCombinationType.max.GetHashCode());
}
In the logic, I create a Dictionary<int,int>, where key is created using the method GetPairKey shown above, where I ensure that out of given combination they are arranged correctly, so that I get the same Hashcode, which can be inserted and checked for an entry in a Dictionary, while its value is dummy and its ignored.
However above logic seems to have a flaw and it doesn't work as expected for all the logic processing, what am I doing wrong in this case, shall I try something different to create a Hashcode. Is something like following code a better choice, please suggest
Tuple.Create(minOrderId,maxOrderId).GetHashCode, following is relevant code usage:
foreach (var pair in localSavingPairs)
{
var firstOrder = pair.FirstOrder;
var secondOrder = pair.SecondOrder;
if (processedOrderDictionary.ContainsKey(GetPairKey(firstOrder.Id, secondOrder.Id))) continue;
Adding to the Dictionary, is the following code:
processedOrderDictionary.Add(GetPairKey(firstOrder.Id, secondOrder.Id), 0); here the value 0 is dummy and is not used
You need a value that can uniquely represent every possible value.
That is different to a hash-code.
You could uniquely represent each value with a long or with a class or struct that contains all of the appropriate values. Since after a certain total size using long won't work any more, let's look at the other approach, which is more flexible and more extensible:
public class KeyPair : IEquatable<KeyPair>
{
public int Min { get; private set; }
public int Max { get; private set; }
public KeyPair(int first, int second)
{
if (first < second)
{
Min = first;
Max = second;
}
else
{
Min = second;
Max = first;
}
}
public bool Equals(KeyPair other)
{
return other != null && other.Min == Min && other.Max == Max;
}
public override bool Equals(object other)
{
return Equals(other as KeyPair);
}
public override int GetHashCode()
{
return unchecked(Max * 31 + Min);
}
}
Now, the GetHashCode() here will not be unique, but the KeyPair itself will be. Ideally the hashcodes will be very different to each other to better distribute these objects, but doing much better than the above depends on information about the actual values that will be seen in practice.
The dictionary will use that to find the item, but it will also use Equals to pick between those where the hash code is the same.
(You can experiment with this by having a version for which GetHashCode() always just returns 0. It will have very poor performance because collisions hurt performance and this will always collide, but it will still work).
First, 42.GetHashCode() returns 42. Second, 1 ^ 2 is identical to 2 ^ 1, so there's really no point in sorting numbers. Third, your "hash" function is very weak and produces a lot of collisions, which is why you're observing the flaws.
There are two options I can think of right now:
Use a slightly "stronger" hash function
Replace your Dictionary<int, int> key with Dictionary<string, int> with keys being your two sorted numbers separated by whatever character you prever -- e.g. 56-6472
Given that XOR is commutative (so (a ^ b) will always be the same as (b ^ a)) it seems to me that your ordering is misguided... I'd just
(new {firstOrderId, secondOrderId}).GetHashCode()
.Net will fix you up a good well-distributed hashing implementation for anonymous types.
This question already has answers here:
Best hashing algorithm in terms of hash collisions and performance for strings
(9 answers)
Closed 9 years ago.
I have a class with string properties and I need to override GetHashCode() method.
class A
{
public string Prop1 { get; set; }
public string Prop2 { get; set; }
public string Prop3 { get; set; }
}
The first idea is to do something like this:
public override int GetHashCode()
{
return Prop1.GetHashCode() ^ Prop2.GetHashCode() ^ Prop3.GetHashCode();
}
The second idea is:
public override int GetHashCode()
{
return String.Join(";", new[] {Prop1, Prop2, Prop3}).GetHashCode();
}
What is the best way?
You shouldn't just XOR them together, because this doesn't account for ordering. Imagine you have two objects:
"foo", "bar", "baz"
and
"bar", "foo", "baz"
With a simple XOR, both of these will have the same hash. Luckily it's pretty easy to work around. This is the code I use to combine hashes:
static int MultiHash(IEnumerable<object> items)
{
Contract.Requires(items != null);
int h = 0;
foreach (object item in items)
{
h = Combine(h, item != null ? item.GetHashCode() : 0);
}
return h;
}
static int Combine(int x, int y)
{
unchecked
{
// This isn't a particularly strong way to combine hashes, but it's
// cheap, respects ordering, and should work for the majority of cases.
return (x << 5) + 3 + x ^ y;
}
}
There are a lot of ways to combine hashes, but usually something very simple like this will do. If for some reason it doesn't work for your situation, MurmurHash has pretty robust hash combining you can pull.
Just XOR the hashes of each string together. It is cheaper (performance wise) than the string concatenation, and as far as I can see, it is not more prone to collisions. Let's assume that each string is 5 characters long and that each character takes up 1 byte. In the first one, you are hashing 15 bytes to 4 bytes (int). In the second one you are concatenating all 3 strings (an expensive operation) to end up with one string of 15 bytes, and they you are hashing it to 4 bytes. Both transform 15 bytes to 4, therefore in theory both are quite similar in terms of collisions.
In reality there is a bit of a difference in the probabilities of collisions, but in practice it may not always matter. It depends on the data the strings will have. If all 3 strings are equal and that they each hash to 0001 (I am using a simple number just for the sake of the example). If all 3 are equal then xoring the first two will get you 0000 and xoring the third one with that will get you back to 0001. By concatenating the strings this can be avoided at the cost of some performance (if you are writing a performance critical program, I wouldn't concatenate strings in the inner loop).
So in the end, I haven't really given an answer after all, for the simple reason that there really isn't one. It all depends on where and how it will be used.
I've asked a question about this class before, but here is one again.
I've created a Complex class:
public class Complex
{
public double Real { get; set; }
public double Imaginary { get; set; }
}
And I'm implementing the Equals and the Hashcode functions, and the Equal function takes in account a certain precision. I use the following logic for that:
public override bool Equals(object obj)
{
//Some default null checkint etc here, the next code is all that matters.
return Math.Abs(complex.Imaginary - Imaginary) <= 0.00001 &&
Math.Abs(complex.Real - Real) <= 0.00001;
}
Well this works, when the Imaginary and the Real part are really close to each other, it says they are the same.
Now I was trying to implement the HashCode function, I've used some examples John skeet used here, currently I have the following.
public override int GetHashCode()
{
var hash = 17;
hash = hash*23 + Real.GetHashCode();
hash = hash*23 + Imaginary.GetHashCode();
return hash;
}
However, this does not take in account the certain precision I want to use. So basically the following two classes:
Complex1[Real = 1.123456; Imaginary = 1.123456]
Complex2[Real = 1.123457; Imaginary = 1.123457]
Are Equal but do not provide the same HashCode, how can I achieve that?
First of all, your Equals() implementation is broken. Read here to see why.
Second, such a "fuzzy equals" breaks the contract of Equals() (it's not transitive, for one thing), so using it with Hashtable will not work, no matter how you implement GetHashCode().
For this kind of thing, you really need a spatial index such as an R-Tree.
Just drop precision when you calculate the hash value.
public override int GetHashCode()
{
var hash = 17;
hash = hash*23 + Math.Round(Real, 5).GetHashCode();
hash = hash*23 + Math.Round(Imaginary, 5).GetHashCode();
return hash;
}
where 5 is you precision value
I see two simple options:
Use Decimal instead of double
Instead of using Real.GetHashCode, use Real.RoundTo6Ciphers().GetHashCode().
Then you'll have the same hashcode.
I would create read-only properties that round Real and Imaginary to the nearest hundred-thousandth and then do equals and hashcode implementations on those getter properties.
I need to generate a fast hash code in GetHashCode for a BitArray. I have a Dictionary where the keys are BitArrays, and all the BitArrays are of the same length.
Does anyone know of a fast way to generate a good hash from a variable number of bits, as in this scenario?
UPDATE:
The approach I originally took was to access the internal array of ints directly through reflection (speed is more important than encapsulation in this case), then XOR those values. The XOR approach seems to work well i.e. my 'Equals' method isn't called excessively when searching in the Dictionary:
public int GetHashCode(BitArray array)
{
int hash = 0;
foreach (int value in array.GetInternalValues())
{
hash ^= value;
}
return hash;
}
However, the approach suggested by Mark Byers and seen elsewhere on StackOverflow was slightly better (16570 Equals calls vs 16608 for the XOR for my test data). Note that this approach fixes a bug in the previous one where bits beyond the end of the bit array could affect the hash value. This could happen if the bit array was reduced in length.
public int GetHashCode(BitArray array)
{
UInt32 hash = 17;
int bitsRemaining = array.Length;
foreach (int value in array.GetInternalValues())
{
UInt32 cleanValue = (UInt32)value;
if (bitsRemaining < 32)
{
//clear any bits that are beyond the end of the array
int bitsToWipe = 32 - bitsRemaining;
cleanValue <<= bitsToWipe;
cleanValue >>= bitsToWipe;
}
hash = hash * 23 + cleanValue;
bitsRemaining -= 32;
}
return (int)hash;
}
The GetInternalValues extension method is implemented like this:
public static class BitArrayExtensions
{
static FieldInfo _internalArrayGetter = GetInternalArrayGetter();
static FieldInfo GetInternalArrayGetter()
{
return typeof(BitArray).GetField("m_array", BindingFlags.NonPublic | BindingFlags.Instance);
}
static int[] GetInternalArray(BitArray array)
{
return (int[])_internalArrayGetter.GetValue(array);
}
public static IEnumerable<int> GetInternalValues(this BitArray array)
{
return GetInternalArray(array);
}
... more extension methods
}
Any suggestions for improvement are welcome!
It is a terrible class to act as a key in a Dictionary. The only reasonable way to implement GetHashCode() is by using its CopyTo() method to copy the bits into a byte[]. That's not great, it creates a ton of garbage.
Beg, steal or borrow to use a BitVector32 instead. It has a good implementation for GetHashCode(). If you've got more than 32 bits then consider spinning your own class so you can get to the underlying array without having to copy.
If the bit arrays are 32 bits or shorter then you just need to convert them to 32 bit integers (padding with zero bits if necessary).
If they can be longer then you can either convert them to a series of 32-bit integers and XOR them, or better: use the algorithm described in Effective Java.
public int GetHashCode()
{
int hash = 17;
hash = hash * 23 + field1.GetHashCode();
hash = hash * 23 + field2.GetHashCode();
hash = hash * 23 + field3.GetHashCode();
return hash;
}
Taken from here. The field1, field2 correcpond the the first 32 bits, second 32 bits, etc.
I have a C#-Application that stores data from a TextFile in a Dictionary-Object. The amount of data to be stored can be rather large, so it takes a lot of time inserting the entries. With many items in the Dictionary it gets even worse, because of the resizing of internal array, that stores the data for the Dictionary.
So I initialized the Dictionary with the amount of items that will be added, but this has no impact on speed.
Here is my function:
private Dictionary<IdPair, Edge> AddEdgesToExistingNodes(HashSet<NodeConnection> connections)
{
Dictionary<IdPair, Edge> resultSet = new Dictionary<IdPair, Edge>(connections.Count);
foreach (NodeConnection con in connections)
{
...
resultSet.Add(nodeIdPair, newEdge);
}
return resultSet;
}
In my tests, I insert ~300k items.
I checked the running time with ANTS Performance Profiler and found, that the Average time for resultSet.Add(...) doesn't change when I initialize the Dictionary with the needed size. It is the same as when I initialize the Dictionary with new Dictionary(); (about 0.256 ms on average for each Add).
This is definitely caused by the amount of data in the Dictionary (ALTHOUGH I initialized it with the desired size). For the first 20k items, the average time for Add is 0.03 ms for each item.
Any idea, how to make the add-operation faster?
Thanks in advance,
Frank
Here is my IdPair-Struct:
public struct IdPair
{
public int id1;
public int id2;
public IdPair(int oneId, int anotherId)
{
if (oneId > anotherId)
{
id1 = anotherId;
id2 = oneId;
}
else if (anotherId > oneId)
{
id1 = oneId;
id2 = anotherId;
}
else
throw new ArgumentException("The two Ids of the IdPair can't have the same value.");
}
}
Since you have a struct, you get the default implementation of Equals() and GetHashCode(). As others have pointed out, this is not very efficient since it uses reflection, but I don't think the reflection is the issue.
My guess is that your hash codes get distributed unevenly by the default GetHashCode(), which could happen, for example, if the default implementation returns a simple XOR of all members (in which case hash(a, b) == hash(b, a)). I can't find any documentation of how ValueType.GetHashCode() is implemented, but try adding
public override int GetHashCode() {
return oneId << 16 | (anotherId & 0xffff);
}
which might be better.
IdPair is a struct, and you haven't overridden Equals or GetHashCode. This means that the default implementation of those methods will be used.
For value-types the default implementation of Equals and GetHashCode uses reflection, which is likely to result in poor performance. Try providing your own implementation of the methods and see if that helps.
My suggested implementation, it might not be exactly what you need/want:
public struct IdPair : IEquatable<IdPair>
{
// ...
public override bool Equals(object obj)
{
if (obj is IdPair)
return Equals((IdPair)obj);
return false;
}
public bool Equals(IdPair other)
{
return id1.Equals(other.id1)
&& id2.Equals(other.id2);
}
public override int GetHashCode()
{
unchecked
{
int hash = 269;
hash = (hash * 19) + id1.GetHashCode();
hash = (hash * 19) + id2.GetHashCode();
return hash;
}
}
}