Near perfect hash for guid tostring as dictionary key - c#

what Im trying to solve: using a guid string as a key for Dictionary(string, someObject) and I want perfect hashing on the key...
not sure if Im missing something... When I run the following test with the dictionary constructor only passing in size allocation I get +- 10 collisions each run. When I pass in the IEqualityComparer just calling gethashcode on the string I have the test passing all good! with multiple runs using x = 10 iterations in some cases and y upto a million! I thought the dictionary was adjusting the hashing function especially when dealing with strings? I don't have reflector on my machine :( so I cant check tonight... If you comment out the alternating dictionary initialisations youll see... the test runs relatively quick on my i7.
[TestMethod]
public void NearPerfectHashingForGuidStrings()
{
int y = 100000;
int collisions = 0;
//Dictionary<string, string> list = new Dictionary<string, string>(y, new GuidStringHashing());
Dictionary<string, string> list = new Dictionary<string, string>(y);
for (int x = 0; x < 5; x++)
{
Enumerable.Range(1, y).ToList().ForEach((h) =>
{
list[Guid.NewGuid().ToString()] = h.ToString();
});
var hashDuplicates = list.Keys.GroupBy(h => h.GetHashCode())
.Where(group => group.Count() > 1)
.Select(group => group.Key).ToList();
hashDuplicates.ToList().ForEach(v => Debug.WriteLine( x + "--- " + v));
collisions += hashDuplicates.Count();
list.Clear();
}
Assert.AreEqual(0, collisions);
}
public class GuidStringHashing : IEqualityComparer<string>
{
public bool Equals(string x, string y)
{
return GetHashCode(x) == GetHashCode(y);
}
public int GetHashCode(string obj)
{
return obj.GetHashCode();
}
}

Your test is broken.
Because your equality comparer incorrectly reports that two different GUIDs that happen to have the same hash are equal, your dictionary never stores the collisions in the first place.
Due to the pigeonhole principle, it is fundamentally impossible to create a 32-bit perfect hash for more than 232 items.

It's impossible. You want a perfect hash function for an unknown set of keys. You can create perfect hash functions for specific set of keys. You can't create one perfect hash function that will work on all sets of keys.
The reason is the "Two Jesus Principle", as was so nicely put by Mark Knopfler: "Two men say they're jesus, one of them must be wrong." (it's more widely known as the "pigeonhole principle")

What do you mean by a perfect hash code?
Your code is somewhat confusing, especially because you post a class GuidStringHashing that is not used by your test method.
But your code demonstrates that when you make 100,000 GUIDs, convert them all to strings, and then take the hash code of the strings, then it happens quite often that not all hash codes are distinct. That might be surprising when there are more than 4 billion integers to choose between, and you only generate 100,000 strings.
You're using the GetHashCode() for general strings, but your strings are not too general, they're all something like
"2315c2a7-7d29-42b1-9696-fe6a9dd72ffd"
so maybe your hash code is not optimal. It's better to parse the strings h back to a GUID and use the hash code of that, as in (new Guid(h)).GetHashCode().
However this still gives collisions with 100,000 GUIDs. I think what you're seeing is just the birthday paradox.
Try this more simple code. Here I use GetHashCode() on the GUIDs, so we expect that the integers are quite random:
var set = new HashSet<int>();
for (int i = 1; true; ++i)
{
if (!set.Add(Guid.NewGuid().GetHashCode()))
Console.WriteLine("Collision, i is: " + i);
}
We see (by running the above code many times) that a collision almost always happens before 100,000 hash codes have been calculated.

Related

how to force HashSet to rehash members?

In this situation where one member is edited to become equal to another, what is the proper way to force the HashSet to recalculate hashes and thereby purge itself of duplicates?
I knew better than to expect this to happen automatically, so I tried such things as intersecting the HashSet with itself, then reassigning it to a constructor call which refers to itself and the same EqualityComparer. I thought for sure the latter would work, but no.
One thing which does succeed is reconstructing the HashSet from its conversion to some other container type such as List, rather than directly from itself.
Class defs:
public class Test {
public int N;
public override string ToString() { return this.N.ToString(); }
}
public class TestClassEquality: IEqualityComparer<Test> {
public bool Equals(Test x, Test y) { return x.N == y.N; }
public int GetHashCode(Test obj) { return obj.N.GetHashCode(); }
}
Test code:
TestClassEquality eq = new TestClassEquality();
HashSet<Test> hs = new HashSet<Test>(eq);
Test a = new Test { N = 1 }, b = new Test { N = 2 };
hs.Add(a);
hs.Add(b);
b.N = 1;
string fmt = "Count = {0}; Values = {1}";
Console.WriteLine(fmt, hs.Count, string.Join(",", hs));
hs.IntersectWith(hs);
Console.WriteLine(fmt, hs.Count, string.Join(",", hs));
hs = new HashSet<Test>(hs, eq);
Console.WriteLine(fmt, hs.Count, string.Join(",", hs));
hs = new HashSet<Test>(new List<Test>(hs), eq);
Console.WriteLine(fmt, hs.Count, string.Join(",", hs));
Output:
"Count: 2; Values: 1,1"
"Count: 2; Values: 1,1"
"Count: 2; Values: 1,1"
"Count: 1; Values: 1"
Based on the final approach succeeding, I could probably create an extension method in which the HashSet dumps itself into a local List, clears itself, and then repopulates from said list.
Is that really necessary or is there some simpler way to do this?
Lasse's comment is correct: you are required by the contract of HashSet to not do this, so asking what to do when you do this is a non-starter. If it hurts when you do that, stop doing that. A mutable object must not be put into a hash set if a mutation will cause its hash value to change while it is in the set. You're in a cleft stick of your own making.
To get out of that cleft stick, you could:
Stop mutating the objects while they are in a hash set. Remove them before you mutate them, put them back in later.
Fix the implementation of equality and hashing on the object so that it is consistent across mutations.
When you create the hash set, provide a custom hashing/equality algorithm that does not change its opinions when the object is mutated.
Implement your own "set" class that has whatever behaviour you like in this scenario. That is extremely difficult, so be careful. (There is a reason why this restriction was created in the first place!)
There is no other way than recreating the HashSet<>. Sadly the HashSet<> constructor has an optimization so that if it is create from another HashSet<> it copies the hash codes... So we can cheat:
hs = new HashSet<Test>(hs.Skip(0), eq);
The hs.Skip(0) is a IEnumerable<>, not an HashSet<>. This defeats the HashSet<> check.
Note that there is no guarantee that in the future the Skip() won't implement a shortcircuit in case of 0, something like:
if (count == 0)
{
return enu;
}
else
{
return count elements;
}
(see Lippert's comment, false problem)
The "manual" method to do it is:
var hs2 = new HashSet<Test>(eq);
foreach (var value in hs)
{
hs2.Add(value);
}
hs = hs2;
So enumerate "manually" and readd.
As you saw, HashSets don't deal with mutable objects when modifying the object affects its hash code or equality to other objects. Just remove it and re-add it:
hs.Remove(b);
b.N = 1;
hs.Add(b);

Create Unique Hashcode for the permutation of two Order Ids

I have a collection which is a permutation of two unique orders, where OrderId is unique. Thus it contains the Order1 (Id = 1) and Order2 (Id = 2) as both 12 and 21. Now while processing a routing algorithm, few conditions are checked and while a combination is included in the final result, then its reverse has to be ignored and needn't be considered for processing. Now since the Id is an integer, I have created a following logic:
private static int GetPairKey(int firstOrderId, int secondOrderId)
{
var orderCombinationType = (firstOrderId < secondOrderId)
? new {max = secondOrderId, min = firstOrderId}
: new { max = firstOrderId, min = secondOrderId };
return (orderCombinationType.min.GetHashCode() ^ orderCombinationType.max.GetHashCode());
}
In the logic, I create a Dictionary<int,int>, where key is created using the method GetPairKey shown above, where I ensure that out of given combination they are arranged correctly, so that I get the same Hashcode, which can be inserted and checked for an entry in a Dictionary, while its value is dummy and its ignored.
However above logic seems to have a flaw and it doesn't work as expected for all the logic processing, what am I doing wrong in this case, shall I try something different to create a Hashcode. Is something like following code a better choice, please suggest
Tuple.Create(minOrderId,maxOrderId).GetHashCode, following is relevant code usage:
foreach (var pair in localSavingPairs)
{
var firstOrder = pair.FirstOrder;
var secondOrder = pair.SecondOrder;
if (processedOrderDictionary.ContainsKey(GetPairKey(firstOrder.Id, secondOrder.Id))) continue;
Adding to the Dictionary, is the following code:
processedOrderDictionary.Add(GetPairKey(firstOrder.Id, secondOrder.Id), 0); here the value 0 is dummy and is not used
You need a value that can uniquely represent every possible value.
That is different to a hash-code.
You could uniquely represent each value with a long or with a class or struct that contains all of the appropriate values. Since after a certain total size using long won't work any more, let's look at the other approach, which is more flexible and more extensible:
public class KeyPair : IEquatable<KeyPair>
{
public int Min { get; private set; }
public int Max { get; private set; }
public KeyPair(int first, int second)
{
if (first < second)
{
Min = first;
Max = second;
}
else
{
Min = second;
Max = first;
}
}
public bool Equals(KeyPair other)
{
return other != null && other.Min == Min && other.Max == Max;
}
public override bool Equals(object other)
{
return Equals(other as KeyPair);
}
public override int GetHashCode()
{
return unchecked(Max * 31 + Min);
}
}
Now, the GetHashCode() here will not be unique, but the KeyPair itself will be. Ideally the hashcodes will be very different to each other to better distribute these objects, but doing much better than the above depends on information about the actual values that will be seen in practice.
The dictionary will use that to find the item, but it will also use Equals to pick between those where the hash code is the same.
(You can experiment with this by having a version for which GetHashCode() always just returns 0. It will have very poor performance because collisions hurt performance and this will always collide, but it will still work).
First, 42.GetHashCode() returns 42. Second, 1 ^ 2 is identical to 2 ^ 1, so there's really no point in sorting numbers. Third, your "hash" function is very weak and produces a lot of collisions, which is why you're observing the flaws.
There are two options I can think of right now:
Use a slightly "stronger" hash function
Replace your Dictionary<int, int> key with Dictionary<string, int> with keys being your two sorted numbers separated by whatever character you prever -- e.g. 56-6472
Given that XOR is commutative (so (a ^ b) will always be the same as (b ^ a)) it seems to me that your ordering is misguided... I'd just
(new {firstOrderId, secondOrderId}).GetHashCode()
.Net will fix you up a good well-distributed hashing implementation for anonymous types.

Cache key construction based on the method name and argument values

I've decided to implement a caching facade in one of our applications - the purpose is to eventually reduce the network overhead and limit the amount of db hits. We are using Castle.Windsor as our IoC Container and we have decided to go with Interceptors to add the caching functionality on top of our services layer using the System.Runtime.Caching namespace.
At this moment I can't exactly figure out what's the best approach for constructing the cache key. The goal is to make a distinction between different methods and also include passed argument values - meaning that these two method calls should be cached under two different keys:
IEnumerable<MyObject> GetMyObjectByParam(56); // key1
IEnumerable<MyObject> GetMyObjectByParam(23); // key2
For now I can see two possible implementations:
Option 1:
assembly | class | method return type | method name | argument types | argument hash codes
"MyAssembly.MyClass IEnumerable<MyObject> GetMyObjectByParam(long) { 56 }";
Option 2:
MD5 or SHA-256 computed hash based on the method's fully-qualified name and passed argument values
string key = new SHA256Managed().ComputeHash(name + args).ToString();
I'm thinking about the first option as the second one requires more processing time - on the other hand the second option enforces exactly the same 'length' of all generated keys.
Is it safe to assume that the first option will generate a unique key for methods using complex argument types? Or maybe there is a completely different way of doing this?
Help and opinion will by highly appreciated!
Based on some very useful links that I've found here and here I've decided to implement it more-or-less like this:
public sealed class CacheKey : IEquatable<CacheKey>
{
private readonly Type reflectedType;
private readonly Type returnType;
private readonly string name;
private readonly Type[] parameterTypes;
private readonly object[] arguments;
public User(Type reflectedType, Type returnType, string name,
Type[] parameterTypes, object[] arguments)
{
// check for null, incorrect values etc.
this.reflectedType = reflectedType;
this.returnType = returnType;
this.name = name;
this.parameterTypes = parameterTypes;
this.arguments = arguments;
}
public override bool Equals(object obj)
{
return Equals(obj as CacheKey);
}
public bool Equals(CacheKey other)
{
if (other == null)
{
return false;
}
for (int i = 0; i < parameterTypes.Count; i++)
{
if (!parameterTypes[i].Equals(other.parameterTypes[i]))
{
return false;
}
}
for (int i = 0; i < arguments.Count; i++)
{
if (!arguments[i].Equals(other.arguments[i]))
{
return false;
}
}
return reflectedType.Equals(other.reflectedType) &&
returnType.Equals(other.returnType) &&
name.Equals(other.name);
}
private override int GetHashCode()
{
unchecked
{
int hash = 17;
hash = hash * 31 + reflectedType.GetHashCode();
hash = hash * 31 + returnType.GetHashCode();
hash = hash * 31 + name.GetHashCode();
for (int i = 0; i < parameterTypes.Count; i++)
{
hash = hash * 31 + parameterTypes[i].GetHashCode();
}
for (int i = 0; i < arguments.Count; i++)
{
hash = hash * 31 + arguments[i].GetHashCode();
}
return hash;
}
}
}
Basically it's just a general idea - the above code can be easily rewritten to a more generic version with one collection of Fields - the same rules would have to be applied on each element of the collection. I can share the full code.
An option you seem to have skipped is using the .NET built in GetHashCode() function for the string. I'm fairly certain this is what would go on behind the scenes in a C# dictionary with a String as the <TKey> (I mention that because you've tagged the question with dictionary). I'm not sure how the .NET dictionary class relates to your Castle.Windsor or the system.runtime.caching interface you mention.
The reason you wouldn't want to use GetHashCode as a hash key is that the functionality is specifically disclaimed by MicroSoft to change between versions without warning (as in to provide a more unique or faster executing function). If this cache will live strictly in memory, then this is not a concern because upgrading the .NET framework would necessitate a restart of your application, wiping the cache.
To clarify, just using the concatenated string (Option 1) should be sufficiently unique. It looks like you've added everything possible to uniquely qualify your methods.
If you end up feeding the String of an MD5 or Sha256 into a dictionary key, the program would probably rehash the string behind the scenes anyways. It's been a while since I read about the inner workings of the Dictionary class. If you leave it as a Dictionary<String, IEnumerable<MyObject>> (as opposed to calling GetHashCode() on the strings yourself using the int return value as the key) then the dictionary should handle collisions of the hash code itself.
Also note that (at least according to a benchmark program run on my machine), MD5 is around 10% faster than SHA1 and twice as fast as SHA256. String.GetHashCode() is around 20 times faster than MD5 (it's not cryptographically secure). Tests were taken for the total time to compute the hashes for the same 100,000 randomly generated strings of length between 32 and 1024 characters. But regardless of the exact numbers, using a cryptographically secure hash function as a key will only slow down your program.
I can post the source code for my comparisons if you like.

How to improve hashing for short strings to avoid collisions?

I am having a problem with hash collisions using short strings in .NET4.
EDIT: I am using the built-in string hashing function in .NET.
I'm implementing a cache using objects that store the direction of a conversion like this
public class MyClass
{
private string _from;
private string _to;
// More code here....
public MyClass(string from, string to)
{
this._from = from;
this._to = to;
}
public override int GetHashCode()
{
return string.Concat(this._from, this._to).GetHashCode();
}
public bool Equals(MyClass other)
{
return this.To == other.To && this.From == other.From;
}
public override bool Equals(object obj)
{
if (obj == null) return false;
if (this.GetType() != obj.GetType()) return false;
return Equals(obj as MyClass);
}
}
This is direction dependent and the from and to are represented by short strings like "AAB" and "ABA".
I am getting sparse hash collisions with these small strings, I have tried something simple like adding a salt (did not work).
The problem is that too many of my small strings like "AABABA" collides its hash with the reverse of "ABAAAB" (Note that these are not real examples, I have no idea if AAB and ABA actually cause collisions!)
and I have gone heavy duty like implementing MD5 (which works, but is MUCH slower)
I have also implemented the suggestion from Jon Skeet here:
Should I use a concatenation of my string fields as a hash code?
This works but I don't know how dependable it is with my various 3-character strings.
How can I improve and stabilize the hashing of small strings without adding too much overhead like MD5?
EDIT: In response to a few of the answers posted... the cache is implemented using concurrent dictionaries keyed from MyClass as stubbed out above. If I replace the GetHashCode in the code above with something simple like #JonSkeet 's code from the link I posted:
int hash = 17;
hash = hash * 23 + this._from.GetHashCode();
hash = hash * 23 + this._to.GetHashCode();
return hash;
Everything functions as expected.
It's also worth noting that in this particular use-case the cache is not used in a multi-threaded environment so there is no race condition.
EDIT: I should also note that this misbehavior is platform dependant. It works as intended on my fully updated Win7x64 machine but does not behave properly on a non-updated Win7x64 machine. I don't know the extend of what updates are missing but I know it doesn't have Win7 SP1... so I would assume there may also be a framework SP or update it's missing as well.
EDIT: As susggested, my issue was not caused by a problem with the hashing function. I had an elusive race condition, which is why it worked on some computers but not others and also why a "slower" hashing method made things work properly. The answer I selected was the most useful in understanding why my problem was not hash collisions in the dictionary.
Are you sure that collisions are who causes problems? When you say
I finally discovered what was causing this bug
You mean some slowness of your code or something else? If not I'm curious what kind of problem is that? Because any hash function (except "perfect" hash functions on limited domains) would cause collisions.
I put a quick piece of code to check for collisions for 3-letter words. And this code doesn't report a single collision for them. You see what I mean? Looks like buid-in hash algorithm is not so bad.
Dictionary<int, bool> set = new Dictionary<int, bool>();
char[] buffer = new char[3];
int count = 0;
for (int c1 = (int)'A'; c1 <= (int)'z'; c1++)
{
buffer[0] = (char)c1;
for (int c2 = (int)'A'; c2 <= (int)'z'; c2++)
{
buffer[1] = (char)c2;
for (int c3 = (int)'A'; c3 <= (int)'z'; c3++)
{
buffer[2] = (char)c3;
string str = new string(buffer);
count++;
int hash = str.GetHashCode();
if (set.ContainsKey(hash))
{
Console.WriteLine("Collision for {0}", str);
}
set[hash] = false;
}
}
}
Console.WriteLine("Generated {0} of {1} hashes", set.Count, count);
While you could pick almost any of well-known hash functions (as David mentioned) or even choose a "perfect" hash since it looks like your domain is limited (like minimum perfect hash)... It would be great to understand if the source of problems are really collisions.
Update
What I want to say is that .NET build-in hash function for string is not so bad. It doesn't give so many collisions that you would need to write your own algorithm in regular scenarios. And this doesn't depend on the lenght of strings. If you have a lot of 6-symbol strings that doesn't imply that your chances to see a collision are highier than with 1000-symbol strings. This is one of the basic properties of hash functions.
And again, another question is what kind of problems do you experience because of collisions? All build-in hashtables and dictionaries support collision resolution. So I would say all you can see is just... probably some slowness. Is this your problem?
As for your code
return string.Concat(this._from, this._to).GetHashCode();
This can cause problems. Because on every hash code calculation you create a new string. Maybe this is what causes your issues?
int hash = 17;
hash = hash * 23 + this._from.GetHashCode();
hash = hash * 23 + this._to.GetHashCode();
return hash;
This would be much better approach - just because you don't create new objects on the heap. Actually it's one of the main points of this approach - get a good hash code of an object with a complex "key" without creating new objects. So if you don't have a single value key then this should work for you. BTW, this is not a new hash function, this is just a way to combine existing hash values without compromising main properties of hash functions.
Any common hash function should be suitable for this purpose. If you're getting collisions on short strings like that, I'd say you're using an unusually bad hash function. You can use Jenkins or Knuth's with no issues.
Here's a very simple hash function that should be adequate. (Implemented in C, but should easily port to any similar language.)
unsigned int hash(const char *it)
{
unsigned hval=0;
while(*it!=0)
{
hval+=*it++;
hval+=(hval<<10);
hval^=(hval>>6);
hval+=(hval<<3);
hval^=(hval>>11);
hval+=(hval<<15);
}
return hval;
}
Note that if you want to trim the bits of the output of this function, you must use the least significant bits. You can also use mod to reduce the output range. The last character of the string tends to only affect the low-order bits. If you need a more even distribution, change return hval; to return hval * 2654435761U;.
Update:
public override int GetHashCode()
{
return string.Concat(this._from, this._to).GetHashCode();
}
This is broken. It treats from="foot",to="ar" as the same as from="foo",to="tar". Since your Equals function doesn't consider those equal, your hash function should not. Possible fixes include:
1) Form the string from,"XXX",to and hash that. (This assumes the string "XXX" almost never appears in your input strings.
2) Combine the hash of 'from' with the hash of 'to'. You'll have to use a clever combining function. For example, XOR or sum will cause from="foo",to="bar" to hash the same as from="bar",to="foo". Unfortunately, choosing the right combining function is not easy without knowing the internals of the hashing function. You can try:
int hc1=from.GetHashCode();
int hc2=to.GetHashCode();
return (hc1<<7)^(hc2>>25)^(hc1>>21)^(hc2<<11);

C#: How would you unit test GetHashCode?

Testing the Equals method is pretty much straight forward (as far as I know). But how on earth do you test the GetHashCode method?
Test that two distinct objects which are equal have the same hash code (for various values). Check that non-equal objects give different hash codes, varying one aspect/property at a time. While the hash codes don't have to be different, you'd be really unlucky to pick different values for properties which happen to give the same hash code unless you've got a bug.
Gallio/MbUnit v3.2 comes with convenient contract verifiers which are able to test your implementation of GetHashCode() and IEquatable<T>. More specifically you may be interested by the EqualityContract and the HashCodeAcceptanceContract. See here, here and there for more details.
public class Spot
{
private readonly int x;
private readonly int y;
public Spot(int x, int y)
{
this.x = x;
this.y = y;
}
public override int GetHashCode()
{
int h = -2128831035;
h = (h * 16777619) ^ x;
h = (h * 16777619) ^ y;
return h;
}
}
Then you declare your contract verifier like this:
[TestFixture]
public class SpotTest
{
[VerifyContract]
public readonly IContract HashCodeAcceptanceTests = new HashCodeAcceptanceContract<Spot>()
{
CollisionProbabilityLimit = CollisionProbability.VeryLow,
UniformDistributionQuality = UniformDistributionQuality.Excellent,
DistinctInstances = DataGenerators.Join(Enumerable.Range(0, 1000), Enumerable.Range(0, 1000)).Select(o => new Spot(o.First, o.Second))
};
}
It would be fairly similar to Equals(). You'd want to make sure two objects which were the "same" at least had the same hash code. That means if .Equals() returns true, the hash codes should be identical as well. As far as what the proper hashcode values are, that depends on how you're hashing.
From personal experience. Aside from obvious things like same objects giving you same hash codes, you need to create large enough array of unique objects and count unique hash codes among them. If unique hash codes make less than, say 50% of overall object count, then you are in trouble, as your hash function is not good.
List<int> hashList = new List<int>(testObjectList.Count);
for (int i = 0; i < testObjectList.Count; i++)
{
hashList.Add(testObjectList[i]);
}
hashList.Sort();
int differentValues = 0;
int curValue = hashList[0];
for (int i = 1; i < hashList.Count; i++)
{
if (hashList[i] != curValue)
{
differentValues++;
curValue = hashList[i];
}
}
Assert.Greater(differentValues, hashList.Count/2);
In addition to checking that object equality implies equality of hashcodes, and the distribution of hashes is fairly flat as suggested by Yann Trevin (if performance is a concern), you may also wish to consider what happens if you change a property of the object.
Suppose your object changes while it's in a dictionary/hashset. Do you want the Contains(object) to still be true? If so then your GetHashCode had better not depend on the mutable property that was changed.
I would pre-supply a known/expected hash and compare what the result of GetHashCode is.
You create separate instances with the same value and check that the GetHashCode for the instances returns the same value, and that repeated calls on the same instance returns the same value.
That is the only requirement for a hash code to work. To work well the hash codes should of course have a good distribution, but testing for that requires a lot of testing...

Categories