How can I generate a unique hashcode for a string - c#

Is there any function, that gives me the same hashcode for the same string?
I'm having trouble when creating 2 different strings (but with the same content), their hashcode is different and therefore is not correctly used in a Dictionary.
I would like to know what GetHashCode() function the Dictionary uses when the key is a string.
I'm building mine like this:
public override int GetHashCode()
{
String str = "Equip" + Equipment.ToString() + "Destiny" + Destiny.ToString();
return str.GetHashCode();
}
But it's producing different results for every instance that uses this code, despite the content of the string being the same.

Your title asks for one thing (unique hash codes) your body asks for something different (consistent hash codes).
You claim:
I'm having trouble when creating 2 different strings (but with the same content), their hashcode is different and therefore is not correctly used in a Dictionary.
If the strings genuinely have the same content, that simply won't occur. Your diagnostics are wrong somehow. Check for non-printable characters in your strings, e.g trailing Unicode "null" characters:
string text1 = "Hello";
string text2 = "Hello\0";
Here text1 and text2 may print the same way in some contexts, but I'd hope they'd have different hash codes.
Note that hash codes are not guaranteed to be unique and can't be... there are only 232 possible hash codes returned from GetHashCode, but more than 232 possible different strings.
Also note that the same content is not guaranteed to produce the same hash code on different runs, even of the same executable - you should not be persisting a hash code anywhere. For example, I believe the 32-bit .NET 4 and 64-bit .NET 4 CLRs produce different hash codes for strings. However, your claim that the values aren't being stored correctly in a Dictionary suggests that this is within a single process - where everything should be consistent.
As noted in comments, it's entirely possible that you're overriding Equals incorrectly. I'd also suggest that your approach to building a hash code isn't great. We don't know what the types of Equipment and Destiny are, but I'd suggest you should use something like:
public override int GetHashCode()
{
int hash = 23;
hash = hash * 31 + Equipment.GetHashCode();
hash = hash * 31 + Destiny.GetHashCode();
return hash;
}
That's the approach I usually use for hash codes. Equals would then look something like:
public override bool Equals(object other)
{
// Reference equality check
if (this == other)
{
return true;
}
if (other == null)
{
return false;
}
// Details of this might change depending on your situation; we'd
// need more information
if (other.GetType() != GetType())
{
return false;
}
// Adjust for your type...
Foo otherFoo = (Foo) other;
// You may want to change the equality used here based on the
// types of Equipment and Destiny
return this.Destiny == otherFoo.Destiny &&
this.Equipment == otherFoo.Equipment;
}

Related

C# ContainsKey does not find Key in dictionary alltough it is present [duplicate]

After executing this piece of code:
int a = 50;
float b = 50.0f;
Console.WriteLine(a.GetHashCode() == b.GetHashCode());
We get False, which is expected, since we are dealing with different objects, hence we should get different hashes.
However, if we execute this:
int a = 0;
float b = 0.0f;
Console.WriteLine(a.GetHashCode() == b.GetHashCode());
We get True. Both obejcts return the same hash code: 0.
Why does this happen? Aren't they supposed to return different hashes?
The GetHashCode of System.Int32 works like:
public override int GetHashCode()
{
return this;
}
Which of course with this being 0, it will return 0.
System.Single's (float is alias) GetHashCode is:
public unsafe override int GetHashCode()
{
float num = this;
if (num == 0f)
{
return 0;
}
return *(int*)(&num);
}
Like you see, at 0f it will return 0.
Program used is ILSpy.
From MSDN Documentation:
Two objects that are equal return hash codes that are equal. However,
the reverse is not true: equal hash codes do not imply object
equality, because different (unequal) objects can have identical hash
codes.
Objects that are conceptually equal are obligated to return the same hashes. Objects that are different are not obligated to return different hashes. That would only be possible if there were less than 2^32 objects that could ever possibly exist. There are more than that. When objects that are different result in the same hash it is called a "collision". A quality hash algorithm minimizes collisions as much as possible, but they can never be removed entirely.
Why should they? Hash codes are a finite set; as many as you can fit in an Int32. There are many many doubles that will have the same hash code as any given int or any other given double.
Hash codes basically have to follow two simple rules:
If two objects are equal, they should have the same hash code.
If an object does not mutate its internal state then the hash code should remain the same.
Nothing obliges two objects that are not equal to have different hash codes; it is mathematically impossible.

C# Hash Function for Dictionary Lookup [duplicate]

Given the following class
public class Foo
{
public int FooId { get; set; }
public string FooName { get; set; }
public override bool Equals(object obj)
{
Foo fooItem = obj as Foo;
if (fooItem == null)
{
return false;
}
return fooItem.FooId == this.FooId;
}
public override int GetHashCode()
{
// Which is preferred?
return base.GetHashCode();
//return this.FooId.GetHashCode();
}
}
I have overridden the Equals method because Foo represent a row for the Foos table. Which is the preferred method for overriding the GetHashCode?
Why is it important to override GetHashCode?
Yes, it is important if your item will be used as a key in a dictionary, or HashSet<T>, etc - since this is used (in the absence of a custom IEqualityComparer<T>) to group items into buckets. If the hash-code for two items does not match, they may never be considered equal (Equals will simply never be called).
The GetHashCode() method should reflect the Equals logic; the rules are:
if two things are equal (Equals(...) == true) then they must return the same value for GetHashCode()
if the GetHashCode() is equal, it is not necessary for them to be the same; this is a collision, and Equals will be called to see if it is a real equality or not.
In this case, it looks like "return FooId;" is a suitable GetHashCode() implementation. If you are testing multiple properties, it is common to combine them using code like below, to reduce diagonal collisions (i.e. so that new Foo(3,5) has a different hash-code to new Foo(5,3)):
In modern frameworks, the HashCode type has methods to help you create a hashcode from multiple values; on older frameworks, you'd need to go without, so something like:
unchecked // only needed if you're compiling with arithmetic checks enabled
{ // (the default compiler behaviour is *disabled*, so most folks won't need this)
int hash = 13;
hash = (hash * 7) + field1.GetHashCode();
hash = (hash * 7) + field2.GetHashCode();
...
return hash;
}
Oh - for convenience, you might also consider providing == and != operators when overriding Equals and GetHashCode.
A demonstration of what happens when you get this wrong is here.
It's actually very hard to implement GetHashCode() correctly because, in addition to the rules Marc already mentioned, the hash code should not change during the lifetime of an object. Therefore the fields which are used to calculate the hash code must be immutable.
I finally found a solution to this problem when I was working with NHibernate.
My approach is to calculate the hash code from the ID of the object. The ID can only be set though the constructor so if you want to change the ID, which is very unlikely, you have to create a new object which has a new ID and therefore a new hash code. This approach works best with GUIDs because you can provide a parameterless constructor which randomly generates an ID.
By overriding Equals you're basically stating that you know better how to compare two instances of a given type.
Below you can see an example of how ReSharper writes a GetHashCode() function for you. Note that this snippet is meant to be tweaked by the programmer:
public override int GetHashCode()
{
unchecked
{
var result = 0;
result = (result * 397) ^ m_someVar1;
result = (result * 397) ^ m_someVar2;
result = (result * 397) ^ m_someVar3;
result = (result * 397) ^ m_someVar4;
return result;
}
}
As you can see it just tries to guess a good hash code based on all the fields in the class, but if you know your object's domain or value ranges you could still provide a better one.
Please donĀ“t forget to check the obj parameter against null when overriding Equals().
And also compare the type.
public override bool Equals(object obj)
{
Foo fooItem = obj as Foo;
if (fooItem == null)
{
return false;
}
return fooItem.FooId == this.FooId;
}
The reason for this is: Equals must return false on comparison to null. See also http://msdn.microsoft.com/en-us/library/bsc2ak47.aspx
How about:
public override int GetHashCode()
{
return string.Format("{0}_{1}_{2}", prop1, prop2, prop3).GetHashCode();
}
Assuming performance is not an issue :)
As of .NET 4.7 the preferred method of overriding GetHashCode() is shown below. If targeting older .NET versions, include the System.ValueTuple nuget package.
// C# 7.0+
public override int GetHashCode() => (FooId, FooName).GetHashCode();
In terms of performance, this method will outperform most composite hash code implementations. The ValueTuple is a struct so there won't be any garbage, and the underlying algorithm is as fast as it gets.
Just to add on above answers:
If you don't override Equals then the default behavior is that references of the objects are compared. The same applies to hashcode - the default implmentation is typically based on a memory address of the reference.
Because you did override Equals it means the correct behavior is to compare whatever you implemented on Equals and not the references, so you should do the same for the hashcode.
Clients of your class will expect the hashcode to have similar logic to the equals method, for example linq methods which use a IEqualityComparer first compare the hashcodes and only if they're equal they'll compare the Equals() method which might be more expensive to run, if we didn't implement hashcode, equal object will probably have different hashcodes (because they have different memory address) and will be determined wrongly as not equal (Equals() won't even hit).
In addition, except the problem that you might not be able to find your object if you used it in a dictionary (because it was inserted by one hashcode and when you look for it the default hashcode will probably be different and again the Equals() won't even be called, like Marc Gravell explains in his answer, you also introduce a violation of the dictionary or hashset concept which should not allow identical keys -
you already declared that those objects are essentially the same when you overrode Equals so you don't want both of them as different keys on a data structure which suppose to have a unique key. But because they have a different hashcode the "same" key will be inserted as different one.
It is because the framework requires that two objects that are the same must have the same hashcode. If you override the equals method to do a special comparison of two objects and the two objects are considered the same by the method, then the hash code of the two objects must also be the same. (Dictionaries and Hashtables rely on this principle).
We have two problems to cope with.
You cannot provide a sensible GetHashCode() if any field in the
object can be changed. Also often a object will NEVER be used in a
collection that depends on GetHashCode(). So the cost of
implementing GetHashCode() is often not worth it, or it is not
possible.
If someone puts your object in a collection that calls
GetHashCode() and you have overrided Equals() without also making
GetHashCode() behave in a correct way, that person may spend days
tracking down the problem.
Therefore by default I do.
public class Foo
{
public int FooId { get; set; }
public string FooName { get; set; }
public override bool Equals(object obj)
{
Foo fooItem = obj as Foo;
if (fooItem == null)
{
return false;
}
return fooItem.FooId == this.FooId;
}
public override int GetHashCode()
{
// Some comment to explain if there is a real problem with providing GetHashCode()
// or if I just don't see a need for it for the given class
throw new Exception("Sorry I don't know what GetHashCode should do for this class");
}
}
Hash code is used for hash-based collections like Dictionary, Hashtable, HashSet etc. The purpose of this code is to very quickly pre-sort specific object by putting it into specific group (bucket). This pre-sorting helps tremendously in finding this object when you need to retrieve it back from hash-collection because code has to search for your object in just one bucket instead of in all objects it contains. The better distribution of hash codes (better uniqueness) the faster retrieval. In ideal situation where each object has a unique hash code, finding it is an O(1) operation. In most cases it approaches O(1).
It's not necessarily important; it depends on the size of your collections and your performance requirements and whether your class will be used in a library where you may not know the performance requirements. I frequently know my collection sizes are not very large and my time is more valuable than a few microseconds of performance gained by creating a perfect hash code; so (to get rid of the annoying warning by the compiler) I simply use:
public override int GetHashCode()
{
return base.GetHashCode();
}
(Of course I could use a #pragma to turn off the warning as well but I prefer this way.)
When you are in the position that you do need the performance than all of the issues mentioned by others here apply, of course. Most important - otherwise you will get wrong results when retrieving items from a hash set or dictionary: the hash code must not vary with the life time of an object (more accurately, during the time whenever the hash code is needed, such as while being a key in a dictionary): for example, the following is wrong as Value is public and so can be changed externally to the class during the life time of the instance, so you must not use it as the basis for the hash code:
class A
{
public int Value;
public override int GetHashCode()
{
return Value.GetHashCode(); //WRONG! Value is not constant during the instance's life time
}
}
On the other hand, if Value can't be changed it's ok to use:
class A
{
public readonly int Value;
public override int GetHashCode()
{
return Value.GetHashCode(); //OK Value is read-only and can't be changed during the instance's life time
}
}
You should always guarantee that if two objects are equal, as defined by Equals(), they should return the same hash code. As some of the other comments state, in theory this is not mandatory if the object will never be used in a hash based container like HashSet or Dictionary. I would advice you to always follow this rule though. The reason is simply because it is way too easy for someone to change a collection from one type to another with the good intention of actually improving the performance or just conveying the code semantics in a better way.
For example, suppose we keep some objects in a List. Sometime later someone actually realizes that a HashSet is a much better alternative because of the better search characteristics for example. This is when we can get into trouble. List would internally use the default equality comparer for the type which means Equals in your case while HashSet makes use of GetHashCode(). If the two behave differently, so will your program. And bear in mind that such issues are not the easiest to troubleshoot.
I've summarized this behavior with some other GetHashCode() pitfalls in a blog post where you can find further examples and explanations.
As of C# 9(.net 5 or .net core 3.1), you may want to use records as it does Value Based Equality by default.
It's my understanding that the original GetHashCode() returns the memory address of the object, so it's essential to override it if you wish to compare two different objects.
EDITED:
That was incorrect, the original GetHashCode() method cannot assure the equality of 2 values. Though objects that are equal return the same hash code.
Below using reflection seems to me a better option considering public properties as with this you don't have have to worry about addition / removal of properties (although not so common scenario). This I found to be performing better also.(Compared time using Diagonistics stop watch).
public int getHashCode()
{
PropertyInfo[] theProperties = this.GetType().GetProperties();
int hash = 31;
foreach (PropertyInfo info in theProperties)
{
if (info != null)
{
var value = info.GetValue(this,null);
if(value != null)
unchecked
{
hash = 29 * hash ^ value.GetHashCode();
}
}
}
return hash;
}

GetHashCode() Returning Different Values For Identical Object Values

I was attempting to use the GetHashCode() value to determine if an object has changed after it's been validated via ajax calls in an ASP.NET MVC application. However, I noticed that this did not work because the hash code value when returned during validation would be different than the hash code generated when the object was created again from the model binding with the same values in another request after the validation request. I was able to solve this problem by creating a SHA hash instead, but I'm curious on why I was seeing this behavior.
I know that hash codes generated from GetHashCode() should not be persisted and can differ on different platforms and over time. I thought that the time period was short enough when I first came up with this idea since these two calls were made in milliseconds of each other and when debugging I confirmed that the model contained the exact same values, but still produced a different hash code.
I'm curious about why this behavior is exhibited. Why would this happen even though this is a single run of the application, albeit a web application? Does this have to do with the ASP.NET life cycle?
In case needed here's the class & GetHashCode implementation I was using:
class DispositionSubmission
{
[Display(Name = "Client")]
[Required(AllowEmptyStrings = false, ErrorMessage = "Client is required.")]
public string ClientId { get; set; }
public string Carrier { get; set; }
public Dictionary<string, string> DispositionInfo { get; set; }
public DispositionType Type { get; set; } //int based enum
...
public override int GetHashCode()
{
unchecked
{
int hash = (int)15485863;
int bigPrime = (int)15485867;
hash = hash * bigPrime ^ ClientId.GetHashCode();
hash = hash * bigPrime ^ (Carrier ?? "").GetHashCode();
hash = hash * bigPrime ^ DispositionInfo.GetHashCode();
hash = hash * bigPrime ^ Type.GetHashCode();
return hash;
}
}
}
DispositionInfo does not have a type that overrides GetHashCode(). Two identical dictionaries with the same objects in them will have different hash codes.
You will need to adjust your GetHashCode() to either not include the dictionary or make it more complex to get the hash code of each key and value in the dictionary and add them up.
GetHashCode will return the same result for the exact same object. If the object has been reallocated, it doesn't matter if you have identical values in all fields, you will get a different result. This is because what you're really using is Object.GetHashCode(), which knows nothing about its other fields anyway.
This behavior is important because if you're using the hash as a way to refer to the object, changing any of its values would make it impossible to reference again.
If you want to have behavior where objects with identical fields have the same hash code, you'll need to implement it yourself.
Edit: To clarify: DispositionInfo, the dictionary, specifically exhibits this behavior. The other fields do not, because they are designed to be immutable (string, int, etc.) Consider getting the hash a different way, or overriding GetHashCode with a custom class that inherits from Dictionary<string, string>.

Cache key construction based on the method name and argument values

I've decided to implement a caching facade in one of our applications - the purpose is to eventually reduce the network overhead and limit the amount of db hits. We are using Castle.Windsor as our IoC Container and we have decided to go with Interceptors to add the caching functionality on top of our services layer using the System.Runtime.Caching namespace.
At this moment I can't exactly figure out what's the best approach for constructing the cache key. The goal is to make a distinction between different methods and also include passed argument values - meaning that these two method calls should be cached under two different keys:
IEnumerable<MyObject> GetMyObjectByParam(56); // key1
IEnumerable<MyObject> GetMyObjectByParam(23); // key2
For now I can see two possible implementations:
Option 1:
assembly | class | method return type | method name | argument types | argument hash codes
"MyAssembly.MyClass IEnumerable<MyObject> GetMyObjectByParam(long) { 56 }";
Option 2:
MD5 or SHA-256 computed hash based on the method's fully-qualified name and passed argument values
string key = new SHA256Managed().ComputeHash(name + args).ToString();
I'm thinking about the first option as the second one requires more processing time - on the other hand the second option enforces exactly the same 'length' of all generated keys.
Is it safe to assume that the first option will generate a unique key for methods using complex argument types? Or maybe there is a completely different way of doing this?
Help and opinion will by highly appreciated!
Based on some very useful links that I've found here and here I've decided to implement it more-or-less like this:
public sealed class CacheKey : IEquatable<CacheKey>
{
private readonly Type reflectedType;
private readonly Type returnType;
private readonly string name;
private readonly Type[] parameterTypes;
private readonly object[] arguments;
public User(Type reflectedType, Type returnType, string name,
Type[] parameterTypes, object[] arguments)
{
// check for null, incorrect values etc.
this.reflectedType = reflectedType;
this.returnType = returnType;
this.name = name;
this.parameterTypes = parameterTypes;
this.arguments = arguments;
}
public override bool Equals(object obj)
{
return Equals(obj as CacheKey);
}
public bool Equals(CacheKey other)
{
if (other == null)
{
return false;
}
for (int i = 0; i < parameterTypes.Count; i++)
{
if (!parameterTypes[i].Equals(other.parameterTypes[i]))
{
return false;
}
}
for (int i = 0; i < arguments.Count; i++)
{
if (!arguments[i].Equals(other.arguments[i]))
{
return false;
}
}
return reflectedType.Equals(other.reflectedType) &&
returnType.Equals(other.returnType) &&
name.Equals(other.name);
}
private override int GetHashCode()
{
unchecked
{
int hash = 17;
hash = hash * 31 + reflectedType.GetHashCode();
hash = hash * 31 + returnType.GetHashCode();
hash = hash * 31 + name.GetHashCode();
for (int i = 0; i < parameterTypes.Count; i++)
{
hash = hash * 31 + parameterTypes[i].GetHashCode();
}
for (int i = 0; i < arguments.Count; i++)
{
hash = hash * 31 + arguments[i].GetHashCode();
}
return hash;
}
}
}
Basically it's just a general idea - the above code can be easily rewritten to a more generic version with one collection of Fields - the same rules would have to be applied on each element of the collection. I can share the full code.
An option you seem to have skipped is using the .NET built in GetHashCode() function for the string. I'm fairly certain this is what would go on behind the scenes in a C# dictionary with a String as the <TKey> (I mention that because you've tagged the question with dictionary). I'm not sure how the .NET dictionary class relates to your Castle.Windsor or the system.runtime.caching interface you mention.
The reason you wouldn't want to use GetHashCode as a hash key is that the functionality is specifically disclaimed by MicroSoft to change between versions without warning (as in to provide a more unique or faster executing function). If this cache will live strictly in memory, then this is not a concern because upgrading the .NET framework would necessitate a restart of your application, wiping the cache.
To clarify, just using the concatenated string (Option 1) should be sufficiently unique. It looks like you've added everything possible to uniquely qualify your methods.
If you end up feeding the String of an MD5 or Sha256 into a dictionary key, the program would probably rehash the string behind the scenes anyways. It's been a while since I read about the inner workings of the Dictionary class. If you leave it as a Dictionary<String, IEnumerable<MyObject>> (as opposed to calling GetHashCode() on the strings yourself using the int return value as the key) then the dictionary should handle collisions of the hash code itself.
Also note that (at least according to a benchmark program run on my machine), MD5 is around 10% faster than SHA1 and twice as fast as SHA256. String.GetHashCode() is around 20 times faster than MD5 (it's not cryptographically secure). Tests were taken for the total time to compute the hashes for the same 100,000 randomly generated strings of length between 32 and 1024 characters. But regardless of the exact numbers, using a cryptographically secure hash function as a key will only slow down your program.
I can post the source code for my comparisons if you like.

How to improve hashing for short strings to avoid collisions?

I am having a problem with hash collisions using short strings in .NET4.
EDIT: I am using the built-in string hashing function in .NET.
I'm implementing a cache using objects that store the direction of a conversion like this
public class MyClass
{
private string _from;
private string _to;
// More code here....
public MyClass(string from, string to)
{
this._from = from;
this._to = to;
}
public override int GetHashCode()
{
return string.Concat(this._from, this._to).GetHashCode();
}
public bool Equals(MyClass other)
{
return this.To == other.To && this.From == other.From;
}
public override bool Equals(object obj)
{
if (obj == null) return false;
if (this.GetType() != obj.GetType()) return false;
return Equals(obj as MyClass);
}
}
This is direction dependent and the from and to are represented by short strings like "AAB" and "ABA".
I am getting sparse hash collisions with these small strings, I have tried something simple like adding a salt (did not work).
The problem is that too many of my small strings like "AABABA" collides its hash with the reverse of "ABAAAB" (Note that these are not real examples, I have no idea if AAB and ABA actually cause collisions!)
and I have gone heavy duty like implementing MD5 (which works, but is MUCH slower)
I have also implemented the suggestion from Jon Skeet here:
Should I use a concatenation of my string fields as a hash code?
This works but I don't know how dependable it is with my various 3-character strings.
How can I improve and stabilize the hashing of small strings without adding too much overhead like MD5?
EDIT: In response to a few of the answers posted... the cache is implemented using concurrent dictionaries keyed from MyClass as stubbed out above. If I replace the GetHashCode in the code above with something simple like #JonSkeet 's code from the link I posted:
int hash = 17;
hash = hash * 23 + this._from.GetHashCode();
hash = hash * 23 + this._to.GetHashCode();
return hash;
Everything functions as expected.
It's also worth noting that in this particular use-case the cache is not used in a multi-threaded environment so there is no race condition.
EDIT: I should also note that this misbehavior is platform dependant. It works as intended on my fully updated Win7x64 machine but does not behave properly on a non-updated Win7x64 machine. I don't know the extend of what updates are missing but I know it doesn't have Win7 SP1... so I would assume there may also be a framework SP or update it's missing as well.
EDIT: As susggested, my issue was not caused by a problem with the hashing function. I had an elusive race condition, which is why it worked on some computers but not others and also why a "slower" hashing method made things work properly. The answer I selected was the most useful in understanding why my problem was not hash collisions in the dictionary.
Are you sure that collisions are who causes problems? When you say
I finally discovered what was causing this bug
You mean some slowness of your code or something else? If not I'm curious what kind of problem is that? Because any hash function (except "perfect" hash functions on limited domains) would cause collisions.
I put a quick piece of code to check for collisions for 3-letter words. And this code doesn't report a single collision for them. You see what I mean? Looks like buid-in hash algorithm is not so bad.
Dictionary<int, bool> set = new Dictionary<int, bool>();
char[] buffer = new char[3];
int count = 0;
for (int c1 = (int)'A'; c1 <= (int)'z'; c1++)
{
buffer[0] = (char)c1;
for (int c2 = (int)'A'; c2 <= (int)'z'; c2++)
{
buffer[1] = (char)c2;
for (int c3 = (int)'A'; c3 <= (int)'z'; c3++)
{
buffer[2] = (char)c3;
string str = new string(buffer);
count++;
int hash = str.GetHashCode();
if (set.ContainsKey(hash))
{
Console.WriteLine("Collision for {0}", str);
}
set[hash] = false;
}
}
}
Console.WriteLine("Generated {0} of {1} hashes", set.Count, count);
While you could pick almost any of well-known hash functions (as David mentioned) or even choose a "perfect" hash since it looks like your domain is limited (like minimum perfect hash)... It would be great to understand if the source of problems are really collisions.
Update
What I want to say is that .NET build-in hash function for string is not so bad. It doesn't give so many collisions that you would need to write your own algorithm in regular scenarios. And this doesn't depend on the lenght of strings. If you have a lot of 6-symbol strings that doesn't imply that your chances to see a collision are highier than with 1000-symbol strings. This is one of the basic properties of hash functions.
And again, another question is what kind of problems do you experience because of collisions? All build-in hashtables and dictionaries support collision resolution. So I would say all you can see is just... probably some slowness. Is this your problem?
As for your code
return string.Concat(this._from, this._to).GetHashCode();
This can cause problems. Because on every hash code calculation you create a new string. Maybe this is what causes your issues?
int hash = 17;
hash = hash * 23 + this._from.GetHashCode();
hash = hash * 23 + this._to.GetHashCode();
return hash;
This would be much better approach - just because you don't create new objects on the heap. Actually it's one of the main points of this approach - get a good hash code of an object with a complex "key" without creating new objects. So if you don't have a single value key then this should work for you. BTW, this is not a new hash function, this is just a way to combine existing hash values without compromising main properties of hash functions.
Any common hash function should be suitable for this purpose. If you're getting collisions on short strings like that, I'd say you're using an unusually bad hash function. You can use Jenkins or Knuth's with no issues.
Here's a very simple hash function that should be adequate. (Implemented in C, but should easily port to any similar language.)
unsigned int hash(const char *it)
{
unsigned hval=0;
while(*it!=0)
{
hval+=*it++;
hval+=(hval<<10);
hval^=(hval>>6);
hval+=(hval<<3);
hval^=(hval>>11);
hval+=(hval<<15);
}
return hval;
}
Note that if you want to trim the bits of the output of this function, you must use the least significant bits. You can also use mod to reduce the output range. The last character of the string tends to only affect the low-order bits. If you need a more even distribution, change return hval; to return hval * 2654435761U;.
Update:
public override int GetHashCode()
{
return string.Concat(this._from, this._to).GetHashCode();
}
This is broken. It treats from="foot",to="ar" as the same as from="foo",to="tar". Since your Equals function doesn't consider those equal, your hash function should not. Possible fixes include:
1) Form the string from,"XXX",to and hash that. (This assumes the string "XXX" almost never appears in your input strings.
2) Combine the hash of 'from' with the hash of 'to'. You'll have to use a clever combining function. For example, XOR or sum will cause from="foo",to="bar" to hash the same as from="bar",to="foo". Unfortunately, choosing the right combining function is not easy without knowing the internals of the hashing function. You can try:
int hc1=from.GetHashCode();
int hc2=to.GetHashCode();
return (hc1<<7)^(hc2>>25)^(hc1>>21)^(hc2<<11);

Categories