Can i use GetHashCode() for all string compares? - c#

i want to cache some search results based on the object to search and some search settings.
However: this creates quite a long cache key, and i thought i'd create a shortcut for it, and i thought i'd use GetHashCode() for it.
So i was wondering, does GetHashCode() always generate a different number, even when i have very long strings or differ only by this: 'รค' in stead of 'a'
I tried some strings and it seemed the answer is yes, but not understanding the GetHashCode() behaviour doesn't give me the true feeling i am right.
And because it is one of those things which will pop up when you're not prepared (the client is looking at cached results for the wrong search) i want to be sure...
EDIT: if MD5 would work, i can change my code not to use the GetHashCode ofcourse, the goals is to get a short(er) string than the original (> 1000 chars)

You CANNOT count on GetHashCode() being unique.
There is an excellent article which investigates the likelihood of collisions available at http://kenneththorman.blogspot.com/2010/09/c-net-equals-and-gethashcode.html . The findings were that "The smallest number of calls to GetHashCode() to return the same hashcode for a different string was after 565 iterations and the highest number of iterations before getting a hashcode collision was 296390 iterations. "
So that you can understand the contract for GetHashCode implementations, the following is an excerpt from MSDN documentation for Object.GetHashCode():
A hash function must have the following properties:
If two objects compare as equal, the GetHashCode method for each object must return the same value. However, if two objects do not compare as equal, the GetHashCode methods for the two object do not have to return different values.
The GetHashCode method for an object must consistently return the same hash code as long as there is no modification to the object state that determines the return value of the object's Equals method. Note that this is true only for the current execution of an application, and that a different hash code can be returned if the application is run again.
For the best performance, a hash function must generate a random distribution for all input.
Eric Lippert of the C# compiler team explains the rationale for the GetHashCode implementation rules on his blog at http://ericlippert.com/2011/02/28/guidelines-and-rules-for-gethashcode/ .

Logically GetHashCode cannot be unique since there are only 2^32 ints and an infinite number of strings (see the pigeon hole principle).
As #Henk pointed out in the comment even though there are an infinite number of strings there are a finite number of System.Strings. However the pigeon hole principle still stands as the later is much larger than int.MaxValue.

If one were store the hash code of each string along with the string itself, one could compare the hashcodes of strings as a "first step" to comparing them for equality. If two strings have different hashcodes, they're not equal, and one needn't bother doing anything else. If one expects to be comparing many pairs of strings which are of the same length, and which are "almost" but not quite equal, checking the hashcodes before checking the content may be a useful performance optimization. Note that this "optimization" would not be worthwhile if one did not have cached hashcodes, since computing the hashcodes of two strings would almost certainly be slower than comparing them. If, however, one has had to compute and cache the hashcodes for some other purpose, checking hash codes as a first step to comparing strings may be useful.

You always risk collisions when using GetHashCode() because you are operating within a limited number space, Int32, and this will also be exacerbated by the fact that hashing algorithms will not perfectly distribute within this space.
If you look at the implementation of HashTable or Dictionary you will see that GetHashCode is used to assign the keys into buckets to cut down the number of comparisons required, however, the equality comparisons are still necessary if there are multiple items in the same bucket.

No. GetHasCode just provides a hash code. There will be collisions. Having different hashes means the strings are different, but having the same hash does not mean the strings are the same.
Read these guidlelines by Eric Lippert for correct use of GetHashCode, they are quite instructing.
If you want to compare strings, just do so! stringA == stringB works fine.
If you want to ensure a string is unique in a large set, using the power of hash code to do so, use a HashSet<string>.

Related

Hashing performance in a HashSet<int> against a List<int> with Contains

I am looking for a comparison/performance considerations between a list of integers against a hash set of integers. This is what What is the difference between HashSet<T> and List<T>? talks about for T as integer.
I will have up to several thousand integers, and I want to find out, for individual integers, whether they are contained in this set.
Now of course this screams for a hash set, but I wonder whether hashing is beneficial here, since they are just integers to start with. Would hashing them first not add unnecessary overhead here?
Or in other words: Is using a hash set beneficial, even for sets of integers?
Hashing an integer is very cheap, as you can see in the source code of the Int32.GetHashCode method:
// The absolute value of the int contained.
public override int GetHashCode()
{
return m_value;
}
The hash of the number is the number itself. It can't get any cheaper than that. So there is no reason to be concerned about the overhead. Put your numbers in a HashSet, and enjoy searching with O(1) computational complexity.
What ever T is there is a simple but efficient rule of thumb:
The collection is mainly used for adding and iterating with very few
search => Use List
The collection is heavely used for research => Use HashSet

C# GetHashCode with two Int16, also returns only up to Int32?

Sorry to combine two questions into one, they are related.
HashCodes for HashSets and the such. As I understand it, they must be unique, not change, and represent any configuration of an object as a single number.
My first question is that for my object, containing the two Int16s a and b, is it safe for my GetHashCode to return something like a * n + b where n is a large number, I think perhaps Math.Pow(2, 16)?
Also GetHashCode appears to inflexibly return specifically the type Int32.
32bits can just about store, for example, two Int16s, a single unicode character or 16 N, S, E, W compass directions, it's not much, even something like a small few node graph would probably be too much for it. Does this represent a limit of C# Hash collections?
As I understand it, they must be unique
Nope. They can't possibly be unique for most types, which can have more than 232 possible values. Ideally, if two objects have the same hash code then they're unlikely to be equal - but you should never assume that they are equal. The important point is that if they have different hash codes, they should definitely be unequal.
My first question is that for my object, containing the two Int16s a and b, is it safe for my GetHashCode to return something like a * n + b where n is a large number, I think perhaps Math.Pow(2, 16).
If it only contains two Int16 values, it would be simplest to use:
return (a << 16) | (ushort) b;
Then the value will be unique. Hoorah!
Also GetHashCode appears to inflexibly return specifically the type Int32.
Yes. Types such as Dictionary and HashSet need to be able to use the fixed size so they can work with it to put values into buckets.
32bits can just about store, for example, two Int16s, a single unicode character or 16 N, S, E, W compass directions, it's not much, even something like a small few node graph would probably be too much for it. Does this represent a limit of C# Hash collections?
If it were a limitation, it would be a .NET limitation rather than a C# limitation - but no, it's just a misunderstanding of what hash codes are meant to represent.
Eric Lippert has an excellent (obviously) blog post about GetHashCode which you should read for more information.
GetHashCode is not (and cannot be) unique for every instance of an object. Take Int64, for example; even if the hash function is perfectly distributed, there will be two four billion Int64s that hash to every value, since the hash code is, as you mentioned, only an Int32.
However this is not a limitation on collections using hash codes; they are simply use buckets for elements which hash to the same value. So a lookup into a hash table isn't guaranteed to be a single operation. Getting the correct bucket is a single operation, but there may be multiple items in that bucket.

get data from generated hashcode()

i have a string(name str) and i generate hashcode(name H) from that ,
i want recieve orginal string(name str) from recieved hashcode(name H)
The short answer is you can't.
Creating a hashcode is one way operation - there is no reverse operation. The reason for this is that there are (for all practical purposes) infinitely many strings, but only finitely many hash codes (the number of possible hashcodes is bounded by the range of an int). Each hashcode could have been generated from any one of the infinitely many strings that give that hash code and there's no way to know which.
You can try to do it through a Brute Force Attack or with the help of Rainbow tables
Anyway, (even if you succeeded in finding something) with those methods, you would only find a string having the same hascode of the original, but you're ABSOLUTELY not sure that would be the original string, because hascodes are not unique.
Mmh, maybe absolutely is even a bit restrictive, because probability says you 99.999999999999... % won't find the same string :D
Hashing is generating a short fixed size value from a usually larger input. It is in general not reversible.
Mathematically impossible. There are only 2^32 different ints, but almost infinitely many strings, so from the pigeon hole principle follows that you can't restore the string.
You can find a string that matches the HashCode pretty easily, but it probably won't be the string that was originally hashed.
GetHashCode() is designed for use in hashtables and as thus is just a performance trick. It allows quick sorting of the input value into buckets, and nothing more. Its value is implementation defined. So another .net version, or even another instance of the same application might return a different value. return 0; is a valid(but not recommended) implementation of GetHashCode, and would not yield any information about the original string.
many of us would like to be able to do that :=)

Does the length of key affect Dictionary performance?

I will use a Dictionary in a .NET project to store a large number of objects. Therefore I decided to use a GUID-string as a key, to ensure unique keys for each object.
Does a large key such as a GUID (or even larger ones) decrease the performance of a Dictionary, e.g. for retrieving an object via its key?
Thanks,
Andrej
I would recommend using an actual Guid rather than the string representation of the Guid. Yes, when comparing strings the length does affect the number of operations required, since it has to compare the strings character-by-character (at a bare minimum; this is barring any special options like IgnoreCase). The actual Guid will give you only 16 bytes to compare rather than the minimum of 32 in the string.
That being said, you are very likely not going to notice any difference...premature optimization and all that. I would simply go for the Guid key since that's what the data is.
The actual size of an object with respect to retrieving values is irrelevant. The speed of lookup of values is much more dependent on the speed of two methods on the passed in IEqualityComparer<T> instance
GetHashcode()
Equals()
EDIT
A lot of people are using String as a justification for saying that larger object size decreases lookup performance. This must be taken with a grain of salt for several reasons.
The performance of the above said methods for String decrease in performance as the size of the string increases for the default comparer. Just because it's true for System.String does not mean it is true in general
You could just as easily write a different IEqualityComparer<String> in such a way that string length was irrelevant.
Yes and no. Larger strings increase the memory size of a dictionary. And larger sizes mean slightly longer times to calculate hash sizes.
But worrying about those things is probably premature optimization. While it will be slower, it's not anything that you will probably actually notice.
Apparently it does. Here is a good test: Dictionary String Key Test
I did a quick Google search and found this article.
http://dotnetperls.com/dictionary-string-key
It confirms that generally shorter keys perform better than longer ones.
see Performance - using Guid object or Guid string as Key for a similar question. You could test it out with an alternative key.

C# getting unique hash from all objects

I want to be able to get a uniqe hash from all objects. What more,
in case of
Dictionary<string, MyObject> foo
I want the unique keys for:
string
MyObject
Properties in MyObject
foo[someKey]
foo
etc..
object.GetHashCode() does not guarantee unique return values for different objects.
That's what I need.
Any idea? Thank you
"Unique hash" is generally a contradiction in terms, even in general terms (and it's more obviously impossible if you're trying to use an Int32 as the hash value). From the wikipedia entry:
A hash function is any well-defined
procedure or mathematical function
that converts a large, possibly
variable-sized amount of data into a
small datum, usually a single integer
that may serve as an index to an
array. The values returned by a hash
function are called hash values, hash
codes, hash sums, or simply hashes.
Note the "small datum" bit - in other words, there will be more possible objects than there are possible hash values, so you can't possibly have uniqueness.
Now, it sounds like you actually want the hash to be a string... which means it won't be of a fixed size (but will have to be under 2GB or whatever the limit is). The simplest way of producing this "unique hash" would be to serialize the object and convert the result into a string, e.g. using Base64 if it's a binary serialization format, or just the text if it's a text-based one such as JSON. However, that's not what anyone else would really recognise as "hashing".
Simply put this is not possible. The GetHashCode function returns a signed integer which contains 2^32 possible unique values. On a 64 bit platform you can have many more than 2^32 different objects running around and hence they cannot all have unique hash codes.
The only way to approach this is to create a different hashing function which returns a type with the capacity greater than or equal to the number of values that could be created in the running system.
A unique hash code is impossible without constraints on your input space. This is because Object.GetHashCode is an int. If you have more than Int32.MaxValue objects then at least two of them must map to same hash code (by the pigeonhole principle).
Define a custom type with restrained input (i.e., the number of possible different objects up to equality is less than Int32.MaxValue) and then, and only then, is it possible to produce a unique hash code. That's not saying it will be easy, just possible.
Alternatively, don't use the Object.GetHashCode mechanism but instead some other way of representing hashes and you might be able to do what you want. We need clear details on what you want and are using it for to be able to help you here.
As others have stated, hash code will never be unique, that's not the point.
The point is to help your Dictionary<string, MyObject> foo to find the exact instance faster. It will use the hash code to narrow the search to a smaller set of objects, and then check them for equality.
You can use the Guid class to get unique strings, if what you need is a unique key. But that is not a hash code.

Categories