C# getting unique hash from all objects

C# getting unique hash from all objects - c#

I want to be able to get a uniqe hash from all objects. What more,
in case of
Dictionary<string, MyObject> foo
I want the unique keys for:
string
MyObject
Properties in MyObject
foo[someKey]
foo
etc..
object.GetHashCode() does not guarantee unique return values for different objects.
That's what I need.
Any idea? Thank you

"Unique hash" is generally a contradiction in terms, even in general terms (and it's more obviously impossible if you're trying to use an Int32 as the hash value). From the wikipedia entry:
A hash function is any well-defined
procedure or mathematical function
that converts a large, possibly
variable-sized amount of data into a
small datum, usually a single integer
that may serve as an index to an
array. The values returned by a hash
function are called hash values, hash
codes, hash sums, or simply hashes.
Note the "small datum" bit - in other words, there will be more possible objects than there are possible hash values, so you can't possibly have uniqueness.
Now, it sounds like you actually want the hash to be a string... which means it won't be of a fixed size (but will have to be under 2GB or whatever the limit is). The simplest way of producing this "unique hash" would be to serialize the object and convert the result into a string, e.g. using Base64 if it's a binary serialization format, or just the text if it's a text-based one such as JSON. However, that's not what anyone else would really recognise as "hashing".

Simply put this is not possible. The GetHashCode function returns a signed integer which contains 2^32 possible unique values. On a 64 bit platform you can have many more than 2^32 different objects running around and hence they cannot all have unique hash codes.
The only way to approach this is to create a different hashing function which returns a type with the capacity greater than or equal to the number of values that could be created in the running system.

A unique hash code is impossible without constraints on your input space. This is because Object.GetHashCode is an int. If you have more than Int32.MaxValue objects then at least two of them must map to same hash code (by the pigeonhole principle).
Define a custom type with restrained input (i.e., the number of possible different objects up to equality is less than Int32.MaxValue) and then, and only then, is it possible to produce a unique hash code. That's not saying it will be easy, just possible.
Alternatively, don't use the Object.GetHashCode mechanism but instead some other way of representing hashes and you might be able to do what you want. We need clear details on what you want and are using it for to be able to help you here.

As others have stated, hash code will never be unique, that's not the point.
The point is to help your Dictionary<string, MyObject> foo to find the exact instance faster. It will use the hash code to narrow the search to a smaller set of objects, and then check them for equality.
You can use the Guid class to get unique strings, if what you need is a unique key. But that is not a hash code.

Related

Which collection type should I use to store a bunch of hashes?

I have a bunch of long strings which I have to manipulate. They can occur again and again and I want to ignore them if they appear twice. I figured the best way to do this would be to hash the string and store the list of hashes in some sort of ordered list with a fast lookup time so that I can compare whenever my data set hands me a new string.
Requirements:
Be able to add items (hashes) to my collection
Be able to (quickly) check whether a particular hash is already in the collection.
Not too memory intensive. I might end up with ~100,000 of these hashes.
I don't need to go backwards (key -> value) if that makes any difference.
Any suggestions on which .NET data type would be most efficient?

I figured the best way to do this would be to hash the string and store the list of hashes in some sort of ordered list with a fast lookup time so that I can compare whenever my data set hands me a new string.
No, don't do that. Two reasons:
Hashes only tell you if two values might be the same; they don't tell you if they are the same.
You'd be doing a lot of work which has already been done for you.
Basically, you should just keep a HashSet<String>. That should be fine, have a quick lookup, and you don't need to implement it yourself.
The downside is that you will end up keeping all the strings in memory. If that's a problem then you'll need to work out an alternative strategy... which may indeed end up keeping just the hashes in memory. The exact details will probably depend on where the strings come from, and what sort of problem it would cause if you got a false positive. For example, you could keep an MD5 hash of each string, as a "better than just hashCode" hash - but that would still allow an attacker to present you with another string with the same hash. Is that a problem? If so, a more secure hash algorithm (e.g. SHA-256) might help. It still won't guarantee that you end up with different hashes for different strings though.
If you really want to be sure, you'd need to keep the hashes in memory but persist the actual string data (to disk or a database) - then when you've got a possible match (because you've seen the same hash before) you'd need to compare the stored string with the fresh one.
If you're storing the hashes in memory, the best approach will depend on the size of hash you're using. For example, for just a 64-bit hash you could use a Long per hash and keep it in a HashSet<Long>. For longer hashes, you'd need an object which can easily be compared etc. At that point, I suggest you look at Guava and its HashCode class, along with the factory methods in HashCodes (Deprecated since Guava v16).

Use a set.
ISet<T> interface is implemented by e.g. HashSet<T>
Add and Contains are expected O(1), unless you have a really poor hashing function, then the worst case is O(n).

Can i use GetHashCode() for all string compares?

i want to cache some search results based on the object to search and some search settings.
However: this creates quite a long cache key, and i thought i'd create a shortcut for it, and i thought i'd use GetHashCode() for it.
So i was wondering, does GetHashCode() always generate a different number, even when i have very long strings or differ only by this: 'ä' in stead of 'a'
I tried some strings and it seemed the answer is yes, but not understanding the GetHashCode() behaviour doesn't give me the true feeling i am right.
And because it is one of those things which will pop up when you're not prepared (the client is looking at cached results for the wrong search) i want to be sure...
EDIT: if MD5 would work, i can change my code not to use the GetHashCode ofcourse, the goals is to get a short(er) string than the original (> 1000 chars)

You CANNOT count on GetHashCode() being unique.
There is an excellent article which investigates the likelihood of collisions available at http://kenneththorman.blogspot.com/2010/09/c-net-equals-and-gethashcode.html . The findings were that "The smallest number of calls to GetHashCode() to return the same hashcode for a different string was after 565 iterations and the highest number of iterations before getting a hashcode collision was 296390 iterations. "
So that you can understand the contract for GetHashCode implementations, the following is an excerpt from MSDN documentation for Object.GetHashCode():
A hash function must have the following properties:
If two objects compare as equal, the GetHashCode method for each object must return the same value. However, if two objects do not compare as equal, the GetHashCode methods for the two object do not have to return different values.
The GetHashCode method for an object must consistently return the same hash code as long as there is no modification to the object state that determines the return value of the object's Equals method. Note that this is true only for the current execution of an application, and that a different hash code can be returned if the application is run again.
For the best performance, a hash function must generate a random distribution for all input.
Eric Lippert of the C# compiler team explains the rationale for the GetHashCode implementation rules on his blog at http://ericlippert.com/2011/02/28/guidelines-and-rules-for-gethashcode/ .

Logically GetHashCode cannot be unique since there are only 2^32 ints and an infinite number of strings (see the pigeon hole principle).
As #Henk pointed out in the comment even though there are an infinite number of strings there are a finite number of System.Strings. However the pigeon hole principle still stands as the later is much larger than int.MaxValue.

If one were store the hash code of each string along with the string itself, one could compare the hashcodes of strings as a "first step" to comparing them for equality. If two strings have different hashcodes, they're not equal, and one needn't bother doing anything else. If one expects to be comparing many pairs of strings which are of the same length, and which are "almost" but not quite equal, checking the hashcodes before checking the content may be a useful performance optimization. Note that this "optimization" would not be worthwhile if one did not have cached hashcodes, since computing the hashcodes of two strings would almost certainly be slower than comparing them. If, however, one has had to compute and cache the hashcodes for some other purpose, checking hash codes as a first step to comparing strings may be useful.

You always risk collisions when using GetHashCode() because you are operating within a limited number space, Int32, and this will also be exacerbated by the fact that hashing algorithms will not perfectly distribute within this space.
If you look at the implementation of HashTable or Dictionary you will see that GetHashCode is used to assign the keys into buckets to cut down the number of comparisons required, however, the equality comparisons are still necessary if there are multiple items in the same bucket.

No. GetHasCode just provides a hash code. There will be collisions. Having different hashes means the strings are different, but having the same hash does not mean the strings are the same.
Read these guidlelines by Eric Lippert for correct use of GetHashCode, they are quite instructing.
If you want to compare strings, just do so! stringA == stringB works fine.
If you want to ensure a string is unique in a large set, using the power of hash code to do so, use a HashSet<string>.

get data from generated hashcode()

i have a string(name str) and i generate hashcode(name H) from that ,
i want recieve orginal string(name str) from recieved hashcode(name H)

The short answer is you can't.
Creating a hashcode is one way operation - there is no reverse operation. The reason for this is that there are (for all practical purposes) infinitely many strings, but only finitely many hash codes (the number of possible hashcodes is bounded by the range of an int). Each hashcode could have been generated from any one of the infinitely many strings that give that hash code and there's no way to know which.

You can try to do it through a Brute Force Attack or with the help of Rainbow tables
Anyway, (even if you succeeded in finding something) with those methods, you would only find a string having the same hascode of the original, but you're ABSOLUTELY not sure that would be the original string, because hascodes are not unique.
Mmh, maybe absolutely is even a bit restrictive, because probability says you 99.999999999999... % won't find the same string :D

Hashing is generating a short fixed size value from a usually larger input. It is in general not reversible.
Mathematically impossible. There are only 2^32 different ints, but almost infinitely many strings, so from the pigeon hole principle follows that you can't restore the string.
You can find a string that matches the HashCode pretty easily, but it probably won't be the string that was originally hashed.
GetHashCode() is designed for use in hashtables and as thus is just a performance trick. It allows quick sorting of the input value into buckets, and nothing more. Its value is implementation defined. So another .net version, or even another instance of the same application might return a different value. return 0; is a valid(but not recommended) implementation of GetHashCode, and would not yield any information about the original string.

many of us would like to be able to do that :=)

Does the length of key affect Dictionary performance?

I will use a Dictionary in a .NET project to store a large number of objects. Therefore I decided to use a GUID-string as a key, to ensure unique keys for each object.
Does a large key such as a GUID (or even larger ones) decrease the performance of a Dictionary, e.g. for retrieving an object via its key?
Thanks,
Andrej

I would recommend using an actual Guid rather than the string representation of the Guid. Yes, when comparing strings the length does affect the number of operations required, since it has to compare the strings character-by-character (at a bare minimum; this is barring any special options like IgnoreCase). The actual Guid will give you only 16 bytes to compare rather than the minimum of 32 in the string.
That being said, you are very likely not going to notice any difference...premature optimization and all that. I would simply go for the Guid key since that's what the data is.

The actual size of an object with respect to retrieving values is irrelevant. The speed of lookup of values is much more dependent on the speed of two methods on the passed in IEqualityComparer<T> instance
GetHashcode()
Equals()
EDIT
A lot of people are using String as a justification for saying that larger object size decreases lookup performance. This must be taken with a grain of salt for several reasons.
The performance of the above said methods for String decrease in performance as the size of the string increases for the default comparer. Just because it's true for System.String does not mean it is true in general
You could just as easily write a different IEqualityComparer<String> in such a way that string length was irrelevant.

Yes and no. Larger strings increase the memory size of a dictionary. And larger sizes mean slightly longer times to calculate hash sizes.
But worrying about those things is probably premature optimization. While it will be slower, it's not anything that you will probably actually notice.

Apparently it does. Here is a good test: Dictionary String Key Test

I did a quick Google search and found this article.
http://dotnetperls.com/dictionary-string-key
It confirms that generally shorter keys perform better than longer ones.

see Performance - using Guid object or Guid string as Key for a similar question. You could test it out with an alternative key.

Why we use Hash Code in HashTable instead of an Index?

How that integer hash is generated by the GetHashCode() function? Is it a random value which is not unique?
In string, it is overridden to make sure that there exists only one hash code for a particular string.
How to do that?
How searching for specific key in a hash table is speeded up using hash code?
What are the advantages of using hash code over using an index directly in the collection (like in arrays)?
Can someone help?

Basically, hash functions use some generic function to digest data and generate a fingerprint (and integer number here) for that data. Unlike an index, this fingerprint depends ONLY on the data, and should be free of any predictable ordering based on the data. Any change to a single bit of the data should also change the fingerprint considerably.
Notice that nowhere does this guarantee that different data won't give the same hash. In fact, quite the opposite: this happens very often, and is called a collision. But, with an integer, the probability is roughly 1 in 4 billion against this (1 in 2^32). If a collision happens, you just compare the actual object you are hashing to see if they match.
This fingerprint can then be used as an index to an array (or arraylist) of stored values. Because the fingerprint is dependent only on the data, you can compute a hash for something and just check the array element for that hash value to see if it has been stored already. Otherwise, you'd have to go through the whole array checking if it matches an item.
You can also VERY quickly do associative arrays by using 2 arrays, one with Key values (indexed by hash), and a second with values mapped to those keys. If you use a hash, you just need to know the key's hash to find the matching value for the key. This is much faster than doing a binary search on a sorted key list, or a scan of the whole array to find matching keys.
There are MANY ways to generate a hash, and all of them have various merits, but few are simple. I suggest consulting the wikipedia page on hash functions for more info.

A hash code IS an index, and a hash table, at its very lowest level, IS an array. But for a given key value, we determine the index into in a hash table differently, to make for much faster data retrieval.
Example: You have 1,000 words and their definitions. You want to store them so that you can retrieve the definition for a word very, very quickly -- faster than a binary search, which is what you would have to do with an array.
So you create a hash table. You start with an array substantially bigger than 1,000 entries -- say 5,000 (the bigger, the more time-efficient).
The way you'll use your table is, you take the word to look up, and convert it to a number between 0 and 4,999. You choose the algorithm for doing this; that's the hashing algorithm. But you could doubtless write something that would be very fast.
Then you use the converted number as an index into your 5,000-element array, and insert/find your definition at that index. There's no searching at all: you've created the index directly from the search word.
All of the operations I've described are constant time; none of them takes longer when we increase the number of entries. We just need to make sure that there is sufficient space in the hash to minimize the chance of "collisions", that is, the chance that two different words will convert to the same integer index. Because that can happen with any hashing algorithm, we need to add checks to see if there is a collision, and do something special (if "hello" and "world" both hash to 1,234 and "hello" is already in the table, what will we do with "world"? Simplest is to put it in 1,235, and adjust our lookup logic to allow for this possibility.)
Edit: after re-reading your post: a hashing algorithm is most definitely not random, it must be deterministic. The index generated for "hello" in my example must be 1,234 every single time; that's the only way the lookup can work.

Answering each one of your questions directly:
How that integer hash is generated by
the GetHashCode() function? Is it a
random value which is not unique?
An integer hash is generated by whatever method is appropriate for the object.
The generation method is not random but must follow consistent rules, ensuring that a hash generated for one particular object will equal the hash generated for an equivalent object. As an example, a hash function for an integer would be to simply return that integer.
In string, it is overridden to make
sure that there exists only one hash
code for a particular string. How to
do that?
There are many ways this can be done. Here's an example I'm thinking of on the spot:
int hash = 0;
for(int i = 0; i < theString.Length; ++i)
{
hash ^= theString[i];
}
This is a valid hash algorithm, because the same sequence of characters will always produce the same hash number. It's not a good hash algorithm (an extreme understatement), because many strings will produce the same hash. A valid hash algorithm doesn't have to guarantee uniqueness. A good hash algorithm will make a chance of two differing objects producing the same number extremely unlikely.
How searching for specific key in a hash table is speeded up using hash code?
What are the advantages of using hash code over using an index directly in the collection (like in arrays)?
A hash code is typically used in hash tables. A hash table is an array, but each entry in the array is a "bucket" of items, not just one item. If you have an object and you want to know which bucket it belongs in, calculate
hash_value MOD hash_table_size.
Then you simply have to compare the object with every item in the bucket. So a hash table lookup will most likely have a search time of O(1), as opposed to O(log(N)) for a sorted list or O(N) for an unsorted list.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.