i have a string(name str) and i generate hashcode(name H) from that ,
i want recieve orginal string(name str) from recieved hashcode(name H)
The short answer is you can't.
Creating a hashcode is one way operation - there is no reverse operation. The reason for this is that there are (for all practical purposes) infinitely many strings, but only finitely many hash codes (the number of possible hashcodes is bounded by the range of an int). Each hashcode could have been generated from any one of the infinitely many strings that give that hash code and there's no way to know which.
You can try to do it through a Brute Force Attack or with the help of Rainbow tables
Anyway, (even if you succeeded in finding something) with those methods, you would only find a string having the same hascode of the original, but you're ABSOLUTELY not sure that would be the original string, because hascodes are not unique.
Mmh, maybe absolutely is even a bit restrictive, because probability says you 99.999999999999... % won't find the same string :D
Hashing is generating a short fixed size value from a usually larger input. It is in general not reversible.
Mathematically impossible. There are only 2^32 different ints, but almost infinitely many strings, so from the pigeon hole principle follows that you can't restore the string.
You can find a string that matches the HashCode pretty easily, but it probably won't be the string that was originally hashed.
GetHashCode() is designed for use in hashtables and as thus is just a performance trick. It allows quick sorting of the input value into buckets, and nothing more. Its value is implementation defined. So another .net version, or even another instance of the same application might return a different value. return 0; is a valid(but not recommended) implementation of GetHashCode, and would not yield any information about the original string.
many of us would like to be able to do that :=)
Related
I have a bunch of long strings which I have to manipulate. They can occur again and again and I want to ignore them if they appear twice. I figured the best way to do this would be to hash the string and store the list of hashes in some sort of ordered list with a fast lookup time so that I can compare whenever my data set hands me a new string.
Requirements:
Be able to add items (hashes) to my collection
Be able to (quickly) check whether a particular hash is already in the collection.
Not too memory intensive. I might end up with ~100,000 of these hashes.
I don't need to go backwards (key -> value) if that makes any difference.
Any suggestions on which .NET data type would be most efficient?
I figured the best way to do this would be to hash the string and store the list of hashes in some sort of ordered list with a fast lookup time so that I can compare whenever my data set hands me a new string.
No, don't do that. Two reasons:
Hashes only tell you if two values might be the same; they don't tell you if they are the same.
You'd be doing a lot of work which has already been done for you.
Basically, you should just keep a HashSet<String>. That should be fine, have a quick lookup, and you don't need to implement it yourself.
The downside is that you will end up keeping all the strings in memory. If that's a problem then you'll need to work out an alternative strategy... which may indeed end up keeping just the hashes in memory. The exact details will probably depend on where the strings come from, and what sort of problem it would cause if you got a false positive. For example, you could keep an MD5 hash of each string, as a "better than just hashCode" hash - but that would still allow an attacker to present you with another string with the same hash. Is that a problem? If so, a more secure hash algorithm (e.g. SHA-256) might help. It still won't guarantee that you end up with different hashes for different strings though.
If you really want to be sure, you'd need to keep the hashes in memory but persist the actual string data (to disk or a database) - then when you've got a possible match (because you've seen the same hash before) you'd need to compare the stored string with the fresh one.
If you're storing the hashes in memory, the best approach will depend on the size of hash you're using. For example, for just a 64-bit hash you could use a Long per hash and keep it in a HashSet<Long>. For longer hashes, you'd need an object which can easily be compared etc. At that point, I suggest you look at Guava and its HashCode class, along with the factory methods in HashCodes (Deprecated since Guava v16).
Use a set.
ISet<T> interface is implemented by e.g. HashSet<T>
Add and Contains are expected O(1), unless you have a really poor hashing function, then the worst case is O(n).
i want to cache some search results based on the object to search and some search settings.
However: this creates quite a long cache key, and i thought i'd create a shortcut for it, and i thought i'd use GetHashCode() for it.
So i was wondering, does GetHashCode() always generate a different number, even when i have very long strings or differ only by this: 'รค' in stead of 'a'
I tried some strings and it seemed the answer is yes, but not understanding the GetHashCode() behaviour doesn't give me the true feeling i am right.
And because it is one of those things which will pop up when you're not prepared (the client is looking at cached results for the wrong search) i want to be sure...
EDIT: if MD5 would work, i can change my code not to use the GetHashCode ofcourse, the goals is to get a short(er) string than the original (> 1000 chars)
You CANNOT count on GetHashCode() being unique.
There is an excellent article which investigates the likelihood of collisions available at http://kenneththorman.blogspot.com/2010/09/c-net-equals-and-gethashcode.html . The findings were that "The smallest number of calls to GetHashCode() to return the same hashcode for a different string was after 565 iterations and the highest number of iterations before getting a hashcode collision was 296390 iterations. "
So that you can understand the contract for GetHashCode implementations, the following is an excerpt from MSDN documentation for Object.GetHashCode():
A hash function must have the following properties:
If two objects compare as equal, the GetHashCode method for each object must return the same value. However, if two objects do not compare as equal, the GetHashCode methods for the two object do not have to return different values.
The GetHashCode method for an object must consistently return the same hash code as long as there is no modification to the object state that determines the return value of the object's Equals method. Note that this is true only for the current execution of an application, and that a different hash code can be returned if the application is run again.
For the best performance, a hash function must generate a random distribution for all input.
Eric Lippert of the C# compiler team explains the rationale for the GetHashCode implementation rules on his blog at http://ericlippert.com/2011/02/28/guidelines-and-rules-for-gethashcode/ .
Logically GetHashCode cannot be unique since there are only 2^32 ints and an infinite number of strings (see the pigeon hole principle).
As #Henk pointed out in the comment even though there are an infinite number of strings there are a finite number of System.Strings. However the pigeon hole principle still stands as the later is much larger than int.MaxValue.
If one were store the hash code of each string along with the string itself, one could compare the hashcodes of strings as a "first step" to comparing them for equality. If two strings have different hashcodes, they're not equal, and one needn't bother doing anything else. If one expects to be comparing many pairs of strings which are of the same length, and which are "almost" but not quite equal, checking the hashcodes before checking the content may be a useful performance optimization. Note that this "optimization" would not be worthwhile if one did not have cached hashcodes, since computing the hashcodes of two strings would almost certainly be slower than comparing them. If, however, one has had to compute and cache the hashcodes for some other purpose, checking hash codes as a first step to comparing strings may be useful.
You always risk collisions when using GetHashCode() because you are operating within a limited number space, Int32, and this will also be exacerbated by the fact that hashing algorithms will not perfectly distribute within this space.
If you look at the implementation of HashTable or Dictionary you will see that GetHashCode is used to assign the keys into buckets to cut down the number of comparisons required, however, the equality comparisons are still necessary if there are multiple items in the same bucket.
No. GetHasCode just provides a hash code. There will be collisions. Having different hashes means the strings are different, but having the same hash does not mean the strings are the same.
Read these guidlelines by Eric Lippert for correct use of GetHashCode, they are quite instructing.
If you want to compare strings, just do so! stringA == stringB works fine.
If you want to ensure a string is unique in a large set, using the power of hash code to do so, use a HashSet<string>.
Sorry to combine two questions into one, they are related.
HashCodes for HashSets and the such. As I understand it, they must be unique, not change, and represent any configuration of an object as a single number.
My first question is that for my object, containing the two Int16s a and b, is it safe for my GetHashCode to return something like a * n + b where n is a large number, I think perhaps Math.Pow(2, 16)?
Also GetHashCode appears to inflexibly return specifically the type Int32.
32bits can just about store, for example, two Int16s, a single unicode character or 16 N, S, E, W compass directions, it's not much, even something like a small few node graph would probably be too much for it. Does this represent a limit of C# Hash collections?
As I understand it, they must be unique
Nope. They can't possibly be unique for most types, which can have more than 232 possible values. Ideally, if two objects have the same hash code then they're unlikely to be equal - but you should never assume that they are equal. The important point is that if they have different hash codes, they should definitely be unequal.
My first question is that for my object, containing the two Int16s a and b, is it safe for my GetHashCode to return something like a * n + b where n is a large number, I think perhaps Math.Pow(2, 16).
If it only contains two Int16 values, it would be simplest to use:
return (a << 16) | (ushort) b;
Then the value will be unique. Hoorah!
Also GetHashCode appears to inflexibly return specifically the type Int32.
Yes. Types such as Dictionary and HashSet need to be able to use the fixed size so they can work with it to put values into buckets.
32bits can just about store, for example, two Int16s, a single unicode character or 16 N, S, E, W compass directions, it's not much, even something like a small few node graph would probably be too much for it. Does this represent a limit of C# Hash collections?
If it were a limitation, it would be a .NET limitation rather than a C# limitation - but no, it's just a misunderstanding of what hash codes are meant to represent.
Eric Lippert has an excellent (obviously) blog post about GetHashCode which you should read for more information.
GetHashCode is not (and cannot be) unique for every instance of an object. Take Int64, for example; even if the hash function is perfectly distributed, there will be two four billion Int64s that hash to every value, since the hash code is, as you mentioned, only an Int32.
However this is not a limitation on collections using hash codes; they are simply use buckets for elements which hash to the same value. So a lookup into a hash table isn't guaranteed to be a single operation. Getting the correct bucket is a single operation, but there may be multiple items in that bucket.
I will use a Dictionary in a .NET project to store a large number of objects. Therefore I decided to use a GUID-string as a key, to ensure unique keys for each object.
Does a large key such as a GUID (or even larger ones) decrease the performance of a Dictionary, e.g. for retrieving an object via its key?
Thanks,
Andrej
I would recommend using an actual Guid rather than the string representation of the Guid. Yes, when comparing strings the length does affect the number of operations required, since it has to compare the strings character-by-character (at a bare minimum; this is barring any special options like IgnoreCase). The actual Guid will give you only 16 bytes to compare rather than the minimum of 32 in the string.
That being said, you are very likely not going to notice any difference...premature optimization and all that. I would simply go for the Guid key since that's what the data is.
The actual size of an object with respect to retrieving values is irrelevant. The speed of lookup of values is much more dependent on the speed of two methods on the passed in IEqualityComparer<T> instance
GetHashcode()
Equals()
EDIT
A lot of people are using String as a justification for saying that larger object size decreases lookup performance. This must be taken with a grain of salt for several reasons.
The performance of the above said methods for String decrease in performance as the size of the string increases for the default comparer. Just because it's true for System.String does not mean it is true in general
You could just as easily write a different IEqualityComparer<String> in such a way that string length was irrelevant.
Yes and no. Larger strings increase the memory size of a dictionary. And larger sizes mean slightly longer times to calculate hash sizes.
But worrying about those things is probably premature optimization. While it will be slower, it's not anything that you will probably actually notice.
Apparently it does. Here is a good test: Dictionary String Key Test
I did a quick Google search and found this article.
http://dotnetperls.com/dictionary-string-key
It confirms that generally shorter keys perform better than longer ones.
see Performance - using Guid object or Guid string as Key for a similar question. You could test it out with an alternative key.
I want to be able to get a uniqe hash from all objects. What more,
in case of
Dictionary<string, MyObject> foo
I want the unique keys for:
string
MyObject
Properties in MyObject
foo[someKey]
foo
etc..
object.GetHashCode() does not guarantee unique return values for different objects.
That's what I need.
Any idea? Thank you
"Unique hash" is generally a contradiction in terms, even in general terms (and it's more obviously impossible if you're trying to use an Int32 as the hash value). From the wikipedia entry:
A hash function is any well-defined
procedure or mathematical function
that converts a large, possibly
variable-sized amount of data into a
small datum, usually a single integer
that may serve as an index to an
array. The values returned by a hash
function are called hash values, hash
codes, hash sums, or simply hashes.
Note the "small datum" bit - in other words, there will be more possible objects than there are possible hash values, so you can't possibly have uniqueness.
Now, it sounds like you actually want the hash to be a string... which means it won't be of a fixed size (but will have to be under 2GB or whatever the limit is). The simplest way of producing this "unique hash" would be to serialize the object and convert the result into a string, e.g. using Base64 if it's a binary serialization format, or just the text if it's a text-based one such as JSON. However, that's not what anyone else would really recognise as "hashing".
Simply put this is not possible. The GetHashCode function returns a signed integer which contains 2^32 possible unique values. On a 64 bit platform you can have many more than 2^32 different objects running around and hence they cannot all have unique hash codes.
The only way to approach this is to create a different hashing function which returns a type with the capacity greater than or equal to the number of values that could be created in the running system.
A unique hash code is impossible without constraints on your input space. This is because Object.GetHashCode is an int. If you have more than Int32.MaxValue objects then at least two of them must map to same hash code (by the pigeonhole principle).
Define a custom type with restrained input (i.e., the number of possible different objects up to equality is less than Int32.MaxValue) and then, and only then, is it possible to produce a unique hash code. That's not saying it will be easy, just possible.
Alternatively, don't use the Object.GetHashCode mechanism but instead some other way of representing hashes and you might be able to do what you want. We need clear details on what you want and are using it for to be able to help you here.
As others have stated, hash code will never be unique, that's not the point.
The point is to help your Dictionary<string, MyObject> foo to find the exact instance faster. It will use the hash code to narrow the search to a smaller set of objects, and then check them for equality.
You can use the Guid class to get unique strings, if what you need is a unique key. But that is not a hash code.