Generate integer based on any given string (without GetHashCode)

Generate integer based on any given string (without GetHashCode) - c#

I'm attempting to write a method to generate an integer based on any given string. When calling this method on 2 identical strings, I need the method to generate the same exact integer both times.
I tried using .GetHasCode() however this is very unreliable once I move the project to another machine, as GetHasCode() returns different values for the same string
It is also important that the collision rate be VERY low. Custom methods I have written thus far produce collisions after just a few hundred thousand records.
The hash value MUST be an integer. A string hash value (like md5) would cripple my project in terms of speed and loading overhead.
The integer hashes are being used to perform extremely rapid text searches, which I have working beautifully, however it currently relies on .GetHasCode() and doesn't work when multiple machines get involved.
Any insight at all would be greatly appreciated.

MD5 hashing returns a byte array which could be converted to an integer:
var mystring = "abcd";
MD5 md5Hasher = MD5.Create();
var hashed = md5Hasher.ComputeHash(Encoding.UTF8.GetBytes(mystring));
var ivalue = BitConverter.ToInt32(hashed, 0);
Of course, you are converting from a 128 bit hash to a 32 bit int, so some information is being lost which will increase the possibility of collisions. You could try adjusting the second parameter to ToInt32 to see if any specific ranges of the MD5 hash produce fewer collisions than others for your data.

If your hash code creates duplicates "after a few hundred thousand records," you have a pretty good hash code implementation.
If you do the math, you'll find that a 32-bit hash code has a 50% chance of creating a duplicate after about 70,000 records. The probability of generating a duplicate after a million records is so close to certainty as not to matter.
As a rule of thumb, the likelihood of generating a duplicate hash code is 50% when the number of records hashed is equal to the square root of the number of possible values. So with a 32 bit hash code that has 2^32 possible values, the chance of generating a duplicate is 50% after approximately 2^16 (65,536) values. The actual number is slightly larger--closer to 70,000--but the rule of thumb gets you in the ballpark.
Another rule of thumb is that the chance of generating a duplicate is nearly 100% when the number of items hashed is four times the square root. So with a 32-bit hash code you're almost guaranteed to get a collision after only 2^18 (262,144) records hashed.
That's not going to change if you use the MD5 and convert it from 128 bits to 32 bits.

This code map any string to int between 0-100
int x= "ali".ToCharArray().Sum(x => x)%100;

using (MD5 md5 = MD5.Create())
{
bigInteger = new BigInteger(md5.ComputeHash(Encoding.Default.GetBytes(myString)));
}
BigInteger requires Org.BouncyCastle.Math

Related

Understanding Hash Codes in .NET

What I've gathered up till now is that hash codes are integers that help finding data from an array faster. Look at this code:
string x = "Run the program to find this string's hash code!";
int hashCode = x.GetHashCode();
Random random = new Random(hashCode);
for(int i = 0; i<100; i++)
{
// Always generates the same set of random integers 60, 23, 67, 80, 89, 44, 44 and so on...
int randomNumber = random.Next(0, 100);
Console.WriteLine("Hash Code is: {0}", hashCode);
Console.WriteLine("The random number it generates is: {0}", randomNumber);
Console.ReadKey();
As you can see I used the Hash Code of string x as the seed for the random number generator. This code gives me a 100 random integers, but every time I run the program, it gives me the SAME set of random numbers! My question is: Why does it give me a different random number every time it iterates through the loop? Why does the Hash Code for x keep changing even though the string isn't changed. What are Hash Codes exactly and how are they generated (if necessary)?

It's vitally important for the hash code to remain the same for a given object throughout the lifetime of that program's execution. The hash code of a given object should not be relied on to remain the same across multiple executions of the program, which is what you're doing. Many implementations will happen to remain the same in different program invocations, but the .NET string implementation does not.

What I've gathered up till now is that hash codes are integers that help finding data from an array faster
No, they help find data in a hash based collection faster. An array is just a sequence of items; there is no reliance on, or benefit from using, hash codes in a normal array.
What are Hash Codes exactly
It is a 32-bit integer that is used to insert and identify an object in a hash-based collection like a Hashtable or Dictionary
and how are they generated (if necessary)?
There is not one algorithm that all objects use to generate a hash code. The only restrictions are that 1) two "equal" objects must generate the same hash code, and 2) an object's hash code must not change over the life of that object. There is no restriction that two "equal" objects in different programs return the same hash code.
The default implementation uses the location of the object in memory. Classes such as string that define "equality" as sometihng other that "a reference to the same object in memory" override this default behavior to honor rule 1 above.
If you want a hash code that can be persisted and is guaranteed to be the same each time you ask for it, then use a standard hashing algorithm like SHA1 or MD5.

Create a 5 chars unique identifier from a 40 characters string

I have a list of 10 to max 300 string codes (40 word characters capitalized) that need to be stored inside an oauth2 Access Token (Claims based authorization);
I have to keep the token small as much as I can (header size problem) so I'm searching a way to create a small unique identifier representing the original string inside the token.
I would then create a lookup table where I will put the uid and the original string.
When the Token will be sent by the client, through the uid and the lookup table I will get the original string back.
I've read that it is possible to truncate the first bytes of a hash (MD5, SHA1) and I would like to know if I can follow this path safely.
Is it possible to safely (collision wise) create a list of hashes (unique) of these strings where each hash would be 4/5 bytes max?
Edit:
I can't pre-generate a random string as a index (or just a list index for example) because this list could change and increase in size (when the server application is deployed for example and new codes are added to this list) so I have to be sure that when I get the token back from the client, the uid will be bound to the correct code.

Yes, any of those hash algorithms give a uniform hash code where each bit isn't supposed to carry more information than any other. You can just take any 4-5 bytes of it (as long as you take the same bytes from each code) and use as a smaller hash code.
Naturally the collision risk gets higher the shorter the hash code is, but you will still get the lowest possible collision risk for that hash code length.
Edit:
As the question changed; No, you can't create unique identifiers using a hash code. With a long enough hash code you can make collisions rare enough that the hash code can be used as a unique identifer for almost any practical application, but a 32 bit hash code doesn't do that, a 128 bit hash code would do that.

Calculate a checksum for a string

I got a string of an arbitrary length (lets say 5 to 2000 characters) which I would like to calculate a checksum for.
Requirements
The same checksum must be returned each time a calculation is done for a string
The checksum must be unique (no collisions)
I can not store previous IDs to check for collisions
Which algorithm should I use?
Update:
Are there an approach which is reasonable unique? i.e. the likelihood of a collision is very small.
The checksum should be alphanumeric
The strings are unicode
The strings are actually texts that should be translated and the checksum is stored with each translation (so a translated text can be matched back to the original text).
The length of the checksum is not important for me (the shorter, the better)
Update2
Let's say that I got the following string "Welcome to this website. Navigate using the flashy but useless menu above".
The string is used in a view in a similar way to gettext in linux. i.e. the user just writes (in a razor view)
#T("Welcome to this website. Navigate using the flashy but useless menu above")
Now I need a way to identity that string so that I can fetch it from a data source (there are several implementations of the data source). Having to use the entire string as a key seems a bit inefficient and I'm therefore looking for a way to generate a key out of it.

That's not possible.
If you can't store previous values, it's not possible to create a unique checksum that is smaller than the information in the string.
Update:
The term "reasonably unique" doesn't make sense, either it's unique or it's not.
To get a reasonably low risk of hash collisions, you can use a resonably large hash code.
The MD5 algorithm for example produces a 16 byte hash code. Convert the string to a byte array using some encoding that preserves all characters, for example UTF-8, calculate the hash code using the MD5 class, then convert the hash code byte array into a string using the BitConverter class:
string theString = "asdf";
string hash;
using (System.Security.Cryptography.MD5 md5 = System.Security.Cryptography.MD5.Create()) {
hash = BitConverter.ToString(
md5.ComputeHash(Encoding.UTF8.GetBytes(theString))
).Replace("-", String.Empty);
}
Console.WriteLine(hash);
Output:
912EC803B2CE49E4A541068D495AB570

You can use cryptographic Hash functions for this. Most of them are available in .Net
For example:
var sha1 = System.Security.Cryptography.SHA1.Create();
byte[] buf = System.Text.Encoding.UTF8.GetBytes("test");
byte[] hash= sha1.ComputeHash(buf, 0, buf.Length);
//var hashstr = Convert.ToBase64String(hash);
var hashstr = System.BitConverter.ToString(hash).Replace("-", "");

Note: This is an answer to the original question.
Assuming you want the checksum to be stored in a variable of fixed size (i.e. an integer), you cannot satisfy your second constraint.
The checksum must be unique (no collisions)
You cannot avoid collisions because there will be more distinct strings than there are possible checksum values.

I realize this post is practically ancient, but I stumbled upon it and have run into an almost identical issue in the past. We had an nvarchar(8000) field that we needed to lookup against.
Our solution was to create a persisted computed column using CHECKSUM of the nasty lookup field. We had an auto-incrementing ID field and keyed on (checksum, id)
When reading from the table, we wrote a proc that took the lookup text, computed the checksum and then took where the checksums were equal and the text was equal.
You could easily perform the checksum portions at the application level based on the answer above and store them manually instead of using our DB-centric solution. But the point is to get a reasonably sized key for indexing so that your text comparison runs against a bucket of collisions instead of the entire dataset.
Good luck!

To guarantee uniqueness, for a almost infinite size strings, treat the variable length string as a set of concatenated substrings each having "x characters in length". Your hash function needs only to determine uniqueness for a maximum substring length and then generate a series of checksum numbers generating values. Think of it as the equivalent network IP address with a set of checksum numbers.
Your issue with collisions is the assumption that a collision forces a slower search method to resolve each collision. If their are a insignificant number of possible collisions compared to the number of hash objects, then as a whole the extra overhead becomes NIL. A collision is due to the sizing of a table smaller than the maximum number of objects. This doesn't have to be the case because the table may have "holes" and each object within the table may have a reference count of objects at that collision. Only if this count is greater than 1, then a collision occurs or multiple instances of the same substring.

A simple, repeatable hash from an UInt32 to a UInt16

I have a little problem where need to do a hash of a number of about 10 digits into a number of 6 digits. The hash needs to be deterministic.
It's more important that the hash is not resource intensive.
For example, say that I have some number, x, like 123456789
I want to write an hash function that gives me a number, y, back like 987654.
I'd then like to have a function that takes the x and y as parameters, re-applies the hash on x, and checks that the result is y.
It should be difficult to compute possible input values given the hash.
My first idea of multiplying pairs of digits led to a lot of duplicate hashed values.
I have the feeling that this sort of problem has some kind of elegant solution, but I just can't think of it myself.
Can anyone help me out here? Thanks in advance :)

What you need is called "hashing".
Try CRC16.

Your problem as stated is not solvable.
You say that you want the system to be "somewhat hard to break", by which I assume you mean that it is "somewhat hard" for an attacker to take a known digest and produce from it a possible input which hashes to the given digest. Since there are only 4 billion possible inputs and only 65536 possible hashes in the system you propose, it is utterly trivial to find a message that corresponds to a given hash, no matter what the hash algorithm is. On average, the attacker will have about 65000 possible messages to choose from, and can therefore cherry-pick the message that best serves his nefarious scheme.
I would expect a "somewhat hard" problem in the hash-breaking space to require, dedicating, say, a few million dollars worth of supercomputer time to break. Your proposal can be broken by inexperienced high school students writing Javascript programs that take a couple minutes to write and maybe a minute to run, tops; this is not even vaguely close to "somewhat hard".
Why are you choosing such tiny limits on your algorithm, limits which will by their very nature make it trivial to break the hashing? And for that matter, what's the value in hashing such a tiny amount of data as a 32 bit integer?

(( X>>16) ^ (X)) & 0xFFFF
.......

What you want to do is to try to distribute the hash values as evenly as possible over the range. Some of the built in hashing methods are fairly good at this, so you could perhaps try something like getting the hash code of the string representation, and simply throw away half of the bits:
ushort code = (ushort)value.ToString().GetHashCode();
However, it also depends on what you are going to use the hash code for. The built in hash codes are not intended to be stored permanently. The algorithms for calculating the hash codes can change with any new version of the framework, so if you store the hash codes in the database they may become useless in the future. In that case you would instead have to create the hashing algorithm yourself from scratch, or use some hashing algorithm that was designed for permanent storage.
One simple algorithm that is used for hash codes for some values in the framework is to use exclusive or to make all bits in the value matter when the hash code is smaller than the data:
byte[] b = BitConverter.GetBytes(value);
ushort code = (ushort)(BitConverter.ToUInt16(b, 0) ^ BitConverter.ToUInt16(b, 2));
or the more efficient but less obvious way to do the same:
ushort code = (ushort)((value >> 16) ^ value);
This of course has no obfuscating properties for small values, so you might want to throw in some "random" bits to make the hash code significantly different from the value:
ushort code = (ushort)(0x56D4 ^ (value >> 16) ^ value);

How about just discarding the lower 16 bits or last 4 digits?
1234567890 --> 123456
Easily done by just doing an integer division by 10000.

Why we use Hash Code in HashTable instead of an Index?

How that integer hash is generated by the GetHashCode() function? Is it a random value which is not unique?
In string, it is overridden to make sure that there exists only one hash code for a particular string.
How to do that?
How searching for specific key in a hash table is speeded up using hash code?
What are the advantages of using hash code over using an index directly in the collection (like in arrays)?
Can someone help?

Basically, hash functions use some generic function to digest data and generate a fingerprint (and integer number here) for that data. Unlike an index, this fingerprint depends ONLY on the data, and should be free of any predictable ordering based on the data. Any change to a single bit of the data should also change the fingerprint considerably.
Notice that nowhere does this guarantee that different data won't give the same hash. In fact, quite the opposite: this happens very often, and is called a collision. But, with an integer, the probability is roughly 1 in 4 billion against this (1 in 2^32). If a collision happens, you just compare the actual object you are hashing to see if they match.
This fingerprint can then be used as an index to an array (or arraylist) of stored values. Because the fingerprint is dependent only on the data, you can compute a hash for something and just check the array element for that hash value to see if it has been stored already. Otherwise, you'd have to go through the whole array checking if it matches an item.
You can also VERY quickly do associative arrays by using 2 arrays, one with Key values (indexed by hash), and a second with values mapped to those keys. If you use a hash, you just need to know the key's hash to find the matching value for the key. This is much faster than doing a binary search on a sorted key list, or a scan of the whole array to find matching keys.
There are MANY ways to generate a hash, and all of them have various merits, but few are simple. I suggest consulting the wikipedia page on hash functions for more info.

A hash code IS an index, and a hash table, at its very lowest level, IS an array. But for a given key value, we determine the index into in a hash table differently, to make for much faster data retrieval.
Example: You have 1,000 words and their definitions. You want to store them so that you can retrieve the definition for a word very, very quickly -- faster than a binary search, which is what you would have to do with an array.
So you create a hash table. You start with an array substantially bigger than 1,000 entries -- say 5,000 (the bigger, the more time-efficient).
The way you'll use your table is, you take the word to look up, and convert it to a number between 0 and 4,999. You choose the algorithm for doing this; that's the hashing algorithm. But you could doubtless write something that would be very fast.
Then you use the converted number as an index into your 5,000-element array, and insert/find your definition at that index. There's no searching at all: you've created the index directly from the search word.
All of the operations I've described are constant time; none of them takes longer when we increase the number of entries. We just need to make sure that there is sufficient space in the hash to minimize the chance of "collisions", that is, the chance that two different words will convert to the same integer index. Because that can happen with any hashing algorithm, we need to add checks to see if there is a collision, and do something special (if "hello" and "world" both hash to 1,234 and "hello" is already in the table, what will we do with "world"? Simplest is to put it in 1,235, and adjust our lookup logic to allow for this possibility.)
Edit: after re-reading your post: a hashing algorithm is most definitely not random, it must be deterministic. The index generated for "hello" in my example must be 1,234 every single time; that's the only way the lookup can work.

Answering each one of your questions directly:
How that integer hash is generated by
the GetHashCode() function? Is it a
random value which is not unique?
An integer hash is generated by whatever method is appropriate for the object.
The generation method is not random but must follow consistent rules, ensuring that a hash generated for one particular object will equal the hash generated for an equivalent object. As an example, a hash function for an integer would be to simply return that integer.
In string, it is overridden to make
sure that there exists only one hash
code for a particular string. How to
do that?
There are many ways this can be done. Here's an example I'm thinking of on the spot:
int hash = 0;
for(int i = 0; i < theString.Length; ++i)
{
hash ^= theString[i];
}
This is a valid hash algorithm, because the same sequence of characters will always produce the same hash number. It's not a good hash algorithm (an extreme understatement), because many strings will produce the same hash. A valid hash algorithm doesn't have to guarantee uniqueness. A good hash algorithm will make a chance of two differing objects producing the same number extremely unlikely.
How searching for specific key in a hash table is speeded up using hash code?
What are the advantages of using hash code over using an index directly in the collection (like in arrays)?
A hash code is typically used in hash tables. A hash table is an array, but each entry in the array is a "bucket" of items, not just one item. If you have an object and you want to know which bucket it belongs in, calculate
hash_value MOD hash_table_size.
Then you simply have to compare the object with every item in the bucket. So a hash table lookup will most likely have a search time of O(1), as opposed to O(log(N)) for a sorted list or O(N) for an unsorted list.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.