Create a 5 chars unique identifier from a 40 characters string - c#

I have a list of 10 to max 300 string codes (40 word characters capitalized) that need to be stored inside an oauth2 Access Token (Claims based authorization);
I have to keep the token small as much as I can (header size problem) so I'm searching a way to create a small unique identifier representing the original string inside the token.
I would then create a lookup table where I will put the uid and the original string.
When the Token will be sent by the client, through the uid and the lookup table I will get the original string back.
I've read that it is possible to truncate the first bytes of a hash (MD5, SHA1) and I would like to know if I can follow this path safely.
Is it possible to safely (collision wise) create a list of hashes (unique) of these strings where each hash would be 4/5 bytes max?
Edit:
I can't pre-generate a random string as a index (or just a list index for example) because this list could change and increase in size (when the server application is deployed for example and new codes are added to this list) so I have to be sure that when I get the token back from the client, the uid will be bound to the correct code.

Yes, any of those hash algorithms give a uniform hash code where each bit isn't supposed to carry more information than any other. You can just take any 4-5 bytes of it (as long as you take the same bytes from each code) and use as a smaller hash code.
Naturally the collision risk gets higher the shorter the hash code is, but you will still get the lowest possible collision risk for that hash code length.
Edit:
As the question changed; No, you can't create unique identifiers using a hash code. With a long enough hash code you can make collisions rare enough that the hash code can be used as a unique identifer for almost any practical application, but a 32 bit hash code doesn't do that, a 128 bit hash code would do that.

Related

Generate integer based on any given string (without GetHashCode)

I'm attempting to write a method to generate an integer based on any given string. When calling this method on 2 identical strings, I need the method to generate the same exact integer both times.
I tried using .GetHasCode() however this is very unreliable once I move the project to another machine, as GetHasCode() returns different values for the same string
It is also important that the collision rate be VERY low. Custom methods I have written thus far produce collisions after just a few hundred thousand records.
The hash value MUST be an integer. A string hash value (like md5) would cripple my project in terms of speed and loading overhead.
The integer hashes are being used to perform extremely rapid text searches, which I have working beautifully, however it currently relies on .GetHasCode() and doesn't work when multiple machines get involved.
Any insight at all would be greatly appreciated.
MD5 hashing returns a byte array which could be converted to an integer:
var mystring = "abcd";
MD5 md5Hasher = MD5.Create();
var hashed = md5Hasher.ComputeHash(Encoding.UTF8.GetBytes(mystring));
var ivalue = BitConverter.ToInt32(hashed, 0);
Of course, you are converting from a 128 bit hash to a 32 bit int, so some information is being lost which will increase the possibility of collisions. You could try adjusting the second parameter to ToInt32 to see if any specific ranges of the MD5 hash produce fewer collisions than others for your data.
If your hash code creates duplicates "after a few hundred thousand records," you have a pretty good hash code implementation.
If you do the math, you'll find that a 32-bit hash code has a 50% chance of creating a duplicate after about 70,000 records. The probability of generating a duplicate after a million records is so close to certainty as not to matter.
As a rule of thumb, the likelihood of generating a duplicate hash code is 50% when the number of records hashed is equal to the square root of the number of possible values. So with a 32 bit hash code that has 2^32 possible values, the chance of generating a duplicate is 50% after approximately 2^16 (65,536) values. The actual number is slightly larger--closer to 70,000--but the rule of thumb gets you in the ballpark.
Another rule of thumb is that the chance of generating a duplicate is nearly 100% when the number of items hashed is four times the square root. So with a 32-bit hash code you're almost guaranteed to get a collision after only 2^18 (262,144) records hashed.
That's not going to change if you use the MD5 and convert it from 128 bits to 32 bits.
This code map any string to int between 0-100
int x= "ali".ToCharArray().Sum(x => x)%100;
using (MD5 md5 = MD5.Create())
{
bigInteger = new BigInteger(md5.ComputeHash(Encoding.Default.GetBytes(myString)));
}
BigInteger requires Org.BouncyCastle.Math

Creating a short unique string for each unique long string

I'm trying to create a url shortener system in c# and asp.net mvc. I know about hashtable and I know how to create a redirect system etc. The problem is indexing long urls in database. Some urls may have up to 4000 character length, and it seems it is a bad idea to index this kind of strings. The question is: How can I create a unique short string for each url? for example MD5 can help me? Is MD5 really unique for each string?
NOTE: I see that Gravatar uses MD5 for emails, so if each email address is unique, then its MD5 hashed value is unique. Is it right? Can I use same solution for urls?
You can use MD5 or SHA1 for such purposes as your described.
Hashes aren't completely unique. As example if you have 4000 bytes array, that's mean that you potentially have 256^4000 combinaton. And MD5 has will have 256^16 combination. So, there is a possibility of collisions. However, for all practical purposes (except cryptography), you don't never to worry about collisions.
If you are interested to real about collission vulnerability of MD5 (related to cryptographical use), you can do it here
A perfect hash function is one that guarantees no collisions. Since your application cannot accomodate hash chains, a perfect hash is the way to go.
The hashing approaches already mentioned will work fine for creating unique short strings that will probably uniquely identify your URL's. However, I'd like to propose an alternate approach.
Create a database table with two columns, ID (an integer) and URL (a string). Create a row in the table for each URL you wish to track. Then, refer to each URL by its ID. Make the ID auto-incrementing, this will ensure uniqueness.
This addresses the problem of how to translate from the shortened version to the longer version: simply join on the table in the database. With hashing, this would become a problem because hashing is one-way. The resulting page identifiers will also be shorter than MD5 hashes, and will only contain digits so they will be easy to include in URL query strings, etc.
I think you could try to make from the url string a byte(each char can be a byte) array and then use encoding (Base64 for example, or you can create one yourself if you want to go that far), Then if you want to decode you just use base 64 decoding and make from the bytes (in the array) again chars. However I am not sure or this will be a long string or not, but I am pretty sure it will be unique.
(PS you should ofc apply some logic first like always remove http:// and add it again later when decoding)

Calculate a checksum for a string

I got a string of an arbitrary length (lets say 5 to 2000 characters) which I would like to calculate a checksum for.
Requirements
The same checksum must be returned each time a calculation is done for a string
The checksum must be unique (no collisions)
I can not store previous IDs to check for collisions
Which algorithm should I use?
Update:
Are there an approach which is reasonable unique? i.e. the likelihood of a collision is very small.
The checksum should be alphanumeric
The strings are unicode
The strings are actually texts that should be translated and the checksum is stored with each translation (so a translated text can be matched back to the original text).
The length of the checksum is not important for me (the shorter, the better)
Update2
Let's say that I got the following string "Welcome to this website. Navigate using the flashy but useless menu above".
The string is used in a view in a similar way to gettext in linux. i.e. the user just writes (in a razor view)
#T("Welcome to this website. Navigate using the flashy but useless menu above")
Now I need a way to identity that string so that I can fetch it from a data source (there are several implementations of the data source). Having to use the entire string as a key seems a bit inefficient and I'm therefore looking for a way to generate a key out of it.
That's not possible.
If you can't store previous values, it's not possible to create a unique checksum that is smaller than the information in the string.
Update:
The term "reasonably unique" doesn't make sense, either it's unique or it's not.
To get a reasonably low risk of hash collisions, you can use a resonably large hash code.
The MD5 algorithm for example produces a 16 byte hash code. Convert the string to a byte array using some encoding that preserves all characters, for example UTF-8, calculate the hash code using the MD5 class, then convert the hash code byte array into a string using the BitConverter class:
string theString = "asdf";
string hash;
using (System.Security.Cryptography.MD5 md5 = System.Security.Cryptography.MD5.Create()) {
hash = BitConverter.ToString(
md5.ComputeHash(Encoding.UTF8.GetBytes(theString))
).Replace("-", String.Empty);
}
Console.WriteLine(hash);
Output:
912EC803B2CE49E4A541068D495AB570
You can use cryptographic Hash functions for this. Most of them are available in .Net
For example:
var sha1 = System.Security.Cryptography.SHA1.Create();
byte[] buf = System.Text.Encoding.UTF8.GetBytes("test");
byte[] hash= sha1.ComputeHash(buf, 0, buf.Length);
//var hashstr = Convert.ToBase64String(hash);
var hashstr = System.BitConverter.ToString(hash).Replace("-", "");
Note: This is an answer to the original question.
Assuming you want the checksum to be stored in a variable of fixed size (i.e. an integer), you cannot satisfy your second constraint.
The checksum must be unique (no collisions)
You cannot avoid collisions because there will be more distinct strings than there are possible checksum values.
I realize this post is practically ancient, but I stumbled upon it and have run into an almost identical issue in the past. We had an nvarchar(8000) field that we needed to lookup against.
Our solution was to create a persisted computed column using CHECKSUM of the nasty lookup field. We had an auto-incrementing ID field and keyed on (checksum, id)
When reading from the table, we wrote a proc that took the lookup text, computed the checksum and then took where the checksums were equal and the text was equal.
You could easily perform the checksum portions at the application level based on the answer above and store them manually instead of using our DB-centric solution. But the point is to get a reasonably sized key for indexing so that your text comparison runs against a bucket of collisions instead of the entire dataset.
Good luck!
To guarantee uniqueness, for a almost infinite size strings, treat the variable length string as a set of concatenated substrings each having "x characters in length". Your hash function needs only to determine uniqueness for a maximum substring length and then generate a series of checksum numbers generating values. Think of it as the equivalent network IP address with a set of checksum numbers.
Your issue with collisions is the assumption that a collision forces a slower search method to resolve each collision. If their are a insignificant number of possible collisions compared to the number of hash objects, then as a whole the extra overhead becomes NIL. A collision is due to the sizing of a table smaller than the maximum number of objects. This doesn't have to be the case because the table may have "holes" and each object within the table may have a reference count of objects at that collision. Only if this count is greater than 1, then a collision occurs or multiple instances of the same substring.

Generate a short code based on a unique string in C#

I'm just about to launch the beta of a new online service. Beta subscribers will be sent a unique "access code" that allows them to register for the service.
Rather than storing a list of access codes, I thought I would just generate a code based on their email, since this itself is unique.
My initial thought was to combine the email with a unique string and then Base64 encode it. However, I was looking for codes that are a bit shorter, say 5 digits long.
If the access code itself needs to be unique, it will be difficult to ensure against collisions. If you can tolerate a case where two users might, by coincidence, share the same access code, it becomes significantly easier.
Taking the base-64 encoding of the e-mail address concatenated with a known string, as proposed, could introduce a security vulnerability. If you used the base64 output of the e-mail address concatenated with a known word, the user could just unencode the access code and derive the algorithm used to generate the code.
One option is to take the SHA-1-HMAC hash (System.Cryptography.HMACSHA1) of the e-mail address with a known secret key. The output of the hash is a 20-byte sequence. You could then truncate the hash deterministically. For instance, in the following, GetCodeForEmail("test#example.org") gives a code of 'PE2WEG' :
// define characters allowed in passcode. set length so divisible into 256
static char[] ValidChars = {'2','3','4','5','6','7','8','9',
'A','B','C','D','E','F','G','H',
'J','K','L','M','N','P','Q',
'R','S','T','U','V','W','X','Y','Z'}; // len=32
const string hashkey = "password"; //key for HMAC function -- change!
const int codelength = 6; // lenth of passcode
string GetCodeForEmail(string address)
{
byte[] hash;
using (HMACSHA1 sha1 = new HMACSHA1(ASCIIEncoding.ASCII.GetBytes(hashkey)))
hash = sha1.ComputeHash(UTF8Encoding.UTF8.GetBytes(address));
int startpos = hash[hash.Length -1] % (hash.Length - codelength);
StringBuilder passbuilder = new StringBuilder();
for (int i = startpos; i < startpos + codelength; i++)
passbuilder.Append(ValidChars[hash[i] % ValidChars.Length]);
return passbuilder.ToString();
}
You may create a special hash from their email, which is less than 6 chars, but it wouldn't really make that "unique", there will always be collisions in such a small space. I'd rather go with a longer key, or storing pre-generated codes in a table anyway.
So, it sounds like what you want to do here is to create a hash function specifically for emails as #can poyragzoglu pointed out. A very simple one might look something like this:
(pseudo code)
foreach char c in email:
running total += [large prime] * [unicode value]
then do running total % large 5 digit number
As he pointed out though, this will not be unique unless you had an excellent hash function. You're likely to have collisions. Not sure if that matters.
What seems easier to me, is if you already know the valid emails, just check the user's email against your list of valid ones upon registration? Why bother with a code at all?
If you really want a unique identifier though, the easiest way to do this is probably to just use what's called a GUID. C# natively supports this. You could store this in your Users table. Though, it would be far too long for a user to ever remember/type out, it would almost certainly be unique for each one if that's what you're trying to do.

Why we use Hash Code in HashTable instead of an Index?

How that integer hash is generated by the GetHashCode() function? Is it a random value which is not unique?
In string, it is overridden to make sure that there exists only one hash code for a particular string.
How to do that?
How searching for specific key in a hash table is speeded up using hash code?
What are the advantages of using hash code over using an index directly in the collection (like in arrays)?
Can someone help?
Basically, hash functions use some generic function to digest data and generate a fingerprint (and integer number here) for that data. Unlike an index, this fingerprint depends ONLY on the data, and should be free of any predictable ordering based on the data. Any change to a single bit of the data should also change the fingerprint considerably.
Notice that nowhere does this guarantee that different data won't give the same hash. In fact, quite the opposite: this happens very often, and is called a collision. But, with an integer, the probability is roughly 1 in 4 billion against this (1 in 2^32). If a collision happens, you just compare the actual object you are hashing to see if they match.
This fingerprint can then be used as an index to an array (or arraylist) of stored values. Because the fingerprint is dependent only on the data, you can compute a hash for something and just check the array element for that hash value to see if it has been stored already. Otherwise, you'd have to go through the whole array checking if it matches an item.
You can also VERY quickly do associative arrays by using 2 arrays, one with Key values (indexed by hash), and a second with values mapped to those keys. If you use a hash, you just need to know the key's hash to find the matching value for the key. This is much faster than doing a binary search on a sorted key list, or a scan of the whole array to find matching keys.
There are MANY ways to generate a hash, and all of them have various merits, but few are simple. I suggest consulting the wikipedia page on hash functions for more info.
A hash code IS an index, and a hash table, at its very lowest level, IS an array. But for a given key value, we determine the index into in a hash table differently, to make for much faster data retrieval.
Example: You have 1,000 words and their definitions. You want to store them so that you can retrieve the definition for a word very, very quickly -- faster than a binary search, which is what you would have to do with an array.
So you create a hash table. You start with an array substantially bigger than 1,000 entries -- say 5,000 (the bigger, the more time-efficient).
The way you'll use your table is, you take the word to look up, and convert it to a number between 0 and 4,999. You choose the algorithm for doing this; that's the hashing algorithm. But you could doubtless write something that would be very fast.
Then you use the converted number as an index into your 5,000-element array, and insert/find your definition at that index. There's no searching at all: you've created the index directly from the search word.
All of the operations I've described are constant time; none of them takes longer when we increase the number of entries. We just need to make sure that there is sufficient space in the hash to minimize the chance of "collisions", that is, the chance that two different words will convert to the same integer index. Because that can happen with any hashing algorithm, we need to add checks to see if there is a collision, and do something special (if "hello" and "world" both hash to 1,234 and "hello" is already in the table, what will we do with "world"? Simplest is to put it in 1,235, and adjust our lookup logic to allow for this possibility.)
Edit: after re-reading your post: a hashing algorithm is most definitely not random, it must be deterministic. The index generated for "hello" in my example must be 1,234 every single time; that's the only way the lookup can work.
Answering each one of your questions directly:
How that integer hash is generated by
the GetHashCode() function? Is it a
random value which is not unique?
An integer hash is generated by whatever method is appropriate for the object.
The generation method is not random but must follow consistent rules, ensuring that a hash generated for one particular object will equal the hash generated for an equivalent object. As an example, a hash function for an integer would be to simply return that integer.
In string, it is overridden to make
sure that there exists only one hash
code for a particular string. How to
do that?
There are many ways this can be done. Here's an example I'm thinking of on the spot:
int hash = 0;
for(int i = 0; i < theString.Length; ++i)
{
hash ^= theString[i];
}
This is a valid hash algorithm, because the same sequence of characters will always produce the same hash number. It's not a good hash algorithm (an extreme understatement), because many strings will produce the same hash. A valid hash algorithm doesn't have to guarantee uniqueness. A good hash algorithm will make a chance of two differing objects producing the same number extremely unlikely.
How searching for specific key in a hash table is speeded up using hash code?
What are the advantages of using hash code over using an index directly in the collection (like in arrays)?
A hash code is typically used in hash tables. A hash table is an array, but each entry in the array is a "bucket" of items, not just one item. If you have an object and you want to know which bucket it belongs in, calculate
hash_value MOD hash_table_size.
Then you simply have to compare the object with every item in the bucket. So a hash table lookup will most likely have a search time of O(1), as opposed to O(log(N)) for a sorted list or O(N) for an unsorted list.

Categories