I got a string of an arbitrary length (lets say 5 to 2000 characters) which I would like to calculate a checksum for.
Requirements
The same checksum must be returned each time a calculation is done for a string
The checksum must be unique (no collisions)
I can not store previous IDs to check for collisions
Which algorithm should I use?
Update:
Are there an approach which is reasonable unique? i.e. the likelihood of a collision is very small.
The checksum should be alphanumeric
The strings are unicode
The strings are actually texts that should be translated and the checksum is stored with each translation (so a translated text can be matched back to the original text).
The length of the checksum is not important for me (the shorter, the better)
Update2
Let's say that I got the following string "Welcome to this website. Navigate using the flashy but useless menu above".
The string is used in a view in a similar way to gettext in linux. i.e. the user just writes (in a razor view)
#T("Welcome to this website. Navigate using the flashy but useless menu above")
Now I need a way to identity that string so that I can fetch it from a data source (there are several implementations of the data source). Having to use the entire string as a key seems a bit inefficient and I'm therefore looking for a way to generate a key out of it.
That's not possible.
If you can't store previous values, it's not possible to create a unique checksum that is smaller than the information in the string.
Update:
The term "reasonably unique" doesn't make sense, either it's unique or it's not.
To get a reasonably low risk of hash collisions, you can use a resonably large hash code.
The MD5 algorithm for example produces a 16 byte hash code. Convert the string to a byte array using some encoding that preserves all characters, for example UTF-8, calculate the hash code using the MD5 class, then convert the hash code byte array into a string using the BitConverter class:
string theString = "asdf";
string hash;
using (System.Security.Cryptography.MD5 md5 = System.Security.Cryptography.MD5.Create()) {
hash = BitConverter.ToString(
md5.ComputeHash(Encoding.UTF8.GetBytes(theString))
).Replace("-", String.Empty);
}
Console.WriteLine(hash);
Output:
912EC803B2CE49E4A541068D495AB570
You can use cryptographic Hash functions for this. Most of them are available in .Net
For example:
var sha1 = System.Security.Cryptography.SHA1.Create();
byte[] buf = System.Text.Encoding.UTF8.GetBytes("test");
byte[] hash= sha1.ComputeHash(buf, 0, buf.Length);
//var hashstr = Convert.ToBase64String(hash);
var hashstr = System.BitConverter.ToString(hash).Replace("-", "");
Note: This is an answer to the original question.
Assuming you want the checksum to be stored in a variable of fixed size (i.e. an integer), you cannot satisfy your second constraint.
The checksum must be unique (no collisions)
You cannot avoid collisions because there will be more distinct strings than there are possible checksum values.
I realize this post is practically ancient, but I stumbled upon it and have run into an almost identical issue in the past. We had an nvarchar(8000) field that we needed to lookup against.
Our solution was to create a persisted computed column using CHECKSUM of the nasty lookup field. We had an auto-incrementing ID field and keyed on (checksum, id)
When reading from the table, we wrote a proc that took the lookup text, computed the checksum and then took where the checksums were equal and the text was equal.
You could easily perform the checksum portions at the application level based on the answer above and store them manually instead of using our DB-centric solution. But the point is to get a reasonably sized key for indexing so that your text comparison runs against a bucket of collisions instead of the entire dataset.
Good luck!
To guarantee uniqueness, for a almost infinite size strings, treat the variable length string as a set of concatenated substrings each having "x characters in length". Your hash function needs only to determine uniqueness for a maximum substring length and then generate a series of checksum numbers generating values. Think of it as the equivalent network IP address with a set of checksum numbers.
Your issue with collisions is the assumption that a collision forces a slower search method to resolve each collision. If their are a insignificant number of possible collisions compared to the number of hash objects, then as a whole the extra overhead becomes NIL. A collision is due to the sizing of a table smaller than the maximum number of objects. This doesn't have to be the case because the table may have "holes" and each object within the table may have a reference count of objects at that collision. Only if this count is greater than 1, then a collision occurs or multiple instances of the same substring.
Related
Could somebody help me to understand what is the most significant byte of a 160 bit (SHA-1) hash?
I have a C# code which calls the cryptography library to calculate a hash code from a data stream. In the result I get a 20 byte C# array. Then I calculate another hash code from another data stream and then I need to place the hash codes in ascending order.
Now, I'm trying to understand how to compare them right. Apparently I need to subtract one from another and then check if the result is negative, positive or zero. Technically, I have 2 20 byte arrays, which if we look at from the memory perspective having the least significant byte at the beginning (lower memory address) and the most significant byte at the end (higher memory address). On the other hand looking at them from the human reading perspective the most significant byte is at the beginning and the least significant is at the end and if I'm not mistaken this order is used for comparing GUIDs. Of course, it will give us different order if we use one or another approach. Which way is considered to be the right or conventional one for comparing hash codes? It is especially important in our case because we are thinking about implementing a distributed hash table which should be compatible with existing ones.
You should think of the initial hash as just bytes, not a number. If you're trying to order them for indexed lookup, use whatever ordering is simplest to implement - there's no general purpose "right" or "conventional" here, really.
If you've got some specific hash table you want to be "compatible" with (not even sure what that would mean) you should see what approach to ordering that hash table takes, assuming it's even relevant. If you've got multiple tables you need to be compatible with, you may find you need to use different ordering for different tables.
Given the comments, you're trying to work with Kademlia, which based on this document treats the hashes as big-endian numbers:
Kademlia follows Pastry in interpreting keys (including nodeIDs) as bigendian numbers. This means that the low order byte in the byte array representing the key is the most significant byte and so if two keys are close together then the low order bytes in the distance array will be zero.
That's just an arbitrary interpretation of the bytes - so long as everyone uses the same interpretation, it will work... but it would work just as well if everyone decided to interpret them as little-endian numbers.
You can use SequenceEqual to compare Byte arrays, check the following links for elaborate details:
How to compare two arrays of bytes
Comparing two byte arrays in .NET
I have a list of 10 to max 300 string codes (40 word characters capitalized) that need to be stored inside an oauth2 Access Token (Claims based authorization);
I have to keep the token small as much as I can (header size problem) so I'm searching a way to create a small unique identifier representing the original string inside the token.
I would then create a lookup table where I will put the uid and the original string.
When the Token will be sent by the client, through the uid and the lookup table I will get the original string back.
I've read that it is possible to truncate the first bytes of a hash (MD5, SHA1) and I would like to know if I can follow this path safely.
Is it possible to safely (collision wise) create a list of hashes (unique) of these strings where each hash would be 4/5 bytes max?
Edit:
I can't pre-generate a random string as a index (or just a list index for example) because this list could change and increase in size (when the server application is deployed for example and new codes are added to this list) so I have to be sure that when I get the token back from the client, the uid will be bound to the correct code.
Yes, any of those hash algorithms give a uniform hash code where each bit isn't supposed to carry more information than any other. You can just take any 4-5 bytes of it (as long as you take the same bytes from each code) and use as a smaller hash code.
Naturally the collision risk gets higher the shorter the hash code is, but you will still get the lowest possible collision risk for that hash code length.
Edit:
As the question changed; No, you can't create unique identifiers using a hash code. With a long enough hash code you can make collisions rare enough that the hash code can be used as a unique identifer for almost any practical application, but a 32 bit hash code doesn't do that, a 128 bit hash code would do that.
I'm attempting to write a method to generate an integer based on any given string. When calling this method on 2 identical strings, I need the method to generate the same exact integer both times.
I tried using .GetHasCode() however this is very unreliable once I move the project to another machine, as GetHasCode() returns different values for the same string
It is also important that the collision rate be VERY low. Custom methods I have written thus far produce collisions after just a few hundred thousand records.
The hash value MUST be an integer. A string hash value (like md5) would cripple my project in terms of speed and loading overhead.
The integer hashes are being used to perform extremely rapid text searches, which I have working beautifully, however it currently relies on .GetHasCode() and doesn't work when multiple machines get involved.
Any insight at all would be greatly appreciated.
MD5 hashing returns a byte array which could be converted to an integer:
var mystring = "abcd";
MD5 md5Hasher = MD5.Create();
var hashed = md5Hasher.ComputeHash(Encoding.UTF8.GetBytes(mystring));
var ivalue = BitConverter.ToInt32(hashed, 0);
Of course, you are converting from a 128 bit hash to a 32 bit int, so some information is being lost which will increase the possibility of collisions. You could try adjusting the second parameter to ToInt32 to see if any specific ranges of the MD5 hash produce fewer collisions than others for your data.
If your hash code creates duplicates "after a few hundred thousand records," you have a pretty good hash code implementation.
If you do the math, you'll find that a 32-bit hash code has a 50% chance of creating a duplicate after about 70,000 records. The probability of generating a duplicate after a million records is so close to certainty as not to matter.
As a rule of thumb, the likelihood of generating a duplicate hash code is 50% when the number of records hashed is equal to the square root of the number of possible values. So with a 32 bit hash code that has 2^32 possible values, the chance of generating a duplicate is 50% after approximately 2^16 (65,536) values. The actual number is slightly larger--closer to 70,000--but the rule of thumb gets you in the ballpark.
Another rule of thumb is that the chance of generating a duplicate is nearly 100% when the number of items hashed is four times the square root. So with a 32-bit hash code you're almost guaranteed to get a collision after only 2^18 (262,144) records hashed.
That's not going to change if you use the MD5 and convert it from 128 bits to 32 bits.
This code map any string to int between 0-100
int x= "ali".ToCharArray().Sum(x => x)%100;
using (MD5 md5 = MD5.Create())
{
bigInteger = new BigInteger(md5.ComputeHash(Encoding.Default.GetBytes(myString)));
}
BigInteger requires Org.BouncyCastle.Math
I need to generate a unique id for file sizes of upto 200-300MB. The condition is that the algo should be quick, it should not take much time. I am selecting the files from a desktop and calculation a hash value as such:
HMACSHA256 myhmacsha256 = new HMACSHA256(key);
byte[] hashValue = myhmacsha256.ComputeHash(fileStream);
filestream is a handle to the file to read content from it. This method is going to take a lot of time for obvious reasons.
Does windows generate a key for a file for its own book keeping that I could directly use ?
Is there any other way to identify if the file is same, instead of matching file name which is not very foolproof.
MD5.Create().ComputeHash(fileStream);
Alternatively, I'd suggest looking at this rather similar question.
How about generating a hash from the info that's readily available from the file itself? i.e. concatenate :
File Name
File Size
Created Date
Last Modified Date
and create your own?
When you compute hashes and compare them, it would require both files to completely go through. My suggestion is to first check the file sizes, if they are identical and then go through the files byte by byte.
If you want a "quick and dirty" check, I would suggest looking at CRC-32. It is extremely fast (the algorithm simply involves doing XOR with table lookups), and if you aren't too concerned about collision resistance, a combination of the file size and the CRC-32 checksum over the file data should be adequate. 28.5 bits are required to represent the file size (that gets you to 379M bytes), which means you get a checksum value of effectively just over 60 bits. I would use a 64-bit quantity to store the file size, for future proofing, but 32 bits would work too in your scenario.
If collision resistance is a consideration, then you pretty much have to use one of the tried-and-true-yet-unbroken cryptographic hash algorithms. I would still concur with what Devils child wrote and also include the file size as a separate (readily accessible) part of the hash, however; if the sizes don't match, there is no chance that the file content can be the same, so in that case the computationally intensive hash calculation can be skipped.
How that integer hash is generated by the GetHashCode() function? Is it a random value which is not unique?
In string, it is overridden to make sure that there exists only one hash code for a particular string.
How to do that?
How searching for specific key in a hash table is speeded up using hash code?
What are the advantages of using hash code over using an index directly in the collection (like in arrays)?
Can someone help?
Basically, hash functions use some generic function to digest data and generate a fingerprint (and integer number here) for that data. Unlike an index, this fingerprint depends ONLY on the data, and should be free of any predictable ordering based on the data. Any change to a single bit of the data should also change the fingerprint considerably.
Notice that nowhere does this guarantee that different data won't give the same hash. In fact, quite the opposite: this happens very often, and is called a collision. But, with an integer, the probability is roughly 1 in 4 billion against this (1 in 2^32). If a collision happens, you just compare the actual object you are hashing to see if they match.
This fingerprint can then be used as an index to an array (or arraylist) of stored values. Because the fingerprint is dependent only on the data, you can compute a hash for something and just check the array element for that hash value to see if it has been stored already. Otherwise, you'd have to go through the whole array checking if it matches an item.
You can also VERY quickly do associative arrays by using 2 arrays, one with Key values (indexed by hash), and a second with values mapped to those keys. If you use a hash, you just need to know the key's hash to find the matching value for the key. This is much faster than doing a binary search on a sorted key list, or a scan of the whole array to find matching keys.
There are MANY ways to generate a hash, and all of them have various merits, but few are simple. I suggest consulting the wikipedia page on hash functions for more info.
A hash code IS an index, and a hash table, at its very lowest level, IS an array. But for a given key value, we determine the index into in a hash table differently, to make for much faster data retrieval.
Example: You have 1,000 words and their definitions. You want to store them so that you can retrieve the definition for a word very, very quickly -- faster than a binary search, which is what you would have to do with an array.
So you create a hash table. You start with an array substantially bigger than 1,000 entries -- say 5,000 (the bigger, the more time-efficient).
The way you'll use your table is, you take the word to look up, and convert it to a number between 0 and 4,999. You choose the algorithm for doing this; that's the hashing algorithm. But you could doubtless write something that would be very fast.
Then you use the converted number as an index into your 5,000-element array, and insert/find your definition at that index. There's no searching at all: you've created the index directly from the search word.
All of the operations I've described are constant time; none of them takes longer when we increase the number of entries. We just need to make sure that there is sufficient space in the hash to minimize the chance of "collisions", that is, the chance that two different words will convert to the same integer index. Because that can happen with any hashing algorithm, we need to add checks to see if there is a collision, and do something special (if "hello" and "world" both hash to 1,234 and "hello" is already in the table, what will we do with "world"? Simplest is to put it in 1,235, and adjust our lookup logic to allow for this possibility.)
Edit: after re-reading your post: a hashing algorithm is most definitely not random, it must be deterministic. The index generated for "hello" in my example must be 1,234 every single time; that's the only way the lookup can work.
Answering each one of your questions directly:
How that integer hash is generated by
the GetHashCode() function? Is it a
random value which is not unique?
An integer hash is generated by whatever method is appropriate for the object.
The generation method is not random but must follow consistent rules, ensuring that a hash generated for one particular object will equal the hash generated for an equivalent object. As an example, a hash function for an integer would be to simply return that integer.
In string, it is overridden to make
sure that there exists only one hash
code for a particular string. How to
do that?
There are many ways this can be done. Here's an example I'm thinking of on the spot:
int hash = 0;
for(int i = 0; i < theString.Length; ++i)
{
hash ^= theString[i];
}
This is a valid hash algorithm, because the same sequence of characters will always produce the same hash number. It's not a good hash algorithm (an extreme understatement), because many strings will produce the same hash. A valid hash algorithm doesn't have to guarantee uniqueness. A good hash algorithm will make a chance of two differing objects producing the same number extremely unlikely.
How searching for specific key in a hash table is speeded up using hash code?
What are the advantages of using hash code over using an index directly in the collection (like in arrays)?
A hash code is typically used in hash tables. A hash table is an array, but each entry in the array is a "bucket" of items, not just one item. If you have an object and you want to know which bucket it belongs in, calculate
hash_value MOD hash_table_size.
Then you simply have to compare the object with every item in the bucket. So a hash table lookup will most likely have a search time of O(1), as opposed to O(log(N)) for a sorted list or O(N) for an unsorted list.