Any 32 or 64 bit hash function? - c#

I want to create a method in c# which will return a string of max 10-12 characters. I have tried using SHA1 and MD5 but they are 160 and 128 bits respectively and generates 32 characters string which doesn't fulfill my requirements. Security is not the issue. I just need a small string that will remain unique

You can truncate the string (the hash) to the length you want. You'll only make it weaker (as an extreme example, if you truncate it to one byte, you'll probably have a collision after 16 elements are hashed, thanks to the birthday problem). Each part of a good hash is as much "good" as every other part. So take the first x characters/bytes and live happy. See for example a discussion about this in security. There is an explanation here about how much secure will be a truncated hash.

Related

C# Creating your own hash algorithm - 99 documents, 0.0001 collision?

Looking at Wolfram: collision to bits - graph with 99 documents, I'd need a 25.5bit hashing algorithm to have a 0.0001 chance for a collision.
I looked at CRC-24 and I was wondering if it could be improved to use even less characters. I have a big list of characters that can be used for the hash: Basically all Unicode characters except for 4 or 5 characters.
Now how do you create your own hash algorithm based on a set of usable characters in C#?
EDIT:
I try to precise the issue: I have 99 strings. I want to cut them to 64 chars max. length. This can create duplicates. But they need to be unique while maintaining their meaning. The idea was to create a hash as small as possible and replace the last characters with the hash created of the original string. The hash of course should have a low probability for collision and be as short as possible. How I understand, the more symbols can be used in the hash (as in a-z0-9 or a-zA-Z0-9 or even all unicode characters), the less characters can the hash have before there are collisions. I looked at sha-1 and just trimming it, or crc-32 but they are not using the "full potential" of unicode characters.

Which part of a GUID is most worth keeping?

I need to generate a unique ID and was considering Guid.NewGuid to do this, which generates something of the form:
0fe66778-c4a8-4f93-9bda-366224df6f11
This is a little long for the string-type database column that it will end up residing in, so I was planning on truncating it.
The question is: Is one end of a GUID more preferable than the rest in terms of uniqueness? Should I be lopping off the start, the end, or removing parts from the middle? Or does it just not matter?
You can save space by using a base64 string instead:
var g = Guid.NewGuid();
var s = Convert.ToBase64String(g.ToByteArray());
Console.WriteLine(g);
Console.WriteLine(s);
This will save you 12 characters (8 if you weren't using the hyphens).
Keep all of it.
From the above link:
* Four bits to encode the computer number,
* 56 bits for the timestamp, and
* four bits as a uniquifier.
you can redefine the Guid to right-size it to your needs.
If the GUID were simply a random number, you could keep an arbitrary subset of the bits and suffer a certain percent chance of collision that you can calculate with the "birthday algorithm":
double numBirthdays = 365; // set to e.g. 18446744073709551616d for 64 bits
double numPeople = 23; // set to the maximum number of GUIDs you intend to store
double probability = 1; // that all birthdays are different
for (int x = 1; x < numPeople; x++)
probability *= (double)(numBirthdays - x) / numBirthdays;
Console.WriteLine("Probability that two people have the same birthday:");
Console.WriteLine((1 - probability).ToString());
However, often the probability of a collision is higher because, as a matter of fact, GUIDs are in general NOT random. According to Wikipedia's GUID article there are five types of GUIDs. The 13th digit specifies which kind of GUID you have, so it tends not to vary much, and the top two bits of the 17th digit are always fixed at 01.
For each type of GUID you'll get different degrees of randomness. Version 4 (13th digit = 4) is entirely random except for digits 13 and 17; versions 3 and 5 are effectively random, as they are cryptographic hashes; while versions 1 and 2 are mostly NOT random but certain parts are fairly random in practical cases. A "gotcha" for version 1 and 2 GUIDs is that many GUIDs could come from the same machine and in that case will have a large number of identical bits (in particular, the last 48 bits and many of the time bits will be identical). Or, if many GUIDs were created at the same time on different machines, you could have collisions between the time bits. So, good luck safely truncating that.
I had a situation where my software only supported 64 bits for unique IDs so I couldn't use GUIDs directly. Luckily all of the GUIDs were type 4, so I could get 64 bits that were random or nearly random. I had two million records to store, and the birthday algorithm indicated that the probability of a collision was 1.08420141198273 x 10^-07 for 64 bits and 0.007 (0.7%) for 48 bits. This should be assumed to be the best-case scenario, since a decrease in randomness will usually increase the probability of collision.
I suppose that in theory, more GUID types could exist in the future than are defined now, so a future-proof truncation algorithm is not possible.
I agree with Rob - Keep all of it.
But since you said you're going into a database, I thought I'd point out that just using Guid's doesn't necessarily mean that it will index well in a database. For that reason, the NHibernate developers created a Guid.Comb algorithm that's more DB friendly.
See NHibernate POID Generators revealed and documentation on the Guid Algorithms for more information.
NOTE: Guid.Comb is designed to improve performance on MsSQL
Truncating a GUID is a bad idea, please see this article for why.
You should consider generating a shorter GUID, as google reveals some solutions for. These solutions seem to involve taking a GUID and changing it to be represented in full 255 bit ascii.

Locally unique identifier

Question: When you have a .NET GUID for inserting in a database, it's structure is like this:
60 bits of timestamp,
48 bits of computer identifier,
14 bits of uniquifier, and
6 bits are fixed,
----
128 bits total
Now I have problem with a GUID, because it's a 128 bit number, and some of the DBs I'm using only support 64 bit numbers.
Now I don't want to solve the dilemma by using an autoincrement bigint value, since I want to be able to do offline replication.
So I got the idea of creating a locally unique identifier class, which is basically a GUID downsized to a 64 bit value.
I came up with this:
day 9 bit (12*31=372 d)
year 8 bit (2266-2010 = 256 y)
seconds 17 bit (24*60*60=86400 s)
hostname 12 bit (2^12=4096)
random 18 bit (2^18=262144)
------------------------
64 bits total
My question now is: The timestamp is pretty much fixed at 34 bits, leaving me with 64-34=30 bits for the hostname + random number.
Now my question:
1) Would you rather increase the hostname-hash bitsize and decrease the random bitsize, or increase the random bitsize and decrease the hostname-hash bitsize.
2) Exists there a hash algorithm that reduces every string to n-Bits?
n being ideally = 12 or as near as possible.
Actually, .NET-generated GUIDs are 6 fixed bits and 122 bits of randomness.
You could consider just using 64 bits of randomness, with an increased chance of collision due to the smaller bit length. It would work better than a hash.
If space isn't a concern, then why don't you just use 2 columns that are 64bits wide, then split the guid in half using 8bytes for each then just convert those to your 64bit numbers and store it in 2 columns, then if you ever do need to upsize to another system, you'll still be unique you'll just need to factor in the rejoining of the 2 columns.
Why write your own? Why not just generate a uniformly random number? It will do the job nicely. Just grab the first X digits where X is whatever size you want... say 64-bits.
See here for info about RAND() vs. NEWID() in SQL Server, which is really just an indictment of GUIDs vs. random number generators. Also, see here if you need something more random than System.Random.

Predicting the length of an encrypted string

I am using this for encryption: http://msdn.microsoft.com/en-us/library/system.security.cryptography.rijndaelmanaged.aspx
Is there a way I can predict what the encrypted text will look like? I am converting the encrypted output to text so I can store it in the db.
I just want to make sure the size of the database column is large enough.
I am limiting the text input to be 20 characters.
Are you using SQL Server 2005 or above? If so you could just use VARCHAR(MAX) or NVARCHAR(MAX) for the column type.
If you want to be a bit more precise...
The maximum block size for RijndaelManaged is 256 bits (32 bytes).
Your maximum input size is 20 characters, so even if we assume a worst-case scenario of 4 bytes per character, that'll only amount to 80 bytes, which will then be padded up to a maximum of 96 bytes for the encryption process.
If you use Base64 encoding on the encrypted output that will create 128 characters from the 96 encrypted bytes. If you use hex encoding then that will create 192 characters from the 96 encrypted bytes (plus maybe a couple of extra characters if you're prefixing the hex string with "0x"). In either case a column width of 200 characters should give you more than enough headroom.
(NB: These are just off-the-top-of-my-head calculations. I haven't verified that they're actually correct!)
For an unknown encryption algorithm with no information to be found online, I would write a little test program that encrypted a random set of strings of maximum length, find the longest length in the output, then multiply by a safety factor based on how likely the length of input is to change, and how accurate the result of the test program was.
Really generally speaking though, you're probably going to be in the 1.5x - 2x input length range.
For this specific algorithm, the length of ciphertext will be,
((length+16)/16)*16
This is to meet the block size and padding requirement.
I suggest you also add an random IV to the ciphertext so that will take another 16 bytes.
However, if you want put this as char in database, you have to encode it. That will increase it even more.
For base64, multiply it by 4/3. For hex, double it.
Encryption will never increase the size of data beyond the minimum padding required.
If it does 'expand' the data, it is probably not a very good encryption algorithm.

Hashtable/Dictionary collisions

Using the standard English letters and underscore only, how many characters can be used at a maximum without causing a potential collision in a hashtable/dictionary.
So strings like:
blur
Blur
b
Blur_The_Shades_Slightly_With_A_Tint_Of_Blue
...
There's no guarantee that you won't get a collision between single letters.
You probably won't, but the algorithm used in string.GetHashCode isn't specified, and could change. (In particular it changed between .NET 1.1 and .NET 2.0, which burned people who assumed it wouldn't change.)
Note that hash code collisions won't stop well-designed hashtables from working - you should still be able to get the right values out, it'll just potentially need to check more than one key using equality if they've got the same hash code.
Any dictionary which relies on hash codes being unique is missing important information about hash codes, IMO :) (Unless it's operating under very specific conditions where it absolutely knows they'll be unique, i.e. it's using a perfect hash function.)
Given a perfect hashing function (which you're not typically going to have, as others have mentioned), you can find the maximum possible number of characters that guarantees no two strings will produce a collision, as follows:
No. of unique hash codes avilable = 2 ^ 32 = 4294967296 (assuming an 32-bit integer is used for hash codes)
Size of character set = 2 * 26 + 1 = 53 (26 lower as upper case letters in the Latin alphabet, plus underscore)
Then you must consider that a string of length l (or less) has a total of 54 ^ l representations. Note that the base is 54 rather than 53 because the string can terminate after any character, adding an extra possibility per char - not that it greatly effects the result.
Taking the no. of unique hash codes as your maximum number of string representations, you get the following simple equation:
54 ^ l = 2 ^ 32
And solving it:
log2 (54 ^ l) = 32
l * log2 54 = 32
l = 32 / log2 54 = 5.56
(Where log2 is the logarithm function of base 2.)
Since string lengths clearly can't be fractional, you take the integral part to give a maximum length of just 5. Very short indeed, but observe that this restriction would prevent even the remotest chance of a collision given a perfect hash function.
This is largely theoretical however, as I've mentioned, and I'm not sure of how much use it might be in the design consideration of anything. Saying that, hopefully it should help you understand the matter from a theoretical viewpoint, on top of which you can add the practical considersations (e.g. non-perfect hash functions, non-uniformity of distribution).
Universal Hashing
To calculate the probability of collisions with S strings of length L with W bits per character to a hash of length H bits assuming an optimal universal hash (1) you could calculate the collision probability based on a hash table of size (number of buckets) 'N`.
First things first we can assume a ideal hashtable implementation (2) that splits the H bits in the hash perfectly into the available buckets N(3). This means H becomes meaningless except as a limit for N.
W and 'L' are simply the basis for an upper bound for S. For simpler maths assume that strings length < L are simply padded to L with a special null character. If we were interested we are interested in the worst case this is 54^L (26*2+'_'+ null), plainly this is a ludicrous number, the actual number of entries is more useful than the character set and the length so we will simply work as if S was a variable in it's own right.
We are left trying to put S items into N buckets.
This then becomes a very well known problem, the birthday paradox
Solving this for various probabilities and number of buckets is instructive but assuming we have 1 billion buckets (so about 4GB of memory in a 32 bit system) then we would need only 37K entries before we hit a 50% chance of their being at least one collision. Given that trying to avoid any collisions in a hashtable becomes plainly absurd.
All this does not mean that we should not care about the behaviour of our hash functions. Clearly these numbers are assuming ideal implementations, they are an upper bound on how good we can get. A poor hash function can give far worse collisions in some areas, waste some of the possible 'space' by never or rarely using it all of which can cause hashes to be less than optimal and even degrade to a performance that looks like a list but with much worse constant factors.
The .NET framework's implementation of the string's hash function is not great (in that it could be better) but is probably acceptable for the vast majority of users and is reasonably efficient to calculate.
An Alternative Approach: Perfect Hashing
If you wish you can generate what are known as perfect hashes this requires full knowledge of the input values in advance however so is not often useful. In a simliar vein to the above maths we can show that even perfect hashing has it's limits:
Recall the limit of of 54 ^ L strings of length L. However we only have H bits (we shall assume 32) which is about 4 billion different numbers. So if you can have truly any string and any number of them then you have to satisfy:
54 ^ L <= 2 ^ 32
And solving it:
log2 (54 ^ L) <= 32
L * log2 54 <= 32
L <= 32 / log2 54 <= 5.56
Since string lengths clearly can't be fractional, you are left with a maximum length of just 5. Very short indeed.
If you know that you will only ever have a set of strings well below 4 Billion in size then perfect hashing would let you handle any value of L, but restricting the set of values can be very hard in practice and you must know them all in advance or degrade to what amounts to a database of string -> hash and add to it as new strings are encountered.
For this exercise the universal hash is optimal as we wish to reduce the probability of any collision i.e. for any input the probability of it having output x from a set of possibilities R is 1/R.
Note that doing an optimal job on the hashing (and the internal bucketing) is quite hard but that you should expect the built in types to be reasonable if not always ideal.
In this example I have avoided the question of closed and open addressing. This does have some bearing on the probabilities involved but not significantly
A hash algorithm isn't supposed to guarantee uniqueness. Given that there are far more potential strings (26^n for n length, even ignoring special chars, spaces, capitalization, non-english chars, etc.) than there are places in your hashtable, there's no way such a guarantee could be fulfilled. It's only supposed to guarantee a good distribution.
If your key is a string (e.g., a Dictionary) then it's GetHashCode() will be used. That's a 32bit integer. Hashtable defaults to a 1 key to value load factor and increases the number of buckets to maintain that load factor. So if you do see collisions they should tend to occur around reallocation boundaries (and decrease shortly after reallocation).

Categories