How to hash a URL quickly

How to hash a URL quickly - c#

I have a unique situation where I need to produce hashes on the fly. Here is my situation. This question is related to here. I need to store a many urls in the database which need to be indexed. A URL can be over 2000 characters long. The database complains that a string over 900 bytes cannot be indexed. My solution is to hash the URL using MD5 or SHA256. I am not sure which hashing algorithm to use. Here are my requirements
Shortest character length with minimal collision
Needs to be very fast. I will be hashing the referurl on every page request
Collisions need to be minimized since I may have millions of urls in the database
I am not worried about security. I am worried about character length, speed, and collisions. Anyone know of a good algorithm for this?

In your case, I wouldn't use any of the cryptographic hash functions (i.e. MD5, SHA), since they were designed with security in mind: They mainly want to make it as hard as possible to finde two different strings with the same hash. I think this wouldn't be a problem in your case. (the possibility of random collisions is inherent to hashing, of course)
I'd strongly not suggest to use String.GetHashCode(), since the implementation is not known and MSDN says that it might vary between different versions of the framework. Even the results between x86 and x64 versions may be different. So you'll get into troubles when trying to access the same database using a newer (or different) version of the .NET framework.
I found the algorithm for the Java implementation of hashCode on Wikipedia (here), it seems quite easy to implement. Even a straightforward implementation would be faster than an implementation of MD5 or SHA imo. You could also use long values which reduces the probability of collisions.
There is also a short analysis of the .NET GetHashCode implementation here (not the algorithm itself but some implementation details), you could also use this one I guess. (or try to implement the Java version in a similar way ...)

a quick one :
URLString.GetHashCode().ToString("x")

While both MD5 and SHA1 have been proved ineffective where collision prevention is essential I suspect for your application either would be sufficient. I don't know for sure but I suspect that MD5 would be the simpler and quicker of the two algorithms.

Use the System.Security.Cryptography.SHA1Cng class, I would suggest. It's 160 bits or 20 bytes long, so that should definitely be small enough. If you need it to be a string, it will only require 40 characters, so that should suit your needs well. It should also be fast enough, and as far as I know, no collisions have yet been found.

I'd personally use String.GetHashCode(). This is the basic hash function. I honestly have no idea how it performs compared to other implementations but it should be fine.
Either of the two hashing functions that you name should be quick enough that you won't notice much difference between them. Unless this site requires ultra-high performance I would not worry too much about them. I'd personally probably go for MD5. This can be formatted as a string as hexdecimal in 64 characters or as a base 64 string in 44 characters.
The reason I'd go for MD5 is because you are very unlikely to run into collisions and even if you do you can structure your queries with "where urlhash = #hash and url = #url". The database engine should work out that one is indexed and the other isn't and use that information to do a sensible search.
If there are colisions the indexed scan on urlhash will return a handful of results which will be easy to do text comparisons on to get the right one. This is unlikely to be relevant very often though. You've pretty low chances of getting collisions this way.

Reflected source code of GetHashCode function in .net 4.0
public override unsafe int GetHashCode()
{
fixed (char* str = ((char*) this))
{
char* chPtr = str;
int num = 0x15051505;
int num2 = num;
int* numPtr = (int*) chPtr;
for (int i = this.Length; i > 0; i -= 4)
{
num = (((num << 5) + num) + (num >> 0x1b)) ^ numPtr[0];
if (i <= 2)
{
break;
}
num2 = (((num2 << 5) + num2) + (num2 >> 0x1b)) ^ numPtr[1];
numPtr += 2;
}
return (num + (num2 * 0x5d588b65));
}
}
There was O(n) simple operations(+, <<, ^) and one multiplication. So this is very fast.
I've tested this function on 3 mln DB contains strings lengths up to 256 characters and about 97% of strings has no collision. (Maximum 5 strings have the same hash)

You may want to look at the following project:
CMPH - C Minimal Perfect Hashing Library
And check out the following hot topics listing for perfect hashes:
Hottest 'perfect-hash' Answers - Stack Overflow
You could also consider using a full text index in SQL rather than hashing:
CREATE FULLTEXT INDEX (Transact-SQL)

Related

I just noticed I get different hashcodes from objects depending on if I build for x86 or 64. Can I do that aswell?

I noticed that hashcodes I got from other objects were different when I built for a either x86 or x64.
Up until now I have implemented most of my own hashing functions like this:
int someIntValueA;
int someIntValueB;
const int SHORT_MASK = 0xFFFF;
public override int GetHashCode()
{
return (someIntValueA & SHORT_MASK) + ((someIntValueB & SHORT_MASK) << 16);
}
Will storing the values in a long and getting the hashcode from that give me a wider range as well on 64-bit systems, or is this a bad idea?
public override int GetHashCode()
{
long maybeBiggerSpectrumPossible = someIntValueA + (someIntValueB << 32);
return maybeBiggerSpectrumPossible.GetHashCode();
}

No, that will be far worse.
Suppose your int values are typically in the range of a short: between -30000 and +30000. And suppose further that most of them are near the middle, say, between 0 and 1000. That's pretty typical. With your first hash code you get all the bits of both ints into the hash code and they don't interfere with each other; the number of collisions is zero under typical conditions.
But when you do your trick with a long, then you rely on what the long implementation of GetHashCode does, which is xor the upper 32 bits with the lower 32 bits. So your new implementation is just a slow way of writing int1 ^ int2. Which, in the typical scenario has almost all zero bits, and hence collisions all over the place.

The approach you suggest won't make anything any better (quite the opposite).
However…
SpookyHash is for example designed to work particularly quickly on 64-bit systems, because when working out the math the author was thinking about what would be fast on a 64-bit system, xxHash has 32-bit and 64-bit variants that are designed to give comparable quality of hash at better speed for 32-bit and 64-bit computation respectively.
The general idea of making use of the differences performances of different arithmetic operations on different machines is a valid one.
And your general idea of making use of a larger intermediary storage in hash calculation is also a valid one as long as those extra bits make their way into subsequent operations.
So at a very general level, the answer is yes, even if your particular implementation fails to come through with that.
Now, in practice, when you're sitting down to write a hashcode implementation should you worry about this?
Well it depends. For a while I was very bullish about using algorithms like SpookyHash, and it does very well (even on 32-bit systems) when the hash is based on a large amount of source data. But on the other hand it can be better, especially when used with smaller hash-based sets and dictionaries, to be crappy really fast than fantastic slowly. So there isn't an one-solution-fits-all answer. With just two input integers your initial solution is likely to beat a super-avalancy algorithm like xxHash or SpookyHash for many uses. You could perhaps do better if you also had a >> 16 to rotate rather than shift (fun fact, some jitters are optimised for that), but we're not touching on 64- vs 32-bit versions in that at all.
The cases where you do find a big possible improvement with taking a different approach in 64- and 32-bit are where there's a large amount of data to mix in, especially if it's in a blittable form (like string or byte[]) that you can access via a long* or int* depending on framework.
So, generally you can ignore the question of bitness, but if you find yourself thinking "this hashcode has to go through so much stuff to get an answer; can I make it better?" then maybe it's time to consider such matters.

Alternative to BigInteger.ModPow(); in C#

i'm looking for an alternative to the BigInteger package of C# which has been introduced with NET 4.x.
The mathematical operations with this object are terribly slow, I guess this is caused by the fact that the arithmetics are done on a higher level than the primitive types - or badly optimized, whatever.
Int64/long/ulong or other 64bit-numbers are way to small and won't calculate correctly - I'm talking about 64bit-integer to the power of 64-bit integers.
Hopefully someone can suggest my something. Thanks in advance.

Honestly, if you have extremely large numbers and need to do heavy computations with them and the BigInteger library still isn't cutting it for you, why not offload it onto an external process using whatever language or toolkit you know of that does it best? Are you truly constrained to write whatever it is you're trying to accomplish entirely in C#?
For example, you can offload to MATLAB in C#.

BIGInteger is indeed very slow. One of the reasons is it's immutability.
If you do a = a - b you will get a new copy of a. Normally this is fast. With BigInteger and say an integer of 2048 bits it will need to allocate an extra 2KB.
It should also have different multiplication-algorithms depending on integersize (I assume it is not that sophisticated). What I mean is that for very very large integers a different algorithm using fourier transforms works best and for smaller integers you break the work down in smaller multiplies (divide and conquer approach). See more on http://en.wikipedia.org/wiki/Multiplication_algorithm
Either way there are alternatives, none of which I have used or tested. They might be slower as .NET internal for all I know. (making a testcase and do some valid testing is your friend)
Google 'C# large integer multiplication' for a lot of homemade BigInteger implementations (usually from pre C#4.0 when BIGInteger was introduced)
https://github.com/devoyster/IntXLib
http://gmplib.org/ (there are C# wrappers)
http://www.extremeoptimization.com/ (commercial)
http://mathnetnumerics.codeplex.com/ (nice opensource, but not much onboard for very large integers)

public static int PowerBySquaring(int baseNumber, int exponent)
{
int result = 1;
while (exponent != 0)
{
if ((exponent & 1)==1)
{
result *= baseNumber;
}
exponent >>= 1;
baseNumber *= baseNumber;
}
return result;
}

clarification for crypt SHA-512 algorithm (c#)

EDIT: Sorry I forgot to mention, I'm not using the implemented sha512 crypt because as far as I can tell it doesn't involve a salt value or a specified number of rounds to compute the hash with.
Okay so I'm coding the sha-512 crypt in c# and I'm following the steps found here...
http://people.redhat.com/drepper/SHA-crypt.txt
This is my first time doing anything encryption related so I want to make sure I'm understanding the steps correctly... I don't understand c code well enough to direct translation from c to c# :/
I have assumed finishing a digest is the same as computing the hash. In this case, I've also assumed that when the steps refer to a finished digest, they are referring the the computed hash, rather than the pre-hash computed digest bytes. Correct me if I'm wrong please!
Assuming everything has been done correctly for steps 1-8, my doubts start at step 9
9. For each block of 32 or 64 bytes in the password string (excluding
the terminating NUL in the C representation), add digest B to digest A
Since I'm using SHA-512, I have block sizes of 64 bytes.
Would the following code produce the desired result?
//FYI, temp = digestA from steps 1-3 (before expanding digestA for step 9)
//alt_result = computed digestB hash (64 byte hash)
for (cnt = key.Length; cnt > 64; cnt -= 64) //9
{
int i = 0;
ctx.TransformBlock(alt_result, 0, 64, digestA, temp.Length + 64 * i);
i++;
}
If anyone can clarify that what I've stated is correct, I would appreciate it. Thanks!

Salting is as simple as appending a fixed byte string on the end of your input string. Essentially providing a known "homegrown" transform to your input.
About the algorithm itself: you seem to be starting at a disadvantage. A neophyte, you're making a lot of "assumptions" about basic crypting terminology that even need clarification. If the CLR implementation won't work for you, I think your time would be better spent finding a good C implementation and figuring out how to integrate to that. Figuring out the interop (extern) calls to that will be far easier than diving into the intracacies of crypting, the results will be more efficient, and the knowledge you gain about native interop will be far more useful/reusable.

I'll add some important clarification for others who might come across this later.
First:
SHA512 and SHA512Crypt are two distinct algorithms for two different purposes. SHA512 is a general purpose hashing algorithm (see this). SHA512Crypt is a password storage or password based key derivation algorithm that uses SHA512 (hash) internally (see this). SHA512Crypt is based on the earlier Crypt function that used MD5 instead of SHA512.
The password storage/key generation algorithms have been specifically created to make it orders of magnitude more expensive to brute force. The typical way this is done is by iterating over the underlying hash algorithm in some fashion. However, you don't want to to this yourself... which brings us to...
Second:
Do NOT write your own cryptography methods. (see this) There are tons of ways to screw it up, even if you know exactly what you are doing.
If you don't want to use the built in Rfc2898DerviceBytes due to it being based on SHA1, then you could look at bcrypt or some other public, reviewed implementation of a known cryptographic algorithms.

get data from generated hashcode()

i have a string(name str) and i generate hashcode(name H) from that ,
i want recieve orginal string(name str) from recieved hashcode(name H)

The short answer is you can't.
Creating a hashcode is one way operation - there is no reverse operation. The reason for this is that there are (for all practical purposes) infinitely many strings, but only finitely many hash codes (the number of possible hashcodes is bounded by the range of an int). Each hashcode could have been generated from any one of the infinitely many strings that give that hash code and there's no way to know which.

You can try to do it through a Brute Force Attack or with the help of Rainbow tables
Anyway, (even if you succeeded in finding something) with those methods, you would only find a string having the same hascode of the original, but you're ABSOLUTELY not sure that would be the original string, because hascodes are not unique.
Mmh, maybe absolutely is even a bit restrictive, because probability says you 99.999999999999... % won't find the same string :D

Hashing is generating a short fixed size value from a usually larger input. It is in general not reversible.
Mathematically impossible. There are only 2^32 different ints, but almost infinitely many strings, so from the pigeon hole principle follows that you can't restore the string.
You can find a string that matches the HashCode pretty easily, but it probably won't be the string that was originally hashed.
GetHashCode() is designed for use in hashtables and as thus is just a performance trick. It allows quick sorting of the input value into buckets, and nothing more. Its value is implementation defined. So another .net version, or even another instance of the same application might return a different value. return 0; is a valid(but not recommended) implementation of GetHashCode, and would not yield any information about the original string.

many of us would like to be able to do that :=)

A simple, repeatable hash from an UInt32 to a UInt16

I have a little problem where need to do a hash of a number of about 10 digits into a number of 6 digits. The hash needs to be deterministic.
It's more important that the hash is not resource intensive.
For example, say that I have some number, x, like 123456789
I want to write an hash function that gives me a number, y, back like 987654.
I'd then like to have a function that takes the x and y as parameters, re-applies the hash on x, and checks that the result is y.
It should be difficult to compute possible input values given the hash.
My first idea of multiplying pairs of digits led to a lot of duplicate hashed values.
I have the feeling that this sort of problem has some kind of elegant solution, but I just can't think of it myself.
Can anyone help me out here? Thanks in advance :)

What you need is called "hashing".
Try CRC16.

Your problem as stated is not solvable.
You say that you want the system to be "somewhat hard to break", by which I assume you mean that it is "somewhat hard" for an attacker to take a known digest and produce from it a possible input which hashes to the given digest. Since there are only 4 billion possible inputs and only 65536 possible hashes in the system you propose, it is utterly trivial to find a message that corresponds to a given hash, no matter what the hash algorithm is. On average, the attacker will have about 65000 possible messages to choose from, and can therefore cherry-pick the message that best serves his nefarious scheme.
I would expect a "somewhat hard" problem in the hash-breaking space to require, dedicating, say, a few million dollars worth of supercomputer time to break. Your proposal can be broken by inexperienced high school students writing Javascript programs that take a couple minutes to write and maybe a minute to run, tops; this is not even vaguely close to "somewhat hard".
Why are you choosing such tiny limits on your algorithm, limits which will by their very nature make it trivial to break the hashing? And for that matter, what's the value in hashing such a tiny amount of data as a 32 bit integer?

(( X>>16) ^ (X)) & 0xFFFF
.......

What you want to do is to try to distribute the hash values as evenly as possible over the range. Some of the built in hashing methods are fairly good at this, so you could perhaps try something like getting the hash code of the string representation, and simply throw away half of the bits:
ushort code = (ushort)value.ToString().GetHashCode();
However, it also depends on what you are going to use the hash code for. The built in hash codes are not intended to be stored permanently. The algorithms for calculating the hash codes can change with any new version of the framework, so if you store the hash codes in the database they may become useless in the future. In that case you would instead have to create the hashing algorithm yourself from scratch, or use some hashing algorithm that was designed for permanent storage.
One simple algorithm that is used for hash codes for some values in the framework is to use exclusive or to make all bits in the value matter when the hash code is smaller than the data:
byte[] b = BitConverter.GetBytes(value);
ushort code = (ushort)(BitConverter.ToUInt16(b, 0) ^ BitConverter.ToUInt16(b, 2));
or the more efficient but less obvious way to do the same:
ushort code = (ushort)((value >> 16) ^ value);
This of course has no obfuscating properties for small values, so you might want to throw in some "random" bits to make the hash code significantly different from the value:
ushort code = (ushort)(0x56D4 ^ (value >> 16) ^ value);

How about just discarding the lower 16 bits or last 4 digits?
1234567890 --> 123456
Easily done by just doing an integer division by 10000.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.