I have a bunch of IDs, that are in the String form, like "enemy1", "enemy2".
I want to save a progress, depends on how many of each enemies I killed. For that goal I use a dictionary like { { "enemy1", 0 }, { "enemy2", 1 } }.
Then I want to share player's save between few machines he can play into (like PC and laptop) via network (serialize it in JSON file first). For size decreasing and perfomance inreasing, i use hashes instead full string, using that alg (becouse MDSN said, that default hash alg can be different on different machines):
int hash_ = 0;
public override int GetHashCode()
{
if(hash_ == 0)
{
hash_ = 5381;
foreach(var ch in id_)
hash_ = ((hash_ << 5) + hash_) ^ ch;
}
return hash_;
}
So, the question is: is that alg in C# will return the same results in any machine player will use.
UPD: in comments i note that the main part of question was unclear.
So. If i can guarantee that all files before deserialization will be in the same encoding, is char representation on every machine that player can use will be the same and operation ^ ch will give same result? I mean WinX64/WinX32/Mac/Linux/... machines
Yes, that code will give the same result on every platform, for the same input. A char is a UTF-16 code unit, regardless of platform, and any given char will convert to the same int value on every platform. As normal with hash codes computed like this, you shouldn't assume that equal hash codes implies equal original values. (It's unclear how you're intending to use the hash, to be honest.)
I would point out that your code isn't thread-safe though - if two threads call GetHashCode at basically the same time, one may see a value of 0 (and therefore start hashing) whereas the second may see an interim result (as computed by the first thread) and assume that's the final hash. If you really believe caching is important here (and I'd test that first) you should compute the complete hash using a local variable, then copy it to the field only when you're done.
Related
I'm working in C#. I have an unsigned 32-bit integer i that is incremented gradually in response to an outside user controlled event. The number is displayed in hexadecimal as a unique ID for the user to be able to enter and look up later. I need i to display a very different 8 character string if it is incremented or two integers are otherwise close together in value (say, distance < 256). So for example, if i = 5 and j = 6 then:
string a = Encoded(i); // = "AF293E5B"
string b = Encoded(j); // = "CD2429A4"
The limitations on this are:
I don't want an obvious pattern in how the string changes in each increment.
The process needs to be reversible, so if given the string I can generate the original number.
Each generated string needs to be unique for the entire range of a 32-bit unsigned integers, so that two numbers don't ever produce the same string.
The algorithm to produce the string should be fairly easy to implement and maintain for both encoding and decoding (maybe 30 lines each or less).
However:
The algorithm does not need to be cryptographically secure. The goal is obfuscation more than encryption. The number itself is not secret, it just needs to not obviously be an incrementing number.
It is alright if looking at a large list of incremented numbers a human can discern a pattern in how the strings are changing. I just don't want it to be obvious if they are "close".
I recognize that a Minimal Perfect Hash Function meets these requirements, but I haven't been able to find one that will do what I need or learn how to derive one that will.
I have seen this question, and while it is along similar lines, I believe my question is more specific and precise in its requirements. The answer given for that question (as of this writing) references 3 links for possible implementations, but not being familiar with Ruby I'm not sure how to get at the code for the "obfuscate_id" (first link), Skipjack feels like overkill for what I need (2nd link), and Base64 does not use the character set I'm interested in (hex).
y = p * x mod q is reversible if p and q are co-primes. In particular, mod 2^32 is easy, and any odd number is a co-prime of 2^32. Now 17,34,51,... is a bit too easy, but the pattern is less obvious for 2^31 < p < 2^32-2^30 (0x8000001-0xBFFFFFFF).
I would like to generate coupon codes , e.g. AYB4ZZ2. However, I would also like to be able to mark the used coupons and limit their global number, let's say N. The naive approach would be something like "generate N unique alphanumeric codes, put them into database and perform a db search on every coupon operation."
However, as far as I realize, we can also attempt to find a function MakeCoupon(n), which converts the given number into a coupon-like string with predefined length.
As far as I understand, MakeCoupon should fullfill the following requirements:
Be bijective. It's inverse MakeNumber(coupon) should be effectively computable.
Output for MakeCoupon(n) should be alphanumeric and should have small and constant length - so that it could be called human readable. E.g. SHA1 digest wouldn't pass this requirement.
Practical uniqueness. Results of MakeCoupon(n) for every natural n <= N should be totally unique or unique in the same terms as, for example, MD5 is unique (with the same extremely small collision probability).
(this one is tricky to define) It shouldn't be obvious how to enumerate all remaining coupons from a single coupon code - let's say MakeCoupon(n) and MakeCoupon(n + 1) should visually differ.
E.g. MakeCoupon(n), which simply outputs n padded with zeroes would fail this requirement, because 000001 and 000002 don't actually differ "visually".
Q:
Does any function or function generator, which fullfills the following requirements, exist? My search attempts only lead me to [CPAN] CouponCode, but it does not fullfill the requirement of the corresponding function being bijective.
Basically you can split your operation into to parts:
Somehow "encrypt" your initial number n, so that two consecutive numbers yield (very) different results
Construct your "human-readable" code from the result of step 1
For step 1 I'd suggest to use a simple block cipher (e.g. a Feistel cipher with a round function of your choice). See also this question.
Feistel ciphers work in several rounds. During each round, some round function is applied to one half of the input, the result is xored with the other half and the two halves are swapped. The nice thing about Feistel ciphers is that the round function hasn't to be two-way (the input to the round function is retained unmodified after each round, so the result of the round function can be reconstructed during decryption). Therefore you can choose whatever crazy operation(s) you like :). Also Feistel ciphers are symmetric, which fulfills your first requirement.
A short example in C#
const int BITCOUNT = 30;
const int BITMASK = (1 << BITCOUNT/2) - 1;
static uint roundFunction(uint number) {
return (((number ^ 47894) + 25) << 1) & BITMASK;
}
static uint crypt(uint number) {
uint left = number >> (BITCOUNT/2);
uint right = number & BITMASK;
for (int round = 0; round < 10; ++round) {
left = left ^ roundFunction(right);
uint temp = left; left = right; right = temp;
}
return left | (right << (BITCOUNT/2));
}
(Note that after the last round there is no swapping, in the code the swapping is simply undone in the construction of the result)
Apart from fulfilling your requirements 3 and 4 (the function is total, so for different inputs you get different outputs and the input is "totally scrambled" according to your informal definition) it is also it's own inverse (thus implicitely fulfilling requirement 1), i.e. crypt(crypt(x))==x for each x in the input domain (0..2^30-1 in this implementation). Also it's cheap in terms of performance requirements.
For step 2 just encode the result to some base of your choice. For instance, to encode a 30-bit number, you could use 6 "digits" of an alphabet of 32 characters (so you can encode 6*5=30 bits).
An example for this step in C#:
const string ALPHABET= "AG8FOLE2WVTCPY5ZH3NIUDBXSMQK7946";
static string couponCode(uint number) {
StringBuilder b = new StringBuilder();
for (int i=0; i<6; ++i) {
b.Append(ALPHABET[(int)number&((1 << 5)-1)]);
number = number >> 5;
}
return b.ToString();
}
static uint codeFromCoupon(string coupon) {
uint n = 0;
for (int i = 0; i < 6; ++i)
n = n | (((uint)ALPHABET.IndexOf(coupon[i])) << (5 * i));
return n;
}
For inputs 0 - 9 this yields the following coupon codes
0 => 5VZNKB
1 => HL766Z
2 => TMGSEY
3 => P28L4W
4 => EM5EWD
5 => WIACCZ
6 => 8DEPDA
7 => OQE33A
8 => 4SEQ5A
9 => AVAXS5
Note, that this approach has two different internal "secrets": First, the round function together with the number of rounds used and second, the alphabet you use for encoding the encyrpted result. But also note, that the shown implementation is in no way secure in a cryptographical sense!
Also note, that the shown function is a total bijective function, in the sense, that every possible 6-character code (with characters out of your alphabet) will yield a unique number. To prevent anyone from entering just some random code, you should define some kind of restictions on the input number. E.g. only issue coupons for the first 10.000 numbers. Then, the probability of some random coupon code to be valid would be 10000/2^30=0.00001 (it would require about 50000 attempts to find a correct coupon code). If you need more "security", you can just increase the bit size/coupon code length (see below).
EDIT: Change Coupon code length
Changing the length of the resulting coupon code requires some math: The first (encrypting) step only works on a bit string with even bit count (this is required for the Feistel cipher to work).
In the the second step, the number of bits that can be encoded using a given alphabet depends on the "size" of chosen alphabet and the length of the coupon code. This "entropy", given in bits, is, in general, not an integer number, far less an even integer number. For example:
A 5-digit code using a 30 character alphabet results in 30^5 possible codes which means ld(30^5)=24.53 bits/Coupon code.
For a four-digit code, there is a simple solution: Given a 32-Character alphabet you can encode *ld(32^4)=5*4=20* Bits. So you can just set the BITCOUNT to 20 and change the for loop in the second part of the code to run until 4 (instead of 6)
Generating a five-digit code is a bit trickier and somhow "weakens" the algorithm: You can set the BITCOUNT to 24 and just generate a 5-digit code from an alphabet of 30 characters (remove two characters from the ALPHABET string and let the for loop run until 5).
But this will not generate all possible 5-digit-codes: with 24 bits you can only get 16,777,216 possible values from the encryption stage, the 5 digit codes could encode 24,300,000 possible numbers, so some possible codes will never be generated. More specifically, the last position of the code will never contain some characters of the alphabet. This can be seen as a drawback, because it narrows down the set of valid codes in an obvious way.
When decoding a coupon code, you'll first have to run the codeFromCoupon function and then check, if bit 25 of the result is set. This would mark an invalid code that you can immediately reject. Note that, in practise, this might even be an advantage, since it allows a quick check (e.g. on the client side) of the validity of a code without giving away all internals of the algorithm.
If bit 25 is not set you'll call the crypt function and get the original number.
Though I may get docked for this answer I feel like I need to respond - I really hope that you hear what I'm saying as it comes from a lot of painful experience.
While this task is very academically challenging, and software engineers tend to challenge their intelect vs. solving problems, I need to provide you with some direction on this if I may. There is no retail store in the world, that has any kind of success anyway, that doesn't keep very good track of each and every entity that is generated; from each piece of inventory to every single coupon or gift card they send out those doors. It's just not being a good steward if you are, because it's not if people are going to cheat you, it's when, and so if you have every possible item in your arsenal you'll be ready.
Now, let's talk about the process by which the coupon is used in your scenario.
When the customer redeems the coupon there is going to be some kind of POS system in front right? And that may even be an online business where they are then able to just enter their coupon code vs. a register where the cashier scans a barcode right (I'm assuming that's what we're dealing with here)? And so now, as the vendor, you're saying that if you have a valid coupon code I'm going to give you some kind of discount and because our goal was to generate coupon codes that were reversable we don't need a database to verify that code, we can just reverse it right! I mean it's just math right? Well, yes and no.
Yes, you're right, it's just math. In fact, that's also the problem because so is cracking SSL. But, I'm going to assume that we all realize the math used in SSL is just a bit more complex than anything used here and the key is substantially larger.
It does not behoove you, nor is it wise for you to try and come up with some kind of scheme that you're just sure nobody cares enough to break, especially when it comes to money. You are making your life very difficult trying to solve a problem you really shouldn't be trying to solve because you need to be protecting yourself from those using the coupon codes.
Therefore, this problem is unnecessarily complicated and could be solved like this.
// insert a record into the database for the coupon
// thus generating an auto-incrementing key
var id = [some code to insert into database and get back the key]
// base64 encode the resulting key value
var couponCode = Convert.ToBase64String(id);
// truncate the coupon code if you like
// update the database with the coupon code
Create a coupon table that has an auto-incrementing key.
Insert into that table and get the auto-incrementing key back.
Base64 encode that id into a coupon code.
Truncate that string if you want.
Store that string back in the database with the coupon just inserted.
What you want is called Format-preserving encryption.
Without loss of generality, by encoding in base 36 we can assume that we are talking about integers in 0..M-1 rather than strings of symbols. M should probably be a power of 2.
After choosing a secret key and specifying M, FPE gives you a pseudo-random permutation of 0..M-1 encrypt along with its inverse decrypt.
string GenerateCoupon(int n) {
Debug.Assert(0 <= n && n < N);
return Base36.Encode(encrypt(n));
}
boolean IsCoupon(string code) {
return decrypt(Base36.Decode(code)) < N;
}
If your FPE is secure, this scheme is secure: no attacker can generate other coupon codes with probability higher than O(N/M) given knowledge of arbitrarily many coupons, even if he manages to guess the number associated with each coupon that he knows.
This is still a relatively new field, so there are few implementations of such encryption schemes. This crypto.SE question only mentions Botan, a C++ library with Perl/Python bindings, but not C#.
Word of caution: in addition to the fact that there are no well-accepted standards for FPE yet, you must consider the possibility of a bug in the implementation. If there is a lot of money on the line, you need to weigh that risk against the relatively small benefit of avoiding a database.
You can use a base-36 number system. Assume that you want 6 characters in the coupen output.
pseudo code for MakeCoupon
MakeCoupon(n)
{
Have an byte array of fixed size, say 6. Initialize all the values to 0.
convert the number to base - 36 and store the 'digits' in an array
(using integer division and mod operations)
Now, for each 'digit' find the corresponding ascii code assuming the
digits to start from 0..9,A..Z
With this convension output six digits as a string.
}
Now the calculating the number back is the reverse of this operation.
This would work with very large numbers (35^6) with 6 allowed characters.
Choose a cryptographic function c. There are a few requirements on c, but for now let us take SHA1.
choose a secret key k.
Your coupon code generating function could be, for number n:
concatenate n and k as "n"+"k" (this is known as salting in password management)
compute c("n"+"k")
the result of SHA1 is 160bits, encode them (for instance with base64) as an ASCII string
if the result is too long (as you said it is the case for SHA1), truncate it to keep only the first 10 letters and name this string s
your coupon code is printf "%09d%s" n s, i.e. the concatenation of zero-padded n and the truncated hash s.
Yes, it is trivial to guess n the number of the coupon code (but see below). But it is hard to generate another valid code.
Your requirements are satisfied:
To compute the reverse function, just read the first 9 digits of the code
The length is always 19 (9 digits of n, plus 10 letters of hash)
It is unique, since the first 9 digits are unique. The last 10 chars are too, with high probability.
It is not obvious how to generate the hash, even if one guesses that you used SHA1.
Some comments:
If you're worried that reading n is too obvious, you can obfuscate it lightly, like base64 encoding, and alternating in the code the characters of n and s.
I am assuming that you won't need more than a billion codes, thus the printing of n on 9 digits, but you can of course adjust the parameters 9 and 10 to your desired coupon code length.
SHA1 is just an option, you could use another cryptographic function like private key encryption, but you need to check that this function remains strong when truncated and when the clear text is provided.
This is not optimal in code length, but has the advantage of simplicity and widely available libraries.
EDIT: 64 or 128 bit would also work. My brain just jumped to 32bit for some reason, thinking it would be sufficient.
I have a struct that is composed of mostly numeric values (int, decimal), and 3 strings that are never more than 12 alpha-characters each. I'm trying to create an integer value that will work as a hash code, and trying to create it quickly. Some of the numeric values are also nullable.
It seems like BitVector32 or BitArray would be useful entities for use in this endevor, but I'm just not sure how to bend them to my will in this task. My struct contains 3 strings, 12 decimals (7 of which are nullable), and 4 ints.
To simplify my use case, lets say you have the following struct:
public struct Foo
{
public decimal MyDecimal;
public int? MyInt;
public string Text;
}
I know I can get numeric identifiers for each value. MyDecimal and MyInt are of course unique, from a numerical standpoint. And the string has a GetHashCode() function which will return a usually-unique value.
So, with a numeric identifier for each, is it possible to generate a hash code that uniquely identifies this structure? e.g. I can compare 2 different Foo's containing the same values, and get the same Hash Code, every time (regardless of app domain, restarting the app, time of day, alignment of Jupiters moons, etc).
The hash would be sparse, so I don't anticipate collisions from my use cases.
Any ideas? My first run at it I converted everything to a string representation, concated it, and used the built-in GetHashCode() but that seems terribly ... inefficient.
EDIT: A bit more background information. The structure data is being delivered to a webclient, and the client does a lot of computation of included values, string construction, etc to re-render the page. The aforementioned 19 field structure represent a single unit of information, each page could have many of units. I'd like to do some client-side caching of the rendered result, so I can quickly re-render a unit without recomputing on the client side if I see the same hash identifier from the server. JavaScript numeric values are all 64 bit, so I suppose my 32bit constraint is artificial and limiting. 64 bit would work, or I suppose even 128 bit if I can break it into two 64 bit values on the server.
Well, even in a sparse table one should better be prepared for collisions, depending on what "sparse" means.
You would need to be able to make very specific assumptions about the data you will be hashing at the same time to beat this graph with 32 bits.
Go with SHA256. Your hashes will not depend on CLR version and you will have no collisions. Well, you will still have some, but less frequently than meteorite impacts, so you can afford not anticipating any.
Two things I suggest you take a look at here and here. I don't think you'll be able to GUARANTEE no collisions with just 32 bits.
Hash codes by definition of a hash function are not meant to be unique. They are only meant to be as evenly distributed across all result values as possible. Getting a hash code for an object is meant to be a quick way to check if two objects are different. If hash codes for two objects are different then those objects are different. But if hash codes are the same you have to deeply compare the objects to be be sure. Hash codes main usage is in all hash-based collections where they make it possible for nearly O(1) retrieval speed.
So in this light, your GetHashCode does not have to be complex and in fact it shouldn't. It must be balanced between being very quick and producing evenly distributed values. If it takes too long to get a hash code it makes it pointless because advantage over deep compare is gone. If on the other extreme end, hash code would always be 1 for example (lighting fast) it would lead to deep compare in every case which makes this hash code pointless too.
So get the balance right and don't try to come up with a perfect hash code. Call GetHashCode on all (or most) of your members and combine the results using Xor operator maybe with a bitwise shift operator << or >>. Framework types have GetHashCode quite optimized although they are not guaranteed to be the same in each application run. There is no guarantee but they also do not have to change and a lot of them don't. Use a reflector to make sure or create your own versions based on the reflected code.
In your particular case deciding if you have already processed a structure by just looking at its hash code is a bit risky. The better the hash the smaller the risk but still. The ultimate and only unique hash code is... the data itself. When working with hash codes you must also override Object.Equals for your code to be truly reliable.
I believe the usual method in .NET is to call GetHashCode on each member of the structure and xor the results.
However, I don't think GetHashCode claims to produce the same hash for the same value in different app domains.
Could you give a bit more information in your question about why you want this hash value and why it needs to be stable over time, different app domains etc.
What goal are you after? If it is performance then you should use a class since a struct will be copied by value whenever you pass it as a function parameter.
3 strings, 12 decimals (7 of which are nullable), and 4 ints.
On a 64 bit machine a pointer will be 8 bytes in size a decimal takes 16 bytes and an int 4 bytes. Ignoring padding your struct will use 232 bytes per instance. This is much bigger compared to the recommened maximum of 16 bytes which makes sense perf wise (classes take up at least 16 bytes due to its object header, ...)
If you need a fingerprint of the value you can use a cryptographically grade hash algo like SHA256 which will produce a 16 byte fingerprint. This is still not uniqe but at least unique enough. But this will cost quite some performance as well.
Edit1:
After you made clear that you need the hash code to identify the object in a Java Script web client cache I am confused. Why does the server send the same data again? Would it not be simpler to make the server smarter to send only data the client has not yet received?
A SHA hash algo could be ok in your case to create some object instance tag.
Why do you need a hash code at all? If your goal is to store the values in a memory efficient manner you can create a FooList which uses dictionaries to store identical values only once and uses and int as lookup key.
using System;
using System.Collections.Generic;
namespace MemoryEfficientFoo
{
class Foo // This is our data structure
{
public int A;
public string B;
public Decimal C;
}
/// <summary>
/// List which does store Foos with much less memory if many values are equal. You can cut memory consumption by factor 3 or if all values
/// are different you consume 5 times as much memory as if you would store them in a plain list! So beware that this trick
/// might not help in your case. Only if many values are repeated it will save memory.
/// </summary>
class FooList : IEnumerable<Foo>
{
Dictionary<int, string> Index2B = new Dictionary<int, string>();
Dictionary<string, int> B2Index = new Dictionary<string, int>();
Dictionary<int, Decimal> Index2C = new Dictionary<int, decimal>();
Dictionary<Decimal,int> C2Index = new Dictionary<decimal,int>();
struct FooIndex
{
public int A;
public int BIndex;
public int CIndex;
}
// List of foos which do contain only the index values to the dictionaries to lookup the data later.
List<FooIndex> FooValues = new List<FooIndex>();
public void Add(Foo foo)
{
int bIndex;
if(!B2Index.TryGetValue(foo.B, out bIndex))
{
bIndex = B2Index.Count;
B2Index[foo.B] = bIndex;
Index2B[bIndex] = foo.B;
}
int cIndex;
if (!C2Index.TryGetValue(foo.C, out cIndex))
{
cIndex = C2Index.Count;
C2Index[foo.C] = cIndex;
Index2C[cIndex] = cIndex;
}
FooIndex idx = new FooIndex
{
A = foo.A,
BIndex = bIndex,
CIndex = cIndex
};
FooValues.Add(idx);
}
public Foo GetAt(int pos)
{
var idx = FooValues[pos];
return new Foo
{
A = idx.A,
B = Index2B[idx.BIndex],
C = Index2C[idx.CIndex]
};
}
public IEnumerator<Foo> GetEnumerator()
{
for (int i = 0; i < FooValues.Count; i++)
{
yield return GetAt(i);
}
}
System.Collections.IEnumerator System.Collections.IEnumerable.GetEnumerator()
{
return GetEnumerator();
}
}
class Program
{
static void Main(string[] args)
{
FooList list = new FooList();
List<Foo> fooList = new List<Foo>();
long before = GC.GetTotalMemory(true);
for (int i = 0; i < 1000 * 1000; i++)
{
list
//fooList
.Add(new Foo
{
A = i,
B = "Hi",
C = i
});
}
long after = GC.GetTotalMemory(true);
Console.WriteLine("Did consume {0:N0}bytes", after - before);
}
}
}
A similar memory conserving list can be found here
I have a unique situation where I need to produce hashes on the fly. Here is my situation. This question is related to here. I need to store a many urls in the database which need to be indexed. A URL can be over 2000 characters long. The database complains that a string over 900 bytes cannot be indexed. My solution is to hash the URL using MD5 or SHA256. I am not sure which hashing algorithm to use. Here are my requirements
Shortest character length with minimal collision
Needs to be very fast. I will be hashing the referurl on every page request
Collisions need to be minimized since I may have millions of urls in the database
I am not worried about security. I am worried about character length, speed, and collisions. Anyone know of a good algorithm for this?
In your case, I wouldn't use any of the cryptographic hash functions (i.e. MD5, SHA), since they were designed with security in mind: They mainly want to make it as hard as possible to finde two different strings with the same hash. I think this wouldn't be a problem in your case. (the possibility of random collisions is inherent to hashing, of course)
I'd strongly not suggest to use String.GetHashCode(), since the implementation is not known and MSDN says that it might vary between different versions of the framework. Even the results between x86 and x64 versions may be different. So you'll get into troubles when trying to access the same database using a newer (or different) version of the .NET framework.
I found the algorithm for the Java implementation of hashCode on Wikipedia (here), it seems quite easy to implement. Even a straightforward implementation would be faster than an implementation of MD5 or SHA imo. You could also use long values which reduces the probability of collisions.
There is also a short analysis of the .NET GetHashCode implementation here (not the algorithm itself but some implementation details), you could also use this one I guess. (or try to implement the Java version in a similar way ...)
a quick one :
URLString.GetHashCode().ToString("x")
While both MD5 and SHA1 have been proved ineffective where collision prevention is essential I suspect for your application either would be sufficient. I don't know for sure but I suspect that MD5 would be the simpler and quicker of the two algorithms.
Use the System.Security.Cryptography.SHA1Cng class, I would suggest. It's 160 bits or 20 bytes long, so that should definitely be small enough. If you need it to be a string, it will only require 40 characters, so that should suit your needs well. It should also be fast enough, and as far as I know, no collisions have yet been found.
I'd personally use String.GetHashCode(). This is the basic hash function. I honestly have no idea how it performs compared to other implementations but it should be fine.
Either of the two hashing functions that you name should be quick enough that you won't notice much difference between them. Unless this site requires ultra-high performance I would not worry too much about them. I'd personally probably go for MD5. This can be formatted as a string as hexdecimal in 64 characters or as a base 64 string in 44 characters.
The reason I'd go for MD5 is because you are very unlikely to run into collisions and even if you do you can structure your queries with "where urlhash = #hash and url = #url". The database engine should work out that one is indexed and the other isn't and use that information to do a sensible search.
If there are colisions the indexed scan on urlhash will return a handful of results which will be easy to do text comparisons on to get the right one. This is unlikely to be relevant very often though. You've pretty low chances of getting collisions this way.
Reflected source code of GetHashCode function in .net 4.0
public override unsafe int GetHashCode()
{
fixed (char* str = ((char*) this))
{
char* chPtr = str;
int num = 0x15051505;
int num2 = num;
int* numPtr = (int*) chPtr;
for (int i = this.Length; i > 0; i -= 4)
{
num = (((num << 5) + num) + (num >> 0x1b)) ^ numPtr[0];
if (i <= 2)
{
break;
}
num2 = (((num2 << 5) + num2) + (num2 >> 0x1b)) ^ numPtr[1];
numPtr += 2;
}
return (num + (num2 * 0x5d588b65));
}
}
There was O(n) simple operations(+, <<, ^) and one multiplication. So this is very fast.
I've tested this function on 3 mln DB contains strings lengths up to 256 characters and about 97% of strings has no collision. (Maximum 5 strings have the same hash)
You may want to look at the following project:
CMPH - C Minimal Perfect Hashing Library
And check out the following hot topics listing for perfect hashes:
Hottest 'perfect-hash' Answers - Stack Overflow
You could also consider using a full text index in SQL rather than hashing:
CREATE FULLTEXT INDEX (Transact-SQL)
I'm just about to launch the beta of a new online service. Beta subscribers will be sent a unique "access code" that allows them to register for the service.
Rather than storing a list of access codes, I thought I would just generate a code based on their email, since this itself is unique.
My initial thought was to combine the email with a unique string and then Base64 encode it. However, I was looking for codes that are a bit shorter, say 5 digits long.
If the access code itself needs to be unique, it will be difficult to ensure against collisions. If you can tolerate a case where two users might, by coincidence, share the same access code, it becomes significantly easier.
Taking the base-64 encoding of the e-mail address concatenated with a known string, as proposed, could introduce a security vulnerability. If you used the base64 output of the e-mail address concatenated with a known word, the user could just unencode the access code and derive the algorithm used to generate the code.
One option is to take the SHA-1-HMAC hash (System.Cryptography.HMACSHA1) of the e-mail address with a known secret key. The output of the hash is a 20-byte sequence. You could then truncate the hash deterministically. For instance, in the following, GetCodeForEmail("test#example.org") gives a code of 'PE2WEG' :
// define characters allowed in passcode. set length so divisible into 256
static char[] ValidChars = {'2','3','4','5','6','7','8','9',
'A','B','C','D','E','F','G','H',
'J','K','L','M','N','P','Q',
'R','S','T','U','V','W','X','Y','Z'}; // len=32
const string hashkey = "password"; //key for HMAC function -- change!
const int codelength = 6; // lenth of passcode
string GetCodeForEmail(string address)
{
byte[] hash;
using (HMACSHA1 sha1 = new HMACSHA1(ASCIIEncoding.ASCII.GetBytes(hashkey)))
hash = sha1.ComputeHash(UTF8Encoding.UTF8.GetBytes(address));
int startpos = hash[hash.Length -1] % (hash.Length - codelength);
StringBuilder passbuilder = new StringBuilder();
for (int i = startpos; i < startpos + codelength; i++)
passbuilder.Append(ValidChars[hash[i] % ValidChars.Length]);
return passbuilder.ToString();
}
You may create a special hash from their email, which is less than 6 chars, but it wouldn't really make that "unique", there will always be collisions in such a small space. I'd rather go with a longer key, or storing pre-generated codes in a table anyway.
So, it sounds like what you want to do here is to create a hash function specifically for emails as #can poyragzoglu pointed out. A very simple one might look something like this:
(pseudo code)
foreach char c in email:
running total += [large prime] * [unicode value]
then do running total % large 5 digit number
As he pointed out though, this will not be unique unless you had an excellent hash function. You're likely to have collisions. Not sure if that matters.
What seems easier to me, is if you already know the valid emails, just check the user's email against your list of valid ones upon registration? Why bother with a code at all?
If you really want a unique identifier though, the easiest way to do this is probably to just use what's called a GUID. C# natively supports this. You could store this in your Users table. Though, it would be far too long for a user to ever remember/type out, it would almost certainly be unique for each one if that's what you're trying to do.