Generate Reusable Unique Code for String Value

Generate Reusable Unique Code for String Value - c#

I have a SQL table that stores unique nvarchar data set to 60 characters max.
I now need to output each value to a file on a daily basis. This file is then fed into a 3rd party system.
However, this 3rd party system requires the value to be limited to 10 characters. The values does not have to be what is in the table. They just need to be unique and 10 characters max. They must also be consistent in that the same unique id is used each day for the table value.
I cannot truncate the string as it could then lose its uniqueness.
Looking at my options, I could:
Use GetHashCode()
Convert to Hexadecimal
With GetHashCode, this looks a simple straightforward option and I get the same value each time it it run. However, Microsoft documentation recommends against using it for my purpose...
https://learn.microsoft.com/en-us/dotnet/api/system.string.gethashcode?redirectedfrom=MSDN&view=netframework-4.8#System_String_GetHashCode
As a result, hash codes should never be used outside of the application domain in which they were created, they should never be used as key fields in a collection, and they should never be persisted.
With Hexadecimal conversion, it may also lose uniqueness when trimmed to 10 characters.
I have also looked at this example but again I'm not sure how reliable it is with uniqueness: A fast hash function for string in C#
static UInt64 CalculateHash(string read)
{
UInt64 hashedValue = 3074457345618258791ul;
for(int i=0; i<read.Length; i++)
{
hashedValue += read[i];
hashedValue *= 3074457345618258799ul;
}
return hashedValue;
}
Are there any other options available to me?

Add an unique Identity key to your table and let SQL Server manage the incrementation for you. This can be seeded with a large number if needed.

Related

Guid.NewGuid() VS a random string generator from Random.Next()

My colleague and I are debating which of these methods to use for auto generating user ID's and post ID's for identification in the database:
One option uses a single instance of Random, and takes some useful parameters so it can be reused for all sorts of string-gen cases (i.e. from 4 digit numeric pins to 20 digit alphanumeric ids). Here's the code:
// This is created once for the lifetime of the server instance
class RandomStringGenerator
{
public const string ALPHANUMERIC_CAPS = "ABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890";
public const string ALPHA_CAPS = "ABCDEFGHIJKLMNOPQRSTUVWXYZ";
public const string NUMERIC = "1234567890";
Random rand = new Random();
public string GetRandomString(int length, params char[] chars)
{
string s = "";
for (int i = 0; i < length; i++)
s += chars[rand.Next() % chars.Length];
return s;
}
}
and the other option is simply to use:
Guid.NewGuid();
see Guid.NewGuid on MSDN
We're both aware that Guid.NewGuid() would work for our needs, but I would rather use the custom method. It does the same thing but with more control.
My colleague thinks that because the custom method has been cooked up ourselves, it's more likely to generate collisions. I'll admit I'm not fully aware of the implementation of Random, but I presume it is just as random as Guid.NewGuid(). A typical usage of the custom method might be:
RandomStringGenerator stringGen = new RandomStringGenerator();
string id = stringGen.GetRandomString(20, RandomStringGenerator.ALPHANUMERIC_CAPS.ToCharArray());
Edit 1:
We are using Azure Tables which doesn't have an auto increment (or similar) feature for generating keys.
Some answers here just tell me to use NewGuid() "because that's what it's made for". I'm looking for a more in depth reason as to why the cooked up method may be more likely to generate collisions given the same degrees of freedom as a Guid.
Edit 2:
We were also using the cooked up method to generate post ID's which, unlike session tokens, need to look pretty for display in the url of our website (like http://mywebsite.com/14983336), so guids are not an option here, however collisions are still to be avoided.

I am looking for a more in depth reason as to why the cooked up method may be more likely to generate collisions given the same degrees of freedom as a Guid.
First, as others have noted, Random is not thread-safe; using it from multiple threads can cause it to corrupt its internal data structures so that it always produces the same sequence.
Second, Random is seeded based on the current time. Two instances of Random created within the same millisecond (recall that a millisecond is several million processor cycles on modern hardware) will have the same seed, and therefore will produce the same sequence.
Third, I lied. Random is not seeded based on the current time; it is seeded based on the amount of time the machine has been active. The seed is a 32 bit number, and since the granularity is in milliseconds, that's only a few weeks until it wraps around. But that's not the problem; the problem is: the time period in which you create that instance of Random is highly likely to be within a few minutes of the machine booting up. Every time you power-cycle a machine, or bring a new machine online in a cluster, there is a small window in which instances of Random are created, and the more that happens, the greater the odds are that you'll get a seed that you had before.
(UPDATE: Newer versions of the .NET framework have mitigated some of these problems; in those versions you no longer have every Random created within the same millisecond have the same seed. However there are still many problems with Random; always remember that it is only pseudo-random, not crypto-strength random. Random is actually very predictable, so if you are relying on unpredictability, it is not suitable.)
As other have said: if you want a primary key for your database then have the database generate you a primary key; let the database do its job. If you want a globally unique identifier then use a guid; that's what they're for.
And finally, if you are interested in learning more about the uses and abuses of guids then you might want to read my "guid guide" series; part one is here:
https://ericlippert.com/2012/04/24/guid-guide-part-one/

As written in other answers, my implementation had a few severe problems:
Thread safety: Random is not thread safe.
Predictability: the method couldn't be used for security critical identifiers like session tokens due to the nature of the Random class.
Collisions: Even though the method created 20 'random' numbers, the probability of a collision is not (number of possible chars)^20 due to the seed value only being 31 bits, and coming from a bad source. Given the same seed, any length of sequence will be the same.
Guid.NewGuid() would be fine, except we don't want to use ugly GUIDs in urls and .NETs NewGuid() algorithm is not known to be cryptographically secure for use in session tokens - it might give predictable results if a little information is known.
Here is the code we're using now, it is secure, flexible and as far as I know it's very unlikely to create collisions if given enough length and character choice:
class RandomStringGenerator
{
RNGCryptoServiceProvider rand = new RNGCryptoServiceProvider();
public string GetRandomString(int length, params char[] chars)
{
string s = "";
for (int i = 0; i < length; i++)
{
byte[] intBytes = new byte[4];
rand.GetBytes(intBytes);
uint randomInt = BitConverter.ToUInt32(intBytes, 0);
s += chars[randomInt % chars.Length];
}
return s;
}
}

"Auto generating user ids and post ids for identification in the database"...why not use a database sequence or identity to generate keys?
To me your question is really, "What is the best way to generate a primary key in my database?" If that is the case, you should use the conventional tool of the database which will either be a sequence or identity. These have benefits over generated strings.
Sequences/identity index better. There are numerous articles and blog posts that explain why GUIDs and so forth make poor indexes.
They are guaranteed to be unique within the table
They can be safely generated by concurrent inserts without collision
They are simple to implement
I guess my next question is, what reasons are you considering GUID's or generated strings? Will you be integrating across distributed databases? If not, you should ask yourself if you are solving a problem that doesn't exist.

Your custom method has two problems:
It uses a global instance of Random, but doesn't use locking. => Multi threaded access can corrupt its state. After which the output will suck even more than it already does.
It uses a predictable 31 bit seed. This has two consequences:
You can't use it for anything security related where unguessability is important
The small seed (31 bits) can reduce the quality of your numbers. For example if you create multiple instances of Random at the same time(since system startup) they'll probably create the same sequence of random numbers.
This means you cannot rely on the output of Random being unique, no matter how long it is.
I recommend using a CSPRNG (RNGCryptoServiceProvider) even if you don't need security. Its performance is still acceptable for most uses, and I'd trust the quality of its random numbers over Random. If you you want uniqueness, I recommend getting numbers with around 128 bits.
To generate random strings using RNGCryptoServiceProvider you can take a look at my answer to How can I generate random 8 character, alphanumeric strings in C#?.
Nowadays GUIDs returned by Guid.NewGuid() are version 4 GUIDs. They are generated from a PRNG, so they have pretty similar properties to generating a random 122 bit number (the remaining 6 bits are fixed). Its entropy source has much higher quality than what Random uses, but it's not guaranteed to be cryptographically secure.
But the generation algorithm can change at any time, so you can't rely on that. For example in the past the Windows GUID generation algorithm changed from v1 (based on MAC + timestamp) to v4 (random).

Use System.Guid as it:
...can be used across all computers and networks wherever a unique identifier is required.
Note that Random is a pseudo-random number generator. It is not truly random, nor unique. It has only 32-bits of value to work with, compared to the 128-bit GUID.
However, even GUIDs can have collisions (although the chances are really slim), so you should use the database's own features to give you a unique identifier (e.g. the autoincrement ID column). Also, you cannot easily turn a GUID into a 4 or 20 (alpha)numeric number.

Contrary to what some people have said in the comment, a GUID generated by Guid.NewGuid() is NOT dependent on any machine-specific identifier (only type 1 GUIDs are, Guid.NewGuid() returns a type 4 GUID, which is mostly random).
As long as you don't need cryptographic security, the Random class should be good enough, but if you want to be extra safe, use System.Security.Cryptography.RandomNumberGenerator. For the Guid approach, note that not all digits in a GUID are random. Quote from wikipedia:
In the canonical representation, xxxxxxxx-xxxx-Mxxx-Nxxx-xxxxxxxxxxxx, the most significant bits of N indicates the variant (depending on the variant; one, two or three bits are used). The variant covered by the UUID specification is indicated by the two most significant bits of N being 1 0 (i.e. the hexadecimal N will always be 8, 9, A, or B).
In the variant covered by the UUID specification, there are five versions. For this variant, the four bits of M indicates the UUID version (i.e. the hexadecimal M will either be 1, 2, 3, 4, or 5).

Regarding your edit, here is one reason to prefer a GUID over a generated string:
The native storage for a GUID (uniqueidentifier) in SQL Server is 16 bytes. To store a equivalent-length varchar (string), where each "digit" in the id is stored as a character, would require somewhere between 32 and 38 bytes, depending on formatting.
Because of its storage, SQL Server is also able to index a uniqueidentifier column more efficiently than a varchar column as well.

Coupon code generation

I would like to generate coupon codes , e.g. AYB4ZZ2. However, I would also like to be able to mark the used coupons and limit their global number, let's say N. The naive approach would be something like "generate N unique alphanumeric codes, put them into database and perform a db search on every coupon operation."
However, as far as I realize, we can also attempt to find a function MakeCoupon(n), which converts the given number into a coupon-like string with predefined length.
As far as I understand, MakeCoupon should fullfill the following requirements:
Be bijective. It's inverse MakeNumber(coupon) should be effectively computable.
Output for MakeCoupon(n) should be alphanumeric and should have small and constant length - so that it could be called human readable. E.g. SHA1 digest wouldn't pass this requirement.
Practical uniqueness. Results of MakeCoupon(n) for every natural n <= N should be totally unique or unique in the same terms as, for example, MD5 is unique (with the same extremely small collision probability).
(this one is tricky to define) It shouldn't be obvious how to enumerate all remaining coupons from a single coupon code - let's say MakeCoupon(n) and MakeCoupon(n + 1) should visually differ.
E.g. MakeCoupon(n), which simply outputs n padded with zeroes would fail this requirement, because 000001 and 000002 don't actually differ "visually".
Q:
Does any function or function generator, which fullfills the following requirements, exist? My search attempts only lead me to [CPAN] CouponCode, but it does not fullfill the requirement of the corresponding function being bijective.

Basically you can split your operation into to parts:
Somehow "encrypt" your initial number n, so that two consecutive numbers yield (very) different results
Construct your "human-readable" code from the result of step 1
For step 1 I'd suggest to use a simple block cipher (e.g. a Feistel cipher with a round function of your choice). See also this question.
Feistel ciphers work in several rounds. During each round, some round function is applied to one half of the input, the result is xored with the other half and the two halves are swapped. The nice thing about Feistel ciphers is that the round function hasn't to be two-way (the input to the round function is retained unmodified after each round, so the result of the round function can be reconstructed during decryption). Therefore you can choose whatever crazy operation(s) you like :). Also Feistel ciphers are symmetric, which fulfills your first requirement.
A short example in C#
const int BITCOUNT = 30;
const int BITMASK = (1 << BITCOUNT/2) - 1;
static uint roundFunction(uint number) {
return (((number ^ 47894) + 25) << 1) & BITMASK;
}
static uint crypt(uint number) {
uint left = number >> (BITCOUNT/2);
uint right = number & BITMASK;
for (int round = 0; round < 10; ++round) {
left = left ^ roundFunction(right);
uint temp = left; left = right; right = temp;
}
return left | (right << (BITCOUNT/2));
}
(Note that after the last round there is no swapping, in the code the swapping is simply undone in the construction of the result)
Apart from fulfilling your requirements 3 and 4 (the function is total, so for different inputs you get different outputs and the input is "totally scrambled" according to your informal definition) it is also it's own inverse (thus implicitely fulfilling requirement 1), i.e. crypt(crypt(x))==x for each x in the input domain (0..2^30-1 in this implementation). Also it's cheap in terms of performance requirements.
For step 2 just encode the result to some base of your choice. For instance, to encode a 30-bit number, you could use 6 "digits" of an alphabet of 32 characters (so you can encode 6*5=30 bits).
An example for this step in C#:
const string ALPHABET= "AG8FOLE2WVTCPY5ZH3NIUDBXSMQK7946";
static string couponCode(uint number) {
StringBuilder b = new StringBuilder();
for (int i=0; i<6; ++i) {
b.Append(ALPHABET[(int)number&((1 << 5)-1)]);
number = number >> 5;
}
return b.ToString();
}
static uint codeFromCoupon(string coupon) {
uint n = 0;
for (int i = 0; i < 6; ++i)
n = n | (((uint)ALPHABET.IndexOf(coupon[i])) << (5 * i));
return n;
}
For inputs 0 - 9 this yields the following coupon codes
0 => 5VZNKB
1 => HL766Z
2 => TMGSEY
3 => P28L4W
4 => EM5EWD
5 => WIACCZ
6 => 8DEPDA
7 => OQE33A
8 => 4SEQ5A
9 => AVAXS5
Note, that this approach has two different internal "secrets": First, the round function together with the number of rounds used and second, the alphabet you use for encoding the encyrpted result. But also note, that the shown implementation is in no way secure in a cryptographical sense!
Also note, that the shown function is a total bijective function, in the sense, that every possible 6-character code (with characters out of your alphabet) will yield a unique number. To prevent anyone from entering just some random code, you should define some kind of restictions on the input number. E.g. only issue coupons for the first 10.000 numbers. Then, the probability of some random coupon code to be valid would be 10000/2^30=0.00001 (it would require about 50000 attempts to find a correct coupon code). If you need more "security", you can just increase the bit size/coupon code length (see below).
EDIT: Change Coupon code length
Changing the length of the resulting coupon code requires some math: The first (encrypting) step only works on a bit string with even bit count (this is required for the Feistel cipher to work).
In the the second step, the number of bits that can be encoded using a given alphabet depends on the "size" of chosen alphabet and the length of the coupon code. This "entropy", given in bits, is, in general, not an integer number, far less an even integer number. For example:
A 5-digit code using a 30 character alphabet results in 30^5 possible codes which means ld(30^5)=24.53 bits/Coupon code.
For a four-digit code, there is a simple solution: Given a 32-Character alphabet you can encode *ld(32^4)=5*4=20* Bits. So you can just set the BITCOUNT to 20 and change the for loop in the second part of the code to run until 4 (instead of 6)
Generating a five-digit code is a bit trickier and somhow "weakens" the algorithm: You can set the BITCOUNT to 24 and just generate a 5-digit code from an alphabet of 30 characters (remove two characters from the ALPHABET string and let the for loop run until 5).
But this will not generate all possible 5-digit-codes: with 24 bits you can only get 16,777,216 possible values from the encryption stage, the 5 digit codes could encode 24,300,000 possible numbers, so some possible codes will never be generated. More specifically, the last position of the code will never contain some characters of the alphabet. This can be seen as a drawback, because it narrows down the set of valid codes in an obvious way.
When decoding a coupon code, you'll first have to run the codeFromCoupon function and then check, if bit 25 of the result is set. This would mark an invalid code that you can immediately reject. Note that, in practise, this might even be an advantage, since it allows a quick check (e.g. on the client side) of the validity of a code without giving away all internals of the algorithm.
If bit 25 is not set you'll call the crypt function and get the original number.

Though I may get docked for this answer I feel like I need to respond - I really hope that you hear what I'm saying as it comes from a lot of painful experience.
While this task is very academically challenging, and software engineers tend to challenge their intelect vs. solving problems, I need to provide you with some direction on this if I may. There is no retail store in the world, that has any kind of success anyway, that doesn't keep very good track of each and every entity that is generated; from each piece of inventory to every single coupon or gift card they send out those doors. It's just not being a good steward if you are, because it's not if people are going to cheat you, it's when, and so if you have every possible item in your arsenal you'll be ready.
Now, let's talk about the process by which the coupon is used in your scenario.
When the customer redeems the coupon there is going to be some kind of POS system in front right? And that may even be an online business where they are then able to just enter their coupon code vs. a register where the cashier scans a barcode right (I'm assuming that's what we're dealing with here)? And so now, as the vendor, you're saying that if you have a valid coupon code I'm going to give you some kind of discount and because our goal was to generate coupon codes that were reversable we don't need a database to verify that code, we can just reverse it right! I mean it's just math right? Well, yes and no.
Yes, you're right, it's just math. In fact, that's also the problem because so is cracking SSL. But, I'm going to assume that we all realize the math used in SSL is just a bit more complex than anything used here and the key is substantially larger.
It does not behoove you, nor is it wise for you to try and come up with some kind of scheme that you're just sure nobody cares enough to break, especially when it comes to money. You are making your life very difficult trying to solve a problem you really shouldn't be trying to solve because you need to be protecting yourself from those using the coupon codes.
Therefore, this problem is unnecessarily complicated and could be solved like this.
// insert a record into the database for the coupon
// thus generating an auto-incrementing key
var id = [some code to insert into database and get back the key]
// base64 encode the resulting key value
var couponCode = Convert.ToBase64String(id);
// truncate the coupon code if you like
// update the database with the coupon code
Create a coupon table that has an auto-incrementing key.
Insert into that table and get the auto-incrementing key back.
Base64 encode that id into a coupon code.
Truncate that string if you want.
Store that string back in the database with the coupon just inserted.

What you want is called Format-preserving encryption.
Without loss of generality, by encoding in base 36 we can assume that we are talking about integers in 0..M-1 rather than strings of symbols. M should probably be a power of 2.
After choosing a secret key and specifying M, FPE gives you a pseudo-random permutation of 0..M-1 encrypt along with its inverse decrypt.
string GenerateCoupon(int n) {
Debug.Assert(0 <= n && n < N);
return Base36.Encode(encrypt(n));
}
boolean IsCoupon(string code) {
return decrypt(Base36.Decode(code)) < N;
}
If your FPE is secure, this scheme is secure: no attacker can generate other coupon codes with probability higher than O(N/M) given knowledge of arbitrarily many coupons, even if he manages to guess the number associated with each coupon that he knows.
This is still a relatively new field, so there are few implementations of such encryption schemes. This crypto.SE question only mentions Botan, a C++ library with Perl/Python bindings, but not C#.
Word of caution: in addition to the fact that there are no well-accepted standards for FPE yet, you must consider the possibility of a bug in the implementation. If there is a lot of money on the line, you need to weigh that risk against the relatively small benefit of avoiding a database.

You can use a base-36 number system. Assume that you want 6 characters in the coupen output.
pseudo code for MakeCoupon
MakeCoupon(n)
{
Have an byte array of fixed size, say 6. Initialize all the values to 0.
convert the number to base - 36 and store the 'digits' in an array
(using integer division and mod operations)
Now, for each 'digit' find the corresponding ascii code assuming the
digits to start from 0..9,A..Z
With this convension output six digits as a string.
}
Now the calculating the number back is the reverse of this operation.
This would work with very large numbers (35^6) with 6 allowed characters.

Choose a cryptographic function c. There are a few requirements on c, but for now let us take SHA1.
choose a secret key k.
Your coupon code generating function could be, for number n:
concatenate n and k as "n"+"k" (this is known as salting in password management)
compute c("n"+"k")
the result of SHA1 is 160bits, encode them (for instance with base64) as an ASCII string
if the result is too long (as you said it is the case for SHA1), truncate it to keep only the first 10 letters and name this string s
your coupon code is printf "%09d%s" n s, i.e. the concatenation of zero-padded n and the truncated hash s.
Yes, it is trivial to guess n the number of the coupon code (but see below). But it is hard to generate another valid code.
Your requirements are satisfied:
To compute the reverse function, just read the first 9 digits of the code
The length is always 19 (9 digits of n, plus 10 letters of hash)
It is unique, since the first 9 digits are unique. The last 10 chars are too, with high probability.
It is not obvious how to generate the hash, even if one guesses that you used SHA1.
Some comments:
If you're worried that reading n is too obvious, you can obfuscate it lightly, like base64 encoding, and alternating in the code the characters of n and s.
I am assuming that you won't need more than a billion codes, thus the printing of n on 9 digits, but you can of course adjust the parameters 9 and 10 to your desired coupon code length.
SHA1 is just an option, you could use another cryptographic function like private key encryption, but you need to check that this function remains strong when truncated and when the clear text is provided.
This is not optimal in code length, but has the advantage of simplicity and widely available libraries.

Calculate a checksum for a string

I got a string of an arbitrary length (lets say 5 to 2000 characters) which I would like to calculate a checksum for.
Requirements
The same checksum must be returned each time a calculation is done for a string
The checksum must be unique (no collisions)
I can not store previous IDs to check for collisions
Which algorithm should I use?
Update:
Are there an approach which is reasonable unique? i.e. the likelihood of a collision is very small.
The checksum should be alphanumeric
The strings are unicode
The strings are actually texts that should be translated and the checksum is stored with each translation (so a translated text can be matched back to the original text).
The length of the checksum is not important for me (the shorter, the better)
Update2
Let's say that I got the following string "Welcome to this website. Navigate using the flashy but useless menu above".
The string is used in a view in a similar way to gettext in linux. i.e. the user just writes (in a razor view)
#T("Welcome to this website. Navigate using the flashy but useless menu above")
Now I need a way to identity that string so that I can fetch it from a data source (there are several implementations of the data source). Having to use the entire string as a key seems a bit inefficient and I'm therefore looking for a way to generate a key out of it.

That's not possible.
If you can't store previous values, it's not possible to create a unique checksum that is smaller than the information in the string.
Update:
The term "reasonably unique" doesn't make sense, either it's unique or it's not.
To get a reasonably low risk of hash collisions, you can use a resonably large hash code.
The MD5 algorithm for example produces a 16 byte hash code. Convert the string to a byte array using some encoding that preserves all characters, for example UTF-8, calculate the hash code using the MD5 class, then convert the hash code byte array into a string using the BitConverter class:
string theString = "asdf";
string hash;
using (System.Security.Cryptography.MD5 md5 = System.Security.Cryptography.MD5.Create()) {
hash = BitConverter.ToString(
md5.ComputeHash(Encoding.UTF8.GetBytes(theString))
).Replace("-", String.Empty);
}
Console.WriteLine(hash);
Output:
912EC803B2CE49E4A541068D495AB570

You can use cryptographic Hash functions for this. Most of them are available in .Net
For example:
var sha1 = System.Security.Cryptography.SHA1.Create();
byte[] buf = System.Text.Encoding.UTF8.GetBytes("test");
byte[] hash= sha1.ComputeHash(buf, 0, buf.Length);
//var hashstr = Convert.ToBase64String(hash);
var hashstr = System.BitConverter.ToString(hash).Replace("-", "");

Note: This is an answer to the original question.
Assuming you want the checksum to be stored in a variable of fixed size (i.e. an integer), you cannot satisfy your second constraint.
The checksum must be unique (no collisions)
You cannot avoid collisions because there will be more distinct strings than there are possible checksum values.

I realize this post is practically ancient, but I stumbled upon it and have run into an almost identical issue in the past. We had an nvarchar(8000) field that we needed to lookup against.
Our solution was to create a persisted computed column using CHECKSUM of the nasty lookup field. We had an auto-incrementing ID field and keyed on (checksum, id)
When reading from the table, we wrote a proc that took the lookup text, computed the checksum and then took where the checksums were equal and the text was equal.
You could easily perform the checksum portions at the application level based on the answer above and store them manually instead of using our DB-centric solution. But the point is to get a reasonably sized key for indexing so that your text comparison runs against a bucket of collisions instead of the entire dataset.
Good luck!

To guarantee uniqueness, for a almost infinite size strings, treat the variable length string as a set of concatenated substrings each having "x characters in length". Your hash function needs only to determine uniqueness for a maximum substring length and then generate a series of checksum numbers generating values. Think of it as the equivalent network IP address with a set of checksum numbers.
Your issue with collisions is the assumption that a collision forces a slower search method to resolve each collision. If their are a insignificant number of possible collisions compared to the number of hash objects, then as a whole the extra overhead becomes NIL. A collision is due to the sizing of a table smaller than the maximum number of objects. This doesn't have to be the case because the table may have "holes" and each object within the table may have a reference count of objects at that collision. Only if this count is greater than 1, then a collision occurs or multiple instances of the same substring.

User ID obfuscation

I expect this's been asked before but haven't really found an appropriate answer here and also don't have the time to come up with my own solution...
If we have a user table with int identity primary key then our users have consecutive IDs while they register on the site.
The we have user public profile page on the site URL:
www.somesite.com/user/1234
where 1234 is the actual user ID. There is nothing vulnerable to see user's ID per se, but it does give anyone the ability to check how many users are registered on my site... Manually increasing the number eventually gets me to an invalid profile.
This is the main reason why I wand a reversible ID mapping to a seemingly random number with fixed length:
www.somesite.com/user/6123978458176573
Can you point me to a simple class that does this mapping? It is of course important that this mapping is simply reversible otherwise I'd have to save the mapping along with other user's data.
I want to avoid GUIDs
GUIDs are slower to index search them because they're not consecutive so SQL has to scan the whole index to match a particular GUID instead just a particular calculated index page...
If I'd have ID + GUID then I would always need to fetch original user ID to do any meaningful data manipulation which is again speed degradation...
A mathematical reversible integer permutation seems the fastest solution...

I would 100% go with the "Add a GUID column to the table" approach. It will take seconds to generate one for each current user, and update your insert procedure to generate one for each new user. This is the best solution.
However, if you really dont want to take that approach there are any number of obfuscation techniques you could use.
Simply Base64 encoding the string representation of your number is one (bad) way to do it.
static public string EncodeTo64(string toEncode)
{
byte[] toEncodeAsBytes
= System.Text.ASCIIEncoding.ASCII.GetBytes(toEncode);
string returnValue
= System.Convert.ToBase64String(toEncodeAsBytes);
return returnValue;
}
static public string DecodeFrom64(string encodedData)
{
byte[] encodedDataAsBytes
= System.Convert.FromBase64String(encodedData);
string returnValue =
System.Text.ASCIIEncoding.ASCII.GetString(encodedDataAsBytes);
return returnValue;
}
Bad because anyone with half an ounce of technical knowledge (hackers/scriptkiddies tend to have that in abundance) will instantly recognise the result as Base64 and easily reverse-engineer.
Edit: This blogpost Obfuscating IDs in URLs with Rails provides quite a workable example. Converting to C# gives you something like:
static int Prime = 1580030173;
static int PrimeInverse = 59260789;
public static int EncodeId(int input)
{
return (input * Prime) & int.MaxValue;
}
public static int DecodeId(int input)
{
return (input * PrimeInverse) & int.MaxValue;
}
Input --> Output
1234 --> 1989564746
5678 --> 1372124598
5679 --> 804671123
This follow up post by another author explains how to secure this a little bit more with a random XOR, as well as how to calculate Prime and PrimeInverse - ive just used the pre-canned ones from the original blog for demo.

Use UUIDs
Make another column in the user table for, e.g. 64 bit integers, and fill it with a random number (each time a new user registered - generate it and check it's unique). A number looks better than UUID, however a bit more coding required.
Use maths. ;) You could generate pair of numbers X, Y such as X*Y = 1 (mod M). E.g. X=10000000019L, Y=1255114267L, and M=2^30. Then, you will have two simple functions:
.
long encode(long id)
{ return (id * X) & M; }
long decode(long encodedId)
{ return (encodedId * Y) & M; }
It will produce nearly random encoded ids. It's easy, but hackable. If someone would bother to hack it, he will be able to guess your numbers and see encoded values too. However, I am not completely sure which complexity it is, but as I remember it's not very easy to hack.

May I suggest that you use a UUID instead. This could be indexable and generated within a stored procedure when you add a new user to the database. This would mean either adding a new column to the database table or a new table containing UUIDs but with the User ID as related key.
edit
If you really want to avoid GUIDs then why not use the users "username" whilst they access their profile page. After all I imagine that you don't assign a user an ID until they have entered valid information and that data has been saved into the database.

Generate a short code based on a unique string in C#

I'm just about to launch the beta of a new online service. Beta subscribers will be sent a unique "access code" that allows them to register for the service.
Rather than storing a list of access codes, I thought I would just generate a code based on their email, since this itself is unique.
My initial thought was to combine the email with a unique string and then Base64 encode it. However, I was looking for codes that are a bit shorter, say 5 digits long.

If the access code itself needs to be unique, it will be difficult to ensure against collisions. If you can tolerate a case where two users might, by coincidence, share the same access code, it becomes significantly easier.
Taking the base-64 encoding of the e-mail address concatenated with a known string, as proposed, could introduce a security vulnerability. If you used the base64 output of the e-mail address concatenated with a known word, the user could just unencode the access code and derive the algorithm used to generate the code.
One option is to take the SHA-1-HMAC hash (System.Cryptography.HMACSHA1) of the e-mail address with a known secret key. The output of the hash is a 20-byte sequence. You could then truncate the hash deterministically. For instance, in the following, GetCodeForEmail("test#example.org") gives a code of 'PE2WEG' :
// define characters allowed in passcode. set length so divisible into 256
static char[] ValidChars = {'2','3','4','5','6','7','8','9',
'A','B','C','D','E','F','G','H',
'J','K','L','M','N','P','Q',
'R','S','T','U','V','W','X','Y','Z'}; // len=32
const string hashkey = "password"; //key for HMAC function -- change!
const int codelength = 6; // lenth of passcode
string GetCodeForEmail(string address)
{
byte[] hash;
using (HMACSHA1 sha1 = new HMACSHA1(ASCIIEncoding.ASCII.GetBytes(hashkey)))
hash = sha1.ComputeHash(UTF8Encoding.UTF8.GetBytes(address));
int startpos = hash[hash.Length -1] % (hash.Length - codelength);
StringBuilder passbuilder = new StringBuilder();
for (int i = startpos; i < startpos + codelength; i++)
passbuilder.Append(ValidChars[hash[i] % ValidChars.Length]);
return passbuilder.ToString();
}

You may create a special hash from their email, which is less than 6 chars, but it wouldn't really make that "unique", there will always be collisions in such a small space. I'd rather go with a longer key, or storing pre-generated codes in a table anyway.

So, it sounds like what you want to do here is to create a hash function specifically for emails as #can poyragzoglu pointed out. A very simple one might look something like this:
(pseudo code)
foreach char c in email:
running total += [large prime] * [unicode value]
then do running total % large 5 digit number
As he pointed out though, this will not be unique unless you had an excellent hash function. You're likely to have collisions. Not sure if that matters.
What seems easier to me, is if you already know the valid emails, just check the user's email against your list of valid ones upon registration? Why bother with a code at all?
If you really want a unique identifier though, the easiest way to do this is probably to just use what's called a GUID. C# natively supports this. You could store this in your Users table. Though, it would be far too long for a user to ever remember/type out, it would almost certainly be unique for each one if that's what you're trying to do.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.