Generate unique code from primary key - c#

I have a table of orders and I want to give users a unique code for an order whilst hiding the incrementing identity integer primary key because I don't want to give away how many orders have been made.
One easy way of making sure the codes are unique is to use the primary key to determine the code.
So how can I transform an integer into a friendly, say, eight alpha numeric code such that every code is unique?

The easiest way (if you want an alpha numeric code) is to convert the integer primary key to HEX (like below). And, you can Use `PadLeft()' to make sure the string has 8 characters. But, when the number of orders grow, 8 characters will not be enough.
var uniqueCode = intPrimaryKey.ToString("X").PadLeft(8, '0');
Or, you can create an offset of your primary key, before converting it to HEX, like below:
var uniqueCode = (intPrimaryKey + 999).ToString("X").PadLeft(8, '0');

Assuming the total number of orders being created isn't going to get anywhere near the total number of identifiers in your pool, a reasonably effective technique is to simply generate a random identifier and see if it is used already; continue generating new identifiers until you find one not previously used.

A quick and easy way to do this is to have a guid column that has a default value of
left(newid(),8)
This solution will generally give you a unique value for each row. But if you have extremely large amounts of orders this will not be unique and you should use just the newid() value to generate the guid.

I would just use MD5 for this. MD5 offers enough "uniqueness" for a small subset of integers that represent your customer orders.
For an example see this answer. You will need to adjust input parameter from string to int (or alternatively just call ToString on your number and use the code as-is).

If you would like something that would be difficult to trace and you don;t mind it being 16 characters, you could use something like this that includes some random numbers and mixes the byte positions of the original input with them: (EDITED to make a bit more untraceable, by XOR-ing with the generated random numbers).
public static class OrderIdRandomizer
{
private static readonly Random _rnd = new Random();
public static string GenerateFor(int orderId)
{
var rndBytes = new byte[4];
_rnd.NextBytes(rndBytes);
var bytes = new byte[]
{
(byte)rndBytes[0],
(byte)(((byte)(orderId >> 8)) ^ rndBytes[0]),
(byte)(((byte)(orderId >> 24)) ^ rndBytes[1]),
(byte)rndBytes[1],
(byte)(((byte)(orderId >> 16)) ^ rndBytes[2]),
(byte)rndBytes[2],
(byte)(((byte)(orderId)) ^ rndBytes[3]),
(byte)rndBytes[3],
};
return string.Concat(bytes.Select(b => b.ToString("X2")));
}
public static int ReconstructFrom(string generatedId)
{
if (generatedId == null || generatedId.Length != 16)
throw new InvalidDataException("Invalid generated order id");
var bytes = new byte[8];
for (int i = 0; i < 8; i++)
bytes[i] = byte.Parse(generatedId.Substring(i * 2, 2), System.Globalization.NumberStyles.HexNumber);
return (int)(
((bytes[2] ^ bytes[3]) << 24) |
((bytes[4] ^ bytes[5]) << 16) |
((bytes[1] ^ bytes[0]) << 8) |
((bytes[6] ^ bytes[7])));
}
}
Usage:
var obfuscatedId = OrderIdRandomizer.GenerateFor(123456);
Console.WriteLine(obfuscatedId);
Console.WriteLine(OrderIdRandomizer.ReconstructFrom(obfuscatedId));
Disadvantage: If the algorithm is know, it is obviously easy to break.
Advantage: It is completely custom, i.e. not an established algorithm like MD5 that might be easier to guess/crack if you do not know what algorithm is being used.

Related

Generate a unique, random friend code from a user ID algorithmically

I am looking for a way to generate a random, unique 9 digit friend code for a user from a sequential user ID. The idea behind this is so people can't enumerate users by searching the friend codes one by one. If there are 1000 possible codes and 100 registered users, searching a random code should have a 10% chance of finding a user.
A possible way to do this is to generate a code randomly, check if the code is already in use, and if it is, try again. I am looking for an approach (mostly out of curiosity) where the friend code is generated algorithmically and is guarenteed to be unique for that user ID first try.
Specifically, given a range of numbers (1 to 999,999,999), running the function on this number should return another number in the same range, which is paired and unique to the input number. This pairing should only differ if the range changes and/or an input seed to the randomness changes.
An individual should ideally not be able to easily reverse engineer the user ID from the friend ID without knowing the seed and algorithm (or having a very large pool of samples and a lot of time - this does not need to be cryptographically secure), so simply subtracting the user ID from the maximum range is not a valid solution.
Here is some c# code that accomplishes what I am after by generating the entire range of numbers, shuffling the list, then retrieving a friend ID by treating the user ID as the list index:
int start = 1; // Starting number (inclusive)
int end = 999999999; // End number (inclusive)
Random random = new Random(23094823); // Random with a given seed
var friendCodeList = new List<int>();
friendCodeList.AddRange(Enumerable.Range(start, end + 1)); // Populate list
int n = friendCodeList.Count;
// Shuffle the list, this should be the same for a given start, end and seed
while (n > 1)
{
n--;
int k = random.Next(n + 1);
int value = friendCodeList[k];
friendCodeList[k] = friendCodeList[n];
friendCodeList[n] = value;
}
// Retrieve friend codes from the list
var userId = 1;
Console.WriteLine($"User ID {userId}: {friendCodeList[userId]:000,000,000}");
userId = 99999999;
Console.WriteLine($"User ID {userId}: {friendCodeList[userId]:000,000,000}");
userId = 123456;
Console.WriteLine($"User ID {userId}: {friendCodeList[userId]:000,000,000}");
User ID 1: 054,677,867
User ID 99999999: 237,969,637
User ID 123456: 822,632,399
Unfortunately, this is unsuitable for large ranges - this program takes 8GB of RAM to run, with a 10 or 12 digit friend code it would not be feasible to pre-generate the list either in memory or a database. I am looking for a solution that does not require this pre-generation step.
I am interested in solutions that use either a seeded random number generator or bitwise trickery to achieve this, if it is possible. The above function is reversible (by searching the values of the list) but the solution does not need to be.
Quick mathematics lesson!
You're thinking of developing a way to map one integer value (the original "secret" UserId value) to another (the (encrypted) "public" value) and back again. This is exactly what a block-cipher does (except each "block" is usually 16 bytes big instead of being a single character or integer value). So in other words, you want to create your own cryptosystem.
(Note that even if you're thinking of converting UserId 123 into a string instead of an integer, for example, a YouTube Video Id like "dQw4w9WgXcQ") - it's still an integer: because every scalar value stored in a computer, including strings, can be represented as an integer - hence the "illegal primes" problem back in the late-1990s).
And the biggest, most important take-away from any undergraduate-level computer-science class on cryptography is never create your own cryptosystem!.
With that out of the way...
Provided that security is not a top-concern...
...and you're only concerned with preventing disclosure of incrementing integer Id values (e.g. so your visitors and users don't see how many database records you really have) then use a Hashids library: https://hashids.org/
For .NET, use this NuGet package: https://www.nuget.org/packages/Hashids.net/
Overview for .NET: https://hashids.org/net/
Project page: https://github.com/ullmark/hashids.net
In your code, construct a single Hashids object (I'd use a public static readonly field or property - or better yet: a singleton injectable service) and use the .Encode method to convert any integer int/Int32 value into a string value.
To convert the string value back to the original int/Int32, use the .Decode method.
As an aside, I don't like how the library is called "Hashids" when hashes are meant to be one-way functions - because the values are still reversible - albeit by using a secret "salt" value (why isn't it called a "key"?) it isn't really a hash, imo.
If security really matters...
Then you need to treat each integer value as a discrete block in a block cipher (not a stream-cipher, because each value needs to be encrypted and decrypted independently by itself).
For the purposes of practicality, you need to use a symmetric block cipher with a small block-size. Unfortunately many block ciphers with small block sizes aren't very good (TripleDES has a block size of 64-bits - but it's weak today), so let's stick with AES.
AES has a block-size of 128 bits (16 bytes) - that's the same as two Int64 integers concatenated with each other. Assuming you use base64url encoding on a 16-byte value then your output will be 22 characters long (as Base64 uses 6 bits per character). If you're comfortable with strings of this length then you're all set. The shortest URL-safe string you can generate from a 128-bit value is 21 (hardly an improvement at all) because Base-73 is the most you can safely use in a URL that will survive all modern URL-transmission systems (never automatically assume Unicode is supported anywhere when dealing with plaintext).
It is possible to adapt AES to generate smaller output block-sizes, but it won't work in this case because using techniques like CTR Mode mean that the generated output needs to include extra state information (IV, counter, etc) which will end-up taking up the same amount of space as was gained.
Here's the code:
Very important notes:
This uses CBC Mode - which means the same input results in the same output (that's required by-design!). CBC is useful when encrypting blocks independently.
This re-uses the same IV - this is intentional and actually desirable for this application - but generally speaking do not reuse IVs when using AES for any other purpose and make sure you understand what you're doing.
*
private static readonly Byte[] _key = new Byte[] { }. // Must be 128, 192 or 256 bits (16, 24, or 32 bytes) in length.
private static readonly Byte[] _iv = new Byte[8]; // You could use the default all-zeroes.
// Note that this method works with Int32 arguments.
private static Byte[] ProcessBlock( Byte[] inputBlock, Boolean encrypt )
{
Byte[] outputBlock;
using( Aes aes = Aes.Create() )
{
aes.Key = _key;
aes.IV = _iv;
using( ICryptoTransform xform = encrypt ? aes.CreateEncryptor() : aes.CreateDecryptor() )
{
outputBlock = xform.TransformFinalBlock( inputBlock, 0, inputBlock.Length );
}
}
}
public static Byte[] EncryptInteger( Int64 value )
{
Byte[] inputBlock = new Byte[16];
inputBlock[0] = (Byte)(value >> 0 & 0xFF);
inputBlock[1] = (Byte)(value >> 8 & 0xFF);
inputBlock[2] = (Byte)(value >> 16 & 0xFF);
inputBlock[3] = (Byte)(value >> 24 & 0xFF);
inputBlock[4] = (Byte)(value >> 32 & 0xFF);
inputBlock[5] = (Byte)(value >> 40 & 0xFF);
inputBlock[6] = (Byte)(value >> 48 & 0xFF);
inputBlock[7] = (Byte)(value >> 56 & 0xFF);
return ProcessBlock( inputBlock, encrypt: true );
}
public static Int64 DecryptInteger( Byte[] block )
{
Byte[] outputBlock = ProcessInteger( value, encrypt: false );
return
(Int64)outputBlock[0] << 0 |
(Int64)outputBlock[1] << 8 |
(Int64)outputBlock[2] << 16 |
(Int64)outputBlock[3] << 24 |
(Int64)outputBlock[4] << 32 |
(Int64)outputBlock[5] << 40 |
(Int64)outputBlock[6] << 48 |
(Int64)outputBlock[7] << 56;
};
public static String EncryptIntegerToString( Int64 value ) => Convert.ToBase64String( EncryptInteger( value ) ).Replace( '+', '-' ).Replace( '/', '_' );
public static Int64 DecryptIntegerFromString( String base64Url )
{
if( String.IsNullOrWhiteSpace( base64Url ) ) throw new ArgumentException( message: "Invalid string.", paramName: nameof(base64Url) );
// Convert Base64Url to Base64:
String base64 = base64Url.Replace( '-', '+' ).Replace( '_', '/' );
Byte[] block = Convert.FromBase64String( base64 );
return DecryptInteger( block );
}
A simple method like this can produce a long sequence of numbers provided you get the constants right.
ulong Next(ulong current)
{
unchecked
{
return (999_999_937L * current + 383_565_383L) % 999_999_999L;
}
};
From memory, this kind of function can produce a sequence of 999_999_999 digits if the values in the function are chosen correctly.
My test code shows that this method can produce 500_499 numbers without repeating.
My computer can produce the entire sequence in just under 9 milliseconds so it is a fairly fast algorithm.
The first ten elements of this sequence (with leading '0's padded) is:
383565383, 602511613, 027845340, 657154301, 639998680, 703647183, 757439993, 422285770, 201847617, 869013116
5_960_464 * current + 383_565_383L gives a sequence length of 1_000_998 before repetition.

Defining a bit[] array in C#

currently im working on a solution for a prime-number calculator/checker. The algorythm is already working and verry efficient (0,359 seconds for the first 9012330 primes). Here is a part of the upper region where everything is declared:
const uint anz = 50000000;
uint a = 3, b = 4, c = 3, d = 13, e = 12, f = 13, g = 28, h = 32;
bool[,] prim = new bool[8, anz / 10];
uint max = 3 * (uint)(anz / (Math.Log(anz) - 1.08366));
uint[] p = new uint[max];
Now I wanted to go to the next level and use ulong's instead of uint's to cover a larger area (you can see that already), where i tapped into my problem: the bool-array.
Like everybody should know, bool's have the length of a byte what takes a lot of memory when creating the array... So I'm searching for a more resource-friendly way to do that.
My first idea was a bit-array -> not byte! <- to save the bool's, but haven't figured out how to do that by now. So if someone ever did something like this, I would appreciate any kind of tips and solutions. Thanks in advance :)
You can use BitArray collection:
http://msdn.microsoft.com/en-us/library/system.collections.bitarray(v=vs.110).aspx
MSDN Description:
Manages a compact array of bit values, which are represented as Booleans, where true indicates that the bit is on (1) and false indicates the bit is off (0).
You can (and should) use well tested and well known libraries.
But if you're looking to learn something (as it seems to be the case) you can do it yourself.
Another reason you may want to use a custom bit array is to use the hard drive to store the array, which comes in handy when calculating primes. To do this you'd need to further split addr, for example lowest 3 bits for the mask, next 28 bits for 256MB of in-memory storage, and from there on - a file name for a buffer file.
Yet another reason for custom bit array is to compress the memory use when specifically searching for primes. After all more than half of your bits will be 'false' because the numbers corresponding to them would be even, so in fact you can both speed up your calculation AND reduce memory requirements if you don't even store the even bits. You can do that by changing the way addr is interpreted. Further more you can also exclude numbers divisible by 3 (only 2 out of every 6 numbers has a chance of being prime) thus reducing memory requirements by 60% compared to plain bit array.
Notice the use of shift and logical operators to make the code a bit more efficient.
byte mask = (byte)(1 << (int)(addr & 7)); for example can be written as
byte mask = (byte)(1 << (int)(addr % 8));
and addr >> 3 can be written as addr / 8
Testing shift/logical operators vs division shows 2.6s vs 4.8s in favor of shift/logical for 200000000 operations.
Here's the code:
void Main()
{
var barr = new BitArray(10);
barr[4] = true;
Console.WriteLine("Is it "+barr[4]);
Console.WriteLine("Is it Not "+barr[5]);
}
public class BitArray{
private readonly byte[] _buffer;
public bool this[long addr]{
get{
byte mask = (byte)(1 << (int)(addr & 7));
byte val = _buffer[(int)(addr >> 3)];
bool bit = (val & mask) == mask;
return bit;
}
set{
byte mask = (byte) ((value ? 1:0) << (int)(addr & 7));
int offs = (int)addr >> 3;
_buffer[offs] = (byte)(_buffer[offs] | mask);
}
}
public BitArray(long size){
_buffer = new byte[size/8 + 1]; // define a byte buffer sized to hold 8 bools per byte. The spare +1 is to avoid dealing with rounding.
}
}

How can I best generate a static array of random number on demand?

An application I'm working on requires a matrix of random numbers. The matrix can grow in any direction at any time, and isn't always full. (I'll probably end up re-implementing it with a quad tree or something else, rather than a matrix with a lot of null objects.)
I need a way to generate the same matrix, given the same seed, no matter in which order I calculate the matrix.
LazyRandomMatrix rndMtx1 = new LazyRandomMatrix(1234) // Seed new object
float X = rndMtx1[0,0] // Lazily generate random numbers on demand
float Y = rndMtx1[3,16]
float Z = rndMtx1[23,-5]
Debug.Assert(X == rndMtx1[0,0])
Debug.Assert(Y == rndMtx1[3,16])
Debug.Assert(Z == rndMtx1[23,-5])
LazyRandomMatrix rndMtx2 = new LazyRandomMatrix(1234) // Seed second object
Debug.Assert(Y == rndMtx2[3,16]) // Lazily generate the same random numbers
Debug.Assert(Z == rndMtx2[23,-5]) // on demand in a different order
Debug.Assert(X == rndMtx2[0,0])
Yes, if I knew the dimensions of the array, the best way would be to generate the entire array, and just return values, but they need to be generated independently and on demand.
My first idea was to initialize a new random number generator for each call to a new coordinate, seeding it with some hash of the overall matrix's seed and the coordinates used in calling, but this seems like a terrible hack, as it would require creating a ton of new Random objects.
What you're talking about is typically called "Perlin Noise", here's a link for you: http://freespace.virgin.net/hugo.elias/models/m_perlin.htm
The most important thing in that article is the noise function in 2D:
function Noise1(integer x, integer y)
n = x + y * 57
n = (n<<13) ^ n;
return ( 1.0 - ( (n * (n * n * 15731 + 789221) + 1376312589) & 7fffffff) / 1073741824.0);
end function
It returns a number between -1.0 and +1.0 based on the x and y coordonates alone (and a hard coded seed that you can change randomly at the start of your app or just leave it as it is).
The rest of the article is about interpolating these numbers, but depending on how random you want these numbers, you can just leave them as it is. Note that these numbers will be utterly random. If you instead apply a Cosine Interpolator and use the generated noise every 5-6 indexes, interpolating inbetween, you get heightmap data (which is what I used it for). Skip it for totally random data.
Standart random generator usually is generator of sequence, where each next element is build from previous. So to generate rndMtx1[3,16] you have to generate all previous elements in a sequence.
Actually you need something different from random generator, because you need only one value, but not the sequence. You have to build your own "generator" which uses seed and indexes as input for formula to produce single random value. You can invent many ways to do so. One of the simplest way is to take random value asm hash(seed + index) (I guess idea of hashes used in passwords and signing is to produce some stable "random" value out of input data).
P.S. You can improve your approach with independent generators (Random(seed + index)) by making lazy blocks of matrix.
I think your first idea of instantiating a new Random object seeded by some deterministic hash of (x-coordinate, y-coordinate, LazyRandomMatrix seed) is probably reasonable for most situations. In general, creating lots of small objects on the managed heap is something the CLR is very good at handling efficiently. And I don't think Random.ctor() is terribly expensive. You can easily measure the performance if it's a concern.
A very similar solution which may be easier than creating a good deterministic hash is to use two Random objects each time. Something like:
public int this[int x, int y]
{
get
{
Random r1 = new Random(_seed * x);
Random r2 = new Random(y);
return (r1.Next() ^ r2.Next());
}
}
Here is a solution based on a SHA1 hash. Basically this takes the bytes for the X, Y and Seed values and packs this into a byte array. Then a hash for the byte array and the first 4 bytes of the hash used to generate an int. This should be pretty random.
public class LazyRandomMatrix
{
private int _seed;
private SHA1 _hashProvider = new SHA1CryptoServiceProvider();
public LazyRandomMatrix(int seed)
{
_seed = seed;
}
public int this[int x, int y]
{
get
{
byte[] data = new byte[12];
Buffer.BlockCopy(BitConverter.GetBytes(x), 0, data, 0, 4);
Buffer.BlockCopy(BitConverter.GetBytes(y), 0, data, 4, 4);
Buffer.BlockCopy(BitConverter.GetBytes(_seed), 0, data, 8, 4);
byte[] hash = _hashProvider.ComputeHash(data);
return BitConverter.ToInt32(hash, 0);
}
}
}
PRNGs can be built out of hash functions.
This is what e.g. MS Research did with parallelizing random number generation with MD5 or others with TEA on a GPU.
(In fact, PRNGs can be thought of as a hash function from (seed, state) => nextnumber.)
Generating massive amounts of random numbers on a GPU brings up similar problems.
(E.g., to make it parallel, there should not be a single shared state.)
Although it is more common in the crypto world, using a crypto hash, I have taken the liberty to use MurmurHash 2.0 for sake of speed and simplicity. It has very good statistical properties as a hash function. A related, but not identical test shows that it gives good results as a PRNG. (unless I have fsc#kd up something in the C# code, that is.:) Feel free to use any other suitable hash function; crypto ones (MD5, TEA, SHA) as well - though crypto hashes tend to be much slower.
public class LazyRandomMatrix
{
private uint seed;
public LazyRandomMatrix(int seed)
{
this.seed = (uint)seed;
}
public int this[int x, int y]
{
get
{
return MurmurHash2((uint)x, (uint)y, seed);
}
}
static int MurmurHash2(uint key1, uint key2, uint seed)
{
// 'm' and 'r' are mixing constants generated offline.
// They're not really 'magic', they just happen to work well.
const uint m = 0x5bd1e995;
const int r = 24;
// Initialize the hash to a 'random' value
uint h = seed ^ 8;
// Mix 4 bytes at a time into the hash
key1 *= m;
key1 ^= key1 >> r;
key1 *= m;
h *= m;
h ^= key1;
key2 *= m;
key2 ^= key2 >> r;
key2 *= m;
h *= m;
h ^= key2;
// Do a few final mixes of the hash to ensure the last few
// bytes are well-incorporated.
h ^= h >> 13;
h *= m;
h ^= h >> 15;
return (int)h;
}
}
A pseudo-random number generator is essentially a function that deterministically calculates a successor for a given value.
You could invent a simple algorithm that calculates a value from its neighbours. If a neighbour doesn't have a value yet, calculate its value from its respective neighbours first.
Something like this:
value(0,0) = seed
value(x+1,0) = successor(value(x,0))
value(x,y+1) = successor(value(x,y))
Example with successor(n) = n+1 to calculate value(2,4):
\ x 0 1 2
y +-------------------
0 | 627 628 629
1 | 630
2 | 631
3 | 632
4 | 633
This example algorithm is obviously not very good, but you get the idea.
You want a random number generator with random access to the elements, instead of sequential access. (Note that you can reduce your two coordinates into a single index i.e. by computing i = x + (y << 16).)
A cool example of such a generator is Blum Blum Shub, which is a cryptographically secure PRNG with easy random-access. Unfortunately, it is very slow.
A more practical example is the well-known linear congruential generator. You can easily modify one to allow random access. Consider the definition:
X(0) = S
X(n) = B + X(n-1)*A (mod M)
Evaluating this directly would take O(n) time (that's pseudo linear, not linear), but you can convert to a non-recursive form:
//Expand a few times to see the pattern:
X(n) = B + X(n-1)*A (mod M)
X(n) = B + (B + X(n-2)*A)*A (mod M)
X(n) = B + (B + (B + X(n-3)*A)*A)*A (mod M)
//Aha! I see it now, and I can reduce it to a closed form:
X(n) = B + B*A + B*A*A + ... + B*A^(N-1) + S*A^N (mod M)
X(n) = S*A^N + B*SUM[i:0..n-1](A^i) (mod M)
X(n) = S*A^N + B*(A^N-1)/(A-1) (mod M)
That last equation can be computed relatively quickly, although the second part of it is a bit tricky to get right (because division doesn't distribute over mod the same way addition and multiplication do).
As far as I see, there are 2 basic algorithms possible here:
Generate a new random number based on func(seed, coord) for each coord
Generate a single random number from seed, and then transform it for the coord (something like rotate(x) + translate(y) or whatever)
In the first case, you have the problem of always generating new random numbers, although this may not be as expensive as you fear.
In the second case, the problem is that you may lose randomness during your transformation operations. However, in either case the result is reproducible.

Quick and Simple Hash Code Combinations

Can people recommend quick and simple ways to combine the hash codes of two objects. I am not too worried about collisions since I have a Hash Table which will handle that efficiently I just want something that generates a code quickly as possible.
Reading around SO and the web there seem to be a few main candidates:
XORing
XORing with Prime Multiplication
Simple numeric operations like multiplication/division (with overflow checking or wrapping around)
Building a String and then using the String classes Hash Code method
What would people recommend and why?
I would personally avoid XOR - it means that any two equal values will result in 0 - so hash(1, 1) == hash(2, 2) == hash(3, 3) etc. Also hash(5, 0) == hash(0, 5) etc which may come up occasionally. I have deliberately used it for set hashing - if you want to hash a sequence of items and you don't care about the ordering, it's nice.
I usually use:
unchecked
{
int hash = 17;
hash = hash * 31 + firstField.GetHashCode();
hash = hash * 31 + secondField.GetHashCode();
return hash;
}
That's the form that Josh Bloch suggests in Effective Java. Last time I answered a similar question I managed to find an article where this was discussed in detail - IIRC, no-one really knows why it works well, but it does. It's also easy to remember, easy to implement, and easy to extend to any number of fields.
If you are using .NET Core 2.1 or later or .NET Framework 4.6.1 or later, consider using the System.HashCode struct to help with producing composite hash codes. It has two modes of operation: Add and Combine.
An example using Combine, which is usually simpler and works for up to eight items:
public override int GetHashCode()
{
return HashCode.Combine(object1, object2);
}
An example of using Add:
public override int GetHashCode()
{
var hash = new HashCode();
hash.Add(this.object1);
hash.Add(this.object2);
return hash.ToHashCode();
}
Pros:
Part of .NET itself, as of .NET Core 2.1/.NET Standard 2.1 (though, see con below)
For .NET Framework 4.6.1 and later, the Microsoft.Bcl.HashCode NuGet package can be used to backport this type.
Looks to have good performance and mixing characteristics, based on the work the author and the reviewers did before merging this into the corefx repo
Handles nulls automatically
Overloads that take IEqualityComparer instances
Cons:
Not available on .NET Framework before .NET 4.6.1. HashCode is part of .NET Standard 2.1. As of September 2019, the .NET team has no plans to support .NET Standard 2.1 on the .NET Framework, as .NET Core/.NET 5 is the future of .NET.
General purpose, so it won't handle super-specific cases as well as hand-crafted code
While the template outlined in Jon Skeet's answer works well in general as a hash function family, the choice of the constants is important and the seed of 17 and factor of 31 as noted in the answer do not work well at all for common use cases. In most use cases, the hashed values are much closer to zero than int.MaxValue, and the number of items being jointly hashed are a few dozen or less.
For hashing an integer tuple {x, y} where -1000 <= x <= 1000 and -1000 <= y <= 1000, it has an abysmal collision rate of almost 98.5%. For example, {1, 0} -> {0, 31}, {1, 1} -> {0, 32}, etc. If we expand the coverage to also include n-tuples where 3 <= n <= 25, it does less terrible with a collision rate of about 38%. But we can do much better.
public static int CustomHash(int seed, int factor, params int[] vals)
{
int hash = seed;
foreach (int i in vals)
{
hash = (hash * factor) + i;
}
return hash;
}
I wrote a Monte Carlo sampling search loop that tested the method above with various values for seed and factor over various random n-tuples of random integers i. Allowed ranges were 2 <= n <= 25 (where n was random but biased toward the lower end of the range) and -1000 <= i <= 1000. At least 12 million unique collision tests were performed for each seed and factor pair.
After about 7 hours running, the best pair found (where the seed and factor were both limited to 4 digits or less) was: seed = 1009, factor = 9176, with a collision rate of 0.1131%. In the 5- and 6-digit areas, even better options exist. But I selected the top 4-digit performer for brevity, and it peforms quite well in all common int and char hashing scenarios. It also seems to work fine with integers of much greater magnitudes.
It is worth noting that "being prime" did not seem to be a general prerequisite for good performance as a seed and/or factor although it likely helps. 1009 noted above is in fact prime, but 9176 is not. I explicitly tested variations on this where I changed factor to various primes near 9176 (while leaving seed = 1009) and they all performed worse than the above solution.
Lastly, I also compared against the generic ReSharper recommendation function family of hash = (hash * factor) ^ i; and the original CustomHash() as noted above seriously outperforms it. The ReSharper XOR style seems to have collision rates in the 20-30% range for common use case assumptions and should not be used in my opinion.
Use the combination logic in tuple. The example is using c#7 tuples.
(field1, field2).GetHashCode();
I presume that .NET Framework team did a decent job in testing their System.String.GetHashCode() implementation, so I would use it:
// System.String.GetHashCode(): http://referencesource.microsoft.com/#mscorlib/system/string.cs,0a17bbac4851d0d4
// System.Web.Util.StringUtil.GetStringHashCode(System.String): http://referencesource.microsoft.com/#System.Web/Util/StringUtil.cs,c97063570b4e791a
public static int CombineHashCodes(IEnumerable<int> hashCodes)
{
int hash1 = (5381 << 16) + 5381;
int hash2 = hash1;
int i = 0;
foreach (var hashCode in hashCodes)
{
if (i % 2 == 0)
hash1 = ((hash1 << 5) + hash1 + (hash1 >> 27)) ^ hashCode;
else
hash2 = ((hash2 << 5) + hash2 + (hash2 >> 27)) ^ hashCode;
++i;
}
return hash1 + (hash2 * 1566083941);
}
Another implementation is from System.Web.Util.HashCodeCombiner.CombineHashCodes(System.Int32, System.Int32) and System.Array.CombineHashCodes(System.Int32, System.Int32) methods. This one is simpler, but probably doesn't have such a good distribution as the method above:
// System.Web.Util.HashCodeCombiner.CombineHashCodes(System.Int32, System.Int32): http://referencesource.microsoft.com/#System.Web/Util/HashCodeCombiner.cs,21fb74ad8bb43f6b
// System.Array.CombineHashCodes(System.Int32, System.Int32): http://referencesource.microsoft.com/#mscorlib/system/array.cs,87d117c8cc772cca
public static int CombineHashCodes(IEnumerable<int> hashCodes)
{
int hash = 5381;
foreach (var hashCode in hashCodes)
hash = ((hash << 5) + hash) ^ hashCode;
return hash;
}
This is a repackaging of Special Sauce's brilliantly researched solution.
It makes use of Value Tuples (ITuple).
This allows defaults for the parameters seed and factor.
public static int CombineHashes(this ITuple tupled, int seed=1009, int factor=9176)
{
var hash = seed;
for (var i = 0; i < tupled.Length; i++)
{
unchecked
{
hash = hash * factor + tupled[i].GetHashCode();
}
}
return hash;
}
Usage:
var hash1 = ("Foo", "Bar", 42).CombineHashes();
var hash2 = ("Jon", "Skeet", "Constants").CombineHashes(seed=17, factor=31);
If your input hashes are the same size, evenly distributed and not related to each other then an XOR should be OK. Plus it's fast.
The situation I'm suggesting this for is where you want to do
H = hash(A) ^ hash(B); // A and B are different types, so there's no way A == B.
of course, if A and B can be expected to hash to the same value with a reasonable (non-negligible) probability, then you should not use XOR in this way.
If you're looking for speed and don't have too many collisions, then XOR is fastest. To prevent a clustering around zero, you could do something like this:
finalHash = hash1 ^ hash2;
return finalHash != 0 ? finalHash : hash1;
Of course, some prototyping ought to give you an idea of performance and clustering.
Assuming you have a relevant toString() function (where your different fields shall appear), I would just return its hashcode:
this.toString().hashCode();
This is not very fast, but it should avoid collisions quite well.
I would recommend using the built-in hash functions in System.Security.Cryptography rather than rolling your own.

Simple Pseudo-Random Algorithm

I'm need a pseudo-random generator which takes a number as input and returns another number witch is reproducible and seems to be random.
Each input number should match to exactly one output number and vice versa
same input numbers always result in same output numbers
sequential input numbers that are close together (eg. 1 and 2) should produce completely different output numbers (eg. 1 => 9783526, 2 => 283)
It must not be perfect, it's just to create random but reproducible test data.
I use C#.
I wrote this funny piece of code some time ago which produced something random.
public static long Scramble(long number, long max)
{
// some random values
long[] scramblers = { 3, 5, 7, 31, 343, 2348, 89897 };
number += (max / 7) + 6;
number %= max;
// shuffle according to divisibility
foreach (long scrambler in scramblers)
{
if (scrambler >= max / 3) break;
number = ((number * scrambler) % max)
+ ((number * scrambler) / max);
}
return number % max;
}
I would like to have something better, more reliable, working with any size of number (no max argument).
Could this probably be solved using a CRC algorithm? Or some bit shuffling stuff.
I remove the microsoft code from this answer, the GNU code file is a lot longer but basically it contains this from http://cs.uccs.edu/~cs591/bufferOverflow/glibc-2.2.4/stdlib/random_r.c :
int32_t val = state[0];
val = ((state[0] * 1103515245) + 12345) & 0x7fffffff;
state[0] = val;
*result = val;
for your purpose, the seed is state[0] so it would look more like
int getRand(int val)
{
return ((val * 1103515245) + 12345) & 0x7fffffff;
}
You (maybe) can do this easily in C# using the Random class:
public int GetPseudoRandomNumber(int input)
{
Random random = new Random(input);
return random.Next();
}
Since you're explicitly seeding Random with the input, you will get the same output every time given the same input value.
A tausworthe generator is simple to implement and pretty fast. The following pseudocode implementation has full cycle (2**31 - 1, because zero is a fixed point):
def tausworthe(seed)
seed ^= seed >> 13
seed ^= seed << 18
return seed & 0x7fffffff
I don't know C#, but I'm assuming it has XOR (^) and bit shift (<<, >>) operators as in C.
Set an initial seed value, and invoke with seed = tausworthe(seed).
The first two rules suggest a fixed or input-seeded permutation of the input, but the third rule requires a further transform.
Is there any further restriction on what the outputs should be, to guide that transform? - e.g. is there an input set of output values to choose from?
If the only guide is "no max", I'd use the following...
Apply a hash algorithm to the whole input to get the first output item. A CRC might work, but for more "random" results, use a crypto hash algorithm such as MD5.
Use a next permutation algorithm (plenty of links on Google) on the input.
Repeat the hash-then-next-permutation until all required outputs are found.
The next permutation may be overkill though, you could probably just increment the first input (and maybe, on overflow, increment the second and so on) before redoing the hash.
For crypto-style hashing, you'll need a key - just derive something from the input before you start.

Categories