Generate 1M unique random keys with alpanumeric subset

Generate 1M unique random keys with alpanumeric subset - c#

I want to generate 1M random (appearing) unique alphanumeric keys and store them in a database. Each key will be 8 characters long and only the subset "abcdefghijk n pqrstuvxyz and 0-9" will be used.
The letters l,m,o and w are ditched. "m and w" are left out because of limited printing space, as each key will be printed on a product in a very small space. Dropping m and w enabled to increase the letter size with 2pt, improving readability. l and o were dropped because they are easily mixed up with 1, i and 0 at the current printing size. We did some testing characters 1,i, and 0 were always read correctly, l and o had to many mistakes. Capitals were left out for the same reason as 'm and w".
So why not a sequence? A few reasons: The keys can be registered afterwards and we do not want anyone guessing the next key in the sequence and register somebody else's key. Appearance: we don't need customers and competition to know we only shipped a few thousand keys.
Is there a practical way to generate the keys, ensure the uniqueness of each key and store them in a database? Thanks!

Edit: #CodeInChaos pointed out a problem: System.Random isn't very secure, and the sequence could be reproduced without a great deal of difficulty. I've replaced Random with a secure generator here:
var possibilities = "abcdefghijknpqrstuvxyz0123456789".ToCharArray();
int goal = 1000000;
int codeLength = 8;
var codes = new HashSet<string>();
var random = new RNGCryptoServiceProvider();
while (codes.Count < goal)
{
var newCode = new char[codeLength];
for (int i = 0; i < codeLength; i++)
newCode[i] = possibilities[random.Next(possibilities.Length)];
codes.Add(new string(newCode));
}
// now write codes to database
static class Extensions
{
public static byte Next(this RNGCryptoServiceProvider provider, byte maximum)
{
var b = new byte[1];
while (true)
{
provider.GetBytes(b);
if (b[0] < maximum)
return b[0];
}
}
}
(the Next method isn't very fast, but might be good enough for your purposes)

1 million isn't much these days and you can probably do that on a single machine fairly quickly. It's a one-time operation after all.
Take a hashtable (or hashset)
Generate random keys and put them into it as keys (or directly if a set) until the count is 1 million
Write them to the database
My quick and dirty testing code looked like this:
function new-key {-join'abcdefghijknpqrstuvxyz0123456789'[(0..7|%{random 32})]}
$keys = #{}
for(){$keys[(new-key)]=1}
But PowerShell is slow, so I'd expect C++ or C# do very well here.

Is there a practical way to generate the keys, ensure the uniqueness
of each key and store them in a database?
Since this is a single operation you could simply do the following:
1) Generate A Single Key
2) Verify that the generated key does not exist in the database.
3) If it does exist generate a new key.
3b) If it does not exist write it to the database
4) Go Back to Step 1
There are other choices of course, in the end it boils down to generating a key, and making sure it does not exist in the database.
You could in theory generate 10 million keys ( in order to save processing power ) write them to a file. Once the keys are generate just look at each one and see if it already exits in the database. You likely could program a tool that does this in less then 48 hours.

I encounter a similar problem once.. what I did is create a unique sequence YYYY/MM/DD/HH/MM/SS/millis/nano and get its hash code. After that I use the hash as a key. Your client and your competitor won't be able to guess the next value. It might not be full proof but in my case it was enough!

To actually get the random string, you can use code similar to this:
Random rand = new Random(new DateTime().Millisecond);
String[] possibilities = {"a","b","c","d","e","f","g","h","i","j","k",
"l","n","p","q","r","s","t","u","v","x","y","z","0","1","2","3","4",
"5","6","7","8","9"};
for (int i = 0; i < 1000000; ++i)
{
System.Text.StringBuilder sb = new System.Text.StringBuilder();
for (int j = 0; j < 8; ++j)
{
sb.Append(possibilities[rand.Next(possibilities.Length)]);
}
if (!databaseContains(sb.ToString()))
databaseAdd(sb.ToString());
else
--i;
}

Related

Sort Sequential Guids Generated by UuidCreateSequential

I tried to sort Guids generated by UuidCreateSequential, but I see the results are not correct, am I mising something? here is the code
private class NativeMethods
{
[DllImport("rpcrt4.dll", SetLastError = true)]
public static extern int UuidCreateSequential(out Guid guid);
}
public static Guid CreateSequentialGuid()
{
const int RPC_S_OK = 0;
Guid guid;
int result = NativeMethods.UuidCreateSequential(out guid);
if (result == RPC_S_OK)
return guid;
else throw new Exception("could not generate unique sequential guid");
}
static void TestSortedSequentialGuid(int length)
{
Guid []guids = new Guid[length];
int[] ids = new int[length];
for (int i = 0; i < length; i++)
{
guids[i] = CreateSequentialGuid();
ids[i] = i;
Thread.Sleep(60000);
}
Array.Sort(guids, ids);
for (int i = 0; i < length - 1; i++)
{
if (ids[i] > ids[i + 1])
{
Console.WriteLine("sorting using guids failed!");
return;
}
}
Console.WriteLine("sorting using guids succeeded!");
}
EDIT1:
Just to make my question clear, why the guid struct is not sortable using the default comparer ?
EDIT 2:
Also here are some sequential guids I've generated, seems they are not sorted ascending as presented by the hex string
"53cd98f2504a11e682838cdcd43024a7",
"7178df9d504a11e682838cdcd43024a7",
"800b5b69504a11e682838cdcd43024a7",
"9796eb73504a11e682838cdcd43024a7",
"c14c5778504a11e682838cdcd43024a7",
"c14c5779504a11e682838cdcd43024a7",
"d2324e9f504a11e682838cdcd43024a7",
"d2324ea0504a11e682838cdcd43024a7",
"da3d4460504a11e682838cdcd43024a7",
"e149ff28504a11e682838cdcd43024a7",
"f2309d56504a11e682838cdcd43024a7",
"f2309d57504a11e682838cdcd43024a7",
"fa901efd504a11e682838cdcd43024a7",
"fa901efe504a11e682838cdcd43024a7",
"036340af504b11e682838cdcd43024a7",
"11768c0b504b11e682838cdcd43024a7",
"2f57689d504b11e682838cdcd43024a7"

First off, let's re-state the observation: when creating sequential GUIDs with a huge time delay -- 60 billion nanoseconds -- between creations, the resulting GUIDs are not sequential.
am I missing something?
You know every fact you need to know to figure out what is going on. You're just not putting them together.
You have a service that provides numbers which are both sequential and unique across all computers in the universe. Think for a moment about how that is possible. It's not a magic box; someone had to write that code.
Imagine if you didn't have to do it using computers, but instead had to do it by hand. You advertise a service: you provide sequential globally unique numbers to anyone who asks at any time.
Now, suppose I ask you for three such numbers and you hand out 20, 21, and 22. Then sixty years later I ask you for three more and surprise, you give me 13510985, 13510986 and 13510987. "Wait just a minute here", I say, "I wanted six sequential numbers, but you gave me three sequential numbers and then three more. What gives?"
Well, what do you suppose happened in that intervening 60 years? Remember, you provide this service to anyone who asks, at any time. Under what circumstances could you give me 23, 24 and 25? Only if no one else asked within that 60 years.
Now is it clear why your program is behaving exactly as it ought to?
In practice, the sequential GUID generator uses the current time as part of its strategy to enforce the globally unique property. Current time and current location is a reasonable starting point for creating a unique number, since presumably there is only one computer on your desk at any one time.
Now, I caution you that this is only a starting point; suppose you have twenty virtual machines all in the same real machine and all trying to generate sequential GUIDs at the same time? In these scenarios collisions become much more likely. You can probably think of techniques you might use to mitigate collisions in these scenarios.

After researching, I can't sort the guid using the default sort or even using the default string representation from guid.ToString as the byte order is different.
to sort the guids generated by UuidCreateSequential I need to convert to either BigInteger or form my own string representation (i.e. hex string 32 characters) by putting bytes in most signification to least significant order as follows:
static void TestSortedSequentialGuid(int length)
{
Guid []guids = new Guid[length];
int[] ids = new int[length];
for (int i = 0; i < length; i++)
{
guids[i] = CreateSequentialGuid();
ids[i] = i;
// this simulates the delay between guids creation
// yes the guids will not be sequential as it interrupts generator
// (as it used the time internally)
// but still the guids should be in increasing order and hence they are
// sortable and that was the goal of the question
Thread.Sleep(60000);
}
var sortedGuidStrings = guids.Select(x =>
{
var bytes = x.ToByteArray();
//reverse high bytes that represents the sequential part (time)
string high = BitConverter.ToString(bytes.Take(10).Reverse().ToArray());
//set last 6 bytes are just the node (MAC address) take it as it is.
return high + BitConverter.ToString(bytes.Skip(10).ToArray());
}).ToArray();
// sort ids using the generated sortedGuidStrings
Array.Sort(sortedGuidStrings, ids);
for (int i = 0; i < length - 1; i++)
{
if (ids[i] > ids[i + 1])
{
Console.WriteLine("sorting using sortedGuidStrings failed!");
return;
}
}
Console.WriteLine("sorting using sortedGuidStrings succeeded!");
}

Hopefully I understood your question correctly. It seems you are trying to sort the HEX representation of your Guids. That really means that you are sorting them alphabetically and not numerically.
Guids will be indexed by their byte value in the database. Here is a console app to prove that your Guids are numerically sequential:
using System;
using System.Linq;
using System.Numerics;
class Program
{
static void Main(string[] args)
{
//These are the sequential guids you provided.
Guid[] guids = new[]
{
"53cd98f2504a11e682838cdcd43024a7",
"7178df9d504a11e682838cdcd43024a7",
"800b5b69504a11e682838cdcd43024a7",
"9796eb73504a11e682838cdcd43024a7",
"c14c5778504a11e682838cdcd43024a7",
"c14c5779504a11e682838cdcd43024a7",
"d2324e9f504a11e682838cdcd43024a7",
"d2324ea0504a11e682838cdcd43024a7",
"da3d4460504a11e682838cdcd43024a7",
"e149ff28504a11e682838cdcd43024a7",
"f2309d56504a11e682838cdcd43024a7",
"f2309d57504a11e682838cdcd43024a7",
"fa901efd504a11e682838cdcd43024a7",
"fa901efe504a11e682838cdcd43024a7",
"036340af504b11e682838cdcd43024a7",
"11768c0b504b11e682838cdcd43024a7",
"2f57689d504b11e682838cdcd43024a7"
}.Select(l => Guid.Parse(l)).ToArray();
//Convert to BigIntegers to get their numeric value from the Guids bytes then sort them.
BigInteger[] values = guids.Select(l => new BigInteger(l.ToByteArray())).OrderBy(l => l).ToArray();
for (int i = 0; i < guids.Length; i++)
{
//Convert back to a guid.
Guid sortedGuid = new Guid(values[i].ToByteArray());
//Compare the guids. The guids array should be sequential.
if(!sortedGuid.Equals(guids[i]))
throw new Exception("Not sequential!");
}
Console.WriteLine("All good!");
Console.ReadKey();
}
}

Best way to GetHashCode() for 44-bit number stored as Int64

I have around 5,000,000 objects stored in a Dictionary<MyKey, MyValue>.
MyKey is a struct that packs each component of my key (5 different numbers) in the right-most 44 bits of an Int64 (ulong).
Since the ulong will always start with 20 zero-bits, my gut feeling is that returning the native Int64.GetHashCode() implementation is likely to collide more often, than if the hash code implementation only considers the 44 bits that are actually in use (although mathematically, I wouldn't know where to begin to prove that theory).
This increases the number of calls to .Equals() and makes dictionary lookups slower.
The .NET implementation of Int64.GetHashCode() looks like this:
public override int GetHashCode()
{
return (int)this ^ (int)(this >> 32);
}
How would I best implement GetHashCode()?

I couldn't begin to suggest a "best" way to hash 44-bit numbers. But, I can suggest a way to compare it to the 64-bit hash algorithm.
One way to do this is to simply check how many collisions you get for a set of numbers (as suggested by McKenzie et al in Selecting a Hashing Algorithm) Unless you're going to test all possible values of your set, you'll need to judge whether the # of collisions you get is acceptable. This could be done in code with something like:
var rand = new Random(42);
var dict64 = new Dictionary<int, int>();
var dict44 = new Dictionary<int, int>();
for (int i = 0; i < 100000; ++i)
{
// get value between 0 and 0xfffffffffff (max 44-bit value)
var value44 = (ulong)(rand.NextDouble() * 0x0FFFFFFFFFFF);
var value64 = (ulong)(rand.NextDouble() * ulong.MaxValue);
var hash64 = value64.GetHashCode();
var hash44 = (int)value44 ^ (int)(value44>> 32);
if (!dict64.ContainsValue(hash64))
{
dict64.Add(hash64,hash64);
}
if (!dict44.ContainsValue(hash44))
{
dict44.Add(hash44, hash44);
}
}
Trace.WriteLine(string.Format("64-bit hash: {0}, 64-bit hash with 44-bit numbers {1}", dict64.Count, dict44.Count));
In other words, consistently generate 100,000 random 64-bit values and 100,000 random 44-bit values, perform a hash on each and keep track of unique values.
In my test this generated 99998 unique values for 44-bit numbers and 99997 unique values for 64-bit numbers. So, that's one less collision for 44-bit numbers over 64-bit numbers. I would expect less collisions with 44-bit numbers simply because you have less possible inputs.
I'm not going to tell you the 64-bit hash method is "best" for 44-bit; you'll have to decide if these results mean it's good for your circumstances.
Ideally you should be testing with realistic values that your application is likely to generate. Given those will all be 44-bit values, it's hard to compare that to the collisions ulong.GetHashCode() produces (i.e. you'd have identical results). If random values based on a constant seed isn't good enough, modify the code with something better.
While things might not "feel" right, science suggests there's no point in changing something without reproducible tests that prove a change is necessary.

Here's my attempt to answer this question, which I'm posting despite the fact that the answer is the opposite of what I was expecting. (Although I may have made a mistake somewhere - I almost hope so, and am open to criticism regarding my test technique.)
// Number of Dictionary hash buckets found here:
// http://stackoverflow.com/questions/24366444/how-many-hash-buckets-does-a-net-dictionary-use
const int CNumberHashBuckets = 4999559;
static void Main(string[] args)
{
Random randomNumberGenerator = new Random();
int[] dictionaryBuckets1 = new int[CNumberHashBuckets];
int[] dictionaryBuckets2 = new int[CNumberHashBuckets];
for (int i = 0; i < 5000000; i++)
{
ulong randomKey = (ulong)(randomNumberGenerator.NextDouble() * 0x0FFFFFFFFFFF);
int simpleHash = randomKey.GetHashCode();
BumpHashBucket(dictionaryBuckets1, simpleHash);
int superHash = ((int)(randomKey >> 12)).GetHashCode() ^ ((int)randomKey).GetHashCode();
BumpHashBucket(dictionaryBuckets2, superHash);
}
int collisions1 = ComputeCollisions(dictionaryBuckets1);
int collisions2 = ComputeCollisions(dictionaryBuckets2);
}
private static void BumpHashBucket(int[] dictionaryBuckets, int hashedKey)
{
int bucketIndex = (int)((uint)hashedKey % CNumberHashBuckets);
dictionaryBuckets[bucketIndex]++;
}
private static int ComputeCollisions(int[] dictionaryBuckets)
{
int i = 0;
foreach (int dictionaryBucket in dictionaryBuckets)
i += Math.Max(dictionaryBucket - 1, 0);
return i;
}
I try to simulate how the processing done by Dictionary will work. The OP says he has "around 5,000,000" objects in a Dictionary, and according to the referenced source there will be either 4999559 or 5999471 "buckets" in the Dictionary.
Then I generate 5,000,000 random 44-bit keys to simulate the OP's Dictionary entries, and for each key I hash it two different ways: the simple ulong.GetHashCode() and an alternative way that I suggested in a comment. Then I turn each hash code into a bucket index using modulo - I assume that's how it is done by Dictionary. This is used to increment the pseudo buckets as a way of computing the number of collisions.
Unfortunately (for me) the results are not as I was hoping. With 4999559 buckets the simulation typically indicates around 1.8 million collisions, with my "super hash" technique actually having a few (around 0.01%) MORE collisions. With 5999471 buckets there are typically around 1.6 million collisions, and my so-called super hash gives maybe 0.1% fewer collisions.
So my "gut feeling" was wrong, and there seems to be no justification for trying to find a better hash code technique.

System.Random not so random, why does this occur? [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Random number generator only generating one random number
I've distilled the behavior observed in a larger system into this code sequence:
for (int i = 0; i < 100; i++)
{
Random globalRand = new Random(0x3039 + i);
globalRand.Next();
globalRand.Next();
int newSeed = globalRand.Next();
Random rand = new Random(newSeed);
int value = rand.Next(1, 511);
Console.WriteLine(value);
}
Running this from Visual Studio 2012 targeting .NET 4.5 will output either 316 or 315. Extending this beyond 100 iterations and you'll see the value slowly decrement (314, 313...) but it still isn't what I'd imagine anyone would consider "random".
EDIT
I am aware that there are several questions on StackOverflow already that ask why their random numbers aren't random. However, those questions have issues regarding one of a) not passing in a seed (or passing in the same seed) to their Random object instance or b) doing something like NextInt(0, 1) and not realizing that the second parameter is an exclusive bound. Neither of those issues are true for this question.

It's a pseudo random generator which basically creates a long (infinite) list of numbers. The list is deterministic but the order can in most practical scenarios be treated as random. The order is however determined by the seed.
The most random behaviour you can achieve (without fancy pants tricks that are hard to get right all the time) is to reuse the same object over and over.
If you change your code to the below you have a more random behaviour
Random globalRand = new Random();
for (int i = 0; i < 100; i++)
{
globalRand.Next();
globalRand.Next();
int newSeed = globalRand.Next();
Random rand = new Random(newSeed);
int value = rand.Next(1, 511);
Console.WriteLine(value);
}
The reason is the math behind pseudo random generators basically just creates an infinite list of numbers.
The distribution of these numbers are almost random but the order in which they come are not. Computers are deterministic and as such incapable of producing true random numbers (without aid) so to circumvent this math geniuses have produced functions capable of producing these lists of numbers, that have a lot of radnomness about them but where the seed is determining the order.
Given the same seed the function always produces the same order. Given two seeds close to each order (where close can be a property of which function is used) the list will be almost the same for the first several numbers in the list but will eventually be very different.

Using the first random number to generate the second. Doesnt make it any more "Random". As suggested try this.
Also as suggested no need to generate the random object inside the loop.
Random globalRand = new Random();
for (int i = 0; i < 100; i++)
{
int value = globalRand.Next(1, 511);
Console.WriteLine(value);
}

I believe the Random() is time based so if your process is running really fast you will get the same answer if you keep creating new instances of your Random() in your loop. Try creatining the Random() outside of the loop and just use the .Next() to get your answer like:
Random rand = new Random();
for (int i = 0; i < 100; i++)
{
int value = rand.Next();
Console.WriteLine(value);
}

Without parameters Random c'tor takes the current date and time as the seed - and you can generally execute a fair amount of code before the internal timer works out that the current date and time has changed. Therefore you are using the same seed repeatedly - and getting the same results repeatedly.
Source : http://csharpindepth.com/Articles/Chapter12/Random.aspx

Generating a dataset with few unique values

Note: This is part 2 of a 2 part question.
Part 1 here
I'm wanting to more about sorting algorithms and what better way to do than then to code! So I figure I need some data to work with.
My approach to creating some "standard" data will be as follows: create a set number of items, not sure how large to make it but I want to have fun and make my computer groan a little bit :D
Once I have that list, I'll push it into a text file and just read off that to run my algorithms against. I should have a total of 4 text files filled with the same data but just sorted differently to run my algorithms against (see below).
Correct me if I'm wrong but I believe I need 4 different types of scenarios to profile my algorithms.
Randomly sorted data (for this I'm going to use the knuth shuffle)
Reversed data (easy enough)
Nearly sorted (not sure how to implement this)
Few unique (once again not sure how to approach this)
This question is for generating a list with a few unique items of data.
Which approach is best to generate a dataset with a few unique items.

Answering my own question here. Don't know if this is the best but it works.
public static int[] FewUnique(int uniqueCount, int returnSize)
{
Random r = _random;
int[] values = new int[uniqueCount];
for (int i = 0; i < uniqueCount; i++)
{
values[i] = i;
}
int[] array = new int[returnSize];
for (int i = 0; i < returnSize; i++)
{
array[i] = values[r.Next(0, values.Count())];
}
return array;
}

It might be worth having a look at NBuilder. It's a framework designed to generate objects for testing with and sounds like just what you need.
You could deal with the "few unique" items with some code like this:
var products = Builder<YourClass>.CreateListOfSize(1000)
.WhereAll().AreConstructedWith("some fixed value")
.WhereRandom(20).AreConstructedWith("some other fixed value")
.Build();
There's plenty of other variations you can use as well to get the data like you want it. Have a look at some of the samples on the site for more ideas.

http://pages.cs.wisc.edu/~bart/fuzz/
Is all about fuzz testing which focuses on semi random data. It should be straight forward to adapt this approach to your problem

I guess your solution is ok. I would only modify it slighly:
public static int[] FewUnique(int uniqueCount, int low, int high, int returnSize)
{
Random r = _random;
int[] values = new int[uniqueCount];
for (int i = 0; i < uniqueCount; i++)
{
values[i] = r.Next(low, high);
}
int[] array = new int[returnSize];
for (int i = 0; i < returnSize; i++)
{
array[i] = values[r.Next(0, values.Count())];
}
return array;
}
For some algorithms this might make a difference.

more than 1 sequence of random numbers c#, linq

i was using this code to generate a random sequence of numbers:
var sequence = Enumerable.Range(0, 9).OrderBy(n => n * n * (new Random()).Next());
everything was ok till i need more than one sequence, in this code i call the routine 10 times, and the results are my problem, all the sequence are equal.
int i = 0;
while (i<10)
{
Console.Write("{0}:",i);
var sequence = Enumerable.Range(0, 9).OrderBy(n => n * n * (new Random()).Next());
sequence.ToList().ForEach(x=> Console.Write(x));
i++;
Console.WriteLine();
}
Can someone give me a hint of how actually generate different sequences? hopefully using LINQ

The problem is that you're creating a new instance of Random on each iteration. Each instance will be taking its initial seed from the current time, which changes relatively infrequently compared to how often your delegate is getting executed. You could create a single instance of Random and use it repeatedly. See my article on randomness for more details.
However, I would also suggest that you use a Fisher-Yates shuffle instead of OrderBy in order to shuffle the values (there are plenty of examples on Stack Overflow, such as this one)... although it looks like you're trying to bias the randomness somewhat. If you could give more details as to exactly what you're trying to do, we may be able to help more.

You are creating 10 instances of Random in quick succession, and pulling the first pseudo-random number from each of them. I'm not surprised they're all the same.
Try this:
Random r = new Random();
var sequence = Enumerable.Range(0, 9).OrderBy(n => n * n * r.Next());

In my code I use this static method I wrote many years ago, and it still shows good randomization:
using System.Security.Cryptography;
...
public static int GenerateRandomInt(int from, int to)
{
byte[] salt = new byte[4];
RandomNumberGenerator rng = RandomNumberGenerator.Create();
rng.GetBytes(salt);
int num = 0;
for (int i = 0; i < 4; i++)
{
num += salt[i];
}
return num % (to + 1 - from) + from;
}
Can't explain this example in detail, I need to bring myself back in time to remember, but I don't care ;)

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.