Generating a random, non-repeating sequence of all integers in .NET

Generating a random, non-repeating sequence of all integers in .NET - c#

Is there a way in .NET to generate a sequence of all the 32-bit integers (Int32) in random order, without repetitions, and in a memory-efficient manner? Memory-efficient would mean using a maximum of just a few hundred mega bytes of main memory.
Ideally the sequence should be something like an IEnumerable<int>, and it lazily returns the next number in sequence, only when requested.
I did some quick research and I found some partial solutions to this:
Using a maximal linear feedback shift register - if I understood correctly, it only generates the numbers in increasing sequence and it does not cover the entire range
Using the Fisher-Yates or other shuffling algorithms over collections - this would violate the memory restrictions given the large range
Maintaining a set-like collection and keep generating a random integer (perhaps using Random) until it doesn't repeat, i.e. it's not in the set - apart from possibly failing to satisfy the memory requirements, it would get ridiculously slow when generating the last numbers in the sequence.
Random permutations over 32 bits, however I can't think of a way to ensure non-repeatability.
Is there another way to look at this problem - perhaps taking advantage of the fixed range of values - that would give a solution satisfying the memory requirements? Maybe the .NET class libraries come with something useful?
UPDATE 1
Thanks everyone for your insights and creative suggestions for a solution. I'll try to implement and test soon (both for correctness and memory efficiency) the 2 or 3 most promising solutions proposed here, post the results and then pick a "winner".
UPDATE 2
I tried implementing hvd's suggestion in the comment below. I tried using both the BitArray in .NET and my custom implementation, since the .NET one is limited to int.MaxValue entries, so not enough to cover the entire range of integers.
I liked the simplicity of the idea and i was willing to "sacrifice" those 512 MB of memory if it worked fine. Unfortunately, the run time is quite slow, spending up to tens of seconds to generate the next random number on my machine, which has a 3.5 GHz Core i7 CPU. So unfortunately this is not acceptable if you request many, many random numbers to be generated. I guess it's predictable though, it's a O(M x N) algorithm if i'm not mistaken, where N is 2^32 and M is the number of requested integers, so all those iterations take their toll.
Ideally i'd like to generate the next random number in O(1) time, while still meeting the memory requirements, maybe the next algorithms suggested here are suited for this. I'll give them a try as soon as I can.
UPDATE 3
I just tested the Linear Congruential Generator and i can say i'm quite pleased with the results. It looks like a strong contender for the winner position in this thread.
Correctness: all integers generated exactly once (i used a bit vector to check this).
Randomness: fairly good.
Memory usage: Excellent, just a few bytes.
Run time: Generates the next random integer very fast, as you can expect from an O(1) algorithm. Generating every integer took a total of approx. 11 seconds on my machine.
All in all i'd say it's a very appropriate technique if you're not looking for highly randomized sequences.
UPDATE 4
The modular multiplicative inverse technique described below behaves quite similarly to the LCG technique - not surprising, since both are based on modular arithmetic -, although i found it a bit less straightforward to implement in order to yield satisfactorily random sequences.
One interesting difference I found is that this technique seems faster than LCG: generating the entire sequence took about 8 seconds, versus 11 seconds for LCG. Other than this, all other remarks about memory efficiency, correctness and randomness are the same.
UPDATE 5
Looks like user TomTom deleted their answer with the Mersenne Twister without notice after i pointed out in a comment that i found out that it generates duplicate numbers sooner than required. So i guess this rules out the Mersenne Twister entirely.
UPDATE 6
I tested another suggested technique that looks promising, Skip32, and while i really liked the quality of the random numbers, the algorithm is not suitable for generating the entire range of integers in an acceptable amount of time. So unfortunately it falls short when compared to the other techniques that were able to finish the process. I used the implementation in C# from here, by the way - i changed the code to reduce the number of rounds to 1 but it still can't finish in a timely manner.
After all, judging by the results described above, my personal choice for the solution goes to the modular multiplicative inverses technique, followed very closely by the linear congruential generator. Some may argue that this is inferior in certain aspects to other techniques, but given my original constraints i'd say it fits them best.

If you don't need the random numbers to be cryptographically secure, you can use a Linear Congruential Generator.
An LCG is a formula of the form X_n+1 = X_n * a + c (mod m), it needs constant memory and constant time for every generated number.
If proper values for the LCG are chosen, it will have a full period length, meaning it will output every number between 0 and your chosen modulus.
An LCG has a full period if and only if:
The modulus and the increment are relatively prime, i.e. GCD(m, c) = 1
a - 1 is divisible by all prime factors of m
If m is divisible by 4, a - 1 must be divisible by 4.
Our modulus is 2 ^ 32, meaning a must be a number of form 4k + 1 where k is an arbitrary integer, and c must not be divisible by 2.
While this is a C# question I've coded a small C++ program to test the speed of this solution, since I'm more comfortable in that language:
#include <iostream>
#include <stdlib.h>
class lcg {
private:
unsigned a, c, val;
public:
lcg(unsigned seed=0) : lcg(seed, rand() * 4 + 1, rand() * 2 + 1) {}
lcg(unsigned seed, unsigned a, unsigned c) {
val = seed;
this->a = a;
this->c = c;
std::cout << "Initiated LCG with seed " << seed << "; a = " << a << "; c = " << c << std::endl;
}
unsigned next() {
this->val = a * this->val + c;
return this->val;
}
};
int main() {
srand(time(NULL));
unsigned seed = rand();
int dummy = 0;
lcg gen(seed);
time_t t = time(NULL);
for (uint64_t i = 0; i < 0x100000000ULL; i++) {
if (gen.next() < 1000) dummy++; // Avoid optimizing this out with -O2
}
std::cout << "Finished cycling through. Took " << (time(NULL) - t) << " seconds." << std::endl;
if (dummy > 0) return 0;
return 1;
}
You may notice I am not using the modulus operation anywhere in the lcg class, that's because we use 32 bit integer overflow for our modulus operation.
This produces all values in the range [0, 4294967295] inclusive.
I also had to add a dummy variable for the compiler not to optimize everything out.
With no optimization this solution finishes in about 15 seconds, while with -O2, a moderate optimization it finishes under 5 seconds.
If "true" randomness is not an issue, this is a very fast solution.

Is there a way in .NET
Actually, this can be done in most any language
to generate a sequence of all the 32-bit integers (Int32)
Yes.
in random order,
Here we need to agree on terminology since "random" is not what most people think it is. More on this in a moment.
without repetitions,
Yes.
and in a memory-efficient manner?
Yes.
Memory-efficient would mean using a maximum of just a few hundred mega bytes of main memory.
Ok, so would using almost no memory be acceptable? ;-)
Before getting to the suggestion, we need to clear up the matter of "randomness". Something that is truly random has no discernible pattern. Hence, running the algorithm millions of times in a row could theoretically return the same value across all iterations. If you throw in the concept of "must be different from the prior iteration", then it is no longer random. However, looking at all of the requirements together, it seems that all that is really being asked for is "differing patterns of distribution of the integers". And this is doable.
So how to do this efficiently? Make use of Modular multiplicative inverses. I used this to answer the following Question which had a similar requirement to generate non-repeating, pseudo-random, sample data within certain bounds:
Generate different random time in the given interval
I first learned about this concept here ( generate seemingly random unique numeric ID in SQL Server ) and you can use either of the following online calculators to determine your "Integer" and "Modular Multiplicative Inverses (MMI)" values:
http://planetcalc.com/3311/
http://www.cs.princeton.edu/~dsri/modular-inversion-answer.php
Applying that concept here, you would use Int32.MaxSize as the Modulo value.
This would give a definite appearance of random distribution with no chance for collisions and no memory needed to store already used values.
The only initial problem is that the pattern of distribution is always the same given the same "Integer" and "MMI" values. So, you could come up with differing patterns by either adding a "randomly" generated Int to the starting value (as I believe I did in my answer about generating the sample data in SQL Server) or you can pre-generate several combinations of "Integer" and corresponding "MMI" values, store those in a config file / dictionary, and use a .NET random function to pick one at the start of each run. Even if you store 100 combinations, that is almost no memory use (assuming it is not in a config file). In fact, if storing both as Int and the dictionary uses Int as an index, then 1000 values is approximately 12k?
UPDATE
Notes:
There is a pattern in the results, but it is not discernible unless you have enough of them at any given moment to look at in total. For most use-cases, this is acceptable since no recipient of the values would have a large collection of them, or know that they were assigned in sequence without any gaps (and that knowledge is required in order to determine if there is a pattern).
Only 1 of the two variable values -- "Integer" and "Modular Multiplicative Inverse (MMI)" -- is needed in the formula for a particular run. Hence:
each pair gives two distinct sequences
if maintaining a set in memory, only a simple array is needed, and assuming that the array index is merely an offset in memory from the base address of the array, then the memory required should only be 4 bytes * capacity (i.e. 1024 options is only 4k, right?)
Here is some test code. It is written in T-SQL for Microsoft SQL Server since that is where I work primarily, and it also has the advantage of making it real easy-like to test for uniqueness, min and max values, etc, without needing to compile anything. The syntax will work in SQL Server 2008 or newer. For SQL Server 2005, initialization of variables had not been introduced yet so each DECLARE that contains an = would merely need to be separated into the DECLARE by itself and a SET #Variable = ... for however that variable is being initialized. And the SET #Index += 1; would need to become SET #Index = #Index + 1;.
The test code will error if you supply values that produce any duplicates. And the final query indicates if there are any gaps since it can be inferred that if the table variable population did not error (hence no duplicates), and the total number of values is the expected number, then there could only be gaps (i.e. missing values) IF either or both of the actual MIN and MAX values are outside of the expected values.
PLEASE NOTE that this test code does not imply that any of the values are pre-generated or need to be stored. The code only stores the values in order to test for uniqueness and min / max values. In practice, all that is needed is the simple formula, and all that is needed to pass into it is:
the capacity (though that could also be hard-coded in this case)
the MMI / Integer value
the current "index"
So you only need to maintain 2 - 3 simple values.
DECLARE #TotalCapacity INT = 30; -- Modulo; -5 to +4 = 10 OR Int32.MinValue
-- to Int32.MaxValue = (UInt32.MaxValue + 1)
DECLARE #MMI INT = 7; -- Modular Multiplicative Inverse (MMI) or
-- Integer (derived from #TotalCapacity)
DECLARE #Offset INT = 0; -- needs to stay at 0 if min and max values are hard-set
-----------
DECLARE #Index INT = (1 + #Offset); -- start
DECLARE #EnsureUnique TABLE ([OrderNum] INT NOT NULL IDENTITY(1, 1),
[Value] INT NOT NULL UNIQUE);
SET NOCOUNT ON;
BEGIN TRY
WHILE (#Index < (#TotalCapacity + 1 + #Offset)) -- range + 1
BEGIN
INSERT INTO #EnsureUnique ([Value]) VALUES (
((#Index * #MMI) % #TotalCapacity) - (#TotalCapacity / 2) + #Offset
);
SET #Index += 1;
END;
END TRY
BEGIN CATCH
DECLARE #Error NVARCHAR(4000) = ERROR_MESSAGE();
RAISERROR(#Error, 16, 1);
RETURN;
END CATCH;
SELECT * FROM #EnsureUnique ORDER BY [OrderNum] ASC;
SELECT COUNT(*) AS [TotalValues],
#TotalCapacity AS [ExpectedCapacity],
MIN([Value]) AS [MinValue],
(#TotalCapacity / -2) AS [ExpectedMinValue],
MAX([Value]) AS [MaxValue],
(#TotalCapacity / 2) - 1 AS [ExpectedMaxValue]
FROM #EnsureUnique;

A 32 bit PRP in CTR mode seems like the only viable approach to me (your 4th variant).
You can either
Use a dedicated 32 bit block cipher.
Skip32, the 32 bit variant of Skipjack is a popular choice.
As a tradeoff between quality/security and performance you can adjust the number of rounds to your needs. More rounds are slower but more secure.
Length-preserving-encryption (a special case of format-preserving-encryption)
FFX mode is the typical recommendation. But in its typical instantiations (e.g. using AES as underlying cipher) it'll be much slower than dedicated 32 bit block ciphers.
Note that many of these constructions have a significant flaw: They're even permutations. That means that once you have seen 2^32-2 outputs, you'll be able to predict the second-to-last output with certainty, instead of only 50%. I think Rogaways AEZ paper mentions a way to fix this flaw.

I'm going to preface this answer by saying I realize that some of the other answers are infinitely more elegant, and probably fit your needs better than this one. This is most certainly a brute-force approach to this problem.
If getting something truly random* (or pseudo-random* enough for cryptographic purposes) is important, you could generate a list of all integers ahead of time, and store them all on disk in random order ahead of time. At the run time of your program, you then read those numbers from the disk.
Below is the basic outline of the algorithm I'm proposing to generate these numbers. All 32-bit integers can be stored in ~16 GiB of disk space (32 bits = 4 bytes, 4 bytes / integer * 2^32 integers = 2^34 bytes = 16 GiB, plus whatever overhead the OS/filesystem needs), and I've taken "a few hundred megabytes" to mean that you want to read in a file of no more than 256 MiB at a time.
Generate 16 GiB / 256 MiB = 64 ASCII text files with 256 MiB of "null" characters (all bits set to 0) each. Name each text file "0.txt" through "64.txt"
Loop sequentially from Int32.MinValue to Int32.MaxValue, skipping 0. This is the value of the integer you're currently storing.
On each iteration, generate a random integer from 0 to UInt32.MaxValue from the source of randomness of your choice (hardware true random generator, pseudo-random algorithm, whatever). This is the index of the value you're currently storing.
Split the index into two integers: the 6 most significant bits, and the remaining 26. Use the upper bits to load the corresponding text file.
Multiply the lower 26 bits by 4 and use that as an index in the opened file. If the four bytes following that index are all still the "null" character, encode the current value into four ASCII characters, and store those characters in that position. If they are not all the "null" character, go back to step 3.
Repeat until all integers have been stored.
This would ensure that the numbers are from a known source of randomness but are still unique, rather than having the limitations of some of the other proposed solutions. It would take a long time to "compile" (especially using the relatively naive algorithm above), but it meets the runtime efficiency requirements.
At runtime, you can now generate a random starting index, then read the bytes in the files sequentially to obtain a unique, random*, non-repeating sequence of integers. Assuming that you're using a relatively small number of integers at once, you could even index randomly into the files, storing which indices you've used and ensuring a number is not repeated that way.
(* I understand that the randomness of any source is lessened by imposing the "uniqueness" constraint, but this approach should produce numbers relatively close in randomness to the original source)
TL;DR - Shuffle the integers ahead of time, store all of them on disk in a number of smaller files, then read from the files as needed at runtime.

Nice puzzle. A few things come to mind:
We need to store which items have been used. If approximately is good enough, you might want to use a bloom filter for this. But since you specifically state that you want all numbers, there's only one data structure for this: a bit vector.
You probably want to use a pseudo random generator algorithm with a long period.
And the solution probably involves using multiple algorithm.
My first attempt was to figure out how good pseudo random number generation works with a simple bit vector. I accept collisions (and therefore a slowdown), but definitely not too many collisions. This simple algorithm will generate about half the numbers for you in a limited amount of time.
static ulong xorshift64star(ulong x)
{
x ^= x >> 12; // a
x ^= x << 25; // b
x ^= x >> 27; // c
return x * 2685821657736338717ul;
}
static void Main(string[] args)
{
byte[] buf = new byte[512 * 1024 * 1024];
Random rnd = new Random();
ulong value = (uint)rnd.Next(int.MinValue, int.MaxValue);
long collisions = 0;
Stopwatch sw = Stopwatch.StartNew();
for (long i = 0; i < uint.MaxValue; ++i)
{
if ((i % 1000000) == 0)
{
Console.WriteLine("{0} random in {1:0.00}s (c={2})", i, sw.Elapsed.TotalSeconds, collisions - 1000000);
collisions = 0;
}
uint randomValue; // result will be stored here
bool collision;
do
{
value = xorshift64star(value);
randomValue = (uint)value;
collision = (buf[randomValue >> 4] & (1 << (int)(randomValue & 7))) != 0;
++collisions;
}
while (collision);
buf[randomValue >> 4] |= (byte)(1 << (int)(randomValue & 7));
}
Console.ReadLine();
}
After about 1,9 billion random numbers, the algorithm will start to come to a grinding halt.
1953000000 random in 283.74s (c=10005932)
[...]
2108000000 random in 430.66s (c=52837678)
So, let's for the sake of argument say that you're going to use this algorithm for the first +/- 2 billion numbers.
Next, you need a solution for the rest, which is basically the problem that the OP described. For that, I'd sample random numbers into a buffer and combine the buffer with the Knuth shuffle algorithm. You can also use this right from the start if you like.
This is what I came up with (probably still buggy so do test...):
static void Main(string[] args)
{
Random rnd = new Random();
byte[] bloom = new byte[512 * 1024 * 1024];
uint[] randomBuffer = new uint[1024 * 1024];
ulong value = (uint)rnd.Next(int.MinValue, int.MaxValue);
long collisions = 0;
Stopwatch sw = Stopwatch.StartNew();
int n = 0;
for (long i = 0; i < uint.MaxValue; i += n)
{
// Rebuild the buffer. We know that we have uint.MaxValue-i entries left and that we have a
// buffer of 1M size. Let's calculate the chance that you want any available number in your
// buffer, which is now:
double total = uint.MaxValue - i;
double prob = ((double)randomBuffer.Length) / total;
if (i >= uint.MaxValue - randomBuffer.Length)
{
prob = 1; // always a match.
}
uint threshold = (uint)(prob * uint.MaxValue);
n = 0;
for (long j = 0; j < uint.MaxValue && n < randomBuffer.Length; ++j)
{
// is it available? Let's shift so we get '0' (unavailable) or '1' (available)
int available = 1 ^ ((bloom[j >> 4] >> (int)(j & 7)) & 1);
// use the xorshift algorithm to generate a random value:
value = xorshift64star(value);
// roll a die for this number. If we match the probability check, add it.
if (((uint)value) <= threshold * available)
{
// Store this in the buffer
randomBuffer[n++] = (uint)j;
// Ensure we don't encounter this thing again in the future
bloom[j >> 4] |= (byte)(1 << (int)(j & 7));
}
}
// Our buffer now has N random values, ready to be emitted. However, it's
// still sorted, which is something we don't want.
for (int j = 0; j < n; ++j)
{
// Grab index to swap. We can do this with Xorshift, but I didn't bother.
int index = rnd.Next(j, n);
// Swap
var tmp = randomBuffer[j];
randomBuffer[j] = randomBuffer[index];
randomBuffer[index] = tmp;
}
for (int j = 0; j < n; ++j)
{
uint randomNumber = randomBuffer[j];
// Do something with random number buffer[i]
}
Console.WriteLine("{0} random in {1:0.00}s", i, sw.Elapsed.TotalSeconds);
}
Console.ReadLine();
}
Back to the requirements:
Is there a way in .NET to generate a sequence of all the 32-bit integers (Int32) in random order, without repetitions, and in a memory-efficient manner? Memory-efficient would mean using a maximum of just a few hundred mega bytes of main memory.
Cost: 512 MB + 4 MB.
Repetitions: none.
It's pretty fast. It just isn't 'uniformly' fast. Every 1 million numbers, you have to recalculate the buffer.
What's also nice: both algorithms can work together, so you can first generate the first -say- 2 billion numbers very fast, and then use the second algorithm for the rest.

One of the easiest solutions is to use an block encrytion algorithm like AES in countermode. You need a seed which equals the key in AES. Next you need a counter which is incremented for each new random value. The random value is the result of encrypting the counter with the key. Since the cleartext (counter) and the random number (ciphertext) is bijectiv and because of the pigeon hole principle the random numbers are unique (for the blocksize).
Memory efficiency: you only need to store the seed and the counter.
The only limmitation is that AES has 128 bit block size instead of your 32 bit. So you might need to increase to 128 bit or find a block cipher with 32 bit block size.
For your IEnumerable you can write a wrapper. The index is the counter.
Disclaimer: You are asking for non-repeating/unique: This disqualifies from random because normally you should see collisions in random numbers. Therefore you should not use it for a long sequence. See also https://crypto.stackexchange.com/questions/25759/how-can-a-block-cipher-in-counter-mode-be-a-reasonable-prng-when-its-a-prp

As your numbers per your definition are supposed to be random then there is by definition no other way than to store all of then as the number have no intrinsic relation to each other.
So this would mean that you have to store all values you used in order to prevent them from being used again.
However, in computing the pattern just needs to be not "noticable". Usually the system calculates a random number by performing multiplication operations with huge predetermined values and timer values in such a way that they overflow and thus appear randomly selected. So either you use your third option or you have to think about generating these pseudo random numbers in a way that you can reproduce the sequence of every number generated and check if something in reoccuring. This obviously would be extremely computationally expensive but you asked for memory efficiency.
So you could store the number you seeded you random generator with and the number of elements you generated. Each time you need a new number, reseed the generator and iterate through the number of elements you generated + 1. This is your new number. Now reseed and iterate through the sequence again to check if it occured before.
So something like this:
int seed = 123;
Int64 counter = 0;
Random rnd = new Random(seed);
int GetUniqueRandom()
{
int newNumber = rnd.Next();
Random rndCheck = new Random(seed);
counter++;
for (int j = 0; j < counter; j++)
{
int checkNumber = rndCheck.Next();
if (checkNumber == newNumber)
return GetUniqueRandom();
}
return newNumber;
}
EDIT: It was pointed out that counter will reach a huge value and there's no telling if it will overflow before you got all of the 4 billion values or not.

You could try this homebrew block-cipher:
public static uint Random(uint[] seed, uint m)
{
for(int i = 0; i < seed.Length; i++)
{
m *= 0x6a09e667;
m ^= seed[i];
m += m << 16;
m ^= m >> 16;
}
return m;
}
Test code:
const int seedSize = 3; // larger values result in higher quality but are slower
var seed = new uint[seedSize];
var seedBytes = new byte[4 * seed.Length];
new RNGCryptoServiceProvider().GetBytes(seedBytes);
Buffer.BlockCopy(seedBytes, 0, seed, 0, seedBytes.Length);
for(uint i = 0; i < uint.MaxValue; i++)
{
Random(seed, i);
}
I haven't checked the quality of its outputs yet. Runs in 19 sec on my computer for seedSize = 3.

Related

Randomly ordered long sequence

From a start and end where both data types are long's, I'd like to produce a randomly sorted list with them.
At the moment, I'm using a for loop to populate a list:
for (var i = idStart; i < idEnd; i++){ list.Add(i); }
Then I'm shuffle'ing the list using an extension method. However, when the difference between start and end are large (millions), the for loop causes out of memory exceptions.
Is there a more efficient, sleeker method for producing an randomly sequenced list of long's, where each number only appears once?

Is there a more efficient, sleeker method for producing an randomly sequenced list of long's, where each number only appears once?
Yes, if you eliminate the requirement that the sequence be truly random. Use the following technique.
Without loss of generality let us suppose that you wish to generate numbers from 0 through n-1 for some n. Clearly you can see how to generate numbers between x and y; just generate numbers from 0 through x-y and then add x to each.
Find a randomly generated number z that is coprime to n. Doing so is left as an exercise to the reader. It will help if the number is pretty large modulo n; the pattern will be easy to notice if z is small modulo n.
Find a randomly generated number m that is between 0 and n-1.
Now generate the sequence (m) * z % n, (m + 1) * z % n, (m + 2) * z % n, and so on. The sequence repeats at (m + n) * z % n; it does not repeat before that. Again, determining why it does not repeat is left as an exercise.
It is easy to see that this is not a true shuffle because there are fewer than n squared possible sequences generated, not the n factorial sequences that are possible with a true shuffle. But it might be good enough for your purposes; if you are using something like System.Random to do randomization you are already abandoning a true shuffle.
I note also that many of the comments suggest that there should be no problem with a large allocation. These comments forget that (1) the relevant measure is not amount of RAM in the box but rather size of the largest contiguous user mode address space block, and that can easily be less than a hundred million bytes in a 32 bit process, (2) that the list data structure intentionally over-allocates, that (3) when the list gets full a copy of the underlying array must be allocated to copy the old list into the new list, which more than doubles the actual memory load of the list, temporarily, and that (4) a user who naively attempts to allocate one hundred-million-byte structure may well attempt to allocate a dozen of them throughout the program. You should always avoid such large allocations; if you have data structures that require large amounts of storage then put them on disk.

c# format preserving encryption for integers

I have a requirement for generating numeric codes that will be used as redemption codes for vouchers or similar. The requirement is that the codes are numeric and relatively short for speed on data entry for till operators. Around 6 characters long and numeric. We know that's a small number so we have a process in place so that the codes can expire and be re-used.
We started off by just using a sequential integer generator which is working well in terms of generating a unique code. The issue with this is that the codes generated are sequential so predictable which means customers could guess codes that we generate and redeem a voucher not meant for them.
I've been reading up on Format Preserving Encryption which seems like it might work well for us. We don't need to decrypt the code back at any point as the code itself is arbitrary we just need to ensure it's not predictable (by everyday people). It's not crucial for security it's just to keep honest people honest.
There are various ciphers referenced in the wikipedia article but I have very basic cryptographic and mathematical skills and am not capable of writing my own code to achieve this based on the ciphers.
I guess my question is, does anyone know of a c# implementation of this that will encrypt an integer into another integer and maintain the same length?
FPE seems to be used well for encrypting a 16 digit credit card number into another 16 digit number. We need the same sort of thing but not necessarily fixed to a length but as long is the plain values length matches the encrypted values length.
So the following four integers would be encrypted
from
123456
123457
123458
123459
to something non-sequential like this
521482
265012
961450
346582
I'm open to any other suggestions to achieve this FPE just seemed like a good option.
EDIT
Thanks for the suggestions around just generating a unique code and storing them and checking for duplicates. for now we've avoided doing this because we don't want to have to check storage when we generate. This is why we use a sequential integer generator so we don't need to check if the code is unique or not. I'll re-investigate doing this but for now still looking for ways to avoid having to go to storage each time we generate a code.

I wonder if this will not be off base also, but let me give it a try. This solution will require no storage but will require processing power (a tiny amount, but it would not be pencil-and-paper easy). It is essentially a homemade PRNG but may have characteristics more suitable to what you want to do than the built-in ones do.
To make your number generator, make a polynomial with prime coefficients and a prime modulus. For example, let X represent the Nth voucher you issed. Then:
Voucher Number = (23x^4+19x^3+5x^2+29x+3)%65537. This is of course just an example; you could use any number of terms, any primes you want for the coefficients, and you can make the modulus as large as you like. In fact, the modulus does not need to be prime at all. It only sets the maximum voucher number. Having the coefficients be prime helps cut down on collisions.
In this case, vouchers #100, 101, and 102 would have numbers 26158, 12076, and 6949, respectively. Consider it a sort of toy encryption where the coefficients are your key. Not super secure, but nothing with an output space as small as you are asking for would be secure against a strong adversary. But this should stop the everyday fraudster.
To confirm a valid voucher would take the computer (but calculation only, not storage). It would iterate through a few thousand or tens of thousands of input X looking for the output Y that matches the voucher presented to you. When it found the match, it could signal a valid voucher.
Alternatively, you could issue the vouchers with the serial number and the calculation concatenated together, like a value and checksum. Then you could run the calculation on the value by hand using your secret coefficients to confirm validity.
As long as you do not reveal the coefficients to anyone, it is very hard to identify a pattern in the outputs. I am not sure if this is even close to as secure as what you were looking for, but posting the idea just in case.
I miscalculated the output for 100 (did it by hand and failed). Corrected it just now. Let me add some code to illustrate how I'd check for a valid voucher:
using System;
using System.Numerics;
namespace Vouchers
{
class Program
{
static void Main(string[] args)
{
Console.Write("Enter voucher number: ");
BigInteger input = BigInteger.Parse(Console.ReadLine());
for (BigInteger i = 0;i<10000000;i++)
{
BigInteger testValue = (23 * i * i * i * i + 19 * i * i * i + 5 * i * i + 29 * i + 3) % 65537;
if(testValue==input)
{
Console.WriteLine("That is voucher # " + i.ToString());
break;
}
if (i == 100) Console.WriteLine(testValue);
}
Console.ReadKey();
}
}
}

One option is to build an in-place random permutation of the numbers. Consider this code:
private static readonly Random random = new Random((int)DateTime.UtcNow.Ticks);
private static int GetRandomPermutation(int input)
{
char[] chars = input.ToString().ToCharArray();
for (int i = 0; i < chars.Length; i++ )
{
int j = random.Next(chars.Length);
if (j != i)
{
char temp = chars[i];
chars[i] = chars[j];
chars[j] = temp;
}
}
return int.Parse(new string(chars));
}
You mentioned running into performance issues with some other techniques. This method does a lot of work, so it may not meet your performance requirements. It's a neat academic exercise, anyway.

Thanks for the help from the comments to my original post on this from Blogbeard and lc. It Turns out we needed to hit storage when generating the codes anyway so this meant implementing a PRNG was a better option for us rather than messing around with encryption.
This is what we ended up doing
Continue to use our sequential number generator to generate integers
Create an instance of C# Random class (a PRNG) using the sequential number as a seed.
Generate a random number within the range of the minimum and maximum number we want.
Check for duplicates and regenerate until we find a unique one
Turns out using c# random with a seed makes the random numbers actually quite predictable when using the sequential number as a seed for each generation.
For example with a range between 1 and 999999 using a sequential seed I tested generating 500000 values without a single collision.

Fast Algorithm for computing percentiles to remove outliers

I have a program that needs to repeatedly compute the approximate percentile (order statistic) of a dataset in order to remove outliers before further processing. I'm currently doing so by sorting the array of values and picking the appropriate element; this is doable, but it's a noticable blip on the profiles despite being a fairly minor part of the program.
More info:
The data set contains on the order of up to 100000 floating point numbers, and assumed to be "reasonably" distributed - there are unlikely to be duplicates nor huge spikes in density near particular values; and if for some odd reason the distribution is odd, it's OK for an approximation to be less accurate since the data is probably messed up anyhow and further processing dubious. However, the data isn't necessarily uniformly or normally distributed; it's just very unlikely to be degenerate.
An approximate solution would be fine, but I do need to understand how the approximation introduces error to ensure it's valid.
Since the aim is to remove outliers, I'm computing two percentiles over the same data at all times: e.g. one at 95% and one at 5%.
The app is in C# with bits of heavy lifting in C++; pseudocode or a preexisting library in either would be fine.
An entirely different way of removing outliers would be fine too, as long as it's reasonable.
Update: It seems I'm looking for an approximate selection algorithm.
Although this is all done in a loop, the data is (slightly) different every time, so it's not easy to reuse a datastructure as was done for this question.
Implemented Solution
Using the wikipedia selection algorithm as suggested by Gronim reduced this part of the run-time by about a factor 20.
Since I couldn't find a C# implementation, here's what I came up with. It's faster even for small inputs than Array.Sort; and at 1000 elements it's 25 times faster.
public static double QuickSelect(double[] list, int k) {
return QuickSelect(list, k, 0, list.Length);
}
public static double QuickSelect(double[] list, int k, int startI, int endI) {
while (true) {
// Assume startI <= k < endI
int pivotI = (startI + endI) / 2; //arbitrary, but good if sorted
int splitI = partition(list, startI, endI, pivotI);
if (k < splitI)
endI = splitI;
else if (k > splitI)
startI = splitI + 1;
else //if (k == splitI)
return list[k];
}
//when this returns, all elements of list[i] <= list[k] iif i <= k
}
static int partition(double[] list, int startI, int endI, int pivotI) {
double pivotValue = list[pivotI];
list[pivotI] = list[startI];
list[startI] = pivotValue;
int storeI = startI + 1;//no need to store # pivot item, it's good already.
//Invariant: startI < storeI <= endI
while (storeI < endI && list[storeI] <= pivotValue) ++storeI; //fast if sorted
//now storeI == endI || list[storeI] > pivotValue
//so elem #storeI is either irrelevant or too large.
for (int i = storeI + 1; i < endI; ++i)
if (list[i] <= pivotValue) {
list.swap_elems(i, storeI);
++storeI;
}
int newPivotI = storeI - 1;
list[startI] = list[newPivotI];
list[newPivotI] = pivotValue;
//now [startI, newPivotI] are <= to pivotValue && list[newPivotI] == pivotValue.
return newPivotI;
}
static void swap_elems(this double[] list, int i, int j) {
double tmp = list[i];
list[i] = list[j];
list[j] = tmp;
}
Thanks, Gronim, for pointing me in the right direction!

The histogram solution from Henrik will work. You can also use a selection algorithm to efficiently find the k largest or smallest elements in an array of n elements in O(n). To use this for the 95th percentile set k=0.05n and find the k largest elements.
Reference:
http://en.wikipedia.org/wiki/Selection_algorithm#Selecting_k_smallest_or_largest_elements

According to its creator a SoftHeap can be used to:
compute exact or approximate medians
and percentiles optimally. It is also
useful for approximate sorting...

I used to identify outliers by calculating the standard deviation. Everything with a distance more as 2 (or 3) times the standard deviation from the avarage is an outlier. 2 times = about 95%.
Since your are calculating the avarage, its also very easy to calculate the standard deviation is very fast.
You could also use only a subset of your data to calculate the numbers.

You could estimate your percentiles from just a part of your dataset, like the first few thousand points.
The Glivenko–Cantelli theorem ensures that this would be a fairly good estimate, if you can assume your data points to be independent.

Divide the interval between minimum and maximum of your data into (say) 1000 bins and calculate a histogram. Then build partial sums and see where they first exceed 5000 or 95000.

There are a couple basic approaches I can think of. First is to compute the range (by finding the highest and lowest values), project each element to a percentile ((x - min) / range) and throw out any that evaluate to lower than .05 or higher than .95.
The second is to compute the mean and standard deviation. A span of 2 standard deviations from the mean (in both directions) will enclose 95% of a normally-distributed sample space, meaning your outliers would be in the <2.5 and >97.5 percentiles. Calculating the mean of a series is linear, as is the standard dev (square root of the sum of the difference of each element and the mean). Then, subtract 2 sigmas from the mean, and add 2 sigmas to the mean, and you've got your outlier limits.
Both of these will compute in roughly linear time; the first one requires two passes, the second one takes three (once you have your limits you still have to discard the outliers). Since this is a list-based operation, I do not think you will find anything with logarithmic or constant complexity; any further performance gains would require either optimizing the iteration and calculation, or introducing error by performing the calculations on a sub-sample (such as every third element).

A good general answer to your problem seems to be RANSAC.
Given a model, and some noisy data, the algorithm efficiently recovers the parameters of the model.
You will have to chose a simple model that can map your data. Anything smooth should be fine. Let say a mixture of few gaussians. RANSAC will set the parameters of your model and estimate a set of inliners at the same time. Then throw away whatever doesn't fit the model properly.

You could filter out 2 or 3 standard deviation even if the data is not normally distributed; at least, it will be done in a consistent manner, that should be important.
As you remove the outliers, the std dev will change, you could do this in a loop until the change in std dev is minimal. Whether or not you want to do this depends upon why are you manipulating the data this way. There are major reservations by some statisticians to removing outliers. But some remove the outliers to prove that the data is fairly normally distributed.

Not an expert, but my memory suggests:
to determine percentile points exactly you need to sort and count
taking a sample from the data and calculating the percentile values sounds like a good plan for decent approximation if you can get a good sample
if not, as suggested by Henrik, you can avoid the full sort if you do the buckets and count them

One set of data of 100k elements takes almost no time to sort, so I assume you have to do this repeatedly. If the data set is the same set just updated slightly, you're best off building a tree (O(N log N)) and then removing and adding new points as they come in (O(K log N) where K is the number of points changed). Otherwise, the kth largest element solution already mentioned gives you O(N) for each dataset.

smart way to generate unique random number

i want to generate a sequence of unique random numbers in the range of 00000001 to 99999999.
So the first one might be 00001010, the second 40002928 etc.
The easy way is to generate a random number and store it in the database, and every next time do it again and check in the database if the number already exists and if so, generate a new one, check it again, etc.
But that doesn't look right, i could be regenerating a number maybe 100 times if the number of generated items gets large.
Is there a smarter way?
EDIT
as allways i forgot to say WHY i wanted this, and it will probably make things clearer and maybe get an alternative, and it is:
we want to generate an ordernumber for a booking, so we could just use 000001, 000002 etc. But we don't want to give the competitors a clue of how much orders are created (because it's not a high volume market, and we don't want them to know if we are on order 30 after 2 months or at order 100. So we want to have an order number which is random (yet unique)

You can use either an Linear Congruential Generator (LCG) or Linear Feedback Shift Register (LFSR). Google or wikipedia for more info.
Both can, with the right parameters, operate on a 'full-cycle' (or 'full period') basis so that they will generate a 'psuedo-random number' only once in a single period, and generate all numbers within the range. Both are 'weak' generators, so no good for cyptography, but perhaps 'good enough' for apparent randomness. You may have to constrain the period to work within your 'decimal' maximum as having 'binary' periods is necessary.
Update: I should add that it is not necessary to pre-calculate or pre-store previous values in any way, you only need to keep the previous seed-value (single int) and calculate 'on-demand' the next number in the sequence. Of course you can save a chain of pre-calculated numbers to your DB if desired, but it isn't necessary.

How about creating a set all of possible numbers and simply randomising the order? You could then just pick the next number from the tail.
Each number appears only once in the set, and when you want a new one it has already been generated, so the overhead is tiny at the point at which you want one. You could do this in memory or the database of your choice. You'll just need a sensible locking strategy for pulling the next available number.

You could build a table with all the possible numbers in it, give the record a 'used' field.
Select all records that have not been 'used'
Pick a random number (r) between 1 and record count
Take record number r
Get your 'random value' from the record
Set the 'used' flag and update the db.
That should be more efficient than picking random numbers, querying the database and repeat until not found as that's just begging for an eternity for the last few values.

Use Pseudo-random Number Generators.
For example - Linear Congruential Random Number Generator
(if increment and n are coprime, then code will generate all numbers from 0 to n-1):
int seed = 1, increment = 3;
int n = 10;
int x = seed;
for(int i = 0; i < n; i++)
{
x = (x + increment) % n;
Console.WriteLine(x);
}
Output:
4
7
0
3
6
9
2
5
8
1
Basic Random Number Generators
Mersenne Twister

Using this algorithm might be suitable, though it's memory consuming:
http://en.wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle
Put the numbers in the array from 1 to 99999999 and do the shuffle.

For the extremely limited size of your numbers no you cannot expect uniqueness for any type of random generation.
You are generating a 32bit integer, whereas to reach uniqueness you need a much larger number in terms around 128bit which is the size GUIDs use which are guaranteed to always be globally unique.

In case you happen to have access to a library and you want to dig into and understand the issue well, take a look at
The Art of Computer Programming, Volume 2: Seminumerical Algorithms
by Donald E. Knuth. Chapter 3 is all about random numbers.

You could just place your numbers in a set. If the size of the set after generation of your N numbers is too small, generate some more.
Do some trial runs. How many numbers do you have to generate on average? Try to find out an optimal solution to the tradeoff "generate too many numbers" / "check too often for duplicates". This optimal is a number M, so that after generating M numbers, your set will likely hold N unique numbers.
Oh, and M can also be calculated: If you need an extra number (your set contains N-1), then the chance of a random number already being in the set is (N-1)/R, with R being the range. I'm going crosseyed here, so you'll have to figure this out yourself (but this kinda stuff is what makes programming fun, no?).

You could put a unique constraint on the column that contains the random number, then handle any constraint voilations by regenerating the number. I think this normally indexes the column as well so this would be faster.
You've tagged the question with C#, so I'm guessing you're using C# to generate the random number. Maybe think about getting the database to generate the random number in a stored proc, and return it.

You could try giving writing usernames by using a starting number and an incremental number. You start at a number (say, 12000), then, for each account created, the number goes up by the incremental value.
id = startValue + (totalNumberOfAccounts * inctrementalNumber)
If incrementalNumber is a prime value, you should be able to loop around the max account value and not hit another value. This creates the illusion of a random id, but should also have very little conflicts. In the case of a conflicts, you could add a number to increase when there's a conflict, so the above code becomes. We want to handle this case, since, if we encounter one account value that is identical, when we increment, we will bump into another conflict when we increment again.
id = startValue + (totalNumberOfAccounts * inctrementalNumber) + totalConflicts

By fallowing line we can get e.g. 6 non repetitive random numbers for range e.g. 1 to 100.
var randomNumbers = Enumerable.Range(1, 100)
.OrderBy(n => Guid.NewGuid())
.Take(6)
.OrderBy(n => n);

I've had to do something like this before (create a "random looking" number for part of a URL). What I did was create a list of keys randomly generated. Each time it needed a new number it simply randomly selected a number from keys.Count and XOR the key and the given sequence number, then outputted XORed value (in base 62) prefixed with the keys index (in base 62).
I also check the output to ensure it does not contain any naught words. If it does simply take the next key and have a second go.
Decrypting the number is equally simple (the first digit is the index to the key to use, a simple XOR and you are done).
I like andora's answer if you are generating new numbers and might have used it had I known. However if I was to do this again I would have simply used UUIDs. Most (if not every) platform has a method for generating them and the length is just not an issue for URLs.

You could try shuffling the set of possible values then using them sequentially.

I like Lazarus's solution, but if you want to avoid effectively pre-allocating the space for every possible number, just store the used numbers in the table, but build an "unused numbers" list in memory by adding all possible numbers to a collection then deleting every one that's present in the database. Then select one of the remaining numbers and use that, adding it to the list in the database, obviously.
But, like I say, I like Lazaru's solution - I think that's your best bet for most scenarios.

function getShuffledNumbers(count) {
var shuffledNumbers = new Array();
var choices = new Array();
for (var i = 0; i<count; i++) {
// choose a number between 1 and amount of numbers remaining
choices[i] = selectedNumber = Math.ceil(Math.random()*(99999999 - i));
// Now to figure out the number based on this selection, work backwards until
// you figure out which choice this number WOULD have been on the first step
for (var j = 0; j < i; j++) {
if (choices[i - 1 - j] >= selectedNumber) {
// This basically says "it was choice number (selectedNumber) on the last step,
// but if it's greater than or equal to this, it must have been choice number
// (selectedNumber + 1) on THIS step."
selectedNumber++;
}
}
shuffledNumbers[i] = selectedNumber;
}
return shuffledNumbers;
}
This is as fast a way I could think of and only uses memory as it needs, however if you run it all the way through it will use double as much memory because it has two arrays, choices and shuffledNumbers.

Running a linear congruential generator once to generate each number is apt to produce rather feeble results. Running it through a number of iterations which is relatively prime to your base (100,000,000 in this case) will improve it considerably. If before reporting each output from the generator, you run it through one or more additional permutation functions, the final output will still be a duplicate-free permutation of as many numbers as you want (up to 100,000,000) but if the proper functions are chosen the result can be cryptographically strong.

create and store ind db two shuffled versions(SHUFFLE_1 and SHUFFLE_2) of the interval [0..N), where N=10'000;
whenever a new order is created, you assign its id like this:
ORDER_FAKE_INDEX = N*SHUFFLE_1[ORDER_REAL_INDEX / N] + SHUFFLE_2[ORDER_REAL_INDEX % N]

I also came with same kind of problem but in C#. I finally solved it. Hope it works for you also.
Suppose I need random number between 0 and some MaxValue and having a Random type object say random.
int n=0;
while(n<MaxValue)
{
int i=0;
i=random.Next(n,MaxValue);
n++;
Write.Console(i.ToString());
}

the stupid way: build a table to record, store all the numble first, and them ,every time the numble used, and flag it as "used"

System.Random rnd = new System.Random();
IEnumerable<int> numbers = Enumerable.Range(0, 99999999).OrderBy(r => rnd.Next());
This gives a randomly shuffled collection of ints in your range. You can then iterate through the collection in order.
The nice part about this is that you're not actually creating the entire collection in memory.
See comments below - this will generate the entire collection in memory when you iterate to the first element.

You can genearate number like below if you are ok with consumption of memory.
import java.util.ArrayList;
import java.util.Collections;
public class UniqueRandomNumbers {
public static void main(String[] args) {
ArrayList<Integer> list = new ArrayList<Integer>();
for (int i=1; i<11; i++) {
list.add(i);
}
Collections.shuffle(list);
for (int i=0; i<11; i++) {
System.out.println(list.get(i));
}
}
}

Radix sort for strings of arbitrary lengths

I need to sort a huge list of text strings of arbitrary length. I suppose radix sort is the best option here. List is really huge, so padding strings to the same length is completely impossible.
Is there any ready-made implementation for this task, preferably in C#?

Depending on what you need, you might find inserting all the strings into some form of Trie to be the best solution. Even a basic Ternary Search Trie will have a smaller memory footprint than an array/list of strings and will store the strings in sorted order.
Insertion, lookup and removal are all O(k * log(a)) where a is the size of your alphabet (the number of possible values for a character). Since a is constant so is log(a) so you end up with a O(n * k) algorithm for sorting.
Edit: In case you are unfamiliar with Tries, they are basically n-ary trees where each edge represents a single character of the key. When inserting, you check if the root node contains an edge (or child, whatever) that matches the first character of your key. If so, you follow that path and repeat with the second character and so on. If not, you add a new edge. In a Ternary Search Trie, the edges/children are stored in a binary tree so the characters are in sorted order and can be searched in log(a) time. If you want to waste memory you can store the edges/children in an array of size a which gives you constant lookup at each step.

See this thread. radix sort or this one radix sort implementation

How many are many, one million?
The built in List<string>.Sort() takes O(n * log(n)) on average.
log2(10^6) ~=20, that is not very much slower than O(n) for 10^6 elements. If your strings are more than 20 characters long radix sort O(n * k) will be "slower".
I doubt a radix sort will be significantly faster than the built in sort. But it would be fun to measure and compare.

Edit: there is a point to these statements I made previously, but the point is wrong overall.
Radix sort is the wrong sort to use on large numbers of strings. For things like
I really like squirrels. Yay, yay, yay!
I really like blue jays. Yay, yay, yay!
I really like registers. Yay, yay, yay!
you will have a bunch of entries falling in the same bucket. You could avoid this by hashing, but what use is sorting a hash code?
Use quicksort or mergesort or the like. (Quicksort generally performs better and takes less memory, but many examples have worst-case performance of O(N^2) which almost never occurs in practice; Mergesort doesn't perform quite as well but is usually implemented to be stable, and it's easy to do part in memory and part on disk.) That is, use the built-in sort function.
Edit: Well, it turns out that at least on very large files with long repeats at the beginning (e.g. source code) and with many lines exactly the same (100x repeats, in fact), radix sort does start becoming competitive with quicksort. I'm surprised! But, anyway, here is the code I used to implement radix sort. It's in Scala, not C#, but I've written it in fairly iterative style so it should be reasonably obvious how things work. The only two tricky bits are that (a(i)(ch) & 0xFF) is to extract a 0-255 byte from an array of arrays of bytes (bytes are signed), that counts.scanLeft(0)(_ + _) forms a cumulative sum of the counts, starting from zero (and then indices.clone.take(257) takes all but the last one), and that Scala allows multiple parameter lists (so I split up the always-provided argument from the arguments that have defaults that are used in recursion). Here it is:
def radixSort(a: Array[Array[Byte]])(i0: Int = 0, i1: Int = a.length, ch: Int = 0) {
val counts = new Array[Int](257)
var i = i0
while (i < i1) {
if (a(i).length <= ch) counts(0) += 1
else { counts((a(i)(ch)&0xFF)+1) += 1 }
i += 1
}
val indices = counts.scanLeft(0)(_ + _)
val starts = indices.clone.take(257)
i = i0
while (i < i1) {
val bucket = if (a(i).length <= ch) 0 else (a(i)(ch)&0xFF)+1
if (starts(bucket)+i0 <= i && i < starts(bucket)+i0+counts(bucket)) {
if (indices(bucket) <= i) indices(bucket) = i+1
i += 1
}
else {
val temp = a(indices(bucket)+i0)
a(indices(bucket)+i0) = a(i)
a(i) = temp
indices(bucket) += 1
}
}
i = 1
while (i < counts.length) {
if (counts(i)>1) {
radixSort(a)(i0+starts(i),i0+starts(i)+counts(i),ch+1)
}
i += 1
}
}
And the timings are that with 7M lines of source code (100x duplication of 70k lines), the radix sort ties the built-in library sort, and wins thereafter.

String.Compare() overloads are using such string comparison. See what you need is to feed this to your
sort algorithm.
UPDATE
This is the implementation:
[MethodImpl(MethodImplOptions.InternalCall)]
internal static extern int nativeCompareString(int lcid, string string1, int offset1, int length1, string string2, int offset2, int length2, int flags);
Hard to beat this native unmanaged implementation with your own implementation.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.