Is Random.NextBytes biased?

Is Random.NextBytes biased? - c#

The .NET reference source shows the implementation of NextBytes() as:
for (int i=0; i<buffer.Length; i++)
{
buffer[i]=(byte)(InternalSample()%(Byte.MaxValue+1));
}
InternalSample provides a value in [0, int.MaxValue), as evidenced by it's doc comment and the fact that Next(), which is documented to return this range, simply calls InternalSample.
My concern is that, since InternalSample can produce int.MaxValue different values, and that number is not evenly divisible by 256, then we should have some slight bias in the resulting bytes, with some values (in this case just 255) occurring less frequently than others.
My question is:
Is this analysis correct or is the method in fact unbiased?
If the bias exists, is it strong enough to matter for any real application?
FYI I know Random should not be used for cryptographic purposes; I'm thinking about it's valid use cases (e. g. simulations).

Your analysis is indeed correct. But the defect is one part in two billions i.e. 1 / 2^31 so fairly negligible.
The question that one should ask is, is it even detectable ? For example, how many samples N does one need to establish the bias with say 99% certainty. From what I know, N > s^2 z^2 / epsilon^2, with
z = 2.58,
epsilon = 1 / 2^32 and
s^2 = p - p^2
p = 1/2^8 - 1/2^31
this would require 4.77x10^17 samples, a number so large it will hardly be the most obvious defect.

Refer to Knuth vol. 2, 3.2.1.1 Choice of Modulus. You actually want a modulus that is not equal 256; using 256, the lower 4 bits of the resulting byte are considerably less random than are obtained using 257 (p. 12).
257 is also prime, which is convenient to help reduce bias and lengthen the pseudo-random sequence.
Any pseudo-random sequence is, by definition, not truly random. For non-cryptographic applications, what is unbiased enough? If in doubt, my recommendation is to sample the generated numbers the way your application is going to draw them and do some statistical analysis. The off-the-shelf random number generators are good enough for many applications, but not necessarily good enough for yours.

Related

c# format preserving encryption for integers

I have a requirement for generating numeric codes that will be used as redemption codes for vouchers or similar. The requirement is that the codes are numeric and relatively short for speed on data entry for till operators. Around 6 characters long and numeric. We know that's a small number so we have a process in place so that the codes can expire and be re-used.
We started off by just using a sequential integer generator which is working well in terms of generating a unique code. The issue with this is that the codes generated are sequential so predictable which means customers could guess codes that we generate and redeem a voucher not meant for them.
I've been reading up on Format Preserving Encryption which seems like it might work well for us. We don't need to decrypt the code back at any point as the code itself is arbitrary we just need to ensure it's not predictable (by everyday people). It's not crucial for security it's just to keep honest people honest.
There are various ciphers referenced in the wikipedia article but I have very basic cryptographic and mathematical skills and am not capable of writing my own code to achieve this based on the ciphers.
I guess my question is, does anyone know of a c# implementation of this that will encrypt an integer into another integer and maintain the same length?
FPE seems to be used well for encrypting a 16 digit credit card number into another 16 digit number. We need the same sort of thing but not necessarily fixed to a length but as long is the plain values length matches the encrypted values length.
So the following four integers would be encrypted
from
123456
123457
123458
123459
to something non-sequential like this
521482
265012
961450
346582
I'm open to any other suggestions to achieve this FPE just seemed like a good option.
EDIT
Thanks for the suggestions around just generating a unique code and storing them and checking for duplicates. for now we've avoided doing this because we don't want to have to check storage when we generate. This is why we use a sequential integer generator so we don't need to check if the code is unique or not. I'll re-investigate doing this but for now still looking for ways to avoid having to go to storage each time we generate a code.

I wonder if this will not be off base also, but let me give it a try. This solution will require no storage but will require processing power (a tiny amount, but it would not be pencil-and-paper easy). It is essentially a homemade PRNG but may have characteristics more suitable to what you want to do than the built-in ones do.
To make your number generator, make a polynomial with prime coefficients and a prime modulus. For example, let X represent the Nth voucher you issed. Then:
Voucher Number = (23x^4+19x^3+5x^2+29x+3)%65537. This is of course just an example; you could use any number of terms, any primes you want for the coefficients, and you can make the modulus as large as you like. In fact, the modulus does not need to be prime at all. It only sets the maximum voucher number. Having the coefficients be prime helps cut down on collisions.
In this case, vouchers #100, 101, and 102 would have numbers 26158, 12076, and 6949, respectively. Consider it a sort of toy encryption where the coefficients are your key. Not super secure, but nothing with an output space as small as you are asking for would be secure against a strong adversary. But this should stop the everyday fraudster.
To confirm a valid voucher would take the computer (but calculation only, not storage). It would iterate through a few thousand or tens of thousands of input X looking for the output Y that matches the voucher presented to you. When it found the match, it could signal a valid voucher.
Alternatively, you could issue the vouchers with the serial number and the calculation concatenated together, like a value and checksum. Then you could run the calculation on the value by hand using your secret coefficients to confirm validity.
As long as you do not reveal the coefficients to anyone, it is very hard to identify a pattern in the outputs. I am not sure if this is even close to as secure as what you were looking for, but posting the idea just in case.
I miscalculated the output for 100 (did it by hand and failed). Corrected it just now. Let me add some code to illustrate how I'd check for a valid voucher:
using System;
using System.Numerics;
namespace Vouchers
{
class Program
{
static void Main(string[] args)
{
Console.Write("Enter voucher number: ");
BigInteger input = BigInteger.Parse(Console.ReadLine());
for (BigInteger i = 0;i<10000000;i++)
{
BigInteger testValue = (23 * i * i * i * i + 19 * i * i * i + 5 * i * i + 29 * i + 3) % 65537;
if(testValue==input)
{
Console.WriteLine("That is voucher # " + i.ToString());
break;
}
if (i == 100) Console.WriteLine(testValue);
}
Console.ReadKey();
}
}
}

One option is to build an in-place random permutation of the numbers. Consider this code:
private static readonly Random random = new Random((int)DateTime.UtcNow.Ticks);
private static int GetRandomPermutation(int input)
{
char[] chars = input.ToString().ToCharArray();
for (int i = 0; i < chars.Length; i++ )
{
int j = random.Next(chars.Length);
if (j != i)
{
char temp = chars[i];
chars[i] = chars[j];
chars[j] = temp;
}
}
return int.Parse(new string(chars));
}
You mentioned running into performance issues with some other techniques. This method does a lot of work, so it may not meet your performance requirements. It's a neat academic exercise, anyway.

Thanks for the help from the comments to my original post on this from Blogbeard and lc. It Turns out we needed to hit storage when generating the codes anyway so this meant implementing a PRNG was a better option for us rather than messing around with encryption.
This is what we ended up doing
Continue to use our sequential number generator to generate integers
Create an instance of C# Random class (a PRNG) using the sequential number as a seed.
Generate a random number within the range of the minimum and maximum number we want.
Check for duplicates and regenerate until we find a unique one
Turns out using c# random with a seed makes the random numbers actually quite predictable when using the sequential number as a seed for each generation.
For example with a range between 1 and 999999 using a sequential seed I tested generating 500000 values without a single collision.

Is 161803398 A 'Special' Number? Inside of Math.Random()

I suspect the answer is 'Because of Math', but I was hoping someone could give a little more insight at a basic level...
I was poking around in the BCL source code today, having a look at how some of the classes I've used before were actually implemented. I'd never thought about how to generate (pseudo) random numbers before, so I decided to see how it was done.
Full source here: http://referencesource.microsoft.com/#mscorlib/system/random.cs#29
private const int MSEED = 161803398;
This MSEED value is used every time a Random() class is seeded.
Anyway, I saw this 'magic number' - 161803398 - and I don't have the foggiest idea of why that number was selected. It's not a prime number or a power of 2. It's not 'half way' to a number that seemed more significant. I looked at it in binary and hex and well, it just looked like a number to me.
I tried searching for the number in Google, but I found nothing.

No, but it's based on Phi (the "golden ratio").
161803398 = 1.61803398 * 10^8 ≈ φ * 10^8
More about the golden ratio here.
And a really good read for the casual mathematician here.
And I found a research paper on random number generators that agrees with this assertion. (See page 53.)

This number is taken from golden ratio 1.61803398 * 10^8. Matt gave a nice answer what is this number, therefore I will just explain a little bit about an algorithm.
This is not a special number for this algorithm. The algorithm is Knuth's subtractive random number generator algorithm and the main points of it are:
store a circular list of 56 random numbers
initialization is process of filling the list, then randomize those values with a specific deterministic algorithm
two indices are kept which are 31 apart
new random number is the difference of the two values at the two indices
store new random number in the list
The generator is based on the following recursion: Xn = (Xn-55 - Xn-24) mod m, where n &geq; 0. This is a partial case of lagged Fibonacci generator: Xn = (Xn-j # Xn-k) mod m, where 0 < k < j and # is any binary operation (subtraction, addition, xor).
There are several implementations of this generator. Knuth offers an implementation in
FORTRAN in his book. I found the following code, with the following comment:
PARAMETER (MBIG=1000000000,MSEED=161803398,MZ=0,FAC=1.E-9)
According
to Knuth, any large MBIG, and any smaller (but still large) MSEED can
be substituted for the above values.
A little bit more can be found here Note, that this is not actually a research paper (as stated by Math), this is just a master degree thesis.
People in cryptography like to use irrational number (pi, e, sqrt(5)) because there is a conjecture that digits of such numbers appears with equal frequency and thus have high entropy. You can find this related question on security stackexchange to learn more about such numbers. Here is a quote:
"If the constants are chosen at random, then with high probability, no
attacker will be able to break it." But cryptographers, being a
paranoid lot, are skeptical when someone says, "Let's use this set of
constants. I picked them at random, I swear." So as a compromise,
they'll use constants like, say, the binary expansion of π. While we
no longer have the mathematical benefit of having chosen them at
random from some large pool of numbers, we can at least be more
confident there was no sabotage.

Probability with Random.Next()

I want to write a lottery draw program which needs to randomly choose 20000 numbers from 1-2000000 range. The code is as below:
Random r = New Random(seed); //seed is a 6 digits e.g 123456
int i=0;
while(true){
r.Next(2000000);
i++;
if(i>=20000)
break;
}
My questions are:
Can it make sure the same possibility of all the numbers from 1 to 2000000?
Is the upper bound 2000000 included in the r.Next()?
Any suggestion?

The .NET Random class does a fairly good job of generating random numbers. However be aware that if you seed it with the same number you'll get the same "random" numbers each time. If you don't want this behavior don't provide a seed.
If you're after much more random number generator than the built in .NET one then take a look at random.org. It's one of the best sites out there for getting true random numbers - I believe there's an API. Here's a quote from their site:
RANDOM.ORG offers true random numbers to anyone on the Internet. The
randomness comes from atmospheric noise, which for many purposes is
better than the pseudo-random number algorithms typically used in
computer programs. People use RANDOM.ORG for holding drawings,
lotteries and sweepstakes, to drive games and gambling sites, for
scientific applications and for art and music. The service has existed
since 1998 and was built by Dr Mads Haahr of the School of Computer
Science and Statistics at Trinity College, Dublin in Ireland. Today,
RANDOM.ORG is operated by Randomness and Integrity Services Ltd.
Finally Random.Next() is exlusive so the upper value you supply will never be called. You may need to adjust your code appropriately if you want 2000000 to be in there.

It includes the minValue but does not include the maxValue. Therefore if you want to generate numbers from 1 to 2000000 use:
r.Next(1,2000001)

I believe your question is implementation dependent.
The naïve method of generating a random integer in a range is to generate a random 32-bit word and then normalise it across your range.
The larger the range you're normalising the more the probabilities of each individual value fluctuate.
In your situation, you're normalising 4.3 billion inputs over 2 million outputs. This will mean that the probabilities of each number in your range will differ by up to about 1 in 2000 (or 0.05%). If this slight difference in probabilities is okay for you, then go ahead.

Upperbound included?
No, the upperbound is exclusive so you'll have to use 2000001 to include 2000000.
Any suggestion?
Let me take the liberty of suggesting not to use a while(true) / break. Simply put the condition of the if in your while statement:
Random r = New Random(seed); //seed is a 6 digits e.g 123456
int i=0;
while(i++ < 20000)
{
r.Next(1, 2000001);
}
I know this is nitpicking, but it is a suggestion... :)

Arbitrary precision arithmetic with very big factorials

This is a mathematical problem, not programming to be something useful!
I want to count factorials of very big numbers (10^n where n>6).
I reached to arbitrary precision, which is very helpful in tasks like 1000!. But it obviously dies(StackOverflowException :) ) at much higher values. I'm not looking for a direct answer, but some clues on how to proceed further.
static BigInteger factorial(BigInteger i)
{
if (i < 1)
return 1;
else
return i * factorial(i - 1);
}
static void Main(string[] args)
{
long z = (long)Math.Pow(10, 12);
Console.WriteLine(factorial(z));
Console.Read();
}
Would I have to resign from System.Numerics.BigInteger? I was thinking of some way of storing necessary data in files, since RAM will obviously run out. Optimization is at this point very important. So what would You recommend?
Also, I need values to be as precise as possible. Forgot to mention that I don't need all of these numbers, just about 20 last ones.

As other answers have shown, the recursion is easily removed. Now the question is: can you store the result in a BigInteger, or are you going to have to go to some sort of external storage?
The number of bits you need to store n! is roughly proportional to n log n. (This is a weak form of Stirling's Approximation.) So let's look at some sizes: (Note that I made some arithmetic errors in an earlier version of this post, which I am correcting here.)
(10^6)! takes order of 2 x 10^6 bytes = a few megabytes
(10^12)! takes order of 3 x 10^12 bytes = a few terabytes
(10^21)! takes order of 10^22 bytes = ten billion terabytes
A few megs will fit into memory. A few terabytes is easily within your grasp but you'll need to write a memory manager probably. Ten billion terabytes will take the combined resources of all the technology companies in the world, but it is doable.
Now consider the computation time. Suppose we can perform a million multiplications per second per machine and that we can parallelize the work out to multiple machines somehow.
(10^6)! takes order of one second on one machine
(10^12)! takes order of 10^6 seconds on one machine =
10 days on one machine =
a few minutes on a thousand machines.
(10^21)! takes order of 10^15 seconds on one machine =
30 million years on one machine =
3 years on 10 million machines
1 day on 10 billion machines (each with a TB drive.)
So (10^6)! is within your grasp. (10^12)! you are going to have to write your own memory manager and math library, and it will take you some time to get an answer. (10^21)! you will need to organize all the resources of the world to solve this problem, but it is doable.
Or you could find another approach.

The solution is easy: Calculate the factorials without using recursion, and you won't blow out your stack.
I.e. you're not getting this error because the numbers are too large, but because you have too many levels of function calls. And fortunately, for factorials there's no reason to calculate them recursively.
Once you've solved your stack problem, you can worry about whether your number format can handle your "very big" factorials. Since you don't need the exact values, use one of the many efficient numeric approximations (which you can count on to get all of the most significant digits right). The most common one is Stirling's approximation:
n! ~ n^n e^{-n} sqrt(2 \pi n)
The image is from this page, where you'll find discussion and a second, more accurate formula (although "in most cases the difference is quite small", they say). Of course this number is still too large for you to store, but now you can work with logarithms and drop the unimportant digits before you extract the number. Or use the Wikipedia version of the approximation, which is already expressed as a logarithm.

Unroll recursion:
static BigInteger factorial(BigInteger n)
{
BigInteger res = 1;
for (BigInteger i = 2; i <= n; ++i)
res *= i;
return res;
}

How is a random number generated at runtime?

Since computers cannot pick random numbers(can they?) how is this random number actually generated. For example in C# we say,
Random.Next()
What happens inside?

You may checkout this article. According to the documentation the specific implementation used in .NET is based on Donald E. Knuth's subtractive random number generator algorithm. For more information, see D. E. Knuth. "The Art of Computer Programming, volume 2: Seminumerical Algorithms". Addison-Wesley, Reading, MA, second edition, 1981.

Since computers cannot pick random numbers (can they?)
As others have noted, "Random" is actually pseudo-random. To answer your parenthetical question: yes, computers can pick truly random numbers. Doing so is much more expensive than the simple integer arithmetic of a pseudo-random number generator, and usually not required. However there are applications where you must have non-predictable true randomness: cryptography and online poker immediately come to mind. If either use a predictable source of pseudo-randomness then attackers can decrypt/forge messages much more easily, and cheaters can figure out who has what in their hands.
The .NET crypto classes have methods that give random numbers suitable for cryptography or games where money is on the line. As for how they work: the literature on crypto-strength randomness is extensive; consult any good university undergrad textbook on cryptography for details.
Specialty hardware also exists to get random bits. If you need random numbers that are drawn from atmospheric noise, see www.random.org.

Knuth covers the topic of randomness very well.
We don't really understand random well. How can something predictable be random? And yet pseudo-random sequences can appear to be perfectly random by statistical tests.
There are three categories of Random generators, amplifying on the comment above.
First, you have pseudo random number generators where if you know the current random number, it's easy to compute the next one. This makes it easy to reverse engineer other numbers if you find out a few.
Then, there are cryptographic algorithms that make this much harder. I believe they still are pseudo random sequences (contrary to what the comment above implies), but with the very important property that knowing a few numbers in the sequence does NOT make it obvious how to compute the rest. The way it works is that crypto routines tend to hash up the number, so that if one bit changes, every bit is equally likely to change as a result.
Consider a simple modulo generator (similar to some implementations in C rand() )
int rand() {
return seed = seed * m + a;
}
if m=0 and a=0, this is a lousy generator with period 1: 0, 0, 0, 0, ....
if m=1 and a=1, it's also not very random looking: 0, 1, 2, 3, 4, 5, 6, ...
But if you pick m and a to be prime numbers around 2^16, this will jump around nicely looking very random if you are casually inspecting. But because both numbers are odd, you would see that the low bit would toggle, ie the number is alternately odd and even. Not a great random number generator. And since there are only 2^32 values in a 32 bit number, by definition after 2^32 iterations at most, you will repeat the sequence again, making it obvious that the generator is NOT random.
If you think of the middle bits as nice and scrambled, while the lower ones aren't as random, then you can construct a better random number generator out of a few of these, with the various bits XORed together so that all the bits are covered well. Something like:
(rand1() >> 8) ^ rand2() ^ (rand3() > 5) ...
Still, every number is flipping in synch, which makes this predictable. And if you get two sequential values they are correlated, so that if you plot them you will get lines on your screen. Now imagine you have rules combining the generators, so that sequential values are not the next ones.
For example
v1 = rand1() >> 8 ^ rand2() ...
v2 = rand2() >> 8 ^ rand5() ..
and imagine that the seeds don't always advance. Now you're starting to make something that's much harder to predict based on reverse engineering, and the sequence is longer.
For example, if you compute rand1() every time, but only advance the seed in rand2() every 3rd time, a generator combining them might not repeat for far longer than the period of either one.
Now imagine that you pump your (fairly predictable) modulo-type random number generator through DES or some other encryption algorithm. That will scramble up the bits.
Obviously, there are better algorithms, but this gives you an idea. Numerical Recipes has a lot of algorithms implemented in code and explained. One very good trick: generate not one but a block of random values in a table. Then use an independent random number generator to pick one of the generated numbers, generate a new one and replace it. This breaks up any correlation between adjacent pairs of numbers.
The third category is actual hardware-based random number generators, for example based on atmospheric noise
http://www.random.org/randomness/
This is, according to current science, truly random. Perhaps someday we will discover that it obeys some underlying rule, but currently, we cannot predict these values, and they are "truly" random as far as we are concerned.
The boost library has excellent C++ implementations of Fibonacci generators, the reigning kings of pseudo-random sequences if you want to see some source code.

I'll just add an answer to the first part of the question (the "can they?" part).h
Computers can generate (well, generate may not be an entirely accurate word) random numbers (as in, not pseudo-random). Specifically, by using environmental randomness which is gotten through specialized hardware devices (that generates randomness based on noise, for e.g.) or by using environmental inputs (e.g. hard disk timings, user input event timings).
However, that has no bearing on the second question (which was how Random.Next() works).

The Random class is a pseudo-random number generator.
It is basically an extremely long but deterministic repeating sequence. The "randomness" comes from starting at different positions. Specifying where to start is done by choosing a seed for the random number generator and can for example be done by using the system time or by getting a random seed from another random source. The default Random constructor uses the system time as a seed.
The actual algorithm used to generate the sequence of numbers is documented in MSDN:
The current implementation of the Random class is based on Donald E. Knuth's subtractive random number generator algorithm. For more information, see D. E. Knuth. "The Art of Computer Programming, volume 2: Seminumerical Algorithms". Addison-Wesley, Reading, MA, second edition, 1981.

Computers use pseudorandom number generators. Essentially, they work by start with a seed number and iterating it through an algorithm each time a new pseudorandom number is required.
The process is of course entirely deterministic, so a given seed will generate exactly the same sequence of numbers every time it is used, but the numbers generated form a statistically uniform distribution (approximately), and this is fine, since in most scenarios all you need is stochastic randomness.
The usual practice is to use the current system time as a seed, though if more security is required, "entropy" may be gathered from a physical source such as disk latency in order to generate a seed that is more difficult to predict. In this case, you'd also want to use a cryptographically strong random number generator such as this.

I don't know much details but what I know is that a seed is used in order to generate the random numbers it is then based on some algorithm that uses that seed that a new number is obtained.
If you get random numbers based on the same seed they will be the same often.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.