The following LINQPad code generates random sequence of unique integers from 0 to N and calculates the length of cycle for every integer starting from 0. In order to calculate cycle length for a given integer, it reads value from boxes array at the index equal to that integer, than takes the value and reads from the index equal to that value, and so on. The process stops when the value read from array is equal to original integer we started with. The number of iterations spent to calculating the length of every cycle gets saved into a Dictionary.
const int count = 100;
var random = new Random();
var boxes = Enumerable.Range(0, count).OrderBy(x => random.Next(0, count - 1)).ToArray();
string.Join(", ", boxes.Select(x => x.ToString())).Dump("Boxes");
var stats = Enumerable.Range(0, count).ToDictionary(x => x, x => {
var iterations = 0;
var ind = x;
while(boxes[ind] != x)
{
ind = boxes[ind];
iterations++;
}
return iterations;
});
stats.GroupBy(x => x.Value).Select(x => new {x.Key, Count = x.Count()}).OrderBy(x => x.Key).Dump("Stats");
stats.Sum(x => x.Value).Dump("Total Iterations");
Typical result looks as follows:
The results I am getting seem weird to me:
The lengths of all cycles can be grouped into only few buckets (usually 3 to 7). I was hoping to see more distinct buckets.
The number of elements in every bucket most of the time grows together with the bucket value they belong to. I was hoping that it would be more random.
I have tried several different randomize functions, like .NET's Random and RandomNumberGenerator classes, as well as random data generated from random.org. All of them seem to produce similar results.
Am I doing something wrong? Are those results expected from mathematical point of view? Or, perhaps, the pseudo nature of randomizing functions that I used have side effects?
What you are doing is generating a random permutation of size count. Then you check the properties of the permutation. If your random number generator is good, then you should observe the statistics of random permutations.
The average number of cycles of length k is 1/k, for k<count. On average, there is 1 fixed point, 1/2 cycles of length 2, 1/3 cycles of length 3, etc. The average number of cycles of any length is therefore 1+1/2+1/3+...+1/count ~ ln count + gamma. There are a lot of neat properties of the distribution of the number of cycles. Very occasionally there are many cycles, but the average value of 2^# cycles is count+1.
Your buckets correspond to the number of different cycle lengths, which is at most the number of cycles, but might be lower because of repeated cycle lengths. On average, few cycle lengths are repeated. Even as the count increases to infinity, and the average number of cycles increases to infinity, the average number of repeated cycle lengths stays finite.
In a permutation test in statistics, usually an example of bootstrapping, to analyze some types of data, you view it as an example of a permutation. For example, you might observe two quantities, x_i and y_i. You get a permutation by sorting the xs and ys, and seeing the index of the value of y paired with the kth x value. Then you compare statistics of this permutation with the properties of random permutations. This doesn't assume much about the underlying distributions, but it can still detect when x and y seem to be related. So, it's useful to know what to expect from random permutations.
Related
I have a large set of n integers. I need to choose k elements in that list such that they sum from largest to smallest. Every subset chosen will have to be valid for some rule (it doesn't matter the rule). I want to find the largest-summed subset that is also valid.
For instance, using a small set of numbers (10, 7, 5, 3, 0), a subset size of 3, and a simple rule (sum must be prime), the code would look at:
10, 7, 5 = 22 -> NOT PRIME, KEEP GOING
10, 7, 3 = 20 -> NOT PRIME, KEEP GOING
10, 5, 3 = 18 -> NOT PRIME, KEEP GOING
10, 7, 0 = 17 -> PRIME, STOP
I know I could just put EVERY combination in a list, order it descending, and then work my way down until a sum passes the test, but that seems hugely inefficient in both space and time, especially if I have a set of size like 100 and a subset size of 8. That's like 186 billion combinations that I'd have to calculate.
Is there a way to just do this in a simple loop where I start at the biggest sum check for validity, and then calculate and go to the next largest possible sum and check for validity, etc.? Something like:
// Assuming set is ordered, this is the largest possible sum given the subset_size
int sum = set.Take(subset_size).Sum();
while (!IsValid(sum))
{
sum = NextLargest(set, subset_size, sum);
}
bool IsValid (int sum)
{
return sum % 2 == 0;
}
int NextLargest (int[] set, int subset_size, int current_sum)
{
// Find the next largest sum here as efficiently as possible
}
You don't need to look at every combination, only the ones that sum to a larger number.
Iterate over the set in descending order and check the sum. Keep track of the largest valid sum found so far. When a larger sum is impossible, break out of the loop. For example, given subset size 5, you found a valid sum 53. At some point, you are considering a subset that starts with 10. Since the numbers are in descending order, the largest sum you can get at this point is 50. So this path can be abandoned. This should significantly trim down your solution space.
I got 9 numbers which I want to divide in two lists, and both lists need to reach a certain amount when summed up. For example I got a list of ints:
List<int> test = new List<int>
{
1963000, 1963000, 393000, 86000,
393000, 393000, 176000, 420000,
3193000
};
And I want to have 2 lists of numbers that when you sum them up, they both reach over 4 million.
It doesn't matter if the 2 lists don't have the same amount of numbers. If it only takes 2 numbers to reach 4 million in 1 list, and 7 numbers together reaching 7 million, is fine.
As long as both lists summed up are equal to 4 million or higher.
Is this certain sum low enough to be reached easily?
If yes, then your algorithm may be as simple as: iterate i from 1 to number of items. sum up the first i numbers. if the sum is higher than your certain sum (eg 4 million), then you are finished, else increment i.
BUT: if your certain sums are high and it is not such trivial to find the partition, then you have the famous Partition Probem (https://en.wikipedia.org/wiki/Partition_problem), this is not that simple but there are some algorithms. Read this wikipedia artikle or try to google "Partition problem solution" or similar.
I want to generate a random number from a to b. The problem is, the number has to be given with exponential distribution.
Here's my code:
public double getDouble(double low, double high)
{
double r;
(..some stuff..)
r = rand.NextDouble();
if (r == 0) r += 0.00001;
return (1 / -0.9) * Math.Log(1 - r) * (high - low) + low;
}
The problem is that (1 / -0.9) * Math.Log(1 - r) is not between 0 and 1, so the result won't be between a and b. Can someone help? Thanks in advance!
I missunderstood your question in the first answer :) You are already using the inversion sampling.
To map a range into another range, there is a typical mathematical approach:
f(x) = (b-a)(x - min)/(max-min) + a
where
b = upper bound of target
a = lower bound of target
min = lower bound of source
max = upper bound of source
x = the value to map
(this is linear scaling, so the distribution would be preserved)
(You can verify: If you put in min for x, it results in a, if you put in max for x, you'll get b.)
The Problem now: The exponential distribution has a maximum value of inf. So, you cannot use this equation, because it always wold be whatever / inf + 0 - so 0. (Which makes sense mathematically, but ofc. does not fit your needs)
So, the ONLY correct answer is: There is no exponential distribution possible between two fixed numbers, cause you can't map [0,inf] -> [a,b]
Therefore you need some sort of trade-off, to make your result as exponential as possible.
I wrapped my head around different possibilities out of curiosity and I found that you simple can't beat maths on this :P
However, I did some test with Excel and 1.4 Million random records:
I picked a random number as "limit" (10) and rounded the computed result to 1 decimal place. (0, 0.1, 0.2 and so on) This number I used to perform the linear transformation with an maximum of 10, ingoring any result greater than 1.
Out of 1.4 Million computations (generated it 10-20 times), only 7-10 random numbers greater than 1 have been generated:
(Probability density function, After mapping the values: Column 100 := 1, Column 0 := 0)
So:
Map the values to [0,1], using the linear approach mentioned above, assume a maximum of 10 for the transformation.
If you encounter a value > 1 after the transformation - just draw another random number, until the value is < 1.
With only 7-10 occurences out of 1.4 Million tests, this should be close enough, since the re-drawn number will again be pseudo-exponential-distributed.
If you want to build a spaceship, where navigation depends on perfectly exponential distributed numbers between 0 and 1 - don't do it, else you should be good.
(If you want to cheat a bit: If you encounter a number > 1, just find the record that has the biggest variance (i.e. Max(occurrences < expected occurrences)) from it's expected value - then assume that value :P )
Since the support for the exponential distribution is 0 to infinity, regardless of the rate, I'm going to assume that you're asking for an exponential that's truncated below a and above b. Another way of expressing this would be an exponential random variable X conditioned on a <= X <= b.
You can derive the inversion algorithm for this by calculating the cumulative distribution function (CDF) of the truncated distribution as the integral from a to x of the density for your exponential. Scale the result by the area between a and b (which is F(b) - F(a) where F(x) is the CDF of the original exponential distribution) to make it a valid distribution with an area of 1. Set the derived CDF to U, a uniform(0,1) random number, and solve for X to get the inversion.
I don't program C#, but here's the result expressed in Ruby. It should translate pretty transparently.
def exp_in_range(a, b, rate = 1.0)
exp_rate_a = Math.exp(-rate * a)
return -Math.log(exp_rate_a - rand * (exp_rate_a - Math.exp(-rate * b))) / rate
end
I put a default rate of 1.0 since you didn't specify, but clearly you can override that. rand is Ruby's built-in uniform generator. I think the rest is pretty self-explanatory. I cranked out several test sets of 100k observations for a variety of (a,b) values, loaded the results into my favorite stats package, and the results are as expected.
The exponential distribution is not limited on the positive side, so values can go from 0 to inf. There are many ways to scale [0,infinity] to some finite interval, but the result would not be exponential distributed.
If you just want a slice of the exponential distribution between a and b, you could simply draw r from [ra rb] such that -log(1-ra)=a and -log(1-rb)=b , i,e,
r=rand.NextDouble(); // assume this is between 0 and 1
ra=Math.Exp(-a)-1;
rb=Math.Exp(-b)-1;
rbound=ra+(rb-ra)*r;
return -Math.Log(1 - rbound);
Why check for r==0? I think you would want to check for the argument of the log to be >0, so check for r (or rbound int this case) ==1.
Also not clear why the (1/-.9) factor??
I've been thinking about how to implement something that, frankly, is beyond my mathematical skills. So here goes, feel free to try and point me in the right direction rather than complete code solutions any help I'd be grateful for.
So, imagine I've done an analysis of text and generated a table of the frequencies of different two-character combinations. I've stored these in a 26x26 array.
eg.
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
A 1 15 (frequency of AA, then frequency of AB etc.)
B 12 0 (freq of BA, BB etc..)
... etc.
So I want to randomly choose these two-character combinations but I'd like to 'weight' my choice based on the frequency. ie. the AB from above should be 15 times 'more likely' than AA. And, obviously, the selection should never return something like BB (ie. a frequency of 0 - in this example, obviously BB does occur in words like Bubble!! :-) ). For the 0 case I realise I could loop until I get a non-0 frequency but that's just not elegant because I have a feeling/intuition that there is a way to skew my average.
I was thinking to chose the first char of my pair - ie. the row - (I'm generating a 4-pair-sequence ultimately) I could just use the system random function (Random class.Next) then use the 'weighted' random algorithm to pick the second char.
Any ideas?
Given your example sample, I would first create a cumulative series of all of the numbers (1, 15, 12, 0 => 1, 16, 28, 28).
Then I would produce a random number between 0 and 27 (let's say 19).
Then I would calculate that 19 was >=16 but <28, giving me bucket 3 (BA).
There are some good suggestions in the other answers for your specific problem. To solve the general problem of "I have a source of random numbers conforming to a uniform probability distribution, but I would like it to conform to a given nonuniform probability distribution", then you can work out the quantile function, which is the function that performs that transformation. I give a gentle introduction that explains why the quantile function is the function you want here:
Generating Random Non-Uniform Data In C#
How about summing all the frequencies and using that from AA to ZZ to generate your pair.
Lets say you have a total frequency of pairs if the rnd return 0 you get AA if it returns 1-14 then its AB etc
Use your frequency matrix to generate a complete set of values. Order the set by Random.Next(). Store the randomized set in an array. Then you can just select an element out if that array based on Random.Next(randomarray.Length).
If there is a mathematical way to calculate the frequency you could do that as well. But creating a precompiled and cached set will reduce the calculation time if this is called repeatedly.
As a note, depending on the max frequency this could require a good amount of storage. You would also want to create the instance of random before you loop to build the set. This is so you don't reseed the random generator.
...
Another way (similar to what you suggested at the end of your question) would be to do this in two passes with the first selecting the row and the second used your weighted frequency to select the column. That would just be the sum of the row frequencies bounded over a ranges. The first suggestion should give a more even distribution based on weight.
Take the sum of the probabilities. Take a random number between zero and that sum. Add up the probabilities until you get it's greater than or equal to your random number. Then use the item your on.
Eg pseudocode:
b = getProbabilites()
s = sum(b)
r = randomInt() % s
i = 0
acc = 0
while (acc < r) {
acc += b[i]
i++
}
return i
If efficiency is not a problem, you could create a key->value hash instead of an array. An upside of this would be that (if you format it well in the text) it would be very easy to update the values should the need arise. Something like
{
AA => 5, AB => 2, AC => 4,
BA => 6, BB => 5, BC => 9,
CA => 2, CB => 7, CC => 8
}
With this, you could easily retrieve the value for the sequence you want, and quickly find the entry to update. If the table is automatically generated and extremely large, it could help to get/be familiar with vim's use of regular expressions.
Looking at another question of mine I realized that technically there is nothing preventing this algorithm from running for an infinite period of time. (IE: It never returns)
Because of the chance that rand.Next(1, 100000); could theoretically keep generating the same value.
Out of curiosity; how would I calculate the probability of this happening? I assume it would be very small?
Code from other question:
Random rand = new Random();
List<Int32> result = new List<Int32>();
for (Int32 i = 0; i < 300; i++)
{
Int32 curValue = rand.Next(1, 100000);
while (result.Exists(value => value == curValue))
{
curValue = rand.Next(1, 100000);
}
result.Add(curValue);
}
On ONE given draw of a random number, the probability of repeating a value readily found in the result list is
P(Collision) = i * 1/100000 where i is the number of values in the list.
That is because all 100,000 possible numbers are assumed to have the same probability of being drawn (assumption of a uniform distribution) and the drawing of any number is independent from that of drawing any other number.
The probability of experiencing such a "collision" with the numbers from the list several several times in a row is
P(n Collisions) = P(Collision) ^ n
where n is the number of times a collision happens
That is because the drawings are independent.
Numerically...
when the list is half full, i = 150 and
P(Collision) = 0.15% = 0.0015 and
P(2 Collisions) = 0.00000225
P(3 Collisions) - 0.000000003375
P(4 Collisions) = 0.0000000000050265
when the list is all full but for the last one, i = 299 and
P(Collision) = 0.299% = 0.00299 and
P(2 Collisions) = 0.0000089401 (approx)
P(3 Collisions) = 0.00000002673 (approx)
P(4 Collisions) = 0.000000000079925 (approx)
You are therefore right to assume that the probability of having to draw multiple times for finding the next suitable value to add to the array is very small, and should therefore not impact the overall performance of the snippet. Beware that there will be a few retries (statistically speaking), but the total number of retries will be small compared to 300.
If however the total number of item desired in the list was to increase much, or if the range of random number sought was to be reduced, P(Collision) would not be so small and hence the number of "retries" needed would grow accordingly. That is why other algorithms exist for drawings multiple values without replacement; most are based on the idea of using the random number as an index into a array of all the remaining values.
Assuming a uniform distribution (not a bad assumption, I believe) the chance of getting the number n times in a row is (0.00001)^n.
It's quite possible for a PRNG to generate the same number in a limited range in consecutive calls. The probability would be a function of the bit-size of the raw PRNG and the method used to reduce that size to the numeric range you want (in this case 1 - 100000).
To answer your question exactly, no, it isn't very small, the probability of it going on for an infinite period of time "is" 0. I say "is" because it actually tends to 0 when the number of iterations tends to infinity.
As bdares said, it will tend to 0 with (1/range)ˆn , with n being the number of iterations, if we can assume an uniform distribution (this says we kinda can).
This program will not halt if:
A random number is picked that is in the result set
That number generates a cycle (i.e. a loop) in the random number generator's algorithm (they all do)
All numbers in the loop are already in the result set
All random number generators eventually loop back on themselves, due to the limited number of integers possible ==> for 32-bit, only 2^32 possible values.
"Good" generators have very large loops. "Poor" algorithms yield short loops for certain values. Consult Knuth's The Art of Computer Programming for random number generators. It is a fascinating read.
Now, assuming there is a cycle of (n) numbers. For your program, which loops 300 times, that means (n) <= 300. Also, the number of attempts you try before you hit on a number in this cycle, plus the length of the cycle, must not be greater than 300. Therefore, assuming the first try you hit on the cycle, then the cycle can be 300 long. If on the second try you hit the cycle, it can only be 299 long.
Assuming that most random number generation algorithms have reasonably-flat probability distribution, the probability of hitting a 300-cycle the first time is (300/2^32), multiplied by the probability of having a 300-cycle (this depends on the rand algorithm), plus the probability of hitting a 299-cycle the first time (299/2^32) x probability of having a 299-cycle, etc. And so on and so forth. Then add up the second try, third try, all the way up to the 300-th try (which can only be a 1-cycle).
Now this is assuming that any number can take on the full 2^32 generator space. If you are limiting it to 100000 only, then in essence you increase the chance of having much shorter cycles, because multiple numbers (in the 2^32 space) can map to the same number in "real" 100000 space.
In reality, most random generator algorithms have minimum cycle lengths of > 300. A random generator implementation based on the simplest LCG (linear congruential generator, wikipedia) can have a "full period" (i.e. 2^32) with the correct choice of parameters. So it is safe to say that minimum cycle lengths are definitely > 300. If this is the case, then it depends on the mapping algorithm of the generator to map 2^32 numbers into 100000 numbers. Good mappers will not create 300-cycles, poor mappers may create short cycles.