I've been thinking about how to implement something that, frankly, is beyond my mathematical skills. So here goes, feel free to try and point me in the right direction rather than complete code solutions any help I'd be grateful for.
So, imagine I've done an analysis of text and generated a table of the frequencies of different two-character combinations. I've stored these in a 26x26 array.
eg.
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
A 1 15 (frequency of AA, then frequency of AB etc.)
B 12 0 (freq of BA, BB etc..)
... etc.
So I want to randomly choose these two-character combinations but I'd like to 'weight' my choice based on the frequency. ie. the AB from above should be 15 times 'more likely' than AA. And, obviously, the selection should never return something like BB (ie. a frequency of 0 - in this example, obviously BB does occur in words like Bubble!! :-) ). For the 0 case I realise I could loop until I get a non-0 frequency but that's just not elegant because I have a feeling/intuition that there is a way to skew my average.
I was thinking to chose the first char of my pair - ie. the row - (I'm generating a 4-pair-sequence ultimately) I could just use the system random function (Random class.Next) then use the 'weighted' random algorithm to pick the second char.
Any ideas?
Given your example sample, I would first create a cumulative series of all of the numbers (1, 15, 12, 0 => 1, 16, 28, 28).
Then I would produce a random number between 0 and 27 (let's say 19).
Then I would calculate that 19 was >=16 but <28, giving me bucket 3 (BA).
There are some good suggestions in the other answers for your specific problem. To solve the general problem of "I have a source of random numbers conforming to a uniform probability distribution, but I would like it to conform to a given nonuniform probability distribution", then you can work out the quantile function, which is the function that performs that transformation. I give a gentle introduction that explains why the quantile function is the function you want here:
Generating Random Non-Uniform Data In C#
How about summing all the frequencies and using that from AA to ZZ to generate your pair.
Lets say you have a total frequency of pairs if the rnd return 0 you get AA if it returns 1-14 then its AB etc
Use your frequency matrix to generate a complete set of values. Order the set by Random.Next(). Store the randomized set in an array. Then you can just select an element out if that array based on Random.Next(randomarray.Length).
If there is a mathematical way to calculate the frequency you could do that as well. But creating a precompiled and cached set will reduce the calculation time if this is called repeatedly.
As a note, depending on the max frequency this could require a good amount of storage. You would also want to create the instance of random before you loop to build the set. This is so you don't reseed the random generator.
...
Another way (similar to what you suggested at the end of your question) would be to do this in two passes with the first selecting the row and the second used your weighted frequency to select the column. That would just be the sum of the row frequencies bounded over a ranges. The first suggestion should give a more even distribution based on weight.
Take the sum of the probabilities. Take a random number between zero and that sum. Add up the probabilities until you get it's greater than or equal to your random number. Then use the item your on.
Eg pseudocode:
b = getProbabilites()
s = sum(b)
r = randomInt() % s
i = 0
acc = 0
while (acc < r) {
acc += b[i]
i++
}
return i
If efficiency is not a problem, you could create a key->value hash instead of an array. An upside of this would be that (if you format it well in the text) it would be very easy to update the values should the need arise. Something like
{
AA => 5, AB => 2, AC => 4,
BA => 6, BB => 5, BC => 9,
CA => 2, CB => 7, CC => 8
}
With this, you could easily retrieve the value for the sequence you want, and quickly find the entry to update. If the table is automatically generated and extremely large, it could help to get/be familiar with vim's use of regular expressions.
Related
I have a large set of n integers. I need to choose k elements in that list such that they sum from largest to smallest. Every subset chosen will have to be valid for some rule (it doesn't matter the rule). I want to find the largest-summed subset that is also valid.
For instance, using a small set of numbers (10, 7, 5, 3, 0), a subset size of 3, and a simple rule (sum must be prime), the code would look at:
10, 7, 5 = 22 -> NOT PRIME, KEEP GOING
10, 7, 3 = 20 -> NOT PRIME, KEEP GOING
10, 5, 3 = 18 -> NOT PRIME, KEEP GOING
10, 7, 0 = 17 -> PRIME, STOP
I know I could just put EVERY combination in a list, order it descending, and then work my way down until a sum passes the test, but that seems hugely inefficient in both space and time, especially if I have a set of size like 100 and a subset size of 8. That's like 186 billion combinations that I'd have to calculate.
Is there a way to just do this in a simple loop where I start at the biggest sum check for validity, and then calculate and go to the next largest possible sum and check for validity, etc.? Something like:
// Assuming set is ordered, this is the largest possible sum given the subset_size
int sum = set.Take(subset_size).Sum();
while (!IsValid(sum))
{
sum = NextLargest(set, subset_size, sum);
}
bool IsValid (int sum)
{
return sum % 2 == 0;
}
int NextLargest (int[] set, int subset_size, int current_sum)
{
// Find the next largest sum here as efficiently as possible
}
You don't need to look at every combination, only the ones that sum to a larger number.
Iterate over the set in descending order and check the sum. Keep track of the largest valid sum found so far. When a larger sum is impossible, break out of the loop. For example, given subset size 5, you found a valid sum 53. At some point, you are considering a subset that starts with 10. Since the numbers are in descending order, the largest sum you can get at this point is 50. So this path can be abandoned. This should significantly trim down your solution space.
I have a list of entities, and for the purpose of analysis, an entity can be in one of three states. Of course I wish it was only two states, then I could represent that with a bool.
In most cases there will be a list of entities where the size of the list is usually 100 < n < 500.
I am working on analyzing the effects of the combinations of the entities and the states.
So if I have 1 entity, then I can have 3 combinations. If I have two entities, I can have six combinations, and so on.
Because of the amount of combinations, brute forcing this will be impractical (it needs to run on a single system). My task is to find good-but-not-necessarily-optimal solutions that could work. I don't need to test all possible permutations, I just need to find one that works. That is an implementation detail.
What I do need to do is to register the combinations possible for my current data set - this is basically to avoid duplicating the work of analyzing each combination. Every time a process arrives at a certain configuration of combinations, it needs to check if that combo is already being worked at or if it was resolved in the past.
So if I have x amount of tri-state values, what is an efficient way of storing and comparing this in memory? I realize there will be limitations here. Just trying to be as efficient as possible.
I can't think of a more effective unit of storage then two bits, where one of the four "bit states" is not used. But I don't know how to make this efficient. Do I need to make a choice on optimizing for storage size or performance?
How can something like this be modeled in C# in a way that wastes the least amount of resources and still performs relatively well when a process needs to ask "Has this particular combination of tri-state values already been tested?"?
Edit: As an example, say I have just 3 entities, and the state is represented by a simple integer, 1, 2 or 3. We would then have this list of combinations:
111
112
113
121
122
123
131
132
133
211
212
213
221
222
223
231
232
233
311
312
313
321
322
323
331
332
333
I think you can break this down as follows:
You have a set of N entities, each of which can have one of three different states.
Given one particular permutation of states for those N entities, you
want to remember that you have processed that permutation.
It therefore seems that you can treat the N entities as a base-3 number with 3 digits.
When considering one particular set of states for the N entities, you can store that as an array of N bytes where each byte can have the value 0, 1 or 2, corresponding to the three possible states.
That isn't a memory-efficient way of storing the states for one particular permutation, but that's OK because you don't need to store that array. You just need to store a single bit somewhere corresponding to that permutation.
So what you can do is to convert the byte array into a base 10 number that you can use as an index into a BitArray. You then use the BitArray to remember whether a particular permutation of states has been processed.
To convert a byte array representing a base three number to a decimal number, you can use this code:
public static int ToBase10(byte[] entityStates) // Each state can be 0, 1 or 2.
{
int result = 0;
for (int i = 0, n = 1; i < entityStates.Length; n *= 3, ++i)
result += n * entityStates[i];
return result;
}
Given that you have numEntities different entities, you can then create a BitArray like so:
int numEntities = 4;
int numPerms = (int)Math.Pow(numEntities, 3);
BitArray states = new BitArray(numPerms);
Then states can store a bit for each possible permutation of states for all the entities.
Let's suppose that you have 4 entities A, B, C and D, and you have a permutation of states (which will be 0, 1 or 2) as follows: A2 B1 C0 D1. That is, entity A has state 2, B has state 1, C has state 0 and D has state 1.
You would represent that as a boolean array like so:
byte[] permutation = { 2, 1, 0, 1 };
Then you can convert that to a base 10 number like so:
int asBase10 = ToBase10(permutation);
Then you can check if that permutation has been processed like so:
if (!bits[permAsBase10])
{
// Not processed, so process it.
process(permutation);
bits[permAsBase10] = true; // Remember that we processed it.
}
Without getting overly fancy with algorithms and data structures and assuming your tri-state values can be represented in strings and doesn't have a easily determined fix maximum amount. ie. "111", "112", etc (or even "1:1:1", "1:1:2") then a simple SortedSet may end up being fairly efficient.
As a bonus, it doesn't care about the number of values in your set.
SortedSet<string> alreadyTried = new SortedSet<string>();
if(!HasSetBeenTried("1:1:1"){
// do whatever
}
if(!HasSetBeenTried("500:212:100"){
// do whatever
}
public bool HasSetBeenTried(string set){
if(alreadyTried.Contains(set)) return false;
alreadyTried.Add(set);
return true;
}
Simple mathematic says:
3 entities in 3 states makes 27 combinations.
So you need exactly log(27)/log(2) = ~ 4.75 bits to store that information.
Because a pc only can make use of whole bits, you need to "waste" ~0.25 bits and use 5 bits per combination.
The more data you gather, the better you can pack that information, but in the end, maybe a compression algorithm could help even more.
Again: you only asked for memory efficiency, not performance.
In general you can calculate the bits you need by Math.Ceil(Math.Log( noCombinations , 2 )).
I want to generate a random number from a to b. The problem is, the number has to be given with exponential distribution.
Here's my code:
public double getDouble(double low, double high)
{
double r;
(..some stuff..)
r = rand.NextDouble();
if (r == 0) r += 0.00001;
return (1 / -0.9) * Math.Log(1 - r) * (high - low) + low;
}
The problem is that (1 / -0.9) * Math.Log(1 - r) is not between 0 and 1, so the result won't be between a and b. Can someone help? Thanks in advance!
I missunderstood your question in the first answer :) You are already using the inversion sampling.
To map a range into another range, there is a typical mathematical approach:
f(x) = (b-a)(x - min)/(max-min) + a
where
b = upper bound of target
a = lower bound of target
min = lower bound of source
max = upper bound of source
x = the value to map
(this is linear scaling, so the distribution would be preserved)
(You can verify: If you put in min for x, it results in a, if you put in max for x, you'll get b.)
The Problem now: The exponential distribution has a maximum value of inf. So, you cannot use this equation, because it always wold be whatever / inf + 0 - so 0. (Which makes sense mathematically, but ofc. does not fit your needs)
So, the ONLY correct answer is: There is no exponential distribution possible between two fixed numbers, cause you can't map [0,inf] -> [a,b]
Therefore you need some sort of trade-off, to make your result as exponential as possible.
I wrapped my head around different possibilities out of curiosity and I found that you simple can't beat maths on this :P
However, I did some test with Excel and 1.4 Million random records:
I picked a random number as "limit" (10) and rounded the computed result to 1 decimal place. (0, 0.1, 0.2 and so on) This number I used to perform the linear transformation with an maximum of 10, ingoring any result greater than 1.
Out of 1.4 Million computations (generated it 10-20 times), only 7-10 random numbers greater than 1 have been generated:
(Probability density function, After mapping the values: Column 100 := 1, Column 0 := 0)
So:
Map the values to [0,1], using the linear approach mentioned above, assume a maximum of 10 for the transformation.
If you encounter a value > 1 after the transformation - just draw another random number, until the value is < 1.
With only 7-10 occurences out of 1.4 Million tests, this should be close enough, since the re-drawn number will again be pseudo-exponential-distributed.
If you want to build a spaceship, where navigation depends on perfectly exponential distributed numbers between 0 and 1 - don't do it, else you should be good.
(If you want to cheat a bit: If you encounter a number > 1, just find the record that has the biggest variance (i.e. Max(occurrences < expected occurrences)) from it's expected value - then assume that value :P )
Since the support for the exponential distribution is 0 to infinity, regardless of the rate, I'm going to assume that you're asking for an exponential that's truncated below a and above b. Another way of expressing this would be an exponential random variable X conditioned on a <= X <= b.
You can derive the inversion algorithm for this by calculating the cumulative distribution function (CDF) of the truncated distribution as the integral from a to x of the density for your exponential. Scale the result by the area between a and b (which is F(b) - F(a) where F(x) is the CDF of the original exponential distribution) to make it a valid distribution with an area of 1. Set the derived CDF to U, a uniform(0,1) random number, and solve for X to get the inversion.
I don't program C#, but here's the result expressed in Ruby. It should translate pretty transparently.
def exp_in_range(a, b, rate = 1.0)
exp_rate_a = Math.exp(-rate * a)
return -Math.log(exp_rate_a - rand * (exp_rate_a - Math.exp(-rate * b))) / rate
end
I put a default rate of 1.0 since you didn't specify, but clearly you can override that. rand is Ruby's built-in uniform generator. I think the rest is pretty self-explanatory. I cranked out several test sets of 100k observations for a variety of (a,b) values, loaded the results into my favorite stats package, and the results are as expected.
The exponential distribution is not limited on the positive side, so values can go from 0 to inf. There are many ways to scale [0,infinity] to some finite interval, but the result would not be exponential distributed.
If you just want a slice of the exponential distribution between a and b, you could simply draw r from [ra rb] such that -log(1-ra)=a and -log(1-rb)=b , i,e,
r=rand.NextDouble(); // assume this is between 0 and 1
ra=Math.Exp(-a)-1;
rb=Math.Exp(-b)-1;
rbound=ra+(rb-ra)*r;
return -Math.Log(1 - rbound);
Why check for r==0? I think you would want to check for the argument of the log to be >0, so check for r (or rbound int this case) ==1.
Also not clear why the (1/-.9) factor??
This is a bit of weird requirement so bear with me.
I have two values, X and Y, that come from three possible range lists. One list is a range for X values, one list is a range for Y values and the final list is ranges for X and Y values combined.
What I need to do is to find all possible solutions.
An example will probably make things clearer?
X Ranges
A: 11-13;
B: 14-16;
C: 17-19;
D: 20-22.
Y Ranges
A: 11-28;
B: 29-46;
C: 47-64.
X, Y Ranges
A: 6-7, 6-14;
B: 8-9, 15-23;
C: 10-11, 24-32.
X = 25, Y = 67
So one solution would be:
X Range = B, X = 14
Y Range = B, Y = 35
X, Y Range = C, X, Y = 11, 32
Another solution would be:
X Range = B, X = 15
Y Range = B, Y = 43
X, Y Range = C, X, Y = 10, 24
My current solution is to put the three ranges into lists and then to use brute force to try EVERY single combination, rejecting any that don't give the correct answer. This means that the vast majority of the solution is taken up generating invalid results.
This actually performs not too badly considering the wasted effort. The problem is I need to get this working for much longer lists than my example and potentially across multiple variables.
My brute force solution will work for the more complicated requirement but it starts to get really slow.
is there a more elegant way to approach this problem?
1.The problem specification
I am not sure I follow your solutions so for clarification:
i assume X=25,Y=67 is input value
but not X nor Y is inside any of your input ranges
are you finding any X,Y <0-max value provided as input>
that matches some X,Y,XY range at the same time ?
2.so you need to speedup finding of overlapping intervals
to speed up search in list you can use sorted lists
for each range list create 2 index lists
one sorted by start ascending
second sorted by stop value descending
also after this you can implement binary search
example:
range list: A(1-10), B(5-7), C(22-30), D(5-8)
sorted list1: A, B, D, C
sorted list2: C, A, D, B
now I want to find all intervals which covers value 10
so make 2 list of solutions, one for each sorted lists
start <= 10: A,B,D
stop >= 10: A,C
now the solution is to select just ranges found in both list
solution: A
you can adapt this to your problem
at first look this seems like slower more complicated approach
but thanks to binary search the search can be quick
and you do not realy have to have 2 solution lists in memory they are really just 2 start indexes
one for each sorted index list
I have been trying to search answer for this, but all discussions that I have found are either in language that I don't understand or relies on having a collection where each element has its own weight.
I want to basically just get a random number between 0 and 10, which is "middle-weighted" as in 5 comes more often than 0 and 10. Basically I have been trying to figure out an algorithm where I can give any number to be the "weighted number" between min and max values that I have defined and all the numbers generated would be weighted appropiately. I know that this may sound like "I dont want to think about this, I'll just sit back and wait someone else to do this", but I have been thinking and searching about this for like an hour and I'm really lost :|
So in the end, I want that I could call ( via extension method )
random.NextWeighted(MIN, MAX, WEIGHT);
You have an inverse normal distribution method available.
Scale your random number so that it's a double between zero and one.
Pass it to InverseNormalDistribution.
Scale the returned value based on the weight. (For example, divide by weight over 100.)
Calculate [ (MIN + MAX) / 2 ] + [ (ScaledValue) X (MAX - MIN) ]
If that's less than MIN, return MIN. If it's more than MAX, return MAX. Otherwise, return this value.
I don't know how much more often you want 5 to appear than the other numbers between 0-10 but you could create an array with the distribution you want.
Something like
var dist = new []{0,1,2,3,4,5,6,7,8,9,10,5,5,5};
Then you get a random positions of 0 and 13 you will get numbers between 0-10 but a 5 four times more often than the others. Pretty fast but not very practical if you want numbers between 0 and billion though.