So you've got a list of items that you want to draw from at random (with replacement). But you don't want a uniform distribution where each item is equally likely. You want weighted probabilities where, say, "Silver" is to be drawn five times for every three times that "Gold" is drawn, and "Bronze" should come up something like eleven times in the same number of draws.
What you could do is build a list with duplicates (three Golds, five Silvers, and eleven Bronzes) — 19 entries in all — and draw uniformly from that. But that does seem a bit wasteful, especially if the list is in a database. Or consider the case where there should be one Gold for every hundred Silver, and one Silver for every hundred Bronze. That list of duplicates will have 10,101 entries — for only three different values.
Instead, what we'll do is use conditional probabilities. First of all, we decide whether or not to pick the first element ("Silver" gets picked with probability 5/19). If we decide to pick the first element, then we're done. If not, we randomly decide whether to pick the second element. But to work out the probability we have to take into account the fact that we have decided not to pick the first. ("Gold" gets picked, not with probability 3/19 ≈ 15.79%, but with probability 3/(19-5) = 3/14 ≈ 21.43%). If we decide not to pick any other elements before the last one, then by the time we get to it we have no choice in the matter any more - we have to pick the last element because it's the only one left to pick. ("Bronze" gets picked with probability 11/(19-5-3) = 11/11 = 1 = 100%). The conditional_probabilities
function calculates, for each element, the probability to use when deciding whether or not to pick that element given that we have not picked any of the previous ones.
Actually picking an element is done by the get_variate
function, which works as described above: for each element of the array, decide at random whether or not to pick it based on the conditional probabilities calculated earlier, and keep going until one is picked.
There's not much to using the two functions; the example pretty much covers the interface (I mean, two functions with one parameter each, how much interface is there to cover?). As you can see, the input to conditional_probabilities
is an array where the values to be selected are array keys, and their relative frequencies the corresponding values (["Silver" => 5, "Gold" => 3, "Bronze" => 11]
); get_variate
returns one of those keys.
function conditional_probabilities($weights)
{
$probs = $weights;
foreach($probs as &$weight)
{
$weight/= array_sum($weights);
array_shift($weights);
}
return $probs;
}
function get_variate($probs)
{
foreach($probs as $k => $p)
{
if(mt_rand() <= mt_getrandmax() * $p)
{
return $k;
}
}
}
// Example of use
$biases = [
'foo' => 1,
'bar' => 7,
'baz' => 3,
'blug' => 5,
'wibble' => 9,
'womble' => 4,
'splunge' => 7,
'plink' => 2,
'plunk' => 9,
'plonk' => 4,
'fnord' => 0,
'zit' => 3,
'zot' => 7];
$probs = conditional_probabilities($biases);
$sample = [];
for($j = 0; $j < 100000; ++$j)
$sample[] = get_variate($probs);
// Let's see if these variates have the distribution they should:
$histogram = array_count_values($sample);
$sum = array_sum($biases) / 100000;
foreach($histogram as $key => $p)
{
echo $key,"\t", $p * $sum, "\n";
}
baz 2.9646
plonk 4.07846
womble 4.03393
splunge 6.95644
zot 6.94607
blug 4.93429
plunk 8.85842
wibble 9.06277
zit 3.01645
bar 7.08149
foo 1.02968
plink 2.0374
Clearly it's not going to preserve the order of the input weights; they're actually listed in the order they were first picked. So I'll just sort them by hand here, along with the intended weight for comparison, and also the totals for the two columns:
foo 1.02968 1
bar 7.08149 7
baz 2.9646 3
blug 4.93429 5
wibble 9.06277 9
womble 4.03393 4
splunge 6.95644 7
plink 2.0374 2
plunk 8.85842 9
plonk 4.07846 4
zit 3.01645 3
zot 6.94607 7
61.00000 61
It should go without saying that since fnord
has zero chance of getting picked, it wasn't and so didn't show up to be counted.