Over in my previous post, I’m giving away 3 copies of my video course on ggplot2 and shiny. To win a copy, you just need to leave a comment and I will select 3 winners among the n participants at random after a deadline.
But how do I pick 3 winners such that:
 all players are equally likely to win.
 noone can contest the fairness of the selection.
The first thing that comes to mind is to run sample(n,3, replace = FALSE)
and report the winners. But how do you know I actually ran this code? I could have decided on the winners well in advance and just pretended to run the code.
A way to approach this issue could be to set the random seed to some value so that anyone suspecting foul play can run the code themselves and get the same answer as me:
I see at least two problems with it: 1) I could still have selected a seed that gives me the sample I eventually want, and 2) even using a function (e.g. of n the number of participants) as the seed doesn’t guarantee a uniform distribution for each player.
I came up with a plan which I think addresses both the uniform distribution over the players, and the incontestability of the selection.
First, I simplify the problem of selecting 3 winners among n participants to selecting 1 integer from a uniform distribution. This is easy: instead of choosing 3 items among n, I’m selecting 1 of the choose(n,3)
possible combinations. Once I’ve sampled 1 number i, I simply use combn(n,3)
to generate all the combinations and pick the ith item:
combn(n,3, simplify=FALSE)[[i]]
.
Second, I have a way to pick a number from a uniform distribution that’s completely transparent. To do this, I pick up a number uniformly at random in a much bigger set (1:N, with N>>choose(n,3)
) and project this number back to the interval I’m interested in (1:choose(n,3)
). That is, once I have a number j between 1 and N, I use
to find a random value uniformly distributed from 1 to choose(n,3)
Ideally you’d want N to be a multiple of choose(n,3) for every outcome to be exactly equally likely, but if N is much larger than choose(n,3), the slight difference in probability for each outcome is negligible (of the order of 1/N
).
Now, how do I pick up a number in a bigger set in a transparent manner? We saw that using the seed is fraught with difficulty, so I need something else. I’m going to use something which neither I nor the players have any control over: the UK national lottery, which is a combination of 7 integers from the set {1,…,49}. More precisely, I’m doing this:
 declare in advance which future lottery draw I’m using.
 use the lexicographic index of the combination drawn as my number j; j comes from a uniform distribution between 1 and
N=choose(49,7)
, which is a relatively big number.
And this is it: this way, I pick up 3 winners among n and there is no way for me (or the players) to rig the selection. Here is the code I’ll run:
1 2 3 4 5 6  lotteryResult < c( , , , , , , ) # to be filled in by the actual lottery draw nPlayers < 200 # to be updated with the number of participants nWinners < 3 index < ceiling(choose(nPlayers, nWinners) * lexicographicIndex(lotteryResult, 49) / choose(49,7)) winners < combn(nPlayers,nWinners,simplify = FALSE)[[index]] cat("\n The winners are:", winners) 
The deadline for the competition is Wednesday 09th July at midnight UK time. The next lottery (UK lotto) draw after that is on Saturday 12th July, and I’ll use that draw to decide on the winners, using the code I’ve presented here.
What do you think? Can you poke holes in my strategy? Can you come up with something simpler?
Note about the lexicographic index
It is not terribly difficult to find the index of a combination without generating them all. All you need to do is to count the number of combinations that appeared before. For example, if the combination starts with 3, you know it comes after all the combinations that start with 1 and 2. Here is the code I wrote to go from a combination to its lexicographic index. There’s also a test function after it.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43  # gives the index of the combination if sorted in lexicographic order, starting at 1 # lexicographicIndex(c(2,3,4), 5) = 7 # because [2,3,4] is the 7th item in the lexicographically ordered combinations of length 3 using 5 letters: # 1 2 3 # 1 2 4 # 1 2 5 # 1 3 4 # 1 3 5 # 1 4 5 # 2 3 4 # 2 3 5 # 2 4 5 # 3 4 5 # C. Ladroue # combination is a sequence of unique integers between 1 and alphabetSize lexicographicIndex < function(combination, alphabetSize){ combination < sort(combination) combinationLength < length(combination) index < 1 for(p in 1:combinationLength){ starting < ifelse(p == 1, 1 , combination[p1] + 1) finishing < combination[p]  1 if(starting <= finishing) index < index + sum(sapply(starting:finishing, function(j) choose(alphabetSize  j, combinationLength  p))) } index } lexicographicIndexTest < function(){ alphabetSize < 10 combinationLength < 3 x < combn(alphabetSize, combinationLength, simplify = FALSE) cat("\n test all combinations with alphabet size = ",alphabetSize,"and combination length = ",combinationLength,": ", all(sapply(1:length(x), function(index) lexicographicIndex(x[[index]], alphabetSize) == index )) ) cat("\n") } 
Why couldn’t you have simply used the lottery number as the seed in the original sample() code? I may not fully understand the implication of your second complaint about random seeds… any one seed doesn’t give uniform likelihood but a randomly selected seed should, from the perspective of the probability given that you don’t know the seed. Yes?
My issue with the seed is that I cannot guarantee that the sampling will be uniform even if I know the distribution of the seed. Unless there is some theoretical result I’m not aware of.
seed
inset.seed(seed)
is an integer, or a value interpreted as such. We need to decide how to map the lottery numbers to one integer. An obvious choice is the sum of the lottery numbers, or the lexicographic index of the whole combination.Using the sum is disastrous: the distribution of the sum is a binomial and some seeds are selected many more times than others, which results in a subset of players having a much bigger chance of winning.
Using the lexicographic index as the seed, we have to resort to simulation:
Admittedly, in this case the sampling is pretty good and uniform:
However, I don’t know whether it’s always the case. Maybe it works for 100 players, maybe it doesn’t with 50.
If I use the projection instead of playing with the seed:
In this case,
And with the guarantee that it will work for a very wide range of values (pretty much as large as
choose(49,7)
, or 800 players.).You could just string the lottery numbers together, e.g., if they’re
22, 4, 9, 47, 17, 41, 13
then the seed could be 224947174113.
But using some public number that is unknowable until some future time is a good step towards transparency. However, say you have 25 candidates and your routine selects 7, 19, 4. You’re still the only person who knows the names associated with #7, #19, and #4. So you have to also publish the list of all the candidates to be fully transparent.
Which is overkill, in my mind. You’re voluntarily giving some stuff away for free, which is certainly generous, but it’s not so valuable that anyone (reasonable) is going to insist on an audit of the methodology used by the awards committee, so to speak.
—
You comment: “Using the sum is disastrous: the distribution of the sum is a binomial and some seeds are selected many more times than others, which results in a subset of players having a much bigger chance of winning.”
This is a onetime event, so just one seed will be selected just once. Should be random enough for most people. (For this purpose, anyway.)
I’m not sure what the distribution would be like with the seed you suggest, but that’s a possibility.

Good point about the list of players. It’s something I’ve forgotten to add in my post: the players will be ordered by the time of their comment, which is publicly available.

About the using the sum: even for a onetime event, it’s still unfair for some players to have a probability of winning 10 times lower than others. And a player could possibly choose to comment at the right time to maximise their chances: the binomial distribution is fixed in advance.

I agree all this is overkill in this case but I found it quite interesting as an exercise.
No matter what you do with your own code, there’s no guarantee of transparency. Your subjects have to see the source code, approve it, compile it themselves with compilers on operating systems they trust, and on it goes.
Now, using the lottery, or the 3 leastsignificant digits of tomorrow’s DowJones closing value (which in fact is, or was, used by bookies), will work. Just divide up the possible range by the number of applicants and pick any set of criteria, in advance, as to which quantiles are defined as winners. But again, you have to convince the players that the lottery (or the DJIA) is uniformly distributed.