How to pick up 3 numbers from a uniform distribution in a transparent manner?

Over in my previous post, I’m giving away 3 copies of my video course on ggplot2 and shiny. To win a copy, you just need to leave a comment and I will select 3 winners among the n participants at random after a deadline.

But how do I pick 3 winners such that:

  • all players are equally likely to win.
  • no-one can contest the fairness of the selection.

The first thing that comes to mind is to run sample(n,3, replace = FALSE) and report the winners. But how do you know I actually ran this code? I could have decided on the winners well in advance and just pretended to run the code.

A way to approach this issue could be to set the random seed to some value so that anyone suspecting foul play can run the code themselves and get the same answer as me:

Select All Code:
1
2
set.seed(someSeed)
sample(n, 3, replace=FALSE)

I see at least two problems with it: 1) I could still have selected a seed that gives me the sample I eventually want, and 2) even using a function (e.g. of n the number of participants) as the seed doesn’t guarantee a uniform distribution for each player.

I came up with a plan which I think addresses both the uniform distribution over the players, and the incontestability of the selection.

First, I simplify the problem of selecting 3 winners among n participants to selecting 1 integer from a uniform distribution. This is easy: instead of choosing 3 items among n, I’m selecting 1 of the choose(n,3) possible combinations. Once I’ve sampled 1 number i, I simply use combn(n,3) to generate all the combinations and pick the ith item:
combn(n,3, simplify=FALSE)[[i]].

Second, I have a way to pick a number from a uniform distribution that’s completely transparent. To do this, I pick up a number uniformly at random in a much bigger set (1:N, with N>>choose(n,3)) and project this number back to the interval I’m interested in (1:choose(n,3)). That is, once I have a number j between 1 and N, I use

Select All Code:
1
i <- ceiling(choose(n,3) * j /N)

to find a random value uniformly distributed from 1 to choose(n,3)

Ideally you’d want N to be a multiple of choose(n,3) for every outcome to be exactly equally likely, but if N is much larger than choose(n,3), the slight difference in probability for each outcome is negligible (of the order of 1/N).

Now, how do I pick up a number in a bigger set in a transparent manner? We saw that using the seed is fraught with difficulty, so I need something else. I’m going to use something which neither I nor the players have any control over: the UK national lottery, which is a combination of 7 integers from the set {1,…,49}. More precisely, I’m doing this:

  • declare in advance which future lottery draw I’m using.
  • use the lexicographic index of the combination drawn as my number j; j comes from a uniform distribution between 1 and N=choose(49,7), which is a relatively big number.

And this is it: this way, I pick up 3 winners among n and there is no way for me (or the players) to rig the selection. Here is the code I’ll run:

Select All Code:
1
2
3
4
5
6
lotteryResult <- c( , , , , , , ) # to be filled in by the actual lottery draw
nPlayers <- 200 # to be updated with the number of participants
nWinners <- 3
index <- ceiling(choose(nPlayers, nWinners) * lexicographicIndex(lotteryResult, 49) / choose(49,7))
winners <- combn(nPlayers,nWinners,simplify = FALSE)[[index]]
cat("\n The winners are:", winners)

The deadline for the competition is Wednesday 09th July at midnight UK time. The next lottery (UK lotto) draw after that is on Saturday 12th July, and I’ll use that draw to decide on the winners, using the code I’ve presented here.

What do you think? Can you poke holes in my strategy? Can you come up with something simpler?

Note about the lexicographic index
It is not terribly difficult to find the index of a combination without generating them all. All you need to do is to count the number of combinations that appeared before. For example, if the combination starts with 3, you know it comes after all the combinations that start with 1 and 2. Here is the code I wrote to go from a combination to its lexicographic index. There’s also a test function after it.

Select All Code:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
# gives the index of the combination if sorted in lexicographic order, starting at 1
# lexicographicIndex(c(2,3,4), 5) = 7
# because [2,3,4] is the 7th item in the lexicographically ordered combinations of length 3 using 5 letters: 
# 1 2 3
# 1 2 4
# 1 2 5
# 1 3 4
# 1 3 5
# 1 4 5
# 2 3 4
# 2 3 5
# 2 4 5
# 3 4 5
# C. Ladroue
# combination is a sequence of unique integers between 1 and alphabetSize
 
lexicographicIndex <- function(combination, alphabetSize){
  combination <- sort(combination)
 
  combinationLength <- length(combination)
 
  index <- 1 
  for(p in 1:combinationLength){
 
    starting  <- ifelse(p == 1, 1 , combination[p-1] + 1)
    finishing <- combination[p] - 1
 
    if(starting <= finishing)
      index <- index + sum(sapply(starting:finishing, function(j) choose(alphabetSize - j, combinationLength - p)))
  }
    index 
}
 
 
lexicographicIndexTest <- function(){
  alphabetSize <- 10
  combinationLength <- 3
  x <- combn(alphabetSize, combinationLength, simplify = FALSE)
  cat("\n test all combinations with alphabet size = ",alphabetSize,"and combination length = ",combinationLength,": ",
      all(sapply(1:length(x), function(index) lexicographicIndex(x[[index]], alphabetSize) == index ))
      )
  cat("\n")
}
This entry was posted in Uncategorized and tagged , , , , . Bookmark the permalink.

5 Responses to How to pick up 3 numbers from a uniform distribution in a transparent manner?

  1. Chip Lynch says:

    Why couldn’t you have simply used the lottery number as the seed in the original sample() code? I may not fully understand the implication of your second complaint about random seeds… any one seed doesn’t give uniform likelihood but a randomly selected seed should, from the perspective of the probability given that you don’t know the seed. Yes?

    • CL says:

      My issue with the seed is that I cannot guarantee that the sampling will be uniform even if I know the distribution of the seed. Unless there is some theoretical result I’m not aware of.

      seed in set.seed(seed) is an integer, or a value interpreted as such. We need to decide how to map the lottery numbers to one integer. An obvious choice is the sum of the lottery numbers, or the lexicographic index of the whole combination.

      Using the sum is disastrous: the distribution of the sum is a binomial and some seeds are selected many more times than others, which results in a subset of players having a much bigger chance of winning.

      Using the lexicographic index as the seed, we have to resort to simulation:

      Select All Code:
      # using index as seed
      nIndex <- choose(49,7)
      nPlayers <- 100
      nWinners <- 3
       
      counts <- rep(0L,nPlayers)
      for(s in 1:nIndex) {
        set.seed(s) 
        winners <- sample.int(nPlayers,3, replace = FALSE) 
        counts[winners] <- counts[winners] +1
      }

      Admittedly, in this case the sampling is pretty good and uniform:

      > max(counts)/min(counts)
      [1] 1.002674

      However, I don’t know whether it’s always the case. Maybe it works for 100 players, maybe it doesn’t with 50.

      If I use the projection instead of playing with the seed:

      Select All Code:
      # using index with projection
      nIndex <- choose(49,7)
      nPlayers <- 100
      nWinners <- 3
      nCombination <- choose(nPlayers,nWinners)
      allCombinations <- combn(nPlayers,nWinners)
       
      counts <- rep(0L,nPlayers)
      for(index in ceiling(nCombination * (1:nIndex)/nIndex)) {
        winners <- allCombinations[,index] 
        counts[winners] <- counts[winners] +1
      }

      In this case,

      > max(counts)/min(counts)
      [1] 1.000046
      

      And with the guarantee that it will work for a very wide range of values (pretty much as large as choose(49,7), or 800 players.).

  2. Doug Dame says:

    You could just string the lottery numbers together, e.g., if they’re
    22, 4, 9, 47, 17, 41, 13

    then the seed could be 224947174113.

    But using some public number that is unknowable until some future time is a good step towards transparency. However, say you have 25 candidates and your routine selects 7, 19, 4. You’re still the only person who knows the names associated with #7, #19, and #4. So you have to also publish the list of all the candidates to be fully transparent.

    Which is overkill, in my mind. You’re voluntarily giving some stuff away for free, which is certainly generous, but it’s not so valuable that anyone (reasonable) is going to insist on an audit of the methodology used by the awards committee, so to speak.

    —-
    You comment: “Using the sum is disastrous: the distribution of the sum is a binomial and some seeds are selected many more times than others, which results in a subset of players having a much bigger chance of winning.”

    This is a one-time event, so just one seed will be selected just once. Should be random enough for most people. (For this purpose, anyway.)

    • CL says:

      I’m not sure what the distribution would be like with the seed you suggest, but that’s a possibility.
      -
      Good point about the list of players. It’s something I’ve forgotten to add in my post: the players will be ordered by the time of their comment, which is publicly available.
      -
      About the using the sum: even for a one-time event, it’s still unfair for some players to have a probability of winning 10 times lower than others. And a player could possibly choose to comment at the right time to maximise their chances: the binomial distribution is fixed in advance.
      -
      :) I agree all this is overkill in this case but I found it quite interesting as an exercise.

  3. Carl Witthoft says:

    No matter what you do with your own code, there’s no guarantee of transparency. Your subjects have to see the source code, approve it, compile it themselves with compilers on operating systems they trust, and on it goes.

    Now, using the lottery, or the 3 least-significant digits of tomorrow’s DowJones closing value (which in fact is, or was, used by bookies), will work. Just divide up the possible range by the number of applicants and pick any set of criteria, in advance, as to which quantiles are defined as winners. But again, you have to convince the players that the lottery (or the DJIA) is uniformly distributed.

Leave a Reply

Your email address will not be published. Required fields are marked *

*


eight × 6 =

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre lang="" line="" escaped="">