Non-Uniform Coupon Collector’s Problem

The Coupon Collector’s Problem is a neat little problem in probability, and I first heard about it recently on the statistics subreddit. You, like me, might be familiar with it if you’ve ever tried to solve the expected number of boxes of cereal to buy to get all the toys. Not that I have that problem right now, but it shows up on probability quizzes and the like.

The problem’s solution hinges on two things. One, there is replacement (sampling from a seemingly infinite population of items that are in some proportion). Two, all items are equally likely. What happens when they aren’t equally likely? We turn back to Absorbing Markov Chains (AMC), because apparently that has to be 50% of what I talk about on here!

The Problem

Based on this Reddit post, the OP asked about the problem of uniform dice rolling first, followed by a question about blood types, which don’t have uniform probabilities. What we want to find out is the expected number of dice rolls (or blood tests) that would need to be run. We also want to get an idea of the distribution, and perhaps get a confidence interval of some kind.

Uniform Coupons

The dice problem gives us an easy, uniform setup to the problem before moving on to something maybe a bit more complex. Since we are using an AMC, the first step is to build the transition matrix. That means defining all the states and linking them with probabilities. I do that with the following python code:

import numpy as np
from itertools import combinations
n = 6
state_probs = {k:1/6 for k in range(1,n+1)}
n_states = 2**n
states = ['start']
T = np.zeros((n_states,n_states))

for k in range(1,n):
    for comb in combinations(range(1,n+1),k):
        states.append(comb)
        curr_ind = len(states) - 1
        if len(comb) == 1:
            prev_ind = 0
            T[prev_ind,curr_ind] += state_probs[comb[0]]
        else:
            for rem in comb:
                comb_prev = list(comb)
                comb_prev.remove(rem)
                idx = states.index(tuple(comb_prev))
                T[idx,curr_ind] += state_probs[rem]
        T[curr_ind,curr_ind] = sum(state_probs[x] for x in comb)
states.append((1,2,3,4,5,6))
T[-1,-1] = 1
comb = (1,2,3,4,5,6)
for a in states[-7:-1]:
    idx = states.index(a)
    comb_prev = list(comb)
    v = list(set(comb) - set(a))[0]
    T[idx,-1] = state_probs[v]

All the code does is loop over the various combinations of different lengths of dice roll sets and set up transition probabilities based on all the possible previous states. The transition matrix looks pretty, too:

MarkovCoupon_1

This next block does the AMC math to get the expected number of steps to the final state, and the variance in the number of steps.


Q = T[:-1,:-1]
nt = Q.shape[0]
R = T[:-1,-1]
Ir = T[-1,-1]
It = np.eye(nt)

N = np.linalg.inv(It - Q)
t = np.dot(N,np.ones(nt))
tsq = t**2
t_var = np.dot(2*N - It,t) - tsq

# expected number of rolls is t[0]

The expected number of dice rolls is 14.7, with a variance of about 39. This doesn’t tell us anything about the PMF of the dice rolls. To get that, we first simulate the AMC to get the probability of being in the final state for a range of dice roll counts. Going from 1 roll to 40 rolls gives the following CDF:

UniformCouponCDF

The CDF will go on to infinity, but I stopped calculating it because it was close enough to 1 for our purposes. CDFs are easily turned into PMFs, too:

UniformCouponPMF

I’ve filled in the 95% interval of the PMF that starts at the smallest roll. This tells us that 95% (roughly, since the PMF is obviously discrete) of the time we will roll 28 or fewer times. We will roll 29 or more times only 5% of the time.

Non-Uniform Coupons (Blood Types)

Using some Red Cross data on blood type ratios in the population, we can repeat the above analysis in the exact same way. Using the same code as above, but with a new set of items:


items = ['A+','A-','B+','B-','AB+','AB-','O+','O-']
state_probs = {'A+':.33,'A-':.07,'B+':.09,'B-':.02,'AB+':.03,'AB-':.01,'O+':.37,'O-':.08}

we get a much larger transition matrix, but one with the same kind of pattern:

NonUniformCouponMatrixThe expected number of people to test before finding all blood types is 122.45. The variance is 8453, so the std. dev is about 92. That’s a lot of people to test just to get at least 1 of each (so go donate blood)!

The following two plots show the CDF and PMF (as a continuous line). I stopped running the AMC simulation forward when the absorbing probability reached 0.95. That means that the PMF shows the left-edge 95% confidence interval.

NonUniformCouponCDF

CDF of Blood Type Coupon Problem

NonUniformCouponPDF

PMF of Blood Type Coupon Problem

Notice how the maximum likelihood point is much smaller than the mean value. We expect that because this distribution can go on for infinity, which greatly skews the mean to the right tail.

Final Thoughts

Besides showing how to get the distribution of steps in a uniform coupon problem, we also used the same method to solve a non-uniform coupon problem. As one would expected, the distribution of steps has a very long right-tail. You might be searching a very long time for that last coupon!

The really easy thing about these particular problems is that you only need a set of population proportions to make them work. In the future I’ll probably re-do this for the McDonald’s Monopoly game, since it is just a coupon collecting game. Hopefully I’ll be able to pull out some data on how much you might be expected to win on the way to trying for big prizes!

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s