Bayesian Modeling with PyMC: Dirichlet and a Custom Stochastic

There was a question asked on Reddit’s r/statistics by user nomm_ in this post. It sounded like the perfect problem for some Bayesian modeling, so I dusted off the PyMC Python library to tackle it. This will also serve, I hope, as a guide to others who are trying to do things like custom stochastics in PyMC that are also observed values.

The question involves estimating the probabilities of selecting a particular action several times in a row (the action involved upgrading items in a video game). There are nodes that need to be linked, and the actions decide how many to link. Several actions can take place for each result, and the actions change likelihood based on the previous actions. Let’s see how to tackle this!

Problem Description

Make sure to read the original post first. To generalize the problem, we could define it with the following rules:

  • We are making a series of 5 integers, each is either 0 or 1
  • Actions that can be taken are to decide how many 1s to place in a row, or to place a 0, and we make those decisions from left to right
  • We can decide to place a single 0, or 1 to 5 1s in a row
  • If a series of 1s are placed in a row, they must be followed by a 0
  • Ignore that last rule if the series of 1s ends the set of 5 integers

There is a question as to whether or not one of the following two rules applies:

  1. You cannot decide to place more 1s than there is room left in the series
  2. If you decide to place more 1s than there is room for, just stop at 5

I’ll give an example. Let’s say that the actions you picked were:

Pick a 0: [0, X, X, X, X] (The x’s means not decided yet)

Pick a 1: [0, 1, 0, X, X]  (notice the padding 0!)

If we were using rule (1), we would only be allowed to pick either a 0, or 1 to 2 1s. If we were using rule (2), we could pick any number of 1s. If rule (2) were used, we might expect more 1s near the end, since there are more long series we could choose from. For rule (1), we don’t have as many options.

The most critical part of the description is what the probabilities are for picking those series. The probabilities are just a list of probabilities, kind of like for a multinomial distribution, that give the probability of selecting x from [0, 1, 2, 3, 4, 5]. I define those as:

p_i = Pr(x = i)

If p_i = 0.5, we’d expect to pick 0 from that list 50% of the time.

Conditionality of the Probabiltiies

For rule (1), we have to decide how to deal with the limited set of choices. In the example above, we can only choose from [0, 1, 2]. We then state that their probabilities are just normalized among themselves, and used to select the rule. For example, if all the probabilities were:

P = [0.2, 0.3, 0.2, 0.15, 0.1, 0.05]

Then we’d normalize the first three to get:

P_{0:2} = [0.286, 0.429, 0.286] (roughly).

This lets us work with only one set of probabilities the whole time. As long as we calculate the likelihood right, we’ll be fine.

The Model

The Bayesian model that we want to use is pretty simple. We want to estimate a vector of probabilties, and that vector is used to generate some random data according to a rule set. Remembering Bayes’ Theorem:

P(A|B) \propto P(B|A) P(A)

And defining our probability vector as $latex p&s=2$, our data as D, and any hyperparameters as \theta, we get:

P(p|D,\theta) \propto P(D|p,\theta) P(p|\theta)

The data follow very specific rules, so there’s no analytical distribution to use for the likelihood (that’s why we’re using PyMC!). We’ll make that in a minute. The prior on $latex p&s=2$ is chosen to be a Dirichlet distribution. The Dirichlet is an extension of the Beta distribution, and it models the distribution of a series of parameters that must all be less than 1, and must all sum to 1. It’s the perfect distribution for modeling a list of probabilities for mutually exclusive events (it’s also the conjugate prior to the multinomial, which is great for plain categorical modeling).

The Dirichlet hyperparameters, \theta, are a vector of values that affect the mean and variance of the resulting probabilities. We could initialize them to some value (based on an analysis of the data, perhaps), or stay objective. I chose to stay objective and use the Jeffrey’s Prior setting of 1/2 for all the hyperparameters. Since they are now constant, I’ll drop the term from the equation.

So now we have:

p \sim Dirichlet (\alpha_i = 1/2)

The Likelihood

The likelihood function just needs to return the probability of the actions in the outcome being selected (and take the log of it, for PyMC). We have to distinguish by rules, so I made a general function that takes a data point (a vector of 0’s and 1’s) and returns the x_i values, along with the limits to those values. Here is the function to do that:

def scan(value):
    groups = []
    prev = False
    s = 0
    for i in xrange(5):
        if value[i] == 0:
            if prev:
                groups.append((s,5-(i-s)))
                prev = False
                s = 0
            else:
                groups.append((0,5-i))
        else:
            prev = True
            s += 1
    if prev:
        groups.append((s,4-(i-s)))
    return groups

Here is an example usage:

scan([0,1,0,0,1])
> [(0, 5), (1, 4), (0, 2), (1, 1)]

The output tells us that we picked x_i = 0 when we could have picked up to 5, picked 1 when we could have picked up to 4, picked 0 up to 2, and then picked 1 up to 1. Those results are then used by the rule (1) likelihood function:

def like1(v,p):
    l = 1
    groups = scan(v)
    for n, s in groups:
        l *= p[n]/sum(p[:s+1])
    return l

All we’re doing is multiplying the probabilities of picking x_i from the feasible set. As a double-check, I ran that function over all possible data points (there are only 32 possible from the rule set), and the total probability summed to 1! That means we are in business. The likelihood function for rule (2) is:

def like2(v,p):
    l = 1
    groups = scan(v)
    for n, s in groups:
        if n != s:
            l *= p[n]
        else:
            l *= sum(p[n:])
    return l

Any time the number of 1’s picked is the same as the max number it could pick, it means that we could have picked more (and truncated down to 5).

PyMC Model Setup

To set up the PyMC model, we first get the data. The data was posted to the Reddit thread, but you can grab it at the bottom of the post, too. The data is pairs of (result, count). Since there are only 32 unique results, it’s easiest to just calculate the likelihood once, and then multiply. Here’s reading and processing the data:

import pymc as pm
import numpy as np
from collections import Counter

data = Counter()
with open("./Data.txt","rb") as f:
    for line in f.readlines():
        d,n = line.split("\t")
        dl = tuple([int(x) for x in d])
        data[dl] += int(n)

Now we can define the model itself:

# define the dirichlet prior, with Jeffrey's hyperparameter
probs = pm.Dirichlet(name="probs",theta=[1/2]*6)
# for bookkeeping, PyMC gives us a distribution that includes all terms of the Dirichlet
# by default, PyMC's Dirichlet doesn't track the last entry, because it is deterministic
probs_full = pm.CompletedDirichlet("full_probs",probs)

# define our custom stochastic
@pm.stochastic(dtype=tuple,observed=True,name="output")
def getlike(value=data,probs=probs):
    all_l = 0
    # optionally, use the probs_full here.
    # I have it this way to help remind myself that the base dirichlet
    # leaves off the last probability
    ps = np.hstack((probs,1-probs.sum()))
    for v,n in value.iteritems():
        l = like1(v,ps)
        all_l += np.log(l)*n
    return all_l

The custom stochastic wraps around the likelihood function we defined, and returns a sum of the log likelihoods to PyMC. From here, all we have to do is run the model. I didn’t do the usual MAP run beforehand, and everything still ended up fine. Here’s the command to run it:

model = pm.Model([probs,getlike,probs_full])
M = pm.MCMC(model)
M.use_step_method(pm.AdaptiveMetropolis,probs)
M.sample(iter=150000, burn=50000, thin=15)
> [-----------------100%-----------------] 150000 of 150000 complete in 108.9 sec

I found that the Adaptive Metropolis sampling gave some better diagnostics, so I went with it. The differences in results were not significant, though.

Results

The summary from PyMC, for rule (1), is:

full_probs:
 
	Mean             SD               MC Error        95% HPD interval
	------------------------------------------------------------------
	0.211            0.008            0.0              [ 0.197  0.227]
	0.317            0.01             0.001            [ 0.298  0.338]
	0.269            0.01             0.001            [ 0.251  0.288]
	0.194            0.01             0.001            [ 0.173  0.212]
	0.009            0.003            0.0              [ 0.005  0.015]
	0.001            0.001            0.0              [ 0.     0.002]

The trace and histogram images are:

full_probs_0_3

full_probs_0_5

Doing the same thing, but for rule (2), gives the following:

full_probs:
 
	Mean             SD               MC Error        95% HPD interval
	------------------------------------------------------------------
	0.258            0.009            0.0              [ 0.241  0.276]
	0.323            0.01             0.0              [ 0.303  0.342]
	0.244            0.01             0.0              [ 0.226  0.266]
	0.167            0.009            0.0              [ 0.151  0.185]
	0.008            0.003            0.0              [ 0.004  0.013]
	0.0              0.001            0.0              [ 0.     0.002]

full_probs_0_3

full_probs_0_5

Wrap-Up

Evidence for whether or not rule (1) or (2) is the correct one isn’t to be found in this analysis. We do get to see how the probabilities of each action being selected are different for each rule, though. I’ve made a histogram of the different traces (the posterior distributions of the probabilities) on top of each other to better visualize the differences.

Click to Open Large Version

Click to Open Large Version

The different rule sets have very different probabilities of selecting a 0. There’s a bunch of overlap for selecting 1, 4, and 5, though. Again, since the model finds fairly confident results for both rule sets, I don’t see how this could inform knowledge of which rule is being used.

I hope you found this helpful or interesting!

Appendix: Data

I just used this in a tab-delimited file, and a quick copy/paste to a text file should have you up and running.

00000 3
00001 1
00010 4
00011 1
00100 6
00101 10
00110 11
00111 5
01000 6
01001 8
01010 31
01011 15
01100 26
01101 34
01110 32
01111 1
10000 8
10001 12
10010 21
10011 12
10100 43
10101 57
10110 89
10111 71
11000 33
11001 40
11010 99
11011 88
11100 76
11101 107
11110 9
11111 0
Advertisements

One Comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s