There was a question asked on Reddit’s r/statistics by user nomm_ in this post. It sounded like the perfect problem for some Bayesian modeling, so I dusted off the PyMC Python library to tackle it. This will also serve, I hope, as a guide to others who are trying to do things like custom stochastics in PyMC that are also observed values.
The question involves estimating the probabilities of selecting a particular action several times in a row (the action involved upgrading items in a video game). There are nodes that need to be linked, and the actions decide how many to link. Several actions can take place for each result, and the actions change likelihood based on the previous actions. Let’s see how to tackle this!
Make sure to read the original post first. To generalize the problem, we could define it with the following rules:
- We are making a series of 5 integers, each is either 0 or 1
- Actions that can be taken are to decide how many 1s to place in a row, or to place a 0, and we make those decisions from left to right
- We can decide to place a single 0, or 1 to 5 1s in a row
- If a series of 1s are placed in a row, they must be followed by a 0
- Ignore that last rule if the series of 1s ends the set of 5 integers
There is a question as to whether or not one of the following two rules applies:
- You cannot decide to place more 1s than there is room left in the series
- If you decide to place more 1s than there is room for, just stop at 5
I’ll give an example. Let’s say that the actions you picked were:
Pick a 0: [0, X, X, X, X] (The x’s means not decided yet)
Pick a 1: [0, 1, 0, X, X] (notice the padding 0!)
If we were using rule (1), we would only be allowed to pick either a 0, or 1 to 2 1s. If we were using rule (2), we could pick any number of 1s. If rule (2) were used, we might expect more 1s near the end, since there are more long series we could choose from. For rule (1), we don’t have as many options.
The most critical part of the description is what the probabilities are for picking those series. The probabilities are just a list of probabilities, kind of like for a multinomial distribution, that give the probability of selecting from [0, 1, 2, 3, 4, 5]. I define those as:
If , we’d expect to pick 0 from that list 50% of the time.
Conditionality of the Probabiltiies
For rule (1), we have to decide how to deal with the limited set of choices. In the example above, we can only choose from [0, 1, 2]. We then state that their probabilities are just normalized among themselves, and used to select the rule. For example, if all the probabilities were:
Then we’d normalize the first three to get:
This lets us work with only one set of probabilities the whole time. As long as we calculate the likelihood right, we’ll be fine.
The Bayesian model that we want to use is pretty simple. We want to estimate a vector of probabilties, and that vector is used to generate some random data according to a rule set. Remembering Bayes’ Theorem:
And defining our probability vector as $latex p&s=2$, our data as , and any hyperparameters as , we get:
The data follow very specific rules, so there’s no analytical distribution to use for the likelihood (that’s why we’re using PyMC!). We’ll make that in a minute. The prior on $latex p&s=2$ is chosen to be a Dirichlet distribution. The Dirichlet is an extension of the Beta distribution, and it models the distribution of a series of parameters that must all be less than 1, and must all sum to 1. It’s the perfect distribution for modeling a list of probabilities for mutually exclusive events (it’s also the conjugate prior to the multinomial, which is great for plain categorical modeling).
The Dirichlet hyperparameters, , are a vector of values that affect the mean and variance of the resulting probabilities. We could initialize them to some value (based on an analysis of the data, perhaps), or stay objective. I chose to stay objective and use the Jeffrey’s Prior setting of 1/2 for all the hyperparameters. Since they are now constant, I’ll drop the term from the equation.
So now we have:
The likelihood function just needs to return the probability of the actions in the outcome being selected (and take the log of it, for PyMC). We have to distinguish by rules, so I made a general function that takes a data point (a vector of 0’s and 1’s) and returns the values, along with the limits to those values. Here is the function to do that:
def scan(value): groups =  prev = False s = 0 for i in xrange(5): if value[i] == 0: if prev: groups.append((s,5-(i-s))) prev = False s = 0 else: groups.append((0,5-i)) else: prev = True s += 1 if prev: groups.append((s,4-(i-s))) return groups
Here is an example usage:
scan([0,1,0,0,1]) > [(0, 5), (1, 4), (0, 2), (1, 1)]
The output tells us that we picked when we could have picked up to 5, picked 1 when we could have picked up to 4, picked 0 up to 2, and then picked 1 up to 1. Those results are then used by the rule (1) likelihood function:
def like1(v,p): l = 1 groups = scan(v) for n, s in groups: l *= p[n]/sum(p[:s+1]) return l
All we’re doing is multiplying the probabilities of picking from the feasible set. As a double-check, I ran that function over all possible data points (there are only 32 possible from the rule set), and the total probability summed to 1! That means we are in business. The likelihood function for rule (2) is:
def like2(v,p): l = 1 groups = scan(v) for n, s in groups: if n != s: l *= p[n] else: l *= sum(p[n:]) return l
Any time the number of 1’s picked is the same as the max number it could pick, it means that we could have picked more (and truncated down to 5).
PyMC Model Setup
To set up the PyMC model, we first get the data. The data was posted to the Reddit thread, but you can grab it at the bottom of the post, too. The data is pairs of (result, count). Since there are only 32 unique results, it’s easiest to just calculate the likelihood once, and then multiply. Here’s reading and processing the data:
import pymc as pm import numpy as np from collections import Counter data = Counter() with open("./Data.txt","rb") as f: for line in f.readlines(): d,n = line.split("\t") dl = tuple([int(x) for x in d]) data[dl] += int(n)
Now we can define the model itself:
# define the dirichlet prior, with Jeffrey's hyperparameter probs = pm.Dirichlet(name="probs",theta=[1/2]*6) # for bookkeeping, PyMC gives us a distribution that includes all terms of the Dirichlet # by default, PyMC's Dirichlet doesn't track the last entry, because it is deterministic probs_full = pm.CompletedDirichlet("full_probs",probs) # define our custom stochastic @pm.stochastic(dtype=tuple,observed=True,name="output") def getlike(value=data,probs=probs): all_l = 0 # optionally, use the probs_full here. # I have it this way to help remind myself that the base dirichlet # leaves off the last probability ps = np.hstack((probs,1-probs.sum())) for v,n in value.iteritems(): l = like1(v,ps) all_l += np.log(l)*n return all_l
The custom stochastic wraps around the likelihood function we defined, and returns a sum of the log likelihoods to PyMC. From here, all we have to do is run the model. I didn’t do the usual MAP run beforehand, and everything still ended up fine. Here’s the command to run it:
model = pm.Model([probs,getlike,probs_full]) M = pm.MCMC(model) M.use_step_method(pm.AdaptiveMetropolis,probs) M.sample(iter=150000, burn=50000, thin=15) > [-----------------100%-----------------] 150000 of 150000 complete in 108.9 sec
I found that the Adaptive Metropolis sampling gave some better diagnostics, so I went with it. The differences in results were not significant, though.
The summary from PyMC, for rule (1), is:
full_probs: Mean SD MC Error 95% HPD interval ------------------------------------------------------------------ 0.211 0.008 0.0 [ 0.197 0.227] 0.317 0.01 0.001 [ 0.298 0.338] 0.269 0.01 0.001 [ 0.251 0.288] 0.194 0.01 0.001 [ 0.173 0.212] 0.009 0.003 0.0 [ 0.005 0.015] 0.001 0.001 0.0 [ 0. 0.002]
The trace and histogram images are:
Doing the same thing, but for rule (2), gives the following:
full_probs: Mean SD MC Error 95% HPD interval ------------------------------------------------------------------ 0.258 0.009 0.0 [ 0.241 0.276] 0.323 0.01 0.0 [ 0.303 0.342] 0.244 0.01 0.0 [ 0.226 0.266] 0.167 0.009 0.0 [ 0.151 0.185] 0.008 0.003 0.0 [ 0.004 0.013] 0.0 0.001 0.0 [ 0. 0.002]
Evidence for whether or not rule (1) or (2) is the correct one isn’t to be found in this analysis. We do get to see how the probabilities of each action being selected are different for each rule, though. I’ve made a histogram of the different traces (the posterior distributions of the probabilities) on top of each other to better visualize the differences.
The different rule sets have very different probabilities of selecting a 0. There’s a bunch of overlap for selecting 1, 4, and 5, though. Again, since the model finds fairly confident results for both rule sets, I don’t see how this could inform knowledge of which rule is being used.
I hope you found this helpful or interesting!
I just used this in a tab-delimited file, and a quick copy/paste to a text file should have you up and running.
00000 3 00001 1 00010 4 00011 1 00100 6 00101 10 00110 11 00111 5 01000 6 01001 8 01010 31 01011 15 01100 26 01101 34 01110 32 01111 1 10000 8 10001 12 10010 21 10011 12 10100 43 10101 57 10110 89 10111 71 11000 33 11001 40 11010 99 11011 88 11100 76 11101 107 11110 9 11111 0