First off, here are the previous posts in my Bayesian sampling series:
- Bayesian Simple Linear Regression with Gibbs Sampling in R
- Blocked Gibbs Sampling in R for Bayesian Multiple Linear Regression
In the first post, I illustrated Gibbs Sampling – an algorithm for getting draws from a posterior when conditional posteriors are known. In the second post, I showed that if we can vectorize, then drawing a whole “block” per iteration will increase the speed of the sampler.
For many models, like logistics models, there are no conjugate priors – so Gibbs is not applicable. And as we saw in the first post, the brute force grid method is much too slow to scale to real-world settings.
This post shows how we can use Metropolis-Hastings (MH) to sample from non-conjugate conditional posteriors within each blocked Gibbs iteration – a much better alternative than the grid method.
I’ll illustrate the algorithm, give some R code results (all code posted on my GitHub), and then profile the R code to identify the bottlenecks in the MH algorithm.
The simulated data for this example is a cross-sectional dataset with patients. There is one binary outcome, , a binary treatment variable, , and one confounder, age. Age is a categorical variable with 3 levels. I control for it using 2 dummies (with one category as reference). I model this with a bayesian logistic regression:
Above, is assumed known. Note I’m using to denote the 1000×4 model matrix and to denote the 4×1 parameter vector. I also place an inverse gamma prior on with known hyper-parameters. This is a fairly realistic motivating example for the Metropolis-in-Gibbs:
- We have a binary outcome for which we employ a non-linear link function.
- We have a confounder for which we need to adjust.
- We are estimating more parameters that we care about. In this setting, we really care about the estimate of the treatment effect , so the other coefficients are in some sense nuisance parameters. I wouldn’t say this is a “high-dimensional” setting, but it’s definitely going to strain the sampler.
Unnormalized Conditional Posteriors
Let’s look at the (unnormalized) conditional posteriors of this model. I won’t go through the derivation, but it follows the same procedure used in my previous posts.
Recall we are modeling . Here denotes the row of the model matrix, , and indicates the patient.
There is no conjugacy here! This conditional distribution is not a known distribution so we can’t simply sample from it using Gibbs. Rather, within each gibbs iteration we need another sampling step to draw from this conditional posterior. This second sampler will be the MH sampler. If you need a refresher on Gibbs, see the previous posts linked above.
Side note: the conditional posterior of is conjugate. So within each Gibbs iteration we can use a standard function to sample from the inverse gamma. No need for a second sampling step here. Phew!
The goal is to sample from . Note this is a 4-dimensional density. For simplicity of explanation, assume we only have one and that it’s conditional density is unimodal. The MH sampler works as follow:
- Make an initial “proposal” of to get the sampling started.
- Draw from a distribution centered around . This is called a proposal distribution and, in the standard case, must be symmetric around . E.g. we could use a . For now let’s assume we set the variance of the proposal distribution to some constant.
- Now we take a new draw from the proposal distribution . We then calculate the ratio of the unnormalized densities evaluated at the previous draw, and the current proposal, :
- If this ratio is greater than 1, then the density at the current proposal is higher than the density at the previous value. So we “accept” the proposal and set . Then we repeat steps 2-4 using a proposal distribution centered around and then generating a new proposal. If the ratio is less than 1, then the density at the current proposed value is lower than the previous proposal. In this case, we accept with probability .
So, proposals that yield a higher conditional posterior evaluation are always accepted. However, proposals with lower density evaluations are only sometimes accepted – the lower the relative density evaluation of the proposal, the lower the probability of its acceptance (intuitive!). Over many iterations, draws from the posterior’s high density areas are accepted and the sequence of accepted proposals “climbs” to the high density area. Once the sequence arrives in this high-density area, it tends to remain there. So, in many ways you can view MH as a stochastic “hill-climbing” algorithm. My CompSci friend tells me it is also similar to something called simulated annealing.
The notation extends easily to our 4-dimensional example: the proposal distribution is now a 4-dimensional multivariate Gaussian. Instead of a scalar variance parameter , we have a covariance matrix. Our proposal is therefore a vector of coefficients. In this sense, we are running a blocked Gibbs – using MH to draw a whole block of coefficients per iteration.
- The variance of the jumping distribution is an important parameter. If the variance is too low, the current proposal will likely be very close to the last value and so will be close to 1. We will therefore accept very frequently, but because the accepted values are so close to each other we climb to the high density area slowly over many iterations. If the variance is too high, the sequence may fail to remain in the high density area once it arrives there. Literature suggests that in high dimensions (more than 5 parameters), the optimal acceptance rate is about 24%.
- Many of the “adaptive” MH methods are variants of the basic algorithm described here, but include a tuning period to find the jumping distribution variance that yields the optimal acceptance rate.
- The most computationally intensive part of MH is the density evaluations. For each Gibbs iteration, we have to evaluate the 4-dimensional density twice: once at the current proposal and once at the previously accepted proposal.
- Although the notation extends to high dimensions easily, the performance itself worsens in higher dimensions. The reasons for this is quite technical but super interesting. This paper by Michael Betancourt explains the shortfalls of Gibbs and MH in higher dimensions and outlines how Hamiltonian Monte Carlo (HMC) overcomes these difficulties. As I understand it: in higher dimensions, density does not equal volume. Since getting to the high-volume regions is really what we want, and since standard MH gets to high-density regions, in high dimensions MH fails to explore high-volume areas efficiently. In low dimensions, density approximates volume well so it’s not an issue.
All (commented!) code producing these results is available on my Github. So here are the MCMC chains of our 4 parameters of interest. The red lines indicate true values.
There’s some room for improvement:
- The acceptance rate is only 18%, I could have tuned the jumping distribution covariance matrix to have a more optimal rate.
- I think more iterations would definitely help here. These chains look okay, but still fairly autocorrelated.
The nice thing about about the Bayesian paradigm is that all inference is done using the posterior distribution. The coefficient estimates now are on the log scale, but if we wanted odds ratios, we just exponentiate the posterior draws. If we want an interval estimate of the odds ratio, then we could just grab the 2.5 and 97.5 percentiles of the exponentiated posterior draws. It’s as simple as that – no delta-method junk.
I mentioned before that MH is costly because the log posterior must be evaluated twice per Gibbs iteration. Below is a profile analysis using the R package profviz showing this. The for-loop runs a Gibbs iteration. Within each Gibbs iteration, I call the function rcond_post_beta_mh() which uses MH to produce a draw from the conditional posterior of the parameter vector. Notice that it takes up the bulk of the runtime.
Diving into the rcond_post_beta_mh(), we see that the subroutine log_cond_post_beta() is the bottleneck in the MH run. This function is the log conditional posterior density of the beta vector, which is evaluated twice.
That’s all for now. Feel free to leave comments if you see errors (I type these up fairly quickly so I apologize in advance).
I may do another post diving deeper into Hamiltonian Monte Carlo, which I alluded to earlier in this post. More likely, this will be my last post on sampling methods. I’d like to move on to some Bayesian nonparametrics in the future.