Metropolis-in-Gibbs Sampling and Runtime Analysis with Profviz

First off, here are the previous posts in my Bayesian sampling series:

In the first post, I illustrated Gibbs Sampling – an algorithm for getting draws from a posterior when conditional posteriors are known. In the second post, I showed that if we can vectorize, then drawing a whole “block” per iteration will increase the speed of the sampler.

For many models, like logistics models, there are no conjugate priors – so Gibbs is not applicable. And as we saw in the first post, the brute force grid method is much too slow to scale to real-world settings.

This post shows how we can use Metropolis-Hastings (MH) to sample from non-conjugate conditional posteriors within each blocked Gibbs iteration – a much better alternative than the grid method.

I’ll illustrate the algorithm, give some R code results (all code posted on my GitHub), and then profile the R code to identify the bottlenecks in the MH algorithm.

The Model

The simulated data for this example is a cross-sectional dataset with N=1000 patients. There is one binary outcome, Y , a binary treatment variable, A, and one confounder, age. Age is a categorical variable with 3 levels. I control for it using 2 dummies (with one category as reference). I model this with a bayesian logistic regression:

logit(Y) = \beta_0 + \beta_1A + \beta_2age_1 + \beta_3age_2 = X\vec \beta

\beta_0, \beta_1, \beta_2, \beta_3 \sim N(\lambda, \phi)

Above, \lambda is assumed known. Note I’m using X to denote the 1000×4 model matrix and \vec \beta to denote the 4×1 parameter vector. I also place an inverse gamma prior on \phi with known hyper-parameters. This is a fairly realistic motivating example for the Metropolis-in-Gibbs:

  1. We have a binary outcome for which we employ a non-linear link function.
  2. We have a confounder for which we need to adjust.
  3. We are estimating more parameters that we care about. In this setting, we really care about the estimate of the treatment effect \beta_1, so the other coefficients are in some sense nuisance parameters. I wouldn’t say this is a “high-dimensional” setting, but it’s definitely going to strain the sampler.

Unnormalized Conditional Posteriors

Let’s look at the (unnormalized) conditional posteriors of this model. I won’t go through the derivation, but it follows the same procedure used in my previous posts.
log f( \vec \beta | \phi, X, Y) \propto \sum_{i=1}^n y_i\cdot log(p_i) + (1 - y_i) \cdot log(1 - p_i) - \frac{1}{2\phi}(\vec \beta - \vec \lambda)'(\vec \beta - \vec \lambda)
Recall we are modeling p_i = expit(x_i' \vec \beta). Here x_i denotes the i^{ith} row of the model matrix, X, and i indicates the patient.

There is no conjugacy here! This conditional distribution is not a known distribution so we can’t simply sample from it using Gibbs. Rather, within each gibbs iteration we need another sampling step to draw from this conditional posterior. This second sampler will be the MH sampler. If you need a refresher on Gibbs, see the previous posts linked above.

Side note: the conditional posterior of \phi is conjugate. So within each Gibbs iteration we can use a standard function to sample from the inverse gamma. No need for a second sampling step here. Phew!

Metropolis-Hastings

The goal is to sample from log f(\vec \beta | \phi, X, Y). Note this is a 4-dimensional density. For simplicity of explanation, assume we only have one \beta and that it’s conditional density is unimodal. The MH sampler works as follow:

  1. Make an initial “proposal” of \beta^{(0)} to get the sampling started.
  2. Draw from a distribution centered around \beta^{(0)}. This is called a proposal distribution and, in the standard case, must be symmetric around \beta^{(0)}. E.g. we could use a N(\beta^{(0)}, \sigma^2) . For now let’s assume we set the variance of the proposal distribution to some constant.
  3. Now we take a new draw from the proposal distribution \beta^* . We then calculate the ratio of the unnormalized densities evaluated at the previous draw, \beta^{(0)} and the current proposal, \beta^* :   r = \frac{f(\beta^*)}{f(\beta^{(0)})}
  4. If this ratio is greater than 1, then the density at the current proposal is higher than the density at the previous value. So we “accept” the proposal and set \beta^{(1)} = \beta^*. Then we repeat steps 2-4 using a proposal distribution centered around \beta^{(1)} and then generating a new proposal. If the ratio is less than 1, then the density at the current proposed value is lower than the previous proposal. In this case, we accept \beta^* with probability r.

So, proposals that yield a higher conditional posterior evaluation are always accepted. However, proposals with lower density evaluations are only sometimes accepted – the lower the relative density evaluation of the proposal, the lower the probability of its acceptance (intuitive!). Over many iterations, draws from the posterior’s high density areas are accepted and the sequence of accepted proposals “climbs” to the high density area. Once the sequence arrives in this high-density area, it tends to remain there. So, in many ways you can view MH as a stochastic “hill-climbing” algorithm. My CompSci friend tells me it is also similar to something called simulated annealing.

The notation extends easily to our 4-dimensional example: the proposal distribution is now a 4-dimensional multivariate Gaussian. Instead of a scalar variance parameter \sigma^2, we have a covariance matrix. Our proposal is therefore a vector of coefficients. In this sense, we are running a blocked Gibbs – using MH to draw a whole block of coefficients per iteration.

Some comments:

  • The variance of the jumping distribution is an important parameter. If the variance is too low, the current proposal will likely be very close to the last value and so r will be close to 1. We will therefore accept very frequently, but because the accepted values are so close to each other we climb to the high density area slowly over many iterations. If the variance is too high, the sequence may fail to remain in the high density area once it arrives there. Literature suggests that in high dimensions (more than 5 parameters), the optimal acceptance rate is about 24%.
  • Many of the “adaptive” MH methods are variants of the basic algorithm described here, but include a tuning period to find the jumping distribution variance that yields the optimal acceptance rate.
  • The most computationally intensive part of MH is the density evaluations. For each Gibbs iteration, we have to evaluate the 4-dimensional density twice: once at the current proposal and once at the previously accepted proposal.
  • Although the notation extends to high dimensions easily, the performance itself worsens in higher dimensions. The reasons for this is quite technical but super interesting. This paper by Michael Betancourt explains the shortfalls of Gibbs and MH in higher dimensions and outlines how Hamiltonian Monte Carlo (HMC) overcomes these difficulties. As I understand it: in higher dimensions, density does not equal volume. Since getting to the high-volume regions is really what we want, and since standard MH gets to high-density regions, in high dimensions MH fails to explore high-volume areas efficiently. In low dimensions, density approximates volume well so it’s not an issue.

Results

All (commented!) code producing these results is available on my Github. So here are the MCMC chains of our 4 parameters of interest. The red lines indicate true values.

LogitMCMCChains.png

There’s some room for improvement:

  • The acceptance rate is only 18%, I could have tuned the jumping distribution covariance matrix to have a more optimal rate.
  • I think more iterations would definitely help here. These chains look okay, but still fairly autocorrelated.

The nice thing about about the Bayesian paradigm is that all inference is done using the posterior distribution. The coefficient estimates now are on the log scale, but if we wanted odds ratios, we just exponentiate the posterior draws. If we want an interval estimate of the odds ratio, then we could just grab the 2.5 and 97.5 percentiles of the exponentiated posterior draws. It’s as simple as that – no delta-method junk.

I mentioned before that MH is costly because the log posterior must be evaluated twice per Gibbs iteration. Below is a profile analysis using the R package profviz showing this. The for-loop runs a Gibbs iteration. Within each Gibbs iteration, I call the function rcond_post_beta_mh() which uses MH to produce a draw from the conditional posterior of the parameter vector. Notice that it takes up the bulk of the runtime.

Screen Shot 2017-11-06 at 9.37.11 PM

Diving into the rcond_post_beta_mh(), we see that the subroutine log_cond_post_beta() is the bottleneck in the MH run. This function is the log conditional posterior density of the beta vector, which is evaluated twice.

Screen Shot 2017-11-06 at 9.41.55 PM.png

That’s all for now. Feel free to leave comments if you see errors (I type these up fairly quickly so I apologize in advance).

I may do another post diving deeper into Hamiltonian Monte Carlo, which I alluded to earlier in this post. More likely, this will be my last post on sampling methods. I’d like to move on to some Bayesian nonparametrics in the future.

13 thoughts on “Metropolis-in-Gibbs Sampling and Runtime Analysis with Profviz

  1. 1. Gibbs can still work without conjugate prior. Try slice sampling.
    2. Gibbs is simulating true distribution each iteration, but MH is not. to its core, it is multivariate normal with rate of rejection to move. This makes MH longer to converge, and behave like localized and correlated multivariate normal rather than posterior. Usually you would take every 10 simulated results to avoid MH’s correlation and normality.

    1. I should clarify that when I say Gibbs isn’t applicable I am referring to Conjugate Gibbs, which I used in the previous two posts. If you don’t have conjugate you need an additional layer of sampling within each Gibbs step.

      I think you’re saying that an alternative to doing MH within each Gibbs step is to slice within each Gibbs step? Let me know if I misunderstood.

      Great points re MH being fundamentally Gaussian and an approximation.

      1. Thx for the reply. Yes I do mean adding slice at each Gibbs step. Conjugate prior is very restrictive and can be a bit problematic to generalize.

  2. I may well be confused here myself, but I struggle to understand why sampling phi is necessary at all. In my understanding, we cannot identify an error variance in a logit (as well as probit) model anyhow, so that we are effectively simulating the distribution of an unidentified parameter.

    As far as the prior variance for the beta is concerned, is there (“hierarchical Bayes”?) a benefit to specifying further hyperparameters for phi (i.e., alpha, gamma) rather than to set a prior variance phi directly?

    In fact, if I replace

    mh_draw <- rcond_post_beta_mh(gibbs_res[i-1,1:p], gibbs_res[i,p+1], lambda, X, Y, mh_trials=5, jump_v=.01)

    with

    mh_draw <- rcond_post_beta_mh(gibbs_res[i-1,1:p], 1, lambda, X, Y, mh_trials=5, jump_v=.01)

    I obtain results that, when looking at the posterior mean (I did not look at anything else) are, if anything, closer to the true parameter values.

    Also, I could not, again in terms of the posterior mean (there seems to be more autocorrelation), discern a major effect of setting mh_trials=5 rather than mh_trials=1.

    Finally, is the initialization phi <- 10000 used anywhere?

    1. Hi Christoph. Thanks for the comment. I’ll try to address all your points. Re necessity of phi: we definitely need phi since it appears in the conditional posterior for beta. We can either sample it or assume it to be a known constant. Setting phi=1 is doing the latter. Essentially you’re saying that a priori you think reasonable values of beta are between 3*(-1) + lambda and 3*(1) + lambda. In this case, that’s actually true! All of the true betas are in that range. So essentially this is a highly informative prior around the true values. It doesn’t surprise me that you get better answers with this method. Let’s say the true beta for treatment was not 1.1, but 10. You’ll need an a lot of data for the sampler to overcome your highly informative prior and move towards 10. By placing a prior inverse gamma on phi which allows for very large values of phi, I accomplish two things: 1) large plausible values of phi –> large plausible values for beta a priori. So it avoids the case described above. 2) I’m letting each beta have it’s own value of phi, rather than saying a priori that all betas have the same phi. That is, I’m allowing for different prior uncertainties for each parameters. Of course this increased flexibility comes at a cost: it takes longer for the sampler to converge.

      Re mh_trials. So this is controlling how many MH iterations we want to run within each Gibbs. “T” who commented above noted that MH is an approximation to the true posterior. In well-behaved problems, you don’t need to do many metropolis iterations – you can get away with just doing 1. But in ill-conditioned problems you may need more. So it’s problem specific. Bayesian software that do this typically include tuning periods that automatically set this and the jumping distribution variance to “good” values.

      The initialization of phi <-1000 doesn't do anything. Thanks for catching this. It's a hold over from when I was sampling beta first and so needed initial phi values. Now I sample phi first so I need initial beta values. And all results are stored in a single matrix, gibbs_res, as opposed to separate vectors.

      1. Thanks, Arman!

        Regarding bottlenecks in the code, would it not be much faster to compute the likelihood as

        xb <- X%*%beta
        p <- invlogit(xb)
        lik <- sum(Y*log(p) + (1-Y)*log(1-p))

        ?

      2. Yeah it could be faster! The bulk of the computational efficiency with my code comes from updating lik n times with the for loop. The problem is not really the for-loop itself. Say if I preallocated memory and made lik a vector of length n, then filled in lik with each iteration of the for-loop. Then calculated sum(lik). I think this would be as fast as the matrix multiplication you propose. Because both methods require 2pn (p=length of beta, n = # rows of x) operations when computing Xb, whether or not you run it in a for-loop. So the real inefficiency comes from not pre-allocating memory.

        The broader point I wanted to make with the profviz is that the MH algo has the fundamental bottleneck of needing to evaluate the log posterior twice within each iteration. You can optimize this evaluation, as you suggest, but you can’t get rid of it. Other sampling algorithms don’t have this drawback, but they have other trade-offs.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s