Efficient MCMC with Caching

This post is part of a running series on Bayesian MCMC tutorials. For updates, follow @StableMarkets.

Metropolis Review

Metropolis-Hastings is an MCMC algorithm for drawing samples from a distribution known up to a constant of proportionality, p(\theta | y) \propto p(y|\theta)p(\theta). Very briefly, the algorithm works by starting with some initial draw \theta^{(0)} then running through the following process for t=1, \dots, T times:

  1. Propose a draw from a given proposal distribution \theta^* \sim p^*(\theta).
  2. Evaluate posterior up to a proportionality constant under this draw: T(\theta^*) = p(y|\theta^*)p(\theta^*).
  3. Evaluate the posterior up to a proportionality constant under the previous draw: T(\theta^{(t-1)}) = p(y|\theta^{(t-1)})p(\theta^{(t-1)} )
  4. If T(\theta^*) \geq T(\theta^{(t-1)}), then we accept \theta^* as a draw from p(\theta | y) and set \theta^{(t)} \leftarrow  \theta^*
  5. If T(\theta^*) < T(\theta^{(t-1)}), then we accept \theta^* with probability T(\theta^*)/T(\theta^{(t-1)}) and set \theta^{(t)} \leftarrow  \theta^* or else we reject it and retain the previous estimate, \theta^{(t)} \leftarrow \theta^{(t-1)}.


These terms “accept” and “reject” give the impression that we either “keep” or “toss” the results from each iteration. But it’s important to note that while we can toss out \theta^* if it is rejected, the evaluations T(\theta^*) and T(\theta^{(t-1)}) should not be tossed. They should be cached and used for future iterations.

If accepted, then in the next iteration we will need to compare T(\cdot ) under the new proposal with T(\cdot ) from this accepted proposal – which we have from the previous iteration. It’s what we used to accept this proposal in the first place.

If rejected, then in the next iteration we will need to compare T(\cdot ) under a new proposal against T(\cdot ) from \theta^{(t-1)} – which we have from the previous iterations.

Logistic Regression

Caching saves us one T(\cdot) evaluation per iteration t – cutting complexity by half. To give you a sense of how much computation this saves, consider the case of a logistic regression with binary n \times 1 vector Y, n \times p covariate matrix X, and p\times 1 parameter vector \beta.

Y | \beta \sim Ber\Big( expit( X\beta  ) \Big)

Where expit() is applied element-wise to yield an n\times 1 vector p=(p_1, p_2, \dots, p_n)'=expit( X\beta  ). For simplicity, let’s say we have an improper, uniform prior. Then,

\log T(\beta) = p(Y|\beta) =  \log \sum_{i=1}^n y_i \log[ p_i] + (1 - y_i) \log[ 1 - p_i]

This is a costly evaluation. Computing T(\cdot) requires matrix multiplication – \mathcal{O}(np) operation. Without caching, we’d have to do this multiplication twice – once under the proposal and once under the current draw.

Example in R

I wrote a short code implementing both a vanilla Metropolis algorithm (no caching) and a Metropolis with caching for a logistic regression with 50,000 observations and four parameters. I used microbenchmark package to compare run time. The vanilla algorithm’s runtime is about twice as long as the algorithm using caching – as expected.

The relative runtime is actually slightly less than 2 because we still need to do the work of storing the evaluation at each step, but still a huge efficiency gain.

# full code at
# https://github.com/stablemarkets/BayesianTutorials/blob/master/MH_with_caching/
# Author: Arman Oganisian
source("HelperFunctions.R") # contains mh_vanilla and mh_cache
# hyper-parameters and true values
true_beta <- matrix(c(0,2,1,2),ncol=1)
# simulate covariates
X1 <- rnorm(N)
X2 <- rnorm(N)
X3 <- rnorm(N)
X <- model.matrix(~ X1 + X2 + X3)
# simulate outcome
Y <- rbinom(n = N, size = 1, prob = invlogit( X %*% true_beta ) )
## Run benchmark
# Run Vanialla Metropolis
MH_vanilla = mh_vanilla(beta_0 = c(0,0,0,0), # initial value
p=4, # number of parameters
phi = phi, lambda = lambda, #hyperparameters
X = X, Y = Y, # Data
#iterations and proposal variance
mh_trials=1000, jump_v=.2 ),
# Run Metropolis with Cache
MH_cache = mh_cache(beta_0 = c(0,0,0,0),
phi = phi, lambda = lambda, X = X, Y = Y,
mh_trials=1000, jump_v=.2, p=4),
times = 10)

5 thoughts on “Efficient MCMC with Caching

  1. There are more efficient ways to make use of the computations of $T(\theta^*)$ and $T(\theta^{(t-1)})$, like Rao-Blackwellisation (Gelfand & Smith, 1990; Casella & Robert, 1996; Douc & Robert, 2011) and prefetching (Banterlé et al. (2015).

    1. I’ve come across Rao-Blackwellization, but not prefetching. Thank you for the reference. By the way, I’ve been following your blog for quite some time and really enjoy your content!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s