Metropolis-Hastings is an MCMC algorithm for drawing samples from a distribution known up to a constant of proportionality, . Very briefly, the algorithm works by starting with some initial draw then running through the following process for times:
- Propose a draw from a given proposal distribution .
- Evaluate posterior up to a proportionality constant under this draw: .
- Evaluate the posterior up to a proportionality constant under the previous draw:
- If , then we accept as a draw from and set
- If , then we accept with probability and set or else we reject it and retain the previous estimate, .
These terms “accept” and “reject” give the impression that we either “keep” or “toss” the results from each iteration. But it’s important to note that while we can toss out if it is rejected, the evaluations and should not be tossed. They should be cached and used for future iterations.
If accepted, then in the next iteration we will need to compare under the new proposal with from this accepted proposal – which we have from the previous iteration. It’s what we used to accept this proposal in the first place.
If rejected, then in the next iteration we will need to compare under a new proposal against from – which we have from the previous iterations.
Caching saves us one evaluation per iteration – cutting complexity by half. To give you a sense of how much computation this saves, consider the case of a logistic regression with binary vector , covariate matrix , and parameter vector .
Where is applied element-wise to yield an vector . For simplicity, let’s say we have an improper, uniform prior. Then,
This is a costly evaluation. Computing requires matrix multiplication – operation. Without caching, we’d have to do this multiplication twice – once under the proposal and once under the current draw.
Example in R
I wrote a short code implementing both a vanilla Metropolis algorithm (no caching) and a Metropolis with caching for a logistic regression with 50,000 observations and four parameters. I used microbenchmark package to compare run time. The vanilla algorithm’s runtime is about twice as long as the algorithm using caching – as expected.
The relative runtime is actually slightly less than 2 because we still need to do the work of storing the evaluation at each step, but still a huge efficiency gain.