# Bayesian Inference with Backfitting MCMC

Previous posts in this series on MCMC samplers for Bayesian inference (in order of publication):

All code for this (and previous) posts are in my Github repo

Recently I’ve been reading about Bayesian additive regression trees (BART). For those interested, the paper is here. I use similar notation when describing the backfitting procedure in this post. I believe this MCMC method was first described here.

BART a Bayesian nonparametric model that can be fit using backfitting MCMC – which is cool in itself. I thought I’d write a post describing the intuition behind backfitting as well as a toy example.

All code for this post can be found in this GitHub repo. I don’t use any packages … the MCMC is implemented from scratch.

Consider data on a covariate $x$ and an outcome $y$ for $n$ subjects, $(y_i, x_i)_{1:n}$. In this toy example, the data looks something like this: We might consider the following probabilistic model $y_i | x_i, \beta, \phi \sim N(\mu_i, \phi)$

Where $\mu_i = \sum_{j=0}^3 x_i^j \beta_j$

Essentially we model the conditional mean using a third-degree polynomial. Note that this is a special case of the more general additive model $\mu_i = \sum_{i=0}^k f_j(x_i)\beta_j$

where in this case $f_j(x_i) = x_i^j$ and $k=3$. The model is completed with a flat prior $p(\beta_j) \propto 1$ on each element of the parameter vector $\beta$ and an inverse-gamma prior, $p(\phi) = InvGam(.5, 10000)$, with shape of $.5$ and rate $10000$ on the variance parameter.

The conditional posterior for each $\beta_j$ is Gaussian (because of conjugacy). We can sample from it using conjugate Gibbs or Metropolis. We could sample the whole parameter vector $\beta$ as a block as well, but in this post we’ll stick to backfitting – which is itself  a Gibbs sampler. We still sample from the conditional posterior of each $\beta_j$ conditional on the other $\beta$s. However, we make use of the key insight that the conditional posterior of each $\beta_j$ depends on the other betas, denoted by $\beta_{-j}$, only through the residual $R_j = y_i - \sum_{k \neq j} x_i^k \beta_k$

Intuitively, $R_j$ is the portion of the mean of $y_i$ left after the portions explained by the other terms (not $j$) are subtracted off. It is also normally distributed, $R_j \sim N(x_i^j\beta_j, \phi)$

With a normal prior on $\beta_j$, the posterior can be computed via conjugacy as $p(\beta_j | R_j, \phi) \sim N\big( \frac{\sum_{i=1}^n R_{ij} \cdot x_i^j}{\sum_{i=1}^n [x_i^j]^2}, \frac{\phi}{\sum_{i=1}^n [x_i^j]^2 } \big)$

The backfitting MCMC proceeds as follows. First, initialize all betas except $\beta_1$. This is totally arbitrary – you can start with any parameter. Then, within each Gibbs iteration,

1. Compute $R_1$ with the values of $\beta_{-1}$ at the current iteration. The sample from the posterior $\beta_1 \sim p(\beta_1 | R_1, \phi)$ conditional on the current draw of $\phi$.
2. Compute $R_2$ with the values of $\beta_{-2}$ at the current iteration. Note, for $\beta_1$ use the value from step 1. The sample from the posterior $\beta_2 \sim p(\beta_2 | R_2, \phi)$.
3. Continue this procedure for all beta parameters.
4. Once all $\beta$ parameters are drawn, sample $\phi \sim p(\phi | y, \beta)$. This posterior is another inverse gamma.

The term backfitting seems appropriate because, at each iteration, we’re “backing out” the distribution of the $\beta_j$ that we want to sample using the other betas.

To get a fitted regression line, we need to sample from the posterior predictive distribution. We do this after step 4 in each Gibbs iteration by drawing $n$ values from $\hat{y}_i | x_i, \beta^{[k]}, \phi^{[k]} \sim N(\mu_i^{[k]}, \phi^{[k]} ), i \in \{1, 2, \dots, n \}$

The superscript $[k]$ denotes the parameters using values from the $k^{th}$ Gibbs iteration.

Here are the results from the simulated example. Again, all code can be found on my GitHub repo. ## 3 thoughts on “Bayesian Inference with Backfitting MCMC”

1. Dang Le says:

Can you please elaborate further on how you were able to calculate the posterior via conjugacy?

1. AO says:

We know that the posterior is proportional to the likelihood times the prior: p(beta|R, phi) \propto p(R | beta, phi) * p(beta ). Now, p(beta) is a normal density and p(R | beta, phi) is a normal density. Just multiply them together and do a lot of algebra to identify the mean and variance of another normal.