# Blocked Gibbs Sampling in R for Bayesian Multiple Linear Regression

In a previous post, I derived and coded a Gibbs sampler in R for estimating a simple linear regression.

In this post, I will do the same for multivariate linear regression. I will derive the conditional posterior distributions necessary for the blocked Gibbs sampler. I will then code the sampler and test it using simulated data.

R code for simulating data and implementing the blocked Gibbs is in by GitHub repo.

# A Bayesian Model

Suppose we have a sample size of $n$ subjects. We observe an $n \times 1$ outcome vector $y$. The Bayesian multivariate regression assumes that this vector is drawn from a multivariate normal distribution where the mean vector is $X\beta$ and covariance matrix $\phi I$. Here, $X$ is an observed $n \times p$ matrix of covariates. Note, the first column of this matrix is identity. The parameter vector $\beta$ is $p \times 1$,  $\phi$ is a common variance parameter, and $I$ is the $n \times n$ identity matrix. By using the identity matrix, we are assuming independent observations. Formally,

$y \sim N_n(X\beta, \phi I)$

So far, this is identical to the multivariate normal regression seen in the frequentist setting. Assuming $X$ is full column rank, maximizing the likelihood yields the solution:

$\hat{\beta} = (X'X)^{-1}X'y$

A Bayesian model is obtained by specifying a prior distribution for $\beta$ and $\phi$. For this example, I will use a flat, improper prior for $\beta$ and a inverse gamma prior for $\phi$:

$\beta \sim f(\beta) \propto 1$

$\phi \sim IG(\alpha, \gamma)$

We assume the hyperparamters are known for simplicity.

# Joint Posterior Distribution

The joint posterior distribution is proportional to

$p(\beta, \phi | y, X) \propto p(y | \beta, \phi, X) p(\beta|X) p(\phi| X)$

We can write this because we assumed prior independence. That is,

$p(\beta,\phi|X) = p(\beta|X) p(\phi|X)$.

Substituting the distributions,

$p(\beta, \phi | y, X) \propto \phi^{-n/2} e^{-\frac{1}{2\phi}(y - X\beta)'(y - X\beta) } \phi^{-(\alpha + 1)}e^{-\frac{\gamma}{\phi}}$

# Blocked Gibbs Sampler

Before coding the sampler, we need to derive the components of the Gibbs sampler – the posterior conditional distributions of each parameter.

The conditional posterior of $\phi$is found by dropping factors independent of $\phi$ from the joint posterior and rearranging. This case is easy since there’s nothing to drop:

\begin{aligned} p(\phi | \beta, \alpha, \gamma, y, X) & \propto \phi^{-n/2} e^{-\frac{1}{2\phi}(y - X\beta)'(y - X\beta) } \phi^{-(\alpha + 1)}e^{-\frac{\gamma}{\phi}} \\ & \propto \phi^{-(\alpha + \frac{n}{2} + 1)} e^{-\frac{1}{\phi}\Big[ \frac{1}{2}(y - X\beta)'(y - X\beta) + \gamma \Big] } \\ & = IG(shape=\alpha + \frac{n}{2}, rate=\frac{1}{2}(y - X\beta)'(y - X\beta) + \gamma) \end{aligned}

The conditional posterior of $\beta$ takes some more linear algebra.

\begin{aligned} p(\beta | \phi, \alpha, \gamma, y, X ) & \propto e^{-\frac{1}{2\phi}(y - X\beta)'(y - X\beta) } \\ & \propto e^{-\frac{1}{2\phi}( y'y - 2\beta'X'y + \beta'X'X\beta ) } \\ & \propto e^{-\frac{1}{2\phi}(\beta'X'X\beta - 2\beta'X'y ) } \\ & \propto e^{-\frac{1}{2\phi}(\beta'X'X\beta - 2\beta'(X'X)(X'X)^{-1}X'y ) } \\ & \propto e^{-\frac{1}{2\phi}(\beta - (X'X)^{-1}X'y )'(X'X)(\beta - (X'X)^{-1}X'y )} \\ & = N_p((X'X)^{-1}X'y, \phi (X'X)^{-1} )\end{aligned}

This is a really nifty and intuitive result. Because we use a flat prior on the parameter vector, the conditional posterior of the parameter vector is centered around the maximum likelihood estimate $\hat{\beta} = (X'X)^{-1}X'y$. The covariance matrix of the conditional posterior is the frequentist estimate of the covariance matrix, $Cov(\hat \beta) = \phi (X'X)^{-1}$

Also note that the conditional posterior is a multivariate distribution since $\beta$ is a vector. So in each iteration of the Gibbs sampler, we draw a whole vector, or “block”, from the posterior. This is much more efficient than drawing from the conditional distribution of each parameter conditional on the others, one at a time. This is why the procedure is called a “blocked” sampler.

# Simulation

All code referencing in the following can be found in my GitHub repo.

I simulate a $50 \times 1$ outcome vector from $y \sim N_{50}(X\beta, 10000 \cdot I)$. Here $I$ is the $n \times n$ identity matrix and $X$ is a $50 \times 4$ model matrix. The true parameter vector, $\beta$ is

$\beta = (1000, 50, -50, 10) \ '$

Running the blocked Gibbs sampler (the block_gibbs() function) produces estimates of the true coefficient and variance parameters. 500,000 iterations were run. A burn-in period of 100,000 with a trimming of 10 iterations.

Below are the plots of the MCMC chains with true values indicated with a red line.

Here are the posterior distributions of the parameters after applying burn-in and trimming:

Seems like we’re able to get reasonable posterior estimates of these parameters. The distributions are not exactly centered around the truth because our data set is only one realization of the truth. To make sure the Bayesian estimator is working properly, I repeat this exercise for 1,000 simulated datasets.

This will yield a 1,000 sets of posterior means and 1,000 sets of 95% credible intervals. On average, these 1,000 posterior means should be centered around the truth. And on average the true parameter values should be within the credible interval 95% of the time.

Below is a summary of these evaluations.

The “Estimator Means” column is the average posterior mean across all 1,000 simulations. Pretty good. The percent bias is all less than 5%. The coverage of the 95% CI is around 95% for all parameters.

# Extensions

There are many extensions we can make to this model. For example, other distributions aside from the Normal could be used in order to accommodate different types of outcomes. For example, if we had binary data, we could model it as:

$y \sim Ber(p)$

$logit(p) = X\beta$

And then place a prior distribution on $\beta$. This idea generalizes Bayesian linear regression to Bayesian GLM.

In the linear case outlined in this post, it’s possible to have modeled the covariance matrix more flexibly. Instead, it is assumed that the covariance matrix is diagonal with a single common variance. This is the homoskedasticity assumption made in multiple linear regression. If the data is clustered (e.g., multiple observations per subject), we could use an inverse-Wishart distribution to model the whole covariance matrix.