Bayesian Simple Linear Regression with Gibbs Sampling in R

Many introductions to Bayesian analysis use relatively simple didactic examples (e.g. making inference about the probability of success given bernoulli data). While this makes for a good introduction to Bayesian principles, the extension of these principles to regression is not straight-forward.

This post will sketch out how these principles extend to simple linear regression. Along the way, I will derive the posterior conditional distributions of the parameters of interest, present R code for implementing a Gibbs sampler, and present the so-called grid point method.

I’ve had trouble with R code snippets in wordpress before, so I will not present code in the post. Instead, I’ll host the code on GitHub. You can follow along with the code as you go through the post.

A Bayesian Model

Suppose we observe data (y_i, x_i) for i \in \{1, \dots, n \} . Our model for y is

y_i \sim N(\beta_0 + \beta_1 x_i, \phi )

Of interest is making inference about \beta_0 and \beta_1 . If we place normal priors on the coefficients and an inverse gamma prior on the variance term, the full bayesian model for this data can be written as:

y_i | \beta_0, \beta_1, \phi \sim N(\beta_0 + \beta_1 x_i, \phi )
\beta_0 | \mu_0, \tau_0 \sim N(\mu_0, \tau_0)
\beta_1 | \mu_1, \tau_1 \sim N(\mu_1, \tau_1)
\phi | \alpha, \gamma \sim IG(\alpha, \gamma)

Assuming the hyperparameters \mu_0, \tau_0, \mu_1, \tau_1, \alpha and \gamma are known,  the posterior can be written up to a constant of proportionality,

p(\beta_0, \beta_1, \phi | \vec y) \propto \Big [\prod_{i=1}^n p(y_i | \beta_0, \beta_1, \phi )\Big] p( \beta_0 | \mu_0, \tau_0 ) p(\beta_1 | \mu_1, \tau_1) p( \phi | \alpha, \gamma)

The term in the brackets is the joint distribution of the data, or likelihood. The other terms comprise the joint prior distribution of the parameters (since we implicitly assumed prior independence, the joint prior factors).

This is recognized as the familiar expression:

posterior \propto likelihood \times prior.

Part 0 of the accompanying R code generates data from this model for specified “true” parameters. We will later estimate a bayesian regression model with this data to check that we can recover these true parameters.

The Gibbs Sampler

To draw from this posterior distribution, we can use the Gibbs sampling algorithm. Gibbs sampling is an iterative algorithm that produces samples from the posterior distribution of each parameter of interest. It does so by sequentially drawing from the conditional posterior of the each parameter in the following way:

Screen Shot 2017-08-07 at 4.18.30 PM

It can be shown that, after a suitable burn-in period, the remaining of the 1,000 draws are draws from the posterior distributions. These samples are not independent. The sequence of draws is a random walk in the posterior space and each step in the space depends on the previous position. Typically a thinning period will also be used (which is not done here). A thinning of 10 would mean that we keep every 10th draw. The idea being that each draw may be dependent on the previous draw, but not as dependent on the 10th previous draw.

Conditional Posterior Distributions

To use Gibbs, we need to identify the conditional posterior of each parameter.

It helps to start with the full unnormalized posterior:

p(\beta_0, \beta_1, \phi | \vec y) \propto \phi^{-n/2} e^{-\frac{1}{2\phi} \sum_{i=1}^{n}(y_i - (\beta_0 + \beta_1x_i) )^2 } e^{-\frac{1}{2\tau_0} (\beta_0 - \mu_0 )^2 } e^{-\frac{1}{2\tau_1} (\beta_1 - \mu_1 )^2 } \phi^{-(\alpha + 1) e^{-\frac{\gamma}{\phi}}} 

To find the conditional posterior of a parameter, we simply drop all terms from the joint posterior that do not include that parameter. For example, the constant term \beta_0 has conditional posterior:

p(\beta_0 | \phi,\beta_1, \mu_0, \tau_0, \vec y) \propto e^{-\frac{1}{2\phi} \sum_{i=1}^n (y_i - (\beta_0 + \beta_1x_i) )^2 } e^{-\frac{1}{2\tau_0} (\beta_0 - \mu_0 )^2 }


p(\beta_1 | \beta_0, \phi, \mu_1, \tau_1, \vec y) \propto e^{-\frac{1}{2\phi} \sum_{i=1}^{n}(y_i - (\beta_0 + \beta_1x_i) )^2 } e^{-\frac{1}{2\tau_1} (\beta_1 - \mu_1 )^2 }

The conditional posterior can be recognized as another inverse gamma distribution, with some algebraic manipulation.

\begin{aligned} p(\phi | \beta_0, \beta_1, \alpha, \gamma, \vec y) & \propto \phi^{-n/2} \phi^{-(\alpha + 1) e^{-\frac{\gamma}{\phi}}} e^{-\frac{1}{2\phi} \sum_{i=1}^{n}(y_i - (\beta_0 + \beta_1x_i) )^2} \\ & = \phi^{-(\alpha + \frac{n}{2} + 1)} exp{-\frac{1}{\phi} \Big[ \frac{1}{2\phi} \sum_{i=1}^{n}(y_i - (\beta_0 + \beta_1x_i) )^2 + \gamma \Big] } \\ & = IG(shape = \alpha + \frac{n}{2}, \ rate = \frac{1}{2\phi} \sum_{i=1}^{n}(y_i - (\beta_0 + \beta_1x_i) )^2 + \gamma ) \end{aligned}

The conditional posteriors of \beta_0 and \beta_1 aren’t as easy to recognize. But we don’t really need to go through any algebra if we’re willing ot use the grid method.

The Grid Method

Consider \beta_0. The grid method is a very brute force way (in my opinion) to sample from its conditional posterior distribution. This conditional distribution is just a function of \beta_0. So we can evaluate the density for certain \beta_0 values. In R notation, this can be grid=seq(-10, 10, by=.001), say. This sequence is the “grid” of points.

Then the conditional posterior distribution evaluated at each grid point tells us the relatively likelihood of that draw.

We can then use the sample() function in R to draw from these grid of points, with sampling probabilities proportional to the density evaluation at the grid points.

This is implemented in functions rb0cond() and rb1cond() in part 1 of the accompanying R code.

It is common to encounter numerical issues when using the grid method. Since we are evaluating an unnormalized posterior on the grid, the results can get quite large or small. This may yield Inf and -Inf values in R.

In functions rb0cond() and rb1cond(), for example, I actually evaluate the log of the conditional posterior distributions derived. I then normalize by subtracting each evaluation from the maximum of all the evaluations before exponentiating back from the log scale. This tends to handle such numerical issues.

We don’t need to use the grid method to draw from the conditional posterior of \phi since it is from a known distribution.

Note that this grid method has some drawbacks.

First, it’s computationally expensive. It would be more computationally efficient to go through the algebra and hopefully get a known posterior distribution to draw from, as we did with \phi.

Second, the grid method requires specifying a region of grid points. What if the conditional posterior had significant density outside our specified grid interval of [-10,10]? In this case, we would not get an accurate sample from the conditional posterior. It’s important to keep this in mind and experiment with wide grid intervals. So, we need to be clever about handling numerical issues such as numbers approaching Inf and -Inf values in R.

Simulation Results

Now that we have a way to sample from each parameter’s conditional posterior, we can implement the Gibbs sampler. This is done in part 2 of the accompanying R code. It codes the same algorithm outlined above in R.

The results are good. The plot below shows the sequence of 1000 Gibbs samples (with burn-in draws removed and no thinning implemented). The red lines indicate the true parameter values under which we simulate the data. The fourth plot shows the joint posterior of the intercept and slope terms, with red lines indicating contours.

Screen Shot 2017-08-07 at 5.02.03 PM

To sum things up, we first derived an expression for the joint distribution of the parameters. Then we outlined the Gibbs algorithm for drawing samples from this posterior. In the process, we recognized that the Gibbs method relies on sequential draws from the conditional posterior distribution of each parameter. For \phi, this was an easily recognized, known distribution. For the slope and intercept terms, we decided to circumvent the algebra by using the grid method. This benefits of doing this is that we side step a lot of algebra. The costs are increased computational complexity, some trial and error when choosing appropriate range for the grid, and numerical issues.

I’ll try to follow up with an extension to a bayesian multivariate linear regression model in the near future.

Fixed Effects, Random Effects, and First Differencing

I came across a stackoverflow post the other day touching on first differencing and decided to write a quick review of the topic as well as related random effects and fixed effects methods.

In the end we’ll see that random effects, fixed effects, and first differencing are primarily used to handle unobserved heterogeneity within a repeated measures context. One annoying thing about these topics is that there are many synonyms and they vary across (sometimes within) disciplines. Throughout the post I’ll be using a lot of “i.e.”, “aka”, and “/”s to indicate synonymous concepts/idea. Redundant terminology can be a huge barrier to seeing the underlying concepts.

Before getting into first differencing, I’ll lay out the context by example. Consider that we want to compare the effect of two blood pressure medications (Drug A and B) on reducing systolic blood pressure (BP). Suppose you’ve enrolled 100 patients in the study. You randomize each patient to receiving drug A or B as their first treatment. After a sufficient washout period, you give each subject the remaining drug. After each treatment, you record BP. Because of the washout period, the second treatment should be independent of the first. In case it is not, the randomization will ensure that the error is randomized across Drug A and B – so we’re still comparing apples to apples.

This experiment yields two repeated measurements of BP per person, one after having Drug A and another after having Drug B. Repeated measures data is also be referred to as hierarchical data/multilevel data.

BP_{s,d}; \ s \in 0,1,..., 100, d \in \{0,1\}

Above, d=1 marks the measurement with Drug A and d=0 marks the measurement with Drug B.

Ordinary Least Squares

Now we are interested in the mean difference in BP between Drug A and Drug B and write the following linear model for subject s and drug d:

BP_{s,d} = \beta_0 + \beta_1 \times d_{s,d} + a_s + \epsilon_{s,d}

In the above, \beta_1 represents the difference in average BP between Drug A and drug B – the average treatment effect. Using this coefficient we can text the null hypothesis that Drug A and Drug B have the same effect on BP:

H_0: \beta_1=0 \\ H_a: \beta_1 \neq 0

So we could estimate  \beta_1 using OLS and do the corresponding hypothesis test (which, by the way, will yield an identical p-value to an unpaired two-sample t-test). However, we would be ignoring a crucial piece of the model, namely a_s. This is the unobserved subject-specific, time-invariant heterogeneity (e.g. race, sex, baseline BMI, etc) and it must be dealt with.

We have two options: we either assume the treatment effect is not correlated with unobserved characteristics, or assume it is correlated.

Random Effects

If we believe these unobserved factors are uncorrelated with the treatment effect, then leaving them out of the model won’t bias our estimate of the treatment effect.

The catch is that a_s will be absorbed into the error term. This induces autocorrelation on the error term within each patient.

BP_{s,d} = \beta_0 + \beta_1 \times d_{s,d} + (a_s + \epsilon_{s,d})

For a given subject s we can define a composite error term, c_{s,1} = a_s + \epsilon_{s,1} and c_{s,0} = a_s + \epsilon_{s,0}. The errors are correlated insofar as both errors contain a_s.

In order for OLS estimates to be efficient, the error term must be independent and identically distributed. We have just shown that they are not independent. The consequences of this is that our estimate of $\beta_1$, while still unbiased, is inefficient. The OLS standard errors are compromised.

Random effects (aka hierarchical modeling) is built to handle the fact that our measurements are clustered/nested within patients. Hierarchical models can be used to handle additional clustering/nesting. For example, consider the same experiment but in this case we know that some patients come from the same family. So now we have BP measurements nested within patients and patients nested within families.

Fixed Effects

If we assume that the unobserved factors are indeed correlated with the treatment effect, then our estimate of the treatment effect will be biased. One way to solve this issue through the fixed effects estimator, which is derived from the original model:

BP_{s,d} = \beta_0 + \beta_1 \times d_{s,d} + a_s + \epsilon_{s,d}

Taking the average of each variable across each subject’s measurements, we obtain the average model:

\bar{BP_{s}} = \beta_0 + \beta_1 \times \bar{d_{s}} + \bar{a_s} + \bar{\epsilon_{s}}

Subtracting the latter from the former, we get what is typically called the “fixed effects estimator”

BP_{s,d}- \bar{BP_s} =  \beta_1 \times (d_{s,d} - \bar{d_s} ) + (a_s-\bar{a_s}) + (\epsilon_{s,d}-\bar{\epsilon_{s}})

BP_{s,d}- \bar{BP_s} =  \beta_1 \times (d_{s,d} - \bar{d_s} ) + (\epsilon_{s,d}-\bar{\epsilon_{s}})

This process of de-meaning the variables is called the “within” transformation – an appropriate name because it calculate BP variability within each patient. Most importantly, this process eliminates unobserved heterogeneity from the model. The math aside, the within transformation exploits the repeated measures structure of the experiment…essentially using each patient as their own control. The first difference approach will also use the same strategy.

This estimate of the treatment effect is identical to an OLS estimate of the first model with subject dummies (aka least squares dummy variable – LSDV – model). The coefficient and standard error of the treatment effect will be identical to LSDV. From what I understand the fixed effects estimator was shown to be equivalent to LSDV in the late 1970s and gained popularity because it’s must less computationally intensive to perform the within transformation than it is to estimate N-1 subject dummies.

First Difference

First differencing is an alternative to fixed effects. It also achieves the same goal: to eliminate a_s from the model. If  a_s is related to the the treatment effect, first differencing will also yield an unbiased estimate of the effect.

The original equation admits a separate equations for each drug.

BP_{s,d} = \beta_0 + \beta_1 \times d_{s,d} + a_s + \epsilon_{s,d}

Drug A: BP_{s,1} = \beta_0 + \beta_1 + a_s + \epsilon_{s,1}

Drug B: BP_{s,0} = \beta_0 + a_s + \epsilon_{s,0}

Subtracting the latter from the former yields the first difference estimate of the treatment effect:

BP_{s,1}-BP_{s,0} = \beta_1 + (\epsilon_{s,1}-\epsilon_{s,0})

In this example, all we need to do is regression the difference in BP on a constant. This is equivalent to a one-sample t-test of the null hypothesis that the mean difference in BP is zero. It is also equivalent to a paired t-test. Most importantly, it’s also equivalent to the fixed effects estimator in the case where we have two measurements per subject. For more than two measurements, there are efficiency differences between the two methods – but that’s a subject for a different post.

Exploring P-values with Simulations in R

The recent flare-up in discussions on p-values inspired me to conduct a brief simulation study.

In particularly, I wanted to illustrate just how p-values vary with different effect and sample sizes.
Here are the details of the simulation. I simulated n draws of my independent variable X :

X_n \sim N(100, 400)


n \in \{5,6,...,25\}

For each X_n , I define a Y_n as

Y_n := 10+\beta X_n +\epsilon


\epsilon \sim N(0,1)
\beta \in \{.05, .06,..., .25 \}

In other words, for each effect size, \beta , the simulation draws X and Y with some error \epsilon . The following regression model is estimated and the p-value of \beta is observed.

Y_n = \beta_0 + \beta X_n

The drawing and the regression is done 1,000 times so that for each effect size – sample size combination, the simulation yields 1,000 p-values. The average of these 1,000 p-values for each effect size and sample size combination is plotted below.

Note, these results are for a fixed var(\epsilon)=1. Higher sampling error would typically shift these curves upward, meaning that for each effect size, the same sample would yield a lower signal.


There are many take-aways from this plot.

First, for a given sample size, larger effect sizes are “detected” more easily. By detected, I mean found to be statistically significant using the .05 threshold. It’s possible to detect larger effect sizes (e.g. .25) with relatively low sample sizes (in this case <10). By contrast, if the effect size is small (e.g. .05), then a larger sample is needed to detect the effect (>10).

Second, this figure illustrates an oft-heard warning about p-values: always interpret them within the context of sample size. Lack of statistical significance does not imply lack of an effect. An effect may exist, but the sample size may be insufficient to detect it (or the variability in the data set is too high). On the other hand, just because a p-value signals statistical significance does not mean that the effect is actually meaningful. Consider an effect size of .00000001 (effectively 0). According to the chart, even the p-value of this effect size tends to 0 as the sample size increases, eventually crossing the statistical significance threshold.

Code is available on GitHub.

Stop and Frisk: Spatial Analysis of Racial Differences

Stops in 2014, with red lines indicated high white stop density areas and blue shades indicating high black stop density areas. Notice that high white stop density areas are very different from high black stop density areas. The star in Brooklyn marks the location where officers Lui and Ramos were killed. The star on Staten Island markets the location of Eric Garner's death.
Stops in 2014. Red lines indicate high white stop density areas and blue shades indicate high black stop density areas.
Notice that high white stop density areas are very different from high black stop density areas.
The star in Brooklyn marks the location of officers Liu’s and Ramos’ deaths. The star on Staten Island marks the location of Eric Garner’s death.

In my last post, I compiled and cleaned publicly available data on over 4.5 million stops over the past 11 years.

I also presented preliminary summary statistics showing that blacks had been consistently stopped 3-6 times more than whites over the last decade in NYC.

Since the last post, I managed to clean and reformat the coordinates marking the location of the stops. While I compiled data from 2003-2014, coordinates were available for year 2004 and years 2007-2014. All the code can be found in my GitHub repository.

My goals were to:

  • See if blacks and whites were being stopped at the same locations
  • Identify areas with especially high amounts of stops and see how these areas changed over time.

Killing two birds with one stone, I made density plots to identify areas with high and low stop densities. Snapshots were taken in 2 year intervals from 2007-2013. Stops of whites are indicated in red contour lines and stops of blacks are indicated in blue shades.


Continue reading Stop and Frisk: Spatial Analysis of Racial Differences

Stop and Frisk: Blacks stopped 3-6 times more than Whites over 10 years

The NYPD provides publicly available data on stop and frisks with data dictionaries, located here. The data, ranging from 2003 to 2014, contains information on over 4.5 million stops. Several variables such as the age, sex, and race of the person stopped are included.

I wrote some R code to clean and compile the data into a single .RData file. The code and clean data set are available in my Github repository. The purpose of this post is simply to make this clean, compiled dataset available to others to combine with their own datasets and reach interesting/meaningful conclusions.

Here are some preliminary (unadjusted) descriptive statistics:


Continue reading Stop and Frisk: Blacks stopped 3-6 times more than Whites over 10 years

Modeling Ebola Contagion Using Airline Networks in R

I first became interested in networks when reading Matthew O’Jackson’s 2010 paper describing their application to economics. During the 2014 ebola outbreak, there was a lot of concern over the disease spreading to the U.S.. I was caught up with work/classes at the time, but decided to use airline flight data to at least explore the question.

The source for the data can be found in a previous post on spatial data visualization.

I assumed that the disease had a single origin (Liberia) and wanted to explore the question of how the disease could travel to the U.S. through a network of airline flights.

A visualization of the network can be seen below. Each node is a country and each edge represents an existing airline route from one country to another. Flights that take off and land in the same country are omitted to avoid clutter.

Each vertex is a country and each edge represents and existing airline route between two countries. Flights beginning and ending in the same country are not represented for clarity.
Each node is a country and each edge represents an existing airline route between two countries. Flights beginning and ending in the same country are not represented for clarity.

Continue reading Modeling Ebola Contagion Using Airline Networks in R

NYC Motor Vehicle Collisions – Street-Level Heat Map

StreetLevelMap In this post I will extend a previous analysis creating a borough-level heat map of NYC motor vehicle collisions. The data is from NYC Open Data. In particular, I will go from borough-level to street-level collisions. The processing of the code is very similar to the previous analysis, with a few more functions that map streets to colors. Below, I load the ggmap package, and the data, and only keep collisions with longitude and latitude information.



#### 1. Clean Data ####
# get long and lat coordinates from concatenated &quot;location&quot; var

# create year variable

Continue reading NYC Motor Vehicle Collisions – Street-Level Heat Map