Tag Archives: statistics

Fixed Effects, Random Effects, and First Differencing

I came across a stackoverflow post the other day touching on first differencing and decided to write a quick review of the topic as well as related random effects and fixed effects methods.

In the end we’ll see that random effects, fixed effects, and first differencing are primarily used to handle unobserved heterogeneity within a repeated measures context. One annoying thing about these topics is that there are many synonyms and they vary across (sometimes within) disciplines. Throughout the post I’ll be using a lot of “i.e.”, “aka”, and “/”s to indicate synonymous concepts/idea. Redundant terminology can be a huge barrier to seeing the underlying concepts.

Before getting into first differencing, I’ll lay out the context by example. Consider that we want to compare the effect of two blood pressure medications (Drug A and B) on reducing systolic blood pressure (BP). Suppose you’ve enrolled 100 patients in the study. You randomize each patient to receiving drug A or B as their first treatment. After a sufficient washout period, you give each subject the remaining drug. After each treatment, you record BP. Because of the washout period, the second treatment should be independent of the first. In case it is not, the randomization will ensure that the error is randomized across Drug A and B – so we’re still comparing apples to apples.

This experiment yields two repeated measurements of BP per person, one after having Drug A and another after having Drug B. Repeated measures data is also be referred to as hierarchical data/multilevel data.

BP_{s,d}; \ s \in 0,1,..., 100, d \in \{0,1\}

Above, d=1 marks the measurement with Drug A and d=0 marks the measurement with Drug B.

Ordinary Least Squares

Now we are interested in the mean difference in BP between Drug A and Drug B and write the following linear model for subject s and drug d:

BP_{s,d} = \beta_0 + \beta_1 \times d_{s,d} + a_s + \epsilon_{s,d}

In the above, \beta_1 represents the difference in average BP between Drug A and drug B – the average treatment effect. Using this coefficient we can text the null hypothesis that Drug A and Drug B have the same effect on BP:

H_0: \beta_1=0 \\ H_a: \beta_1 \neq 0

So we could estimate  \beta_1 using OLS and do the corresponding hypothesis test (which, by the way, will yield an identical p-value to an unpaired two-sample t-test). However, we would be ignoring a crucial piece of the model, namely a_s. This is the unobserved subject-specific, time-invariant heterogeneity (e.g. race, sex, baseline BMI, etc) and it must be dealt with.

We have two options: we either assume the treatment effect is not correlated with unobserved characteristics, or assume it is correlated.

Random Effects

If we believe these unobserved factors are uncorrelated with the treatment effect, then leaving them out of the model won’t bias our estimate of the treatment effect.

The catch is that a_s will be absorbed into the error term. This induces autocorrelation on the error term within each patient.

BP_{s,d} = \beta_0 + \beta_1 \times d_{s,d} + (a_s + \epsilon_{s,d})

For a given subject s we can define a composite error term, c_{s,1} = a_s + \epsilon_{s,1} and c_{s,0} = a_s + \epsilon_{s,0}. The errors are correlated insofar as both errors contain a_s.

In order for OLS estimates to be efficient, the error term must be independent and identically distributed. We have just shown that they are not independent. The consequences of this is that our estimate of $\beta_1$, while still unbiased, is inefficient. The OLS standard errors are compromised.

Random effects (aka hierarchical modeling) is built to handle the fact that our measurements are clustered/nested within patients. Hierarchical models can be used to handle additional clustering/nesting. For example, consider the same experiment but in this case we know that some patients come from the same family. So now we have BP measurements nested within patients and patients nested within families.

Fixed Effects

If we assume that the unobserved factors are indeed correlated with the treatment effect, then our estimate of the treatment effect will be biased. One way to solve this issue through the fixed effects estimator, which is derived from the original model:

BP_{s,d} = \beta_0 + \beta_1 \times d_{s,d} + a_s + \epsilon_{s,d}

Taking the average of each variable across each subject’s measurements, we obtain the average model:

\bar{BP_{s}} = \beta_0 + \beta_1 \times \bar{d_{s}} + \bar{a_s} + \bar{\epsilon_{s}}

Subtracting the latter from the former, we get what is typically called the “fixed effects estimator”

BP_{s,d}- \bar{BP_s} =  \beta_1 \times (d_{s,d} - \bar{d_s} ) + (a_s-\bar{a_s}) + (\epsilon_{s,d}-\bar{\epsilon_{s}})

BP_{s,d}- \bar{BP_s} =  \beta_1 \times (d_{s,d} - \bar{d_s} ) + (\epsilon_{s,d}-\bar{\epsilon_{s}})

This process of de-meaning the variables is called the “within” transformation – an appropriate name because it calculate BP variability within each patient. Most importantly, this process eliminates unobserved heterogeneity from the model. The math aside, the within transformation exploits the repeated measures structure of the experiment…essentially using each patient as their own control. The first difference approach will also use the same strategy.

This estimate of the treatment effect is identical to an OLS estimate of the first model with subject dummies (aka least squares dummy variable – LSDV – model). The coefficient and standard error of the treatment effect will be identical to LSDV. From what I understand the fixed effects estimator was shown to be equivalent to LSDV in the late 1970s and gained popularity because it’s must less computationally intensive to perform the within transformation than it is to estimate N-1 subject dummies.

First Difference

First differencing is an alternative to fixed effects. It also achieves the same goal: to eliminate a_s from the model. If  a_s is related to the the treatment effect, first differencing will also yield an unbiased estimate of the effect.

The original equation admits a separate equations for each drug.

BP_{s,d} = \beta_0 + \beta_1 \times d_{s,d} + a_s + \epsilon_{s,d}

Drug A: BP_{s,1} = \beta_0 + \beta_1 + a_s + \epsilon_{s,1}

Drug B: BP_{s,0} = \beta_0 + a_s + \epsilon_{s,0}

Subtracting the latter from the former yields the first difference estimate of the treatment effect:

BP_{s,1}-BP_{s,0} = \beta_1 + (\epsilon_{s,1}-\epsilon_{s,0})

In this example, all we need to do is regression the difference in BP on a constant. This is equivalent to a one-sample t-test of the null hypothesis that the mean difference in BP is zero. It is also equivalent to a paired t-test. Most importantly, it’s also equivalent to the fixed effects estimator in the case where we have two measurements per subject. For more than two measurements, there are efficiency differences between the two methods – but that’s a subject for a different post.

Exploring P-values with Simulations in R

The recent flare-up in discussions on p-values inspired me to conduct a brief simulation study.

In particularly, I wanted to illustrate just how p-values vary with different effect and sample sizes.
Here are the details of the simulation. I simulated n draws of my independent variable X :

X_n \sim N(100, 400)


n \in \{5,6,...,25\}

For each X_n , I define a Y_n as

Y_n := 10+\beta X_n +\epsilon


\epsilon \sim N(0,1)
\beta \in \{.05, .06,..., .25 \}

In other words, for each effect size, \beta , the simulation draws X and Y with some error \epsilon . The following regression model is estimated and the p-value of \beta is observed.

Y_n = \beta_0 + \beta X_n

The drawing and the regression is done 1,000 times so that for each effect size – sample size combination, the simulation yields 1,000 p-values. The average of these 1,000 p-values for each effect size and sample size combination is plotted below.

Note, these results are for a fixed var(\epsilon)=1. Higher sampling error would typically shift these curves upward, meaning that for each effect size, the same sample would yield a lower signal.


There are many take-aways from this plot.

First, for a given sample size, larger effect sizes are “detected” more easily. By detected, I mean found to be statistically significant using the .05 threshold. It’s possible to detect larger effect sizes (e.g. .25) with relatively low sample sizes (in this case <10). By contrast, if the effect size is small (e.g. .05), then a larger sample is needed to detect the effect (>10).

Second, this figure illustrates an oft-heard warning about p-values: always interpret them within the context of sample size. Lack of statistical significance does not imply lack of an effect. An effect may exist, but the sample size may be insufficient to detect it (or the variability in the data set is too high). On the other hand, just because a p-value signals statistical significance does not mean that the effect is actually meaningful. Consider an effect size of .00000001 (effectively 0). According to the chart, even the p-value of this effect size tends to 0 as the sample size increases, eventually crossing the statistical significance threshold.

Code is available on GitHub.

Simulating Endogeneity

Introduction The topic in this post is endogeneity, which can severely bias regression estimates. I will specifically simulate endogeneity caused by an omitted variable. In future posts in this series, I’ll simulate other specification issues such as heteroskedasticity, multicollinearity, and collider bias.

The Data-Generating Process

Consider the data-generating process (DGP) of some outcome variable Y :

Y = a+\beta x+c z + \epsilon_1

\epsilon_1 \sim N(0,\sigma^{2})

For the simulation, I set parameter values for a , \beta , and c and simulate positively correlated independent variables, x and z (N=500).

# simulation parameters
ss=500; trials=5000;
a=50; b=.5; c=.01;
d=25; h=.9;

# generate two independent variables

The Simulation

The simulation will estimate the two models below. The first model is correct in the sense that it includes all terms in the actual DGP. However, the second model omits a variable that is present in the DGP. Instead, the variable is obsorbed into the error term \epsilon_1 .

(1)\thinspace Y = a+\beta x+c z + \epsilon_1

(2) \thinspace Y = a+\beta x + \epsilon_1

This second model will yield a biased estimator of \beta . The variance will also be biased. This is because x is endogenous, which is a fancy way of saying it is correlated with the error term, \epsilon_1 . Since cor(x,z)>0 and \epsilon_1=\epsilon + cz , then cor(x,\epsilon_1)>0 . To show this, I run the simulation below with 5,000 iterations. For each iteration, I construct the outcome variable, Y using the DGP. I then run a regression to estimate \beta , first with model 1 and then with model 2.

 # assume normal error with constant variance to start
 # Select data generation process
 if(endog==TRUE){ fit=lm(y~x) }else{ fit=lm(y~x+z)}

# run simulation - with and wihtout endogeneity

Simulation Results This simulation yields two different sampling distributions for \beta . Note that I have set the true value to \beta=.5 . When z is not omitted, the simulation yields the green sampling distribution, centered around the true value. The average value across all simulations is 0.4998. When z is omitted, the simulation yields the red sampling distribution, centered around 0.5895. It’s biased upward from the true value of .5 by .0895. Moreover, the variance of the biased sampling distribution is much smaller than the true variance around \beta . This compromises the ability to perform any meaningful inferences about the true parameter. End2Bias Analysis The bias in \beta can be derived analytically. Consider that in model 1 (presented above), x and z are related by:

(3)\thinspace z = d+hx+\epsilon_2

Substituting z in equation 1 with equation 3 and re-ordering:

 Y = a+\beta x+c (d+hx+\epsilon_2) + \epsilon_1

 (4)\thinspace Y = (a+cd)+(\beta+ch) x + (\epsilon_1+c\epsilon_2)

When omitting variable z , it is actually equation 4 that is estimated. It can be seen that \beta is biased by the quantity ch . In this case, since x and z are positively correlated by construction and their slope coefficients are positive, the bias will be positive. According to the parameters of the simulation, the “true” bias should be  ch=.09. Here is the distribution of the bias, it is centered around .0895, very close to the true bias value. bias2 The derivation above also lets us determine the direction of bias from knowing the correlation of x and z as well as the sign of c (the true partial effect of z on y ). If both are the same sign, then the estimate of \beta will be biased upward. If the signs differ, then the estimate of \beta will be biased downward. Conclusion The case above was pretty general, but has particular applications. For example, if we believe that an individual’s income is a function of years of education and year of work experience, then omitting one variable will bias the slope estimate of the other.

Iterative OLS Regression Using Gauss-Seidel

Figure shows 10 iterations of Gauss-Seidel's OLS estimates. Estimates get successively closer to the true line, shown in green.
Figure shows 10 iterations of Gauss-Seidel’s OLS estimates. Estimates get successively closer to the true line, shown in green.

I just finished covering a few numerical techniques for solving systems of equations, which can be applied to find best-fit lines through a give set of data points.

The four points \{(0,0), (1,3), (2,3), (5,6)\} are arranged into an inconsistent system of four equations and two unknowns:

b+a(0) = 0 \\  b+a(1) = 3 \\  b+a(2) = 3 \\  b+a(5) = 6

The system can be represented in matrix form:

\begin{bmatrix} 1 & 0 \\ 1 & 1 \\ 1 & 2 \\ 1 & 5 \end{bmatrix}  \begin{bmatrix} b \\ a \end{bmatrix}  =  \begin{bmatrix} 1 \\ 3 \\ 3 \\ 6 \end{bmatrix}

The least-squares solution vector can be found by solving the normal equations


Solving for A^TA and A^T \vec{b} yields the following matrix system:

\begin{bmatrix} 4 & 8 \\ 8 & 30 \end{bmatrix}  \begin{bmatrix} b \\ a \end{bmatrix}  =  \begin{bmatrix} 12 \\ 39 \end{bmatrix}

The system can be solved using Gauss-Seidel method.

Below, I run 10 iterations of Gauss-Seidel (visualized in the figure above). The estimated line gets successively closer to the true solution in green. The estimates are shown in blue – each iteration is shown in a darker shade than the next (see highlighted lines).

library(MASS) # package needed for ginv() function 

A=cbind(c(4,8),c(8,30)) # coefficient matrix
b<-t(t(c(12,39))) # b-vector
x0<-t(t(c(0,0)))  # starting vector
iter<-10          # number of Gauss-Seidel iterations to run

L<-lower.tri(A)*A  # lower triang. A
U<-upper.tri(A)*A  # upper triang. A
D<-diag(diag(A))   # diag of A

# plot points 
title(main="OLS Estimates")
       c("Data","Estimates","True Line"),

# create color palette - lines will get darker with each iter
pal<-colorRampPalette(c("#f2f2f2", "Blue"))
colors<-pal(iter) # creates color palette of length(iter)

# plot true line

  # Gauss-Seidel formula
  # plot estimated line  
  abline(a=as.numeric(x0[2,1]), # slope of estimated line
         b=as.numeric(x0[1,1]), # y-intercept of estimated line
         col=colors[n]) # pick nth color in palette

Using Markov Chains to Model Mortgage Defaults in R

The goal of this post is to blend the material I’ve been learning in my night class with my day-job and R.

If we have some object that switches between states over time according to fixed probabilities, we can model the long-term behavior of this object using Markov chains*.

A good example is a mortgage. At any given point in time, a loan has a probability of defaulting, stay current on payments, or getting paid-off in full. Collectively, we call these “transition probabilities.” These probabilities are assumed to be fixed over the life of the loan**.

As an example, we’ll look at conventional fixed-rate, 30-year mortgages. Let’s suppose that every current loan in time T has a 75% chance of staying current, 10% chance of defaulting, and 15% chance of being paid off in time T+1. These transition probabilities are outlined in the graphic above. Obviously, once a loan defaults or gets paid-off, it stays defaulted or paid-off. We call such states “absorbing states.”

Since we know the transition probabilities, all we need*** is an initial distribution of loans and we can predict the percentage of loans in each state at any given point in the 30-year period. Suppose we start off at T=0 with 100 current loans, and 0 defaulted and paid off loans. In time T+1, we know (according to our transition probabilities) that 75 of these 100 will remain current on their payments. However, 15 loans will be paid off and 10 loans will default. Since we assume that the transition probabilities are constant through the loans’ lives, we can use them to find the amount of current loans in time t=2. Of the 75 loans that were current in T+1, 56.25 loans will remain current in T+2 (since 75*.75=56.25).

If we repeat the process 28 more time (which is done in the posted code) and plot the points, we get the time series plotted above. After 30 years, there are no current loans (since they were all 30-year loans). They have all either paid off or defaulted, with more loans paid-off than defaulted.

*There are many drawbacks to using Markov Chains to model mortgages. This model assumes that the transition probabilities are the same for all 100 loans I use in my example. In reality, loans are not identical (e.g. the borrow of one loan may have much higher credit score than another. This difference would give the former a much lower chance of default) and transition probabilities are not constant throughout the lives of the loans (e.g. if interest rates plummet halfway through the loan’s life, the probability that the loan will be paid off skyrockets since the borrower can refinance at a lower rate) . In short, no one actually uses this model because it is too primitive. Interestingly enough, however, I did compare the curves in my plot against empirical data I have at work and the results are strikingly similar.

In industry, Survival Analysis is used most frequently to model loans (usually implemented using logistic regression with panel data or a proportional hazards models). This is an interesting blend of biostatistics and economics. It’s particularly funny when people apply the biostatistics terminology to mortgages to discuss single-monthly mortality (monthly probability of prepayment), hazards, or survival functions (i.e. the blue line in my chart.).

**In this case, these probabilities can be though of as “hazard rates.” The hazard of default, for example, is the probability that a loan defaults in time T+1 given that it has survived in time T. This is different from a probability of default. The former is a conditional probability whereas the latter is not.

***We don’t technically need an initial condition in this case for mathematical reasons which I won’t get into because time is a scarce resource.