Exploring P-values with Simulations in R

The recent flare-up in discussions on p-values inspired me to conduct a brief simulation study.

In particularly, I wanted to illustrate just how p-values vary with different effect and sample sizes.
Here are the details of the simulation. I simulated $n$ draws of my independent variable $X$:

$X_n \sim N(100, 400)$

where

$n \in \{5,6,...,25\}$

For each $X_n$, I define a $Y_n$ as

$Y_n := 10+\beta X_n +\epsilon$

where

$\epsilon \sim N(0,1)$
$\beta \in \{.05, .06,..., .25 \}$

In other words, for each effect size, $\beta$, the simulation draws $X$ and $Y$ with some error $\epsilon$. The following regression model is estimated and the p-value of $\beta$ is observed.

$Y_n = \beta_0 + \beta X_n$

The drawing and the regression is done 1,000 times so that for each effect size – sample size combination, the simulation yields 1,000 p-values. The average of these 1,000 p-values for each effect size and sample size combination is plotted below.

Note, these results are for a fixed $var(\epsilon)=1$. Higher sampling error would typically shift these curves upward, meaning that for each effect size, the same sample would yield a lower signal.

There are many take-aways from this plot.

First, for a given sample size, larger effect sizes are “detected” more easily. By detected, I mean found to be statistically significant using the .05 threshold. It’s possible to detect larger effect sizes (e.g. .25) with relatively low sample sizes (in this case <10). By contrast, if the effect size is small (e.g. .05), then a larger sample is needed to detect the effect (>10).

Second, this figure illustrates an oft-heard warning about p-values: always interpret them within the context of sample size. Lack of statistical significance does not imply lack of an effect. An effect may exist, but the sample size may be insufficient to detect it (or the variability in the data set is too high). On the other hand, just because a p-value signals statistical significance does not mean that the effect is actually meaningful. Consider an effect size of .00000001 (effectively 0). According to the chart, even the p-value of this effect size tends to 0 as the sample size increases, eventually crossing the statistical significance threshold.

Code is available on GitHub.

Stop and Frisk: Blacks stopped 3-6 times more than Whites over 10 years

The NYPD provides publicly available data on stop and frisks with data dictionaries, located here. The data, ranging from 2003 to 2014, contains information on over 4.5 million stops. Several variables such as the age, sex, and race of the person stopped are included.

I wrote some R code to clean and compile the data into a single .RData file. The code and clean data set are available in my Github repository. The purpose of this post is simply to make this clean, compiled dataset available to others to combine with their own datasets and reach interesting/meaningful conclusions.

Here are some preliminary (unadjusted) descriptive statistics:

Simulating Endogeneity

Introduction The topic in this post is endogeneity, which can severely bias regression estimates. I will specifically simulate endogeneity caused by an omitted variable. In future posts in this series, I’ll simulate other specification issues such as heteroskedasticity, multicollinearity, and collider bias.

The Data-Generating Process

Consider the data-generating process (DGP) of some outcome variable $Y$:

$Y = a+\beta x+c z + \epsilon_1$

$\epsilon_1 \sim N(0,\sigma^{2})$

For the simulation, I set parameter values for $a$, $\beta$, and $c$ and simulate positively correlated independent variables, $x$ and $z$ (N=500).

# simulation parameters
set.seed(144);
ss=500; trials=5000;
a=50; b=.5; c=.01;
d=25; h=.9;

# generate two independent variables
x=rnorm(n=ss,mean=1000,sd=50);
z=d+h*x+rnorm(ss,0,10)


The Simulation

The simulation will estimate the two models below. The first model is correct in the sense that it includes all terms in the actual DGP. However, the second model omits a variable that is present in the DGP. Instead, the variable is obsorbed into the error term $\epsilon_1$.

$(1)\thinspace Y = a+\beta x+c z + \epsilon_1$

$(2) \thinspace Y = a+\beta x + \epsilon_1$

This second model will yield a biased estimator of $\beta$. The variance will also be biased. This is because $x$ is endogenous, which is a fancy way of saying it is correlated with the error term, $\epsilon_1$. Since $cor(x,z)>0$ and $\epsilon_1=\epsilon + cz$, then $cor(x,\epsilon_1)>0$. To show this, I run the simulation below with 5,000 iterations. For each iteration, I construct the outcome variable, $Y$ using the DGP. I then run a regression to estimate $\beta$, first with model 1 and then with model 2.

sim=function(endog){
# assume normal error with constant variance to start
e=rnorm(n=ss,mean=0,sd=10)
y=a+b*x+c*z+e
# Select data generation process
if(endog==TRUE){ fit=lm(y~x) }else{ fit=lm(y~x+z)}
return(fit\$coefficients)
}

# run simulation - with and wihtout endogeneity
sim_results=t(replicate(trials,sim(endog=FALSE)))
sim_results_endog=t(replicate(trials,sim(endog=TRUE)))


Simulation Results This simulation yields two different sampling distributions for $\beta$. Note that I have set the true value to $\beta=.5$. When $z$ is not omitted, the simulation yields the green sampling distribution, centered around the true value. The average value across all simulations is 0.4998. When $z$ is omitted, the simulation yields the red sampling distribution, centered around 0.5895. It’s biased upward from the true value of .5 by .0895. Moreover, the variance of the biased sampling distribution is much smaller than the true variance around $\beta$. This compromises the ability to perform any meaningful inferences about the true parameter. Bias Analysis The bias in $\beta$ can be derived analytically. Consider that in model 1 (presented above), $x$ and $z$ are related by:

$(3)\thinspace z = d+hx+\epsilon_2$

Substituting $z$ in equation 1 with equation 3 and re-ordering:

$Y = a+\beta x+c (d+hx+\epsilon_2) + \epsilon_1$

$(4)\thinspace Y = (a+cd)+(\beta+ch) x + (\epsilon_1+c\epsilon_2)$

When omitting variable $z$, it is actually equation 4 that is estimated. It can be seen that $\beta$ is biased by the quantity $ch$. In this case, since $x$ and $z$ are positively correlated by construction and their slope coefficients are positive, the bias will be positive. According to the parameters of the simulation, the “true” bias should be  $ch=.09$. Here is the distribution of the bias, it is centered around .0895, very close to the true bias value. The derivation above also lets us determine the direction of bias from knowing the correlation of $x$ and $z$ as well as the sign of $c$ (the true partial effect of $z$ on $y$). If both are the same sign, then the estimate of $\beta$ will be biased upward. If the signs differ, then the estimate of $\beta$ will be biased downward. Conclusion The case above was pretty general, but has particular applications. For example, if we believe that an individual’s income is a function of years of education and year of work experience, then omitting one variable will bias the slope estimate of the other.

Using Markov Chains to Model Mortgage Defaults in R

The goal of this post is to blend the material I’ve been learning in my night class with my day-job and R.

If we have some object that switches between states over time according to fixed probabilities, we can model the long-term behavior of this object using Markov chains*.

A good example is a mortgage. At any given point in time, a loan has a probability of defaulting, stay current on payments, or getting paid-off in full. Collectively, we call these “transition probabilities.” These probabilities are assumed to be fixed over the life of the loan**.

As an example, we’ll look at conventional fixed-rate, 30-year mortgages. Let’s suppose that every current loan in time T has a 75% chance of staying current, 10% chance of defaulting, and 15% chance of being paid off in time T+1. These transition probabilities are outlined in the graphic above. Obviously, once a loan defaults or gets paid-off, it stays defaulted or paid-off. We call such states “absorbing states.”

Since we know the transition probabilities, all we need*** is an initial distribution of loans and we can predict the percentage of loans in each state at any given point in the 30-year period. Suppose we start off at T=0 with 100 current loans, and 0 defaulted and paid off loans. In time T+1, we know (according to our transition probabilities) that 75 of these 100 will remain current on their payments. However, 15 loans will be paid off and 10 loans will default. Since we assume that the transition probabilities are constant through the loans’ lives, we can use them to find the amount of current loans in time t=2. Of the 75 loans that were current in T+1, 56.25 loans will remain current in T+2 (since 75*.75=56.25).

If we repeat the process 28 more time (which is done in the posted code) and plot the points, we get the time series plotted above. After 30 years, there are no current loans (since they were all 30-year loans). They have all either paid off or defaulted, with more loans paid-off than defaulted.

*There are many drawbacks to using Markov Chains to model mortgages. This model assumes that the transition probabilities are the same for all 100 loans I use in my example. In reality, loans are not identical (e.g. the borrow of one loan may have much higher credit score than another. This difference would give the former a much lower chance of default) and transition probabilities are not constant throughout the lives of the loans (e.g. if interest rates plummet halfway through the loan’s life, the probability that the loan will be paid off skyrockets since the borrower can refinance at a lower rate) . In short, no one actually uses this model because it is too primitive. Interestingly enough, however, I did compare the curves in my plot against empirical data I have at work and the results are strikingly similar.

In industry, Survival Analysis is used most frequently to model loans (usually implemented using logistic regression with panel data or a proportional hazards models). This is an interesting blend of biostatistics and economics. It’s particularly funny when people apply the biostatistics terminology to mortgages to discuss single-monthly mortality (monthly probability of prepayment), hazards, or survival functions (i.e. the blue line in my chart.).

**In this case, these probabilities can be though of as “hazard rates.” The hazard of default, for example, is the probability that a loan defaults in time T+1 given that it has survived in time T. This is different from a probability of default. The former is a conditional probability whereas the latter is not.

***We don’t technically need an initial condition in this case for mathematical reasons which I won’t get into because time is a scarce resource.

Using Stochastic Process Simulations to Forecast Stocks

I good alternative to using historical volatility to forecast 35 weeks ahead may be to use an implied volatility from the options market. The market price of the option contract that expires 35 weeks later can be used to “reverse-engineer” the market’s expected volatility over the forecast period. This may be the subject of my next post!