# Exploring P-values with Simulations in R

The recent flare-up in discussions on p-values inspired me to conduct a brief simulation study.

In particularly, I wanted to illustrate just how p-values vary with different effect and sample sizes.
Here are the details of the simulation. I simulated $n$ draws of my independent variable $X$:

$X_n \sim N(100, 400)$

where

$n \in \{5,6,...,25\}$

For each $X_n$, I define a $Y_n$ as

$Y_n := 10+\beta X_n +\epsilon$

where

$\epsilon \sim N(0,1)$
$\beta \in \{.05, .06,..., .25 \}$

In other words, for each effect size, $\beta$, the simulation draws $X$ and $Y$ with some error $\epsilon$. The following regression model is estimated and the p-value of $\beta$ is observed.

$Y_n = \beta_0 + \beta X_n$

The drawing and the regression is done 1,000 times so that for each effect size – sample size combination, the simulation yields 1,000 p-values. The average of these 1,000 p-values for each effect size and sample size combination is plotted below.

Note, these results are for a fixed $var(\epsilon)=1$. Higher sampling error would typically shift these curves upward, meaning that for each effect size, the same sample would yield a lower signal.

There are many take-aways from this plot.

First, for a given sample size, larger effect sizes are “detected” more easily. By detected, I mean found to be statistically significant using the .05 threshold. It’s possible to detect larger effect sizes (e.g. .25) with relatively low sample sizes (in this case <10). By contrast, if the effect size is small (e.g. .05), then a larger sample is needed to detect the effect (>10).

Second, this figure illustrates an oft-heard warning about p-values: always interpret them within the context of sample size. Lack of statistical significance does not imply lack of an effect. An effect may exist, but the sample size may be insufficient to detect it (or the variability in the data set is too high). On the other hand, just because a p-value signals statistical significance does not mean that the effect is actually meaningful. Consider an effect size of .00000001 (effectively 0). According to the chart, even the p-value of this effect size tends to 0 as the sample size increases, eventually crossing the statistical significance threshold.

Code is available on GitHub.

# Stop and Frisk: Spatial Analysis of Racial Differences

In my last post, I compiled and cleaned publicly available data on over 4.5 million stops over the past 11 years.

I also presented preliminary summary statistics showing that blacks had been consistently stopped 3-6 times more than whites over the last decade in NYC.

Since the last post, I managed to clean and reformat the coordinates marking the location of the stops. While I compiled data from 2003-2014, coordinates were available for year 2004 and years 2007-2014. All the code can be found in my GitHub repository.

My goals were to:

• See if blacks and whites were being stopped at the same locations
• Identify areas with especially high amounts of stops and see how these areas changed over time.

Killing two birds with one stone, I made density plots to identify areas with high and low stop densities. Snapshots were taken in 2 year intervals from 2007-2013. Stops of whites are indicated in red contour lines and stops of blacks are indicated in blue shades.

# Stop and Frisk: Blacks stopped 3-6 times more than Whites over 10 years

The NYPD provides publicly available data on stop and frisks with data dictionaries, located here. The data, ranging from 2003 to 2014, contains information on over 4.5 million stops. Several variables such as the age, sex, and race of the person stopped are included.

I wrote some R code to clean and compile the data into a single .RData file. The code and clean data set are available in my Github repository. The purpose of this post is simply to make this clean, compiled dataset available to others to combine with their own datasets and reach interesting/meaningful conclusions.

Here are some preliminary (unadjusted) descriptive statistics:

# Modeling Ebola Contagion Using Airline Networks in R

I first became interested in networks when reading Matthew O’Jackson’s 2010 paper describing their application to economics. During the 2014 ebola outbreak, there was a lot of concern over the disease spreading to the U.S.. I was caught up with work/classes at the time, but decided to use airline flight data to at least explore the question.

The source for the data can be found in a previous post on spatial data visualization.

I assumed that the disease had a single origin (Liberia) and wanted to explore the question of how the disease could travel to the U.S. through a network of airline flights.

A visualization of the network can be seen below. Each node is a country and each edge represents an existing airline route from one country to another. Flights that take off and land in the same country are omitted to avoid clutter.

# NYC Motor Vehicle Collisions – Street-Level Heat Map

In this post I will extend a previous analysis creating a borough-level heat map of NYC motor vehicle collisions. The data is from NYC Open Data. In particular, I will go from borough-level to street-level collisions. The processing of the code is very similar to the previous analysis, with a few more functions that map streets to colors. Below, I load the ggmap package, and the data, and only keep collisions with longitude and latitude information.

library(ggmap)

d_clean=d[which(regexpr(',',d$LOCATION)!=-1),] #### 1. Clean Data #### # get long and lat coordinates from concatenated &quot;location&quot; var comm=regexpr(',',d_clean$LOCATION)
d_clean$loc=as.character(d_clean$LOCATION)
d_clean$lat=as.numeric(substr(d_clean$loc,2,comm-1))
d_clean$long=as.numeric(substr(d_clean$loc,comm+1,nchar(d_clean$loc)-1)) # create year variable d_clean$year=substr(d_clean$DATE,7,10)  # Simulating Endogeneity Introduction The topic in this post is endogeneity, which can severely bias regression estimates. I will specifically simulate endogeneity caused by an omitted variable. In future posts in this series, I’ll simulate other specification issues such as heteroskedasticity, multicollinearity, and collider bias. The Data-Generating Process Consider the data-generating process (DGP) of some outcome variable $Y$: $Y = a+\beta x+c z + \epsilon_1$ $\epsilon_1 \sim N(0,\sigma^{2})$ For the simulation, I set parameter values for $a$, $\beta$, and $c$ and simulate positively correlated independent variables, $x$ and $z$ (N=500). # simulation parameters set.seed(144); ss=500; trials=5000; a=50; b=.5; c=.01; d=25; h=.9; # generate two independent variables x=rnorm(n=ss,mean=1000,sd=50); z=d+h*x+rnorm(ss,0,10)  The Simulation The simulation will estimate the two models below. The first model is correct in the sense that it includes all terms in the actual DGP. However, the second model omits a variable that is present in the DGP. Instead, the variable is obsorbed into the error term $\epsilon_1$. $(1)\thinspace Y = a+\beta x+c z + \epsilon_1$ $(2) \thinspace Y = a+\beta x + \epsilon_1$ This second model will yield a biased estimator of $\beta$. The variance will also be biased. This is because $x$ is endogenous, which is a fancy way of saying it is correlated with the error term, $\epsilon_1$. Since $cor(x,z)>0$ and $\epsilon_1=\epsilon + cz$, then $cor(x,\epsilon_1)>0$. To show this, I run the simulation below with 5,000 iterations. For each iteration, I construct the outcome variable, $Y$ using the DGP. I then run a regression to estimate $\beta$, first with model 1 and then with model 2. sim=function(endog){ # assume normal error with constant variance to start e=rnorm(n=ss,mean=0,sd=10) y=a+b*x+c*z+e # Select data generation process if(endog==TRUE){ fit=lm(y~x) }else{ fit=lm(y~x+z)} return(fit$coefficients)
}

# run simulation - with and wihtout endogeneity
sim_results=t(replicate(trials,sim(endog=FALSE)))
sim_results_endog=t(replicate(trials,sim(endog=TRUE)))


Simulation Results This simulation yields two different sampling distributions for $\beta$. Note that I have set the true value to $\beta=.5$. When $z$ is not omitted, the simulation yields the green sampling distribution, centered around the true value. The average value across all simulations is 0.4998. When $z$ is omitted, the simulation yields the red sampling distribution, centered around 0.5895. It’s biased upward from the true value of .5 by .0895. Moreover, the variance of the biased sampling distribution is much smaller than the true variance around $\beta$. This compromises the ability to perform any meaningful inferences about the true parameter. Bias Analysis The bias in $\beta$ can be derived analytically. Consider that in model 1 (presented above), $x$ and $z$ are related by:

$(3)\thinspace z = d+hx+\epsilon_2$

Substituting $z$ in equation 1 with equation 3 and re-ordering:

$Y = a+\beta x+c (d+hx+\epsilon_2) + \epsilon_1$

$(4)\thinspace Y = (a+cd)+(\beta+ch) x + (\epsilon_1+c\epsilon_2)$

When omitting variable $z$, it is actually equation 4 that is estimated. It can be seen that $\beta$ is biased by the quantity $ch$. In this case, since $x$ and $z$ are positively correlated by construction and their slope coefficients are positive, the bias will be positive. According to the parameters of the simulation, the “true” bias should be  $ch=.09$. Here is the distribution of the bias, it is centered around .0895, very close to the true bias value. The derivation above also lets us determine the direction of bias from knowing the correlation of $x$ and $z$ as well as the sign of $c$ (the true partial effect of $z$ on $y$). If both are the same sign, then the estimate of $\beta$ will be biased upward. If the signs differ, then the estimate of $\beta$ will be biased downward. Conclusion The case above was pretty general, but has particular applications. For example, if we believe that an individual’s income is a function of years of education and year of work experience, then omitting one variable will bias the slope estimate of the other.

# Motor Vehicle Collision Density in NYC

In a previous post, I visualized crime density in Boston using R’s ggmap package. In this post, I use ggmap to visualize vehicle accidents in New York City. R code and data are included in this post.

The data comes from NYC Open Data. My data cut ranges from 2012 to 2015. The data tracks the type of vehicle, the name of the street on which the accident occurs, as well as the longitude and latitude coordinates of the accident. Both coordinates are saved as a single character variable called “LOCATION”.

Below, I load the ggmap and gridExtra packages, load the data, drop all accidents without location coordinates, and parse the LOCATION variable to get the longitude and latitude coordinates. I also parse the date variable to create a year variable and use that variable to create two data sets: one with all vehicle accidents in 2013 and another with all vehicle accidents in 2014


library(ggmap)
library(gridExtra)

d_clean=d[which(regexpr(',',d$LOCATION)!=-1),] comm=regexpr(',',d_clean$LOCATION)
d_clean$loc=as.character(d_clean$LOCATION)
d_clean$lat=as.numeric(substr(d_clean$loc,2,comm-1))
d_clean$long=as.numeric(substr(d_clean$loc,comm+1,nchar(d_clean$loc)-1)) d_clean$year=substr(d_clean$DATE,7,10) d_2013=d_clean[which(d_clean$year=='2013'),c('long','lat')]