Stop and Frisk: Spatial Analysis of Racial Differences

In my last post, I compiled and cleaned publicly available data on over 4.5 million stops over the past 11 years.

I also presented preliminary summary statistics showing that blacks had been consistently stopped 3-6 times more than whites over the last decade in NYC.

Since the last post, I managed to clean and reformat the coordinates marking the location of the stops. While I compiled data from 2003-2014, coordinates were available for year 2004 and years 2007-2014. All the code can be found in my GitHub repository.

My goals were to:

• See if blacks and whites were being stopped at the same locations
• Identify areas with especially high amounts of stops and see how these areas changed over time.

Killing two birds with one stone, I made density plots to identify areas with high and low stop densities. Snapshots were taken in 2 year intervals from 2007-2013. Stops of whites are indicated in red contour lines and stops of blacks are indicated in blue shades.

Stop and Frisk: Blacks stopped 3-6 times more than Whites over 10 years

The NYPD provides publicly available data on stop and frisks with data dictionaries, located here. The data, ranging from 2003 to 2014, contains information on over 4.5 million stops. Several variables such as the age, sex, and race of the person stopped are included.

I wrote some R code to clean and compile the data into a single .RData file. The code and clean data set are available in my Github repository. The purpose of this post is simply to make this clean, compiled dataset available to others to combine with their own datasets and reach interesting/meaningful conclusions.

Here are some preliminary (unadjusted) descriptive statistics:

Simulating Endogeneity

Introduction The topic in this post is endogeneity, which can severely bias regression estimates. I will specifically simulate endogeneity caused by an omitted variable. In future posts in this series, I’ll simulate other specification issues such as heteroskedasticity, multicollinearity, and collider bias.

The Data-Generating Process

Consider the data-generating process (DGP) of some outcome variable $Y$:

$Y = a+\beta x+c z + \epsilon_1$

$\epsilon_1 \sim N(0,\sigma^{2})$

For the simulation, I set parameter values for $a$, $\beta$, and $c$ and simulate positively correlated independent variables, $x$ and $z$ (N=500).

# simulation parameters
set.seed(144);
ss=500; trials=5000;
a=50; b=.5; c=.01;
d=25; h=.9;

# generate two independent variables
x=rnorm(n=ss,mean=1000,sd=50);
z=d+h*x+rnorm(ss,0,10)


The Simulation

The simulation will estimate the two models below. The first model is correct in the sense that it includes all terms in the actual DGP. However, the second model omits a variable that is present in the DGP. Instead, the variable is obsorbed into the error term $\epsilon_1$.

$(1)\thinspace Y = a+\beta x+c z + \epsilon_1$

$(2) \thinspace Y = a+\beta x + \epsilon_1$

This second model will yield a biased estimator of $\beta$. The variance will also be biased. This is because $x$ is endogenous, which is a fancy way of saying it is correlated with the error term, $\epsilon_1$. Since $cor(x,z)>0$ and $\epsilon_1=\epsilon + cz$, then $cor(x,\epsilon_1)>0$. To show this, I run the simulation below with 5,000 iterations. For each iteration, I construct the outcome variable, $Y$ using the DGP. I then run a regression to estimate $\beta$, first with model 1 and then with model 2.

sim=function(endog){
# assume normal error with constant variance to start
e=rnorm(n=ss,mean=0,sd=10)
y=a+b*x+c*z+e
# Select data generation process
if(endog==TRUE){ fit=lm(y~x) }else{ fit=lm(y~x+z)}
return(fit$coefficients) } # run simulation - with and wihtout endogeneity sim_results=t(replicate(trials,sim(endog=FALSE))) sim_results_endog=t(replicate(trials,sim(endog=TRUE)))  Simulation Results This simulation yields two different sampling distributions for $\beta$. Note that I have set the true value to $\beta=.5$. When $z$ is not omitted, the simulation yields the green sampling distribution, centered around the true value. The average value across all simulations is 0.4998. When $z$ is omitted, the simulation yields the red sampling distribution, centered around 0.5895. It’s biased upward from the true value of .5 by .0895. Moreover, the variance of the biased sampling distribution is much smaller than the true variance around $\beta$. This compromises the ability to perform any meaningful inferences about the true parameter. Bias Analysis The bias in $\beta$ can be derived analytically. Consider that in model 1 (presented above), $x$ and $z$ are related by: $(3)\thinspace z = d+hx+\epsilon_2$ Substituting $z$ in equation 1 with equation 3 and re-ordering: $Y = a+\beta x+c (d+hx+\epsilon_2) + \epsilon_1$ $(4)\thinspace Y = (a+cd)+(\beta+ch) x + (\epsilon_1+c\epsilon_2)$ When omitting variable $z$, it is actually equation 4 that is estimated. It can be seen that $\beta$ is biased by the quantity $ch$. In this case, since $x$ and $z$ are positively correlated by construction and their slope coefficients are positive, the bias will be positive. According to the parameters of the simulation, the “true” bias should be $ch=.09$. Here is the distribution of the bias, it is centered around .0895, very close to the true bias value. The derivation above also lets us determine the direction of bias from knowing the correlation of $x$ and $z$ as well as the sign of $c$ (the true partial effect of $z$ on $y$). If both are the same sign, then the estimate of $\beta$ will be biased upward. If the signs differ, then the estimate of $\beta$ will be biased downward. Conclusion The case above was pretty general, but has particular applications. For example, if we believe that an individual’s income is a function of years of education and year of work experience, then omitting one variable will bias the slope estimate of the other. Interactive Cobb-Douglas Web App with R I used Shiny to make an interactive cobb-douglass production surface in R. It reacts to user’s share of labor and capital inputs and allows the user to rotate the surface. The contour plot (isoquants) is also dynamic. Shiny works using two R codes stored in the same folder. One R code works on the user interface (UI) side and the other works on the server side. On the UI side, I take user inputs for figure rotations and capital/labor inputs via slidebars and output a plot of the surface and isoquants. library(shiny) shinyUI(pageWithSidebar( headerPanel("Cobb-Douglas Production Function"), sidebarPanel( sliderInput("L","Share of Labor:", min=0, max=1, value=.5, step=.1), sliderInput("C","Share of Capital:", min=0, max=1, value=.5, step=.1), sliderInput("p1","Rotate Horizontal:", min=0, max=180, value=40, step=10), sliderInput("p2","Rotate Vertical:", min=0, max=180, value=20, step=10) ), mainPanel( plotOutput("p"), plotOutput('con')) ))  The manipulation of the inputs is done on the server side: library(shiny) options(shiny.reactlog=TRUE) shinyServer(function(input,output){ observe({ # create x, y ranges for plots x=seq(0,200,5) y=seq(0,200,5) # Cobb-Douglass Model model<- function(a,b){ (a**(as.numeric(input$L)))*(b**(as.numeric(input$C))) } # generate surface values at given x-y points z<-outer(x,y,model) # create gradient for surface pal<-colorRampPalette(c("#f2f2f2", "Blue")) colors<-pal(max(z)) colors2<-pal(max(y)) # plot functions output$p<-renderPlot({persp(x,y,z,
theta=as.numeric(input$p1), phi=as.numeric(input$p2),
col=colors[z],
xlab="Share of Labor",
ylab="Share of Capital",
zlab="Output"
)})
output$con<-renderPlot({contour(x,y,z, theta=as.numeric(input$p1),
phi=as.numeric(input\$p2),
col=colors2[y],
xlab="Share of Labor",
ylab="Share of Capital"
)})
})
})


Okun’s Law

Relative to previous recoveries, our recovery has seen decelerating unemployment without a corresponding acceleration in GDP. At each level of quarterly unemployment decrease, there is a lower than average quarterly increase in GDP.

Cobb-Douglas Visualisation – I

I used excel to construct a visualization of the Cobb-Douglas production function (explicitly presented in the graph titles).

The production function expresses the output of any given firm as a function of two inputs (labor and capital) and parameters (alpha and beta). When the sum of alpha and beta equals 1, it can be shown that they represent labor’s and capital’s share of output, respectively .

This condition also means that the firm is operating with constant returns to scale. When a firm expands its inputs by a certain percent, output increases by the same amount.

We can plot each amount of labor, capital, and output in x-y-z space if we specify alpha and beta.

We do this for labor and capital ranging from 1 to 100 and alpha=beta=.5.The result is the Cobb-Douglas production surface with capital and labor each comprising 50% of the input.

Notice that the lines that separate the differently colored areas are equally spaced. This is a property of increasing returns to scale.

When labor and capital expand, the level of utility rises proportionately at a constant rate.

It is also useful to view the surface from above.

Those L-shaped curves are called isoquants or rectangular hyperbolas. They represent the different combinations of labor and capital that produce the same (“iso”) quantity of output (“quant”). For example, L=4 and K=4, L=16 and K=1, and L=1 and K=16 all produce O=4 level of output. The L-shaped curve simply connects these points for Q=4. As the curves move northwest, they plot the combinations for higher values of output.

A particular firm’s efficient combination depends on the price of its inputs, but that’s for another day.

Unstable Market

The necessary and sufficient condition for convergence is that the slope of the supply curve be greater than the absolute value of the slope of the demand curve. If the slope of the supple curve is less, then price and quantity diverge from equilibrium over time.