Tag Archives: economics

Stop and Frisk: Spatial Analysis of Racial Differences

Stops in 2014, with red lines indicated high white stop density areas and blue shades indicating high black stop density areas. Notice that high white stop density areas are very different from high black stop density areas. The star in Brooklyn marks the location where officers Lui and Ramos were killed. The star on Staten Island markets the location of Eric Garner's death.
Stops in 2014. Red lines indicate high white stop density areas and blue shades indicate high black stop density areas.
Notice that high white stop density areas are very different from high black stop density areas.
The star in Brooklyn marks the location of officers Liu’s and Ramos’ deaths. The star on Staten Island marks the location of Eric Garner’s death.

In my last post, I compiled and cleaned publicly available data on over 4.5 million stops over the past 11 years.

I also presented preliminary summary statistics showing that blacks had been consistently stopped 3-6 times more than whites over the last decade in NYC.

Since the last post, I managed to clean and reformat the coordinates marking the location of the stops. While I compiled data from 2003-2014, coordinates were available for year 2004 and years 2007-2014. All the code can be found in my GitHub repository.

My goals were to:

  • See if blacks and whites were being stopped at the same locations
  • Identify areas with especially high amounts of stops and see how these areas changed over time.

Killing two birds with one stone, I made density plots to identify areas with high and low stop densities. Snapshots were taken in 2 year intervals from 2007-2013. Stops of whites are indicated in red contour lines and stops of blacks are indicated in blue shades.


Continue reading Stop and Frisk: Spatial Analysis of Racial Differences

Stop and Frisk: Blacks stopped 3-6 times more than Whites over 10 years

The NYPD provides publicly available data on stop and frisks with data dictionaries, located here. The data, ranging from 2003 to 2014, contains information on over 4.5 million stops. Several variables such as the age, sex, and race of the person stopped are included.

I wrote some R code to clean and compile the data into a single .RData file. The code and clean data set are available in my Github repository. The purpose of this post is simply to make this clean, compiled dataset available to others to combine with their own datasets and reach interesting/meaningful conclusions.

Here are some preliminary (unadjusted) descriptive statistics:


Continue reading Stop and Frisk: Blacks stopped 3-6 times more than Whites over 10 years

Simulating Endogeneity

Introduction The topic in this post is endogeneity, which can severely bias regression estimates. I will specifically simulate endogeneity caused by an omitted variable. In future posts in this series, I’ll simulate other specification issues such as heteroskedasticity, multicollinearity, and collider bias.

The Data-Generating Process

Consider the data-generating process (DGP) of some outcome variable Y :

Y = a+\beta x+c z + \epsilon_1

\epsilon_1 \sim N(0,\sigma^{2})

For the simulation, I set parameter values for a , \beta , and c and simulate positively correlated independent variables, x and z (N=500).

# simulation parameters
ss=500; trials=5000;
a=50; b=.5; c=.01;
d=25; h=.9;

# generate two independent variables

The Simulation

The simulation will estimate the two models below. The first model is correct in the sense that it includes all terms in the actual DGP. However, the second model omits a variable that is present in the DGP. Instead, the variable is obsorbed into the error term \epsilon_1 .

(1)\thinspace Y = a+\beta x+c z + \epsilon_1

(2) \thinspace Y = a+\beta x + \epsilon_1

This second model will yield a biased estimator of \beta . The variance will also be biased. This is because x is endogenous, which is a fancy way of saying it is correlated with the error term, \epsilon_1 . Since cor(x,z)>0 and \epsilon_1=\epsilon + cz , then cor(x,\epsilon_1)>0 . To show this, I run the simulation below with 5,000 iterations. For each iteration, I construct the outcome variable, Y using the DGP. I then run a regression to estimate \beta , first with model 1 and then with model 2.

 # assume normal error with constant variance to start
 # Select data generation process
 if(endog==TRUE){ fit=lm(y~x) }else{ fit=lm(y~x+z)}

# run simulation - with and wihtout endogeneity

Simulation Results This simulation yields two different sampling distributions for \beta . Note that I have set the true value to \beta=.5 . When z is not omitted, the simulation yields the green sampling distribution, centered around the true value. The average value across all simulations is 0.4998. When z is omitted, the simulation yields the red sampling distribution, centered around 0.5895. It’s biased upward from the true value of .5 by .0895. Moreover, the variance of the biased sampling distribution is much smaller than the true variance around \beta . This compromises the ability to perform any meaningful inferences about the true parameter. End2Bias Analysis The bias in \beta can be derived analytically. Consider that in model 1 (presented above), x and z are related by:

(3)\thinspace z = d+hx+\epsilon_2

Substituting z in equation 1 with equation 3 and re-ordering:

 Y = a+\beta x+c (d+hx+\epsilon_2) + \epsilon_1

 (4)\thinspace Y = (a+cd)+(\beta+ch) x + (\epsilon_1+c\epsilon_2)

When omitting variable z , it is actually equation 4 that is estimated. It can be seen that \beta is biased by the quantity ch . In this case, since x and z are positively correlated by construction and their slope coefficients are positive, the bias will be positive. According to the parameters of the simulation, the “true” bias should be  ch=.09. Here is the distribution of the bias, it is centered around .0895, very close to the true bias value. bias2 The derivation above also lets us determine the direction of bias from knowing the correlation of x and z as well as the sign of c (the true partial effect of z on y ). If both are the same sign, then the estimate of \beta will be biased upward. If the signs differ, then the estimate of \beta will be biased downward. Conclusion The case above was pretty general, but has particular applications. For example, if we believe that an individual’s income is a function of years of education and year of work experience, then omitting one variable will bias the slope estimate of the other.

Interactive Cobb-Douglas Web App with R

Screen Shot 2014-11-04 at 7.24.02 PM

I used Shiny to make an interactive cobb-douglass production surface in R. It reacts to user’s share of labor and capital inputs and allows the user to rotate the surface. The contour plot (isoquants) is also dynamic.

Shiny works using two R codes stored in the same folder. One R code works on the user interface (UI) side and the other works on the server side.

On the UI side, I take user inputs for figure rotations and capital/labor inputs via slidebars and output a plot of the surface and isoquants.


  headerPanel("Cobb-Douglas Production Function"),
    sliderInput("L","Share of Labor:",
                min=0, max=1, value=.5, step=.1),
    sliderInput("C","Share of Capital:",
                min=0, max=1, value=.5, step=.1),  
    sliderInput("p1","Rotate Horizontal:",
              min=0, max=180, value=40, step=10),
    sliderInput("p2","Rotate Vertical:",
            min=0, max=180, value=20, step=10)
  mainPanel( plotOutput("p"),

The manipulation of the inputs is done on the server side:


  # create x, y ranges for plots
  # Cobb-Douglass Model
  model<-  function(a,b){

  # generate surface values at given x-y points

  # create gradient for surface
  pal<-colorRampPalette(c("#f2f2f2", "Blue"))
  # plot functions
                              xlab="Share of Labor",
                              ylab="Share of Capital",
                              xlab="Share of Labor",
                              ylab="Share of Capital"

Cobb-Douglas Visualisation – I

I used excel to construct a visualization of the Cobb-Douglas production function (explicitly presented in the graph titles).

The production function expresses the output of any given firm as a function of two inputs (labor and capital) and parameters (alpha and beta). When the sum of alpha and beta equals 1, it can be shown that they represent labor’s and capital’s share of output, respectively .

This condition also means that the firm is operating with constant returns to scale. When a firm expands its inputs by a certain percent, output increases by the same amount.

We can plot each amount of labor, capital, and output in x-y-z space if we specify alpha and beta.

We do this for labor and capital ranging from 1 to 100 and alpha=beta=.5.The result is the Cobb-Douglas production surface with capital and labor each comprising 50% of the input.

Screen Shot 2013-06-06 at 8.45.31 PM

Notice that the lines that separate the differently colored areas are equally spaced. This is a property of increasing returns to scale.

When labor and capital expand, the level of utility rises proportionately at a constant rate.

It is also useful to view the surface from above.

Screen Shot 2013-06-06 at 8.39.47 PM


Those L-shaped curves are called isoquants or rectangular hyperbolas. They represent the different combinations of labor and capital that produce the same (“iso”) quantity of output (“quant”). For example, L=4 and K=4, L=16 and K=1, and L=1 and K=16 all produce O=4 level of output. The L-shaped curve simply connects these points for Q=4. As the curves move northwest, they plot the combinations for higher values of output.

A particular firm’s efficient combination depends on the price of its inputs, but that’s for another day.