Stop and Frisk: Blacks stopped 3-6 times more than Whites over 10 years

The NYPD provides publicly available data on stop and frisks with data dictionaries, located here. The data, ranging from 2003 to 2014, contains information on over 4.5 million stops. Several variables such as the age, sex, and race of the person stopped are included.

I wrote some R code to clean and compile the data into a single .RData file. The code and clean data set are available in my Github repository. The purpose of this post is simply to make this clean, compiled dataset available to others to combine with their own datasets and reach interesting/meaningful conclusions.

Here are some preliminary (unadjusted) descriptive statistics:

StopAndFrisk

Age Distribution of Stopped Persons

The data shows some interesting trends:

  • Stops had been increasing steadily from 2003 to 2012, but falling since 2012.
  • The percentage of stopped persons who were black was consistently 3.5-6.5 times higher than the percentage of stopped persons who were white. Note that this is all unadjusted for the proportions of blacks/whites who live in the area. Perhaps this can be achieved through combining stop and frisk data with census data.
  • The data indicates whether or not officers explained the reason for stop to the stopped person. The data shows that police gave an explanation about 98-99% of the time. Of course, this involves a certain level of trust since the data itself is recorded by police. There is no difference in this statistic across race and sex.
  • The median age of stopped persons was 24. The distribution was roughly the same across race and sex.

A few notes on the data:

  • The raw data is saved as CSV files, one file for each year. However, the same variables are not tracked in each year. The .RData file on Github only contains select variables.
  • The importing and cleaning codes can take about 15 minutes to run.
  • All stops in all years have coordinates marking the location of the stop, however I’m still unable to make sense of them. I plan to publish another post with some spatial analyses.

The coding for this was particularly interesting because I had never used R to download ZIP files from the web. I reproduced this portion of the code below. It produces one dataset for each year from 2013 to 2014.

for(i in 2013:2014){
 temp <- tempfile()
 url<-paste("http://www.nyc.gov/html/nypd/downloads/zip/analysis_and_planning/",i,"_sqf_csv.zip",sep='')
 download.file(url,temp)
 assign(paste("d",i,sep=''),read.csv(unz(temp, paste(i,".csv",sep=''))))
}
unlink(temp)
Advertisements

15 thoughts on “Stop and Frisk: Blacks stopped 3-6 times more than Whites over 10 years

  1. Do we need to put the last statement “unlink(temp)” inside the loop, like after assign? I thought your code will leave the data for 2014 in the tempdir.

    1. Hi Fei,

      I don’t think we need to unlink every iteration. It’s my understanding that every time you link, it overrides the previous unlink(), so we need to only unlink after the last iteration.

      I verified that 2014 is successfully loaded into the R environment when I run the code, which is available on GitHub at the provided link.

      Let me know if this doesn’t help.

      Thanks!
      Arman

  2. Is the second facet mislabelled? Are you reporting the ratio of the percent of blacks stopped to the percent of white stopped? Or are you actually just reporting the ratio of blacks stopped to whites stopped?

    1. Hi Asrs,
      No, that’s not mislabeled. It’s the ratio of the percentages. They are equivalent, though. Since for a given month, %black= (#black/#stops), %white=(#white/#stops), then Ratio = (%black)/(%white) = (#black)/(#white).

      The reason I used percentages was because I calculated those first. However, the percentages were very different in magnitude, which made plotting the two time series on a single plot impractical. So I just took the ratio of the percentages.

  3. The black vs. white percentages will be a lot more informative once the percentages of blacks vs. whites residing in the patrolled areas are included in the analysis. I hope you are able to dive deeper in this way; it seems the current conclusions are based on the assumption that the population is 50% white and 50% black (or have I misunderstood this?).

    1. Hi Theresa,

      Thanks for reading and the comment. I agree. The numbers are unadjusted for the proportions in the underlying population and other sources of geographic heterogeneity. I touched on this on a follow up spatial analysis, where we can clearly see that blacks and whites are being stopped at different locations (consistently over time). This is probably due to the fact that blacks and whites live in different areas, and not due to overt racism.

      The focus of this post was to prepare a clean, multi-year, publicly available, and easily downloadable dataset (see my GitHub) for others to combine with their own analyses and reach more meaningful conclusions

      1. Thank you for taking the time to do this analysis. Especially the more recent geographic post. This is really interesting and I hope it can propel future unbiased research.

  4. Thanks for this good post. A question I have. How did you determine from the given dataset which variables are of interest or which variables are important for further analysis?

    1. Hi Ashish,

      I picked a few variables I personally found interesting. The focus of this effort was to produce a clean, publicly available, and easily downloadable dataset that others can use and combine with their own data. The focus was not really the summary statistics themselves.

  5. Obviously an inflammatory click bait headline. A useless analysis without knowing, at a minimum, the racial distributions of where (i.e. neighborhoods) the stops occurred. If your goal was to demonstrate a programming concept/feature, it could have been done in a more responsible manner.

    1. Hi Kenny,

      The analysis is “useless” in the sense that, as you state, it does not adjust for the proportions in the underlying population and other sources of geographic heterogeneity. I touched on this in a follow-up spatial analysis post, where we clearly see blacks and whites stopped at different locations. This is likely due to the fact that blacks and whites live/hang out in different areas and not due to racism. In this sense, I go in the direction you suggest.

      However, the goal of this post was not to present useful statistics. It was to present a compiled, clean, publicly available, and easily downloadable dataset (see my GitHub) that others can download and combine with their own data (perhaps census info?) and come to more meaningful conclusions.

      The focus of this post was to prepare a clean, multi-year, publicly available, and easily downloadable dataset for others to combine with their own analyses and reach more meaningful conclusions.

  6. Three tips for beginners to understand this line:

    assign(paste(“d”,i,sep=’ ‘),read.csv(unz(temp, paste(i,”.csv”,sep=’ ‘))))

    1. This line is part of a multi-line “for” loop (loops are discouraged in R)
    2. Read the nested parentheses from the inside out
    3. At the R command line, use the ? (help) function for documentation:
    ?unz

    unz reads (only) single files within zip files, in binary mode. The description is the full path to the zip file, with ‘.zip’ extension if required.

    ?download.file

    The function download.file can be used to download a single file as described by url from the internet and store it in destfile. The url must start with a scheme such as http://, ftp:// or file://.

    Nesting the unz function inside of the read.csv function means that you never have to unzip and store the complete csv files (1) — instead the data is read directly from the zip file into R and then (later in the program — see GIT) subsets are written out to an .rdata file. If desired, the subset-ted data could have been written out to .csv files or to a database such as SQLite3 (using RSQLite). Cool.

    (1) This comment (in GIT) says the files are written out again, but that no longer seems to be the case (the working directory is changed, but the code to write the .csv files out seems to have been deleted — as it is no longer needed — it may have been needed during development or testing):

    #### 2. Save each of the downloaded files as csv…one csv for each year ####
    setwd(‘…/Stable Markets/Stop and Frisk/data’)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s