Data Analytics and Computational Social Science: HW 3

April Merleaux

knitr::opts_chunk$set(echo = TRUE)

Overview

As part of a larger research project, I am interested in identifying agricultural counties in the 1980s that have an elevated rate of arrest for marijuana sale/manufacture. My research is still exploratory, but broadly I’m interested in farmers who grew marijuana in an attempt to avoid farm foreclosure during the farm mortgage crisis of the 1980s. While there are news reports, I’m interested in whether I can detect any broader patterns in the data. The FBI Uniform Crime Reporting Program Data Series is available for download from the ICPSR. I previously downloaded County Level Detailed Arrest and Crime reports from the IPCSR website for Adults in 1987. I chose that year because there is also Agricultural Census data available, which I hope to use later. Because this particular product was only available for download as an ASCI file, with either SAS or SPSS Setup, I used a web interface for SAS’s JMP product to convert the data to a .xlsx The FBI data is here: https://doi.org/10.3886/ICPSR09252.v1

library(tidyverse)
library(readxl)
crime1987 <- read_excel("~/DATA/crime data/da09252.xlsx",
col_types = c("numeric", "numeric", "numeric",
"numeric", "text", "text", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric"))
View(crime1987)

Looking at the Data

On first glance, I see that there are 3132 rows and 53 columns. The variables I am interested in are the county, population, the number of arrests for MARIJUANA SALE/MANUF and MARIJUANA POSSESSION. Other than county, these are continuous numeric variables.
One immediate problem is with the FIPS codes. This data identifies a county by its county and state FIPS codes, in separate columns. The codebook identifies the counties with the state and county merged and separated by a “.”. For the county codes that have fewer than 3 digits, I need to add 0s to make them 3 digits. I am going to start by shortening the names of those two columns. I already converted them to chr in the read in stageto make it easier to combine.

crime1987 <- crime1987 %>%
  rename(FIPS_state = `FIPS STATE CODE - SEE CODEBOOK`,
         FIPS_county = `FIPS COUNTY CODE - SEE CODEBOOK`)

#To add the leading zeros to the county codes, I am going to use str_pad from the stringr package  
crime1987$FIPS_county <- str_pad(crime1987$FIPS_county, 3, pad = "0")

#unite() combines the state and county into a single variable, FIPS. However, it does not preserve the original two variables.  
crime1987 <- unite(crime1987, FIPS, FIPS_state, FIPS_county, sep = ".")
show(crime1987)

# A tibble: 3,132 x 52
   `ICPSR STUDY NU~ `ICPSR EDITION ~ `ICPSR PART NUM~ `ICPSR SEQ ID N~
              <dbl>            <dbl>            <dbl>            <dbl>
 1             9252                1                5                1
 2             9252                1                5                2
 3             9252                1                5                3
 4             9252                1                5                4
 5             9252                1                5                5
 6             9252                1                5                6
 7             9252                1                5                7
 8             9252                1                5                8
 9             9252                1                5                9
10             9252                1                5               10
# ... with 3,122 more rows, and 48 more variables: FIPS <chr>,
#   POPULATION <dbl>, GRAND TOTAL <dbl>, PART 1 TOTAL <dbl>,
#   VIOLENT CRIMES <dbl>, PROPERTY CRIMES <dbl>, MURDERS <dbl>,
#   RAPES <dbl>, ROBBERIES <dbl>, AGGRAVATED ASSAULTS <dbl>,
#   BURGLARIES <dbl>, LARCENIES <dbl>, AUTO THEFTS <dbl>,
#   ARSONS <dbl>, OTHER ASSAULTS <dbl>, FORGERY/COUNTERFEITING <dbl>,
#   FRAUD <dbl>, EMBEZZLEMENT <dbl>, HAVE STOLEN PROPERTY <dbl>,
#   VANDALISM <dbl>, WEAPONS VIOLATIONS <dbl>,
#   PROSTITUTION/COMM VICE <dbl>, SEX OFFENSES <dbl>,
#   DRUG ABUSE VIOL. TOTAL <dbl>, DRUG ABUSE SALE/MANUFAC <dbl>,
#   OPIUM/COCAINE SALE/MANUF <dbl>, MARIJUANA SALE/MANUF <dbl>,
#   SYNTHETIC DRUG SALE/MANU <dbl>, OTHER:DANG NON-NARCOTICS <dbl>,
#   DRUG POSSESSION SUBTOTAL <dbl>, OPIUM/COCAINE POSSESSION <dbl>,
#   MARIJUANA POSSESSION <dbl>, SYNTH NARCOTICS POSSESSN <dbl>,
#   DRUG POSSESSION: OTHER <dbl>, GAMBLING: TOTAL <dbl>,
#   BOOKMAKING (HORSE-SPORT) <dbl>, NUMBERS & LOTTERY <dbl>,
#   GAMBLING: ALL OTHER <dbl>, OFFEN AGAINST FAM & CHIL <dbl>,
#   DRIVING UNDER INFLUENCE <dbl>, LIQUOR LAW VIOLATIONS <dbl>,
#   DRUNKENNESS <dbl>, DISORDERLY CONDUCT <dbl>, VAGRANCY <dbl>,
#   ALL OTH OFFS EX TRAFFIC <dbl>, SUSPICION <dbl>,
#   CURFEW, LOITERING LAWS <dbl>, RUN-AWAYS (JUVENILES) <dbl>

I now have the FIPS code in a form that matches the county names in the code book. On the other hand, this format for the FIPS code is not standard. When I try to combine this data with other data sources, I may need to mess around with the FIPS codes again. In the meantime, I also will have to match these codes with the counties, which I have in another csv. Not going to try to do that right now.

There are some counties that have no data. These arrived as 0s in the data. This isn’t helpful though because 0 is a legitimate value for many of the variables. It seems like the easiest thing is to apply a filter for the population field. If the population is 0, I’m assuming the data is missing, so I’m filtering them out. Now I have 2925 rows.

crime1987 <- filter(crime1987, POPULATION != 0)

Narrowing the focus

I am going to add a variable for arrest rate for marijuana manufacture and marijuana possession, select only the relevant columns, and create some summary statistics.

#the operation wasn't working correctly until I renamed the columns
crime1987 <- crime1987 %>%
rename(MJ_manufact_arrests = "MARIJUANA SALE/MANUF", MJ_possession_arrests = "MARIJUANA POSSESSION") %>%
mutate(MJ_manufact_RATE = MJ_manufact_arrests / POPULATION * 100000,
       MJ_possession_RATE = MJ_possession_arrests / POPULATION * 100000)
#select the relevant columns and eliminate 0s
MJcrime1987 <- filter(crime1987, MJ_manufact_RATE > 0) %>%
  select(FIPS, POPULATION, MJ_manufact_arrests, MJ_manufact_RATE, MJ_possession_arrests, MJ_possession_RATE)
#find the mean and other summary stats for the arrest rate, after filtering out the 0s and selecting only the relevant columns
MJcrime1987 %>%
summarize(mean(MJ_manufact_RATE), mean(MJ_possession_RATE))

# A tibble: 1 x 2
  `mean(MJ_manufact_RATE)` `mean(MJ_possession_RATE)`
                     <dbl>                      <dbl>
1                     36.1                       115.

show(MJcrime1987)

# A tibble: 1,889 x 6
   FIPS  POPULATION MJ_manufact_arr~ MJ_manufact_RATE MJ_possession_a~
   <chr>      <dbl>            <dbl>            <dbl>            <dbl>
 1 1.003      92364               23             24.9              114
 2 1.005      25492                5             19.6               28
 3 1.009      39295                5             12.7               35
 4 1.015     124738               14             11.2              223
 5 1.017      40102               19             47.4               29
 6 1.019      19345               11             56.9               17
 7 1.021      31336                6             19.1               28
 8 1.023      17129                7             40.9                8
 9 1.031      40505               17             42.0               58
10 1.033      54913               14             25.5               76
# ... with 1,879 more rows, and 1 more variable:
#   MJ_possession_RATE <dbl>

#distribution of each variable
ggplot(MJcrime1987, aes(MJ_manufact_RATE)) + geom_histogram()

ggplot(MJcrime1987, aes(MJ_possession_RATE)) + geom_histogram()

#is there a correlation between manufacture and possession?
ggplot(data = MJcrime1987) +
geom_point(mapping = aes(x = MJ_possession_RATE, y = MJ_manufact_RATE))

#yes? I am not sure what to do next. Should I transform the data with a log function since it is so skewed?

Research questions

Even though there is clearly a linear correlation between the rate of possession arrests and the rate of manufacture/sale arrests, I am most interested in the situation where there is an inverse relation: high manufacturing arrest rate and comparatively low possession rate. This is because I’m interested in cannabis producers who were generally politically and socially conservative, from rural parts of the country where there was not a lot of cannabis use. I am going to try to filter for MJ_manufact_RATE > mean and MJ_possession_RATE < mean.

MJcrime1987 %>%
  filter(MJ_possession_RATE < 116 & MJ_manufact_RATE > 36)

# A tibble: 263 x 6
   FIPS  POPULATION MJ_manufact_arr~ MJ_manufact_RATE MJ_possession_a~
   <chr>      <dbl>            <dbl>            <dbl>            <dbl>
 1 1.017      40102               19             47.4               29
 2 1.019      19345               11             56.9               17
 3 1.023      17129                7             40.9                8
 4 1.059      28212               31            110.                32
 5 1.089     235470              119             50.5              115
 6 1.121      77080               36             46.7               58
 7 2.060       1180                1             84.7                1
 8 2.090      66461               54             81.3               76
 9 5.021      19535               21            107.                17
10 5.029      19334               19             98.3               10
# ... with 253 more rows, and 1 more variable:
#   MJ_possession_RATE <dbl>

That narrows things down to 263 counties, which is potentially useful. I’m not sure this is the most effective thing, though. For example, I checked on the record identified as 1.059, which is Franklin County, Alabama. Franklin County is non-Urban, predominantly manufacturing, and has a nearby National Forest. Proximity to a National Forest is going to have to be another criteria I introduce. I can also tell just from glancing at it that I might be interested in a larger spread between the two rates. This may be a job for standard deviations, eg I’m looking for counties with a manufacture rate more than 1 SD from the mean and possession rate less than 1 SD below the mean. Not sure if that’s right, or how to set it up.

If I could identify counties that met those criteria, I could use my old standard qualitative historical research methods to try to locate sources that would tell me the stories of some of those farmers. I am also interested in knowing if there are any common characteristics of counties like that. For example, do they have unusually high or low rates of farm foreclosure or farm debt? What is the typical farm size? Typical capital held by farms? Do they tend to specialize in one of the commodity crops whose price and markets were/are international and unstable in the 1980s? To answer some of those questions, I will need to add data from at least two other datasets. First, the USDA ERS has County Typology codes: https://www.ers.usda.gov/data-products/county-typology-codes/ and Rural-Urban Continuum Codes: https://www.ers.usda.gov/data-products/rural-urban-continuum-codes/ Second, the 1987 Census of Agriculture has detailed information about farm characteristics, crops, etc. I am not going to try to do that for this homework, though!

Comment on this article Share:

HW 3

Overview

Looking at the Data

Narrowing the focus

Research questions

Reuse

Citation