HW3
knitr::opts_chunk$set(echo = TRUE)
As part of a larger research project, I am interested in identifying agricultural counties in the 1980s that have an elevated rate of arrest for marijuana sale/manufacture. My research is still exploratory, but broadly I’m interested in farmers who grew marijuana in an attempt to avoid farm foreclosure during the farm mortgage crisis of the 1980s. While there are news reports, I’m interested in whether I can detect any broader patterns in the data. The FBI Uniform Crime Reporting Program Data Series is available for download from the ICPSR. I previously downloaded County Level Detailed Arrest and Crime reports from the IPCSR website for Adults in 1987. I chose that year because there is also Agricultural Census data available, which I hope to use later. Because this particular product was only available for download as an ASCI file, with either SAS or SPSS Setup, I used a web interface for SAS’s JMP product to convert the data to a .xlsx The FBI data is here: https://doi.org/10.3886/ICPSR09252.v1
library(tidyverse)
library(readxl)
crime1987 <- read_excel("~/DATA/crime data/da09252.xlsx",
col_types = c("numeric", "numeric", "numeric",
"numeric", "text", "text", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric"))
View(crime1987)
On first glance, I see that there are 3132 rows and 53 columns. The variables I am interested in are the county, population, the number of arrests for MARIJUANA SALE/MANUF and MARIJUANA POSSESSION. Other than county, these are continuous numeric variables.
One immediate problem is with the FIPS codes. This data identifies a county by its county and state FIPS codes, in separate columns. The codebook identifies the counties with the state and county merged and separated by a “.”. For the county codes that have fewer than 3 digits, I need to add 0s to make them 3 digits. I am going to start by shortening the names of those two columns. I already converted them to chr in the read in stageto make it easier to combine.
#To add the leading zeros to the county codes, I am going to use str_pad from the stringr package
crime1987$FIPS_county <- str_pad(crime1987$FIPS_county, 3, pad = "0")
#unite() combines the state and county into a single variable, FIPS. However, it does not preserve the original two variables.
crime1987 <- unite(crime1987, FIPS, FIPS_state, FIPS_county, sep = ".")
show(crime1987)
# A tibble: 3,132 x 52
`ICPSR STUDY NU~ `ICPSR EDITION ~ `ICPSR PART NUM~ `ICPSR SEQ ID N~
<dbl> <dbl> <dbl> <dbl>
1 9252 1 5 1
2 9252 1 5 2
3 9252 1 5 3
4 9252 1 5 4
5 9252 1 5 5
6 9252 1 5 6
7 9252 1 5 7
8 9252 1 5 8
9 9252 1 5 9
10 9252 1 5 10
# ... with 3,122 more rows, and 48 more variables: FIPS <chr>,
# POPULATION <dbl>, GRAND TOTAL <dbl>, PART 1 TOTAL <dbl>,
# VIOLENT CRIMES <dbl>, PROPERTY CRIMES <dbl>, MURDERS <dbl>,
# RAPES <dbl>, ROBBERIES <dbl>, AGGRAVATED ASSAULTS <dbl>,
# BURGLARIES <dbl>, LARCENIES <dbl>, AUTO THEFTS <dbl>,
# ARSONS <dbl>, OTHER ASSAULTS <dbl>, FORGERY/COUNTERFEITING <dbl>,
# FRAUD <dbl>, EMBEZZLEMENT <dbl>, HAVE STOLEN PROPERTY <dbl>,
# VANDALISM <dbl>, WEAPONS VIOLATIONS <dbl>,
# PROSTITUTION/COMM VICE <dbl>, SEX OFFENSES <dbl>,
# DRUG ABUSE VIOL. TOTAL <dbl>, DRUG ABUSE SALE/MANUFAC <dbl>,
# OPIUM/COCAINE SALE/MANUF <dbl>, MARIJUANA SALE/MANUF <dbl>,
# SYNTHETIC DRUG SALE/MANU <dbl>, OTHER:DANG NON-NARCOTICS <dbl>,
# DRUG POSSESSION SUBTOTAL <dbl>, OPIUM/COCAINE POSSESSION <dbl>,
# MARIJUANA POSSESSION <dbl>, SYNTH NARCOTICS POSSESSN <dbl>,
# DRUG POSSESSION: OTHER <dbl>, GAMBLING: TOTAL <dbl>,
# BOOKMAKING (HORSE-SPORT) <dbl>, NUMBERS & LOTTERY <dbl>,
# GAMBLING: ALL OTHER <dbl>, OFFEN AGAINST FAM & CHIL <dbl>,
# DRIVING UNDER INFLUENCE <dbl>, LIQUOR LAW VIOLATIONS <dbl>,
# DRUNKENNESS <dbl>, DISORDERLY CONDUCT <dbl>, VAGRANCY <dbl>,
# ALL OTH OFFS EX TRAFFIC <dbl>, SUSPICION <dbl>,
# CURFEW, LOITERING LAWS <dbl>, RUN-AWAYS (JUVENILES) <dbl>
I now have the FIPS code in a form that matches the county names in the code book. On the other hand, this format for the FIPS code is not standard. When I try to combine this data with other data sources, I may need to mess around with the FIPS codes again. In the meantime, I also will have to match these codes with the counties, which I have in another csv. Not going to try to do that right now.
There are some counties that have no data. These arrived as 0s in the data. This isn’t helpful though because 0 is a legitimate value for many of the variables. It seems like the easiest thing is to apply a filter for the population field. If the population is 0, I’m assuming the data is missing, so I’m filtering them out. Now I have 2925 rows.
crime1987 <- filter(crime1987, POPULATION != 0)
I am going to add a variable for arrest rate for marijuana manufacture and marijuana possession, select only the relevant columns, and create some summary statistics.
#the operation wasn't working correctly until I renamed the columns
crime1987 <- crime1987 %>%
rename(MJ_manufact_arrests = "MARIJUANA SALE/MANUF", MJ_possession_arrests = "MARIJUANA POSSESSION") %>%
mutate(MJ_manufact_RATE = MJ_manufact_arrests / POPULATION * 100000,
MJ_possession_RATE = MJ_possession_arrests / POPULATION * 100000)
#select the relevant columns and eliminate 0s
MJcrime1987 <- filter(crime1987, MJ_manufact_RATE > 0) %>%
select(FIPS, POPULATION, MJ_manufact_arrests, MJ_manufact_RATE, MJ_possession_arrests, MJ_possession_RATE)
#find the mean and other summary stats for the arrest rate, after filtering out the 0s and selecting only the relevant columns
MJcrime1987 %>%
summarize(mean(MJ_manufact_RATE), mean(MJ_possession_RATE))
# A tibble: 1 x 2
`mean(MJ_manufact_RATE)` `mean(MJ_possession_RATE)`
<dbl> <dbl>
1 36.1 115.
show(MJcrime1987)
# A tibble: 1,889 x 6
FIPS POPULATION MJ_manufact_arr~ MJ_manufact_RATE MJ_possession_a~
<chr> <dbl> <dbl> <dbl> <dbl>
1 1.003 92364 23 24.9 114
2 1.005 25492 5 19.6 28
3 1.009 39295 5 12.7 35
4 1.015 124738 14 11.2 223
5 1.017 40102 19 47.4 29
6 1.019 19345 11 56.9 17
7 1.021 31336 6 19.1 28
8 1.023 17129 7 40.9 8
9 1.031 40505 17 42.0 58
10 1.033 54913 14 25.5 76
# ... with 1,879 more rows, and 1 more variable:
# MJ_possession_RATE <dbl>
#distribution of each variable
ggplot(MJcrime1987, aes(MJ_manufact_RATE)) + geom_histogram()
ggplot(MJcrime1987, aes(MJ_possession_RATE)) + geom_histogram()
#is there a correlation between manufacture and possession?
ggplot(data = MJcrime1987) +
geom_point(mapping = aes(x = MJ_possession_RATE, y = MJ_manufact_RATE))
#yes? I am not sure what to do next. Should I transform the data with a log function since it is so skewed?
Even though there is clearly a linear correlation between the rate of possession arrests and the rate of manufacture/sale arrests, I am most interested in the situation where there is an inverse relation: high manufacturing arrest rate and comparatively low possession rate. This is because I’m interested in cannabis producers who were generally politically and socially conservative, from rural parts of the country where there was not a lot of cannabis use. I am going to try to filter for MJ_manufact_RATE > mean and MJ_possession_RATE < mean.
# A tibble: 263 x 6
FIPS POPULATION MJ_manufact_arr~ MJ_manufact_RATE MJ_possession_a~
<chr> <dbl> <dbl> <dbl> <dbl>
1 1.017 40102 19 47.4 29
2 1.019 19345 11 56.9 17
3 1.023 17129 7 40.9 8
4 1.059 28212 31 110. 32
5 1.089 235470 119 50.5 115
6 1.121 77080 36 46.7 58
7 2.060 1180 1 84.7 1
8 2.090 66461 54 81.3 76
9 5.021 19535 21 107. 17
10 5.029 19334 19 98.3 10
# ... with 253 more rows, and 1 more variable:
# MJ_possession_RATE <dbl>
That narrows things down to 263 counties, which is potentially useful. I’m not sure this is the most effective thing, though. For example, I checked on the record identified as 1.059, which is Franklin County, Alabama. Franklin County is non-Urban, predominantly manufacturing, and has a nearby National Forest. Proximity to a National Forest is going to have to be another criteria I introduce. I can also tell just from glancing at it that I might be interested in a larger spread between the two rates. This may be a job for standard deviations, eg I’m looking for counties with a manufacture rate more than 1 SD from the mean and possession rate less than 1 SD below the mean. Not sure if that’s right, or how to set it up.
If I could identify counties that met those criteria, I could use my old standard qualitative historical research methods to try to locate sources that would tell me the stories of some of those farmers. I am also interested in knowing if there are any common characteristics of counties like that. For example, do they have unusually high or low rates of farm foreclosure or farm debt? What is the typical farm size? Typical capital held by farms? Do they tend to specialize in one of the commodity crops whose price and markets were/are international and unstable in the 1980s? To answer some of those questions, I will need to add data from at least two other datasets. First, the USDA ERS has County Typology codes: https://www.ers.usda.gov/data-products/county-typology-codes/ and Rural-Urban Continuum Codes: https://www.ers.usda.gov/data-products/rural-urban-continuum-codes/ Second, the 1987 Census of Agriculture has detailed information about farm characteristics, crops, etc. I am not going to try to do that for this homework, though!
Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Merleaux (2022, Feb. 27). Data Analytics and Computational Social Science: HW 3. Retrieved from https://github.com/DACSS/dacss_course_website/posts/httprpubscomamerleaux870762/
BibTeX citation
@misc{merleaux2022hw, author = {Merleaux, April}, title = {Data Analytics and Computational Social Science: HW 3}, url = {https://github.com/DACSS/dacss_course_website/posts/httprpubscomamerleaux870762/}, year = {2022} }