Data Analytics and Computational Social Science: HW5

April Merleaux

knitr::opts_chunk$set(echo = TRUE)

#read in and tidy 1987 crime data
#load libraries 
library(tidyverse)
library(readxl)
#read in the file
crime1987 <- read_excel("~/DATA/crime data/da09252.xlsx",
                        col_types = c("numeric", "numeric", "numeric",
                                      "numeric", "text", "text", "numeric",
                                      "numeric", "numeric", "numeric",
                                      "numeric", "numeric", "numeric",
                                      "numeric", "numeric", "numeric",
                                      "numeric", "numeric", "numeric",
                                      "numeric", "numeric", "numeric",
                                      "numeric", "numeric", "numeric",
                                      "numeric", "numeric", "numeric",
                                      "numeric", "numeric", "numeric",
                                      "numeric", "numeric", "numeric",
                                      "numeric", "numeric", "numeric",
                                      "numeric", "numeric", "numeric",
                                      "numeric", "numeric", "numeric",
                                      "numeric", "numeric", "numeric",
                                      "numeric", "numeric", "numeric",
                                      "numeric", "numeric", "numeric",
                                      "numeric"))
#tidy FIPS codes
#rename select columns as a first step to tidying the FIPS codescrime1987 <- mutate(crime1987, FIPS5 = paste("county", "state"))
crime1987 <- crime1987 %>%
  rename(FIPS_state = `FIPS STATE CODE - SEE CODEBOOK`,
         FIPS_county = `FIPS COUNTY CODE - SEE CODEBOOK`)
#create new state column with padded 0s. Doing it this way preserves FIPS_state as is, which I will combine with FIPS_county in a few steps to create the string that matches to the county name in this dataset. I need to preserve them separately because the FIPS codes are formatted differently in other datasets
crime1987 <- mutate(crime1987, state = FIPS_state)
crime1987$state <-str_pad(crime1987$state, 2, pad = "0")
#add leading 0s to  FIPS_county and duplicate it so that it still exists after I combine them in the next step
crime1987$FIPS_county <- str_pad(crime1987$FIPS_county, 3, pad = "0")
crime1987 <- mutate(crime1987, county = FIPS_county)
#unite() combines the state and county into a single variable separated by ".", FIPS. However, it does not preserve the original two variables, which is why I preserved them by duplicating them in the last steps.
crime1987 <- unite(crime1987, FIPS, FIPS_state, FIPS_county, sep = ".")
#create a new variable, FIPS5 that combines county and state into a single variable
crime1987 <- mutate(crime1987, FIPS5 = paste(state,county, sep = ""))
#create arrest rate variables, first renaming the columns to avoid confusion
crime1987 <- crime1987 %>%
  rename(MJ_manufact_arrests = "MARIJUANA SALE/MANUF", MJ_possession_arrests = "MARIJUANA POSSESSION") %>%
  mutate(MJ_manufact_RATE = MJ_manufact_arrests / POPULATION * 100000,
         MJ_possession_RATE = MJ_possession_arrests / POPULATION * 100000)
#create a new datatable, MJcrime1987 that narrows just to the columns  I'm interested in 
#select the relevant columns and eliminate counties with 0 population and 0 MJ_manufact_arrests
MJcrime1987 <- filter(crime1987, MJ_manufact_RATE > 0, POPULATION != 0) %>%
  select(FIPS, FIPS5, POPULATION, MJ_manufact_arrests, MJ_manufact_RATE, MJ_possession_arrests, MJ_possession_RATE)
# Add data about the counties    
#Add the Rural-Urban Continuum Codes: https://www.ers.usda.gov/data-products/rural-urban-continuum-codes/ and USDA ERS  County Typology codes: https://www.ers.usda.gov/data-products/county-typology-codes/    
#read in USDA rural urban continuum codes from the USDA
ruralurbancodes83_93 <- read_excel("~/DATA/USDA data originals and codebook/ruralurbancodes83_93.xls")  
#join crime1987 and MJcrime1987 with ruralurbancodes83_93 by FIPS5
crime1987_countytype <- inner_join(crime1987, ruralurbancodes83_93, by = "FIPS5")
MJcrime1987_countytype <- inner_join(MJcrime1987, ruralurbancodes83_93, by = "FIPS5")  
#These are county typologies that give information about the economic activity within counties. There are 6 mutually exclusive economic types: farming, mining, manufacturing, government, services, nonspecialized. A county with at least 20% of its economy dependent on a particular sector in 1989 will have a 1 in that column in the table. There are 5 additional categories that are not mutually exclusive: retirement-destination, Federal lands, persistent poverty, commuting and transfers-dependent.
countytypology89 <- read_excel("~/DATA/USDA data originals and codebook/typology89.xls")
#create new variables, econ_type and econ_char using the 11 columns from the USDA data
countytypology89 <- countytypology89 %>%
  mutate(econ_type = case_when(
    `FM` == 1 ~ "Farming Dependent", 
    `MI` == 1 ~ "Mining Dependent", 
    `MF` == 1 ~ "Manufacturing Dependent", 
    `GV` == 1 ~ "Government Dependent", 
    `TS` == 1 ~ "Services Dependent", 
    `NS` == 1 ~ "Nonspecialized"),
    econ_char = case_when(
    `RT` == 1 ~ "Retirement Destination", 
    `FL` == 1 ~ "Federal Lands", 
    `CM` == 1 ~ "Commuting", 
    `PV` == 1 ~ "Persistent Poverty", 
    `TP` == 1 ~ "Transfers-dependent")
  )  
# join with the other data  
MJcrime1987_allCountyinfo <- inner_join(MJcrime1987_countytype, countytypology89, by = "FIPS5")
crime1987_allCountyinfo <- inner_join(crime1987_countytype, countytypology89, by = "FIPS5")
#narrow down to the relevant variables  
tidy_MJ_Counties <- MJcrime1987_allCountyinfo %>%
  select("FIPS5", "POPULATION", "MJ_manufact_arrests", "MJ_manufact_RATE", "MJ_possession_arrests", "MJ_possession_RATE", "State.x", "County Name", "1993 Rural-urban Continuum Code","RuralUrban1993", "econ_type", "econ_char")
#narrow to rural counties, eg those with a rural-urban continuum code greater than 3
tidy_MJ_Counties <- filter(tidy_MJ_Counties, `1993 Rural-urban Continuum Code` > 3)
#end of reading in and tidying A tibble: 1,186 x 12

Visualizations

I did a facet wrap in the last homework, and here’s a facet grid with the same data.

#facet_grid
ggplot(data = tidy_MJ_Counties) +
geom_point(mapping = aes(x = MJ_manufact_RATE, y = MJ_possession_RATE)) +
facet_grid(econ_char ~ econ_type)

Here’s another visualization of econ_type in relation to arrest rate and econ_char [It turns out that this next one is the most useful of the visualizations]

#very useful visualization ggplot(data = tidy_MJ_Counties) +
tidy_MJ_Counties %>%
  drop_na() %>%
  ggplot(aes(x = econ_type, y = MJ_manufact_RATE, fill = econ_char)) + 
  geom_col() +
  labs(title = "Marijuana Manufacture Arrest Rate by County Economics", x = "County Economic Type", y = "Marijuana Manufacture Arrest Rate", fill = "County Economic Character") +
  coord_flip()

This next one is kindof cool. Manufacture arrest rate against possession with geom_smooth and color as econ_type Farming Dependent and government dependent are interesting. The curve goes down on possession as manufacture increases in Services dependent and counties without an econ_type listed. Retirement + Services seems like a recipe for marijuana arrests.

ggplot(tidy_MJ_Counties) +
geom_smooth(mapping = aes(x = MJ_manufact_RATE, y = MJ_possession_RATE, color = econ_type))

And here’s just Farming Dependent with both x and y transformed by log10

ggplot(data = filter(tidy_MJ_Counties, econ_type == "Farming Dependent"), mapping = aes(x = MJ_manufact_RATE, y = MJ_possession_RATE)) +
geom_point() +
geom_smooth () +
scale_y_log10() +
scale_x_log10()

I figured out how to make a bar graph counting rural counties of each econ type Either geom_bar or stat_count will work And one geom_col with econ_type against rate, because you can have x and y for col (only one of them for bar)

ggplot(data = tidy_MJ_Counties) +
geom_bar(mapping = aes(x = econ_type))+
  coord_flip()

ggplot(data = tidy_MJ_Counties) +
  geom_bar(mapping = aes(x = RuralUrban1993))+
  coord_flip()

ggplot(tidy_MJ_Counties) +
geom_col(mapping = aes(x = econ_type, y = MJ_manufact_RATE)) +
  coord_flip()

ggplot(tidy_MJ_Counties) +
geom_col(mapping = aes(x = econ_type, y = MJ_manufact_RATE)) +
  coord_flip() +
  scale_y_log10()

#when I transform the log the arrest rate, manufacturing appears to be a more significant category

Here are some experiments using stat_summary The first one is a bar chart, showing the median for each econ_type. The second one shows the mean and max and min, which seems useful. The last one is an attempt to show max, min, and median. It looks lousy because the mean is much higher than the median and so things aren’t so squished. The final one adds a point for the median.

#for these visualizations I want to remove the NAs first 

tidy_MJ_Counties %>%
  drop_na() %>%
  ggplot(aes(x = econ_type, y = MJ_manufact_RATE)) +
  stat_summary() +
  coord_flip() +
  labs(title = "Mean Arrest Rate for Marijuana Manufacture by County Economic Type")

tidy_MJ_Counties %>%
  drop_na() %>%
  ggplot(aes(x = econ_type, y = MJ_manufact_RATE)) +
  stat_summary(fun = "median", fun.max = max, fun.min = min) +
  coord_flip()+
  labs(title = "Median Arrest Rate for Marijuana Manufacture by County Economic Type")

tidy_MJ_Counties %>%
  drop_na() %>%
  ggplot(aes(x = econ_type, y = MJ_manufact_RATE)) +
  stat_summary() +
  stat_summary(fun = "median", color = "red") +
  coord_flip() +
  labs(title = "Median & Mean Arrest Rate for Marijuana Manufacture by County Economic Type")

tidy_MJ_Counties %>%
  drop_na() %>%
  ggplot() +
  geom_col(mapping = aes(x = econ_type, y = MJ_manufact_RATE, fill = econ_char), show.legend = FALSE, width = 1) +
  theme(aspect.ratio = 1) +
  labs(title = "Median Arrest Rate for Marijuana Manufacture by County Economic Type", x = NULL, y = "Marijuana Manufacture Arrest Rate") +
  coord_polar()

On to other fun things! Agricultural Data

The next step in this process is to start read in and tidy data from the agricultural census. This is available in tabular form on ICPSR: https://www.icpsr.umich.edu/web/ICPSR/studies/35206

I honestly have no idea how to approach such a large dataset (2,634 variables!)

load("~/DATA/USDA data originals and codebook/ICPSR_35206-V4/ICPSR_35206/DS0041/35206-0041-Data.rda")
#create lookup table from column attributes
attributes(da35206.0041)$variable.labels -> labels
USDA35206_labels <- data.frame(labels)
USDA35206_labels <- rename(USDA35206_labels, "name" = 1)
USDA35206_labels <- rownames_to_column(USDA35206_labels, "code")
USDA35206_labels <- USDA35206_labels %>%
  filter(str_detect(code, "ITEM"))

#tidy data
USDA1987 <- da35206.0041 
USDA1987 <- tibble(USDA1987)
USDA1987 <- USDA1987 %>%
  filter(COUNFIP != 0) 
USDA1987$FIPS <- str_pad(USDA1987$FIPS, 5, pad = "0")
USDA1987 <- rename(USDA1987, FIPS5 = FIPS)

# join to MJ data
MJ_Counties_joinedUSDA <- left_join(tidy_MJ_Counties, USDA1987, by = "FIPS5")

#remove fields called flag
MJ_Counties_joinedUSDA <- select(MJ_Counties_joinedUSDA, -starts_with("FLAG"))

#attempt to look at some variables
MJ_Counties_joinedUSDA %>%
filter(MJ_manufact_RATE > MJ_possession_RATE) %>%
summarize("Mean Farms (number), 1987" = mean(ITEM01001, na.rm = TRUE), "Mean Farms by size:  1,000 acres or more, 1987" = mean(ITEM01012, na.rm = TRUE), "Market value of ag products sold($1,000), 1987" = mean(ITEM01019, na.rm = TRUE), "Operators by principal occupation:  Farming, 1987" = mean(ITEM01030, na.rm = TRUE), "Operators by principal occupation:  Other, 1987" = mean(ITEM01031, na.rm = TRUE))

# A tibble: 1 x 5
  `Mean Farms (num~ `Mean Farms by ~ `Market value o~ `Operators by p~
              <dbl>            <dbl>            <dbl>            <dbl>
1              608.             61.0           42190.             353.
# ... with 1 more variable:
#   Operators by principal occupation:  Other, 1987 <dbl>

#rename columns to match the labels extracted in the first step

#names(MJ_Counties_joinedUSDA)[names(MJ_Counties_joinedUSDA) %in% USDA35206_labels$code] = USDA35206_labels$name[match(names(MJ_Counties_joinedUSDA)[names(MJ_Counties_joinedUSDA) %in% USDA35206_labels$code], USDA35206_labels$code)]

#but this creates problem of duplicate variable names so this is commented out for now

Homework 5 Questions

What is missing (if anything) in your analysis process so far?

I have now read in the USDA Ag Census data and joined it to the subset of crime data that includes only the counties with the rural-urban continuum code greater than 3 (eg not major metro areas). I have also filtered out counties with 0 manufacturing arrests. I have not What’s missing now is a strategy to investigate the 2000+ variables from the ag census in order to characterize the agricultural activities of the counties of interest. In HW3 I looked at the subset of counties that had higher than mean manufacture arrest rate and lower than mean possession rate. I have not checked to see if the same patterns of county economic types hold in this subset. I am thinking about creating a categorical variable out of the MJ_manufact_RATE (eg low, medium, high) so that I can compare different crops.

What conclusions can you make about your research questions at this point?

Possession arrests and manufacture arrests are clearly correlated. It seems likely that people were rarely charged with manufacture without also being charged with possession (why not pile on the charges?) The downside of this is that I probably can’t rely on the relationship between manufacture rate and possession rate to discern which counties are of interest. The two major kinds of counties of interest for my research are government dependent with federal lands and farming dependent with persistent poverty. Neither of these are surprising to me. I am a little confused by what happens when I transform the arrest rates by using log10. It changes the relative significance of county economic types.

What do you think a naive reader would need to fully understand your graphs?

I think I am a naive reader, and I don’t fully understand what I’m looking at!

Is there anything you want to answer with your dataset, but can’t?

So many things. I think that I might be able to better address the ag census data by breaking it into smaller chunks. I also think that I would be interested in doing a similar thing with crime data from other years in order to create a time series.

Comment on this article Share:

HW5

Visualizations

On to other fun things! Agricultural Data

Homework 5 Questions

Reuse

Citation