::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
knitr
library(readxl)
library(tidyverse)
library(ggplot2)
library(dplyr)
library(stringr)
library(googlesheets4)
Final Project Part 1
Research Question:
Does political partisanship correlate with COVID-19 death rates?
The COVID-19 pandemic became a political matter. Behaviors associated with COVID-19 prevention were adopted on partisan lines (masking, social distancing, and vaccine uptake). Early in the pandemic, mask mandates were protested in some communities. My research question is have these behaviors affected COVID-19 death rates along partisan lines? If so, public health interventions could target communities that may be higher risk for COVID-19 deaths based on political partisanship.
I am thinking death toll would make the most sense to measure than infection rates as infection rates are constantly changing (other studies have looked at infection rates over waves of the pandemic, see this study from the Pew Research Center (Jones 2022)). I also think that one way to measure partisanship will be the 2020 county-level election results (% voting for Trump). In other words, my research is looking to see if (county-level) Trump support correlates with COVID-19 death rates. Both these variables can be found in county-level data sets so I can join multiple dataset with county name (or FIPS code) as the “key”.
Other variables to consider at the county-level (confounding variables): vaccine (and booster) uptake, average age of population
Hypothesis:
While I came up with this research idea on my own, other organizations such as NPR (Wood and Brumfiel 2021) and the Pew Research Center ()have already tested this. For this project, I will use the most recent data I can find. I was hoping to consider the confounding variable of population density, for instance I am guessing more urban populations will tend to vote democratic but these more densely populated places may also have higher infection rates. However, I cannot find any county level population density data sets, so I may use the “Urban Rural Description” variable in one of my datasets.
H0: B1 (and all beta values) is zero. There is no correlation Ha: B1 (or any beta value) is not zero. There is a correlation between partisanship and COVID-19 death rates.
Descriptive Statistics:
#Reading in the data from google sheets
gs4_deauth()
<-read_sheet("https://docs.google.com/spreadsheets/d/1fmxoA_bibvsxsvgRdVPCgMA7DkmJNZfxiWgLgCLcsOY/edit#gid=937778872")
votedf
<-read_sheet("https://docs.google.com/spreadsheets/d/1Hy2O3HxhZGF_fhu6jgmoC2ibWwJTlI7pQOESBOd4hTU/edit#gid=787918384") coviddf
#Changing fips code to character format and adding in leading zeros
$"FIPS Code" <- as.character(coviddf$"FIPS Code")
coviddf<-mutate(coviddf, FIPSNEW=str_pad(coviddf$"FIPS Code", 5, pad = "0"))
coviddfhead(coviddf, 12)
# A tibble: 12 × 22
`Data as of` `Start Date` `End Date` State County Na…¹
<dttm> <dttm> <dttm> <chr> <chr>
1 2022-10-05 00:00:00 2020-01-01 00:00:00 2022-10-01 00:00:00 AK Anchorage …
2 2022-10-05 00:00:00 2020-01-01 00:00:00 2022-10-01 00:00:00 AK Anchorage …
3 2022-10-05 00:00:00 2020-01-01 00:00:00 2022-10-01 00:00:00 AK Anchorage …
4 2022-10-05 00:00:00 2020-01-01 00:00:00 2022-10-01 00:00:00 AK Fairbanks …
5 2022-10-05 00:00:00 2020-01-01 00:00:00 2022-10-01 00:00:00 AK Fairbanks …
6 2022-10-05 00:00:00 2020-01-01 00:00:00 2022-10-01 00:00:00 AK Fairbanks …
7 2022-10-05 00:00:00 2020-01-01 00:00:00 2022-10-01 00:00:00 AK Matanuska-…
8 2022-10-05 00:00:00 2020-01-01 00:00:00 2022-10-01 00:00:00 AK Matanuska-…
9 2022-10-05 00:00:00 2020-01-01 00:00:00 2022-10-01 00:00:00 AK Matanuska-…
10 2022-10-05 00:00:00 2020-01-01 00:00:00 2022-10-01 00:00:00 AL Autauga Co…
11 2022-10-05 00:00:00 2020-01-01 00:00:00 2022-10-01 00:00:00 AL Autauga Co…
12 2022-10-05 00:00:00 2020-01-01 00:00:00 2022-10-01 00:00:00 AL Autauga Co…
# … with 17 more variables: `Urban Rural Code` <dbl>, `FIPS State` <dbl>,
# `FIPS County` <dbl>, `FIPS Code` <chr>, Indicator <chr>,
# `Total deaths` <dbl>, `COVID-19 Deaths` <dbl>, `Non-Hispanic White` <dbl>,
# `Non-Hispanic Black` <dbl>,
# `Non-Hispanic American Indian or Alaska Native` <dbl>,
# `Non-Hispanic Asian` <dbl>,
# `Non-Hispanic Native Hawaiian or Other Pacific Islander` <dbl>, …
$county_fips <- as.character(votedf$county_fips)
votedf<-mutate(votedf, county_fipsNEW=str_pad(votedf$county_fips, 5, pad = "0"))
votedfhead(votedf, 12)
# A tibble: 12 × 13
year state state_po county_…¹ count…² office candi…³ party candi…⁴ total…⁵
<dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 2000 ALABAMA AL AUTAUGA 1001 US PR… AL GORE DEMO… 4942 17208
2 2000 ALABAMA AL AUTAUGA 1001 US PR… GEORGE… REPU… 11993 17208
3 2000 ALABAMA AL AUTAUGA 1001 US PR… RALPH … GREEN 160 17208
4 2000 ALABAMA AL AUTAUGA 1001 US PR… OTHER OTHER 113 17208
5 2000 ALABAMA AL BALDWIN 1003 US PR… AL GORE DEMO… 13997 56480
6 2000 ALABAMA AL BALDWIN 1003 US PR… GEORGE… REPU… 40872 56480
7 2000 ALABAMA AL BALDWIN 1003 US PR… RALPH … GREEN 1033 56480
8 2000 ALABAMA AL BALDWIN 1003 US PR… OTHER OTHER 578 56480
9 2000 ALABAMA AL BARBOUR 1005 US PR… AL GORE DEMO… 5188 10395
10 2000 ALABAMA AL BARBOUR 1005 US PR… GEORGE… REPU… 5096 10395
11 2000 ALABAMA AL BARBOUR 1005 US PR… RALPH … GREEN 46 10395
12 2000 ALABAMA AL BARBOUR 1005 US PR… OTHER OTHER 65 10395
# … with 3 more variables: version <dbl>, mode <chr>, county_fipsNEW <chr>, and
# abbreviated variable names ¹county_name, ²county_fips, ³candidate,
# ⁴candidatevotes, ⁵totalvotes
summary(votedf)
year state state_po county_name
Min. :2000 Length:72617 Length:72617 Length:72617
1st Qu.:2004 Class :character Class :character Class :character
Median :2012 Mode :character Mode :character Mode :character
Mean :2011
3rd Qu.:2020
Max. :2020
county_fips office candidate party
Length:72617 Length:72617 Length:72617 Length:72617
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
candidatevotes totalvotes version mode
Min. : 0 Min. : 0 Min. :20220315 Length:72617
1st Qu.: 115 1st Qu.: 5175 1st Qu.:20220315 Class :character
Median : 1278 Median : 11194 Median :20220315 Mode :character
Mean : 10782 Mean : 42514 Mean :20220315
3rd Qu.: 5848 3rd Qu.: 29855 3rd Qu.:20220315
Max. :3028885 Max. :4264365 Max. :20220315
county_fipsNEW
Length:72617
Class :character
Mode :character
summary(coviddf)
Data as of Start Date End Date
Min. :2022-10-05 Min. :2020-01-01 Min. :2022-10-01
1st Qu.:2022-10-05 1st Qu.:2020-01-01 1st Qu.:2022-10-01
Median :2022-10-05 Median :2020-01-01 Median :2022-10-01
Mean :2022-10-05 Mean :2020-01-01 Mean :2022-10-01
3rd Qu.:2022-10-05 3rd Qu.:2020-01-01 3rd Qu.:2022-10-01
Max. :2022-10-05 Max. :2020-01-01 Max. :2022-10-01
State County Name Urban Rural Code FIPS State
Length:3495 Length:3495 Min. :1.000 Min. : 1.00
Class :character Class :character 1st Qu.:2.000 1st Qu.:18.00
Mode :character Mode :character Median :4.000 Median :33.00
Mean :3.645 Mean :30.47
3rd Qu.:5.000 3rd Qu.:42.00
Max. :6.000 Max. :56.00
FIPS County FIPS Code Indicator Total deaths
Min. : 1.00 Length:3495 Length:3495 Min. : 621
1st Qu.: 31.00 Class :character Class :character 1st Qu.: 1690
Median : 71.00 Mode :character Mode :character Median : 3284
Mean : 99.37 Mean : 7163
3rd Qu.:121.00 3rd Qu.: 6990
Max. :840.00 Max. :220829
COVID-19 Deaths Non-Hispanic White Non-Hispanic Black
Min. : 101.0 Min. :0.0270 Min. :0.0010
1st Qu.: 176.0 1st Qu.:0.6677 1st Qu.:0.0230
Median : 364.0 Median :0.8300 Median :0.0690
Mean : 852.7 Mean :0.7742 Mean :0.1242
3rd Qu.: 844.0 3rd Qu.:0.9290 3rd Qu.:0.1800
Max. :31013.0 Max. :1.0000 Max. :0.7610
NA's :3 NA's :592
Non-Hispanic American Indian or Alaska Native Non-Hispanic Asian
Min. :0.0000 Min. :0.0010
1st Qu.:0.0020 1st Qu.:0.0070
Median :0.0040 Median :0.0130
Mean :0.0214 Mean :0.0261
3rd Qu.:0.0100 3rd Qu.:0.0280
Max. :0.8610 Max. :0.5170
NA's :1701 NA's :1360
Non-Hispanic Native Hawaiian or Other Pacific Islander Hispanic
Min. :0.0000 Min. :0.0030
1st Qu.:0.0000 1st Qu.:0.0220
Median :0.0010 Median :0.0480
Mean :0.0023 Mean :0.0987
3rd Qu.:0.0010 3rd Qu.:0.1090
Max. :0.2000 Max. :0.9870
NA's :2183 NA's :740
Other Urban Rural Description Footnote FIPSNEW
Min. :0.0010 Length:3495 Length:3495 Length:3495
1st Qu.:0.0090 Class :character Class :character Class :character
Median :0.0150 Mode :character Mode :character Mode :character
Mean :0.0174
3rd Qu.:0.0220
Max. :0.2410
NA's :1633
This data is going to require some tidying before merging. In the coviddf, each county is listed 3 times, (once per indicator) so I will likely filter out just the indicator “Distribution of COVID-19 deaths (%)” so each county is listed only once. Similarly, the votedf contains extra years. For my research, I am only concerned with 2016 data so I will filter out % voting for Trump in 2016 as a measure of political affiliation/partisanship. Then I will merge the two dfs based on county names (will also require some data tidying).
The votedf was compiled by the MIT Election Data and Science Lab. It was first published in 2018 and has been updated with the 2020 election. It contains county-level presidential election data beginning in 2000 and going up to the 2020 election. The data has 12 columns, and 72,617 rows (many of which I will filter out before conducting analysis.) There are 1,892 distinct county names in the data set.
The coviddf only has 857 unique county names in the data frame. This may be because not all counties reported COVID-19 death counts. When I join the data sets, I will join so as to only include observations that we have information from both data frames. The coviddf is provisional, meaning that it is consistently updated (I believe on a weekly basis) with current COVID-19 death toll data. It is likely compiled by counties/towns reporting these numbers to the CDC. This data has limitations, not all counties report this, and not all report it accurately/ attribute COVID-19 as the true cause of death in all circumstances. Using the summary function, we can see the “mean” COVID-19 deaths by county is 852.7, however this isn’t super meaningful given each county has this reported 3 times in the data and the median is significantly lower. Statistics provided by the summary function will be more meaningful once the data is tidied.
References
Jones, B. (2022). The Changing Political Geography of COVID-19 Over the Last Two Years. Pew Research Center. March 3, 2022. https://www.pewresearch.org/politics/2022/03/03/the-changing-political-geography-of-covid-19-over-the-last-two-years/
MIT Election Data and Science Lab. (2021) County Presidential Election Returns 2000-2020. Accessed from the Harvard Dataverse [October 11, 2022]. https://doi.org/10.7910/DVN/VOQCHQ
National Center for Health Statistics. (2022). Provisional COVID-19 Deaths by County, and Race and Hispanic Origin. Accessed from the Centers for Disease Control [October 11, 2022]. https://data.cdc.gov/d/k8wy-p9cg
Wood, D. and Brumfiel, G. (2021). Pro-Trump counties now have far higher COVID death rates. Misinformation is to blame. NPR. December 5, 2021. https://www.npr.org/sections/health-shots/2021/12/05/1059828993/data-vaccine-misinformation-trump-counties-covid-death-rate
[Need to add italics to references]