Final Project, Air Quality and Lung/Heart Disease Deaths

final project

Joseph Vincent

Author

Joseph Vincent

Published

August 12, 2023

Code

library(tidyverse)
library(dplyr)
library(lubridate)
library(readxl)
library(car)
library(sandwich)
library(stargazer)

knitr::opts_chunk$set(echo = TRUE)

Introduction

Poor air quality has been shown to have a massively negative impact on human health. In particular, air pollution is known to worsen pulmonary and cardiovascular diseases.[1] Just this year, researchers out of Boston University showed that air pollution caused by oil and gas production in the US resulted in over 7500 excess deaths and $77 billion in total health impacts in a single year.[2]

The goal of my project is to assess whether there is a noticeable relationship between air quality trends and related causes of death in California for the decade of 2000-2010.

My findings are that, while inconclusive across all counties in California, in large metropolitan counties of California, days where PM2.5 is above the national average have a positive and statistically significant impact on deaths caused by heart and lung diseases.

Reading, cleaning and combining data sets

In order to accomplish this, I’ve had to combine data from a variety of sources and estimate and/or create some variables.

Air quality

SOURCE [3]

I’m using the CDC’s database for Air Quality Measures on the National Environmental Health Tracking Network. The most recent, complete historic data available contains per-county data on air quality between the years of 1999 and 2013, inclusive.

This data is collected by the Environmental Protection Agency’s network of more than 4,000 outdoor ambient air monitoring systems around the US, called the Air Quality System (AQS).

These systems collect two main types of data:

Fine Particulate Matter (PM2.5) Concentrations

This refers to tiny particulate matter in the air measuring two and a half microns or less in width that originates from a variety of pollutants. Exposure to PM2.5 particles can cause both short and long term health effects. PM2.5 sources include wildfires and fossil fuel combustion.

Ozone Concentrations

High concentrations of Ozone near the ground level can be harmful, causing irritation to the respiratory system and aggravating a variety of chronic lung diseases. This is often referred to as “smog” and is produced by things like car emissions and power plants.

While this data is of high quality, it is lacking in rural areas where monitors are not present. To account for this, the CDC and EPA developed statistical models to fill in gaps in data across regions and time. This allows for granular analysis of county air quality data.

Two important concepts used in my analysis are PM2.5 Days and Ozone Days.

A measure of “PM2.5 Days” for a particular county-year represents the number of days with PM2.5 levels over the NAAQS.

A measure of “Ozone Days” for a particular county-year represents the number of days with the maximum 8-hour average ozone concentration over the National Ambient Air Quality Standard (NAAQS).

Deaths by Cause

SOURCE [4]

In order to draw conclusions about air quality’s immediate impact on public health, I’m combining my air quality data with a data set from the state of California’s Department of Health and Human Services (HHS) on annual number of deaths by county, including primary cause of death. This data also spans the years 1999-2013, but I will be using 2000-2010.

While the data includes a wide variety of causes, for my analysis I will be focusing on chronic lower respiratory disease (CLD), and diseases of the heart (HTD), as these are the main relevant categories for air quality health impacts.

Population

SOURCE [5]

I’m also using per-county population data from the United States Census Bureau. While there are census packages out there for easy use within R, I required annual data by county, which only exists from 2005 onwards in most data sets. I’m instead using the County Intercensal Tables 2000-2010 for the state of California.

Power Plant Density

SOURCE [6]

I’m using power plant location data from the U.S. Energy Information Administration data in order to calculate density of power plants per county. This will be used as a proxy for social determinants of health and neighborhood quality.

RUCC

SOURCE [7]

In order to select subsets of my data, I’ve used data from the Economic Research Service of the US Department of Agriculture to obtain Rural Urban Continuum Codes (RUCC) per county. This is a classification scheme on a 8-1 scale that groups counties into categories based on the size of metro areas contained within. This will be used to take subsets of the data in order to make comparisons across similar counties.

See the code chunk below for details about reading, manipulating and combining the data sets:

Code

#reading in air quality data
airquality <- read_csv("Final Project Data/Air_Quality_Measures_on_the_National_Environmental_Health_Tracking_Network.csv")

Rows: 218635 Columns: 14
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (8): MeasureName, MeasureType, StratificationLevel, StateName, CountyNam...
dbl (6): MeasureId, StateFips, CountyFips, ReportYear, Value, MonitorOnly

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Code

#reading in deaths data set
calideaths <- read_csv("Final Project Data/2021-05-14_deaths_final_1999_2013_county_year_sup.csv") %>%
  #filtering for occurrence deaths (i.e. all deaths that occurred, disregarding residence)
  filter(Geography_Type == "Occurrence") %>%
  #de-selecting geography type as they are now all occurrence
  select(-Geography_Type)

Rows: 292320 Columns: 10
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (7): County, Geography_Type, Strata, Strata_Name, Cause, Cause_Desc, Ann...
dbl (3): Year, Count, Annotation_Code

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Code

#reading in population data
populations <- read_csv("Final Project Data/co-est00int-tot.csv")

Rows: 3194 Columns: 20
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): STNAME, CTYNAME
dbl (18): SUMLEV, REGION, DIVISION, STATE, COUNTY, ESTIMATESBASE2000, POPEST...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Code

#cleaning air quality data
airqualitycali <- airquality %>%
  #arranging by year
  arrange(`ReportYear`) %>%
  #filtering for california only and 2000-2010
  filter(StateName == "California" & `ReportYear` %in% c(2000:2010)) %>%
  #selecting only relevant columns
  select(CountyName, ReportYear, MeasureName, Value) %>%
  #renaming county and year to be consistent with deaths data set
  rename("County" = `CountyName`, "Year" = `ReportYear`) %>%
  #pivoting so that each row is a year-county for merging data
  pivot_wider(names_from = MeasureName, values_from = Value) %>%
  #renaming Air Quality columns
  rename("Ozone Days Delete" = `Number of days with maximum 8-hour average ozone concentration over the National Ambient Air Quality Standard`,
         "Ozone Person Days Delete" = `Number of person-days with maximum 8-hour average ozone concentration over the National Ambient Air Quality Standard`,
         "PM2.5 Percent of Days Delete" = `Percent of days with PM2.5 levels over the National Ambient Air Quality Standard (NAAQS)`,
         "PM2.5 Person Days Delete" = `Person-days with PM2.5 over the National Ambient Air Quality Standard`,
         "PM2.5 Annual Average Delete" = `Annual average ambient concentrations of PM2.5 in micrograms per cubic meter (based on seasonal averages and daily measurement)`,
         "Ozone Days" = `Number of days with maximum 8-hour average ozone concentration over the National Ambient Air Quality Standard (monitor and modeled data)`,
         "Ozone Person Days" = `Number of person-days with maximum 8-hour average ozone concentration over the National Ambient Air Quality Standard (monitor and modeled data)`,
         "PM2.5 Percent of Days" = `Percent of days with PM2.5 levels over the National Ambient Air Quality Standard (monitor and modeled data)`,
         "PM2.5 Person Days" = `Number of person-days with PM2.5 over the National Ambient Air Quality Standard (monitor and modeled data)`,
         "PM2.5 Annual Average" = `Annual average ambient concentrations of PM 2.5 in micrograms per cubic meter, based on seasonal averages and daily measurement (monitor and modeled data)`)

#filling in modeled data for first year, where there is no modeled data
airqualitycali <- airqualitycali %>%
  mutate(`Ozone Days` = case_when(
    `Year` == 2000 ~ `Ozone Days Delete`,
    TRUE ~ as.numeric(as.character(`Ozone Days`)))) %>%
  mutate(`Ozone Person Days` = case_when(
    `Year` == 2000 ~ `Ozone Person Days Delete`,
    TRUE ~ as.numeric(as.character(`Ozone Person Days`)))) %>%
  mutate(`PM2.5 Percent of Days` = case_when(
    `Year` == 2000  ~ `PM2.5 Percent of Days Delete`,
    TRUE ~ as.numeric(as.character(`PM2.5 Percent of Days`)))) %>%
  mutate(`PM2.5 Person Days` = case_when(
    `Year` == 2000 ~ `PM2.5 Person Days Delete`,
    TRUE ~ as.numeric(as.character(`PM2.5 Person Days`)))) %>%
  mutate(`PM2.5 Annual Average` = case_when(
    `Year` == 2000 ~ `PM2.5 Annual Average Delete`,
    TRUE ~ as.numeric(as.character(`PM2.5 Annual Average`)))) %>%
  select(!contains("Delete"))

#cleaning deaths data
calideathsclean <- calideaths %>%
  #filtering for 2000-2010
  filter(`Year` %in% c(2000:2010)) %>%
  #filling in suppressed data
  mutate(Count = case_when(
    Annotation_Code == 1 ~ 0,
    TRUE ~ as.numeric(as.character(Count)))) %>%
  #de-selecting some unused columns
  select(-Cause_Desc, -Annotation_Code, -Annotation_Desc) %>%
  #focusing on relevant conditions
  filter(Cause %in% c("ALL", "CLD", "HTD")) %>%
  rename("Deaths" = `Count`) %>%
  rename("Strata Name" = `Strata_Name`)

# cleaning population data
populationstidy <- populations %>%
  filter(STNAME == "California" & CTYNAME != "California") %>%
  separate(CTYNAME, into = c("County", "Delete"), sep = " County") %>%
  select(County, POPESTIMATE2000:POPESTIMATE2010, -CENSUS2010POP) %>%
  rename("2000" = POPESTIMATE2000,
         "2001" = POPESTIMATE2001,
         "2002" = POPESTIMATE2002,
         "2003" = POPESTIMATE2003,
         "2004" = POPESTIMATE2004,
         "2005" = POPESTIMATE2005,
         "2006" = POPESTIMATE2006,
         "2007" = POPESTIMATE2007,
         "2008" = POPESTIMATE2008,
         "2009" = POPESTIMATE2009,
         "2010" = POPESTIMATE2010) %>%
  pivot_longer(col = c(`2000`:`2010`), names_to = "Year", values_to = "Population") %>%
  mutate(`Year` = as.numeric(Year))

#merging data
airqualityanddeaths <- left_join(calideathsclean, airqualitycali, by = c("County", "Year"))
airqualityanddeaths <- left_join(airqualityanddeaths, populationstidy, by = c("County", "Year"))

#For the remaining missing air quality data not filled in by models in 2000, replacing NAs with zero, as they are all small rural counties that will skew means higher for 2000
airqualityanddeaths <- airqualityanddeaths %>%
  mutate(`Ozone Days` = replace_na(`Ozone Days`, 0),
         `Ozone Person Days` = replace_na(`Ozone Person Days`, 0),
         `PM2.5 Percent of Days` = replace_na(`PM2.5 Percent of Days`, 0),
         `PM2.5 Person Days` = replace_na(`PM2.5 Person Days`, 0),
         `PM2.5 Annual Average` = replace_na(`PM2.5 Annual Average`, 0)) %>%
# Creating standardized deaths per 100,000 column
  mutate("Deaths per 100,000" = `Deaths`/`Population`*100000) %>%
# Creating a raw PM2.5 Days column
  mutate("PM2.5 Days" = `PM2.5 Percent of Days`/100*365) %>%
# Re-arranging
  select(Year, County, Population, Strata, `Strata Name`, Cause, Deaths, `Deaths per 100,000`, `Ozone Days`, `Ozone Person Days`, `PM2.5 Days`, `PM2.5 Person Days`, `PM2.5 Percent of Days`, `PM2.5 Annual Average`) %>% 
# Turning Year into a date for ease of plotting time series'
  mutate(Year = make_date(year=Year)) %>%
# Filtering out Sierra county due to unreliable data
  filter(County != "Sierra") %>%
# Filtering out strata for this analysis
  filter(Strata == "Total Population") %>%
  select(-Strata, -`Strata Name`)

# adding square mile values to each county
airqualityanddeaths <- airqualityanddeaths %>%
  mutate(sqmiles = case_when(
    County == "San Bernardino" ~ 20057,
    County == "Inyo" ~ 10181,
    County == "Kern" ~ 8132,
    County == "Riverside" ~ 7206,
    County == "Siskiyou" ~ 6278,
    County == "Fresno" ~ 5958,
    County == "Tulare" ~ 4824,
    County == "Lassen" ~ 4541,
    County == "San Diego" ~ 4207,
    County == "Imperial" ~ 4177,
    County == "Los Angeles" ~ 4057,
    County == "Modoc" ~ 3918,
    County == "Shasta" ~ 3775,
    County == "Humboldt" ~ 3568,
    County == "Mendocino" ~ 3506,
    County == "San Luis Obispo" ~ 3299,
    County == "Monterey" ~ 3281,
    County == "Trinity" ~ 3179,
    County == "Mono" ~ 3049,
    County == "Tehama" ~ 2950,
    County == "Santa Barbara" ~ 2735,
    County == "Plumas" ~ 2553,
    County == "Tuolumne" ~ 2221,
    County == "Madera" ~ 2137,
    County == "Merced" ~ 1935,
    County == "Ventura" ~ 1843,
    County == "El Dorado" ~ 1708,
    County == "Butte" ~ 1636,
    County == "Sonoma" ~ 1576,
    County == "Stanislaus" ~ 1495,
    County == "Mariposa" ~ 1449,
    County == "Placer" ~ 1407,
    County == "San Joaquin" ~ 1391,
    County == "Kings" ~ 1389,
    County == "San Benito" ~ 1389,
    County == "Glenn" ~ 1314,
    County == "Santa Clara" ~ 1290,
    County == "Lake" ~ 1256,
    County == "Colusa" ~ 1151,
    County == "Calaveras" ~ 1020,
    County == "Yolo" ~ 1015,
    County == "Del Norte" ~ 1006,
    County == "Sacramento" ~ 965,
    County == "Nevada" ~ 958,
    County == "Solano" ~ 822,
    County == "Orange" ~ 791,
    County == "Napa" ~ 748,
    County == "Alameda" ~ 739,
    County == "Alpine" ~ 738,
    County == "Contra Costa" ~ 716,
    County == "Yuba" ~ 632,
    County == "Sutter" ~ 602,
    County == "Amador" ~ 595,
    County == "Marin" ~ 520,
    County == "San Mateo" ~ 448,
    County == "Santa Cruz" ~ 445,
    County == "San Francisco" ~ 47
  ))

airqualityanddeaths <- airqualityanddeaths %>%
  mutate(Popdensity = Population/sqmiles)

# source https://www.eia.gov/electricity/data/eia860/
plants <- read_excel("Final Project Data/CA Plants 2010.xlsx")
plantsbycounty <- as.data.frame(table(plants$COUNTY)) %>%
  rename("County" = `Var1`, "Plants" = `Freq`)
airqualityanddeaths <- left_join(airqualityanddeaths, plantsbycounty, by = "County")

unemployment <- read_excel("Final Project Data/Incomedata.xlsx")
unemployment <- unemployment %>%
  filter(State == "CA" & grepl("CA", Area_Name) & Area_Name != "Sierra, CA") %>%
  separate(col = Area_Name, into = c("County", "Delete"), sep = " County") %>%
  select(County, contains("Unemployment_rate")) %>%
  pivot_longer(cols = 2:24, names_to = "Yeardelete", values_to = "Unemploymentrate") %>%
  separate(col = Yeardelete, into = c("Delete", "Year"), sep = "rate_") %>%
  mutate(`Year` = as.numeric(Year)) %>%
  mutate(Year = make_date(year=Year)) %>%
  select(County, Year, Unemploymentrate) %>%
  filter(year(Year) %in% 2000:2010)

income <- read_excel("Final Project Data/Incomedata.xlsx")
income <- income %>%
  filter(State == "CA" & grepl("CA", Area_Name)) %>%
  separate(col = Area_Name, into = c("County", "Delete"), sep = " County") %>%
  select(County, Median_Household_Income_2021) %>%
  rename(`2021` = `Median_Household_Income_2021`) %>%
  #estimating 2000 raw income data based on income growth rates
  mutate(`2000raw` = `2021`*.6,
  #then calculating inflation adjusted values to normalized year 2010
  # inflation adjustment from here: https://www.usinflationcalculator.com/
  `2000` = `2000raw` + `2000raw`*0.266,
  #then estimating subsequent years based on real income growth rate
  # rates from here: https://www.multpl.com/us-median-real-income-growth/table/by-year
  `2001` = `2000` + `2000`*-0.022,
  `2002` = `2001` + `2001`*-0.013,
  `2003` = `2002` + `2002`*-0.001,
  `2004` = `2003` + `2003`*-0.004,
  `2005` = `2004` + `2004`*0.011,
  `2006` = `2005` + `2005`*0.008,
  `2007` = `2006` + `2006`*0.013,
  `2008` = `2007` + `2007`*-0.037,
  `2009` = `2008` + `2008`*-0.007,
  `2010` = `2009` + `2009`*-0.026) %>%
  select(-`2021`, -`2000raw`) %>%
  pivot_longer(cols = c(`2000`:`2010`), values_to = "medianrealincome", names_to = "Year") %>%
  mutate(`Year` = as.numeric(Year)) %>%
  mutate(Year = make_date(year=Year))
 
         
RUCC <- read_excel("Final Project Data/Incomedata.xlsx")
RUCC <- RUCC %>%
  filter(State == "CA" & grepl("CA", Area_Name)) %>%
  separate(col = Area_Name, into = c("County", "Delete"), sep = " County") %>%
  select(County, Rural_Urban_Continuum_Code_2013) %>%
  rename(RUCC = `Rural_Urban_Continuum_Code_2013`)

airqualityanddeaths <- left_join(airqualityanddeaths, unemployment, by = c("County", "Year"))

airqualityanddeaths <- left_join(airqualityanddeaths, income, by = c("County", "Year"))

airqualityanddeaths <- left_join(airqualityanddeaths, RUCC, by = "County")

airqualityanddeaths <- airqualityanddeaths %>%
    mutate(Plants = if_else(is.na(Plants), 0, Plants))

airqualityanddeaths <- airqualityanddeaths %>%
  mutate(Powerplantdensity = Plants/sqmiles*1000)

airqualityanddeaths <- airqualityanddeaths %>%
  select(-Deaths) %>%
  filter(Cause %in% c("CLD", "HTD")) %>%
  pivot_wider(names_from = Cause, values_from = `Deaths per 100,000`)  %>%
  mutate(Deathsper100kheartlung = HTD + CLD) %>%
  filter(!is.na(Deathsper100kheartlung)) %>%
  filter(Deathsper100kheartlung > 0)

The Issue with Small Counties

Before creating an optimal model for explaining heart and lung disease deaths, I want to explain why I’ve chosen to take a subset of the data for the final analysis.

My original data set contains 609 observations - inique county-years in California, from 2000-2010 inclusive. This includes all 57 counties (minus Sierra due to missing data).

Code

nrow(airqualityanddeaths)

[1] 609

Code

unique(airqualityanddeaths$County)

 [1] "Alameda"         "Amador"          "Butte"           "Calaveras"      
 [5] "Colusa"          "Contra Costa"    "Del Norte"       "El Dorado"      
 [9] "Fresno"          "Glenn"           "Humboldt"        "Imperial"       
[13] "Inyo"            "Kern"            "Kings"           "Lake"           
[17] "Lassen"          "Los Angeles"     "Madera"          "Marin"          
[21] "Mariposa"        "Mendocino"       "Merced"          "Modoc"          
[25] "Mono"            "Monterey"        "Napa"            "Nevada"         
[29] "Orange"          "Placer"          "Plumas"          "Riverside"      
[33] "Sacramento"      "San Benito"      "San Bernardino"  "San Diego"      
[37] "San Francisco"   "San Joaquin"     "San Luis Obispo" "San Mateo"      
[41] "Santa Barbara"   "Santa Clara"     "Santa Cruz"      "Shasta"         
[45] "Siskiyou"        "Solano"          "Sonoma"          "Stanislaus"     
[49] "Sutter"          "Tehama"          "Trinity"         "Tulare"         
[53] "Tuolumne"        "Ventura"         "Yolo"            "Yuba"

Performing a basic linear model of Air Quality on Death Rate using this data reveals that there is almost no signal between the two. In fact, there is a slightly negative beta coefficient for PM2.5 Days, and extremely low significance, with a P-value of nearly 60%, and an R squared of 0.0006.

Code

summary(lm(data = airqualityanddeaths, `Deathsper100kheartlung` ~ `PM2.5 Days`))


Call:
lm(formula = Deathsper100kheartlung ~ `PM2.5 Days`, data = airqualityanddeaths)

Residuals:
    Min      1Q  Median      3Q     Max 
-164.72  -49.89  -11.20   43.48  233.36 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  235.3998     3.7644  62.533   <2e-16 ***
`PM2.5 Days`  -0.1033     0.1770  -0.584    0.559    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 75.2 on 607 degrees of freedom
Multiple R-squared:  0.0005614, Adjusted R-squared:  -0.001085 
F-statistic: 0.341 on 1 and 607 DF,  p-value: 0.5595

Graphing a scatter plot and the linear model shows a bit more detail about what is going on. As you get down into the counties with low PM2.5 Days (i.e. better air quality), there is an extreme variance in the rate of deaths. As a result, though there does seem to be slight positive visual trend as you get into the county-years with higher PM2.5 Days, the overall trend is inconclusive.

Code

airqualityanddeaths %>% 
  ggplot(aes(x = `PM2.5 Days`, y = `Deathsper100kheartlung`)) +
  geom_point() +
  geom_smooth(method = "lm") +
  labs(title = "PM2.5 Days vs CLD and HTD Deaths per 100k",
       subtitle = "Across all County-Years, CA, 2000-2010",
       x = "PM 2.5 Days",
       y = "CLD and HTD Deaths per 100,000")

`geom_smooth()` using formula = 'y ~ x'

I believe that this is largely being driven including small rural counties in my model, which is doing two things:

1. The Law of Small Numbers

When the population for a county is small, small random changes in absolute deaths cause a large swing in the death rate for that county. There are plenty of instances where deaths by heart or lung disease are in the single or low double digits.

Selecting a Subset of Large, Metro Counties

Therefore, I have opted to only perform an analysis on large, metropolitan counties. By doing so, I can focus more on the impact of air quality directly on deaths by lung or heart disease instead of other hidden variables.

I’m conscious that this is not the ideal approach, as it opens the door to sample selection bias; however, it will allow me to draw more meaningful conclusions about the affect of air quality in metropolitan counties until I can measure excess deaths in a more sophisticated manner.

In my analysis, therefore, I can only make claims about large metropolitan counties in California, not US or CA counties in general.

Below, I have used RUCC to filter only for counties with a score of 1, which indicates that metropolitan areas of 1 million in population or more are at least partially within their boundaries.

This still leaves me with more than enough data to draw statistical inferences; 176 observations, including 16 unique counties across the 11 years.

See below for a list of unique counties included.

Code

largemetrocounties <- airqualityanddeaths %>%
  filter(RUCC <= 1)

nrow(largemetrocounties)

[1] 176

Code

unique(largemetrocounties$County)

 [1] "Alameda"        "Contra Costa"   "El Dorado"      "Los Angeles"   
 [5] "Marin"          "Orange"         "Placer"         "Riverside"     
 [9] "Sacramento"     "San Benito"     "San Bernardino" "San Diego"     
[13] "San Francisco"  "San Mateo"      "Santa Clara"    "Yolo"

Analysis

1. Identifying My Model

\[ Heart and Lung Death Rate = \beta_0 + PM2.5Days*\beta_1 + OzoneDays*\beta_2 + PowerPlantDensity*\beta_3 + \epsilon \]

My hypothesis is that air quality, specifically PM2.5 Days per year, will result in a higher rate of deaths by heart and lung causes in large metropolitan counties in California.

While I believe PM2.5 Days will have a larger impact, Ozone Days are also a factor that should be included in my analysis. While these two are strongly correlated, Ozone and PM2.5 may have varying health impacts.

Finally, I would like to try to account for Social Determinants of Health. These are the conditions of a person’s environment and lifestyle that impact their health outcomes. I want to select a measure of social determinants of health that reduced Omitted Variable Bias. To do so, it needs to not only have a potential impact on heart and lung disease deaths, but also be correlated with air quality. Therefore, I have chosen the density of power plants per 1,000 square miles. Power plants are a leading polluter and contributor to poor air quality. Simultaneously, power plants tend to be located in poorer, less affluent neighborhoods. Thus, including power plants will allow me to test for whether what I am actually measuring are social determinants of health’s impacts on rate of death by heart and lung disease.

2. Creating A Model Strategy to Ensure External and Internal Validity

a. Summary of Data and b. Custom Variables

My composite data set consists of a few different groups of variables. I will explore means and SDs

First are Year and County. Each observation in my data is a unique combination of the two.

Next are the air quality data types. These include PM2.5 Days and Ozone Days, but also some additional ways of measuring PM2.5 or Ozone which I will not explore further in this analysis.

I have calculated death rate totals for CLD and HTD types, and combined those to produce a Death per 100k by Heart and Lung Disease variable.

I’ve included county populations and square miles per county in order to create variables that are normalized. These include my death rates, population density, and power plant density.

Finally, I have a whole set of economic data (Median Income, Unemployment Rate, etc.) that I added to test various variables to determine the best social determinant of health. I will not explore these further here.

Code

summary(largemetrocounties)

      Year               County            Population        Ozone Days    
 Min.   :2000-01-01   Length:176         Min.   :  53781   Min.   :  0.00  
 1st Qu.:2002-01-01   Class :character   1st Qu.: 252440   1st Qu.:  2.75  
 Median :2005-01-01   Mode  :character   Median :1141384   Median : 10.00  
 Mean   :2004-12-31                      Mean   :1712229   Mean   : 31.80  
 3rd Qu.:2008-01-01                      3rd Qu.:1934820   3rd Qu.: 46.25  
 Max.   :2010-01-01                      Max.   :9830420   Max.   :125.00  
 Ozone Person Days     PM2.5 Days     PM2.5 Person Days   PM2.5 Percent of Days
 Min.   :0.000e+00   Min.   :  0.00   Min.   :        0   Min.   : 0.000       
 1st Qu.:5.953e+05   1st Qu.:  4.00   1st Qu.:  1745962   1st Qu.: 1.096       
 Median :9.023e+06   Median : 10.97   Median :  9124978   Median : 3.005       
 Mean   :9.349e+07   Mean   : 16.87   Mean   : 47249901   Mean   : 4.623       
 3rd Qu.:6.817e+07   3rd Qu.: 19.00   3rd Qu.: 34198875   3rd Qu.: 5.205       
 Max.   :1.084e+09   Max.   :117.00   Max.   :818212890   Max.   :32.055       
 PM2.5 Annual Average    sqmiles          Popdensity           Plants     
 Min.   : 0.00        Min.   :   47.0   Min.   :   38.72   Min.   :  0.0  
 1st Qu.:10.67        1st Qu.:  733.2   1st Qu.:  196.16   1st Qu.:  6.5  
 Median :11.96        Median : 1152.5   Median : 1006.42   Median : 18.0  
 Mean   :13.46        Mean   : 2910.1   Mean   : 2026.27   Mean   : 30.0  
 3rd Qu.:14.68        3rd Qu.: 2295.2   3rd Qu.: 1693.30   3rd Qu.: 38.5  
 Max.   :30.34        Max.   :20057.0   Max.   :17137.51   Max.   :103.0  
 Unemploymentrate medianrealincome      RUCC   Powerplantdensity
 Min.   : 2.800   Min.   : 52063   Min.   :1   Min.   :  0.000  
 1st Qu.: 4.700   1st Qu.: 58359   1st Qu.:1   1st Qu.:  6.062  
 Median : 5.400   Median : 71932   Median :1   Median : 14.002  
 Mean   : 6.424   Mean   : 73156   Mean   :1   Mean   : 22.923  
 3rd Qu.: 7.200   3rd Qu.: 83474   3rd Qu.:1   3rd Qu.: 22.466  
 Max.   :15.500   Max.   :107226   Max.   :1   Max.   :148.936  
      CLD             HTD         Deathsper100kheartlung
 Min.   : 0.00   Min.   : 69.54   Min.   : 69.54        
 1st Qu.:30.30   1st Qu.:149.67   1st Qu.:182.79        
 Median :35.05   Median :171.72   Median :206.04        
 Mean   :36.24   Mean   :169.31   Mean   :205.55        
 3rd Qu.:41.26   3rd Qu.:194.44   3rd Qu.:230.44        
 Max.   :73.95   Max.   :249.26   Max.   :306.13

c. Means and Standard Deviations of Regressors

Find the means and standard deviations of my regressors in the table below:

Code

meanandsd <- largemetrocounties %>%
  summarise(MeanDeathsper100k = mean(`Deathsper100kheartlung`),
            SDDeathsper100k = sd(`Deathsper100kheartlung`),
            MeanPM2.5Days = mean(`PM2.5 Days`),
            SDPM2.5Days = sd(`PM2.5 Days`),
            MeanOzoneDays = mean(`Ozone Days`),
            SDOzoneDays = sd(`Ozone Days`),
            MeanPowerplantdensity = mean(`Powerplantdensity`),
            SDPowerplantdensity = sd(`Powerplantdensity`))
meanandsd

# A tibble: 1 × 8
  MeanDeathsper100k SDDeathspe…¹ MeanP…² SDPM2…³ MeanO…⁴ SDOzo…⁵ MeanP…⁶ SDPow…⁷
              <dbl>        <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
1              206.         43.7    16.9    20.8    31.8    38.8    22.9    34.5
# … with abbreviated variable names ¹SDDeathsper100k, ²MeanPM2.5Days,
#   ³SDPM2.5Days, ⁴MeanOzoneDays, ⁵SDOzoneDays, ⁶MeanPowerplantdensity,
#   ⁷SDPowerplantdensity

d. Estimating a Simple Linear Model

My first basic linear model uses only PM2.5 Days as a regressor on Deaths per 100,000.

Seen in the summary below, this is of itself a statistically significant relationship.

The coefficient on PM2.5 Days is .95. Suggesting that for each additional day of PM2.5 levels over the national average, counties will experience an additional .95 deaths per 100,000 for that year.

The p-value is extremely small (e-10).

The adjusted r-squared is just under .2, but this alone is not meaningful until compared with other models.

Code

simplemod <- lm(data = largemetrocounties, `Deathsper100kheartlung` ~ `PM2.5 Days`)
summary(simplemod)


Call:
lm(formula = Deathsper100kheartlung ~ `PM2.5 Days`, data = largemetrocounties)

Residuals:
     Min       1Q   Median       3Q      Max 
-130.443  -17.874    0.427   19.526  110.114 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  189.5654     3.7996  49.891  < 2e-16 ***
`PM2.5 Days`   0.9471     0.1421   6.664 3.37e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 39.1 on 174 degrees of freedom
Multiple R-squared:  0.2033,    Adjusted R-squared:  0.1988 
F-statistic: 44.41 on 1 and 174 DF,  p-value: 3.373e-10

e. Estimating a Simple LinearLog Regression

I’ve estimated a simple linear-log regression below, to see if changes in PM2.5 Days measured as a percentage yield better results for a uni-variate model.

As you can see below, while still statistically significant, the adjusted R squared value has decreased to .13, suggesting it is a worse fit to my data than the linear model.

Code

simplelogmod <- lm(data = largemetrocounties, `Deathsper100kheartlung` ~ log(`PM2.5 Days`+.1))

summary(simplelogmod)


Call:
lm(formula = Deathsper100kheartlung ~ log(`PM2.5 Days` + 0.1), 
    data = largemetrocounties)

Residuals:
     Min       1Q   Median       3Q      Max 
-139.714  -17.141    1.317   19.800  114.558 

Coefficients:
                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)              185.112      4.974  37.216  < 2e-16 ***
log(`PM2.5 Days` + 0.1)   10.030      1.921   5.222 5.01e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 40.73 on 174 degrees of freedom
Multiple R-squared:  0.1355,    Adjusted R-squared:  0.1305 
F-statistic: 27.27 on 1 and 174 DF,  p-value: 5.012e-07

g. Estimating a Simple Cubic Model

When treating PM2.5 Days as a cubic variable, the adjusted R squared increases, at first suggesting that it may be a better fit.

However, as seen below, the p-values on the coefficients, particularly for the squared and cubed orders have gotten much worse. PM2.5 Days squared has a 0.08 p-value, and PM2.5 Days cubed has a 0.13 p-value. Given that these are several magnitudes worse than for my linear model, this does not give me a lot of confidence that a cubic model is the best fit.

Code

simplecubicmod <- lm(data = largemetrocounties, `Deathsper100kheartlung` ~ I(`PM2.5 Days`) + I(`PM2.5 Days`^2) + I(`PM2.5 Days`^3))

summary(simplecubicmod)


Call:
lm(formula = Deathsper100kheartlung ~ I(`PM2.5 Days`) + I(`PM2.5 Days`^2) + 
    I(`PM2.5 Days`^3), data = largemetrocounties)

Residuals:
     Min       1Q   Median       3Q      Max 
-133.114  -17.442   -0.444   18.445  117.140 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)        1.812e+02  5.766e+00  31.432  < 2e-16 ***
I(`PM2.5 Days`)    2.295e+00  7.147e-01   3.211  0.00158 ** 
I(`PM2.5 Days`^2) -3.379e-02  1.922e-02  -1.758  0.08054 .  
I(`PM2.5 Days`^3)  2.025e-04  1.331e-04   1.522  0.12984    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 38.9 on 172 degrees of freedom
Multiple R-squared:  0.2203,    Adjusted R-squared:  0.2067 
F-statistic:  16.2 on 3 and 172 DF,  p-value: 2.554e-09

h. Plotting Simple Linear, Log and Cubic Models

I’ve plotted the three models below along with a scatter plot of PM2.5 Days against Deaths per 100,000.

The linear model is in blue, the log model is in green, and the cubic model is in red.

It’s clear from this plot that there is a large concentration of county-years with lower PM2.5 Days, particularly less than 30. There is a lot of variance in the data at this level and some heteroskedasticity, but this will be addressed further later on when I calculate robust standard errors.

Code

simplemod_predict <- cbind(largemetrocounties, predict(simplemod, interval = 'confidence')) %>%
  rename("fitlm" = fit, "lwrlm" = lwr, "uprlm" = upr)

simplelogmod_predict <- cbind(simplemod_predict, predict(simplelogmod, interval = 'confidence')) %>%
  rename("fitlog" = fit, "lwrlog" = lwr, "uprlog" = upr)

simplecubicmod_predict <- cbind(simplelogmod_predict, predict(simplecubicmod, interval = 'confidence')) %>%
  rename("fitcub" = fit, "lwrcub" = lwr, "uprcub" = upr)

simplecubicmod_predict %>%
  ggplot(aes(x = `PM2.5 Days`, y = `Deathsper100kheartlung`)) +
  geom_point() +
  geom_smooth(aes(x = `PM2.5 Days`, y = fitlm), col = "blue") +
  geom_smooth(aes(x = `PM2.5 Days`, y = fitlog), col = "green") +
  geom_smooth(aes(x = `PM2.5 Days`, y = fitcub), col = "red") +
  labs(title = "CLD and HTD Deaths per 100k ~ PM2.5 Days",
       subtitle = "Large Metro County-Years, CA, 2000-2010",
       x = "PM 2.5 Days",
       y = "CLD and HTD Deaths per 100,000")

`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

3. Visual Analysis of Models

The log model (green) seems do a pretty good job below 30 days, but is probably the worst visual fit because it plateaus too soon, not accounting for the distinct increase in deaths as the highest PM2.5 Day levels are reached.

When it comes to the linear (blue) and cubic (red) models, they are visually very closely matched. There is perhaps a slight edge to the cubic model, given the small ‘bump’ around 30 days. There is also a ‘dip’ towards the lower end of the data. It’s not clear to me whether this improves the model, due to the distinct spread of death rates at this point.

Overall, I don’t believe that the linear and cubic models are visually distinct enough to eliminate either entirely from contention.

I will still include all models in further analysis to investigate further.

4. Finding the Optimal Model

a. Estimating Specifications for Uni-variate and Multivariate of Each Type

I’ve created 3 models under each of the types previously identified (linear, log, and cubic).

In each, I’ve iteratively added the two additional regressors as previously identified: Ozone Days and Power Plant Density in an attempt to reduce OVB.

Linear models

Code

simplelm <- lm(data = largemetrocounties, `Deathsper100kheartlung` ~ `PM2.5 Days`)
multlm1 <- lm(data = largemetrocounties, `Deathsper100kheartlung` ~ `PM2.5 Days` + `Ozone Days`)
multlm2 <- lm(data = largemetrocounties, `Deathsper100kheartlung` ~ `PM2.5 Days` + `Ozone Days` + `Powerplantdensity`)

Log Models

Code

simplelog <- lm(data = largemetrocounties, `Deathsper100kheartlung` ~ log(`PM2.5 Days` + .1))
multlog1 <- lm(data = largemetrocounties, `Deathsper100kheartlung` ~ log(`PM2.5 Days` + .1) + `Ozone Days`)
multlog2 <- lm(data = largemetrocounties, `Deathsper100kheartlung` ~ log(`PM2.5 Days` + .1) + `Ozone Days` + `Powerplantdensity`)

Cubic Models

Code

simplecub <- lm(data = largemetrocounties, `Deathsper100kheartlung` ~ I(`PM2.5 Days`) + I(`PM2.5 Days`^2) + I(`PM2.5 Days`^3))
multcub1 <- lm(data = largemetrocounties, `Deathsper100kheartlung` ~ I(`PM2.5 Days`) + I(`PM2.5 Days`^2) + I(`PM2.5 Days`^3) + `Ozone Days`)
multcub2 <- lm(data = largemetrocounties, `Deathsper100kheartlung` ~ I(`PM2.5 Days`) + I(`PM2.5 Days`^2) + I(`PM2.5 Days`^3) + `Ozone Days` + `Powerplantdensity`)

Calculating Robust Standard Errors

I’ve calculated robust standard errors for each model. Doing so allows me to account for the heteroskedasticity apparent in the data, with their being a higher variance in death rate for county-years with lower PM2.5 Days. This will allow me to compare models while maintaining the OLS assumptions.

Code

rob_se <- list(sqrt(diag(vcovHC(simplelm, type = "HC1"))),
                sqrt(diag(vcovHC(multlm1, type = "HC1"))),
                sqrt(diag(vcovHC(multlm2, type = "HC1"))),
                sqrt(diag(vcovHC(simplelog, type = "HC1"))),
                sqrt(diag(vcovHC(multlog1, type = "HC1"))),
                sqrt(diag(vcovHC(multlog2, type = "HC1"))),
                sqrt(diag(vcovHC(simplecub, type = "HC1"))),
                sqrt(diag(vcovHC(multcub1, type = "HC1"))),
                sqrt(diag(vcovHC(multcub2, type = "HC1"))))

Code

modlist <- list(simplelm, multlm1, multlm2, simplelog, multlog1, multlog2, simplecub, multcub1, multcub2)

Reporting in Stargazer

See below for a detailed report of the models. I will discuss these results further in the conclusion section.

Code

suppressWarnings(stargazer(modlist,
          title = "Linear Regressions Using CA Air Quality and Deaths by Cause Data",
          type = "html",
          se = rob_se,
          digits = 3,
          header = FALSE,
          object.names = TRUE,
          model.numbers = FALSE,
          column.labels = c("simplelm", "multlm1", "multlm2", "simplelog", "multlog1", "multlog2", "simplecub", "multcub1", "multcub2"),
          model.names = FALSE))

**Linear Regressions Using CA Air Quality and Deaths by Cause Data**

	Dependent variable:

	Deathsper100kheartlung
	simplelm	multlm1	multlm2	simplelog	multlog1	multlog2	simplecub	multcub1	multcub2
	modlist	NA	NA	NA	NA	NA	NA	NA	NA

`PM2.5 Days`	0.947^***	0.649^***	0.592^***
	(0.102)	(0.127)	(0.127)

`Ozone Days`		0.249^***	0.339^***		0.356^***	0.444^***		0.252^***	0.343^***
		(0.068)	(0.072)		(0.072)	(0.076)		(0.071)	(0.075)

Powerplantdensity			0.333^***			0.332^***			0.330^***
			(0.072)			(0.070)			(0.069)

log(`PM2.5 Days` + 0.1)				10.030^***	5.959^***	5.044^**
				(2.026)	(2.131)	(2.158)

I(`PM2.5 Days`)							2.295^***	1.998^***	1.875^***
							(0.641)	(0.650)	(0.641)

I(`PM2.5 Days`2)							-0.034^**	-0.036^**	-0.035^**
							(0.015)	(0.015)	(0.015)

I(`PM2.5 Days`3)							0.0002^**	0.0002^***	0.0002^**
							(0.0001)	(0.0001)	(0.0001)

Constant	189.565^***	186.675^***	177.131^***	185.112^***	182.099^***	173.560^***	181.225^***	178.656^***	169.684^***
	(3.924)	(4.049)	(5.035)	(5.500)	(5.163)	(6.013)	(5.890)	(5.874)	(6.667)


Observations	176	176	176	176	176	176	176	176	176
R²	0.203	0.232	0.297	0.135	0.213	0.277	0.220	0.249	0.313
Adjusted R²	0.199	0.223	0.285	0.131	0.204	0.264	0.207	0.232	0.292
Residual Std. Error	39.099 (df = 174)	38.493 (df = 173)	36.941 (df = 172)	40.730 (df = 174)	38.969 (df = 173)	37.468 (df = 172)	38.905 (df = 172)	38.285 (df = 171)	36.741 (df = 170)
F Statistic	44.414^*** (df = 1; 174)	26.170^*** (df = 2; 173)	24.225^*** (df = 3; 172)	27.268^*** (df = 1; 174)	23.433^*** (df = 2; 173)	21.945^*** (df = 3; 172)	16.200^*** (df = 3; 172)	14.198^*** (df = 4; 171)	15.470^*** (df = 5; 170)

Note:	p<0.1; p<0.05; p<0.01

d. Linear Hypothesis Tests

Next, I will perform linear hypothesis tests for each model. This ensures that the relationship between air quality and death rate is statistically significant for each model.

As can be seen below, the F-statistic for all of the models are statistically significant, with p-values well below even 0.1.

Simple Linear Model

Code

linearHypothesis(simplelm, 
                   "`PM2.5 Days` = 0", 
                   vcov. = vcovHC(simplelm, type = "HC1"))

Linear hypothesis test

Hypothesis:
PM2.5 Days` = 0

Model 1: restricted model
Model 2: Deathsper100kheartlung ~ `PM2.5 Days`

Note: Coefficient covariance matrix supplied.

  Res.Df Df      F    Pr(>F)    
1    175                        
2    174  1 86.165 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Multiple Linear Model 1

Code

linearHypothesis(multlm1, 
                   c("`PM2.5 Days` = 0", "`Ozone Days` = 0"),
                   vcov. = vcovHC(multlm1, type = "HC1"))

Linear hypothesis test

Hypothesis:
PM2.5 Days` = 0
Ozone Days` = 0

Model 1: restricted model
Model 2: Deathsper100kheartlung ~ `PM2.5 Days` + `Ozone Days`

Note: Coefficient covariance matrix supplied.

  Res.Df Df      F    Pr(>F)    
1    175                        
2    173  2 50.717 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Multiple Linear Model 2

Code

linearHypothesis(multlm2, 
                   c("`PM2.5 Days` = 0", "`Ozone Days` = 0", "Powerplantdensity = 0"),
                   vcov. = vcovHC(multlm2, type = "HC1"))

Linear hypothesis test

Hypothesis:
PM2.5 Days` = 0
Ozone Days` = 0
Powerplantdensity = 0

Model 1: restricted model
Model 2: Deathsper100kheartlung ~ `PM2.5 Days` + `Ozone Days` + Powerplantdensity

Note: Coefficient covariance matrix supplied.

  Res.Df Df      F    Pr(>F)    
1    175                        
2    172  3 35.417 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Simple Log Model

Code

linearHypothesis(simplelog, 
                   "log(`PM2.5 Days` + 0.1) = 0",
                   vcov. = vcovHC(simplelog, type = "HC1"))

Linear hypothesis test

Hypothesis:
log(`PM2.5 Days` + 0.1) = 0

Model 1: restricted model
Model 2: Deathsper100kheartlung ~ log(`PM2.5 Days` + 0.1)

Note: Coefficient covariance matrix supplied.

  Res.Df Df      F    Pr(>F)    
1    175                        
2    174  1 24.502 1.744e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Multiple Log Model 1

Code

linearHypothesis(multlog1, 
                   c("log(`PM2.5 Days` + 0.1) = 0", "`Ozone Days` = 0"),
                   vcov. = vcovHC(multlog1, type = "HC1"))

Linear hypothesis test

Hypothesis:
log(`PM2.5 Days` + 0.1) = 0
Ozone Days` = 0

Model 1: restricted model
Model 2: Deathsper100kheartlung ~ log(`PM2.5 Days` + 0.1) + `Ozone Days`

Note: Coefficient covariance matrix supplied.

  Res.Df Df     F    Pr(>F)    
1    175                       
2    173  2 32.24 1.257e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Multiple Log Model 2

Code

linearHypothesis(multlog2, 
                   c("log(`PM2.5 Days` + 0.1) = 0", "`Ozone Days` = 0", "Powerplantdensity = 0"),
                   vcov. = vcovHC(multlog2, type = "HC1"))

Linear hypothesis test

Hypothesis:
log(`PM2.5 Days` + 0.1) = 0
Ozone Days` = 0
Powerplantdensity = 0

Model 1: restricted model
Model 2: Deathsper100kheartlung ~ log(`PM2.5 Days` + 0.1) + `Ozone Days` + 
    Powerplantdensity

Note: Coefficient covariance matrix supplied.

  Res.Df Df     F    Pr(>F)    
1    175                       
2    172  3 24.01 4.986e-13 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Simple Cubic Model

Code

linearHypothesis(simplecub, 
                   c("I(`PM2.5 Days`)  = 0", "I(`PM2.5 Days`^2) = 0", "I(`PM2.5 Days`^3) = 0"),
                   vcov. = vcovHC(simplecub, type = "HC1"))

Linear hypothesis test

Hypothesis:
I(`PM2.5 Days`) = 0
I(`PM2.5 Days`^2) = 0
I(`PM2.5 Days`^3) = 0

Model 1: restricted model
Model 2: Deathsper100kheartlung ~ I(`PM2.5 Days`) + I(`PM2.5 Days`^2) + 
    I(`PM2.5 Days`^3)

Note: Coefficient covariance matrix supplied.

  Res.Df Df      F    Pr(>F)    
1    175                        
2    172  3 33.367 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Multiple Cubic Model 1

Code

linearHypothesis(multcub1, 
                   c("I(`PM2.5 Days`)  = 0", "I(`PM2.5 Days`^2) = 0", "I(`PM2.5 Days`^3) = 0", "`Ozone Days` = 0"),
                   vcov. = vcovHC(multcub1, type = "HC1"))

Linear hypothesis test

Hypothesis:
I(`PM2.5 Days`) = 0
I(`PM2.5 Days`^2) = 0
I(`PM2.5 Days`^3) = 0
Ozone Days` = 0

Model 1: restricted model
Model 2: Deathsper100kheartlung ~ I(`PM2.5 Days`) + I(`PM2.5 Days`^2) + 
    I(`PM2.5 Days`^3) + `Ozone Days`

Note: Coefficient covariance matrix supplied.

  Res.Df Df      F    Pr(>F)    
1    175                        
2    171  4 28.326 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Multiple Cubic Model 2

Code

linearHypothesis(multcub2, 
                   c("I(`PM2.5 Days`)  = 0", "I(`PM2.5 Days`^2) = 0", "I(`PM2.5 Days`^3) = 0", "`Ozone Days` = 0", "Powerplantdensity = 0"),
                   vcov. = vcovHC(multcub2, type = "HC1"))

Linear hypothesis test

Hypothesis:
I(`PM2.5 Days`) = 0
I(`PM2.5 Days`^2) = 0
I(`PM2.5 Days`^3) = 0
Ozone Days` = 0
Powerplantdensity = 0

Model 1: restricted model
Model 2: Deathsper100kheartlung ~ I(`PM2.5 Days`) + I(`PM2.5 Days`^2) + 
    I(`PM2.5 Days`^3) + `Ozone Days` + Powerplantdensity

Note: Coefficient covariance matrix supplied.

  Res.Df Df      F    Pr(>F)    
1    175                        
2    170  5 24.417 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

5. Selecting the Optimal Model

I believe that the multiple linear model, including PM2.5 Days, Ozone Days and Power Plant Density best explains the relationship between PM2.5 Days and the rate of deaths caused by heart and lung diseases.

\[ Heart and Lung Death Rate = \beta_0 + PM2.5Days*\beta_1 + OzoneDays*\beta_2 + PowerPlantDensity*\beta_3 + \epsilon \]

As seen from calculating the F-statistics, in each case, we can reject the null hypothesis that there is no relationship between PM2.5 Days and the rate of deaths of heart and lung diseases in large metropolitan counties.

In order to select the optimal model for explaining this relationship, I took a number of steps:

First, I used the visual analysis of simple models to identify that a linear or cubic model were the most likely candidates for explaining the relationship, as discussed in section 1.g.

The Multivariate Model Reduces OVB

Next, I began including additional variables in my models to see how they affected model fit and their impact on the Beta1 coefficient. I will use the case of the linear model below as an example:

In the simple linear model, PM2.5 Days had a coefficient of 0.95, with a p-value of well below 0.1%.
After adding Ozone Days as a regressor, the coefficient on PM2.5 Days decreased by around 30% to 0.65. Ozone Days had a coefficient of 0.25. Both remained statistically significant at the 0.1% level.
Finally, when adding Power Plant Density into the regression model, the coefficient on PM2.5 Days decreased by an additional 10% (0.59), with Ozone Days and Power Plant Density both at around 0.33. All of these were statistically significant at the 0.1% level.

Because adding these additional variables reduced the coefficient on PM2.5 Days by considerable margins, and all variables remained statistically significant, it’s safe to assume that adding these variables reduced omitted variable bias, and multi-colinearity was not present. Additionally, the adjusted R squared value increased in each case, suggesting that each subsequent model better fit the data.

The Multi-Linear Model is Extremely Significant and Easy to Interpret

Based on adjusted R squared alone, which indicates how well the model fits the data, the cubic model with additional regressors of Ozone Days and Power Plant Density (0.292) is slightly superior to the multivariate linear model (0.285) - but these are very close and do not tell the entire story.

When we look at the significance levels of the individual coefficients for the squared and cubic orders of PM2.5 Days, we see a marked decrease in p-values, both only signficant at the 10% level vs 0.1% level for all variables in the linear model.

Finally, the F-statistic, though significant, is considerably lower in the cubic model (15) than in the linear model (24).

Because of all of these factors, and because the linear model is easier to interpret, I have decided that the multi-linear model best explains this relationship.

In summary, for each additional day of PM2.5 levels over the NAAQS national average, an additional 0.59 deaths per 100,000 of heart and lung causes occur in large metropolitan counties, while holding Ozone Days and Social Determinants of Health constant. To put this into perspective, if Los Angeles county (9.8 million population) were to experience an additional 10 PM2.5 Days in a calendar year, my model estimates they could expect roughly an additional 60 deaths by lung and heart causes.

Bibliography

[1] Kurt, O. K., Zhang, J., & Pinkerton, K. E. (2016). Pulmonary health effects of air pollution. Current opinion in pulmonary medicine, 22(2), 138–143. https://doi.org/10.1097/MCP.0000000000000248

[2] Jonathan J Buonocore et al (2023) Air pollution and health impacts of oil & gas production in the United States Environ. Res.: Health 1 021006. https://iopscience.iop.org/article/10.1088/2752-5309/acc886

[3] Centers for Disease Control and Prevention (2018). Air Quality Measures on the National Environmental Health Tracking Network, 1999 - 2013, [https://data.cdc.gov/Environmental-Health-Toxicology/Air-Quality-Measures-on-the-National-Environmental/cjae-szjv].

[4] California Health and Human Services (2021). 1999-2013 Final Deaths by Year by County,[https://data.chhs.ca.gov/dataset/death-profiles-by-county/resource/e692e6a1-bddd-48ab-a0c8-fa0f1f43e9f4?i %20nner_span=True].

[5] United States Census Bureau (2021). County Intercensal Tables: 2000-2010,[https://www.census.gov/data/tables/time-series/demo/popest/intercensal-2000-2010-counties.html].

[6] U.S. Energy Information Administration (2023). Form EIA-860, [https://www.eia.gov/electricity/data/eia860/]

[7] Economic Research Service of the US Department of Agriculture (2023). [https://www.ers.usda.gov/]

[8] R Core Team (2022). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.

[9] Wickham H, Averick M, Bryan J, Chang W, McGowan LD, François R, Grolemund G, Hayes A, Henry L, Hester J, Kuhn M, Pedersen TL, Miller E, Bache SM, Müller K, Ooms J, Robinson D, Seidel DP, Spinu V, Takahashi K, Vaughan D, Wilke C, Woo K, Yutani H (2019). “Welcome to the tidyverse.” _Journal of Open Source Software_, *4*(43), 1686. doi:10.21105/joss.01686 <https://doi.org/10.21105/joss.01686>.

[10] Fox J, Weisberg S (2019). _An R Companion to Applied Regression_, Third edition. Sage, Thousand Oaks CA. <https://socialsciences.mcmaster.ca/jfox/Books/Companion/>.

[11] Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables. R package version 5.2.3. https://CRAN.R-project.org/package=stargazer

Introduction

Reading, cleaning and combining data sets

The Issue with Small Counties

1. The Law of Small Numbers

2. Social Determinants of Health

Selecting a Subset of Large, Metro Counties

Analysis

1. Identifying My Model

2. Creating A Model Strategy to Ensure External and Internal Validity

a. Summary of Data and b. Custom Variables

c. Means and Standard Deviations of Regressors

d. Estimating a Simple Linear Model

e. Estimating a Simple LinearLog Regression

g. Estimating a Simple Cubic Model

h. Plotting Simple Linear, Log and Cubic Models

3. Visual Analysis of Models

4. Finding the Optimal Model

a. Estimating Specifications for Uni-variate and Multivariate of Each Type

Linear models

Log Models

Cubic Models

Calculating Robust Standard Errors

Reporting in Stargazer

d. Linear Hypothesis Tests

Simple Linear Model

Multiple Linear Model 1

Multiple Linear Model 2

Simple Log Model

Multiple Log Model 1

Multiple Log Model 2

Simple Cubic Model

Multiple Cubic Model 1

Multiple Cubic Model 2

5. Selecting the Optimal Model

The Multivariate Model Reduces OVB

The Multi-Linear Model is Extremely Significant and Easy to Interpret

Bibliography