finalpart2
caleb.hill
Author

Caleb Hill

Published

November 8, 2022

Introduction

Multiple research reports state that there is a relationship between re-hospitalization rates and social characteristics, such as demographic and economic identifiers, (Barnett, Hsu & McWilliams, 2015; Murray, Allen, Clark, Daly & Jacobs, 2021). Specifically, racial characteristics play a large role in predicting re-hospitalization in a population (Li, Cai & Glance, 2015). While some articles examine economic and health factors contributing to these disparities, very few dig deep into environmental factors that influence this phenomenon, (Spatz, Bernheim, Horwitz & Herrin, 2020). With your zipcode affecting up to 60% of your health outcomes, this research is relevant to better improving one of our most costly health expenditures: hospitalization.

Re-hospitalization is a substantially costlier expenditure, as readmitting a patient further increases costs – especially if the diagnosis was untreated, poorly treated, or incorrectly treated. Most inpatient episodes characterized as a re-hospitalization when the patient is readmitted to the hospital 60 days after discharge. If the cause is different, sometimes that is counted as a re-hospitalization; other times, not so much, (Bhosale, K., Nath, R., Pandit, N., Agarwal, P., Khairnar, S., Yadav, B. & Chandrakar, S., 2020).

Research Question

This paper aims to explore how different environmental variables impact re-hospitalization rates on a county-by-county level. Due to the nature of this project, we will not be controlling for racial, ethnic, and sex variables. These environmental factors will include both common environmental concerns, such as heat index, average temperature, precipitation, and natural disasters, along with the built environment, mean travel time to work, renter burden, and population density. We will also stratify by rural/urban classification, to determine if counties above or below 250,000 population experience differences in re-hospitalization rates, dependent upon these explanatory variables.

The data-set chosen for this analysis is taken from the Agency for Healthcare Research and Quality, Social Determinants of Health (SDOH) Database. This data-set has over 300 variables to explore each SDOH domain: social context, economic context, education, healthcare, and the environment. We shall pull data from three of these five domains: social, economic, and environmental.

How re-hospitalization is measured is not clarified per this data-set’s codebook. However, the Center for Medicare and Medicaid (CMS) 30-day Risk-Standardized Readmission Rate (RSRR) measures re-hospitalization as an unplanned readmission to inpatient services. It does stratify and specify based upon diagnosis. As the AHRQ is a federal agency alongside CMS, it is likely that they are pulling from CMS for this measure and aggregating various diagnoses into one county rate.

Hypothesis

The hypothesis for this research report is:

  • Environmental factors increase rates of re-hospitalization in the United States.

Therefore, the null hypothesis is:

  • Environmental factors do not increase rates of re-hospitalization in the United States.

Various regression analyses shall be employed to determine the relationship – or lack thereof – between these variables.

First I’ll import the relevant libraries.

Then I’ll import the dataset and view the first six rows.

Code
df <- SDOH_2020_COUNTY_1_0 <- read_excel("_data/SDOH_2020_COUNTY_1_0.xlsx", 
                                         sheet = "Data")
Warning: Expecting logical in OA1673 / R1673C391: got '46123'
Warning: Expecting logical in OA1765 / R1765C391: got '32510'
Warning: Expecting logical in OB1765 / R1765C392: got '41025'
Warning: Expecting logical in OC1765 / R1765C393: got '41037'
Warning: Expecting logical in OA2799 / R2799C391: got '49017'
Warning: Expecting logical in OB2799 / R2799C392: got '49019'
Warning: Expecting logical in OC2799 / R2799C393: got '49025'
Warning: Expecting logical in OD2799 / R2799C394: got '49055'
Warning: Expecting logical in OA2844 / R2844C391: got '51760'
Code
head(df)
# A tibble: 6 × 685
   YEAR COUNTYFIPS STATEFIPS STATE COUNTY REGION TERRI…¹ ACS_T…² ACS_T…³ ACS_T…⁴
  <dbl> <chr>      <chr>     <chr> <chr>  <chr>    <dbl>   <dbl>   <dbl>   <dbl>
1  2020 01001      01        Alab… Autau… South        0   55639   54929   52404
2  2020 01003      01        Alab… Baldw… South        0  218289  216518  206329
3  2020 01005      01        Alab… Barbo… South        0   25026   24792   23694
4  2020 01007      01        Alab… Bibb … South        0   22374   22073   21121
5  2020 01009      01        Alab… Bloun… South        0   57755   57164   54250
6  2020 01011      01        Alab… Bullo… South        0   10173   10143    9579
# … with 675 more variables: ACS_TOT_POP_ABOVE15 <dbl>,
#   ACS_TOT_POP_ABOVE16 <dbl>, ACS_TOT_POP_16_19 <dbl>,
#   ACS_TOT_POP_ABOVE25 <dbl>, ACS_TOT_CIVIL_POP_ABOVE18 <dbl>,
#   ACS_TOT_CIVIL_VET_POP_ABOVE25 <dbl>, ACS_TOT_OWN_CHILD_BELOW17 <dbl>,
#   ACS_TOT_WORKER_NWFH <dbl>, ACS_TOT_WORKER_HH <dbl>,
#   ACS_TOT_CIVILIAN_LABOR <dbl>, ACS_TOT_CIVIL_EMPLOY_POP <dbl>,
#   ACS_TOT_POP_POV <dbl>, ACS_TOT_CIVIL_NONINST_POP_POV <dbl>, …

Next I want to verify the class is a dataframe. Otherwise, I’ll need to transform the data to make it easier to work with.

Code
class(df)
[1] "tbl_df"     "tbl"        "data.frame"

All good here.

Now on to data transformation. We will need to select only the relevant columns for this analysis.

Code
df_new <- df %>%
  select(COUNTYFIPS,
         STATE,
         COUNTY,
         AHRF_USDA_RUCC_2013,
         CEN_POPDENSITY_COUNTY,
         NEPHTN_HEATIND_105,
         NOAAC_AVG_TEMP_YEARLY,
         NOAAC_PRECIPITATION_AVG_YEARLY,
         NOAAS_TOT_NATURAL_DISASTERS,
         SAIPE_MEDIAN_HH_INCOME,
         SAIPE_PCT_POV,
         ACS_PCT_COMMT_60MINUP,
         ACS_PCT_RENTER_HU_COST_50PCT,
         LTC_AVG_OBS_REHOSP_RATE) 
nrow(df_new)
[1] 3229
Code
head(df_new)
# A tibble: 6 × 14
  COUNTYF…¹ STATE COUNTY AHRF_…² CEN_P…³ NEPHT…⁴ NOAAC…⁵ NOAAC…⁶ NOAAS…⁷ SAIPE…⁸
  <chr>     <chr> <chr>  <chr>     <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
1 01001     Alab… Autau… 2          93.6       4    66.2    5.72      17   67565
2 01003     Alab… Baldw… 3         137.        0    68.9    5.27      58   71135
3 01005     Alab… Barbo… 6          28.3       3    66.4    5.75      24   38866
4 01007     Alab… Bibb … 1          35.9       0    64.1    5.61      23   50907
5 01009     Alab… Bloun… 1          89.6       0    62.7    5.96      45   55203
6 01011     Alab… Bullo… 6          16.3       2    66.2    5.40      12   33124
# … with 4 more variables: SAIPE_PCT_POV <dbl>, ACS_PCT_COMMT_60MINUP <dbl>,
#   ACS_PCT_RENTER_HU_COST_50PCT <dbl>, LTC_AVG_OBS_REHOSP_RATE <dbl>, and
#   abbreviated variable names ¹​COUNTYFIPS, ²​AHRF_USDA_RUCC_2013,
#   ³​CEN_POPDENSITY_COUNTY, ⁴​NEPHTN_HEATIND_105, ⁵​NOAAC_AVG_TEMP_YEARLY,
#   ⁶​NOAAC_PRECIPITATION_AVG_YEARLY, ⁷​NOAAS_TOT_NATURAL_DISASTERS,
#   ⁸​SAIPE_MEDIAN_HH_INCOME

Out of 1400+ variables, we’ve whittled them down to 14. Of those 14, we have four (4) that are unique identifiers (FIPS, State, County, and Rural-Urban Continuation Code), four (4) environmental, two (2) economic, one (1) housing, two (2) built-enviornment, and one (1) healthcare outcome.

Before we launch into exploring these variables via descriptive statistics, first we need to determine where the NAs are and see if any of the variables will have a substantial amount of missing data.

Code
kable(colSums(is.na(df_new)))
x
COUNTYFIPS 0
STATE 0
COUNTY 0
AHRF_USDA_RUCC_2013 9
CEN_POPDENSITY_COUNTY 8
NEPHTN_HEATIND_105 121
NOAAC_AVG_TEMP_YEARLY 123
NOAAC_PRECIPITATION_AVG_YEARLY 123
NOAAS_TOT_NATURAL_DISASTERS 0
SAIPE_MEDIAN_HH_INCOME 87
SAIPE_PCT_POV 87
ACS_PCT_COMMT_60MINUP 8
ACS_PCT_RENTER_HU_COST_50PCT 8
LTC_AVG_OBS_REHOSP_RATE 410

Plenty of variables with missing data. Some are minor, such as population density, housing cost, and commute time variables with 8. Some are concerning, such as Heat Index, Average Yearly Temperature, and Average Yearly Precipitation, all around 120+.

The most concerning is – of course – our outcome variable, Re-Hospitalization Rates. This is not ideal. However, 410 / 3229 (12.6%) is not bad. That still leaves us with plenty of counties to review.

Code
df_new <- df_new %>%
  drop_na() %>%
  print(nrow(df_new))
# A tibble: 2,814 × 14
   COUNTYFIPS STATE   COUNTY          AHRF_USDA_RUCC_2013 CEN_POPDENSITY_COUNTY
   <chr>      <chr>   <chr>           <chr>                               <dbl>
 1 01001      Alabama Autauga County  2                                    93.6
 2 01003      Alabama Baldwin County  3                                   137. 
 3 01005      Alabama Barbour County  6                                    28.3
 4 01007      Alabama Bibb County     1                                    35.9
 5 01009      Alabama Blount County   1                                    89.6
 6 01011      Alabama Bullock County  6                                    16.3
 7 01013      Alabama Butler County   6                                    25.4
 8 01015      Alabama Calhoun County  3                                   189. 
 9 01017      Alabama Chambers County 6                                    56.0
10 01019      Alabama Cherokee County 6                                    47.0
   NEPHTN_HEATIND_105 NOAAC_AVG_TEMP_YEARLY NOAAC_PRECIPITATION_AVG_YEARLY
                <dbl>                 <dbl>                          <dbl>
 1                  4                  66.2                           5.72
 2                  0                  68.9                           5.27
 3                  3                  66.4                           5.75
 4                  0                  64.1                           5.61
 5                  0                  62.7                           5.96
 6                  2                  66.2                           5.40
 7                  0                  67.1                           5.37
 8                  0                  63.2                           5.74
 9                  0                  63.3                           5.93
10                  0                  62.3                           5.85
   NOAAS_TOT_NATURAL_DISASTERS SAIPE_MEDIAN_HH_INCOME SAIPE_PCT_POV
                         <dbl>                  <dbl>         <dbl>
 1                          17                  67565          11.2
 2                          58                  71135           8.9
 3                          24                  38866          25.5
 4                          23                  50907          17.8
 5                          45                  55203          13.1
 6                          12                  33124          30.8
 7                          18                  42268          20.6
 8                          14                  50259          14.5
 9                          18                  39318          16.3
10                          38                  50388          14.7
   ACS_PCT_COMMT_60MINUP ACS_PCT_RENTER_HU_COST_50PCT LTC_AVG_OBS_REHOSP_RATE
                   <dbl>                        <dbl>                   <dbl>
 1                  6.06                         26.6                    0.14
 2                  7.53                         20.8                    0.14
 3                 11.8                          22.4                    0.22
 4                 10.4                          27.4                    0.16
 5                 18.6                          22.6                    0.14
 6                 12.7                          34                      0.05
 7                  9.14                         29.5                    0.11
 8                  7.21                         20.5                    0.13
 9                  5.93                         13.0                    0.15
10                 10.2                          12.2                    0.19
# … with 2,804 more rows

2,814 x 14 is a good place to start.

Descriptive Statistics

For our preliminary analysis, we’re going to provide summary statistics analyzing the 10 variables relevant to our research question, from Population Density to the end of the data-set, and a visualization for each.

Code
kable(describe(df_new))
vars n mean sd median trimmed mad min max range skew kurtosis se
COUNTYFIPS* 1 2814 1.407500e+03 8.124762e+02 1407.50000 1.407500e+03 1043.009100 1.000000e+00 2814.00000 2.813000e+03 0.0000000 -1.2012794 15.3161135
STATE* 2 2814 2.442928e+01 1.359471e+01 23.00000 2.441874e+01 16.308600 1.000000e+00 48.00000 4.700000e+01 0.0584273 -1.2662366 0.2562759
COUNTY* 3 2814 8.369058e+02 4.694846e+02 831.50000 8.339405e+02 584.885700 1.000000e+00 1674.00000 1.673000e+03 0.0526568 -1.1127220 8.8503264
AHRF_USDA_RUCC_2013* 4 2814 4.772210e+00 2.619826e+00 6.00000 4.717140e+00 2.965200 1.000000e+00 9.00000 8.000000e+00 0.0039932 -1.2985960 0.0493867
CEN_POPDENSITY_COUNTY 5 2814 2.893180e+02 1.877174e+03 50.74500 8.620448e+01 55.404762 5.000000e-01 71895.54000 7.189504e+04 25.8518141 850.3767977 35.3868910
NEPHTN_HEATIND_105 6 2814 4.098792e+00 8.186776e+00 0.00000 1.963144e+00 0.000000 0.000000e+00 59.00000 5.900000e+01 2.9358498 9.6132423 0.1543302
NOAAC_AVG_TEMP_YEARLY 7 2814 5.627572e+01 8.030342e+00 55.95833 5.617415e+01 8.784405 3.541667e+01 78.49167 4.307500e+01 0.1339838 -0.5932113 0.1513812
NOAAC_PRECIPITATION_AVG_YEARLY 8 2814 3.611196e+00 1.628908e+00 3.62375 3.638924e+00 1.887844 2.241667e-01 9.74500 9.520833e+00 -0.0899446 -0.7490480 0.0307068
NOAAS_TOT_NATURAL_DISASTERS 9 2814 3.678074e+01 4.541085e+01 25.00000 2.851510e+01 19.273800 0.000000e+00 662.00000 6.620000e+02 5.3277337 46.7140570 0.8560470
SAIPE_MEDIAN_HH_INCOME 10 2814 5.738628e+04 1.442985e+04 55107.00000 5.584910e+04 11735.520300 2.599700e+04 155362.00000 1.293650e+05 1.4172637 3.6789671 272.0193680
SAIPE_PCT_POV 11 2814 1.371606e+01 5.320367e+00 12.80000 1.323637e+01 4.744320 3.000000e+00 39.60000 3.660000e+01 1.0397109 1.6688697 0.1002951
ACS_PCT_COMMT_60MINUP 12 2814 8.155924e+00 4.864799e+00 6.85500 7.487016e+00 3.810282 0.000000e+00 35.91000 3.591000e+01 1.5015546 3.0016425 0.0917071
ACS_PCT_RENTER_HU_COST_50PCT 13 2814 2.060974e+01 6.649144e+00 20.78000 2.057608e+01 6.093486 0.000000e+00 49.26000 4.926000e+01 0.1136840 0.5492796 0.1253440
LTC_AVG_OBS_REHOSP_RATE 14 2814 1.449645e-01 8.432610e-02 0.14000 1.422425e-01 0.059304 0.000000e+00 1.00000 1.000000e+00 1.6404370 12.5443704 0.0015896

We should note for many of these analyses that the Urban / Rural Continuum Code runs from 1.00 to 9.00. Anything 4.00 or higher would be classified as Rural, with an urban population of less than 250,000.

Population Density

Code
pop_den <- df_new %>%
  filter(CEN_POPDENSITY_COUNTY < 5000)

ggplot(pop_den, aes(CEN_POPDENSITY_COUNTY)) +
  geom_histogram(binwidth = 50)

We’ve surely got some out-liers. The mean is 291, but the median is 46. The max is 70,000. We’ve filtered those out for this visualization and set the bins close to the median. A left-skewed variable is expected, as the majority of counties in the United States would be classified as rural and therefore have low population densities.

Let’s plot a facet-grid on these codes.

Code
ggplot(pop_den, aes(CEN_POPDENSITY_COUNTY)) +
  geom_histogram(binwidth = 50) +
  facet_wrap('AHRF_USDA_RUCC_2013')

As expected, a large part of the distribution is 4.0 or higher. 1.0 also has a high variation of population density, which may cause issues with the regression.

Heat Index Over 105F

Code
ggplot(df_new, aes(NEPHTN_HEATIND_105)) +
  geom_boxplot()

Due to the wide range in climate for the United States, it’s not surprising that there’s a large variety of out-liers. The median number of days a county experienceed a heat index of over 105F each year is 4 days per year. One county even reached 59 days – a Texas county!

Code
ggplot(df_new, aes(NEPHTN_HEATIND_105)) +
  geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The data-set has a very left skewed distribution, similar to Population Density. Most counties experience under 10 days with a Heat Index over 105.

Code
ggplot(df_new, aes(NEPHTN_HEATIND_105)) +
  geom_histogram() +
  facet_wrap('AHRF_USDA_RUCC_2013')
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

This distribution stays fairly constant, regardless of UR classification. 2.00 and 6.00 may have interesting insights, as their right tails are more pronounced, but that would be better suited to a map for quick reference. That is outside the scope of this project.

Average Yearly Temperature

Code
ggplot(df_new, aes(NOAAC_AVG_TEMP_YEARLY)) +
  geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

There’s a good distribution. Average temperature each month is between 50 to 60 for most of the counties. The range (45) is also fairly large and shows the multiple climates within its borders.

Average Yearly Precipitation

Code
ggplot(df_new, aes(NOAAC_PRECIPITATION_AVG_YEARLY)) +
  geom_boxplot()

Average precipitation each month is fairly uniform, with the mean at 3.49 inches of rain, on average, each month. This variable will most likely provide less variation in the analysis compared to others, such as population density and heat index. This can be both a good and a bad thing, as variations in precipitation was one of the variables I was most interested in exploring for this project. Oh well.

Total Natural Disasters

Code
ggplot(df_new, aes(NOAAS_TOT_NATURAL_DISASTERS)) +
  geom_boxplot()

Many high out-liers over 100; some even reaching over 600. Let’s plot a histogram to get a better look at the data’s distribution.

Code
ggplot(df_new, aes(NOAAS_TOT_NATURAL_DISASTERS)) +
  geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

A left skewed variable, with observations dropping off dramatically once we reach 50 total recorded natural disasters.

Code
ggplot(df_new, aes(NOAAS_TOT_NATURAL_DISASTERS)) +
  geom_histogram() +
  facet_wrap('AHRF_USDA_RUCC_2013')
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Little to no difference in UR classification for natural disaster out-liers.

Median Household Income

Code
ggplot(df_new, aes(SAIPE_MEDIAN_HH_INCOME)) +
  geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Very close to a normal distribution, if barely left-skewed. A couple of high out-liers, hovering around $90,000+ in median household income, but the mean holds at $57,465.

Percent in Poverty

Code
ggplot(df_new, aes(SAIPE_PCT_POV)) +
  geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Another close to normal distribution. Most counties have poverty rates ranging from 10% to 20%. There are of course out-liers, especially a good number below 10%, but those are rare.

Code
ggplot(df_new, aes(SAIPE_PCT_POV)) +
  geom_histogram() +
  facet_wrap('AHRF_USDA_RUCC_2013')
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Highly urbanized counties (1.00) have substantially less percentage poverty compared to their rural counterparts. 7.00, 8.00, and 9.00 have the highest spread, with some counties reaching 40% poverty rates! We barely see the urban areas (1.00 - 3.00) reach 30% poverty.

Percent Commuting Alone, Over 60 Minutes

Code
ggplot(df_new, aes(ACS_PCT_COMMT_60MINUP)) +   
  geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The majority of counties fall below 10% of their population commuting up to and more than 60 minutes for work. Let’s do another facet grid to see if there’s a relationship between UR codes.

Code
ggplot(df_new, aes(ACS_PCT_COMMT_60MINUP)) +
  geom_histogram() +
  facet_wrap('AHRF_USDA_RUCC_2013')
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Not particularly. The only codes that appear different than the rest include 1.00 (highly urban, over 1 million population) and 8.00 (completely rural, fewer than 2,500 population). 7.00 and higher is surprising, as these are counties with very little population and often not adjacent to metro areas. Therefore, populations are most likely condensed around “urban” centers for economic purposes.

Percent Renter Housing Costs Over 50 Percent of Income

Code
ggplot(df_new, aes(ACS_PCT_RENTER_HU_COST_50PCT)) +   
  geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

This is a startling figure. On average, 20% of counties have renters where 50% or more of their income goes toward housing costs. These leaves little to no room for other expenses and drives economic instability. The data is normally distributed and barely left-skewed – but still an item to consider with further analysis.

Re-hospitalization Rate

Code
ggplot(df_new, aes(LTC_AVG_OBS_REHOSP_RATE)) +   
  geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Another right skewed variable. Lots of counties with 0.00 rates of re-hospitalization, and few, if any, above 0.50 per 100,000 people. From a health perspective, this is good news! From a research perspective, that’s going to make analysis a little trickier. However, the somewhat normal and/or bimodal distribution should be fairly easy to work with. While needing some transformation for a linear regression, we can test multiple models per each variable to determine which amendment provides the most robust inference.

Analysis

Hypothesis Testing

Remember that the hypothesis for this research report is:

  • Environmental factors increase rates of re-hospitalization in the United States.

We have nine (9) explanatory variables to work with, so we can run different regressions to determine what variables influence re-hospitalization rates the most – if at all – and how they interact with other variables.

Reminder that the nine (9) explanatory variable are broken down into three domains: environmental, economic, and built environment.

Environmental entails:

  • Day with Heat Index over 105F

  • Average Annual Precipitation

  • Average Annual Precipitation

  • Total Natural Disasters Per Year

Economic is:

  • Median Household Income

  • Percent Poverty

And the Built Environment includes:

  • Population Density

  • Percent Rental Housing Cost, over 50%

  • Percent Commuting Alone, over 60 minutes

We will run four models to test the hypothesis. They shall examine each environmental variable’s impact on the dependent variable, re-hospitalization rates. The control variables will be the economic and built environment variables, five (5) in total.

Code
model1 <- lm(LTC_AVG_OBS_REHOSP_RATE ~ 
               NEPHTN_HEATIND_105 +
               CEN_POPDENSITY_COUNTY +
               SAIPE_MEDIAN_HH_INCOME +
               SAIPE_PCT_POV +
               ACS_PCT_COMMT_60MINUP +
               ACS_PCT_RENTER_HU_COST_50PCT,
             df_new)

model2 <- lm(LTC_AVG_OBS_REHOSP_RATE ~ 
               NOAAC_AVG_TEMP_YEARLY +
               CEN_POPDENSITY_COUNTY +
               SAIPE_MEDIAN_HH_INCOME +
               SAIPE_PCT_POV +
               ACS_PCT_COMMT_60MINUP +
               ACS_PCT_RENTER_HU_COST_50PCT,
             df_new)
  
model3 <- lm(LTC_AVG_OBS_REHOSP_RATE ~ 
               NOAAC_PRECIPITATION_AVG_YEARLY +
               CEN_POPDENSITY_COUNTY +
               SAIPE_MEDIAN_HH_INCOME +
               SAIPE_PCT_POV +
               ACS_PCT_COMMT_60MINUP +
               ACS_PCT_RENTER_HU_COST_50PCT,
             df_new)
  
model4 <- lm(LTC_AVG_OBS_REHOSP_RATE ~ 
               NOAAS_TOT_NATURAL_DISASTERS +
               CEN_POPDENSITY_COUNTY +
               SAIPE_MEDIAN_HH_INCOME +
               SAIPE_PCT_POV +
               ACS_PCT_COMMT_60MINUP +
               ACS_PCT_RENTER_HU_COST_50PCT,
             df_new)

Let’s plot these regressions, removing the control variables to get a better visualization. For two of the models, we will employ log transformations for data that is skewed. These variables were identified during the Descriptive Statistic section. This shall include a log transformation for the response variable in particular.

Code
ggplot(df_new, aes(NEPHTN_HEATIND_105,
       log(LTC_AVG_OBS_REHOSP_RATE))) +
  geom_point() +
  geom_smooth(method = lm,
              se = FALSE,
              fullrange = TRUE) +
  labs(title = "Model 1",
       x = "Heat Index Over 105F",
       y = "Re-Hospitalization Rates")
`geom_smooth()` using formula 'y ~ x'
Warning: Removed 235 rows containing non-finite values (stat_smooth).

Code
ggplot(df_new, aes(log(NOAAC_AVG_TEMP_YEARLY),
       log(LTC_AVG_OBS_REHOSP_RATE))) +
  geom_point() +
  geom_smooth(method = lm,
              se = FALSE,
              fullrange = TRUE) +
  labs(title = "Model 2",
       x = "Average Annual Temperature",
       y = "Re-Hospitalization Rates")
`geom_smooth()` using formula 'y ~ x'
Warning: Removed 235 rows containing non-finite values (stat_smooth).

Code
ggplot(df_new, aes(NOAAC_PRECIPITATION_AVG_YEARLY,
                   log(LTC_AVG_OBS_REHOSP_RATE))) +
  geom_point() +
  geom_smooth(method = lm,
              se = FALSE,
              fullrange = TRUE) +
  labs(title = "Model 3",
       x = "Average Annual Precipitation",
       y = "Re-Hospitalization Rates")
`geom_smooth()` using formula 'y ~ x'
Warning: Removed 235 rows containing non-finite values (stat_smooth).

Code
ggplot(df_new, aes(NOAAS_TOT_NATURAL_DISASTERS,
                   log(LTC_AVG_OBS_REHOSP_RATE))) +
  geom_point() +
  geom_smooth(method = lm,
              se = FALSE,
              fullrange = TRUE) +
  labs(title = "Model 4",
       x = "Total Natural Disasters",
       y = "Re-Hospitalization Rates")
`geom_smooth()` using formula 'y ~ x'
Warning: Removed 235 rows containing non-finite values (stat_smooth).

Of these four, it looks like temperature and precipitation have the best fit. All but Total Natural Disasters have a positive relationship with the response variable, so that helps us in determining if we should accept or reject the null hypothesis.

We will reject the null hypothesis. While it looks like, at first glance, that there is little positive relationship, we can at least note that there is some positive relationship. In the next two sections, we will dig deeper into each model, examining the p-value and R-Squared value, to see what level of relationship is present.

Model Comparisons

Now we will compare the four (4) models in more depth.

Code
summary(model1)

Call:
lm(formula = LTC_AVG_OBS_REHOSP_RATE ~ NEPHTN_HEATIND_105 + CEN_POPDENSITY_COUNTY + 
    SAIPE_MEDIAN_HH_INCOME + SAIPE_PCT_POV + ACS_PCT_COMMT_60MINUP + 
    ACS_PCT_RENTER_HU_COST_50PCT, data = df_new)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.24905 -0.04273 -0.00201  0.03850  0.86257 

Coefficients:
                               Estimate Std. Error t value Pr(>|t|)    
(Intercept)                   5.582e-02  1.553e-02   3.595 0.000330 ***
NEPHTN_HEATIND_105            6.909e-04  1.959e-04   3.526 0.000428 ***
CEN_POPDENSITY_COUNTY         6.250e-07  8.575e-07   0.729 0.466180    
SAIPE_MEDIAN_HH_INCOME        4.223e-07  1.825e-07   2.313 0.020777 *  
SAIPE_PCT_POV                 3.331e-03  5.166e-04   6.449 1.32e-10 ***
ACS_PCT_COMMT_60MINUP        -4.504e-05  3.321e-04  -0.136 0.892129    
ACS_PCT_RENTER_HU_COST_50PCT  8.043e-04  2.517e-04   3.195 0.001413 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.0824 on 2807 degrees of freedom
Multiple R-squared:  0.04725,   Adjusted R-squared:  0.04521 
F-statistic:  23.2 on 6 and 2807 DF,  p-value: < 2.2e-16
Code
summary(model2)

Call:
lm(formula = LTC_AVG_OBS_REHOSP_RATE ~ NOAAC_AVG_TEMP_YEARLY + 
    CEN_POPDENSITY_COUNTY + SAIPE_MEDIAN_HH_INCOME + SAIPE_PCT_POV + 
    ACS_PCT_COMMT_60MINUP + ACS_PCT_RENTER_HU_COST_50PCT, data = df_new)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.22744 -0.04216 -0.00227  0.03810  0.87918 

Coefficients:
                               Estimate Std. Error t value Pr(>|t|)    
(Intercept)                  -3.293e-03  1.691e-02  -0.195 0.845614    
NOAAC_AVG_TEMP_YEARLY         1.721e-03  2.236e-04   7.697 1.91e-14 ***
CEN_POPDENSITY_COUNTY         6.974e-07  8.498e-07   0.821 0.411866    
SAIPE_MEDIAN_HH_INCOME        2.163e-07  1.835e-07   1.178 0.238760    
SAIPE_PCT_POV                 2.082e-03  5.440e-04   3.827 0.000132 ***
ACS_PCT_COMMT_60MINUP        -4.280e-04  3.338e-04  -1.282 0.199842    
ACS_PCT_RENTER_HU_COST_50PCT  6.661e-04  2.498e-04   2.667 0.007701 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.08172 on 2807 degrees of freedom
Multiple R-squared:  0.06281,   Adjusted R-squared:  0.06081 
F-statistic: 31.35 on 6 and 2807 DF,  p-value: < 2.2e-16
Code
summary(model3)

Call:
lm(formula = LTC_AVG_OBS_REHOSP_RATE ~ NOAAC_PRECIPITATION_AVG_YEARLY + 
    CEN_POPDENSITY_COUNTY + SAIPE_MEDIAN_HH_INCOME + SAIPE_PCT_POV + 
    ACS_PCT_COMMT_60MINUP + ACS_PCT_RENTER_HU_COST_50PCT, data = df_new)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.24203 -0.04423 -0.00280  0.03765  0.86494 

Coefficients:
                                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)                     3.851e-02  1.539e-02   2.502  0.01240 *  
NOAAC_PRECIPITATION_AVG_YEARLY  8.861e-03  1.046e-03   8.470  < 2e-16 ***
CEN_POPDENSITY_COUNTY           4.913e-07  8.473e-07   0.580  0.56206    
SAIPE_MEDIAN_HH_INCOME          5.067e-07  1.801e-07   2.813  0.00495 ** 
SAIPE_PCT_POV                   2.954e-03  5.089e-04   5.805 7.15e-09 ***
ACS_PCT_COMMT_60MINUP          -5.921e-04  3.358e-04  -1.763  0.07794 .  
ACS_PCT_RENTER_HU_COST_50PCT    4.636e-04  2.514e-04   1.844  0.06534 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.08154 on 2807 degrees of freedom
Multiple R-squared:  0.06688,   Adjusted R-squared:  0.06489 
F-statistic: 33.53 on 6 and 2807 DF,  p-value: < 2.2e-16
Code
summary(model4)

Call:
lm(formula = LTC_AVG_OBS_REHOSP_RATE ~ NOAAS_TOT_NATURAL_DISASTERS + 
    CEN_POPDENSITY_COUNTY + SAIPE_MEDIAN_HH_INCOME + SAIPE_PCT_POV + 
    ACS_PCT_COMMT_60MINUP + ACS_PCT_RENTER_HU_COST_50PCT, data = df_new)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.24509 -0.04423 -0.00198  0.03901  0.85940 

Coefficients:
                               Estimate Std. Error t value Pr(>|t|)    
(Intercept)                   4.845e-02  1.564e-02   3.099  0.00196 ** 
NOAAS_TOT_NATURAL_DISASTERS  -5.117e-05  3.596e-05  -1.423  0.15490    
CEN_POPDENSITY_COUNTY         4.512e-07  8.578e-07   0.526  0.59891    
SAIPE_MEDIAN_HH_INCOME        5.359e-07  1.876e-07   2.856  0.00432 ** 
SAIPE_PCT_POV                 3.750e-03  5.097e-04   7.358 2.44e-13 ***
ACS_PCT_COMMT_60MINUP        -2.872e-05  3.331e-04  -0.086  0.93131    
ACS_PCT_RENTER_HU_COST_50PCT  7.913e-04  2.526e-04   3.132  0.00175 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.08255 on 2807 degrees of freedom
Multiple R-squared:  0.04372,   Adjusted R-squared:  0.04168 
F-statistic: 21.39 on 6 and 2807 DF,  p-value: < 2.2e-16

Three of the four explanatory variables meet the statistical significance threshold (0.001). Total Natural Disasters Per Year do not. This helps in finalizing whether to accept or reject the null hypothesis. The p-value is most significant for precipitation, model 3.

For the model fit, the adjusted R squared ranges from 0.04 to 0.06. The highest is for precipitation, model 3.

Precipitation also has the lowest residual standard error, at 0.081.

Finally, we’re going to calculate the PRESS statistic (Predicted Residual Sum of Squares) to best determine which model can predict the response variable based upon the explanatory variables.

Code
PRESS <- function(model) {
  i <- residuals(model)/(1 - lm.influence(model)$hat)
  sum(i^2)
}

PRESS(model1)
[1] 19.17066
Code
PRESS(model2)
[1] 18.86294
Code
PRESS(model3)
[1] 18.77811
Code
PRESS(model4)
[1] 19.23593

Model 3 has the lowest PRESS score.

Due to a strong p-value, PRESS score, and model fit compared to the other three models, model 3 will be chosen as the final model for the diagnostic exploration. While the adjusted R squared value is not strong when controlling for economic and built environment factors, there is still a positive relationship, and therefore some influence on the dependent variable.

Diagnostics

Finally, we’ll plot the diagnostics to best understand the model.

Code
par(mfrow = c(2,3));
plot(model3, 
     which = 1:6)

Of the six plots, Cook’s distance is the most striking and relevant, as there are three out-liers: 1679, 2189, and 2455.

1679 is New York County, New York.

2189 is Mellette County, South Dakota.

2455 is Real County, Texas.

Otherwise, the plot looks good.

Normal Q-Q violates this test, as the points at the right tail of the plot do not generally fall along the line. This is very apparent for our three out-liers. The remaining plots do not violate their tested assumptions and further cement the model’s reliability.

Conclusion

This paper explored the relationship between re-hospitalization rates and four environmental variables, when controlling for common variables that regularly influence the dependent variable. These four variables included days with a heat index over 105F, average annual temperature, average annual precipitation, and total natural disasters.

Four models were selected, one for each variable, to best determine which measure best impacted re-hospitalization rates. Three of the four variables were statistically significant and two had a larger adjusted R squared value than the others. Precipitation was selected as the variable with the best model fit to explain re-hospitalization rate impact. While the adjusted R-squared value is negligible (0.06), there is a positive relationship that is statistically significant. Therein we see some form of an influence large amounts of annual precipitation has on re-hospitalization rates.

I would have liked to tighten the analysis further instead of including multiple (10) variables, by focusing on some key measurements: Precipitation as the explanatory, Poverty as the control, and filtering by Rurality to determine the relationship with re-hospitalization rates. I could then fit different models (Simple Linear, Poisson, Polynomial, etc.) to see which worked best. I may not have time to do so for the poster presentation, nor would that perhaps be within the scope of this project. Either way, I have something for future classes, perhaps via time series analysis or machine learning (prediction over inference). Either way, this helped me better understand the robustness of linear regression models.

References

Barnett, M., Hsu, J. & McWilliams, M. (2015). “Patient Characteristics and Differences in Hospital Readmission Rates.” JAMA Intern Med., 175(11): 1803-1812.

Bhosale KH, Nath RK, Pandit N, Agarwal P, Khairnar S, Yadav B, & Chandrakar S. (2020). “Rate of Rehospitalization in 60 Days of Discharge and It’s Determinants in Patients with Heart Failure with Reduced Ejection Fraction in a Tertiary Care Centre in India.” Int J Heart Fail. 21;2(2):131-144.

Li, Y., Cai, X. & Glance, L. (2015). “Disparities in 30-day rehospitalization rates among Medicare skilled nursing facility residents by race and site of care.” Med Care, 53(12): 1058-1065.

Murray, F., Allen, M., Clark, C., Daly, C. & Jacobs, D. (2021). “Socio-demographic and -economic factors associated with 30-day readmission for conditions targeted by the hospital readmissions reduction program: a population-based study.” BMC Public Health, 21.

Spatz, E., Bernheim, S., Horwitz, L. & Herrin, J. (2020). Community factors and hospital wide readmission rates: Does context matter? PLoS One, 15(10).