hw5
emma_narkewicz
diagnostics
Diagnostics
Author

Emma Narkewicz

Published

May 9, 2023

Code
library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.3.6      ✔ purrr   0.3.4 
✔ tibble  3.1.8      ✔ dplyr   1.0.10
✔ tidyr   1.2.0      ✔ stringr 1.4.1 
✔ readr   2.1.2      ✔ forcats 0.5.2 
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
Code
library(alr4)
Loading required package: car
Loading required package: carData

Attaching package: 'car'

The following object is masked from 'package:dplyr':

    recode

The following object is masked from 'package:purrr':

    some

Loading required package: effects
lattice theme set by effectsTheme()
See ?effectsTheme for details.
Code
library(smss)
library(sjPlot)
Learn more about sjPlot with 'browseVignettes("sjPlot")'.
Code
library(sjmisc)
Install package "strengejacke" from GitHub (`devtools::install_github("strengejacke/strengejacke")`) to load all sj-packages at once!

Attaching package: 'sjmisc'

The following object is masked from 'package:purrr':

    is_empty

The following object is masked from 'package:tidyr':

    replace_na

The following object is masked from 'package:tibble':

    add_case
Code
library(stargazer)

Please cite as: 

 Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
 R package version 5.2.3. https://CRAN.R-project.org/package=stargazer 
Code
library(broom)
library(qpcR)
Loading required package: MASS

Attaching package: 'MASS'

The following object is masked from 'package:dplyr':

    select

Loading required package: minpack.lm
Loading required package: rgl
Loading required package: robustbase

Attaching package: 'robustbase'

The following object is masked from 'package:alr4':

    cloud

Loading required package: Matrix

Attaching package: 'Matrix'

The following objects are masked from 'package:tidyr':

    expand, pack, unpack

Question 1

I started out by reading in the house.selling.price.2 data set & recreating the correlation matrix & regression output for home price.

Code
#Reading in Data
data("house.selling.price.2")
head(house.selling.price.2)
      P    S Be Ba New
1  48.5 1.10  3  1   0
2  55.0 1.01  3  2   0
3  68.0 1.45  3  2   0
4 137.0 2.40  3  3   0
5 309.4 3.30  4  3   1
6  17.5 0.40  1  1   0
Code
#Correlation matrix
cor(house.selling.price.2)
            P         S        Be        Ba       New
P   1.0000000 0.8988136 0.5902675 0.7136960 0.3565540
S   0.8988136 1.0000000 0.6691137 0.6624828 0.1762879
Be  0.5902675 0.6691137 1.0000000 0.3337966 0.2672091
Ba  0.7136960 0.6624828 0.3337966 1.0000000 0.1820651
New 0.3565540 0.1762879 0.2672091 0.1820651 1.0000000
Code
#Regression Output all variables explanatory
summary(lm(P ~ S + Be + Ba + New, data = house.selling.price.2))

Call:
lm(formula = P ~ S + Be + Ba + New, data = house.selling.price.2)

Residuals:
    Min      1Q  Median      3Q     Max 
-36.212  -9.546   1.277   9.406  71.953 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -41.795     12.104  -3.453 0.000855 ***
S             64.761      5.630  11.504  < 2e-16 ***
Be            -2.766      3.960  -0.698 0.486763    
Ba            19.203      5.650   3.399 0.001019 ** 
New           18.984      3.873   4.902  4.3e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 16.36 on 88 degrees of freedom
Multiple R-squared:  0.8689,    Adjusted R-squared:  0.8629 
F-statistic: 145.8 on 4 and 88 DF,  p-value: < 2.2e-16

A

For backward elimination, which variable would be deleted first? Why?

Beds would be the first variable to be deleted, as it has the largest P-value of 0.48673 of the 4 explanatory variables and is the only explanatory variable that is not statistically significant at any level (0.1, 0.05, 0.001)

B

For forward selection, which variable would be added first? Why?

It was difficult to determine from the Regression Output table on the HW5 assignment the difference between a p-value 0 & 0.00000. Pulling up the regression summary in R gives me more precise p-values for Size and New, which shows that Size is more statistically significant than New with a smaller p-value. <2e-16 is less than 4.3e-06. Size is also the most correlated with Price of any of the explanatory variables in the mode. Therefor in forward selection Size would be added first.

C

Why do you think that BEDS has such a large P-value in the multiple regression model, even though it has a substantial correlation with PRICE?

I think that BEDS has such a large a p-value in the multiple regression model (making it not statistically significant) even though it has a substantial correlation with PRICE because of Size serving as a confounder. Size is the most statistically significant explanatory variable in the model, and the larger in size a house, the more likely it is to have more bedrooms.

D

I created the first Models 1 & 2 using the backwards selection method and Model 3 using the forward selection method

Model 1: All 4 explanatory variables Size, Beds, Baths, & New

Model 2: 3 explanatory variables Size, Bath, & New

Model 3: 2 explanatory variables Bath & New

Code
#Model 1
model_1 <- lm(P ~ S + Be + Ba + New, data = house.selling.price.2)
summary(model_1)

Call:
lm(formula = P ~ S + Be + Ba + New, data = house.selling.price.2)

Residuals:
    Min      1Q  Median      3Q     Max 
-36.212  -9.546   1.277   9.406  71.953 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -41.795     12.104  -3.453 0.000855 ***
S             64.761      5.630  11.504  < 2e-16 ***
Be            -2.766      3.960  -0.698 0.486763    
Ba            19.203      5.650   3.399 0.001019 ** 
New           18.984      3.873   4.902  4.3e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 16.36 on 88 degrees of freedom
Multiple R-squared:  0.8689,    Adjusted R-squared:  0.8629 
F-statistic: 145.8 on 4 and 88 DF,  p-value: < 2.2e-16
Code
#Model 2
model_2 <- lm(P ~ S +  Ba + New, data = house.selling.price.2)
summary(model_2)

Call:
lm(formula = P ~ S + Ba + New, data = house.selling.price.2)

Residuals:
    Min      1Q  Median      3Q     Max 
-34.804  -9.496   0.917   7.931  73.338 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -47.992      8.209  -5.847 8.15e-08 ***
S             62.263      4.335  14.363  < 2e-16 ***
Ba            20.072      5.495   3.653 0.000438 ***
New           18.371      3.761   4.885 4.54e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 16.31 on 89 degrees of freedom
Multiple R-squared:  0.8681,    Adjusted R-squared:  0.8637 
F-statistic: 195.3 on 3 and 89 DF,  p-value: < 2.2e-16
Code
#Model 3
model_3 <- lm(P ~ S + New , data = house.selling.price.2) 
summary(model_3)

Call:
lm(formula = P ~ S + New, data = house.selling.price.2)

Residuals:
    Min      1Q  Median      3Q     Max 
-47.207  -9.763  -0.091   9.984  76.405 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -26.089      5.977  -4.365 3.39e-05 ***
S             72.575      3.508  20.690  < 2e-16 ***
New           19.587      3.995   4.903 4.16e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 17.4 on 90 degrees of freedom
Multiple R-squared:  0.8484,    Adjusted R-squared:  0.845 
F-statistic: 251.8 on 2 and 90 DF,  p-value: < 2.2e-16
    1. R squared
    • Model 1 R-squared = 0.8689
    • Model 2 R-squared = 0.8681
    • Model 3 R-squared = 0.8484

Based on the R-squared, Model 1 would be selected due to having the largest R-squared of the 3 models. R-squared measures how well variation in the dependent variable is explained by variation in the independent variables.

    1. Adjusted R-Squared
    • Model 1 Adjusted R-squared = 0.8629
    • Model 2 Adjusted R-squared = 0.8637
    • Model 3 Adjusted R-squared = 0.845

Based on the Adjusted R-squared, Model 2 would be selected due to having the largest R-squared of the 3 models. Adjusted R-squared measures how well variation in the dependent variable is explained by variation in the independent variables, but penalizes the addition of multiple explanatory variables.

    1. PRESS
  • Model 1 PRESS = 28390.22

  • Model 2 PRESS = 27860.05

  • Model 3 PRESS = 31066

Based on the PRESS Statistics, Model 2 would be selected due to having the smallest PRESS of the 3 models. The PRESS statistic also penalizes the addition of more explanatory variables. The smaller the PRESS statistic the better the predictive power of the model.

Code
#Calculating PRESS Statistics

pr <- resid(model_1)/(1 - lm.influence(model_1)$hat)
sum(pr^2)
[1] 28390.22
Code
pr <- resid(model_2)/(1 - lm.influence(model_2)$hat)
sum(pr^2)
[1] 27860.05
Code
pr <- resid(model_3)/(1 - lm.influence(model_3)$hat)
sum(pr^2)
[1] 31066
    1. AIC
    • Model 1 AIC = 790.6225
    • Model 2 AIC = 789.1366
    • Model 3 AIC = 800.1262

Based on the AIC, Model 2 would be selected due to having the smallest AIC of the 3 models. The AIC statistic also penalizes the addition of more explanatory variables and the lower value the better the model.

Code
##Calculating AICs
glance(model_1)$AIC
[1] 790.6225
Code
glance(model_2)$AIC
[1] 789.1366
Code
glance(model_3)$AIC
[1] 800.1262
    1. BIC
    • Model 1 BIC = 805.8181
    • Model 2 BIC = 801.7996
    • Model 3 BIC = 810.2566

Based on the BIC, Model 2 would be selected due to having the smallest BIC of the 3 models. The BIC statistic also penalizes the addition of more explanatory variables and the lower the value the better the model.

Code
glance(model_1)$BIC
[1] 805.8181
Code
glance(model_2)$BIC
[1] 801.7996
Code
glance(model_3)$BIC
[1] 810.2566

E

Based on the findings in the prior question, I prefer Model 2 out of the 3 models because it has the lowest AIC, BIC, & PRESS statistics & highest adjusted R-squared of the 3 models. These all suggest that Model 2, that variance in the explanatory variables Baths, Size, & New, explain the most variance in the outcome variable home price (highest adjusted R-squared), without gratuitous addition of explanatory variables to the model (lowest PRESS, AIC, BIC). I also like that all 3 explanatory variables Size (p-value = < 2e-16), Bath (p-value = 0.000438), & New (p-value = 4.54e-06) are statistically significant in the model, which suggests they are all indeed explanatory variables for the outcome Price.

Question 2

First I read in the trees data set below, containing the measures of diameter, height, & volume of 31 black cherry trees that were cut down for lumber.

Code
data("trees")
head(trees)
  Girth Height Volume
1   8.3     70   10.3
2   8.6     65   10.3
3   8.8     63   10.2
4  10.5     72   16.4
5  10.7     81   18.8
6  10.8     83   19.7

A

Fit a multiple regression model with the Volume as the outcome and Girth and Height as the explanatory variables

Code
#lm volume as outcome, girth & height as explanatory
 tree <-lm(Volume ~ Height + Girth, data = trees)
summary(tree)

Call:
lm(formula = Volume ~ Height + Girth, data = trees)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.4065 -2.6493 -0.2876  2.2003  8.4847 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -57.9877     8.6382  -6.713 2.75e-07 ***
Height        0.3393     0.1302   2.607   0.0145 *  
Girth         4.7082     0.2643  17.816  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.882 on 28 degrees of freedom
Multiple R-squared:  0.948, Adjusted R-squared:  0.9442 
F-statistic:   255 on 2 and 28 DF,  p-value: < 2.2e-16

This linear regression model has a very high adjusted R-squared of 0.9442, suggesting that most of the variation in tree volume is explained by variation in tree height & girth. This isn’t surprising, as volume is a function of dimension & height. Height is statistically significant at the 0.05 level with a p-value of 0.0145. Girth is statistically significant at the 0.001 value with a p-value of < 2e-16.

B

Run regression diagnostic plots on the model. Based on the plots, do you think any of the regression assumptions is violated?

Code
##Diagnostic Plots
par(mfrow = c(2,3))
plot(tree, which = 1:6)

Based on the diagnostic plots, I think the regression assumptions of linearity, constant variance, and influential observation are violated by the model for tree volume. The curvature of the first plot of residuals vs. fitted values shows a lack of linearity. The steep increase in decrease in the scale location plot suggests the linear regression model violates the assumption of constant variance, The final three diagnostic plots suggest violation of influential observation assumption.

In Plot 4, the 31st observation has a Cook’s distance of 0.6, which is < 1, but greater than 4/n, which is 0.13 with n = 31. In the residuals vs. leverage plot, this same observation falls outside of the red curved line further supporting the violation of influential outliers. In the Cook’s distance vs. leverage plot, the 31st observation has a high Cook’s distance (0.6) and a high leverage (0.2), again suggesting the violation of the influential observation.

Question 3

Code
#Read in the data
data("florida")
head(florida)
           Gore   Bush Buchanan
ALACHUA   47300  34062      262
BAKER      2392   5610       73
BAY       18850  38637      248
BRADFORD   3072   5413       65
BREVARD   97318 115185      570
BROWARD  386518 177279      789

A

Run a simple linear regression model where the Buchanan vote is the outcome and the Bush vote is the explanatory variable. Produce the regression diagnostic plots. Is Palm Beach County an outlier based on the diagnostic plots? Why or why not?

Code
flo <- lm(Buchanan ~ Bush, data = florida)
summary(flo)

Call:
lm(formula = Buchanan ~ Bush, data = florida)

Residuals:
    Min      1Q  Median      3Q     Max 
-907.50  -46.10  -29.19   12.26 2610.19 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 4.529e+01  5.448e+01   0.831    0.409    
Bush        4.917e-03  7.644e-04   6.432 1.73e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 353.9 on 65 degrees of freedom
Multiple R-squared:  0.3889,    Adjusted R-squared:  0.3795 
F-statistic: 41.37 on 1 and 65 DF,  p-value: 1.727e-08

The adjusted R-squared value of the model is 0.3795, which suggests a moderate amount of variance in Buchanan vote is explained by variance in the Bush vote. The explanatory variable of Bush vote is statistically significant as an explanatory variable in the model with a p-value of 1.73e-08.

Code
##Diagnostic Plots
par(mfrow = c(2,3))
plot(flo, which = 1:6)

Palm Beach county does appear to be an outlier based on the regression diagnostic plots. It has a Cook’s distance of 2.0, suggesting it has a high influence on the plot. Palm beach is also outside the dotted lined on the residuals vs leverage plot, supporting it is an outlier. The final Cook’s distance vs. Leverage plot shows that while Palm Beach county has a high Cook’s distance, it has a low leverage of 0.05, suggesting that while Palm Beach County is an outlier it does not strongly influence the model.

B

Take the log of both variables (Bush vote and Buchanan Vote) and repeat the analysis in (A.) Does your findings change?

Code
#logs
flo_log <- lm(log(Buchanan) ~ log(Bush), data = florida)
summary(flo_log)

Call:
lm(formula = log(Buchanan) ~ log(Bush), data = florida)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.96075 -0.25949  0.01282  0.23826  1.66564 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -2.57712    0.38919  -6.622 8.04e-09 ***
log(Bush)    0.75772    0.03936  19.251  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4673 on 65 degrees of freedom
Multiple R-squared:  0.8508,    Adjusted R-squared:  0.8485 
F-statistic: 370.6 on 1 and 65 DF,  p-value: < 2.2e-16

The first plot had an adjusted R-squared of 0.3795, while the plot with log of both variable has a much higher adjusted R-squared of 0.8485. log(Bush) is still very statistically significant as an explanatory variable, with a p-value of < 2e-16.

Code
##Diagnostic Plots
par(mfrow = c(2,3))
plot(flo_log, which = 1:6)

Looking at the diagnostic plots, in the model taking the log of both Buchanan and Bush variables, Palm Beach the Cook’s distance decreased from 2 to 0.35, making it no an longer influential observation. Palm beach no longer shows as an outlier on the Residual vs Leverage, and continues to have low leverage on the Cook’s distance vs. leverage plot. Repeating my analysis after taking the log of both the outcome Buchanan & explanatory Bush variables does change my findings, with Palm Beach no longer an outlier in this model.