hw5
model diagnostics
Justine Shakespeare
Homework 5 for DACSS 603 spring 2023
Author

Justine Shakespeare

Published

May 1, 2023

Code
library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.0      ✔ purrr   1.0.1 
✔ tibble  3.1.8      ✔ dplyr   1.0.10
✔ tidyr   1.3.0      ✔ stringr 1.5.0 
✔ readr   2.1.3      ✔ forcats 0.5.2 
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
Code
library(alr4)
Loading required package: car
Loading required package: carData

Attaching package: 'car'

The following object is masked from 'package:dplyr':

    recode

The following object is masked from 'package:purrr':

    some

Loading required package: effects
lattice theme set by effectsTheme()
See ?effectsTheme for details.
Code
library(smss)
library(stargazer)

Please cite as: 

 Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
 R package version 5.2.3. https://CRAN.R-project.org/package=stargazer 

Question 1

Code
# load the data
data(house.selling.price.2)
glimpse(house.selling.price.2)
Rows: 93
Columns: 5
$ P   <dbl> 48.5, 55.0, 68.0, 137.0, 309.4, 17.5, 19.6, 24.5, 34.8, 32.0, 28.0…
$ S   <dbl> 1.10, 1.01, 1.45, 2.40, 3.30, 0.40, 1.28, 0.74, 0.78, 0.97, 0.84, …
$ Be  <int> 3, 3, 3, 3, 4, 1, 3, 3, 2, 3, 3, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, …
$ Ba  <int> 1, 2, 2, 3, 3, 1, 1, 1, 1, 1, 1, 2, 1, 2, 2, 1, 2, 2, 2, 2, 2, 1, …
$ New <int> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …

First we’ll recreate some of the calculations from the homework assignment.

Code
# rename the variables according to the homework 
house.selling.price.2 <- rename(house.selling.price.2, 
                                Price = P, 
                                Size = S, 
                                Beds = Be, 
                                Baths = Ba)

# take a look at the co
cor(house.selling.price.2)
          Price      Size      Beds     Baths       New
Price 1.0000000 0.8988136 0.5902675 0.7136960 0.3565540
Size  0.8988136 1.0000000 0.6691137 0.6624828 0.1762879
Beds  0.5902675 0.6691137 1.0000000 0.3337966 0.2672091
Baths 0.7136960 0.6624828 0.3337966 1.0000000 0.1820651
New   0.3565540 0.1762879 0.2672091 0.1820651 1.0000000
Code
# run linear regression
summary(lm(formula = Price ~ Size + Beds + Baths + New, data = house.selling.price.2))

Call:
lm(formula = Price ~ Size + Beds + Baths + New, data = house.selling.price.2)

Residuals:
    Min      1Q  Median      3Q     Max 
-36.212  -9.546   1.277   9.406  71.953 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -41.795     12.104  -3.453 0.000855 ***
Size          64.761      5.630  11.504  < 2e-16 ***
Beds          -2.766      3.960  -0.698 0.486763    
Baths         19.203      5.650   3.399 0.001019 ** 
New           18.984      3.873   4.902  4.3e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 16.36 on 88 degrees of freedom
Multiple R-squared:  0.8689,    Adjusted R-squared:  0.8629 
F-statistic: 145.8 on 4 and 88 DF,  p-value: < 2.2e-16

A

For backward elimination, which variable would be deleted first? Why?

I would remove the Beds variable first because it is not significant and has the largest p-value.

B

For forward selection, which variable would be added first? Why?

I would add the Size variable first because the correlation between Size and Price is the strongest among all of the variables. The p-value for the Size variable in the regression output is also the smallest.

C

Why do you think that BEDS has such a large P-value in the multiple regression model, even though it has a substantial correlation with PRICE?

Because when the other variables are included in the model, and therefore controlled for, the effect that only the variable Beds is responsible for is quite small. Other variables, such as Size, account for a larger proportion of the effect on Price. The explanatory power of Beds, once other variables are present, is not large in comparison to the other variables.

D

Using software with these four predictors, find the model that would be selected using each criterion:

To come up with a set of models to compare, I used backwards elimination to remove the variable with the highest p-value. That gave us the following four models to compare.

Code
HPModel1 <- (lm(formula = Price ~ Size + Beds + Baths + New, data = house.selling.price.2))

HPModel2 <- (lm(formula = Price ~ Size + Baths + New, data = house.selling.price.2))

HPModel3 <- (lm(formula = Price ~ Size + New, data = house.selling.price.2))

HPModel4 <- (lm(formula = Price ~ Size, data = house.selling.price.2))

stargazer(HPModel1, HPModel2, HPModel3, HPModel4, type = "text")

===================================================================================================================
                                                          Dependent variable:                                      
                    -----------------------------------------------------------------------------------------------
                                                                 Price                                             
                              (1)                     (2)                     (3)                     (4)          
-------------------------------------------------------------------------------------------------------------------
Size                       64.761***               62.263***               72.575***               75.607***       
                            (5.630)                 (4.335)                 (3.508)                 (3.865)        
                                                                                                                   
Beds                        -2.766                                                                                 
                            (3.960)                                                                                
                                                                                                                   
Baths                      19.203***               20.072***                                                       
                            (5.650)                 (5.495)                                                        
                                                                                                                   
New                        18.984***               18.371***               19.587***                               
                            (3.873)                 (3.761)                 (3.995)                                
                                                                                                                   
Constant                  -41.795***              -47.992***              -26.089***              -25.194***       
                           (12.104)                 (8.209)                 (5.977)                 (6.688)        
                                                                                                                   
-------------------------------------------------------------------------------------------------------------------
Observations                  93                      93                      93                      93           
R2                           0.869                   0.868                   0.848                   0.808         
Adjusted R2                  0.863                   0.864                   0.845                   0.806         
Residual Std. Error    16.360 (df = 88)        16.313 (df = 89)        17.395 (df = 90)        19.473 (df = 91)    
F Statistic         145.763*** (df = 4; 88) 195.313*** (df = 3; 89) 251.775*** (df = 2; 90) 382.628*** (df = 1; 91)
===================================================================================================================
Note:                                                                                   *p<0.1; **p<0.05; ***p<0.01

HPModel1 and HPModel2 (first with all variables included, and second with only Beds removed), have the highest R^2 and adjusted R^2 numbers, indicating these are the strongest models. HPModel1 has the highest R^2 with 0.869, but the adjusted R^2 is 0.001 smaller than HPModel2. This is because HPModel1 was penalized for including more variables. Let’s compare the AIC and BIC of those two models to determine which is strongest.

Code
broom::glance(HPModel1)
# A tibble: 1 × 12
  r.squared adj.r.squa…¹ sigma stati…²  p.value    df logLik   AIC   BIC devia…³
      <dbl>        <dbl> <dbl>   <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>   <dbl>
1     0.869        0.863  16.4    146. 5.94e-38     4  -389.  791.  806.  23553.
# … with 2 more variables: df.residual <int>, nobs <int>, and abbreviated
#   variable names ¹​adj.r.squared, ²​statistic, ³​deviance
Code
broom::glance(HPModel2)
# A tibble: 1 × 12
  r.squared adj.r.squa…¹ sigma stati…²  p.value    df logLik   AIC   BIC devia…³
      <dbl>        <dbl> <dbl>   <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>   <dbl>
1     0.868        0.864  16.3    195. 4.96e-39     3  -390.  789.  802.  23684.
# … with 2 more variables: df.residual <int>, nobs <int>, and abbreviated
#   variable names ¹​adj.r.squared, ²​statistic, ³​deviance

HPModel1 has an AIC of 790.6225 and BIC of 805.8181, while HPModel2 has an AIC of 789.1366 and BIC of 801.7996. Because these numbers are slightly smaller for HPModel2, this indicates it is the slightly stronger model.

Finally, we can calculate the PRESS statistic.

Code
# calculate the residuals
res1 <- residuals(HPModel1)

# Calculate the PRESS statistic
pressStat1 <- sum((res1/(1-hatvalues(HPModel1)))^2)

# View the PRESS statistic for HPModel1
cat("HPModel1's PRESS statistics is", pressStat1, "\n")
HPModel1's PRESS statistics is 28390.22 
Code
# calculate the residuals
res2 <- residuals(HPModel2)

# Calculate the PRESS statistic
pressStat2 <- sum((res2/(1-hatvalues(HPModel2)))^2)

# View the PRESS statistic for HPModel2
cat("HPModel2's PRESS statistics is", pressStat2)
HPModel2's PRESS statistics is 27860.05

Again, HPModel2 has a slightly smaller PRESS statistic, indicating it is a stronger model.

E

Given the adjusted R^2, AIC, BIC, and PRESS statistic, I would suggest keeping HPModel2, the model with all of the variables except for Beds since this model was the strongest according to those assessment tools (and we know R^2 gets larger with more variables, so it makes sense to pay attention to the adjusted R^2 in this case).

Question 2

Code
data(trees)
glimpse(trees)
Rows: 31
Columns: 3
$ Girth  <dbl> 8.3, 8.6, 8.8, 10.5, 10.7, 10.8, 11.0, 11.0, 11.1, 11.2, 11.3, …
$ Height <dbl> 70, 65, 63, 72, 81, 83, 66, 75, 80, 75, 79, 76, 76, 69, 75, 74,…
$ Volume <dbl> 10.3, 10.3, 10.2, 16.4, 18.8, 19.7, 15.6, 18.2, 22.6, 19.9, 24.…

From the documentation: “This data set provides measurements of the diameter, height and volume of timber in 31 felled black cherry trees. Note that the diameter (in inches) is erroneously labeled Girth in the data. It is measured at 4 ft 6 in above the ground.”

Tree volume estimation is a big deal, especially in the lumber industry. Use the trees data to build a basic model of tree volume prediction. In particular,

A

Fit a multiple regression model with the Volume as the outcome and Girth and Height as the explanatory variables

Code
TreePred <- lm(formula = Volume ~ Girth + Height, data = trees)
summary(TreePred)

Call:
lm(formula = Volume ~ Girth + Height, data = trees)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.4065 -2.6493 -0.2876  2.2003  8.4847 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -57.9877     8.6382  -6.713 2.75e-07 ***
Girth         4.7082     0.2643  17.816  < 2e-16 ***
Height        0.3393     0.1302   2.607   0.0145 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.882 on 28 degrees of freedom
Multiple R-squared:  0.948, Adjusted R-squared:  0.9442 
F-statistic:   255 on 2 and 28 DF,  p-value: < 2.2e-16

B

Run regression diagnostic plots on the model. Based on the plots, do you think any of the regression assumptions is violated?

Residuals vs Fitted plot.

Code
plot(TreePred, which = 1)

The line is not horizontal, indicating the assumption of linearity is violated here. While the residuals do not exactly “bounce randomly” around the zero line, they are also not in a funnel shape, so it is unclear whether the assumption of constant variance is completely violated. There are some residuals that stand out, indicating that there are outliers.

Normal Q-Q plot

Code
plot(TreePred, which = 2)

Most of the points are along the line, indicating that the assumption of normality has been met.

Scale-Location

Code
plot(TreePred, which = 3)

The red line is not horizontal and instead has a significant dip and then trends slightly upward. This suggests the model violates the assumption of constant variance.

Cook’s Distance

Code
plot(TreePred, which = 4)

To determine whether the influential observation assumption has been violated we must calculate whether the Cook’s distance is larger than 1 or 4/n. Recall there are 31 observations in this data, indicating the calculation for Cook’s distance is 4/31 = 0.13. It looks as though about three points are larger than that, indicating the influential observation assumption has been violated.

Question 3

Code
data(florida)

glimpse(florida)
Rows: 67
Columns: 3
$ Gore     <int> 47300, 2392, 18850, 3072, 97318, 386518, 2155, 29641, 25501, …
$ Bush     <int> 34062, 5610, 38637, 5413, 115185, 177279, 2873, 35419, 29744,…
$ Buchanan <int> 262, 73, 248, 65, 570, 789, 90, 182, 270, 186, 122, 89, 561, …

(Data file: florida in alr R package)

In the 2000 election for U.S. president, the counting of votes in Florida was controversial. In Palm Beach County in south Florida, for example, voters used a so-called butterfly ballot. Some believe that the layout of the ballot caused some voters to cast votes for Buchanan when their intended choice was Gore.

The data has variables for the number of votes for each candidate—Gore, Bush, and Buchanan.

A

Run a simple linear regression model where the Buchanan vote is the outcome and the Bush vote is the explanatory variable. Produce the regression diagnostic plots. Is Palm Beach County an outlier based on the diagnostic plots? Why or why not?

Code
votes <- lm(formula = Buchanan ~ Bush, data = florida)
summary(votes)

Call:
lm(formula = Buchanan ~ Bush, data = florida)

Residuals:
    Min      1Q  Median      3Q     Max 
-907.50  -46.10  -29.19   12.26 2610.19 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 4.529e+01  5.448e+01   0.831    0.409    
Bush        4.917e-03  7.644e-04   6.432 1.73e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 353.9 on 65 degrees of freedom
Multiple R-squared:  0.3889,    Adjusted R-squared:  0.3795 
F-statistic: 41.37 on 1 and 65 DF,  p-value: 1.727e-08
Code
plot(votes, which = 1:4)

After running this regression and using diagnostic plots to examine it, it looks like Palm Beach is definitely a major outlier. In the first plot, residuals vs fitted, the Palm Beach residual is far from the red line and from the rest of the residuals. In the second plot, Normal Q-Q, Palm Beach once again is far from the rest of the data, which otherwise lines up fairly neatly against the line. In the Scale-Location plot again Palm Beach is far from the other residuals. And finally, in the Cook’s Distance plot we can see that Palm Beach is a significant outlier since it is over 1 and 4/67 (0.06).

Dade county also shows up as an outlier in some of these plots, but since this county includes Miami - a city with a very different demographic breakdown than the rest of the state - it makes sense this difference would be reflected in voting patterns.

B

Take the log of both variables (Bush vote and Buchanan Vote) and repeat the analysis in (A.) Does your findings change?

Code
votesLogged <- lm(formula = log(Buchanan) ~ log(Bush), data = florida)
summary(votesLogged)

Call:
lm(formula = log(Buchanan) ~ log(Bush), data = florida)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.96075 -0.25949  0.01282  0.23826  1.66564 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -2.57712    0.38919  -6.622 8.04e-09 ***
log(Bush)    0.75772    0.03936  19.251  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4673 on 65 degrees of freedom
Multiple R-squared:  0.8508,    Adjusted R-squared:  0.8485 
F-statistic: 370.6 on 1 and 65 DF,  p-value: < 2.2e-16
Code
plot(votesLogged, which = 1:4)

If you log the variables Palm Beach remains an outlier in each diagnostic plot, but the difference between that county and the rest of the data is not as stark.