Homework 5
Author

Steve O’Neill

Published

November 27, 2022

Code
library(smss)
library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.0      ✔ purrr   0.3.5 
✔ tibble  3.1.8      ✔ dplyr   1.0.10
✔ tidyr   1.2.1      ✔ stringr 1.4.1 
✔ readr   2.1.3      ✔ forcats 0.5.2 
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
Code
data(house.selling.price.2)
library(alr4)
Loading required package: car
Loading required package: carData

Attaching package: 'car'

The following object is masked from 'package:dplyr':

    recode

The following object is masked from 'package:purrr':

    some

Loading required package: effects
lattice theme set by effectsTheme()
See ?effectsTheme for details.
Code
data("florida")

For the house.selling.price.2 data the tables below show a correlation matrix and a model fit using four predictors of selling price.

In this data, the variables are meant as:

P: selling price

Be: number of bedrooms

Ba: number of bathrooms

New: whether new (1 = yes, 0 = no)

Here is my impression of the correlation matrix:

Code
cor(house.selling.price.2)
            P         S        Be        Ba       New
P   1.0000000 0.8988136 0.5902675 0.7136960 0.3565540
S   0.8988136 1.0000000 0.6691137 0.6624828 0.1762879
Be  0.5902675 0.6691137 1.0000000 0.3337966 0.2672091
Ba  0.7136960 0.6624828 0.3337966 1.0000000 0.1820651
New 0.3565540 0.1762879 0.2672091 0.1820651 1.0000000

And regression output:

Code
fit <- lm(P ~ ., data=house.selling.price.2)
summary(fit)

Call:
lm(formula = P ~ ., data = house.selling.price.2)

Residuals:
    Min      1Q  Median      3Q     Max 
-36.212  -9.546   1.277   9.406  71.953 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -41.795     12.104  -3.453 0.000855 ***
S             64.761      5.630  11.504  < 2e-16 ***
Be            -2.766      3.960  -0.698 0.486763    
Ba            19.203      5.650   3.399 0.001019 ** 
New           18.984      3.873   4.902  4.3e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 16.36 on 88 degrees of freedom
Multiple R-squared:  0.8689,    Adjusted R-squared:  0.8629 
F-statistic: 145.8 on 4 and 88 DF,  p-value: < 2.2e-16

Automated variable selection

For backward elimination, which variable would be deleted first? Why?

If I was doing backward elimination, I would pick a significance level (let’s say alpha = .05) and, at each stage, delete the variable with the largest p-value. I would stop when all variables are significant.

In this example, I would delete the Be (bedroom) variable.

Code
summary(lm(P ~ . - Be, data=house.selling.price.2))

Call:
lm(formula = P ~ . - Be, data = house.selling.price.2)

Residuals:
    Min      1Q  Median      3Q     Max 
-34.804  -9.496   0.917   7.931  73.338 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -47.992      8.209  -5.847 8.15e-08 ***
S             62.263      4.335  14.363  < 2e-16 ***
Ba            20.072      5.495   3.653 0.000438 ***
New           18.371      3.761   4.885 4.54e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 16.31 on 89 degrees of freedom
Multiple R-squared:  0.8681,    Adjusted R-squared:  0.8637 
F-statistic: 195.3 on 3 and 89 DF,  p-value: < 2.2e-16

For forward selection, which variable would be added first? Why?

Like backward elimination, I would also predetermine a significance level (say, 5%). But here I would begin with no explanatory variable.

The Size variable would be added first in forward selection.

Code
intercept_only <- lm(P ~ 1, data=house.selling.price.2)

step(intercept_only, direction = "forward", scope=~ S + Be + Ba + New)
Start:  AIC=705.63
P ~ 1

       Df Sum of Sq    RSS    AIC
+ S     1    145097  34508 554.22
+ Ba    1     91484  88121 641.41
+ Be    1     62578 117028 667.79
+ New   1     22833 156772 694.99
<none>              179606 705.63

Step:  AIC=554.22
P ~ S

       Df Sum of Sq   RSS    AIC
+ New   1    7274.7 27234 534.20
+ Ba    1    4475.6 30033 543.30
<none>              34508 554.22
+ Be    1      40.4 34468 556.11

Step:  AIC=534.2
P ~ S + New

       Df Sum of Sq   RSS    AIC
+ Ba    1    3550.1 23684 523.21
+ Be    1     588.8 26645 534.17
<none>              27234 534.20

Step:  AIC=523.21
P ~ S + New + Ba

       Df Sum of Sq   RSS    AIC
<none>              23684 523.21
+ Be    1    130.55 23553 524.70

Call:
lm(formula = P ~ S + New + Ba, data = house.selling.price.2)

Coefficients:
(Intercept)            S          New           Ba  
     -47.99        62.26        18.37        20.07  

Why do you think that BEDS has such a large P-value in the multiple regression model, even though it has a substantial correlation with PRICE?

As pointed out, Be does have a substantial correlation with P at .59. However, the large P-values in multiple regression indicate that while holding other variables fixed, it does not ‘explain’ the response variable of P, price.

Using software with these four predictors, find the model that would be selected using each criterion:

I’m not sure if I exactly get the question, but I will arbitrarily compare some models that I have made from combinations of the predictors.

Code
mod1 <- lm(P ~ S, data=house.selling.price.2)
mod2 <- lm(P ~ S + New, data=house.selling.price.2)
mod3 <- lm(P ~ S + New + Ba, data=house.selling.price.2)
mod4 <- lm(P ~ .,data=house.selling.price.2)

#A few with interaction variables

mod5 <- lm(P ~ S + New + S*New, data=house.selling.price.2)
mod6 <- lm(P ~ S + New + Ba + S*New, data=house.selling.price.2)
mod7 <- lm(P ~ . + S * New, data=house.selling.price.2)

###R2

Code
summary(mod1)$r.squared
[1] 0.807866
Code
summary(mod2)$r.squared
[1] 0.8483699
Code
summary(mod3)$r.squared
[1] 0.8681361
Code
summary(mod4)$r.squared
[1] 0.868863
Code
summary(mod5)$r.squared
[1] 0.8675196
Code
summary(mod6)$r.squared
[1] 0.8891938
Code
summary(mod7)$r.squared
[1] 0.8906193

Using ‘highest R-squared’ as our criteria, P ~ . + S * New is the winner. That one includes all the predictor variables in the equation, with size and newness as interaction variables.

###Adjusted R2

Code
summary(mod1)$adj.r.squared
[1] 0.8057546
Code
summary(mod2)$adj.r.squared
[1] 0.8450003
Code
summary(mod3)$adj.r.squared
[1] 0.8636912
Code
summary(mod4)$adj.r.squared
[1] 0.8629022
Code
summary(mod5)$adj.r.squared
[1] 0.8630539
Code
summary(mod6)$adj.r.squared
[1] 0.8841571
Code
summary(mod7)$adj.r.squared
[1] 0.8843331

Adjusted R-squared penalizes for adding more explanatory variables to the regression. However, it is still a virtual tie between P ~ S + New + Ba + S*New and P ~ . + S * New. Still, the latter wins.

###PRESS

This elegant function is found on Github and requires no additional libraries.

Code
PRESS <- function(linear.model) {
  #' calculate the predictive residuals
  pr <- residuals(linear.model)/(1-lm.influence(linear.model)$hat)
  #' calculate the PRESS
  PRESS <- sum(pr^2)
  
  return(PRESS)
}

According to a comparison of the PRESS statistics, the best model is P ~ S + New + Ba + S*New with a PRESS of 27501.78

Code
PRESS(mod1)
[1] 38203.29
Code
PRESS(mod2)
[1] 31066
Code
PRESS(mod3)
[1] 27860.05
Code
PRESS(mod4)
[1] 28390.22
Code
PRESS(mod5)
[1] 31899.8
Code
PRESS(mod6)
[1] 27501.78
Code
PRESS(mod7)
[1] 27665.14

###AIC

According to AIC, the best (lowest) score is 774.9558, associated with the model P ~ S + New + Ba + S*New

Code
AIC(mod1)
[1] 820.1439
Code
AIC(mod2)
[1] 800.1262
Code
AIC(mod3)
[1] 789.1366
Code
AIC(mod4)
[1] 790.6225
Code
AIC(mod5)
[1] 789.5704
Code
AIC(mod6)
[1] 774.9558
Code
AIC(mod7)
[1] 775.7515

###BIC

According to BIC, the best model is the same - P ~ S + New + Ba + S*New, with a statistic of 790.1514.

Code
BIC(mod1)
[1] 827.7417
Code
BIC(mod2)
[1] 810.2566
Code
BIC(mod3)
[1] 801.7996
Code
BIC(mod4)
[1] 805.8181
Code
BIC(mod5)
[1] 802.2334
Code
BIC(mod6)
[1] 790.1514
Code
BIC(mod7)
[1] 793.4797

Explain which model you prefer and why.

I prefer the model favored by AIC and BIC, P ~ S + New + Ba + S*New. Intuitively, it makes sense that bedrooms are not a significant driver of a house’s price, although bathrooms are. And the size of a house is more consequential on the outcome of price when the building is newer.

Code
par(mfrow = c(2,3)); plot(mod2, which = 1:6)

Code
par(mfrow = c(2,3)); plot(mod5, which = 1:6)

Code
par(mfrow = c(2,3)); plot(mod6, which = 1:6)

Code
par(mfrow = c(2,3)); plot(mod7, which = 1:6)

Question 2

Code
data(trees)
trees
   Girth Height Volume
1    8.3     70   10.3
2    8.6     65   10.3
3    8.8     63   10.2
4   10.5     72   16.4
5   10.7     81   18.8
6   10.8     83   19.7
7   11.0     66   15.6
8   11.0     75   18.2
9   11.1     80   22.6
10  11.2     75   19.9
11  11.3     79   24.2
12  11.4     76   21.0
13  11.4     76   21.4
14  11.7     69   21.3
15  12.0     75   19.1
16  12.9     74   22.2
17  12.9     85   33.8
18  13.3     86   27.4
19  13.7     71   25.7
20  13.8     64   24.9
21  14.0     78   34.5
22  14.2     80   31.7
23  14.5     74   36.3
24  16.0     72   38.3
25  16.3     77   42.6
26  17.3     81   55.4
27  17.5     82   55.7
28  17.9     80   58.3
29  18.0     80   51.5
30  18.0     80   51.0
31  20.6     87   77.0

From the documentation: “This data set provides measurements of the diameter, height and volume of timber in 31 felled black cherry trees. Note that the diameter (in inches) is erroneously labeled Girth in the data. It is measured at 4 ft 6 in above the ground.”

Tree volume estimation is a big deal, especially in the lumber industry. Use the trees data to build a basic model of tree volume prediction. In particular,

fit a multiple regression model with the Volume as the outcome and Girth and Height as the explanatory variables

Code
mod8 <- lm(Volume ~ Girth + Height, data = trees)

Run regression diagnostic plots on the model. Based on the plots, do you think any of the regression assumptions is violated?

These plots show some significant issues with the regression Volume ~ Girth + Height.

  1. The ‘Residuals vs Fitted’ graph is U-shaped, indicating an issue with linearity.

  2. The Normal Q-Q graph actually looks pretty good except for a wild outlier - this shows an issue with normality.

  3. The Scale-Location graph should also be flat, but isn’t. This indicates an issue with homoscedasticity.

  4. The Cook’s distance graph shows a clear outlier in one of the observations. A ‘high leverage’ observation (above the benchmark of 1 or n/4) may effect the regression if it were taken out.

  5. The Residuals vs Leverage plot and Cook’s dist vs Leverage plot also show the influence of this outlier.

Code
par(mfrow = c(2,3)); plot(mod8, which = 1:6)

Question 3

In the 2000 election for U.S. president, the counting of votes in Florida was controversial. In Palm Beach County in south Florida, for example, voters used a so-called butterfly ballot. Some believe that the layout of the ballot caused some voters to cast votes for Buchanan when their intended choice was Gore.

Code
florida
               Gore   Bush Buchanan
ALACHUA       47300  34062      262
BAKER          2392   5610       73
BAY           18850  38637      248
BRADFORD       3072   5413       65
BREVARD       97318 115185      570
BROWARD      386518 177279      789
CALHOUN        2155   2873       90
CHARLOTTE     29641  35419      182
CITRUS        25501  29744      270
CLAY          14630  41745      186
COLLIER       29905  60426      122
COLUMBIA       7047  10964       89
DADE         328702 289456      561
DE SOTO        3322   4256       36
DIXIE          1825   2698       29
DUVAL        107680 152082      650
ESCAMBIA      40958  73029      504
FLAGLER       13891  12608       83
FRANKLIN       2042   2448       33
GADSDEN        9565   4750       39
GILCHRIST      1910   3300       29
GLADES         1420   1840        9
GULF           2389   3546       71
HAMILTON       1718   2153       24
HARDEE         2341   3764       30
HENDRY         3239   4743       22
HERNANDO      32644  30646      242
HIGHLANDS     14152  20196       99
HILLSBOROUGH 166581 176967      836
HOLMES         2154   4985       76
INDIAN RIVER  19769  28627      105
JACKSON        6868   9138      102
JEFFERSON      3038   2481       29
LAFAYETTE       788   1669       10
LAKE          36555  49963      289
LEE           73560 106141      305
LEON          61425  39053      282
LEVY           5403   6860       67
LIBERTY        1011   1316       39
MADISON        3011   3038       29
MANATEE       49169  57948      272
MARION        44648  55135      563
MARTIN        26619  33864      108
MONROE        16483  16059       47
NASSAU         6952  16404       90
OKALOOSA      16924  52043      267
OKEECHOBEE     4588   5058       43
ORANGE       140115 134476      446
OSCEOLA       28177  26216      145
PALM BEACH   268945 152846     3407
PASCO         69550  68581      570
PINELLAS     199660 184312     1010
POLK          74977  90101      538
PUTNAM        12091  13439      147
ST. JOHNS     19482  39497      229
ST. LUCIE     41559  34705      124
SANTA ROSA    12795  36248      311
SARASOTA      72854  83100      305
SEMINOLE      58888  75293      194
SUMTER         9634  12126      114
SUWANNEE       4084   8014      108
TAYLOR         2647   4051       27
UNION          1399   2326       26
VOLUSIA       97063  82214      396
WAKULLA        3835   4511       46
WALTON         5637  12176      120
WASHINGTON     2796   4983       88
Code
mod9 <- lm(Buchanan ~ Bush, data = florida)
summary(mod9)

Call:
lm(formula = Buchanan ~ Bush, data = florida)

Residuals:
    Min      1Q  Median      3Q     Max 
-907.50  -46.10  -29.19   12.26 2610.19 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 4.529e+01  5.448e+01   0.831    0.409    
Bush        4.917e-03  7.644e-04   6.432 1.73e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 353.9 on 65 degrees of freedom
Multiple R-squared:  0.3889,    Adjusted R-squared:  0.3795 
F-statistic: 41.37 on 1 and 65 DF,  p-value: 1.727e-08
Code
par(mfrow = c(2,3)); plot(mod9, which = 1:6)

The normal Q-Q shows a strong outlier: Palm Beach county. It is very high in the Cook’s distance plot (over 1) and outside of the second dotted gray line in the Residuals vs Leverage plot.

Code
mod10 <- lm(log(Buchanan) ~ log(Bush), data = florida)
summary(mod10)

Call:
lm(formula = log(Buchanan) ~ log(Bush), data = florida)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.96075 -0.25949  0.01282  0.23826  1.66564 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -2.57712    0.38919  -6.622 8.04e-09 ***
log(Bush)    0.75772    0.03936  19.251  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4673 on 65 degrees of freedom
Multiple R-squared:  0.8508,    Adjusted R-squared:  0.8485 
F-statistic: 370.6 on 1 and 65 DF,  p-value: < 2.2e-16

Logging both variables increased the coefficient, lowered the p-value of log(Bush), and increased the multiple R-squared by almost half. The findings of the first model are better-supported.