Homework 5 Solution
Dane Shelton
December 9, 2022
a) For backwards elimination, Beds
would be removed first as it has the highest p-value, indicating it is least important in predicting Price
.
b) For forwards selection, New
would be the first variable added with the model as it has the lowest p-value, indicating it is the most important/ strongest predictor of Price
in the model.
c) Beds
likely has a high p-value because it has high collinearity with Size
, which is highly correlated with the response Price
.
d)
Code
(summary(fit1))
##
## Call:
## lm(formula = P ~ S + Ba + New + Be, data = house)
##
## Residuals:
## Min 1Q Median 3Q Max
## -36.212 -9.546 1.277 9.406 71.953
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -41.795 12.104 -3.453 0.000855 ***
## S 64.761 5.630 11.504 < 2e-16 ***
## Ba 19.203 5.650 3.399 0.001019 **
## New 18.984 3.873 4.902 4.3e-06 ***
## Be -2.766 3.960 -0.698 0.486763
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.36 on 88 degrees of freedom
## Multiple R-squared: 0.8689, Adjusted R-squared: 0.8629
## F-statistic: 145.8 on 4 and 88 DF, p-value: < 2.2e-16
(summary(fit2))
##
## Call:
## lm(formula = P ~ S + Ba + New, data = house)
##
## Residuals:
## Min 1Q Median 3Q Max
## -34.804 -9.496 0.917 7.931 73.338
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -47.992 8.209 -5.847 8.15e-08 ***
## S 62.263 4.335 14.363 < 2e-16 ***
## Ba 20.072 5.495 3.653 0.000438 ***
## New 18.371 3.761 4.885 4.54e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.31 on 89 degrees of freedom
## Multiple R-squared: 0.8681, Adjusted R-squared: 0.8637
## F-statistic: 195.3 on 3 and 89 DF, p-value: < 2.2e-16
(summary(fit3))
##
## Call:
## lm(formula = P ~ S + New, data = house)
##
## Residuals:
## Min 1Q Median 3Q Max
## -47.207 -9.763 -0.091 9.984 76.405
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -26.089 5.977 -4.365 3.39e-05 ***
## S 72.575 3.508 20.690 < 2e-16 ***
## New 19.587 3.995 4.903 4.16e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.4 on 90 degrees of freedom
## Multiple R-squared: 0.8484, Adjusted R-squared: 0.845
## F-statistic: 251.8 on 2 and 90 DF, p-value: < 2.2e-16
Models
fit1: Price ~ Size + Bath + New + Beds
fit2: Price ~ Size + Bath + New
fit3: Price ~ Size + New
Model Evaluation
R2: fit1: .8689
fit2: .8681
fit3: .8484
Using R2 as the model selection criterion would result in fit
, the full model, being selected as it has the highest R2 value.
Adjusted R2: fit1: .8629
fit2: .8637
fit3: .845
Now, with Adjusted R2 penalizing models for additional variables, fit 2
would be selected as the best fitting model.
PRESS
Code
(PRESS(fit1)$stat)
## .........10.........20.........30.........40.........50
## .........60.........70.........80.........90...
## [1] 28390.22
(PRESS(fit2)$stat)
## .........10.........20.........30.........40.........50
## .........60.........70.........80.........90...
## [1] 27860.05
(PRESS(fit3)$stat)
## .........10.........20.........30.........40.........50
## .........60.........70.........80.........90...
## [1] 31066
Using the PRESS statistic for our model selection criteria would select fit2
, the model with the lowest value.
AIC
Selecting a model using AIC as our evaluation criterion, fit 2
is determined to be the best model (lowest value).
BIC
Selecting a model using BIC as our evaluation criterion, fit 2
is determined to be the best model (lowest value).
d: I prefer fit2
, it satisfies all relevant model selection criterion and avoids extraneous variables. Because Beds
is highly correlated with Size
, it does not need to be included in the model, and Baths
provides completeness rather than predicting price from only the Size
of a house and whether it is New
.
Evaluating the Residuals vs Fitted Values
plot, we see a clear pattern in the residuals. This indicates that the model violates the homoskedasticity assumption, all residuals share the same variance regardless of fitted value.
Code
florida <- alr4::florida
buch_bush <- lm(`Buchanan` ~ `Bush`, data=florida)
diag2 <- autoplot(buch_bush, 1:6, ncol=3)
## Error in `autoplot()`:
## ! Objects of type lm not supported by autoplot.
diag2
## Error in eval(expr, envir, enclos): object 'diag2' not found
log_buch_bush <- florida %>%
mutate('Buchanan' = log(`Buchanan`),
'Bush'=log(`Bush`))
log_fit <- lm(`Buchanan` ~ `Bush`, log_buch_bush)
diag3 <- autoplot(log_fit, 1:6, ncol=3)
## Error in `autoplot()`:
## ! Objects of type lm not supported by autoplot.
diag3
## Error in eval(expr, envir, enclos): object 'diag3' not found
a: Yes, we see Palm Beach county identified as an outlier in all diagnostic plots, with Fitted vs Residual Values
and Cooks Distance
providing the most telling evidence.
b: After taking the natural log of both variables, diagnostic plots improve but PBC is still identified as an outlier.