Homework 5 Solution
Dane Shelton
December 9, 2022
a) For backwards elimination, Beds would be removed first as it has the highest p-value, indicating it is least important in predicting Price.
b) For forwards selection, New would be the first variable added with the model as it has the lowest p-value, indicating it is the most important/ strongest predictor of Price in the model.
c) Beds likely has a high p-value because it has high collinearity with Size, which is highly correlated with the response Price.
d)
Code
(summary(fit1))
##
## Call:
## lm(formula = P ~ S + Ba + New + Be, data = house)
##
## Residuals:
## Min 1Q Median 3Q Max
## -36.212 -9.546 1.277 9.406 71.953
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -41.795 12.104 -3.453 0.000855 ***
## S 64.761 5.630 11.504 < 2e-16 ***
## Ba 19.203 5.650 3.399 0.001019 **
## New 18.984 3.873 4.902 4.3e-06 ***
## Be -2.766 3.960 -0.698 0.486763
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.36 on 88 degrees of freedom
## Multiple R-squared: 0.8689, Adjusted R-squared: 0.8629
## F-statistic: 145.8 on 4 and 88 DF, p-value: < 2.2e-16
(summary(fit2))
##
## Call:
## lm(formula = P ~ S + Ba + New, data = house)
##
## Residuals:
## Min 1Q Median 3Q Max
## -34.804 -9.496 0.917 7.931 73.338
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -47.992 8.209 -5.847 8.15e-08 ***
## S 62.263 4.335 14.363 < 2e-16 ***
## Ba 20.072 5.495 3.653 0.000438 ***
## New 18.371 3.761 4.885 4.54e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.31 on 89 degrees of freedom
## Multiple R-squared: 0.8681, Adjusted R-squared: 0.8637
## F-statistic: 195.3 on 3 and 89 DF, p-value: < 2.2e-16
(summary(fit3))
##
## Call:
## lm(formula = P ~ S + New, data = house)
##
## Residuals:
## Min 1Q Median 3Q Max
## -47.207 -9.763 -0.091 9.984 76.405
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -26.089 5.977 -4.365 3.39e-05 ***
## S 72.575 3.508 20.690 < 2e-16 ***
## New 19.587 3.995 4.903 4.16e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.4 on 90 degrees of freedom
## Multiple R-squared: 0.8484, Adjusted R-squared: 0.845
## F-statistic: 251.8 on 2 and 90 DF, p-value: < 2.2e-16Models
fit1: Price ~ Size + Bath + New + Beds
fit2: Price ~ Size + Bath + New
fit3: Price ~ Size + New
Model Evaluation
R2: fit1: .8689
fit2: .8681
fit3: .8484
Using R2 as the model selection criterion would result in fit, the full model, being selected as it has the highest R2 value.
Adjusted R2: fit1: .8629
fit2: .8637
fit3: .845
Now, with Adjusted R2 penalizing models for additional variables, fit 2 would be selected as the best fitting model.
PRESS
Code
(PRESS(fit1)$stat)
## .........10.........20.........30.........40.........50
## .........60.........70.........80.........90...
## [1] 28390.22
(PRESS(fit2)$stat)
## .........10.........20.........30.........40.........50
## .........60.........70.........80.........90...
## [1] 27860.05
(PRESS(fit3)$stat)
## .........10.........20.........30.........40.........50
## .........60.........70.........80.........90...
## [1] 31066Using the PRESS statistic for our model selection criteria would select fit2, the model with the lowest value.
AIC
Selecting a model using AIC as our evaluation criterion, fit 2 is determined to be the best model (lowest value).
BIC
Selecting a model using BIC as our evaluation criterion, fit 2 is determined to be the best model (lowest value).
d: I prefer fit2, it satisfies all relevant model selection criterion and avoids extraneous variables. Because Beds is highly correlated with Size, it does not need to be included in the model, and Baths provides completeness rather than predicting price from only the Size of a house and whether it is New.
Evaluating the Residuals vs Fitted Values plot, we see a clear pattern in the residuals. This indicates that the model violates the homoskedasticity assumption, all residuals share the same variance regardless of fitted value.
Code
florida <- alr4::florida
buch_bush <- lm(`Buchanan` ~ `Bush`, data=florida)
diag2 <- autoplot(buch_bush, 1:6, ncol=3)
## Error in `autoplot()`:
## ! Objects of type lm not supported by autoplot.
diag2
## Error in eval(expr, envir, enclos): object 'diag2' not found
log_buch_bush <- florida %>%
mutate('Buchanan' = log(`Buchanan`),
'Bush'=log(`Bush`))
log_fit <- lm(`Buchanan` ~ `Bush`, log_buch_bush)
diag3 <- autoplot(log_fit, 1:6, ncol=3)
## Error in `autoplot()`:
## ! Objects of type lm not supported by autoplot.
diag3
## Error in eval(expr, envir, enclos): object 'diag3' not founda: Yes, we see Palm Beach county identified as an outlier in all diagnostic plots, with Fitted vs Residual Values and Cooks Distance providing the most telling evidence.
b: After taking the natural log of both variables, diagnostic plots improve but PBC is still identified as an outlier.