
Dane Shelton


December 9, 2022

a) For backwards elimination, Beds would be removed first as it has the highest p-value, indicating it is least important in predicting Price.

b) For forwards selection, New would be the first variable added with the model as it has the lowest p-value, indicating it is the most important/ strongest predictor of Price in the model.

c) Beds likely has a high p-value because it has high collinearity with Size, which is highly correlated with the response Price.


fit1: Price ~ Size + Bath + New + Beds

fit2: Price ~ Size + Bath + New

fit3: Price ~ Size + New

Model Evaluation

R2: fit1: .8689

fit2: .8681

fit3: .8484

Using R2 as the model selection criterion would result in fit, the full model, being selected as it has the highest R2 value.

Adjusted R2: fit1: .8629

fit2: .8637

fit3: .845

Now, with Adjusted R2 penalizing models for additional variables, fit 2 would be selected as the best fitting model.


Using the PRESS statistic for our model selection criteria would select fit2, the model with the lowest value.


Selecting a model using AIC as our evaluation criterion, fit 2 is determined to be the best model (lowest value).


Selecting a model using BIC as our evaluation criterion, fit 2 is determined to be the best model (lowest value).

d: I prefer fit2, it satisfies all relevant model selection criterion and avoids extraneous variables. Because Beds is highly correlated with Size, it does not need to be included in the model, and Baths provides completeness rather than predicting price from only the Size of a house and whether it is New.


fit1 <-  lm(formula=`Volume` ~ `Girth` + `Height`, data= trees)

diag1 <- autoplot(fit1, 1:6, ncol=3)
Evaluating the Residuals vs Fitted Values plot, we see a clear pattern in the residuals. This indicates that the model violates the homoskedasticity assumption, all residuals share the same variance regardless of fitted value.

florida <- alr4::florida

buch_bush <- lm(`Buchanan` ~ `Bush`, data=florida)

diag2 <- autoplot(buch_bush, 1:6, ncol=3)
log_buch_bush <- florida %>% 
                  mutate('Buchanan' = log(`Buchanan`),

log_fit <- lm(`Buchanan` ~ `Bush`, log_buch_bush)

diag3 <- autoplot(log_fit, 1:6, ncol=3)
a: Yes, we see Palm Beach county identified as an outlier in all diagnostic plots, with Fitted vs Residual Values and Cooks Distance providing the most telling evidence.

b: After taking the natural log of both variables, diagnostic plots improve but PBC is still identified as an outlier.