hw5
shelton
Author

Dane Shelton

Published

December 9, 2022

a) For backwards elimination, Beds would be removed first as it has the highest p-value, indicating it is least important in predicting Price.

b) For forwards selection, New would be the first variable added with the model as it has the lowest p-value, indicating it is the most important/ strongest predictor of Price in the model.

c) Beds likely has a high p-value because it has high collinearity with Size, which is highly correlated with the response Price.

d)

Code
(summary(fit1))
## 
## Call:
## lm(formula = P ~ S + Ba + New + Be, data = house)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -36.212  -9.546   1.277   9.406  71.953 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -41.795     12.104  -3.453 0.000855 ***
## S             64.761      5.630  11.504  < 2e-16 ***
## Ba            19.203      5.650   3.399 0.001019 ** 
## New           18.984      3.873   4.902  4.3e-06 ***
## Be            -2.766      3.960  -0.698 0.486763    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.36 on 88 degrees of freedom
## Multiple R-squared:  0.8689, Adjusted R-squared:  0.8629 
## F-statistic: 145.8 on 4 and 88 DF,  p-value: < 2.2e-16

(summary(fit2))
## 
## Call:
## lm(formula = P ~ S + Ba + New, data = house)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -34.804  -9.496   0.917   7.931  73.338 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -47.992      8.209  -5.847 8.15e-08 ***
## S             62.263      4.335  14.363  < 2e-16 ***
## Ba            20.072      5.495   3.653 0.000438 ***
## New           18.371      3.761   4.885 4.54e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.31 on 89 degrees of freedom
## Multiple R-squared:  0.8681, Adjusted R-squared:  0.8637 
## F-statistic: 195.3 on 3 and 89 DF,  p-value: < 2.2e-16

(summary(fit3))
## 
## Call:
## lm(formula = P ~ S + New, data = house)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -47.207  -9.763  -0.091   9.984  76.405 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -26.089      5.977  -4.365 3.39e-05 ***
## S             72.575      3.508  20.690  < 2e-16 ***
## New           19.587      3.995   4.903 4.16e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.4 on 90 degrees of freedom
## Multiple R-squared:  0.8484, Adjusted R-squared:  0.845 
## F-statistic: 251.8 on 2 and 90 DF,  p-value: < 2.2e-16

Models

fit1: Price ~ Size + Bath + New + Beds

fit2: Price ~ Size + Bath + New

fit3: Price ~ Size + New

Model Evaluation

R2: fit1: .8689

fit2: .8681

fit3: .8484

Using R2 as the model selection criterion would result in fit, the full model, being selected as it has the highest R2 value.

Adjusted R2: fit1: .8629

fit2: .8637

fit3: .845

Now, with Adjusted R2 penalizing models for additional variables, fit 2 would be selected as the best fitting model.

PRESS

Code
(PRESS(fit1)$stat)
## .........10.........20.........30.........40.........50
## .........60.........70.........80.........90...
## [1] 28390.22
(PRESS(fit2)$stat)
## .........10.........20.........30.........40.........50
## .........60.........70.........80.........90...
## [1] 27860.05
(PRESS(fit3)$stat)
## .........10.........20.........30.........40.........50
## .........60.........70.........80.........90...
## [1] 31066

Using the PRESS statistic for our model selection criteria would select fit2, the model with the lowest value.

AIC

Code
(AIC(fit1))
[1] 790.6225
Code
(AIC(fit2))
[1] 789.1366
Code
(AIC(fit3))
[1] 800.1262

Selecting a model using AIC as our evaluation criterion, fit 2 is determined to be the best model (lowest value).

BIC

Code
(BIC(fit1))
[1] 805.8181
Code
(BIC(fit2))
[1] 801.7996
Code
(BIC(fit3))
[1] 810.2566

Selecting a model using BIC as our evaluation criterion, fit 2 is determined to be the best model (lowest value).

d: I prefer fit2, it satisfies all relevant model selection criterion and avoids extraneous variables. Because Beds is highly correlated with Size, it does not need to be included in the model, and Baths provides completeness rather than predicting price from only the Size of a house and whether it is New.

Code
#A
data("trees")
#trees

fit1 <-  lm(formula=`Volume` ~ `Girth` + `Height`, data= trees)

#B
diag1 <- autoplot(fit1, 1:6, ncol=3)
## Error in `autoplot()`:
## ! Objects of type lm not supported by autoplot.
diag1
## Error in eval(expr, envir, enclos): object 'diag1' not found

Evaluating the Residuals vs Fitted Values plot, we see a clear pattern in the residuals. This indicates that the model violates the homoskedasticity assumption, all residuals share the same variance regardless of fitted value.

Code
florida <- alr4::florida

buch_bush <- lm(`Buchanan` ~ `Bush`, data=florida)

diag2 <- autoplot(buch_bush, 1:6, ncol=3)
## Error in `autoplot()`:
## ! Objects of type lm not supported by autoplot.
diag2
## Error in eval(expr, envir, enclos): object 'diag2' not found

log_buch_bush <- florida %>% 
                  mutate('Buchanan' = log(`Buchanan`),
                         'Bush'=log(`Bush`))

log_fit <- lm(`Buchanan` ~ `Bush`, log_buch_bush)

diag3 <- autoplot(log_fit, 1:6, ncol=3)
## Error in `autoplot()`:
## ! Objects of type lm not supported by autoplot.
diag3
## Error in eval(expr, envir, enclos): object 'diag3' not found

a: Yes, we see Palm Beach county identified as an outlier in all diagnostic plots, with Fitted vs Residual Values and Cooks Distance providing the most telling evidence.

b: After taking the natural log of both variables, diagnostic plots improve but PBC is still identified as an outlier.