HW 5

hw5

Author

Karen Detter

Published

December 9, 2022

Code

library(tidyverse)

knitr::opts_chunk$set(echo = TRUE)

Q.1

A.

Using backward elimination would result in Beds being removed, because it has the highest p-value of all the variables.

B.

Under the forward selection method, Size would be added first, because it has the largest t-value, indicating the greatest improvement to the null model.

C.

The fact that Beds has a substantial correlation with Price, but a large p-value in the regression model is an indication of small sample size, since p-value is a function of both correlation coefficient and sample size.

D.

Code

#test regression models
library(smss)
data(house.selling.price.2)
full <- lm(P ~ ., data = house.selling.price.2)
forw1 <- lm(P ~ S, data = house.selling.price.2)
forw2 <- lm(P ~ S + New, data = house.selling.price.2)
forw3 <- lm(P ~ S + New + Ba, data = house.selling.price.2)

a.

I used the forward selection method to fit models, adding variables, one at a time, based on t-values (highest to lowest).

\(R^{2}\) is highest for the full model with all variables.

b.

Adjusted \(R^{2}\) is highest for the model of Price as a function of Size, Baths, and New.

c.

Code

#calculate PRESS statistics
PRESS <- function(linear.model) {
  pr <- residuals(linear.model)/(1-lm.influence(linear.model)$hat) 
  PRESS <- sum(pr^2) 
  return(PRESS)
}
PRESS(full)

[1] 28390.22

Code

PRESS(forw1)

[1] 38203.29

Code

PRESS(forw2)

[1] 31066

Code

PRESS(forw3)

[1] 27860.05

The model with Price as a function of Size, Baths, and New has the lowest PRESS calculation.

d.

Code

#calculate AIC values
AIC(full, k=2)

[1] 790.6225

Code

AIC(forw1, k=2)

[1] 820.1439

Code

AIC(forw2, k=2)

[1] 800.1262

Code

AIC(forw3, k=2)

[1] 789.1366

The model with Price as a function of Size, Baths, and New has the lowest AIC calculation.

e.

Code

#calculate BIC values
BIC(full)

[1] 805.8181

Code

BIC(forw1)

[1] 827.7417

Code

BIC(forw2)

[1] 810.2566

Code

BIC(forw3)

[1] 801.7996

The model with Price as a function of Size, Baths, and New has the lowest BIC calculation.

E.

Since the model with Price as a function of Size, Baths, and New has the highest Adjusted \(R^{2}\), and the lowest PRESS, AIC, and BIC calculations, I would choose it as the best one.

Q.2

A.

Code

#fit multiple regression model
trees_full <- lm(Volume ~ Girth + Height, data = trees)
summary(trees_full)


Call:
lm(formula = Volume ~ Girth + Height, data = trees)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.4065 -2.6493 -0.2876  2.2003  8.4847 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -57.9877     8.6382  -6.713 2.75e-07 ***
Girth         4.7082     0.2643  17.816  < 2e-16 ***
Height        0.3393     0.1302   2.607   0.0145 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.882 on 28 degrees of freedom
Multiple R-squared:  0.948, Adjusted R-squared:  0.9442 
F-statistic:   255 on 2 and 28 DF,  p-value: < 2.2e-16

B.

Code

#run diagnostic plots
par(mfrow = c(2,3)); plot(trees_full, which = 1:6)

From the appearance of the Residuals vs Fitted plot, it seems the assumption of a linear relationship is violated, because the residuals form a pattern of groups, instead of being randomly distributed around the 0 line.

It also appears that the assumption of constant variance is violated, as the Scale-Location plot shows heteroskedasticity in the residuals because the baseline shows magnitude changes.

Q.3

a)

Code

library(alr4)

Loading required package: car

Loading required package: carData


Attaching package: 'car'

The following object is masked from 'package:dplyr':

    recode

The following object is masked from 'package:purrr':

    some

Loading required package: effects

lattice theme set by effectsTheme()
See ?effectsTheme for details.

Code

data(florida)
vote <- lm(Buchanan ~ Bush, data = florida)
#produce diagnostic plots
par(mfrow = c(2,3)); plot(vote, which = 1:6)

Palm Beach County is a definite outlier in this model, as it does not follow the patterns produced by the other county data in any of the diagnostic plots, and its residual values are much higher.

b)

Code

#log both variables in model
vote_log <- lm(log(Buchanan) ~ log(Bush), data = florida)
par(mfrow = c(2,3)); plot(vote_log, which = 1:6)

Palm Beach County still appears to be at the high end of the data, but it no longer appears to be an actual outlier, because it is now much closer in line with the patterns and values of the other counties.