Author

Karen Kimble

Published

December 9, 2022

Code
# Setup
library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.0      ✔ purrr   0.3.5 
✔ tibble  3.1.8      ✔ dplyr   1.0.10
✔ tidyr   1.2.1      ✔ stringr 1.4.1 
✔ readr   2.1.3      ✔ forcats 0.5.2 
Warning: package 'ggplot2' was built under R version 4.2.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
Code
library(plyr)
------------------------------------------------------------------------------
You have loaded plyr after dplyr - this is likely to cause problems.
If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
library(plyr); library(dplyr)
------------------------------------------------------------------------------

Attaching package: 'plyr'

The following objects are masked from 'package:dplyr':

    arrange, count, desc, failwith, id, mutate, rename, summarise,
    summarize

The following object is masked from 'package:purrr':

    compact
Code
library(alr4)
Loading required package: car
Loading required package: carData

Attaching package: 'car'

The following object is masked from 'package:dplyr':

    recode

The following object is masked from 'package:purrr':

    some

Loading required package: effects
lattice theme set by effectsTheme()
See ?effectsTheme for details.
Code
library(smss)
Warning: package 'smss' was built under R version 4.2.2

Question 1

Part A

The first variable to be deleted would be beds because it has the largest p-value, and backward elimination begins with deleting the variable with the largest p-value.

Part B

The first variable added in forward selection would be size because it has the smallest p-value.

Part C

Beds has such a large p-value despite its correlation with price because it also has strong correlations with other variables. This may cause multicollinearity.

Part D

Code
# Model 1

model <- lm(P ~ . , data = house.selling.price.2)
Error in is.data.frame(data): object 'house.selling.price.2' not found
Code
model1 <- step(model)
Error in terms(object): object 'model' not found
Code
summary(model1)
Error in summary(model1): object 'model1' not found
Code
# Model 2

model <- lm(P ~ . , data = house.selling.price.2)
Error in is.data.frame(data): object 'house.selling.price.2' not found
Code
model2 <- step(model, direction = c("forward"))
Error in terms(object): object 'model' not found
Code
summary(model2)
Error in summary(model2): object 'model2' not found

Based on R-squared, we would want to pick Model 2 since it has a higher R-squared. This model includes all four predictor variables and was picked using forward selection. However, if judging by the Adjusted R-squared criteria, we would want to pick Model 1 since that has a higher Adjusted R-squared. Model 1 was chosen using backward eliminiation and does not include the Bed variable.

Code
# Calculating PRESS

PRESS <- function(linear.model) {

  pr <- residuals(linear.model)/(1-lm.influence(linear.model)$hat)

  PRESS <- sum(pr^2)
  
  return(PRESS)
}

PRESS(model1)
Error in residuals(linear.model): object 'model1' not found
Code
PRESS(model2)
Error in residuals(linear.model): object 'model2' not found

For the PRESS criteria, the model we would want to pick is Model 1, which was found using backward elimination and only includes New, Bath, and Size as predictor variables.

Code
AIC(model1)
Error in AIC(model1): object 'model1' not found
Code
AIC(model2)
Error in AIC(model2): object 'model2' not found

Judging by the AIC, Model 1 is a better fit because it has a lower AIC.

Code
BIC(model1)
Error in BIC(model1): object 'model1' not found
Code
BIC(model2)
Error in BIC(model2): object 'model2' not found

Model 1 is also a better fit from the BIC criteria because it also has a lower BIC. Overall, judging by all five criteria, Model 1 would be the best fit since it has better results in four out of the five.

Part E

As stated before, Model 1 has better results in four out of the five criteria (Adjusted R-Squared, PRESS, AIC, and BIC). Thus, the model I would prefer overall is Model 1, which omitted the Bed variable. The Bed variable also had an extremely high p-value compared to the other variables, so it would make sense to construct a model without it.

Question 2

Part A

Code
model <- lm(Volume ~ Girth + Height, data = trees)
summary(model)

Call:
lm(formula = Volume ~ Girth + Height, data = trees)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.4065 -2.6493 -0.2876  2.2003  8.4847 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -57.9877     8.6382  -6.713 2.75e-07 ***
Girth         4.7082     0.2643  17.816  < 2e-16 ***
Height        0.3393     0.1302   2.607   0.0145 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.882 on 28 degrees of freedom
Multiple R-squared:  0.948, Adjusted R-squared:  0.9442 
F-statistic:   255 on 2 and 28 DF,  p-value: < 2.2e-16

Part B

Code
plot(model)

There are some regression assumptions that are violated because of the data present in the plots. In the Residuals vs. Fitted plot, the line is not linear, indicating that the variances of the error terms are not equal and there may not be a linear relationship. The Scale-Location plot also shows a non-linear line, indicating that the assumption of constant variance is violated. The other two plots, Normal Q-Q and Residuals vs. Leverage, are normal.

Question 3

Part A

Code
model <- lm(Buchanan ~ Bush, data = florida)
plot(model)

Palm Beach is an outlier based on the diagnostic plots for the model because, while all the other data points are fairly close together, the Palm Beach data point is extremely far in each plot. Additionally, in the Residuals vs. Leverage plot, the Palm Beach point is outside of Cook’s distance, meaning it is an outlier with extreme influence on the data.

Part B

Code
model <- lm(log(Buchanan) ~ log(Bush), data = florida)
plot(model)

The findings do change somewhat because in the new model using logs, the Palm Beach data point is now inside Cook’s distance, meaning it has less influence over the data and is less of an outlier. The distance between Palm Beach and the other data points has been reduced, but it still seems to remain somewhat of an outlier but much less than in the previous model.