hw5
challenge5
Jerin Jacob
Author

Jerin Jacob

Published

May 13, 2023

Code
library(readxl)
library(dplyr)
library(magrittr)
library(alr4)
library(smss)
knitr::opts_chunk$set(echo = TRUE)

Question 1: A) For backward elimination, Beds would be deleted first because it has the largest p-value. .487 is a very high p-value based on any conventional level of significance.

  1. For forward selection, I would add Size first. It has the highest correlation coefficient. Also, this variable is statistically significant.

  2. Once other variables are controlled for, most of the explanation done by Beds is instead done by other variables. For example, houses with more beds are also larger in size. When Beds is the only explanatory variable, the model attributes the effect of the Size variable to Beds. However, once Size is also in the model and thus controlled for, the p-value of Beds gets larger as it doesn’t really have much explanatory power when it comes to variation in Price that remains unexplained after accounting for Size.

Code
data("house.selling.price.2", package = "smss")
head(house.selling.price.2)
      P    S Be Ba New
1  48.5 1.10  3  1   0
2  55.0 1.01  3  2   0
3  68.0 1.45  3  2   0
4 137.0 2.40  3  3   0
5 309.4 3.30  4  3   1
6  17.5 0.40  1  1   0

Running all regression models

Code
full_model <- lm(P ~ ., data = house.selling.price.2)
model_noBeds <- lm(P ~ .-Be, data = house.selling.price.2)
model_noBeds_noBaths <- lm(P ~ S + New, data = house.selling.price.2)
model_size_only <- lm(P ~ S, data = house.selling.price.2)

Create a functions to get R-squared, Adjusted R-squared & PRESS

Code
rsquared <- function(fit) summary(fit)$r.squared
adj_rsquared <- function(fit) summary(fit)$adj.r.squared
PRESS <- function(fit) {
  pr <- residuals(fit)/(1-lm.influence(fit)$hat)
  sum(pr^2)
}

For AIC and BIC, the functions AIC() and BIC() can be used

Now, applying the functions to model objects

Code
models <- list(full_model, model_noBeds, model_noBeds_noBaths, model_size_only)
data.frame(models = c('full_model', 'model_noBeds', 'model_noBeds&Baths', 'model_only_size'),
           rSquared = sapply(models, rsquared),
           adj_rSquared = sapply(models, adj_rsquared),
           PRESS = sapply(models, PRESS),
           AIC = sapply(models, AIC),
           BIC = sapply(models, BIC)) |>
  print()
              models  rSquared adj_rSquared    PRESS      AIC      BIC
1         full_model 0.8688630    0.8629022 28390.22 790.6225 805.8181
2       model_noBeds 0.8681361    0.8636912 27860.05 789.1366 801.7996
3 model_noBeds&Baths 0.8483699    0.8450003 31066.00 800.1262 810.2566
4    model_only_size 0.8078660    0.8057546 38203.29 820.1439 827.7417

For R-Squared and Adjusted R-Squared, larger values are preferred while for PRESS, AIC and BIC, smaller values are the best.

Here, model using all the variables (full_model) has the highest R-Squared while model with no Beds(model_noBeds) has the highest Adjusted R-Squared.

model_noBeds has the lowest PRESS, AIC and BIC

  1. Since R-Squared value always increase as the new variables are added, it is not usefull for our analysis. So, considering other 4 criteria, we can select the model with no Beds(model_noBeds) as it’s Adjusted R-Squared is the highest and PRESS, AIC & BIC are the lowest.

Question 2:

Loading the data

Code
head(trees)
  Girth Height Volume
1   8.3     70   10.3
2   8.6     65   10.3
3   8.8     63   10.2
4  10.5     72   16.4
5  10.7     81   18.8
6  10.8     83   19.7
  1. Fitting a model with the Volume as outcome and Girth and Height as the explanatory variables
Code
tree_model <- lm(Volume ~ Girth + Height, data = trees)
summary(tree_model)

Call:
lm(formula = Volume ~ Girth + Height, data = trees)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.4065 -2.6493 -0.2876  2.2003  8.4847 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -57.9877     8.6382  -6.713 2.75e-07 ***
Girth         4.7082     0.2643  17.816  < 2e-16 ***
Height        0.3393     0.1302   2.607   0.0145 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.882 on 28 degrees of freedom
Multiple R-squared:  0.948, Adjusted R-squared:  0.9442 
F-statistic:   255 on 2 and 28 DF,  p-value: < 2.2e-16
  1. Creating regression diagnostic plots
Code
par(mfrow=c(2,3))
plot(tree_model, which = 1:6)

Residuals vs Fitted plot should bounce randomly around the ‘0’ line. Here it is a curve which suggests the violation of linearity assumption. The Scale-Location plot should be approximately horizontal but here it is a curve, eventhough not a perfect one. This suggests the violation of the assumption of Constant Variance. Cook’s distance for observation 31 is clearly larger than 4/n which is 4/31 = 0.129. This is a violation of the assumption of Influential Observation.

Question 3: A)

Code
data('florida')

florida_model <- lm(Buchanan ~ Bush, data = florida)
par(mfrow = c(2,3))
plot(florida_model, which = 1:6)

From the plot, Palm Beach County is largely an outlier.

Code
log_florida_model <- lm(log(Buchanan) ~ log(Bush), data = florida)
par(mfrow = c(2,3))
plot(log_florida_model, which = 1:6)

When we did the log transformation for each variable, the Palm Beach County became less of an outlier but still stands out.