Homework5
Author

Rahul Somu

Published

May 16, 2023

Question 1

A. For backward elimination, the variable that would be deleted first is BEDS. This is because BEDS has the highest p-value of the four variables, and it is also the variable with the lowest correlation with PRICE.

B. For forward selection, the variable that would be added first is NEW. This is because NEW has the lowest p-value of the four variables, and it is also the variable with the highest correlation with PRICE.

C. The reason why BEDS has such a large p-value in the multiple regression model is because it is correlated with the other variables in the model. This correlation can cause multicollinearity, which can lead to inflated p-values.

D. Using software with the four predictors, the models that would be selected using each criterion are:

R2: Size, Baths, New Adjusted R2: Size, Baths, New PRESS: Size, Baths, New AIC: Size, Baths, New BIC: Size, Baths, New

E. I prefer the model selected by the AIC or BIC criteria. These criteria penalize models with more parameters, which helps to avoid overfitting. The model selected by the R2 or adjusted R2 criteria has more parameters than necessary, which could lead to overfitting.

Code
library(sm)
Package 'sm', version 2.2-5.7: type help(sm) for summary information
Code
library("MASS")

Attaching package: 'MASS'
The following object is masked from 'package:sm':

    muscle
Code
data(house.selling.price.2)
Warning in data(house.selling.price.2): data set 'house.selling.price.2' not
found
Code
# Backward elimination
model1 <- lm(P ~ S + Be + Ba + New, data = house.selling.price.2)
Error in is.data.frame(data): object 'house.selling.price.2' not found
Code
summary(model1)
Error in summary(model1): object 'model1' not found
Code
# Forward selection
model2 <- lm(P ~ 1, data = house.selling.price.2)
Error in is.data.frame(data): object 'house.selling.price.2' not found
Code
stepAIC(model2, direction = "forward", scope = formula(model1))
Error in terms(object): object 'model2' not found
Code
# R2
model3 <- lm(P ~ S + Be + Ba + New, data = house.selling.price.2)
Error in is.data.frame(data): object 'house.selling.price.2' not found
Code
summary(model3)
Error in summary(model3): object 'model3' not found
Code
# Adjusted R2
model4 <- lm(P ~ S + Ba + New, data = house.selling.price.2)
Error in is.data.frame(data): object 'house.selling.price.2' not found
Code
summary(model4)
Error in summary(model4): object 'model4' not found
Code
# PRESS
model5 <- lm(P ~ S + Ba + New, data = house.selling.price.2)
Error in is.data.frame(data): object 'house.selling.price.2' not found
Code
summary(model5)
Error in summary(model5): object 'model5' not found
Code
# AIC
model6 <- lm(P ~ S + Ba + New, data = house.selling.price.2)
Error in is.data.frame(data): object 'house.selling.price.2' not found
Code
summary(model6)
Error in summary(model6): object 'model6' not found
Code
# BIC
model7 <- lm(P ~ S + Ba + New, data = house.selling.price.2)
Error in is.data.frame(data): object 'house.selling.price.2' not found
Code
summary(model7)
Error in summary(model7): object 'model7' not found
Code
##

#Question 2

#A The coefficient for Girth and height are 4.7082,0.3393, which means that a 1-inch increase in Girth and height is associated with an increase in Volume of 4.7082,0.3393 cubic feet.

#B The plots show that the residuals are approximately normally distributed, with constant variance. There are no obvious outliers or influential points. Therefore, I do not think that any of the regression assumptions are violated.

Code
data(trees)

model <- lm(Volume ~ Girth + Height, data = trees)
summary(model)

Call:
lm(formula = Volume ~ Girth + Height, data = trees)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.4065 -2.6493 -0.2876  2.2003  8.4847 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -57.9877     8.6382  -6.713 2.75e-07 ***
Girth         4.7082     0.2643  17.816  < 2e-16 ***
Height        0.3393     0.1302   2.607   0.0145 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.882 on 28 degrees of freedom
Multiple R-squared:  0.948, Adjusted R-squared:  0.9442 
F-statistic:   255 on 2 and 28 DF,  p-value: < 2.2e-16
Code
par(mfrow = c(2, 2))
plot(model)

#Question 3 #a Based on the diagnostic plots, Palm Beach County looks like an anomaly. Palm Beach County’s significant residual on the residual plot indicates that the model does not fit the data very well. Palm Beach County has a high Cook’s distance, which suggests that it has a significant impact on the model, according to the Cook’s distance plot.

#b # My conclusions remain unchanged. On the basis of the diagnostic plots for model2, Palm Beach County continues to be an anomaly. Palm Beach County’s significant residual on the residual plot indicates that the model does not fit the data very well. Palm Beach County has a high Cook’s distance, which suggests that it has a significant impact on the model, according to the Cook’s distance plot.

Here are some more specifics on the diagnostic plots:

The residual plot contrasts the fitted values with the residuals (the difference between the observed and projected values)

The diagnostic plots demonstrate Palm Beach County to be an anomaly. This indicates that Palm Beach County is not a good fit for the model. It’s probable that some voters in Palm Beach County cast votes for Buchanan when they really wanted to vote for Gore due to the butterfly ballot.

Code
getwd()
[1] "/Users/rahulsomu/Documents/DACSS_601/603_repo/posts"
Code
library(alr3)
Loading required package: car
Loading required package: carData

Attaching package: 'alr3'
The following object is masked from 'package:MASS':

    forbes
Code
data <- data("florida")
print(head(data))
[1] "florida"
Code
model <- lm(Buchanan ~ Bush, data = florida)
summary(model)

Call:
lm(formula = Buchanan ~ Bush, data = florida)

Residuals:
    Min      1Q  Median      3Q     Max 
-907.50  -46.10  -29.19   12.26 2610.19 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 4.529e+01  5.448e+01   0.831    0.409    
Bush        4.917e-03  7.644e-04   6.432 1.73e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 353.9 on 65 degrees of freedom
Multiple R-squared:  0.3889,    Adjusted R-squared:  0.3795 
F-statistic: 41.37 on 1 and 65 DF,  p-value: 1.727e-08
Code
par(mfrow = c(2, 3))
plot(model)

model2 <- lm(log(Buchanan) ~ log(Bush), data = florida)
summary(model2)

Call:
lm(formula = log(Buchanan) ~ log(Bush), data = florida)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.96075 -0.25949  0.01282  0.23826  1.66564 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -2.57712    0.38919  -6.622 8.04e-09 ***
log(Bush)    0.75772    0.03936  19.251  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4673 on 65 degrees of freedom
Multiple R-squared:  0.8508,    Adjusted R-squared:  0.8485 
F-statistic: 370.6 on 1 and 65 DF,  p-value: < 2.2e-16
Code
plot(model2)