Homework 5
sai Pothula
Author

Sai Padma pothula

Published

May 2, 2023

Code
ibrary(smss)
Error in ibrary(smss): could not find function "ibrary"
Code
library(alr4)
Loading required package: car
Loading required package: carData
Loading required package: effects
lattice theme set by effectsTheme()
See ?effectsTheme for details.
Code
library(magrittr)
Code
data(house.selling.price.2, package="smss")
data1 <- house.selling.price.2
head(house.selling.price.2,10)
       P    S Be Ba New
1   48.5 1.10  3  1   0
2   55.0 1.01  3  2   0
3   68.0 1.45  3  2   0
4  137.0 2.40  3  3   0
5  309.4 3.30  4  3   1
6   17.5 0.40  1  1   0
7   19.6 1.28  3  1   0
8   24.5 0.74  3  1   0
9   34.8 0.78  2  1   0
10  32.0 0.97  3  1   0
Code
names(data1) <- c('Price', 'Size', 'Beds', 'Baths', 'New')

1A:

Code
house_price_2<-house.selling.price.2

# Calculate the correlation matrix
correlation_matrix <- cor(house_price_2)

print(correlation_matrix, border = TRUE, lwd = 2)
            P         S        Be        Ba       New
P   1.0000000 0.8988136 0.5902675 0.7136960 0.3565540
S   0.8988136 1.0000000 0.6691137 0.6624828 0.1762879
Be  0.5902675 0.6691137 1.0000000 0.3337966 0.2672091
Ba  0.7136960 0.6624828 0.3337966 1.0000000 0.1820651
New 0.3565540 0.1762879 0.2672091 0.1820651 1.0000000
Code
fit <- lm(P ~ ., data = house_price_2)
backward <- step(fit, direction = "backward")
Start:  AIC=524.7
P ~ S + Be + Ba + New

       Df Sum of Sq   RSS    AIC
- Be    1       131 23684 523.21
<none>              23553 524.70
- Ba    1      3092 26645 534.17
- New   1      6432 29985 545.15
- S     1     35419 58972 608.06

Step:  AIC=523.21
P ~ S + Ba + New

       Df Sum of Sq   RSS    AIC
<none>              23684 523.21
- Ba    1      3550 27234 534.20
- New   1      6349 30033 543.30
- S     1     54898 78582 632.75
Code
summary(backward)

Call:
lm(formula = P ~ S + Ba + New, data = house_price_2)

Residuals:
    Min      1Q  Median      3Q     Max 
-34.804  -9.496   0.917   7.931  73.338 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -47.992      8.209  -5.847 8.15e-08 ***
S             62.263      4.335  14.363  < 2e-16 ***
Ba            20.072      5.495   3.653 0.000438 ***
New           18.371      3.761   4.885 4.54e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 16.31 on 89 degrees of freedom
Multiple R-squared:  0.8681,    Adjusted R-squared:  0.8637 
F-statistic: 195.3 on 3 and 89 DF,  p-value: < 2.2e-16

The variable to be deleted first in backward elimination would be “Beds” since it has the highest p-value of 0.487.

1B:

Code
forward_dir <- step(fit, direction = "forward")
Start:  AIC=524.7
P ~ S + Be + Ba + New
Code
summary(forward_dir)

Call:
lm(formula = P ~ S + Be + Ba + New, data = house_price_2)

Residuals:
    Min      1Q  Median      3Q     Max 
-36.212  -9.546   1.277   9.406  71.953 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -41.795     12.104  -3.453 0.000855 ***
S             64.761      5.630  11.504  < 2e-16 ***
Be            -2.766      3.960  -0.698 0.486763    
Ba            19.203      5.650   3.399 0.001019 ** 
New           18.984      3.873   4.902  4.3e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 16.36 on 88 degrees of freedom
Multiple R-squared:  0.8689,    Adjusted R-squared:  0.8629 
F-statistic: 145.8 on 4 and 88 DF,  p-value: < 2.2e-16

In forward selection, the variable with the lowest p-value is added first, indicating a stronger evidence of its influence on the response variable.

Considering the full model, we observe that the variable “S” has the lowest p-value (< 2e-16), demonstrating its high significance in predicting the house price. Therefore, if we were to employ forward selection based on p-values, the variable “S” would be added first to the model. Subsequently, “Ba” would be added, followed by “New,” and finally “Be.” It is worth noting that none of the other variables exhibit p-values lower than “S,” justifying their addition order.

C: When additional variables are included in the regression model, the predictive power of “Beds” in determining the “Price” diminishes. This could be attributed to the fact that “Baths” and “Size” have a strong correlation with “Price,” indicating that they might already account for some of the influence that “Beds” has on the outcome. Furthermore, a high correlation between “Size” and “Beds” suggests the presence of multicollinearity, where these variables provide redundant information to the model.

D:

Code
summary(lm(P ~ S, data = house_price_2))

Call:
lm(formula = P ~ S, data = house_price_2)

Residuals:
    Min      1Q  Median      3Q     Max 
-56.407 -10.656   2.126  11.412  85.091 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -25.194      6.688  -3.767 0.000293 ***
S             75.607      3.865  19.561  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 19.47 on 91 degrees of freedom
Multiple R-squared:  0.8079,    Adjusted R-squared:  0.8058 
F-statistic: 382.6 on 1 and 91 DF,  p-value: < 2.2e-16
Code
summary(lm(P ~ S+New, data = house_price_2))

Call:
lm(formula = P ~ S + New, data = house_price_2)

Residuals:
    Min      1Q  Median      3Q     Max 
-47.207  -9.763  -0.091   9.984  76.405 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -26.089      5.977  -4.365 3.39e-05 ***
S             72.575      3.508  20.690  < 2e-16 ***
New           19.587      3.995   4.903 4.16e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 17.4 on 90 degrees of freedom
Multiple R-squared:  0.8484,    Adjusted R-squared:  0.845 
F-statistic: 251.8 on 2 and 90 DF,  p-value: < 2.2e-16
Code
summary(lm(P ~ ., data = house_price_2))

Call:
lm(formula = P ~ ., data = house_price_2)

Residuals:
    Min      1Q  Median      3Q     Max 
-36.212  -9.546   1.277   9.406  71.953 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -41.795     12.104  -3.453 0.000855 ***
S             64.761      5.630  11.504  < 2e-16 ***
Be            -2.766      3.960  -0.698 0.486763    
Ba            19.203      5.650   3.399 0.001019 ** 
New           18.984      3.873   4.902  4.3e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 16.36 on 88 degrees of freedom
Multiple R-squared:  0.8689,    Adjusted R-squared:  0.8629 
F-statistic: 145.8 on 4 and 88 DF,  p-value: < 2.2e-16
Code
summary(lm(P ~ . -Be, data = house_price_2))

Call:
lm(formula = P ~ . - Be, data = house_price_2)

Residuals:
    Min      1Q  Median      3Q     Max 
-34.804  -9.496   0.917   7.931  73.338 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -47.992      8.209  -5.847 8.15e-08 ***
S             62.263      4.335  14.363  < 2e-16 ***
Ba            20.072      5.495   3.653 0.000438 ***
New           18.371      3.761   4.885 4.54e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 16.31 on 89 degrees of freedom
Multiple R-squared:  0.8681,    Adjusted R-squared:  0.8637 
F-statistic: 195.3 on 3 and 89 DF,  p-value: < 2.2e-16
Code
full_model <- lm(P ~ ., data = house.selling.price.2)
model_noBeds <- lm(P ~ .-Be, data = house.selling.price.2)
model_noBeds_noBaths <- lm(P ~ S + New, data = house.selling.price.2)
model_size_only <- lm(P ~ S, data = house.selling.price.2)
Code
rsquared <- function(fit) summary(fit)$r.squared
adj_rsquared <- function(fit) summary(fit)$adj.r.squared
PRESS <- function(fit) {
  pr <- residuals(fit)/(1-lm.influence(fit)$hat)
  sum(pr^2)
}
Code
models <- list(full_model, model_noBeds, model_noBeds_noBaths, model_size_only)
data.frame(models = c('full_model', 'model_noBeds', 'model_noBeds&Baths', 'model_only_size'),
           rSquared = sapply(models, rsquared),
           adj_rSquared = sapply(models, adj_rsquared),
           PRESS = sapply(models, PRESS),
           AIC = sapply(models, AIC),
           BIC = sapply(models, BIC)) |>
  print()
              models  rSquared adj_rSquared    PRESS      AIC      BIC
1         full_model 0.8688630    0.8629022 28390.22 790.6225 805.8181
2       model_noBeds 0.8681361    0.8636912 27860.05 789.1366 801.7996
3 model_noBeds&Baths 0.8483699    0.8450003 31066.00 800.1262 810.2566
4    model_only_size 0.8078660    0.8057546 38203.29 820.1439 827.7417

It is important to note that when evaluating the models based on R-squared and Adjusted R-squared, higher values indicate better model fit. However, in the case of PRESS, AIC, and BIC, lower values indicate better performance. Considering these criteria, the model that does not include the variable “Beds” is considered better. This conclusion is supported by the higher adjusted R-squared value and lower values of PRESS, AIC, and BIC for the model without “Beds”.

E: Select model with no bed variable.

2A:

Code
head(trees)
  Girth Height Volume
1   8.3     70   10.3
2   8.6     65   10.3
3   8.8     63   10.2
4  10.5     72   16.4
5  10.7     81   18.8
6  10.8     83   19.7
Code
tree_model <- lm(Volume ~ Girth + Height, data = trees)
summary(tree_model)

Call:
lm(formula = Volume ~ Girth + Height, data = trees)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.4065 -2.6493 -0.2876  2.2003  8.4847 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -57.9877     8.6382  -6.713 2.75e-07 ***
Girth         4.7082     0.2643  17.816  < 2e-16 ***
Height        0.3393     0.1302   2.607   0.0145 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.882 on 28 degrees of freedom
Multiple R-squared:  0.948, Adjusted R-squared:  0.9442 
F-statistic:   255 on 2 and 28 DF,  p-value: < 2.2e-16

B:

Code
par(mfrow = c(2,3))
plot(tree_model, which = 1:6)

The violation that stands out the most from the first plot is the violation of the linearity assumption. In the plot of fitted values vs. residuals, the expected pattern is a relatively straight line along the horizontal axis. However, in this case, the red line follows a U-shaped pattern. This violation can be attributed to the relationship between the volume and the square of the diameter. The current model uses only the diameter (Girth) as a predictor, failing to capture the quadratic nature of the relationship. To address this issue, we can explore the inclusion of a quadratic term in the model.

3A:

Code
model <- lm(Buchanan ~ Bush, data = florida)

par(mfrow = c(2, 2))
plot(model)

3B:

Code
florida$log_Bush <- log(florida$Bush)
florida$log_Buchanan <- log(florida$Buchanan)

# Perform simple linear regression with log-transformed variables
model <- lm(log_Buchanan ~ log_Bush, data = florida)

# Generate regression diagnostic plots
par(mfrow = c(2, 2))
plot(model)