hw5
regression
Author

Donny Snyder

Published

December 8, 2022

Code
library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.0      ✔ purrr   0.3.5 
✔ tibble  3.1.8      ✔ dplyr   1.0.10
✔ tidyr   1.2.1      ✔ stringr 1.4.1 
✔ readr   2.1.3      ✔ forcats 0.5.2 
Warning: package 'ggplot2' was built under R version 4.2.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
Code
library(alr4)
Loading required package: car
Loading required package: carData

Attaching package: 'car'

The following object is masked from 'package:dplyr':

    recode

The following object is masked from 'package:purrr':

    some

Loading required package: effects
lattice theme set by effectsTheme()
See ?effectsTheme for details.
Code
library(smss)
Warning: package 'smss' was built under R version 4.2.2
Code
library(dplyr)
library(ggplot2)
library(GGally)
Warning: package 'GGally' was built under R version 4.2.2
Registered S3 method overwritten by 'GGally':
  method from   
  +.gg   ggplot2
Code
library(qpcR)
Error in library(qpcR): there is no package called 'qpcR'

Question 1

Code
data(house.selling.price.2)
summary(house.selling.price.2)
       P                S              Be              Ba       
 Min.   : 17.50   Min.   :0.40   Min.   :1.000   Min.   :1.000  
 1st Qu.: 72.90   1st Qu.:1.33   1st Qu.:3.000   1st Qu.:2.000  
 Median : 96.00   Median :1.57   Median :3.000   Median :2.000  
 Mean   : 99.53   Mean   :1.65   Mean   :3.183   Mean   :1.957  
 3rd Qu.:115.00   3rd Qu.:1.98   3rd Qu.:4.000   3rd Qu.:2.000  
 Max.   :309.40   Max.   :3.85   Max.   :5.000   Max.   :3.000  
      New        
 Min.   :0.0000  
 1st Qu.:0.0000  
 Median :0.0000  
 Mean   :0.3011  
 3rd Qu.:1.0000  
 Max.   :1.0000  
Code
data  <- house.selling.price.2

cor(data)
            P         S        Be        Ba       New
P   1.0000000 0.8988136 0.5902675 0.7136960 0.3565540
S   0.8988136 1.0000000 0.6691137 0.6624828 0.1762879
Be  0.5902675 0.6691137 1.0000000 0.3337966 0.2672091
Ba  0.7136960 0.6624828 0.3337966 1.0000000 0.1820651
New 0.3565540 0.1762879 0.2672091 0.1820651 1.0000000
Code
model1 <- lm(P ~ S + Ba + New + Be ,data = data)
backElim <- lm(P ~ S + Ba + New ,data = data)
forSel <- lm(P ~ S + New + Ba + Be,data = data)

summary(model1)

Call:
lm(formula = P ~ S + Ba + New + Be, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-36.212  -9.546   1.277   9.406  71.953 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -41.795     12.104  -3.453 0.000855 ***
S             64.761      5.630  11.504  < 2e-16 ***
Ba            19.203      5.650   3.399 0.001019 ** 
New           18.984      3.873   4.902  4.3e-06 ***
Be            -2.766      3.960  -0.698 0.486763    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 16.36 on 88 degrees of freedom
Multiple R-squared:  0.8689,    Adjusted R-squared:  0.8629 
F-statistic: 145.8 on 4 and 88 DF,  p-value: < 2.2e-16
Code
summary(backElim)

Call:
lm(formula = P ~ S + Ba + New, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-34.804  -9.496   0.917   7.931  73.338 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -47.992      8.209  -5.847 8.15e-08 ***
S             62.263      4.335  14.363  < 2e-16 ***
Ba            20.072      5.495   3.653 0.000438 ***
New           18.371      3.761   4.885 4.54e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 16.31 on 89 degrees of freedom
Multiple R-squared:  0.8681,    Adjusted R-squared:  0.8637 
F-statistic: 195.3 on 3 and 89 DF,  p-value: < 2.2e-16
Code
AIC(model1)
[1] 790.6225
Code
AIC(backElim)
[1] 789.1366
Code
BIC(model1)
[1] 805.8181
Code
BIC(backElim)
[1] 801.7996
Code
PRESS(model1)
Error in PRESS(model1): could not find function "PRESS"
Code
PRESS(backElim)
Error in PRESS(backElim): could not find function "PRESS"

#Question 1A The Beds variable should be deleted first for backward elimination because it has the highest p-value

#Question 1B The Size variable should be added first for forward selection, because it has the most explanatory value.

#Question 1C Beds likely has such a large p-value because its variance that it explains is almost completely absorbed by size, which it is most highly correlated with.

#Question 1D For R squared, I also prefer the backwards elimination method model, because it reduces the residuals.

For adjusted R squared, I prefer the backwards elimination method model without the Bed predictor, because it reduces the residuals.

For AIC I would also prefer the backwards elimination model created, because it reduces the AIC score, same with the BIC score.

The PRESS metric is the only one where the value is lower for the model with all of the variables, so I would choose that for PRESS.

#Question 2

Code
data2 <- trees

model2 <- lm(Volume ~ Girth + Height, data = data2)
summary(model2)

Call:
lm(formula = Volume ~ Girth + Height, data = data2)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.4065 -2.6493 -0.2876  2.2003  8.4847 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -57.9877     8.6382  -6.713 2.75e-07 ***
Girth         4.7082     0.2643  17.816  < 2e-16 ***
Height        0.3393     0.1302   2.607   0.0145 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.882 on 28 degrees of freedom
Multiple R-squared:  0.948, Adjusted R-squared:  0.9442 
F-statistic:   255 on 2 and 28 DF,  p-value: < 2.2e-16
Code
plot(model2)

Question 2A

Ran model in code above.

Question 2B

It seems like some of the regression assumptions may be violated. There is something of a parabolic relationship in this model, which was left out of the residuals. THe QQ looks normal, but the spread-location plot also shows that the residuals aren’t necessarily spread evenly along the range of values. The residuals vs leverage looks reasonable. It seems that there is not homoscedasticity of errors, which could be an issue for regression.

Question 3

Code
data(florida)
data3 <- florida

model3 <- lm(Buchanan~Bush, data = florida)
summary(model3)

Call:
lm(formula = Buchanan ~ Bush, data = florida)

Residuals:
    Min      1Q  Median      3Q     Max 
-907.50  -46.10  -29.19   12.26 2610.19 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 4.529e+01  5.448e+01   0.831    0.409    
Bush        4.917e-03  7.644e-04   6.432 1.73e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 353.9 on 65 degrees of freedom
Multiple R-squared:  0.3889,    Adjusted R-squared:  0.3795 
F-statistic: 41.37 on 1 and 65 DF,  p-value: 1.727e-08
Code
plot(model3)

Code
model4 <- lm(log(Buchanan)~log(Bush), data = florida)
summary(model4)

Call:
lm(formula = log(Buchanan) ~ log(Bush), data = florida)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.96075 -0.25949  0.01282  0.23826  1.66564 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -2.57712    0.38919  -6.622 8.04e-09 ***
log(Bush)    0.75772    0.03936  19.251  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4673 on 65 degrees of freedom
Multiple R-squared:  0.8508,    Adjusted R-squared:  0.8485 
F-statistic: 370.6 on 1 and 65 DF,  p-value: < 2.2e-16
Code
plot(model4)

Question 3A

Palm beach county is an outlier based on the diagnostic plots, given all of the plots, as it’s value is by far the furthest from the fitted line.

Question 3B

The findings change somewhat, but not that much. Palm Beach is still an outlier.