hw5
regression analysis
Homework 5
Author

Guanhua Tan

Published

May 7, 2023

Code
library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.0      ✔ purrr   0.3.5 
✔ tibble  3.1.8      ✔ dplyr   1.0.10
✔ tidyr   1.2.1      ✔ stringr 1.4.1 
✔ readr   2.1.3      ✔ forcats 0.5.2 
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
Code
library(smss)
library(alr4)
Loading required package: car
Loading required package: carData

Attaching package: 'car'

The following object is masked from 'package:dplyr':

    recode

The following object is masked from 'package:purrr':

    some

Loading required package: effects
lattice theme set by effectsTheme()
See ?effectsTheme for details.
Code
library(ggplot2)

Question 1

A. For backward elimination, which variable would be deleted first? Why?

The variable Beds would be deleted first because the p-value is greatest.

B. For forward selection, which variable would be added first? Why?

The variable Size would be first added because the p-value is smallest.

C. Why do you think that BEDS has such a large P-value in the multiple regression model, even though it has a substantial correlation with PRICE?

I believe Beds is highly identical to Size, which causes the multicollinearity.

D. Using software with these four predictors, find the model that would be selected using each criterion:

  1. R2
  2. Adjusted R2
  3. PRESS
  4. AIC
  5. BIC
Code
data("house.selling.price.2")
model_house <- lm(P~S+Be+Ba+New, data=house.selling.price.2)
model_house_no_Be <- lm(P~S+Ba+New, data=house.selling.price.2)

# Press
pr <- resid(model_house)/(1 - lm.influence(model_house)$hat)
sum(pr^2)
[1] 28390.22
Code
pr <- resid(model_house_no_Be)/(1 - lm.influence(model_house_no_Be)$hat)
sum(pr^2)
[1] 27860.05
Code
# AIC
broom::glance(model_house)
# A tibble: 1 × 12
  r.squared adj.r.squa…¹ sigma stati…²  p.value    df logLik   AIC   BIC devia…³
      <dbl>        <dbl> <dbl>   <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>   <dbl>
1     0.869        0.863  16.4    146. 5.94e-38     4  -389.  791.  806.  23553.
# … with 2 more variables: df.residual <int>, nobs <int>, and abbreviated
#   variable names ¹​adj.r.squared, ²​statistic, ³​deviance
Code
broom::glance(model_house_no_Be)
# A tibble: 1 × 12
  r.squared adj.r.squa…¹ sigma stati…²  p.value    df logLik   AIC   BIC devia…³
      <dbl>        <dbl> <dbl>   <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>   <dbl>
1     0.868        0.864  16.3    195. 4.96e-39     3  -390.  789.  802.  23684.
# … with 2 more variables: df.residual <int>, nobs <int>, and abbreviated
#   variable names ¹​adj.r.squared, ²​statistic, ³​deviance

R2: model_house is 0.87; model_house_no_be is 0.87.

Adjusted R2: model_house is 0.86; model_house_no_be is 0.86.

Press:model_house is 28390.22; model_house_no_be is 27860.05.

AIC: model_house is 790.6225; model_house_no_be is 789.1366.

BIC: model_house is 805.8181; model_house_no_be is 801.7996.

E Explain which model you prefer and why

I’d like to prefer model_house_no_b because the AIC, BIC and Press are smaller than those of model_house.

Question 2

Code
data("trees")
head(trees)
  Girth Height Volume
1   8.3     70   10.3
2   8.6     65   10.3
3   8.8     63   10.2
4  10.5     72   16.4
5  10.7     81   18.8
6  10.8     83   19.7

A. Fit a multiple regression model with the Volume as the outcome and Girth and Height as the explanatory variables

Code
trees_model_1 <- lm(Volume~Girth+Height, data=trees)
summary(trees_model_1)

Call:
lm(formula = Volume ~ Girth + Height, data = trees)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.4065 -2.6493 -0.2876  2.2003  8.4847 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -57.9877     8.6382  -6.713 2.75e-07 ***
Girth         4.7082     0.2643  17.816  < 2e-16 ***
Height        0.3393     0.1302   2.607   0.0145 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.882 on 28 degrees of freedom
Multiple R-squared:  0.948, Adjusted R-squared:  0.9442 
F-statistic:   255 on 2 and 28 DF,  p-value: < 2.2e-16

B. Run regression diagnostic plots on the model. Based on the plots, do you think any of the regression assumptions is violated?

Code
par(mfrow=c(2,3))
plot(trees_model_1, which= 1:6)

I have noted that several plots show clear patterns. the line shows curvature in thhe Residue and Fitted plot. The same shaple has been found in the Scale-location plot. Normal Q-Q demonstrates that points larele don’t fall along the line.

Question 3

Code
data("florida")
head(florida)
           Gore   Bush Buchanan
ALACHUA   47300  34062      262
BAKER      2392   5610       73
BAY       18850  38637      248
BRADFORD   3072   5413       65
BREVARD   97318 115185      570
BROWARD  386518 177279      789

A. Run a simple linear regression model where the Buchanan vote is the outcome and the Bush vote is the explanatory variable. Produce the regression diagnostic plots. Is Palm Beach County an outlier based on the diagnostic plots? Why or why not?

Code
votes_2000 <-lm(Buchanan~Bush, data=florida)
summary(votes_2000)

Call:
lm(formula = Buchanan ~ Bush, data = florida)

Residuals:
    Min      1Q  Median      3Q     Max 
-907.50  -46.10  -29.19   12.26 2610.19 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 4.529e+01  5.448e+01   0.831    0.409    
Bush        4.917e-03  7.644e-04   6.432 1.73e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 353.9 on 65 degrees of freedom
Multiple R-squared:  0.3889,    Adjusted R-squared:  0.3795 
F-statistic: 41.37 on 1 and 65 DF,  p-value: 1.727e-08
Code
par(mfrow=c(2,3))
plot(votes_2000, which= 1:6)

According to the Cook’s distance plots, Palm Beach County is a outlier.

B. Take the log of both variables (Bush vote and Buchanan Vote) and repeat the analysis in (A.) Does your findings change?

Code
votes_2000_trans <-lm(log(Buchanan)~log(Bush), data=florida)
par(mfrow=c(2,3))
plot(votes_2000_trans, which= 1:6)

According to the Cook’s distance of the new model, although it seems disclose more outlies, Palm Beach is still a significant one.