Homework 5

hw5

Homework Assignment 5 - Darron Bunt

Author

Darron Bunt

Published

May 7, 2023

Code

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.2     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Code

library(alr4)

Loading required package: car
Loading required package: carData

Attaching package: 'car'

The following object is masked from 'package:dplyr':

    recode

The following object is masked from 'package:purrr':

    some

Loading required package: effects
lattice theme set by effectsTheme()
See ?effectsTheme for details.

Code

library(smss)
library(ggplot2)
library(broom)

Question 1

(Data file: house.selling.price.2 from smss R package)

Code

# Load data file
data(house.selling.price.2)

For the house.selling.price.2 data the tables below show a correlation matrix and a model fit using four predictors of selling price.

Code

HousePriceVars <- lm(P ~ S + Be + Ba + New, data = house.selling.price.2)
summary(HousePriceVars)


Call:
lm(formula = P ~ S + Be + Ba + New, data = house.selling.price.2)

Residuals:
    Min      1Q  Median      3Q     Max 
-36.212  -9.546   1.277   9.406  71.953 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -41.795     12.104  -3.453 0.000855 ***
S             64.761      5.630  11.504  < 2e-16 ***
Be            -2.766      3.960  -0.698 0.486763    
Ba            19.203      5.650   3.399 0.001019 ** 
New           18.984      3.873   4.902  4.3e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 16.36 on 88 degrees of freedom
Multiple R-squared:  0.8689,    Adjusted R-squared:  0.8629 
F-statistic: 145.8 on 4 and 88 DF,  p-value: < 2.2e-16

A

With these four predictors: For backward elimination, which variable would be deleted first? Why?

In backward elimination, we start by including all variables in the model, pre-determine a significance level, and then at each stage delete the variable with the largest p-value and continue on until all variables are significant.

Looking at the regression output, Beds has the largest p-value (0.487) and would be deleted first.

B

With these four predictors: For forward selection, which variable would be added first? Why?

In forward selection, we start with no explanatory variable and a pre-determined significance level. At each step we add the variable that is the most significant (ie. the one with the smallest p-value) and then stop when there are no remaining variables that can make a significant partial contribution.

Looking at the regression output, Size has the smallest p-value (< 2e-16) and would be added first.

C

Why do you think that BEDS has such a large P-value in the multiple regression model, even though it has a substantial correlation with PRICE?

Code

CorHousePriceVars <- cor(house.selling.price.2)
CorHousePriceVars

            P         S        Be        Ba       New
P   1.0000000 0.8988136 0.5902675 0.7136960 0.3565540
S   0.8988136 1.0000000 0.6691137 0.6624828 0.1762879
Be  0.5902675 0.6691137 1.0000000 0.3337966 0.2672091
Ba  0.7136960 0.6624828 0.3337966 1.0000000 0.1820651
New 0.3565540 0.1762879 0.2672091 0.1820651 1.0000000

I would suspect multicollinearity. Multicollinarity refers to a situation where 2+ variables in a regression model are highly correlated with each other - that is, they are measuring either the same or similar things.

Larger houses typically have more bedrooms and bathrooms (you can only fill so much space with giant living rooms!) and the number of bathrooms is also frequently correlated with the number of bedrooms (houses with 2-4 bedrooms commonly have 2 bathrooms, etc.). If size, bedrooms, and bathrooms are all measuring similar things, multicollinearity is likely a factor.

D

Using software with these four predictors, find the model that would be selected using each criterion:

I’ll create the models using backwards elimination:

Code

# All four predictors: P ~ S + Be + Ba + New
summary(HousePriceVars)


Call:
lm(formula = P ~ S + Be + Ba + New, data = house.selling.price.2)

Residuals:
    Min      1Q  Median      3Q     Max 
-36.212  -9.546   1.277   9.406  71.953 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -41.795     12.104  -3.453 0.000855 ***
S             64.761      5.630  11.504  < 2e-16 ***
Be            -2.766      3.960  -0.698 0.486763    
Ba            19.203      5.650   3.399 0.001019 ** 
New           18.984      3.873   4.902  4.3e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 16.36 on 88 degrees of freedom
Multiple R-squared:  0.8689,    Adjusted R-squared:  0.8629 
F-statistic: 145.8 on 4 and 88 DF,  p-value: < 2.2e-16

Code

# eliminate BEDS due to highest p-value - now have three predictors
HousePriceNoBeds <- lm(P ~ S + Ba + New, data = house.selling.price.2)
summary(HousePriceNoBeds)


Call:
lm(formula = P ~ S + Ba + New, data = house.selling.price.2)

Residuals:
    Min      1Q  Median      3Q     Max 
-34.804  -9.496   0.917   7.931  73.338 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -47.992      8.209  -5.847 8.15e-08 ***
S             62.263      4.335  14.363  < 2e-16 ***
Ba            20.072      5.495   3.653 0.000438 ***
New           18.371      3.761   4.885 4.54e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 16.31 on 89 degrees of freedom
Multiple R-squared:  0.8681,    Adjusted R-squared:  0.8637 
F-statistic: 195.3 on 3 and 89 DF,  p-value: < 2.2e-16

All variables are now significant. For the sake of having another model to compare to, I’ll continue the backwards elimination and remove Baths, which is the next largest p-value.

Code

# eliminate BATHS, now have two predictors 
HousePriceNoBedsBath <- lm(P ~ S + New, data = house.selling.price.2)
summary(HousePriceNoBedsBath)


Call:
lm(formula = P ~ S + New, data = house.selling.price.2)

Residuals:
    Min      1Q  Median      3Q     Max 
-47.207  -9.763  -0.091   9.984  76.405 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -26.089      5.977  -4.365 3.39e-05 ***
S             72.575      3.508  20.690  < 2e-16 ***
New           19.587      3.995   4.903 4.16e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 17.4 on 90 degrees of freedom
Multiple R-squared:  0.8484,    Adjusted R-squared:  0.845 
F-statistic: 251.8 on 2 and 90 DF,  p-value: < 2.2e-16

Code

# Convert the model results to a data frame
HousePrice_results <- bind_rows(
  data.frame(model_name = "HousePriceVars", glance(HousePriceVars)),
  data.frame(model_name = "HousePriceNoBeds", glance(HousePriceNoBeds)),
  data.frame(model_name = "HousePriceNoBedsBath", glance(HousePriceNoBedsBath)))
HousePrice_results

            model_name r.squared adj.r.squared    sigma statistic      p.value
1       HousePriceVars 0.8688630     0.8629022 16.35994  145.7634 5.936558e-38
2     HousePriceNoBeds 0.8681361     0.8636912 16.31279  195.3127 4.964676e-39
3 HousePriceNoBedsBath 0.8483699     0.8450003 17.39529  251.7748 1.365665e-37
  df    logLik      AIC      BIC deviance df.residual nobs
1  4 -389.3113 790.6225 805.8181 23552.98          88   93
2  3 -389.5683 789.1366 801.7996 23683.53          89   93
3  2 -396.0631 800.1262 810.2566 27233.66          90   93

Find the model that would be selected using each criterion:

1. R2

HousePriceVars - the model with all four predictors (Size, Beds, Baths, New).

2. Adjusted R2

HousePriceNoBeds - the model with three predictors (Size, Baths, New)

3. PRESS

Code

# PRESS of HousePriceVars
PRHousePriceVars <- resid(HousePriceVars)/(1 - lm.influence(HousePriceVars)$hat)
sum(PRHousePriceVars^2)

[1] 28390.22

Code

# PRESS of HousePriceNoBeds
PRHousePriceNoBeds <- resid(HousePriceNoBeds)/(1 - lm.influence(HousePriceNoBeds)$hat)
sum(PRHousePriceNoBeds^2)

[1] 27860.05

Code

# PRESS of HousePriceNoBedsBath
PRHousePriceNoBedsBath <- resid(HousePriceNoBedsBath)/(1 - lm.influence(HousePriceNoBedsBath)$hat)
sum(PRHousePriceNoBedsBath^2)

[1] 31066

HousePriceNoBeds - the model with three predictors (Size, Baths, New) - has the smallest PRESS statistic.

4. AIC

HousePriceNoBeds - the model with three predictors (Size, Baths, New) - has the smallest AIC statistic.

5. BIC

HousePriceNoBeds - the model with three predictors (Size, Baths, New) - has the smallest BIC statistic.

E

Explain which model you prefer and why.

The model with Size, Baths, and New as the predictors of Price is the model that I prefer. It had the highest adjusted r squared, lowest PRESS, lowest AIC, and lowest BIC. It also had the lowest p-value.

It appears to be the best fitting model, and accordingly, the one that I prefer.

Question 2

(Data file: trees from base R)

Code

# Load data file
data(trees)

From the documentation: “This data set provides measurements of the diameter, height and volume of timber in 31 felled black cherry trees. Note that the diameter (in inches) is erroneously labeled Girth in the data. It is measured at 4 ft 6 in above the ground.”

Tree volume estimation is a big deal, especially in the lumber industry. Use the trees data to build a basic model of tree volume prediction. In particular,

A

Fit a multiple regression model with the Volume as the outcome and Girth and Height as the explanatory variables

Code

# Multiple regression model with Volume as the outcome and Girth and Height as the explanatory variables
TreeVolume <- lm(Volume ~ Girth + Height, data = trees)
summary(TreeVolume)


Call:
lm(formula = Volume ~ Girth + Height, data = trees)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.4065 -2.6493 -0.2876  2.2003  8.4847 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -57.9877     8.6382  -6.713 2.75e-07 ***
Girth         4.7082     0.2643  17.816  < 2e-16 ***
Height        0.3393     0.1302   2.607   0.0145 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.882 on 28 degrees of freedom
Multiple R-squared:  0.948, Adjusted R-squared:  0.9442 
F-statistic:   255 on 2 and 28 DF,  p-value: < 2.2e-16

B

Run regression diagnostic plots on the model. Based on the plots, do you think any of the regression assumptions is violated?

Code

# Plot the diagnostic plots
par(mfrow = c(2, 2))
plot(TreeVolume)

Based on the plots, do you think any of the regression assumptions is violated?

I think that three of the four plots violate regression assumptions.

On the Residuals vs. Fitted Values plot we want to see linearity and constant variance. The residuals should “bounce randomly” around the zero line, suggesting a linear and reasonable relationship. The residuals here do not. The residuals also do not roughly form a horizonal band around the zero line, suggesting that the variances of the error terms are not equal. This plot shows neither; there are patterns, and a funnel shape.

The Normal Q-Q plot looks as it should - the points on the plot follow a straight line and the points do not deviate significantly from the line.

The Scale-Location plot does not show constant variance. The line is not approximately horizontal.

Finally, the Resisuals vs. Leverage plot shows an outlier outside of Cook’s distance. This point with larger leverage indicates a larger potential influence on the regression model.

Question 3

(Data file: florida in alr R package)

Code

# Load florida data set
data(florida)

In the 2000 election for U.S. president, the counting of votes in Florida was controversial. In Palm Beach County in south Florida, for example, voters used a so-called butterfly ballot. Some believe that the layout of the ballot caused some voters to cast votes for Buchanan when their intended choice was Gore.

The data has variables for the number of votes for each candidate — Gore, Bush, and Buchanan.

A

Run a simple linear regression model where the Buchanan vote is the outcome and the Bush vote is the explanatory variable. Produce the regression diagnostic plots. Is Palm Beach County an outlier based on the diagnostic plots? Why or why not?

Code

# linear regression model where Buchanan vote is outcome and Bush vote is explanatory 
Buchanan <- lm(Buchanan ~ Bush, data = florida)
summary(Buchanan)


Call:
lm(formula = Buchanan ~ Bush, data = florida)

Residuals:
    Min      1Q  Median      3Q     Max 
-907.50  -46.10  -29.19   12.26 2610.19 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 4.529e+01  5.448e+01   0.831    0.409    
Bush        4.917e-03  7.644e-04   6.432 1.73e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 353.9 on 65 degrees of freedom
Multiple R-squared:  0.3889,    Adjusted R-squared:  0.3795 
F-statistic: 41.37 on 1 and 65 DF,  p-value: 1.727e-08

Code

# Produce the regression diagnostic plots
par(mfrow = c(2, 2))
plot(Buchanan)

Is Palm Beach County an outlier based on the diagnostic plots? Why or why not?

Yes, based on the diagnostic plots, Palm Beach County is an outlier.

On the Residuals vs. Fitted Values plot, the residuals largely “bounce randomly” around the zero line and form a roughly horizontal band around the zero line - with Dade and Palm Beach as significant outliers.

The Normal Q-Q plot follows a straight line and the points do not deviate significantly from the line - again, with two major exceptions: Dade, and Palm Beach. The Scale-Location plot is roughly horizonal, but again has a major exception: Palm Beach.

Finally, the Resisuals vs. Leverage plot shows outliers outside of Cook’s distance - Dade, and once again, Palm Beach.

B

Take the log of both variables (Bush vote and Buchanan Vote) and repeat the analysis in (A.) Does your findings change?

Code

# linear regression model where log(Buchanan) is outcome and log(Bush) is explanatory 
LogBuchanan <- lm(log(Buchanan) ~ log(Bush), data = florida)
summary(LogBuchanan)


Call:
lm(formula = log(Buchanan) ~ log(Bush), data = florida)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.96075 -0.25949  0.01282  0.23826  1.66564 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -2.57712    0.38919  -6.622 8.04e-09 ***
log(Bush)    0.75772    0.03936  19.251  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4673 on 65 degrees of freedom
Multiple R-squared:  0.8508,    Adjusted R-squared:  0.8485 
F-statistic: 370.6 on 1 and 65 DF,  p-value: < 2.2e-16

Code

# Produce the regression diagnostic plots
par(mfrow = c(2, 2))
plot(LogBuchanan)

Do your findings change?

The p-value has gotten smaller and the adjusted r squared has gone from 0.3795 to 0.8485, suggesting that the log-log model is a better-fitting model. When it comes to the diagnostic plots, however, Palm Beach is still a notable outlier on the Residuals vs. Fitted, Normal Q-Q, and Scale-Location plots. It is now within Cook’s distance on the Residuals vs. Leverage plot.