Homework 5
Author

Asch Harwood

Published

May 9, 2023

Question 1

Code
data(house.selling.price.2)

A

Beds would be deleted first. It has the highest p-value of 0.487, which also happens to not be statistically significant.

B

The first fit would be an intercept only model, which means there is no explanatory variable. This becomes our ‘baseline’ against which we can evaluate our model as we add explanatory variables.

C

Beds has relatively strong relationship with size, which also has a strong relationship with price. This means that this current model suffers from multicollinearity, which is obscuring the relationship between size and price.

D

Code
fit_S <- lm(P ~ S + Ba + New, data = house.selling.price.2)
fit_Be <- lm(P ~ S + New, data = house.selling.price.2)
fit_Ba <- lm(P ~ New, data = house.selling.price.2)
fit_New <- lm(P ~ 1, data = house.selling.price.2)

R2

  • With an R2 of 0.87: S + Ba + New
Code
summary(fit_S)$r.squared
[1] 0.8681361
Code
summary(fit_Be)$r.squared
[1] 0.8483699
Code
summary(fit_Ba)$r.squared
[1] 0.1271307
Code
summary(fit_New)$r.squared
[1] 0

Adjusted R2

  • With an Adjusted R2 of 0.86: S + Ba + New
Code
summary(fit_S)$adj.r.squared
[1] 0.8636912
Code
summary(fit_Be)$adj.r.squared
[1] 0.8450003
Code
summary(fit_Ba)$adj.r.squared
[1] 0.1175388
Code
summary(fit_New)$adj.r.squared
[1] 0

PRESS

  • Again, S + Ba + New has the smallest PRESS, which means it has the best ‘predictive’ power compared to the other models.
Code
press_stat <- function(model) {
  # Calculate PRESS residuals
  pr <- resid(model) / (1 - lm.influence(model)$hat)
  
  # Compute the PRESS statistic
  press <- sum(pr^2)
  
  return(press)
}
Code
press_stat(fit_S)
[1] 27860.05
Code
press_stat(fit_Be)
[1] 31066
Code
press_stat(fit_Ba)
[1] 164039.3
Code
press_stat(fit_New)
[1] 183531.6

AIC

  • Again, S + Ba + New has the smallest AIC, which suggests it does a better job of fitting the data without overfitting.
Code
AIC(fit_S)
[1] 789.1366
Code
AIC(fit_Be)
[1] 800.1262
Code
AIC(fit_Ba)
[1] 960.908
Code
AIC(fit_New)
[1] 971.5532

BIC

  • Again, S + Ba + New has the smallest BIC, which suggests it does a better job of fitting the data without overfitting
Code
BIC(fit_S)
[1] 801.7996
Code
BIC(fit_Be)
[1] 810.2566
Code
BIC(fit_Ba)
[1] 968.5058
Code
BIC(fit_New)
[1] 976.6184

E

I prefer P ~ S + Ba + New. All coefficients are statistically significant. It outperforms the ‘less complex’ models on all metrics. It also make sense that several different factors influence home price.

Question 2

Code
data(trees)
head(trees)
  Girth Height Volume
1   8.3     70   10.3
2   8.6     65   10.3
3   8.8     63   10.2
4  10.5     72   16.4
5  10.7     81   18.8
6  10.8     83   19.7
Code
str(trees)
'data.frame':   31 obs. of  3 variables:
 $ Girth : num  8.3 8.6 8.8 10.5 10.7 10.8 11 11 11.1 11.2 ...
 $ Height: num  70 65 63 72 81 83 66 75 80 75 ...
 $ Volume: num  10.3 10.3 10.2 16.4 18.8 19.7 15.6 18.2 22.6 19.9 ...

A

Code
fit <- lm(Volume ~ Girth + Height, data = trees)
summary(fit)

Call:
lm(formula = Volume ~ Girth + Height, data = trees)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.4065 -2.6493 -0.2876  2.2003  8.4847 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -57.9877     8.6382  -6.713 2.75e-07 ***
Girth         4.7082     0.2643  17.816  < 2e-16 ***
Height        0.3393     0.1302   2.607   0.0145 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.882 on 28 degrees of freedom
Multiple R-squared:  0.948, Adjusted R-squared:  0.9442 
F-statistic:   255 on 2 and 28 DF,  p-value: < 2.2e-16

B

Residuals vs Fitted

The curved shape of the line suggests this model violates our assumption of linearity that the relationship between the independent and dependent variables is linear.

Scale-Location

The curved shape of the line suggests this model violates our assumption of constant variance, or homoscedasticity, which affects the statistical significance of the model.

Cooks Distance, Residuals vs Leverage, Cooks dist vs Leverage

All three charts show there is single, potentially influence point, which can impact whether our model meets the assumptions of linear regression and can disproportionately affect our regression coefficients.

Code
par(mfrow = c(2,3))
plot(fit, which = 1:6)

Question 3

Code
data("florida")

A

Palm Beach is clearly an outlier. For all other counties, there is a relatively weak but clear relationship between the number of Bush votes and Buchanan votes. The diagnostic plots show that the model largely obeys the relevant assumptions for linear regression of homoscedasticity, linearity, and normality of errors. However, it also highlights how the pattern observed in most counties in Florida does not hold in Palm Beach.

Code
fit <- lm(Buchanan ~ Bush, data = florida)
summary(fit)

Call:
lm(formula = Buchanan ~ Bush, data = florida)

Residuals:
    Min      1Q  Median      3Q     Max 
-907.50  -46.10  -29.19   12.26 2610.19 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 4.529e+01  5.448e+01   0.831    0.409    
Bush        4.917e-03  7.644e-04   6.432 1.73e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 353.9 on 65 degrees of freedom
Multiple R-squared:  0.3889,    Adjusted R-squared:  0.3795 
F-statistic: 41.37 on 1 and 65 DF,  p-value: 1.727e-08
Code
par(mfrow = c(2,3))
plot(fit, which = 1:6)

B

While taking the log of the independent and dependent variable increases the r-squared and reduces the p-value, Palm Beach continues to be an outlier, which suggests there is something ‘different’ about that county, compared to other counties in Florida.

Code
fit <- lm(log(Buchanan) ~ log(Bush), data=florida)
summary(fit)

Call:
lm(formula = log(Buchanan) ~ log(Bush), data = florida)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.96075 -0.25949  0.01282  0.23826  1.66564 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -2.57712    0.38919  -6.622 8.04e-09 ***
log(Bush)    0.75772    0.03936  19.251  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4673 on 65 degrees of freedom
Multiple R-squared:  0.8508,    Adjusted R-squared:  0.8485 
F-statistic: 370.6 on 1 and 65 DF,  p-value: < 2.2e-16
Code
par(mfrow = c(2,3))
plot(fit, which = 1:6)