Homework 5

hw5

Homework 5

Author

Asch Harwood

Published

May 9, 2023

Question 1

Code

data(house.selling.price.2)

A

Beds would be deleted first. It has the highest p-value of 0.487, which also happens to not be statistically significant.

B

The first fit would be an intercept only model, which means there is no explanatory variable. This becomes our ‘baseline’ against which we can evaluate our model as we add explanatory variables.

C

Beds has relatively strong relationship with size, which also has a strong relationship with price. This means that this current model suffers from multicollinearity, which is obscuring the relationship between size and price.

D

Code

fit_S <- lm(P ~ S + Ba + New, data = house.selling.price.2)
fit_Be <- lm(P ~ S + New, data = house.selling.price.2)
fit_Ba <- lm(P ~ New, data = house.selling.price.2)
fit_New <- lm(P ~ 1, data = house.selling.price.2)

R2

With an R2 of 0.87: S + Ba + New

Code

summary(fit_S)$r.squared

[1] 0.8681361

Code

summary(fit_Be)$r.squared

[1] 0.8483699

Code

summary(fit_Ba)$r.squared

[1] 0.1271307

Code

summary(fit_New)$r.squared

[1] 0

Adjusted R2

With an Adjusted R2 of 0.86: S + Ba + New

Code

summary(fit_S)$adj.r.squared

[1] 0.8636912

Code

summary(fit_Be)$adj.r.squared

[1] 0.8450003

Code

summary(fit_Ba)$adj.r.squared

[1] 0.1175388

Code

summary(fit_New)$adj.r.squared

[1] 0

PRESS

Again, S + Ba + New has the smallest PRESS, which means it has the best ‘predictive’ power compared to the other models.

Code

press_stat <- function(model) {
  # Calculate PRESS residuals
  pr <- resid(model) / (1 - lm.influence(model)$hat)
  
  # Compute the PRESS statistic
  press <- sum(pr^2)
  
  return(press)
}

Code

press_stat(fit_S)

[1] 27860.05

Code

press_stat(fit_Be)

[1] 31066

Code

press_stat(fit_Ba)

[1] 164039.3

Code

press_stat(fit_New)

[1] 183531.6

AIC

Again, S + Ba + New has the smallest AIC, which suggests it does a better job of fitting the data without overfitting.

Code

AIC(fit_S)

[1] 789.1366

Code

AIC(fit_Be)

[1] 800.1262

Code

AIC(fit_Ba)

[1] 960.908

Code

AIC(fit_New)

[1] 971.5532

BIC

Again, S + Ba + New has the smallest BIC, which suggests it does a better job of fitting the data without overfitting

Code

BIC(fit_S)

[1] 801.7996

Code

BIC(fit_Be)

[1] 810.2566

Code

BIC(fit_Ba)

[1] 968.5058

Code

BIC(fit_New)

[1] 976.6184

E

I prefer P ~ S + Ba + New. All coefficients are statistically significant. It outperforms the ‘less complex’ models on all metrics. It also make sense that several different factors influence home price.

Question 2

Code

data(trees)
head(trees)

  Girth Height Volume
1   8.3     70   10.3
2   8.6     65   10.3
3   8.8     63   10.2
4  10.5     72   16.4
5  10.7     81   18.8
6  10.8     83   19.7

Code

str(trees)

'data.frame':   31 obs. of  3 variables:
 $ Girth : num  8.3 8.6 8.8 10.5 10.7 10.8 11 11 11.1 11.2 ...
 $ Height: num  70 65 63 72 81 83 66 75 80 75 ...
 $ Volume: num  10.3 10.3 10.2 16.4 18.8 19.7 15.6 18.2 22.6 19.9 ...

A

Code

fit <- lm(Volume ~ Girth + Height, data = trees)
summary(fit)


Call:
lm(formula = Volume ~ Girth + Height, data = trees)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.4065 -2.6493 -0.2876  2.2003  8.4847 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -57.9877     8.6382  -6.713 2.75e-07 ***
Girth         4.7082     0.2643  17.816  < 2e-16 ***
Height        0.3393     0.1302   2.607   0.0145 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.882 on 28 degrees of freedom
Multiple R-squared:  0.948, Adjusted R-squared:  0.9442 
F-statistic:   255 on 2 and 28 DF,  p-value: < 2.2e-16

B

Residuals vs Fitted

The curved shape of the line suggests this model violates our assumption of linearity that the relationship between the independent and dependent variables is linear.

Scale-Location

The curved shape of the line suggests this model violates our assumption of constant variance, or homoscedasticity, which affects the statistical significance of the model.

Cooks Distance, Residuals vs Leverage, Cooks dist vs Leverage

All three charts show there is single, potentially influence point, which can impact whether our model meets the assumptions of linear regression and can disproportionately affect our regression coefficients.

Code

par(mfrow = c(2,3))
plot(fit, which = 1:6)

Question 3

Code

data("florida")

A

Palm Beach is clearly an outlier. For all other counties, there is a relatively weak but clear relationship between the number of Bush votes and Buchanan votes. The diagnostic plots show that the model largely obeys the relevant assumptions for linear regression of homoscedasticity, linearity, and normality of errors. However, it also highlights how the pattern observed in most counties in Florida does not hold in Palm Beach.

Code

fit <- lm(Buchanan ~ Bush, data = florida)
summary(fit)


Call:
lm(formula = Buchanan ~ Bush, data = florida)

Residuals:
    Min      1Q  Median      3Q     Max 
-907.50  -46.10  -29.19   12.26 2610.19 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 4.529e+01  5.448e+01   0.831    0.409    
Bush        4.917e-03  7.644e-04   6.432 1.73e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 353.9 on 65 degrees of freedom
Multiple R-squared:  0.3889,    Adjusted R-squared:  0.3795 
F-statistic: 41.37 on 1 and 65 DF,  p-value: 1.727e-08

Code

par(mfrow = c(2,3))
plot(fit, which = 1:6)

B

While taking the log of the independent and dependent variable increases the r-squared and reduces the p-value, Palm Beach continues to be an outlier, which suggests there is something ‘different’ about that county, compared to other counties in Florida.

Code

fit <- lm(log(Buchanan) ~ log(Bush), data=florida)
summary(fit)


Call:
lm(formula = log(Buchanan) ~ log(Bush), data = florida)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.96075 -0.25949  0.01282  0.23826  1.66564 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -2.57712    0.38919  -6.622 8.04e-09 ***
log(Bush)    0.75772    0.03936  19.251  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4673 on 65 degrees of freedom
Multiple R-squared:  0.8508,    Adjusted R-squared:  0.8485 
F-statistic: 370.6 on 1 and 65 DF,  p-value: < 2.2e-16

Code

par(mfrow = c(2,3))
plot(fit, which = 1:6)