DACSS 603: Homework 5

hw5
Tory Bartelloni
Author

Tory Bartelloni

Published

December 11, 2022

Question 1

For the house.selling.price.2 data the tables below show a correlation matrix and a model fit using four predictors of selling price.

Code
library(smss)
data("house.selling.price.2")

1.A

For backward elimination, which variable would be deleted first? Why?

Answer: Beds would be the first variable eliminated due to it having the largest p-value.

1.B

For forward selection, which variable would be added first? Why?

Answer: The first variable to be added would be size due to it having the smallest p-value.

1.C

Why do you think that BEDS has such a large P-value in the multiple regression model, even though it has a substantial correlation with PRICE?

Answer: Beds has a large p-value in the multiple regression model likely because the model is holding other values constant, which is showing that Beds has a lower explanatory effect when we remove the effects of other variables. There are other variables that have similar properties, like Size, which when controlled for reduces the effect of Beds.

1.D

Using software with these four predictors, find the model that would be selected using each criterion:

First I’ll build the models.

Code
library(stargazer)
house_model <- lm(P ~ S + Be + Ba + New,
                  data=house.selling.price.2)
noSize_model <- lm(P ~ Be + Ba + New,
                  data=house.selling.price.2)
noBeds_model <- lm(P ~ S + Ba + New,
                  data=house.selling.price.2)
noBaths_model <- lm(P ~ S + Be + New,
                  data=house.selling.price.2)
noNew_model <- lm(P ~ S + Be + Ba,
                  data=house.selling.price.2)
sizeAndNew_model <- lm(P ~ S + New,
                  data=house.selling.price.2)
bathsAndNew_model <- lm(P ~ Ba + New,
                  data=house.selling.price.2)

models <- list(house_model, noSize_model, noBeds_model,
               noBaths_model, noNew_model,
               sizeAndNew_model, bathsAndNew_model)

#stargazer(house_model, type="text")

rsquared <- function(fit) summary(fit)$r.squared
adjrsquared <- function(fit) summary(fit)$adj.r.squared
PRESS <- function(fit) {
    pr <- residuals(fit)/(1 - lm.influence(fit)$hat)
    sum(pr^2)
}

model_fits <- data.frame(Model = c("house_model",
                                   "noSize_model",
                                   "noBeds_model",
                                   "noBaths_model",
                                   "noNew_model",
                                   "sizeAndNew_model",
                                   "bathsAndNew_model"),
                         RSquared = sapply(models, rsquared),
                         AdjRSquared = sapply(models, adjrsquared),
                         PRESS = sapply(models, PRESS),
                         AIC = sapply(models, AIC),
                         BIC = sapply(models, BIC))

knitr::kable(model_fits) 
Model RSquared AdjRSquared PRESS AIC BIC
house_model 0.8688630 0.8629022 28390.22 790.6225 805.8181
noSize_model 0.6716595 0.6605919 67942.88 873.9787 886.6417
noBeds_model 0.8681361 0.8636912 27860.05 789.1366 801.7996
noBaths_model 0.8516484 0.8466478 30836.38 800.0934 812.7564
noNew_model 0.8330530 0.8274256 35097.34 811.0758 823.7388
sizeAndNew_model 0.8483699 0.8450003 31066.00 800.1262 810.2566
bathsAndNew_model 0.5624769 0.5527542 87677.82 898.6770 908.8074

The R-Squared model favors the full model. It does not account for the addtional variables being included, which inflates the R-Squared value so the fact that there are other, close R-Squared values indicates this may not be the best model.

All other metrics favor the No Beds Model, which includes all variables other than the Beds variable.

1.E

Based on the results above, where all of our metrics other than R-Squared favor a single model, I prefer that model (No Beds Model).

Question 2

Code
data(trees)

Tree volume estimation is a big deal, especially in the lumber industry. Use the trees data to build a basic model of tree volume prediction. In particular…

2.A

…fit a multiple regression model with the Volume as the outcome and Girth and Height as the explanatory variables

Code
tree_model <- lm(Volume ~ Girth + Height, data=trees)

summary(tree_model)

Call:
lm(formula = Volume ~ Girth + Height, data = trees)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.4065 -2.6493 -0.2876  2.2003  8.4847 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -57.9877     8.6382  -6.713 2.75e-07 ***
Girth         4.7082     0.2643  17.816  < 2e-16 ***
Height        0.3393     0.1302   2.607   0.0145 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.882 on 28 degrees of freedom
Multiple R-squared:  0.948, Adjusted R-squared:  0.9442 
F-statistic:   255 on 2 and 28 DF,  p-value: < 2.2e-16

2.B

Run regression diagnostic plots on the model. Based on the plots, do you think any of the regression assumptions is violated?

Code
library(ggResidpanel)

resid_panel(tree_model,
            plots = "all",
            smoother = TRUE)

Residual and index plot indicate possible violations of the linearity assumption.

Location-Scale plot and Residual plot indicate possible issues with homoskedasticity, but maybe not definitive.

Residual-Leverage plot also indicates possible issues with representativeness shwoing an outlier with large leverage and Cook’s distance.

Question 3

Code
library(alr4)
data("florida")

In the 2000 election for U.S. president, the counting of votes in Florida was controversial. In Palm Beach County in south Florida, for example, voters used a so-called butterfly ballot. Some believe that the layout of the ballot caused some voters to cast votes for Buchanan when their intended choice was Gore.

3.A

Run a simple linear regression model where the Buchanan vote is the outcome and the Bush vote is the explanatory variable. Produce the regression diagnostic plots. Is Palm Beach County an outlier based on the diagnostic plots? Why or why not?

Code
vote_model <- lm(Buchanan ~ Bush, data=florida)

resid_panel(vote_model, plots=c("resid","cookd","lev"))

Yes. In the three plots included here we see Palm Beach County clearly as an outlier. This is indicated by a large Cook’s Distance and the large residual compared to the other counties.

3.B

Take the log of both variables (Bush vote and Buchanan Vote) and repeat the analysis in (a). Does your findings change?

Code
vote_model <- lm(log(Buchanan) ~ log(Bush), data=florida)

resid_panel(vote_model, plots=c("resid","cookd","lev"))

The log-transformed model eliminates the large residual and significantly reduces the Cook’s Distance. When laid over the Leverage plot we see more clearly that it’s outlier status would be much less of a concern.