hw5
desriptive statistics
probability
Homework 5
Author

Caitlin Rowley

Published

May 9, 2023

Code
# load libraries
library(tidyr)
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
Code
library(magrittr)

Attaching package: 'magrittr'
The following object is masked from 'package:tidyr':

    extract
Code
library(ggplot2)
library(markdown)
library(ggtext)
Warning: package 'ggtext' was built under R version 4.2.2
Code
library(readxl)
Warning: package 'readxl' was built under R version 4.2.2
Code
library(alr4)
Warning: package 'alr4' was built under R version 4.2.3
Loading required package: car
Warning: package 'car' was built under R version 4.2.3
Loading required package: carData
Warning: package 'carData' was built under R version 4.2.3

Attaching package: 'car'
The following object is masked from 'package:dplyr':

    recode
Loading required package: effects
Warning: package 'effects' was built under R version 4.2.3
lattice theme set by effectsTheme()
See ?effectsTheme for details.
Code
library(smss)
Warning: package 'smss' was built under R version 4.2.3
Code
library(stargazer)

Please cite as: 
 Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
 R package version 5.2.3. https://CRAN.R-project.org/package=stargazer 
Code
library(MPV)
Warning: package 'MPV' was built under R version 4.2.3
Loading required package: lattice
Loading required package: KernSmooth
KernSmooth 2.23 loaded
Copyright M. P. Wand 1997-2009
Loading required package: randomForest
Warning: package 'randomForest' was built under R version 4.2.2
randomForest 4.7-1.1
Type rfNews() to see new features/changes/bug fixes.

Attaching package: 'randomForest'
The following object is masked from 'package:ggplot2':

    margin
The following object is masked from 'package:dplyr':

    combine
Code
library(AICcmodavg)
Warning: package 'AICcmodavg' was built under R version 4.2.3

Attaching package: 'AICcmodavg'
The following object is masked from 'package:randomForest':

    importance

Question 1

(Data file: house.selling.price.2 from smss R package)

Code
data("house.selling.price.2")
head(house.selling.price.2)
      P    S Be Ba New
1  48.5 1.10  3  1   0
2  55.0 1.01  3  2   0
3  68.0 1.45  3  2   0
4 137.0 2.40  3  3   0
5 309.4 3.30  4  3   1
6  17.5 0.40  1  1   0

For the house.selling.price.2 data the tables below show a correlation matrix and a model fit using
four predictors of selling price.

Regression Output (Outcome: House Price):
                          Estimate Std. Error t value Pr(> | t| )
(Intercept) -41.795 12.104 -3.453 0.001
Size 64.761 5.630 11.504 0
Beds -2.766 3.960 -0.698 0.487
Baths 19.203 5.650 3.399 0.001
New 18.984 3.873 4.902 0.00000

(Hint 1: You should be able to answer A, B, C just using the tables below, although you should
feel free to load the data in R and work with it if you so choose. They will be consistent with what
you see on the tables.

Hint 2: The p-value of a variable in a simple linear regression is the same p-value one would get
from a Pearson’s correlation (cor.test). The p-value is a function of the magnitude of the correlation
coefficient (the higher the coefficient, the lower the p-value) and of sample size (larger samples
lead to smaller p-values). For the correlations shown in the tables, they are between variables of
the same length.)

With these four predictors,

A. For backward elimination, which variable would be deleted first? Why?

  • The first variable to be deleted using backward elimination would be BEDS because it is the least statistically significant with a p-value of 0.487.

B. For forward selection, which variable would be added first? Why?

Code
reg <- lm(P ~ S + Ba + Be + New, data=house.selling.price.2)
summary(reg)

Call:
lm(formula = P ~ S + Ba + Be + New, data = house.selling.price.2)

Residuals:
    Min      1Q  Median      3Q     Max 
-36.212  -9.546   1.277   9.406  71.953 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -41.795     12.104  -3.453 0.000855 ***
S             64.761      5.630  11.504  < 2e-16 ***
Ba            19.203      5.650   3.399 0.001019 ** 
Be            -2.766      3.960  -0.698 0.486763    
New           18.984      3.873   4.902  4.3e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 16.36 on 88 degrees of freedom
Multiple R-squared:  0.8689,    Adjusted R-squared:  0.8629 
F-statistic: 145.8 on 4 and 88 DF,  p-value: < 2.2e-16
  • The first variable to be added using forward selection would be SIZE because it has the highest level of statistical significance with a p-value of 2e-16.

C. Why do you think that BEDS has such a large P-value in the multiple regression model, even though it has a substantial correlation with PRICE?

  • Correlation does not equal causation. So, it is possible that BEDS is highly correlated with PRICE, but that does not mean that there is a statistically significant relationship between the two variables. It is more likely that BEDS has a more statistically significant relationship with SIZE, which we know has a statistically significant relationship with PRICE.

D. Using software with these four predictors, find the model that would be selected using each criterion:

  • reg_1 = SIZE, BEDS, BATHS, NEW (all four predictors)

  • reg_2 = SIZE, BATHS, NEW (backwards elimination - remove BEDS)

  • reg_3 = SIZE (forward selection - add SIZE first)

1. R2

Code
reg_1 <- lm(P ~ S + Ba + Be + New, data=house.selling.price.2)

reg_2 <- lm(P ~ S + Ba + New, data=house.selling.price.2)
            
reg_3 <- lm(P ~ S, data=house.selling.price.2)

stargazer(reg_1, reg_2, reg_3, type='text')

===========================================================================================
                                              Dependent variable:                          
                    -----------------------------------------------------------------------
                                                       P                                   
                              (1)                     (2)                     (3)          
-------------------------------------------------------------------------------------------
S                          64.761***               62.263***               75.607***       
                            (5.630)                 (4.335)                 (3.865)        
                                                                                           
Ba                         19.203***               20.072***                               
                            (5.650)                 (5.495)                                
                                                                                           
Be                          -2.766                                                         
                            (3.960)                                                        
                                                                                           
New                        18.984***               18.371***                               
                            (3.873)                 (3.761)                                
                                                                                           
Constant                  -41.795***              -47.992***              -25.194***       
                           (12.104)                 (8.209)                 (6.688)        
                                                                                           
-------------------------------------------------------------------------------------------
Observations                  93                      93                      93           
R2                           0.869                   0.868                   0.808         
Adjusted R2                  0.863                   0.864                   0.806         
Residual Std. Error    16.360 (df = 88)        16.313 (df = 89)        19.473 (df = 91)    
F Statistic         145.763*** (df = 4; 88) 195.313*** (df = 3; 89) 382.628*** (df = 1; 91)
===========================================================================================
Note:                                                           *p<0.1; **p<0.05; ***p<0.01
  • Based on the stargazer table above, the highest R-squared value across the three models is for reg_1, which includes all four predictors (R-squared = 0.869). This indicates a good level of fit for the model, but we should be wary of its accuracy, as R-squared does not decrease based on poor model fit.

2. Adjusted R2

  • Based on the stargazer table above, the highest adjusted R-squared value across the three models is for reg_2, which excludes BEDS as a predictor. This indicates a good level of fit for the model, and it is a more reliable measure than R-squared due to it being adjusted for the number of predictors.

3. PRESS

Code
PRESS(reg_1)
[1] 28390.22
Code
PRESS(reg_2)
[1] 27860.05
Code
PRESS(reg_3)
[1] 38203.29

We can see in the above PRESS tests that reg_2 has the lowest value, which indicates that it is the best fit model.

4. AIC

Code
models <- list(reg_1, reg_2, reg_3)
mod.names <- c('reg_1', 'reg_2', 'reg_3')

aictab(cand.set = models, modnames = mod.names)

Model selection based on AICc:

      K   AICc Delta_AICc AICcWt Cum.Wt      LL
reg_2 5 789.83       0.00   0.71   0.71 -389.57
reg_1 6 791.60       1.77   0.29   1.00 -389.31
reg_3 3 820.41      30.59   0.00   1.00 -407.07

We can see based on the above AIC model that reg_2 has the lowest AIC value. Therefore, because minimizing the AICc value equates to minimizing the number of unnecessary model parameters, reg_2 is the best fit.

5. BIC

Code
BIC(reg_1)
[1] 805.8181
Code
BIC(reg_2)
[1] 801.7996
Code
BIC(reg_3)
[1] 827.7417

We can see in the above BIC models, which penalize the unnecessary addition of variables to the model, that reg_2 has the lowest value. Thus, it is the best fit model.

E. Explain which model you prefer and why.

Because we saw that reg_2, or the regression model that excluded BEDS as a predictor, was the best fit model based on the above comparison methods (adjusted R-squared, PRESS, AIC, and BIC) I would prefer this model. The comparison models for which reg_2 was deemed the best fit all account for additional predictor variables, so it is the most comprehensively significant model.

Question 2

(Data file: trees from base R)

From the documentation:
“This data set provides measurements of the diameter, height and volume of timber in 31 felled
black cherry trees. Note that the diameter (in inches) is erroneously labeled Girth in the data. It is
measured at 4 ft 6 in above the ground.”

Tree volume estimation is a big deal, especially in the lumber industry. Use the trees data to build
a basic model of tree volume prediction. In particular,

A. Fit a multiple regression model with the Volume as the outcome and Girth and Height as the explanatory variables.

Code
data(trees)

tree <- lm(Volume ~ Girth + Height, data=trees)
summary(tree)

Call:
lm(formula = Volume ~ Girth + Height, data = trees)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.4065 -2.6493 -0.2876  2.2003  8.4847 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -57.9877     8.6382  -6.713 2.75e-07 ***
Girth         4.7082     0.2643  17.816  < 2e-16 ***
Height        0.3393     0.1302   2.607   0.0145 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.882 on 28 degrees of freedom
Multiple R-squared:  0.948, Adjusted R-squared:  0.9442 
F-statistic:   255 on 2 and 28 DF,  p-value: < 2.2e-16
  • We can see based on this regression model that both height (p-value=0.0145) and girth (p-value<2e-16) are statistically significant predictor variables. The adjusted R-squared value is also very close to 1 (0.9442), so this is a well-fitted model.

B. Run regression diagnostic plots on the model. Based on the plots, do you think any of the regression assumptions is violated?

Code
par(mfrow=c(2,3))
plot(tree, which=1:6)

  • Residuals vs. Fitted: In applying the Residuals vs. Fitted plot, we can observe a violation due to the curvature of the red line. We also see that the distribution of data points does not “bounce randomly” about the 0 line. This plot is therefore a bad fit for this regression model.

  • Normal Q-Q: In applying the Normal Q-Q plot, we observe that the data points do generally fall along the 0 line and do not curve off into extremities, which is indicative of a good fit for this regression model.

  • Scale-Location: In applying the Scale-Location plot, we observe heteroscedasticity across the data points as well as a non-horizontal red line. Thus, we observe a violation of this plot due to the increase and decrease in the trend line. This plot is therefore a bad fit for this regression model.

  • Cook’s Distance: In applying the Cook’s Distance plot, we observe a violation due to several data points yielding a Cook’s distance value greater than 1. This plot is therefore a bad fit for this regression model.

  • Residuals vs. Leverage: In applying the Residuals vs. Leverage plot, we can see that there are no data points outside of the dashed lines marked with ‘0.5.’ This plot is therefore a good fit for this regression model.

  • Cook’s Distance vs. Leverage: In applying the Cook’s Distance vs. Leverage plot, we do not really observe any influential data points. This plot does not contribute any value to this regression model.

Question 3

(Data file: florida in alr R package)

In the 2000 election for U.S. president, the counting of votes in Florida was controversial. In Palm
Beach County in south Florida, for example, voters used a so-called butterfly ballot. Some believe
that the layout of the ballot caused some voters to cast votes for Buchanan when their intended
choice was Gore.

The data has variables for the number of votes for each candidate—Gore, Bush, and Buchanan.

A. Run a simple linear regression model where the Buchanan vote is the outcome and the Bush vote is the explanatory variable. Produce the regression diagnostic plots. Is Palm Beach County an outlier based on the diagnostic plots? Why or why not?

Code
data(florida)

vote <- lm(Buchanan ~ Bush, data=florida)
summary(vote)

Call:
lm(formula = Buchanan ~ Bush, data = florida)

Residuals:
    Min      1Q  Median      3Q     Max 
-907.50  -46.10  -29.19   12.26 2610.19 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 4.529e+01  5.448e+01   0.831    0.409    
Bush        4.917e-03  7.644e-04   6.432 1.73e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 353.9 on 65 degrees of freedom
Multiple R-squared:  0.3889,    Adjusted R-squared:  0.3795 
F-statistic: 41.37 on 1 and 65 DF,  p-value: 1.727e-08
Code
par(mfrow=c(2,3))
plot(vote, which=1:6)

  • Based on the above diagnostic plots, Palm Beach appears to be an outlier in each example. Palm Beach violates the regression assumptions in all plots except for the Cook’s Distance vs Leverage plot, which we know is not particularly informative.

B. Take the log of both variables (Bush vote and Buchanan Vote) and repeat the analysis in (A.) Does your findings change?

Code
log_florida <- florida %>%
  transform(log_Bush=log(Bush)) %>%
  transform(log_Buchanan=log(Buchanan))

log_vote <- lm(log_Buchanan ~ log_Bush, data=log_florida)
summary(log_vote)

Call:
lm(formula = log_Buchanan ~ log_Bush, data = log_florida)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.96075 -0.25949  0.01282  0.23826  1.66564 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -2.57712    0.38919  -6.622 8.04e-09 ***
log_Bush     0.75772    0.03936  19.251  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4673 on 65 degrees of freedom
Multiple R-squared:  0.8508,    Adjusted R-squared:  0.8485 
F-statistic: 370.6 on 1 and 65 DF,  p-value: < 2.2e-16
Code
par(mfrow=c(2,3))
plot(log_vote, which=1:6)

After logging both variables, we can see that Palm Beach is still an outlier in each of the diagnostic plots, and it results in the same regression assumption violations (though to a slightly lesser degree).