Blog Post #5

hw5

correlation

regressional analysis

transformations

DACSS 603 HW#5

Author

Alexis Gamez

Published

May 13, 2023

Code

# setup
library(tidyverse)
library(alr4)
library(smss)
library(stargazer)
library(MPV)

knitr::opts_chunk$set(echo = TRUE)

Question 1

Code

# loading data
data("house.selling.price.2")

For the house.selling.price.2 data the tables below show a correlation matrix and a model fit using four predictors of selling price.

A)

For backward elimination, which variable would be deleted first? Why?

Using backward elimination, I would delete the Beds variable first. The point of backward elimination is to fit a model using all possible variables provided and 1 by 1 eliminating the least significant variables from said model. In this case, the Beds variable is the least significant in the model with a p-value of 0.487.

B)

For forward selection, which variable would be added first? Why?

Forward selection works opposite to backward elimination where we start with an empty model and 1 by 1 add variables according to their significance level. Under these circumstances, I would add the Size variable first, as it is the most significant, with a p-value of 0.

C)

Why do you think that BEDS has such a large P-value in the multiple regression model, even though it has a substantial correlation with PRICE?

If there is 1 thing we’ve learned this semester, it’s that correlation does not dictate causation. So while both are strongly correlated, it appears that the significance of the Beds variable within the model diminishes with the addition of the other variables. In other words, the effect of Beds on Price is not as significant in comparison to the effect of the other variables on Price.

D)

Using software with these four predictors, find the model that would be selected using each criterion:

Code

summary(house.selling.price.2)

       P                S              Be              Ba       
 Min.   : 17.50   Min.   :0.40   Min.   :1.000   Min.   :1.000  
 1st Qu.: 72.90   1st Qu.:1.33   1st Qu.:3.000   1st Qu.:2.000  
 Median : 96.00   Median :1.57   Median :3.000   Median :2.000  
 Mean   : 99.53   Mean   :1.65   Mean   :3.183   Mean   :1.957  
 3rd Qu.:115.00   3rd Qu.:1.98   3rd Qu.:4.000   3rd Qu.:2.000  
 Max.   :309.40   Max.   :3.85   Max.   :5.000   Max.   :3.000  
      New        
 Min.   :0.0000  
 1st Qu.:0.0000  
 Median :0.0000  
 Mean   :0.3011  
 3rd Qu.:1.0000  
 Max.   :1.0000

For this question, I’ll be utilizing stargazer to visualize the results to answer parts 1 & 2.

Code

fit1 <- (lm(P ~ S, data= house.selling.price.2))

fit2 <- (lm(P ~ S + New, data= house.selling.price.2))

fit3 <- (lm(P ~ S + Ba + New, data= house.selling.price.2))

fit4 <- (lm(P ~ S + Be + Ba + New, data= house.selling.price.2))

stargazer(fit1, fit2, fit3, fit4, type = 'text')


===================================================================================================================
                                                          Dependent variable:                                      
                    -----------------------------------------------------------------------------------------------
                                                                   P                                               
                              (1)                     (2)                     (3)                     (4)          
-------------------------------------------------------------------------------------------------------------------
S                          75.607***               72.575***               62.263***               64.761***       
                            (3.865)                 (3.508)                 (4.335)                 (5.630)        
                                                                                                                   
Be                                                                                                  -2.766         
                                                                                                    (3.960)        
                                                                                                                   
Ba                                                                         20.072***               19.203***       
                                                                            (5.495)                 (5.650)        
                                                                                                                   
New                                                19.587***               18.371***               18.984***       
                                                    (3.995)                 (3.761)                 (3.873)        
                                                                                                                   
Constant                  -25.194***              -26.089***              -47.992***              -41.795***       
                            (6.688)                 (5.977)                 (8.209)                (12.104)        
                                                                                                                   
-------------------------------------------------------------------------------------------------------------------
Observations                  93                      93                      93                      93           
R2                           0.808                   0.848                   0.868                   0.869         
Adjusted R2                  0.806                   0.845                   0.864                   0.863         
Residual Std. Error    19.473 (df = 91)        17.395 (df = 90)        16.313 (df = 89)        16.360 (df = 88)    
F Statistic         382.628*** (df = 1; 91) 251.775*** (df = 2; 90) 195.313*** (df = 3; 89) 145.763*** (df = 4; 88)
===================================================================================================================
Note:                                                                                   *p<0.1; **p<0.05; ***p<0.01

1. R^2

The best model according to this criteria would be model 4, as its R^2 value is the closest to 1 (0.869).

2. Adjusted R^2

The best model according to this criteria would be model 3, as its adjusted R^2 value is the closest to 1 (0.864).

3. PRESS

Model 1

Code

PRESS(fit1)

[1] 38203.29

Model 2

Code

PRESS(fit2)

[1] 31066

Model 3

Code

PRESS(fit3)

[1] 27860.05

Model 4

Code

PRESS(fit4)

[1] 28390.22

Knowing that a lower PRESS value indicates a better fit model, I would select model 3 according to this criteria.

4. AIC

Model 1

Code

AIC(fit1)

[1] 820.1439

Model 2

Code

AIC(fit2)

[1] 800.1262

Model 3

Code

AIC(fit3)

[1] 789.1366

Model 4

Code

AIC(fit4)

[1] 790.6225

Similar to the PRESS criteria, the lower the returned AIC value, the better the model fits. Again, under these circumstances I would select model 3 as the best fit.

5. BIC

Model 1

Code

BIC(fit1)

[1] 827.7417

Model 2

Code

BIC(fit2)

[1] 810.2566

Model 3

Code

BIC(fit3)

[1] 801.7996

Model 4

Code

BIC(fit4)

[1] 805.8181

Identical to AIC, the lowest BIC value indicates the best fit model. Once again, model 3 would be the best fit under these circumstances.

E)

Explain which model you prefer and why.

The only models proven to be significant through previous criterion are models 3 & 4. Even then, model 4 was only proven to be significant according to it’s R^2 value and I would argue that the more significant value to consider would be the adjusted R^2. Thus, if I were to select 1 of these 2 models, I would select model 3 as the best fit model, especially when considering the criterion previously calculated.

Question 2

Code

# loading data
data("trees")

Tree volume estimation is a big deal, especially in the lumber industry. Use the trees data to build a basic model of tree volume prediction. In particular,

A)

Fit a multiple regression model with the Volume as the outcome and Girth and Height as the explanatory variables.

Code

summary(trees)

     Girth           Height       Volume     
 Min.   : 8.30   Min.   :63   Min.   :10.20  
 1st Qu.:11.05   1st Qu.:72   1st Qu.:19.40  
 Median :12.90   Median :76   Median :24.20  
 Mean   :13.25   Mean   :76   Mean   :30.17  
 3rd Qu.:15.25   3rd Qu.:80   3rd Qu.:37.30  
 Max.   :20.60   Max.   :87   Max.   :77.00

With the data loaded, I’ll fit the model below:

Code

fit_t <- lm(Volume ~ Girth + Height, data = trees)

summary(fit_t)


Call:
lm(formula = Volume ~ Girth + Height, data = trees)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.4065 -2.6493 -0.2876  2.2003  8.4847 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -57.9877     8.6382  -6.713 2.75e-07 ***
Girth         4.7082     0.2643  17.816  < 2e-16 ***
Height        0.3393     0.1302   2.607   0.0145 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.882 on 28 degrees of freedom
Multiple R-squared:  0.948, Adjusted R-squared:  0.9442 
F-statistic:   255 on 2 and 28 DF,  p-value: < 2.2e-16

B)

Run regression diagnostic plots on the model. Based on the plots, do you think any of the regression assumptions is violated?

Code

par(mfrow = c(2, 3)); plot(fit_t, which = 1:6)

Immediately, it’s apparent that at least a couple of regression assumptions are violated throughout the 6 diagnostic plots. Most noteworthy, are the Residuals vs Fitted, Scale-Location and Cook’s Distance plots. It’s obvious that the for the 1st plot that the linearity assumption is violated. Similarly, the lack of a steady horizontal trend in the Scale-Location plot indicates a violation of the constant variance assumption. Lastly, it’s apparent in the Cook’s distance plot that the 31st observation, which is also an extreme outlier, holds significantly more weight within the model than any other telling me that the significance of all observations is skewed and not well fit.

Question 3

Code

# loading data
data("florida")

In the 2000 election for U.S. president, the counting of votes in Florida was controversial. In Palm Beach County in south Florida, for example, voters used a so-called butterfly ballot. Some believe that the layout of the ballot caused some voters to cast votes for Buchanan when their intended choice was Gore.

A)

Run a simple linear regression model where the Buchanan vote is the outcome and the Bush vote is the explanatory variable. Produce the regression diagnostic plots. Is Palm Beach County an outlier based on the diagnostic plots? Why or why not?

Code

summary(florida)

      Gore             Bush           Buchanan     
 Min.   :   788   Min.   :  1316   Min.   :   9.0  
 1st Qu.:  3055   1st Qu.:  4746   1st Qu.:  46.5  
 Median : 14152   Median : 20196   Median : 114.0  
 Mean   : 43341   Mean   : 43356   Mean   : 258.5  
 3rd Qu.: 45974   3rd Qu.: 56542   3rd Qu.: 285.5  
 Max.   :386518   Max.   :289456   Max.   :3407.0

With the data loaded, I’ve fit & visualized the requested model below:

Code

fit_f <- lm(Buchanan ~ Bush, data = florida)

par(mfrow = c(2, 3)); plot(fit_f, which = 1:6)

With the model fit and the diagnostic plots created, it’s extremely apparent that Palm Beach is indeed an outlier and an extreme one at that. In all plots, Palm Beach is shown to deviate entirely from any trend that might have been present among the other polling stations. For example, within the Residuals vs Fitted plot, the linearity assumption is relatively sound until one takes into consideration the Palm Beach and Dade sites. Both are blatant violations of said assumption and would lead any reasonable data scientist to believe that some sort of tampering/manipulation was involved.

B)

Take the log of both variables (Bush vote and Buchanan Vote) and repeat the analysis in (A.) Does your findings change?

Code

fit_logf <- lm(log(Buchanan) ~ log(Bush), data = florida)

par(mfrow = c(2, 3)); plot(fit_logf, which = 1:6)

While logging both the Buchanan & Bush variables seems to lessen the impact of the Palm Beach observation on the model, I would argue that the new results are still not significant enough to change my findings. I would still consider Palm Beach an outlier.