DACSS 603
Author

Alexa Potter

Published

May 7, 2023

Code
library(tidyverse)
library(car)
library(smss)
data("house.selling.price.2", package = "smss")
df <- house.selling.price.2
library(alr4)
data("florida", package = "alr4")
df <- florida
library(flexmix)
library(ggplot2)
library(stargazer)

Question 1

(Data file: house.selling.price.2 from smss R package)

For the house.selling.price.2 data the tables below show a correlation matrix and a model fit using four predictors of selling price.

(Hint 1: You should be able to answer A, B, C just using the tables below, although you should feel free to load the data in R and work with it if you so choose. They will be consistent with what you see on the tables.

Hint 2: The p-value of a variable in a simple linear regression is the same p-value one would get from a Pearson’s correlation (cor.test). The p-value is a function of the magnitude of the correlation coefficient (the higher the coefficient, the lower the p-value) and of sample size (larger samples lead to smaller p-values). For the correlations shown in the tables, they are between variables of the same length.) With these four predictors,

A

For backward elimination, which variable would be deleted first? Why?

Beds would be eliminated first as it has the highest p-value.

B

For forward selection, which variable would be added first? Why?

Size would be added first because it is the most significant with the highest correlation to price.

C

Why do you think that BEDS has such a large P-value in the multiple regression model, even though it has a substantial correlation with PRICE?

It may be because with the influence of other variables, beds is not as important to price as new or size is. These then diminish the influence of beds on price when factored together and controlled for.

D

Using software with these four predictors, find the model that would be selected using each criterion:

Code
lm_1 <- (lm(P ~ S, data= house.selling.price.2))

lm_2 <- (lm(P ~ S + New, data= house.selling.price.2))

lm_3 <- (lm(P ~ S + Ba + New, data= house.selling.price.2))

lm_4 <- (lm(P ~ S + Be + Ba + New, data= house.selling.price.2))

stargazer(lm_1, lm_2, lm_3, lm_4, type = 'text')

===================================================================================================================
                                                          Dependent variable:                                      
                    -----------------------------------------------------------------------------------------------
                                                                   P                                               
                              (1)                     (2)                     (3)                     (4)          
-------------------------------------------------------------------------------------------------------------------
S                          75.607***               72.575***               62.263***               64.761***       
                            (3.865)                 (3.508)                 (4.335)                 (5.630)        
                                                                                                                   
Be                                                                                                  -2.766         
                                                                                                    (3.960)        
                                                                                                                   
Ba                                                                         20.072***               19.203***       
                                                                            (5.495)                 (5.650)        
                                                                                                                   
New                                                19.587***               18.371***               18.984***       
                                                    (3.995)                 (3.761)                 (3.873)        
                                                                                                                   
Constant                  -25.194***              -26.089***              -47.992***              -41.795***       
                            (6.688)                 (5.977)                 (8.209)                (12.104)        
                                                                                                                   
-------------------------------------------------------------------------------------------------------------------
Observations                  93                      93                      93                      93           
R2                           0.808                   0.848                   0.868                   0.869         
Adjusted R2                  0.806                   0.845                   0.864                   0.863         
Residual Std. Error    19.473 (df = 91)        17.395 (df = 90)        16.313 (df = 89)        16.360 (df = 88)    
F Statistic         382.628*** (df = 1; 91) 251.775*** (df = 2; 90) 195.313*** (df = 3; 89) 145.763*** (df = 4; 88)
===================================================================================================================
Note:                                                                                   *p<0.1; **p<0.05; ***p<0.01
  1. R2

Model 4

  1. Adjusted R2

Model 3

  1. PRESS
Code
PRESS <- function(model) {
    i <- residuals(model)/(1 - lm.influence(model)$hat)
    sum(i^2)
}
PRESS(lm_1)
[1] 38203.29
Code
PRESS(lm_2)
[1] 31066
Code
PRESS(lm_3)
[1] 27860.05
Code
PRESS(lm_4)
[1] 28390.22

Model 3

  1. AIC
Code
AIC(lm_1, k=2)
[1] 820.1439
Code
AIC(lm_2, k=2)
[1] 800.1262
Code
AIC(lm_3, k=2)
[1] 789.1366
Code
AIC(lm_4, k=2)
[1] 790.6225

Model 3

  1. BIC
Code
BIC(lm_1)
[1] 827.7417
Code
BIC(lm_2)
[1] 810.2566
Code
BIC(lm_3)
[1] 801.7996
Code
BIC(lm_4)
[1] 805.8181

Model 3

E

Explain which model you prefer and why.

The model diagnostics point to Model 3 (P ~ S + Ba + New) as being the best fit model. This is the best fit model for all the diagnostic tests except for R2. This model has all significant variables except for Beds, which was not significant.

Question 2

(Data file: trees from base R)

From the documentation:

“This data set provides measurements of the diameter, height and volume of timber in 31 felled black cherry trees. Note that the diameter (in inches) is erroneously labeled Girth in the data. It is measured at 4 ft 6 in above the ground.”

Tree volume estimation is a big deal, especially in the lumber industry. Use the trees data to build a basic model of tree volume prediction. In particular,

A

Fit a multiple regression model with the Volume as the outcome and Girth and Height as the explanatory variables

Code
trees
   Girth Height Volume
1    8.3     70   10.3
2    8.6     65   10.3
3    8.8     63   10.2
4   10.5     72   16.4
5   10.7     81   18.8
6   10.8     83   19.7
7   11.0     66   15.6
8   11.0     75   18.2
9   11.1     80   22.6
10  11.2     75   19.9
11  11.3     79   24.2
12  11.4     76   21.0
13  11.4     76   21.4
14  11.7     69   21.3
15  12.0     75   19.1
16  12.9     74   22.2
17  12.9     85   33.8
18  13.3     86   27.4
19  13.7     71   25.7
20  13.8     64   24.9
21  14.0     78   34.5
22  14.2     80   31.7
23  14.5     74   36.3
24  16.0     72   38.3
25  16.3     77   42.6
26  17.3     81   55.4
27  17.5     82   55.7
28  17.9     80   58.3
29  18.0     80   51.5
30  18.0     80   51.0
31  20.6     87   77.0
Code
lm_trees <- lm(Volume ~ Girth + Height, data = trees)
summary(lm_trees)

Call:
lm(formula = Volume ~ Girth + Height, data = trees)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.4065 -2.6493 -0.2876  2.2003  8.4847 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -57.9877     8.6382  -6.713 2.75e-07 ***
Girth         4.7082     0.2643  17.816  < 2e-16 ***
Height        0.3393     0.1302   2.607   0.0145 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.882 on 28 degrees of freedom
Multiple R-squared:  0.948, Adjusted R-squared:  0.9442 
F-statistic:   255 on 2 and 28 DF,  p-value: < 2.2e-16

B

Run regression diagnostic plots on the model. Based on the plots, do you think any of the regression assumptions is violated?

Code
plot(lm_trees, which = 1)

Code
plot(lm_trees, which = 2)

Code
plot(lm_trees, which = 3)

Code
plot(lm_trees, which = 4)

Residuals vs. Fitted

The assumption of linearity seems to be violated here. The graph shows a curve rather than a straight line.

Normal Q-Q

The assumption of normality does appear to not be violated here. The dots generally fall along the line.

Scale-Location

The assumption of constant variance does not appear violated here. While the line does not appear “approximately horizontal”, the dots do not have an increasing or decreasing trend.

Cook’s Distance

To calculate the baseline here, it is 4/n. n= number of trees (31). ~ 0.13. The graph displays 3-4 points above this value, meaning the assumption of influential observation is violated.

Code
4/31
[1] 0.1290323

Question 3

(Data file: florida in alr R package)

In the 2000 election for U.S. president, the counting of votes in Florida was controversial. In Palm Beach County in south Florida, for example, voters used a so-called butterfly ballot. Some believe that the layout of the ballot caused some voters to cast votes for Buchanan when their intended choice was Gore.

The data has variables for the number of votes for each candidate—Gore, Bush, and Buchanan.

A

Run a simple linear regression model where the Buchanan vote is the outcome and the Bush vote is the explanatory variable. Produce the regression diagnostic plots. Is Palm Beach County an outlier based on the diagnostic plots? Why or why not?

Code
florida
               Gore   Bush Buchanan
ALACHUA       47300  34062      262
BAKER          2392   5610       73
BAY           18850  38637      248
BRADFORD       3072   5413       65
BREVARD       97318 115185      570
BROWARD      386518 177279      789
CALHOUN        2155   2873       90
CHARLOTTE     29641  35419      182
CITRUS        25501  29744      270
CLAY          14630  41745      186
COLLIER       29905  60426      122
COLUMBIA       7047  10964       89
DADE         328702 289456      561
DE SOTO        3322   4256       36
DIXIE          1825   2698       29
DUVAL        107680 152082      650
ESCAMBIA      40958  73029      504
FLAGLER       13891  12608       83
FRANKLIN       2042   2448       33
GADSDEN        9565   4750       39
GILCHRIST      1910   3300       29
GLADES         1420   1840        9
GULF           2389   3546       71
HAMILTON       1718   2153       24
HARDEE         2341   3764       30
HENDRY         3239   4743       22
HERNANDO      32644  30646      242
HIGHLANDS     14152  20196       99
HILLSBOROUGH 166581 176967      836
HOLMES         2154   4985       76
INDIAN RIVER  19769  28627      105
JACKSON        6868   9138      102
JEFFERSON      3038   2481       29
LAFAYETTE       788   1669       10
LAKE          36555  49963      289
LEE           73560 106141      305
LEON          61425  39053      282
LEVY           5403   6860       67
LIBERTY        1011   1316       39
MADISON        3011   3038       29
MANATEE       49169  57948      272
MARION        44648  55135      563
MARTIN        26619  33864      108
MONROE        16483  16059       47
NASSAU         6952  16404       90
OKALOOSA      16924  52043      267
OKEECHOBEE     4588   5058       43
ORANGE       140115 134476      446
OSCEOLA       28177  26216      145
PALM BEACH   268945 152846     3407
PASCO         69550  68581      570
PINELLAS     199660 184312     1010
POLK          74977  90101      538
PUTNAM        12091  13439      147
ST. JOHNS     19482  39497      229
ST. LUCIE     41559  34705      124
SANTA ROSA    12795  36248      311
SARASOTA      72854  83100      305
SEMINOLE      58888  75293      194
SUMTER         9634  12126      114
SUWANNEE       4084   8014      108
TAYLOR         2647   4051       27
UNION          1399   2326       26
VOLUSIA       97063  82214      396
WAKULLA        3835   4511       46
WALTON         5637  12176      120
WASHINGTON     2796   4983       88
Code
lm_florida <- lm(Buchanan ~ Bush, data = florida)
summary(lm_florida)

Call:
lm(formula = Buchanan ~ Bush, data = florida)

Residuals:
    Min      1Q  Median      3Q     Max 
-907.50  -46.10  -29.19   12.26 2610.19 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 4.529e+01  5.448e+01   0.831    0.409    
Bush        4.917e-03  7.644e-04   6.432 1.73e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 353.9 on 65 degrees of freedom
Multiple R-squared:  0.3889,    Adjusted R-squared:  0.3795 
F-statistic: 41.37 on 1 and 65 DF,  p-value: 1.727e-08
Code
plot(lm_florida,which=1)

Code
plot(lm_florida,which=2)

Code
plot(lm_florida,which=3)

Code
plot(lm_florida,which=4)

Palm Beach does appear to be an outlier in all the diagnostic plots as it is furthest away from the red baselines. In the Cook’s distance plot, the 4/n is 4/67 = 0.0597, which both Dade and Palm Beach exceed.

Code
4/67
[1] 0.05970149

B

Take the log of both variables (Bush vote and Buchanan Vote) and repeat the analysis in (A.) Does your findings change?

Code
lm_floridalog <- lm((log(Buchanan)) ~ (log(Bush)), data = florida)
summary(lm_floridalog)

Call:
lm(formula = (log(Buchanan)) ~ (log(Bush)), data = florida)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.96075 -0.25949  0.01282  0.23826  1.66564 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -2.57712    0.38919  -6.622 8.04e-09 ***
log(Bush)    0.75772    0.03936  19.251  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4673 on 65 degrees of freedom
Multiple R-squared:  0.8508,    Adjusted R-squared:  0.8485 
F-statistic: 370.6 on 1 and 65 DF,  p-value: < 2.2e-16
Code
plot(lm_floridalog,which=1)

Code
plot(lm_floridalog,which=2)

Code
plot(lm_floridalog,which=3)

Code
plot(lm_floridalog,which=4)

The results here are similar, but less drastic in difference when it comes to Palm Beach. In the Cook’s Distance plot, it’s also now apparent that other counties also violate the assumption.