Kristin Abijaoude_HW5

Hw5
kristin abijaoude
Published

May 9, 2023

Code
# load packages
packages <- c("readr", "readxl", "summarytools", "tidyverse", "dplyr", "smss", "alr4", "stargazer", "broom", "qpcR")
lapply(packages, require, character.only = TRUE)
Loading required package: readr
Loading required package: readxl
Loading required package: summarytools
Loading required package: tidyverse
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.0     ✔ dplyr   1.1.0
✔ tibble  3.1.8     ✔ stringr 1.5.0
✔ tidyr   1.3.0     ✔ forcats 0.5.2
✔ purrr   1.0.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
✖ tibble::view()  masks summarytools::view()
Loading required package: smss

Loading required package: alr4

Loading required package: car

Loading required package: carData


Attaching package: 'car'


The following object is masked from 'package:dplyr':

    recode


The following object is masked from 'package:purrr':

    some


Loading required package: effects

lattice theme set by effectsTheme()
See ?effectsTheme for details.

Loading required package: stargazer


Please cite as: 


 Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.

 R package version 5.2.3. https://CRAN.R-project.org/package=stargazer 


Loading required package: broom

Loading required package: qpcR

Loading required package: MASS


Attaching package: 'MASS'


The following object is masked from 'package:dplyr':

    select


Loading required package: minpack.lm

Loading required package: rgl

Loading required package: robustbase


Attaching package: 'robustbase'


The following object is masked from 'package:alr4':

    cloud


Loading required package: Matrix


Attaching package: 'Matrix'


The following objects are masked from 'package:tidyr':

    expand, pack, unpack
[[1]]
[1] TRUE

[[2]]
[1] TRUE

[[3]]
[1] TRUE

[[4]]
[1] TRUE

[[5]]
[1] TRUE

[[6]]
[1] TRUE

[[7]]
[1] TRUE

[[8]]
[1] TRUE

[[9]]
[1] TRUE

[[10]]
[1] TRUE
Code
knitr::opts_chunk$set(echo = TRUE)

Question 1

Code
data(house.selling.price.2)
house.selling.price.2
       P    S Be Ba New
1   48.5 1.10  3  1   0
2   55.0 1.01  3  2   0
3   68.0 1.45  3  2   0
4  137.0 2.40  3  3   0
5  309.4 3.30  4  3   1
6   17.5 0.40  1  1   0
7   19.6 1.28  3  1   0
8   24.5 0.74  3  1   0
9   34.8 0.78  2  1   0
10  32.0 0.97  3  1   0
11  28.0 0.84  3  1   0
12  49.9 1.08  2  2   0
13  59.9 0.99  2  1   0
14  61.5 1.01  3  2   0
15  60.0 1.34  3  2   0
16  65.9 1.22  3  1   0
17  67.9 1.28  3  2   0
18  68.9 1.29  3  2   0
19  69.9 1.52  3  2   0
20  70.5 1.25  3  2   0
21  72.9 1.28  3  2   0
22  72.5 1.28  3  1   0
23  72.0 1.36  3  2   0
24  71.0 1.20  3  2   0
25  76.0 1.46  3  2   0
26  72.9 1.56  4  2   0
27  73.0 1.22  3  2   0
28  70.0 1.40  2  2   0
29  76.0 1.15  2  2   0
30  69.0 1.74  3  2   0
31  75.5 1.62  3  2   0
32  76.0 1.66  3  2   0
33  81.8 1.33  3  2   0
34  84.5 1.34  3  2   0
35  83.5 1.40  3  2   0
36  86.0 1.15  2  2   1
37  86.9 1.58  3  2   1
38  86.9 1.58  3  2   1
39  86.9 1.58  3  2   1
40  87.9 1.71  3  2   0
41  88.1 2.10  3  2   0
42  85.9 1.27  3  2   0
43  89.5 1.34  3  2   0
44  87.4 1.25  3  2   0
45  87.9 1.68  3  2   0
46  88.0 1.55  3  2   0
47  90.0 1.55  3  2   0
48  96.0 1.36  3  2   1
49  99.9 1.51  3  2   1
50  95.5 1.54  3  2   1
51  98.5 1.51  3  2   0
52 100.1 1.85  3  2   0
53  99.9 1.62  4  2   1
54 101.9 1.40  3  2   1
55 101.9 1.92  4  2   0
56 102.3 1.42  3  2   1
57 110.8 1.56  3  2   1
58 105.0 1.43  3  2   1
59  97.9 2.00  3  2   0
60 106.3 1.45  3  2   1
61 106.5 1.65  3  2   0
62 116.0 1.72  4  2   1
63 108.0 1.79  4  2   1
64 107.5 1.85  3  2   0
65 109.9 2.06  4  2   1
66 110.0 1.76  4  2   0
67 120.0 1.62  3  2   1
68 115.0 1.80  4  2   1
69 113.4 1.98  3  2   0
70 114.9 1.57  3  2   0
71 115.0 2.19  3  2   0
72 115.0 2.07  4  2   0
73 117.9 1.99  4  2   0
74 110.0 1.55  3  2   0
75 115.0 1.67  3  2   0
76 124.0 2.40  4  2   0
77 129.9 1.79  4  2   1
78 124.0 1.89  3  2   0
79 128.0 1.88  3  2   1
80 132.4 2.00  4  2   1
81 139.3 2.05  4  2   1
82 139.3 2.00  4  2   1
83 139.7 2.03  3  2   1
84 142.0 2.12  3  3   0
85 141.3 2.08  4  2   1
86 147.5 2.19  4  2   0
87 142.5 2.40  4  2   0
88 148.0 2.40  5  2   0
89 149.0 3.05  4  2   0
90 150.0 2.04  3  3   0
91 172.9 2.25  4  2   1
92 190.0 2.57  4  3   1
93 280.0 3.85  4  3   0
Code
lm(P ~ ., data = house.selling.price.2) |> summary()

Call:
lm(formula = P ~ ., data = house.selling.price.2)

Residuals:
    Min      1Q  Median      3Q     Max 
-36.212  -9.546   1.277   9.406  71.953 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -41.795     12.104  -3.453 0.000855 ***
S             64.761      5.630  11.504  < 2e-16 ***
Be            -2.766      3.960  -0.698 0.486763    
Ba            19.203      5.650   3.399 0.001019 ** 
New           18.984      3.873   4.902  4.3e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 16.36 on 88 degrees of freedom
Multiple R-squared:  0.8689,    Adjusted R-squared:  0.8629 
F-statistic: 145.8 on 4 and 88 DF,  p-value: < 2.2e-16

A

Variable Be would be eliminated first using the backwards elimination because it has the largest p-value.

B

Variable S would be added first using the forward selection because it has the smallest p-value.

C

Beds has a high p-value despite a substantial correlation with price because of multicollinearity between New and Size as they have way smaller p-values and are statistically significant.

D

Code
lm1 <- lm(P ~ ., data = house.selling.price.2)
lm2 <- lm(P ~ S, Be, Ba, data = house.selling.price.2)
lm3 <- lm(P ~ S, Be, data = house.selling.price.2)
lm4 <- lm(P ~ S, data = house.selling.price.2)

stargazer(lm1, lm2, lm3, lm4, type = "text")

===================================================================================================================
                                                          Dependent variable:                                      
                    -----------------------------------------------------------------------------------------------
                                                                   P                                               
                              (1)                     (2)                     (3)                     (4)          
-------------------------------------------------------------------------------------------------------------------
S                          64.761***               78.472***               77.207***               75.607***       
                            (5.630)                 (2.672)                 (2.583)                 (3.865)        
                                                                                                                   
Be                          -2.766                                                                                 
                            (3.960)                                                                                
                                                                                                                   
Ba                         19.203***                                                                               
                            (5.650)                                                                                
                                                                                                                   
New                        18.984***                                                                               
                            (3.873)                                                                                
                                                                                                                   
Constant                  -41.795***              -44.960***              -42.531***              -25.194***       
                           (12.104)                 (4.904)                 (4.490)                 (6.688)        
                                                                                                                   
-------------------------------------------------------------------------------------------------------------------
Observations                  93                      93                      93                      93           
R2                           0.869                   0.905                   0.908                   0.808         
Adjusted R2                  0.863                   0.904                   0.907                   0.806         
Residual Std. Error    16.360 (df = 88)        19.719 (df = 91)        11.794 (df = 91)        19.473 (df = 91)    
F Statistic         145.763*** (df = 4; 88) 862.682*** (df = 1; 91) 893.585*** (df = 1; 91) 382.628*** (df = 1; 91)
===================================================================================================================
Note:                                                                                   *p<0.1; **p<0.05; ***p<0.01

The R^2 and Adjusted R^2 can be found with this chart above.

Code
# create press model
PRESS <- function(model) {
    i <- residuals(model)/(1 - lm.influence(model)$hat)
    sum(i^2)
}
Code
#lm1 PRESS
PRESS(lm1)
[1] 28390.22
Code
#lm1 AIC
AIC(lm1)
[1] 790.6225
Code
#lm1 BIC
BIC(lm1)
[1] 805.8181
Code
#lm2 PRESS
PRESS(lm2)
[1] 16270.99
Code
#lm2 AIC
AIC(lm2)
[1] 748.9785
Code
#lm2 BIC
BIC(lm2)
[1] 756.5763
Code
#lm3 PRESS
PRESS(lm3)
[1] 16131.45
Code
#lm3 AIC
AIC(lm3)
[1] 726.8739
Code
#lm3 BIC
BIC(lm3)
[1] 734.4717
Code
#lm4 PRESS
PRESS(lm4)
[1] 38203.29
Code
#lm4 AIC
AIC(lm4)
[1] 820.1439
Code
#lm4 BIC
BIC(lm4)
[1] 827.7417

E

Based on the criterion PRESS, R^2, Adjusted R^2, AIC, and BIC, we will go with Model 3 lm3 since it has the highest R^2 and the lowest PRESS, AIC, and BIC values.

Question 2

Code
data(trees)
head(trees)
  Girth Height Volume
1   8.3     70   10.3
2   8.6     65   10.3
3   8.8     63   10.2
4  10.5     72   16.4
5  10.7     81   18.8
6  10.8     83   19.7

A

Code
tree_reg <- lm(Volume ~ Girth + Height, data = trees)
summary(tree_reg)

Call:
lm(formula = Volume ~ Girth + Height, data = trees)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.4065 -2.6493 -0.2876  2.2003  8.4847 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -57.9877     8.6382  -6.713 2.75e-07 ***
Girth         4.7082     0.2643  17.816  < 2e-16 ***
Height        0.3393     0.1302   2.607   0.0145 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.882 on 28 degrees of freedom
Multiple R-squared:  0.948, Adjusted R-squared:  0.9442 
F-statistic:   255 on 2 and 28 DF,  p-value: < 2.2e-16

B

There is a violation of nonlinearity since the regression line is not straight or linear.

Code
plot(tree_reg)

Question 3

A

Code
data(florida)
head(florida)
           Gore   Bush Buchanan
ALACHUA   47300  34062      262
BAKER      2392   5610       73
BAY       18850  38637      248
BRADFORD   3072   5413       65
BREVARD   97318 115185      570
BROWARD  386518 177279      789
Code
gore <- lm(Buchanan ~ Gore, data=florida)
plot(gore)

From the graph, Palm Beach is an outlier because Buchanan received the most votes from that county.

B

Code
gore_log <- lm(log(Buchanan) ~ log(Gore), data=florida)
plot(gore_log)

Even when accounting for the log() function, nothing fundamentally changes when it comes to outliers.