Code
library(tidyverse)
library(car)
library(smss)
data("house.selling.price.2", package = "smss")
<- house.selling.price.2
df library(alr4)
data("florida", package = "alr4")
<- florida
df library(flexmix)
library(ggplot2)
library(stargazer)
Alexa Potter
May 7, 2023
(Data file: house.selling.price.2 from smss R package)
For the house.selling.price.2 data the tables below show a correlation matrix and a model fit using four predictors of selling price.
(Hint 1: You should be able to answer A, B, C just using the tables below, although you should feel free to load the data in R and work with it if you so choose. They will be consistent with what you see on the tables.
Hint 2: The p-value of a variable in a simple linear regression is the same p-value one would get from a Pearson’s correlation (cor.test). The p-value is a function of the magnitude of the correlation coefficient (the higher the coefficient, the lower the p-value) and of sample size (larger samples lead to smaller p-values). For the correlations shown in the tables, they are between variables of the same length.) With these four predictors,
For backward elimination, which variable would be deleted first? Why?
Beds would be eliminated first as it has the highest p-value.
For forward selection, which variable would be added first? Why?
Size would be added first because it is the most significant with the highest correlation to price.
Why do you think that BEDS has such a large P-value in the multiple regression model, even though it has a substantial correlation with PRICE?
It may be because with the influence of other variables, beds is not as important to price as new or size is. These then diminish the influence of beds on price when factored together and controlled for.
Using software with these four predictors, find the model that would be selected using each criterion:
===================================================================================================================
Dependent variable:
-----------------------------------------------------------------------------------------------
P
(1) (2) (3) (4)
-------------------------------------------------------------------------------------------------------------------
S 75.607*** 72.575*** 62.263*** 64.761***
(3.865) (3.508) (4.335) (5.630)
Be -2.766
(3.960)
Ba 20.072*** 19.203***
(5.495) (5.650)
New 19.587*** 18.371*** 18.984***
(3.995) (3.761) (3.873)
Constant -25.194*** -26.089*** -47.992*** -41.795***
(6.688) (5.977) (8.209) (12.104)
-------------------------------------------------------------------------------------------------------------------
Observations 93 93 93 93
R2 0.808 0.848 0.868 0.869
Adjusted R2 0.806 0.845 0.864 0.863
Residual Std. Error 19.473 (df = 91) 17.395 (df = 90) 16.313 (df = 89) 16.360 (df = 88)
F Statistic 382.628*** (df = 1; 91) 251.775*** (df = 2; 90) 195.313*** (df = 3; 89) 145.763*** (df = 4; 88)
===================================================================================================================
Note: *p<0.1; **p<0.05; ***p<0.01
Model 4
Model 3
[1] 38203.29
[1] 31066
[1] 27860.05
[1] 28390.22
Model 3
[1] 820.1439
[1] 800.1262
[1] 789.1366
[1] 790.6225
Model 3
[1] 827.7417
[1] 810.2566
[1] 801.7996
[1] 805.8181
Model 3
Explain which model you prefer and why.
The model diagnostics point to Model 3 (P ~ S + Ba + New) as being the best fit model. This is the best fit model for all the diagnostic tests except for R2. This model has all significant variables except for Beds, which was not significant.
(Data file: trees from base R)
From the documentation:
“This data set provides measurements of the diameter, height and volume of timber in 31 felled black cherry trees. Note that the diameter (in inches) is erroneously labeled Girth in the data. It is measured at 4 ft 6 in above the ground.”
Tree volume estimation is a big deal, especially in the lumber industry. Use the trees data to build a basic model of tree volume prediction. In particular,
Fit a multiple regression model with the Volume as the outcome and Girth and Height as the explanatory variables
Girth Height Volume
1 8.3 70 10.3
2 8.6 65 10.3
3 8.8 63 10.2
4 10.5 72 16.4
5 10.7 81 18.8
6 10.8 83 19.7
7 11.0 66 15.6
8 11.0 75 18.2
9 11.1 80 22.6
10 11.2 75 19.9
11 11.3 79 24.2
12 11.4 76 21.0
13 11.4 76 21.4
14 11.7 69 21.3
15 12.0 75 19.1
16 12.9 74 22.2
17 12.9 85 33.8
18 13.3 86 27.4
19 13.7 71 25.7
20 13.8 64 24.9
21 14.0 78 34.5
22 14.2 80 31.7
23 14.5 74 36.3
24 16.0 72 38.3
25 16.3 77 42.6
26 17.3 81 55.4
27 17.5 82 55.7
28 17.9 80 58.3
29 18.0 80 51.5
30 18.0 80 51.0
31 20.6 87 77.0
Call:
lm(formula = Volume ~ Girth + Height, data = trees)
Residuals:
Min 1Q Median 3Q Max
-6.4065 -2.6493 -0.2876 2.2003 8.4847
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -57.9877 8.6382 -6.713 2.75e-07 ***
Girth 4.7082 0.2643 17.816 < 2e-16 ***
Height 0.3393 0.1302 2.607 0.0145 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.882 on 28 degrees of freedom
Multiple R-squared: 0.948, Adjusted R-squared: 0.9442
F-statistic: 255 on 2 and 28 DF, p-value: < 2.2e-16
Run regression diagnostic plots on the model. Based on the plots, do you think any of the regression assumptions is violated?
Residuals vs. Fitted
The assumption of linearity seems to be violated here. The graph shows a curve rather than a straight line.
Normal Q-Q
The assumption of normality does appear to not be violated here. The dots generally fall along the line.
Scale-Location
The assumption of constant variance does not appear violated here. While the line does not appear “approximately horizontal”, the dots do not have an increasing or decreasing trend.
Cook’s Distance
To calculate the baseline here, it is 4/n. n= number of trees (31). ~ 0.13. The graph displays 3-4 points above this value, meaning the assumption of influential observation is violated.
(Data file: florida in alr R package)
In the 2000 election for U.S. president, the counting of votes in Florida was controversial. In Palm Beach County in south Florida, for example, voters used a so-called butterfly ballot. Some believe that the layout of the ballot caused some voters to cast votes for Buchanan when their intended choice was Gore.
The data has variables for the number of votes for each candidate—Gore, Bush, and Buchanan.
Run a simple linear regression model where the Buchanan vote is the outcome and the Bush vote is the explanatory variable. Produce the regression diagnostic plots. Is Palm Beach County an outlier based on the diagnostic plots? Why or why not?
Gore Bush Buchanan
ALACHUA 47300 34062 262
BAKER 2392 5610 73
BAY 18850 38637 248
BRADFORD 3072 5413 65
BREVARD 97318 115185 570
BROWARD 386518 177279 789
CALHOUN 2155 2873 90
CHARLOTTE 29641 35419 182
CITRUS 25501 29744 270
CLAY 14630 41745 186
COLLIER 29905 60426 122
COLUMBIA 7047 10964 89
DADE 328702 289456 561
DE SOTO 3322 4256 36
DIXIE 1825 2698 29
DUVAL 107680 152082 650
ESCAMBIA 40958 73029 504
FLAGLER 13891 12608 83
FRANKLIN 2042 2448 33
GADSDEN 9565 4750 39
GILCHRIST 1910 3300 29
GLADES 1420 1840 9
GULF 2389 3546 71
HAMILTON 1718 2153 24
HARDEE 2341 3764 30
HENDRY 3239 4743 22
HERNANDO 32644 30646 242
HIGHLANDS 14152 20196 99
HILLSBOROUGH 166581 176967 836
HOLMES 2154 4985 76
INDIAN RIVER 19769 28627 105
JACKSON 6868 9138 102
JEFFERSON 3038 2481 29
LAFAYETTE 788 1669 10
LAKE 36555 49963 289
LEE 73560 106141 305
LEON 61425 39053 282
LEVY 5403 6860 67
LIBERTY 1011 1316 39
MADISON 3011 3038 29
MANATEE 49169 57948 272
MARION 44648 55135 563
MARTIN 26619 33864 108
MONROE 16483 16059 47
NASSAU 6952 16404 90
OKALOOSA 16924 52043 267
OKEECHOBEE 4588 5058 43
ORANGE 140115 134476 446
OSCEOLA 28177 26216 145
PALM BEACH 268945 152846 3407
PASCO 69550 68581 570
PINELLAS 199660 184312 1010
POLK 74977 90101 538
PUTNAM 12091 13439 147
ST. JOHNS 19482 39497 229
ST. LUCIE 41559 34705 124
SANTA ROSA 12795 36248 311
SARASOTA 72854 83100 305
SEMINOLE 58888 75293 194
SUMTER 9634 12126 114
SUWANNEE 4084 8014 108
TAYLOR 2647 4051 27
UNION 1399 2326 26
VOLUSIA 97063 82214 396
WAKULLA 3835 4511 46
WALTON 5637 12176 120
WASHINGTON 2796 4983 88
Call:
lm(formula = Buchanan ~ Bush, data = florida)
Residuals:
Min 1Q Median 3Q Max
-907.50 -46.10 -29.19 12.26 2610.19
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.529e+01 5.448e+01 0.831 0.409
Bush 4.917e-03 7.644e-04 6.432 1.73e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 353.9 on 65 degrees of freedom
Multiple R-squared: 0.3889, Adjusted R-squared: 0.3795
F-statistic: 41.37 on 1 and 65 DF, p-value: 1.727e-08
Palm Beach does appear to be an outlier in all the diagnostic plots as it is furthest away from the red baselines. In the Cook’s distance plot, the 4/n is 4/67 = 0.0597, which both Dade and Palm Beach exceed.
Take the log of both variables (Bush vote and Buchanan Vote) and repeat the analysis in (A.) Does your findings change?
Call:
lm(formula = (log(Buchanan)) ~ (log(Bush)), data = florida)
Residuals:
Min 1Q Median 3Q Max
-0.96075 -0.25949 0.01282 0.23826 1.66564
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.57712 0.38919 -6.622 8.04e-09 ***
log(Bush) 0.75772 0.03936 19.251 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4673 on 65 degrees of freedom
Multiple R-squared: 0.8508, Adjusted R-squared: 0.8485
F-statistic: 370.6 on 1 and 65 DF, p-value: < 2.2e-16
The results here are similar, but less drastic in difference when it comes to Palm Beach. In the Cook’s Distance plot, it’s also now apparent that other counties also violate the assumption.
---
title: "Homework 5"
author: "Alexa Potter"
description: "DACSS 603"
date: "05/07/2023"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- hw5
---
```{r, message=FALSE}
library(tidyverse)
library(car)
library(smss)
data("house.selling.price.2", package = "smss")
df <- house.selling.price.2
library(alr4)
data("florida", package = "alr4")
df <- florida
library(flexmix)
library(ggplot2)
library(stargazer)
```
# Question 1
(Data file: house.selling.price.2 from smss R package)
For the house.selling.price.2 data the tables below show a correlation matrix and a model fit using four predictors of selling price.
(Hint 1: You should be able to answer A, B, C just using the tables below, although you should feel free to load the data in R and work with it if you so choose. They will be consistent with what you see on the tables.
Hint 2: The p-value of a variable in a simple linear regression is the same p-value one would get from a Pearson’s correlation (cor.test). The p-value is a function of the magnitude of the correlation coefficient (the higher the coefficient, the lower the p-value) and of sample size (larger samples lead to smaller p-values). For the correlations shown in the tables, they are between variables of the same length.)
With these four predictors,
## A
For backward elimination, which variable would be deleted first? Why?
Beds would be eliminated first as it has the highest p-value.
## B
For forward selection, which variable would be added first? Why?
Size would be added first because it is the most significant with the highest correlation to price.
## C
Why do you think that BEDS has such a large P-value in the multiple regression model, even though it has a substantial correlation with PRICE?
It may be because with the influence of other variables, beds is not as important to price as new or size is. These then diminish the influence of beds on price when factored together and controlled for.
## D
Using software with these four predictors, find the model that would be selected using each criterion:
```{r}
lm_1 <- (lm(P ~ S, data= house.selling.price.2))
lm_2 <- (lm(P ~ S + New, data= house.selling.price.2))
lm_3 <- (lm(P ~ S + Ba + New, data= house.selling.price.2))
lm_4 <- (lm(P ~ S + Be + Ba + New, data= house.selling.price.2))
stargazer(lm_1, lm_2, lm_3, lm_4, type = 'text')
```
1. R2
Model 4
2. Adjusted R2
Model 3
3. PRESS
```{r}
PRESS <- function(model) {
i <- residuals(model)/(1 - lm.influence(model)$hat)
sum(i^2)
}
PRESS(lm_1)
PRESS(lm_2)
PRESS(lm_3)
PRESS(lm_4)
```
Model 3
4. AIC
```{r}
AIC(lm_1, k=2)
AIC(lm_2, k=2)
AIC(lm_3, k=2)
AIC(lm_4, k=2)
```
Model 3
5. BIC
```{r}
BIC(lm_1)
BIC(lm_2)
BIC(lm_3)
BIC(lm_4)
```
Model 3
## E
Explain which model you prefer and why.
The model diagnostics point to Model 3 (P ~ S + Ba + New) as being the best fit model. This is the best fit model for all the diagnostic tests except for R2. This model has all significant variables except for Beds, which was not significant.
# Question 2
(Data file: trees from base R)
From the documentation:
“This data set provides measurements of the diameter, height and volume of timber in 31 felled black cherry trees. Note that the diameter (in inches) is erroneously labeled Girth in the data. It is measured at 4 ft 6 in above the ground.”
Tree volume estimation is a big deal, especially in the lumber industry. Use the trees data to build a basic model of tree volume prediction. In particular,
## A
Fit a multiple regression model with the Volume as the outcome and Girth and Height as the explanatory variables
```{r}
trees
lm_trees <- lm(Volume ~ Girth + Height, data = trees)
summary(lm_trees)
```
## B
Run regression diagnostic plots on the model. Based on the plots, do you think any of the regression assumptions is violated?
```{r}
plot(lm_trees, which = 1)
plot(lm_trees, which = 2)
plot(lm_trees, which = 3)
plot(lm_trees, which = 4)
```
Residuals vs. Fitted
The assumption of linearity seems to be violated here. The graph shows a curve rather than a straight line.
Normal Q-Q
The assumption of normality does appear to not be violated here. The dots generally fall along the line.
Scale-Location
The assumption of constant variance does not appear violated here. While the line does not appear "approximately horizontal", the dots do not have an increasing or decreasing trend.
Cook's Distance
To calculate the baseline here, it is 4/n. n= number of trees (31). ~ 0.13. The graph displays 3-4 points above this value, meaning the assumption of influential observation is violated.
```{r}
4/31
```
# Question 3
(Data file: florida in alr R package)
In the 2000 election for U.S. president, the counting of votes in Florida was controversial. In Palm Beach County in south Florida, for example, voters used a so-called butterfly ballot. Some believe that the layout of the ballot caused some voters to cast votes for Buchanan when their intended choice was Gore.
The data has variables for the number of votes for each candidate—Gore, Bush, and Buchanan.
## A
Run a simple linear regression model where the Buchanan vote is the outcome and the Bush vote is the explanatory variable. Produce the regression diagnostic plots. Is Palm Beach County an outlier based on the diagnostic plots? Why or why not?
```{r}
florida
lm_florida <- lm(Buchanan ~ Bush, data = florida)
summary(lm_florida)
plot(lm_florida,which=1)
plot(lm_florida,which=2)
plot(lm_florida,which=3)
plot(lm_florida,which=4)
```
Palm Beach does appear to be an outlier in all the diagnostic plots as it is furthest away from the red baselines. In the Cook's distance plot, the 4/n is 4/67 = 0.0597, which both Dade and Palm Beach exceed.
```{r}
4/67
```
## B
Take the log of both variables (Bush vote and Buchanan Vote) and repeat the analysis in (A.) Does your findings change?
```{r}
lm_floridalog <- lm((log(Buchanan)) ~ (log(Bush)), data = florida)
summary(lm_floridalog)
plot(lm_floridalog,which=1)
plot(lm_floridalog,which=2)
plot(lm_floridalog,which=3)
plot(lm_floridalog,which=4)
```
The results here are similar, but less drastic in difference when it comes to Palm Beach. In the Cook's Distance plot, it's also now apparent that other counties also violate the assumption.