Code
library(tidyverse)
library(MPV)
library(alr4)
library(smss)
::opts_chunk$set(echo = TRUE) knitr
Adithya Parupudi
May 8, 2023
P S Be Ba New
1 48.5 1.10 3 1 0
2 55.0 1.01 3 2 0
3 68.0 1.45 3 2 0
4 137.0 2.40 3 3 0
5 309.4 3.30 4 3 1
6 17.5 0.40 1 1 0
7 19.6 1.28 3 1 0
8 24.5 0.74 3 1 0
9 34.8 0.78 2 1 0
10 32.0 0.97 3 1 0
11 28.0 0.84 3 1 0
12 49.9 1.08 2 2 0
13 59.9 0.99 2 1 0
14 61.5 1.01 3 2 0
15 60.0 1.34 3 2 0
16 65.9 1.22 3 1 0
17 67.9 1.28 3 2 0
18 68.9 1.29 3 2 0
19 69.9 1.52 3 2 0
20 70.5 1.25 3 2 0
21 72.9 1.28 3 2 0
22 72.5 1.28 3 1 0
23 72.0 1.36 3 2 0
24 71.0 1.20 3 2 0
25 76.0 1.46 3 2 0
26 72.9 1.56 4 2 0
27 73.0 1.22 3 2 0
28 70.0 1.40 2 2 0
29 76.0 1.15 2 2 0
30 69.0 1.74 3 2 0
31 75.5 1.62 3 2 0
32 76.0 1.66 3 2 0
33 81.8 1.33 3 2 0
34 84.5 1.34 3 2 0
35 83.5 1.40 3 2 0
36 86.0 1.15 2 2 1
37 86.9 1.58 3 2 1
38 86.9 1.58 3 2 1
39 86.9 1.58 3 2 1
40 87.9 1.71 3 2 0
41 88.1 2.10 3 2 0
42 85.9 1.27 3 2 0
43 89.5 1.34 3 2 0
44 87.4 1.25 3 2 0
45 87.9 1.68 3 2 0
46 88.0 1.55 3 2 0
47 90.0 1.55 3 2 0
48 96.0 1.36 3 2 1
49 99.9 1.51 3 2 1
50 95.5 1.54 3 2 1
51 98.5 1.51 3 2 0
52 100.1 1.85 3 2 0
53 99.9 1.62 4 2 1
54 101.9 1.40 3 2 1
55 101.9 1.92 4 2 0
56 102.3 1.42 3 2 1
57 110.8 1.56 3 2 1
58 105.0 1.43 3 2 1
59 97.9 2.00 3 2 0
60 106.3 1.45 3 2 1
61 106.5 1.65 3 2 0
62 116.0 1.72 4 2 1
63 108.0 1.79 4 2 1
64 107.5 1.85 3 2 0
65 109.9 2.06 4 2 1
66 110.0 1.76 4 2 0
67 120.0 1.62 3 2 1
68 115.0 1.80 4 2 1
69 113.4 1.98 3 2 0
70 114.9 1.57 3 2 0
71 115.0 2.19 3 2 0
72 115.0 2.07 4 2 0
73 117.9 1.99 4 2 0
74 110.0 1.55 3 2 0
75 115.0 1.67 3 2 0
76 124.0 2.40 4 2 0
77 129.9 1.79 4 2 1
78 124.0 1.89 3 2 0
79 128.0 1.88 3 2 1
80 132.4 2.00 4 2 1
81 139.3 2.05 4 2 1
82 139.3 2.00 4 2 1
83 139.7 2.03 3 2 1
84 142.0 2.12 3 3 0
85 141.3 2.08 4 2 1
86 147.5 2.19 4 2 0
87 142.5 2.40 4 2 0
88 148.0 2.40 5 2 0
89 149.0 3.05 4 2 0
90 150.0 2.04 3 3 0
91 172.9 2.25 4 2 1
92 190.0 2.57 4 3 1
93 280.0 3.85 4 3 0
For backward elimination, you fit a model using all possible explanatory values to predict the output. Then one by one, you delete the least significant explanatory variable in the model, which would have the largest p-value. In this example, we would delete Beds
first, which has a p-value of 0.487.
With forward selection, you begin with no explanatory variables, then add one variable at a time to the model. The variable you add should be the most significant one, based on it having the lowest P-value of the group of possible explanatory variables. In this example, the first variable to add to the model is Size
, given its extremely small p-value < 2e-16.
While the variable Beds does have a strong correlation with price, when adding additional variables using a regression model, the relationship significantly diminishes, thus the other variables may act as a control on the bed variable.
Call:
lm(formula = P ~ S, data = house.selling.price.2)
Residuals:
Min 1Q Median 3Q Max
-56.407 -10.656 2.126 11.412 85.091
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -25.194 6.688 -3.767 0.000293 ***
S 75.607 3.865 19.561 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 19.47 on 91 degrees of freedom
Multiple R-squared: 0.8079, Adjusted R-squared: 0.8058
F-statistic: 382.6 on 1 and 91 DF, p-value: < 2.2e-16
Call:
lm(formula = P ~ S + New, data = house.selling.price.2)
Residuals:
Min 1Q Median 3Q Max
-47.207 -9.763 -0.091 9.984 76.405
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -26.089 5.977 -4.365 3.39e-05 ***
S 72.575 3.508 20.690 < 2e-16 ***
New 19.587 3.995 4.903 4.16e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 17.4 on 90 degrees of freedom
Multiple R-squared: 0.8484, Adjusted R-squared: 0.845
F-statistic: 251.8 on 2 and 90 DF, p-value: < 2.2e-16
Call:
lm(formula = P ~ ., data = house.selling.price.2)
Residuals:
Min 1Q Median 3Q Max
-36.212 -9.546 1.277 9.406 71.953
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -41.795 12.104 -3.453 0.000855 ***
S 64.761 5.630 11.504 < 2e-16 ***
Be -2.766 3.960 -0.698 0.486763
Ba 19.203 5.650 3.399 0.001019 **
New 18.984 3.873 4.902 4.3e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 16.36 on 88 degrees of freedom
Multiple R-squared: 0.8689, Adjusted R-squared: 0.8629
F-statistic: 145.8 on 4 and 88 DF, p-value: < 2.2e-16
Call:
lm(formula = P ~ . - Be, data = house.selling.price.2)
Residuals:
Min 1Q Median 3Q Max
-34.804 -9.496 0.917 7.931 73.338
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -47.992 8.209 -5.847 8.15e-08 ***
S 62.263 4.335 14.363 < 2e-16 ***
Ba 20.072 5.495 3.653 0.000438 ***
New 18.371 3.761 4.885 4.54e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 16.31 on 89 degrees of freedom
Multiple R-squared: 0.8681, Adjusted R-squared: 0.8637
F-statistic: 195.3 on 3 and 89 DF, p-value: < 2.2e-16
Call:
lm(formula = P ~ . - Be - Ba, data = house.selling.price.2)
Residuals:
Min 1Q Median 3Q Max
-47.207 -9.763 -0.091 9.984 76.405
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -26.089 5.977 -4.365 3.39e-05 ***
S 72.575 3.508 20.690 < 2e-16 ***
New 19.587 3.995 4.903 4.16e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 17.4 on 90 degrees of freedom
Multiple R-squared: 0.8484, Adjusted R-squared: 0.845
F-statistic: 251.8 on 2 and 90 DF, p-value: < 2.2e-16
As expected, the model with the most explanatory variables has the highest R-squared value at 0.8689. Therefore, if you were to select a model solely based on maximizing the R-squared value, it would be: ŷ = -41.79 + 64.76(Size) - 2.77(Beds) + 19.2(Baths) + 18.98(New).
However, if you were to select a model based on adjusted R-squared, the best model for predicting selling price would exclude Beds and use Size, Baths, and New as explanatory variables. The adjusted R-squared value see a slight increase when Beds is removed (from 0.8629 to 0.8637). The model would be: ŷ = -47.99 + 62.26(Size) + 20.07(Baths) + 18.37(New).
[1] 28390.22
[1] 27860.05
When considering PRESS, a smaller PRESS value indicates a better predictive model. Comparing the PRESS value of the model with all variables and the model excluding Bed, the PRESS values would lead us to select the model with Size, Baths, and New as variables for predicting selling price.
[1] 790.6225
[1] 789.1366
When considering the AIC for both models, the value is slightly lower for the model that excludes Bed as a variable. Therefore, the AIC would lead us to use the model with Size, Baths, and New as explanatory variables to predicting selling price.
[1] 805.8181
[1] 801.7996
Lastly, like AIC, the BIC value is lower for the model that excludes Bed as a variable. Once again, we’d select the model that uses Size, Baths, and New as explanatory variables to predict selling price.
Given the results from the various criteria above, the model I would prefer to use to predict selling price is that which excludes Bed and includes Size, Bath, and New as variables: ŷ = -41.79 + 64.76(Size) - 2.77(Beds) + 19.2(Baths) + 18.98(New). This is because each of the criterion indicate this model as slightly stronger in its predictive power than the model that includes all variables except R-squared, which cannot be used alone to determine model strength.
Girth Height Volume
1 8.3 70 10.3
2 8.6 65 10.3
3 8.8 63 10.2
4 10.5 72 16.4
5 10.7 81 18.8
6 10.8 83 19.7
7 11.0 66 15.6
8 11.0 75 18.2
9 11.1 80 22.6
10 11.2 75 19.9
11 11.3 79 24.2
12 11.4 76 21.0
13 11.4 76 21.4
14 11.7 69 21.3
15 12.0 75 19.1
16 12.9 74 22.2
17 12.9 85 33.8
18 13.3 86 27.4
19 13.7 71 25.7
20 13.8 64 24.9
21 14.0 78 34.5
22 14.2 80 31.7
23 14.5 74 36.3
24 16.0 72 38.3
25 16.3 77 42.6
26 17.3 81 55.4
27 17.5 82 55.7
28 17.9 80 58.3
29 18.0 80 51.5
30 18.0 80 51.0
31 20.6 87 77.0
Call:
lm(formula = Volume ~ Girth + Height, data = trees)
Residuals:
Min 1Q Median 3Q Max
-6.4065 -2.6493 -0.2876 2.2003 8.4847
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -57.9877 8.6382 -6.713 2.75e-07 ***
Girth 4.7082 0.2643 17.816 < 2e-16 ***
Height 0.3393 0.1302 2.607 0.0145 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.882 on 28 degrees of freedom
Multiple R-squared: 0.948, Adjusted R-squared: 0.9442
F-statistic: 255 on 2 and 28 DF, p-value: < 2.2e-16
Based on the residuals vs. fitted values plot, the central points appear to roughly bounce randomly above and below 0, but the lowest and highest point appear to be very influential residuals. The red line should be flat along 0 horizontally, but it is U-shaped. This curvature may suggest a violation in the linearity assumption. With the normal Q-Q plot, it’s difficult to confidently say that the assumption of normality appears to be violated. The points generally run along the trend-line, but they do deviate above the line for the higher points. It’s a noteworthy deviation, but it’s difficult to make a certain decision based on the plot. In the scale-location plot, the line is not horizontal, thus suggesting a violation in the assumption of constant variance. Cook’s distance suggests that the 31st observation is above the threshold, meaning it is too influential as one observation.
Gore Bush Buchanan
ALACHUA 47300 34062 262
BAKER 2392 5610 73
BAY 18850 38637 248
BRADFORD 3072 5413 65
BREVARD 97318 115185 570
BROWARD 386518 177279 789
CALHOUN 2155 2873 90
CHARLOTTE 29641 35419 182
CITRUS 25501 29744 270
CLAY 14630 41745 186
COLLIER 29905 60426 122
COLUMBIA 7047 10964 89
DADE 328702 289456 561
DE SOTO 3322 4256 36
DIXIE 1825 2698 29
DUVAL 107680 152082 650
ESCAMBIA 40958 73029 504
FLAGLER 13891 12608 83
FRANKLIN 2042 2448 33
GADSDEN 9565 4750 39
GILCHRIST 1910 3300 29
GLADES 1420 1840 9
GULF 2389 3546 71
HAMILTON 1718 2153 24
HARDEE 2341 3764 30
HENDRY 3239 4743 22
HERNANDO 32644 30646 242
HIGHLANDS 14152 20196 99
HILLSBOROUGH 166581 176967 836
HOLMES 2154 4985 76
INDIAN RIVER 19769 28627 105
JACKSON 6868 9138 102
JEFFERSON 3038 2481 29
LAFAYETTE 788 1669 10
LAKE 36555 49963 289
LEE 73560 106141 305
LEON 61425 39053 282
LEVY 5403 6860 67
LIBERTY 1011 1316 39
MADISON 3011 3038 29
MANATEE 49169 57948 272
MARION 44648 55135 563
MARTIN 26619 33864 108
MONROE 16483 16059 47
NASSAU 6952 16404 90
OKALOOSA 16924 52043 267
OKEECHOBEE 4588 5058 43
ORANGE 140115 134476 446
OSCEOLA 28177 26216 145
PALM BEACH 268945 152846 3407
PASCO 69550 68581 570
PINELLAS 199660 184312 1010
POLK 74977 90101 538
PUTNAM 12091 13439 147
ST. JOHNS 19482 39497 229
ST. LUCIE 41559 34705 124
SANTA ROSA 12795 36248 311
SARASOTA 72854 83100 305
SEMINOLE 58888 75293 194
SUMTER 9634 12126 114
SUWANNEE 4084 8014 108
TAYLOR 2647 4051 27
UNION 1399 2326 26
VOLUSIA 97063 82214 396
WAKULLA 3835 4511 46
WALTON 5637 12176 120
WASHINGTON 2796 4983 88
Call:
lm(formula = Buchanan ~ Bush, data = florida)
Residuals:
Min 1Q Median 3Q Max
-907.50 -46.10 -29.19 12.26 2610.19
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.529e+01 5.448e+01 0.831 0.409
Bush 4.917e-03 7.644e-04 6.432 1.73e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 353.9 on 65 degrees of freedom
Multiple R-squared: 0.3889, Adjusted R-squared: 0.3795
F-statistic: 41.37 on 1 and 65 DF, p-value: 1.727e-08
Based on the diagnostic plots, Palm Beach County is an outlier. First, when looking at the residuals vs fitted plot, the Palm Beach County residual is very large. When referring to the summary of the simple regression model, the third quartile for residuals is 12.26, yet the max is 2610.19. This is a significant jump and indicative of the value being an outlier. The normal Q-Q plot also indicates that the residuals for the model are generally normal except for the Palm Beach County residual, as it greatly deviates from the line in the plot. The Cook’s distance plot shows two points that may be of concern as outliers if you follow the metric of observations scoring over 1, which are DADE and Palm Beach at about 2. The residuals and leverages plot shows the Palm Beach County standardized residual value beyond the dashed line indicating Cook’s distance. This also suggests that the observation is an outlier and the observation has the potential to influence the regression model.
Call:
lm(formula = log(Buchanan) ~ log(Bush), data = florida)
Residuals:
Min 1Q Median 3Q Max
-0.96075 -0.25949 0.01282 0.23826 1.66564
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.57712 0.38919 -6.622 8.04e-09 ***
log(Bush) 0.75772 0.03936 19.251 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4673 on 65 degrees of freedom
Multiple R-squared: 0.8508, Adjusted R-squared: 0.8485
F-statistic: 370.6 on 1 and 65 DF, p-value: < 2.2e-16
Based on the diagnostic plots, Palm Beach County is still an outlier. First, when looking at the residuals vs fitted plot, the Palm Beach County residual is still very large. The normal Q-Q plot also indicates that the residuals for the model are generally normal except for the Palm Beach County residual, as it greatly deviates from the line in the plot. The Cook’s distance plot shows that may be of concern as outlier if you follow the metric of observations scoring over 0.2, which is Palm Beach at about 0.3. The residuals and leverages plot shows the Palm Beach County standardized residual value beyond the dashed line indicating Cook’s distance. This also suggests that the observation is an outlier and the observation has the potential to influence the regression model.
---
title: "Homework 5"
author: "Adithya Parupudi"
description: "my homework 5"
date: "05/08/2023"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- hw5
- Adithya Parupudi
---
```{r}
#| label: setup
#| warning: false
library(tidyverse)
library(MPV)
library(alr4)
library(smss)
knitr::opts_chunk$set(echo = TRUE)
```
# Question 1
```{r}
data(house.selling.price.2)
house.selling.price.2
```
## A
For backward elimination, you fit a model using all possible explanatory values to predict the output. Then
one by one, you delete the least significant explanatory variable in the model, which would have the largest
p-value. In this example, we would delete `Beds` first, which has a p-value of 0.487.
## B
With forward selection, you begin with no explanatory variables, then add one variable at a time to the model.
The variable you add should be the most significant one, based on it having the lowest P-value of the group of
possible explanatory variables. In this example, the first variable to add to the model is `Size`, given its
extremely small p-value < 2e-16.
## C
While the variable Beds does have a strong correlation with price, when adding additional variables using a
regression model, the relationship significantly diminishes, thus the other variables may act as a control on
the bed variable.
## D
```{r}
summary(lm(P ~ S, data = house.selling.price.2))
summary(lm(P ~ S + New, data = house.selling.price.2))
summary(lm(P ~ ., data = house.selling.price.2))
summary(lm(P ~ . - Be, data = house.selling.price.2))
summary(lm(P ~ . - Be - Ba, data = house.selling.price.2))
```
### a. R^2
As expected, the model with the most explanatory variables has the highest R-squared value at 0.8689. Therefore, if you were to select a model solely based on maximizing the R-squared value, it would be: ŷ = -41.79 + 64.76(Size) - 2.77(Beds) + 19.2(Baths) + 18.98(New).
### b. Adjusted R^2
However, if you were to select a model based on adjusted R-squared, the best model for predicting selling price would exclude Beds and use Size, Baths, and New as explanatory variables. The adjusted R-squared value see a slight increase when Beds is removed (from 0.8629 to 0.8637). The model would be: ŷ = -47.99 + 62.26(Size) + 20.07(Baths) + 18.37(New).
### c. PRESS
```{r}
PRESS(lm(P ~ ., data = house.selling.price.2))
PRESS(lm(P ~ . -Be, data = house.selling.price.2))
```
When considering PRESS, a smaller PRESS value indicates a better predictive model. Comparing the PRESS value of the model with all variables and the model excluding Bed, the PRESS values would lead us to select the model with Size, Baths, and New as variables for predicting selling price.
### d. AIC
```{r}
AIC(lm(P ~ ., data = house.selling.price.2))
AIC(lm(P ~ . -Be, data = house.selling.price.2))
```
When considering the AIC for both models, the value is slightly lower for the model that excludes Bed as a variable. Therefore, the AIC would lead us to use the model with Size, Baths, and New as explanatory variables to predicting selling price.
### e. BIC
```{r}
BIC(lm(P ~ ., data = house.selling.price.2))
BIC(lm(P ~ . -Be, data = house.selling.price.2))
```
Lastly, like AIC, the BIC value is lower for the model that excludes Bed as a variable. Once again, we’d select the model that uses Size, Baths, and New as explanatory variables to predict selling price.
## E
Given the results from the various criteria above, the model I would prefer to use to predict selling price is that which excludes Bed and includes Size, Bath, and New as variables: ŷ = -41.79 + 64.76(Size) - 2.77(Beds) + 19.2(Baths) + 18.98(New). This is because each of the criterion indicate this model as slightly stronger in its predictive power than the model that includes all variables except R-squared, which cannot be used alone to determine model strength.
# Question 2
```{r}
data("trees")
trees
```
## A
```{r}
model <- lm(Volume ~ Girth + Height, data = trees)
summary(model)
```
## B
```{r}
par(mfrow = c(2, 3)); plot(model, which = 1:6)
```
Based on the residuals vs. fitted values plot, the central points appear to roughly bounce randomly above and below 0, but the lowest and highest point appear to be very influential residuals. The red line should be flat along 0 horizontally, but it is U-shaped. This curvature may suggest a violation in the linearity assumption. With the normal Q-Q plot, it’s difficult to confidently say that the assumption of normality appears to be violated. The points generally run along the trend-line, but they do deviate above the line for the higher points. It’s a noteworthy deviation, but it’s difficult to make a certain decision based on the plot. In the scale-location plot, the line is not horizontal, thus suggesting a violation in the assumption of constant variance. Cook’s distance suggests that the 31st observation is above the threshold, meaning it is too influential as one observation.
# Question 3
```{r}
data("florida")
florida
```
## A
```{r}
model <- lm(formula = Buchanan ~ Bush, data = florida)
summary(model)
```
```{r}
par(mfrow = c(2, 3)); plot(model, which = 1:6)
```
Based on the diagnostic plots, Palm Beach County is an outlier. First, when looking at the residuals vs fitted plot, the Palm Beach County residual is very large. When referring to the summary of the simple regression model, the third quartile for residuals is 12.26, yet the max is 2610.19. This is a significant jump and indicative of the value being an outlier. The normal Q-Q plot also indicates that the residuals for the model are generally normal except for the Palm Beach County residual, as it greatly deviates from the line in the plot. The Cook’s distance plot shows two points that may be of concern as outliers if you follow the metric of observations scoring over 1, which are DADE and Palm Beach at about 2. The residuals and leverages plot shows the Palm Beach County standardized residual value beyond the dashed line indicating Cook’s distance. This also suggests that the observation is an outlier and the observation has the potential to influence the regression model.
## B
```{r}
model <- lm(formula = log(Buchanan) ~ log(Bush), data = florida)
summary(model)
```
```{r}
par(mfrow = c(2, 3)); plot(model, which = 1:6)
```
Based on the diagnostic plots, Palm Beach County is still an outlier. First, when looking at the residuals vs fitted plot, the Palm Beach County residual is still very large. The normal Q-Q plot also indicates that the residuals for the model are generally normal except for the Palm Beach County residual, as it greatly deviates from the line in the plot. The Cook’s distance plot shows that may be of concern as outlier if you follow the metric of observations scoring over 0.2, which is Palm Beach at about 0.3. The residuals and leverages plot shows the Palm Beach County standardized residual value beyond the dashed line indicating Cook’s distance. This also suggests that the observation is an outlier and the observation has the potential to influence the regression model.