Code
library(smss)
Warning: package 'smss' was built under R version 4.2.3
Code
data(house.selling.price.2)
<-cor(house.selling.price.2)
cor<-lm(P~., house.selling.price.2) reg_tot
Young Soo Choi
May 8, 2023
Warning: package 'smss' was built under R version 4.2.3
P S Be Ba New
P 1.0000000 0.8988136 0.5902675 0.7136960 0.3565540
S 0.8988136 1.0000000 0.6691137 0.6624828 0.1762879
Be 0.5902675 0.6691137 1.0000000 0.3337966 0.2672091
Ba 0.7136960 0.6624828 0.3337966 1.0000000 0.1820651
New 0.3565540 0.1762879 0.2672091 0.1820651 1.0000000
Call:
lm(formula = P ~ ., data = house.selling.price.2)
Residuals:
Min 1Q Median 3Q Max
-36.212 -9.546 1.277 9.406 71.953
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -41.795 12.104 -3.453 0.000855 ***
S 64.761 5.630 11.504 < 2e-16 ***
Be -2.766 3.960 -0.698 0.486763
Ba 19.203 5.650 3.399 0.001019 **
New 18.984 3.873 4.902 4.3e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 16.36 on 88 degrees of freedom
Multiple R-squared: 0.8689, Adjusted R-squared: 0.8629
F-statistic: 145.8 on 4 and 88 DF, p-value: < 2.2e-16
The backward elimation method is a method of excluding the highest p value from the model while including all variables. According to this, the “Beds(Be)” variable, which is the variable with the highest p value in the regression model above, becomes the first variable to be deleted.
The forward selection method starts with a model that does not include any variables and fits the dependent variable and the regression equation one by one for all variables, and among them, the variable with the lowest p value is included in the regression model one by one.
To this end, the p value of each variable was derived through Pearson’s correlation analysis between each independent variable in this model and the dependent variable P.
$S
Pearson's product-moment correlation
data: house.selling.price.2$P and x
t = 19.561, df = 91, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.8509019 0.9318957
sample estimates:
cor
0.8988136
$Be
Pearson's product-moment correlation
data: house.selling.price.2$P and x
t = 6.9756, df = 91, p-value = 4.759e-10
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.4393929 0.7087536
sample estimates:
cor
0.5902675
$Ba
Pearson's product-moment correlation
data: house.selling.price.2$P and x
t = 9.7197, df = 91, p-value = 9.839e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.5967461 0.8009566
sample estimates:
cor
0.713696
$New
Pearson's product-moment correlation
data: house.selling.price.2$P and x
t = 3.6406, df = 91, p-value = 0.0004515
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.1648168 0.5223243
sample estimates:
cor
0.356554
As a result, it may be seen that the p value of the independent variable size(S) is the smallest(2.2e-16). Therefore, the first variable to be added is the size.
It is likely that the Beds variable has high multicollinearity, contains outliers, or distortion due to small samples. Looking at one by one, according to the correlation matrix seen earlier, the Beds variable shows a correlation coefficient of .6 or more with some variables, but it is difficult to judge that it has particular multicollinearity. Next, I checked out the outliers.
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.000 3.000 3.000 3.183 4.000 5.000
1 2 3 4 5
1 6 62 23 1
It has a range from a minimum value of 1 to a maximum value of 5, and among them, more than half are 3. Therefore, it cannot be said to be an outlier problem.
The multicollinearity of the Beds variable is not high and no special abnormal values are included. So I think that this result is due to bias due to small sample numbers.
I proceeded to select a model according to each criterion using forward selection.
First, four models including only one variable were created and each value was compared.
Call:
lm(formula = P ~ S, data = house.selling.price.2)
Residuals:
Min 1Q Median 3Q Max
-56.407 -10.656 2.126 11.412 85.091
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -25.194 6.688 -3.767 0.000293 ***
S 75.607 3.865 19.561 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 19.47 on 91 degrees of freedom
Multiple R-squared: 0.8079, Adjusted R-squared: 0.8058
F-statistic: 382.6 on 1 and 91 DF, p-value: < 2.2e-16
[1] 34508.4
[1] 820.1439
[1] 827.7417
Call:
lm(formula = P ~ Be, data = house.selling.price.2)
Residuals:
Min 1Q Median 3Q Max
-72.079 -19.679 -3.779 14.352 174.752
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -37.23 19.95 -1.866 0.0653 .
Be 42.97 6.16 6.976 4.76e-10 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 35.86 on 91 degrees of freedom
Multiple R-squared: 0.3484, Adjusted R-squared: 0.3413
F-statistic: 48.66 on 1 and 91 DF, p-value: 4.759e-10
[1] 117028.4
[1] 933.7168
[1] 941.3146
Call:
lm(formula = P ~ Ba, data = house.selling.price.2)
Residuals:
Min 1Q Median 3Q Max
-52.903 -21.003 -2.703 12.197 130.571
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -49.248 15.644 -3.148 0.00222 **
Ba 76.026 7.822 9.720 9.84e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 31.12 on 91 degrees of freedom
Multiple R-squared: 0.5094, Adjusted R-squared: 0.504
F-statistic: 94.47 on 1 and 91 DF, p-value: 9.839e-16
[1] 88121.47
[1] 907.3327
[1] 914.9305
Call:
lm(formula = P ~ New, data = house.selling.price.2)
Residuals:
Min 1Q Median 3Q Max
-71.749 -21.249 -7.449 17.251 190.751
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 89.249 5.148 17.336 < 2e-16 ***
New 34.158 9.383 3.641 0.000451 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 41.51 on 91 degrees of freedom
Multiple R-squared: 0.1271, Adjusted R-squared: 0.1175
F-statistic: 13.25 on 1 and 91 DF, p-value: 0.0004515
[1] 156772.4
[1] 960.908
[1] 968.5058
Models including S show the best performance on all criteria. So in step 1, S is included in the model.
Second, a regression model including one variable was fitted again to a model containing S(Size) and compared according to each criterion.
Call:
lm(formula = P ~ S + Be, data = house.selling.price.2)
Residuals:
Min 1Q Median 3Q Max
-56.797 -11.751 2.859 10.817 84.417
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -22.393 10.937 -2.048 0.0435 *
S 76.742 5.227 14.682 <2e-16 ***
Be -1.468 4.523 -0.325 0.7462
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 19.57 on 90 degrees of freedom
Multiple R-squared: 0.8081, Adjusted R-squared: 0.8038
F-statistic: 189.5 on 2 and 90 DF, p-value: < 2.2e-16
[1] 34468.04
[1] 822.0351
[1] 832.1655
Call:
lm(formula = P ~ S + Ba, data = house.selling.price.2)
Residuals:
Min 1Q Median 3Q Max
-41.158 -10.984 1.658 12.792 81.058
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -49.752 9.183 -5.418 5e-07 ***
S 63.863 4.840 13.194 < 2e-16 ***
Ba 22.448 6.130 3.662 0.000422 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 18.27 on 90 degrees of freedom
Multiple R-squared: 0.8328, Adjusted R-squared: 0.8291
F-statistic: 224.1 on 2 and 90 DF, p-value: < 2.2e-16
[1] 30032.81
[1] 809.2251
[1] 819.3555
Call:
lm(formula = P ~ S + New, data = house.selling.price.2)
Residuals:
Min 1Q Median 3Q Max
-47.207 -9.763 -0.091 9.984 76.405
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -26.089 5.977 -4.365 3.39e-05 ***
S 72.575 3.508 20.690 < 2e-16 ***
New 19.587 3.995 4.903 4.16e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 17.4 on 90 degrees of freedom
Multiple R-squared: 0.8484, Adjusted R-squared: 0.845
F-statistic: 251.8 on 2 and 90 DF, p-value: < 2.2e-16
[1] 27233.66
[1] 800.1262
[1] 810.2566
In addition to “S”, the model with “New” performs best on all criteria, so I included the “New” variable in the model at this stage.
Third, the model containing the remaining “Be” and “Ba” in the model containing “S” and “New” is compared by fitting.
Call:
lm(formula = P ~ S + New + Be, data = house.selling.price.2)
Residuals:
Min 1Q Median 3Q Max
-47.184 -8.815 1.550 9.943 73.261
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -15.215 9.771 -1.557 0.123
S 76.825 4.621 16.624 < 2e-16 ***
New 20.749 4.059 5.112 1.81e-06 ***
Be -5.729 4.085 -1.402 0.164
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 17.3 on 89 degrees of freedom
Multiple R-squared: 0.8516, Adjusted R-squared: 0.8466
F-statistic: 170.3 on 3 and 89 DF, p-value: < 2.2e-16
[1] 26644.82
[1] 800.0934
[1] 812.7564
Call:
lm(formula = P ~ S + New + Ba, data = house.selling.price.2)
Residuals:
Min 1Q Median 3Q Max
-34.804 -9.496 0.917 7.931 73.338
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -47.992 8.209 -5.847 8.15e-08 ***
S 62.263 4.335 14.363 < 2e-16 ***
New 18.371 3.761 4.885 4.54e-06 ***
Ba 20.072 5.495 3.653 0.000438 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 16.31 on 89 degrees of freedom
Multiple R-squared: 0.8681, Adjusted R-squared: 0.8637
F-statistic: 195.3 on 3 and 89 DF, p-value: < 2.2e-16
[1] 23683.53
[1] 789.1366
[1] 801.7996
This time, the model with “Ba” is superior to the model with “Be” in all criteria. So, in this step, “Ba” is included in the model.
Finally, I compared this third model with a model including “Be”.
Call:
lm(formula = P ~ ., data = house.selling.price.2)
Residuals:
Min 1Q Median 3Q Max
-36.212 -9.546 1.277 9.406 71.953
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -41.795 12.104 -3.453 0.000855 ***
S 64.761 5.630 11.504 < 2e-16 ***
Be -2.766 3.960 -0.698 0.486763
Ba 19.203 5.650 3.399 0.001019 **
New 18.984 3.873 4.902 4.3e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 16.36 on 88 degrees of freedom
Multiple R-squared: 0.8689, Adjusted R-squared: 0.8629
F-statistic: 145.8 on 4 and 88 DF, p-value: < 2.2e-16
[1] 23552.98
[1] 790.6225
[1] 805.8181
All analysis is now complete. The model to be selected according to each criterion is as follows,
R-squared: Since the R-squared value of the model including all variables is the highest, by this criterion the model including all variables is selected.
Adj. R-squared: Since the adj. R-squared value of the model including three variables(excluding “Be”) is the highest, by this criterion the model including three variables(“S”,“Ba”,“New”) is selected.
PRESS: Since the PRESS value of the model including all variables is the lowest, by this criterion the model including all variables is selected.
AIC: Since the AIC value of the model including three variables(excluding “Be”) is the lowest, by this criterion the model including three variables(“S”,“Ba”,“New”) is selected.
BIC: Since the BIC value of the model including three variables(excluding “Be”) is the lowest, by this criterion the model including three variables(“S”,“Ba”,“New”) is selected.
Overall, the model including every variables is selected according to the R-squared and PRESS criteria, and the model including “S”, “New”, “Ba” variables is selected according to the adj. R-squared, AIC, and BIC.
In the same way, I proceeded with model selection using the backward method.
Four models in which variables were omitted one by one from a model including all variables were fitted and compared.
Call:
lm(formula = P ~ . - S, data = house.selling.price.2)
Residuals:
Min 1Q Median 3Q Max
-46.066 -17.066 -4.607 9.384 115.151
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -106.101 16.892 -6.281 1.21e-08 ***
Be 26.175 4.811 5.440 4.63e-07 ***
Ba 60.184 6.900 8.722 1.41e-13 ***
New 15.099 6.070 2.487 0.0147 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 25.74 on 89 degrees of freedom
Multiple R-squared: 0.6717, Adjusted R-squared: 0.6606
F-statistic: 60.69 on 3 and 89 DF, p-value: < 2.2e-16
[1] 58971.87
[1] 873.9787
[1] 886.6417
Call:
lm(formula = P ~ . - Be, data = house.selling.price.2)
Residuals:
Min 1Q Median 3Q Max
-34.804 -9.496 0.917 7.931 73.338
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -47.992 8.209 -5.847 8.15e-08 ***
S 62.263 4.335 14.363 < 2e-16 ***
Ba 20.072 5.495 3.653 0.000438 ***
New 18.371 3.761 4.885 4.54e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 16.31 on 89 degrees of freedom
Multiple R-squared: 0.8681, Adjusted R-squared: 0.8637
F-statistic: 195.3 on 3 and 89 DF, p-value: < 2.2e-16
[1] 23683.53
[1] 789.1366
[1] 801.7996
Call:
lm(formula = P ~ . - Ba, data = house.selling.price.2)
Residuals:
Min 1Q Median 3Q Max
-47.184 -8.815 1.550 9.943 73.261
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -15.215 9.771 -1.557 0.123
S 76.825 4.621 16.624 < 2e-16 ***
Be -5.729 4.085 -1.402 0.164
New 20.749 4.059 5.112 1.81e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 17.3 on 89 degrees of freedom
Multiple R-squared: 0.8516, Adjusted R-squared: 0.8466
F-statistic: 170.3 on 3 and 89 DF, p-value: < 2.2e-16
[1] 26644.82
[1] 800.0934
[1] 812.7564
Call:
lm(formula = P ~ . - New, data = house.selling.price.2)
Residuals:
Min 1Q Median 3Q Max
-40.199 -10.462 1.167 12.612 81.724
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -53.382 13.319 -4.008 0.000127 ***
S 62.354 6.292 9.910 4.91e-16 ***
Be 1.636 4.327 0.378 0.706299
Ba 22.915 6.282 3.648 0.000445 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 18.36 on 89 degrees of freedom
Multiple R-squared: 0.8331, Adjusted R-squared: 0.8274
F-statistic: 148 on 3 and 89 DF, p-value: < 2.2e-16
[1] 29984.66
[1] 811.0758
[1] 823.7388
The performance of the model without “Be” is the best in all criteria. However, according to the R-squared and PRESS criteria, the performance of the model without “Be” falls short of that of the model with all variables.
Call:
lm(formula = P ~ . - Be - S, data = house.selling.price.2)
Residuals:
Min 1Q Median 3Q Max
-45.947 -22.301 -7.501 14.153 119.618
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -47.113 14.869 -3.169 0.00209 **
Ba 71.480 7.553 9.463 3.74e-15 ***
New 22.454 6.793 3.305 0.00136 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 29.55 on 90 degrees of freedom
Multiple R-squared: 0.5625, Adjusted R-squared: 0.5528
F-statistic: 57.85 on 2 and 90 DF, p-value: < 2.2e-16
[1] 78581.72
[1] 898.677
[1] 908.8074
Call:
lm(formula = P ~ . - Be - Ba, data = house.selling.price.2)
Residuals:
Min 1Q Median 3Q Max
-47.207 -9.763 -0.091 9.984 76.405
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -26.089 5.977 -4.365 3.39e-05 ***
S 72.575 3.508 20.690 < 2e-16 ***
New 19.587 3.995 4.903 4.16e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 17.4 on 90 degrees of freedom
Multiple R-squared: 0.8484, Adjusted R-squared: 0.845
F-statistic: 251.8 on 2 and 90 DF, p-value: < 2.2e-16
[1] 27233.66
[1] 800.1262
[1] 810.2566
Call:
lm(formula = P ~ . - Be - New, data = house.selling.price.2)
Residuals:
Min 1Q Median 3Q Max
-41.158 -10.984 1.658 12.792 81.058
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -49.752 9.183 -5.418 5e-07 ***
S 63.863 4.840 13.194 < 2e-16 ***
Ba 22.448 6.130 3.662 0.000422 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 18.27 on 90 degrees of freedom
Multiple R-squared: 0.8328, Adjusted R-squared: 0.8291
F-statistic: 224.1 on 2 and 90 DF, p-value: < 2.2e-16
[1] 30032.81
[1] 809.2251
[1] 819.3555
There is no model that improves performance compared to the existing model that includes four or three variables.So there are no more variables to exclude.
In other words, the results of model selection using the backward method are as follows.
R-squared: Since the R-squared value of the model including all variables is the highest, by this criterion the model including all variables is selected.
Adj. R-squared: Since the adj. R-squared value of the model including three variables(excluding “Be”) is the highest, by this criterion the model including three variables(“S”,“Ba”,“New”) is selected.
PRESS: Since the PRESS value of the model including all variables is the lowest, by this criterion the model including all variables is selected.
AIC: Since the AIC value of the model including three variables(excluding “Be”) is the lowest, by this criterion the model including three variables(“S”,“Ba”,“New”) is selected.
BIC: Since the BIC value of the model including three variables(excluding “Be”) is the lowest, by this criterion the model including three variables(“S”,“Ba”,“New”) is selected.
Overall, the results were the same as the forward method.
It is natural that the explanatory power of the model increases as the number of variables included in the model increases. However, as the number of variables increases, it becomes difficult to understand the model, and overfitting may occur, which increases errors when applied to new data. Therefore, when selecting a model, indicators have been released to correct the explanatory power that increases due to the increase in the number of variables, and these are adj. R-squared, AIC, and BIC. From this point of view, I prefer models that include “S”, “Ba”, and “New”, which are selected according to these criteria.
[1] "Girth" "Height" "Volume"
[1] "Diameter" "Height" "Volume"
[1] 31
Call:
lm(formula = Volume ~ ., data = trees)
Residuals:
Min 1Q Median 3Q Max
-6.4065 -2.6493 -0.2876 2.2003 8.4847
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -57.9877 8.6382 -6.713 2.75e-07 ***
Diameter 4.7082 0.2643 17.816 < 2e-16 ***
Height 0.3393 0.1302 2.607 0.0145 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.882 on 28 degrees of freedom
Multiple R-squared: 0.948, Adjusted R-squared: 0.9442
F-statistic: 255 on 2 and 28 DF, p-value: < 2.2e-16
Volume = -57.9877 + 4.7082Diameter + 0.3393Height
“Residuals vs. Since the fitted values plot” is curved, it can be seen that the linearity assumption of this model is violated. In addition, the shape of the “Scale Location Plot” is rapidly changing, indicating that the homoskedasticity assumption is violated. Normality is satisfied according to “Normal Q-Q plot”. According to the rest of the plot, some variables have a high influence on the model but are not considered problematic (3 out of 31 observations)
Next, I checked the residual plot to check the independence of the error.
Since no unusual pattern is seen, the assumption of the independence of the error is satisfied.
Warning: package 'alr4' was built under R version 4.2.3
Loading required package: car
Loading required package: carData
Loading required package: effects
Warning: package 'effects' was built under R version 4.2.3
lattice theme set by effectsTheme()
See ?effectsTheme for details.
Gore Bush Buchanan
PALM BEACH 268945 152846 3407
Call:
lm(formula = Buchanan ~ Bush, data = florida)
Residuals:
Min 1Q Median 3Q Max
-907.50 -46.10 -29.19 12.26 2610.19
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.529e+01 5.448e+01 0.831 0.409
Bush 4.917e-03 7.644e-04 6.432 1.73e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 353.9 on 65 degrees of freedom
Multiple R-squared: 0.3889, Adjusted R-squared: 0.3795
F-statistic: 41.37 on 1 and 65 DF, p-value: 1.727e-08
According to these graphs, Palm Beach corresponds to an outlier, and in all graphs, the data of Palm Beach is not on the trend line, but is far from it.
Call:
lm(formula = log_Buchanan ~ log_Bush, data = florida)
Residuals:
Min 1Q Median 3Q Max
-0.96075 -0.25949 0.01282 0.23826 1.66564
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.57712 0.38919 -6.622 8.04e-09 ***
log_Bush 0.75772 0.03936 19.251 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4673 on 65 degrees of freedom
Multiple R-squared: 0.8508, Adjusted R-squared: 0.8485
F-statistic: 370.6 on 1 and 65 DF, p-value: < 2.2e-16
Although the fluctuation of the data has been reduced through logarithm, it still has high leverage and the cook’s distance still exceeds the 4/n standard, so I think it is an outlier.
---
title: "Homework 5"
author: "Young Soo Choi"
description: "hw5"
date: "05/08/2023"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- hw5
---
# Question 1
```{r}
library(smss)
data(house.selling.price.2)
cor<-cor(house.selling.price.2)
reg_tot<-lm(P~., house.selling.price.2)
```
```{r}
cor
summary(reg_tot)
```
## (A)
The backward elimation method is a method of excluding the highest p value from the model while including all variables. According to this, the "Beds(Be)" variable, which is the variable with the highest p value in the regression model above, becomes the first variable to be deleted.
## (B)
The forward selection method starts with a model that does not include any variables and fits the dependent variable and the regression equation one by one for all variables, and among them, the variable with the lowest p value is included in the regression model one by one.
To this end, the p value of each variable was derived through Pearson's correlation analysis between each independent variable in this model and the dependent variable P.
```{r}
lapply(house.selling.price.2[, -1], function(x) cor.test(house.selling.price.2$P, x))
```
As a result, it may be seen that the p value of the independent variable size(S) is the smallest(2.2e-16). Therefore, the first variable to be added is the size.
## (C)
It is likely that the Beds variable has high multicollinearity, contains outliers, or distortion due to small samples.
Looking at one by one, according to the correlation matrix seen earlier, the Beds variable shows a correlation coefficient of .6 or more with some variables, but it is difficult to judge that it has particular multicollinearity.
Next, I checked out the outliers.
```{r}
summary(house.selling.price.2$Be)
table(house.selling.price.2$Be)
```
It has a range from a minimum value of 1 to a maximum value of 5, and among them, more than half are 3. Therefore, it cannot be said to be an outlier problem.
The multicollinearity of the Beds variable is not high and no special abnormal values are included.
So I think that this result is due to bias due to small sample numbers.
## (D)
### forward selection
I proceeded to select a model according to each criterion using forward selection.
First, four models including only one variable were created and each value was compared.
```{r}
# 1st step: fitting 4 models with one variable
reg_S<-lm(P~S, house.selling.price.2)
reg_Be<-lm(P~Be, house.selling.price.2)
reg_Ba<-lm(P~Ba, house.selling.price.2)
reg_New<-lm(P~New, house.selling.price.2)
# model with S
summary(reg_S)
price<-house.selling.price.2$P
sum((price - predict(reg_S, newdata = house.selling.price.2))^2)
AIC(reg_S)
BIC(reg_S)
# model with Be
summary(reg_Be)
sum((price - predict(reg_Be, newdata = house.selling.price.2))^2)
AIC(reg_Be)
BIC(reg_Be)
# model with Ba
summary(reg_Ba)
sum((price - predict(reg_Ba, newdata = house.selling.price.2))^2)
AIC(reg_Ba)
BIC(reg_Ba)
# model with New
summary(reg_New)
sum((price - predict(reg_New, newdata = house.selling.price.2))^2)
AIC(reg_New)
BIC(reg_New)
```
Models including S show the best performance on all criteria. So in step 1, S is included in the model.
Second, a regression model including one variable was fitted again to a model containing S(Size) and compared according to each criterion.
```{r}
# 2nd step: fitting 3 models with two variables
reg_S_Be<-lm(P~S+Be, house.selling.price.2)
reg_S_Ba<-lm(P~S+Ba, house.selling.price.2)
reg_S_New<-lm(P~S+New, house.selling.price.2)
# model with S and Be
summary(reg_S_Be)
sum((price - predict(reg_S_Be, newdata = house.selling.price.2))^2)
AIC(reg_S_Be)
BIC(reg_S_Be)
# model with S and Ba
summary(reg_S_Ba)
sum((price - predict(reg_S_Ba, newdata = house.selling.price.2))^2)
AIC(reg_S_Ba)
BIC(reg_S_Ba)
# model with S and New
summary(reg_S_New)
sum((price - predict(reg_S_New, newdata = house.selling.price.2))^2)
AIC(reg_S_New)
BIC(reg_S_New)
```
In addition to "S", the model with "New" performs best on all criteria, so I included the "New" variable in the model at this stage.
Third, the model containing the remaining "Be" and "Ba" in the model containing "S" and "New" is compared by fitting.
```{r}
# 3rd step: fitting 2 models with three variables
reg_S_N_Be<-lm(P~S+New+Be, house.selling.price.2)
reg_S_N_Ba<-lm(P~S+New+Ba, house.selling.price.2)
# model with S and New and Be
summary(reg_S_N_Be)
sum((price - predict(reg_S_N_Be, newdata = house.selling.price.2))^2)
AIC(reg_S_N_Be)
BIC(reg_S_N_Be)
# model with S and New and Ba
summary(reg_S_N_Ba)
sum((price - predict(reg_S_N_Ba, newdata = house.selling.price.2))^2)
AIC(reg_S_N_Ba)
BIC(reg_S_N_Ba)
```
This time, the model with "Ba" is superior to the model with "Be" in all criteria. So, in this step, "Ba" is included in the model.
Finally, I compared this third model with a model including "Be".
```{r}
# 4th step: fitting 1 models with every variables
summary(reg_tot)
sum((price - predict(reg_tot, newdata = house.selling.price.2))^2)
AIC(reg_tot)
BIC(reg_tot)
```
All analysis is now complete. The model to be selected according to each criterion is as follows,
1. R-squared: Since the R-squared value of the model including all variables is the highest, by this criterion the model including all variables is selected.
2. Adj. R-squared: Since the adj. R-squared value of the model including three variables(excluding "Be") is the highest, by this criterion the model including three variables("S","Ba","New") is selected.
3. PRESS: Since the PRESS value of the model including all variables is the lowest, by this criterion the model including all variables is selected.
4. AIC: Since the AIC value of the model including three variables(excluding "Be") is the lowest, by this criterion the model including three variables("S","Ba","New") is selected.
5. BIC: Since the BIC value of the model including three variables(excluding "Be") is the lowest, by this criterion the model including three variables("S","Ba","New") is selected.
Overall, the model including every variables is selected according to the R-squared and PRESS criteria, and the model including "S", "New", "Ba" variables is selected according to the adj. R-squared, AIC, and BIC.
### backward method
In the same way, I proceeded with model selection using the backward method.
Four models in which variables were omitted one by one from a model including all variables were fitted and compared.
```{r}
# 1st step: fitting 4 models without one variable
reg_no_S<-lm(P~.-S, house.selling.price.2)
reg_no_Be<-lm(P~.-Be, house.selling.price.2)
reg_no_Ba<-lm(P~.-Ba, house.selling.price.2)
reg_no_New<-lm(P~.-New, house.selling.price.2)
# model without S
summary(reg_no_S)
sum((price - predict(reg_no_S, newdata = house.selling.price.2))^2)
AIC(reg_no_S)
BIC(reg_no_S)
# model without Be
summary(reg_no_Be)
sum((price - predict(reg_no_Be, newdata = house.selling.price.2))^2)
AIC(reg_no_Be)
BIC(reg_no_Be)
# model without Ba
summary(reg_no_Ba)
sum((price - predict(reg_no_Ba, newdata = house.selling.price.2))^2)
AIC(reg_no_Ba)
BIC(reg_no_Ba)
# model without New
summary(reg_no_New)
sum((price - predict(reg_no_New, newdata = house.selling.price.2))^2)
AIC(reg_no_New)
BIC(reg_no_New)
```
The performance of the model without "Be" is the best in all criteria. However, according to the R-squared and PRESS criteria, the performance of the model without "Be" falls short of that of the model with all variables.
```{r}
# 2nd step: fitting 3 models without two variable
reg_no_Be_S<-lm(P~.-Be-S, house.selling.price.2)
reg_no_Be_Ba<-lm(P~.-Be-Ba, house.selling.price.2)
reg_no_Be_New<-lm(P~.-Be-New, house.selling.price.2)
# model without Be and S
summary(reg_no_Be_S)
sum((price - predict(reg_no_Be_S, newdata = house.selling.price.2))^2)
AIC(reg_no_Be_S)
BIC(reg_no_Be_S)
# model without Be and Ba
summary(reg_no_Be_Ba)
sum((price - predict(reg_no_Be_Ba, newdata = house.selling.price.2))^2)
AIC(reg_no_Be_Ba)
BIC(reg_no_Be_Ba)
# model without Be and New
summary(reg_no_Be_New)
sum((price - predict(reg_no_Be_New, newdata = house.selling.price.2))^2)
AIC(reg_no_Be_New)
BIC(reg_no_Be_New)
```
There is no model that improves performance compared to the existing model that includes four or three variables.So there are no more variables to exclude.
In other words, the results of model selection using the backward method are as follows.
1. R-squared: Since the R-squared value of the model including all variables is the highest, by this criterion the model including all variables is selected.
2. Adj. R-squared: Since the adj. R-squared value of the model including three variables(excluding "Be") is the highest, by this criterion the model including three variables("S","Ba","New") is selected.
3. PRESS: Since the PRESS value of the model including all variables is the lowest, by this criterion the model including all variables is selected.
4. AIC: Since the AIC value of the model including three variables(excluding "Be") is the lowest, by this criterion the model including three variables("S","Ba","New") is selected.
5. BIC: Since the BIC value of the model including three variables(excluding "Be") is the lowest, by this criterion the model including three variables("S","Ba","New") is selected.
Overall, the results were the same as the forward method.
## (E)
It is natural that the explanatory power of the model increases as the number of variables included in the model increases. However, as the number of variables increases, it becomes difficult to understand the model, and overfitting may occur, which increases errors when applied to new data. Therefore, when selecting a model, indicators have been released to correct the explanatory power that increases due to the increase in the number of variables, and these are adj. R-squared, AIC, and BIC. From this point of view, I prefer models that include "S", "Ba", and "New", which are selected according to these criteria.
# Question 2
```{r}
data(trees)
names(trees)
names(trees)[1] <- "Diameter"
names(trees)
nrow(trees)
```
## (A)
```{r}
tree_reg<-lm(Volume~., trees)
summary(tree_reg)
```
Volume = -57.9877 + 4.7082*Diameter + 0.3393*Height
## (B)
```{r}
par(mfrow=c(2:3))
plot(tree_reg, which=1:6)
```
"Residuals vs. Since the fitted values plot" is curved, it can be seen that the linearity assumption of this model is violated. In addition, the shape of the "Scale Location Plot" is rapidly changing, indicating that the homoskedasticity assumption is violated. Normality is satisfied according to "Normal Q-Q plot". According to the rest of the plot, some variables have a high influence on the model but are not considered problematic (3 out of 31 observations)
Next, I checked the residual plot to check the independence of the error.
```{r}
res<-trees$Volume-predict(tree_reg, newdata = trees)
plot(res)
```
Since no unusual pattern is seen, the assumption of the independence of the error is satisfied.
# Question 3
```{r}
library(alr4)
data("florida")
florida["PALM BEACH",]
```
## (A)
```{r}
reg_vote<-lm(Buchanan~Bush, florida)
summary(reg_vote)
par(mfrow=c(2:3))
plot(reg_vote, which=1:6)
```
According to these graphs, Palm Beach corresponds to an outlier, and in all graphs, the data of Palm Beach is not on the trend line, but is far from it.
## (B)
```{r}
# logarithm
florida$log_Buchanan<-log(florida$Buchanan)
florida$log_Bush<-log(florida$Bush)
# fitting
log_reg<-lm(log_Buchanan~log_Bush, florida)
summary(log_reg)
# diagnosis
par(mfrow=c(2:3))
plot(log_reg, which=1:6)
```
Although the fluctuation of the data has been reduced through logarithm, it still has high leverage and the cook's distance still exceeds the 4/n standard, so I think it is an outlier.