Backward elimination begins with a full model and gradually removes variables that do not add substantially to the model. As a result, because the variable Be/Beds has the highest p-value of 0.487, it is excluded first. We use the step() function to confirm this is correct.
Call:
lm(formula = P ~ . - Be, data = house.selling.price.2)
Residuals:
Min 1Q Median 3Q Max
-34.804 -9.496 0.917 7.931 73.338
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -47.992 8.209 -5.847 8.15e-08 ***
S 62.263 4.335 14.363 < 2e-16 ***
Ba 20.072 5.495 3.653 0.000438 ***
New 18.371 3.761 4.885 4.54e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 16.31 on 89 degrees of freedom
Multiple R-squared: 0.8681, Adjusted R-squared: 0.8637
F-statistic: 195.3 on 3 and 89 DF, p-value: < 2.2e-16
Every variable in the model has a p-value of 0.05, indicating that it is statistically significant. With the elimination of the Be/Beds variable, the relevance of the remaining variables, S, Ba, and New, has increased.
For verification, we can use the stats package’s step() function for backward stepwise prediction, which creates a regression model that incorporates all statistically significant predictor variables S, Ba, and New that are connected to the response variable.
Start: AIC=523.21
P ~ (S + Be + Ba + New) - Be
Df Sum of Sq RSS AIC
<none> 23684 523.21
- Ba 1 3550 27234 534.20
- New 1 6349 30033 543.30
- S 1 54898 78582 632.75
Call:
lm(formula = P ~ (S + Be + Ba + New) - Be, data = house.selling.price.2)
Coefficients:
(Intercept) S Ba New
-47.99 62.26 20.07 18.37
B
We begin with a model and gradually add variables, beginning with the most statistically significant.In this situation, the New variable with a p-value of 4.3e-06 would be added.
Start: AIC=705.63
P ~ 1
Df Sum of Sq RSS AIC
+ S 1 145097 34508 554.22
+ Ba 1 91484 88121 641.41
+ Be 1 62578 117028 667.79
+ New 1 22833 156772 694.99
<none> 179606 705.63
Step: AIC=554.22
P ~ S
Df Sum of Sq RSS AIC
+ New 1 7274.7 27234 534.20
+ Ba 1 4475.6 30033 543.30
<none> 34508 554.22
+ Be 1 40.4 34468 556.11
Step: AIC=534.2
P ~ S + New
Df Sum of Sq RSS AIC
+ Ba 1 3550.1 23684 523.21
+ Be 1 588.8 26645 534.17
<none> 27234 534.20
Step: AIC=523.21
P ~ S + New + Ba
Df Sum of Sq RSS AIC
<none> 23684 523.21
+ Be 1 130.55 23553 524.70
step_forward
Call:
lm(formula = P ~ S + New + Ba, data = house.selling.price.2)
Coefficients:
(Intercept) S New Ba
-47.99 62.26 18.37 20.07
Notably, the coefficients for both backward and forward choices are the same.
C
BEDS has a huge p-value, indicating that it is statistically insignificant, and so we fail to reject the null hypothesis H0. Furthermore, it appears that BEDS is tightly related to the variable Size. This makes logical since the more bedrooms a house has, the larger it is. As a result, BEDS may already be tied to Size. Also worth noting is the tiny sample size (n=93), which may have contributed to the high p-value and significant association.
Start: AIC=705.63
P ~ 1
Df Sum of Sq RSS AIC
+ S 1 145097 34508 554.22
+ Ba 1 91484 88121 641.41
+ Be 1 62578 117028 667.79
+ New 1 22833 156772 694.99
<none> 179606 705.63
Step: AIC=554.22
P ~ S
Df Sum of Sq RSS AIC
+ New 1 7275 27234 534.20
+ Ba 1 4476 30033 543.30
<none> 34508 554.22
+ Be 1 40 34468 556.11
- S 1 145097 179606 705.63
Step: AIC=534.2
P ~ S + New
Df Sum of Sq RSS AIC
+ Ba 1 3550 23684 523.21
+ Be 1 589 26645 534.17
<none> 27234 534.20
- New 1 7275 34508 554.22
- S 1 129539 156772 694.99
Step: AIC=523.21
P ~ S + New + Ba
Df Sum of Sq RSS AIC
<none> 23684 523.21
+ Be 1 131 23553 524.70
- Ba 1 3550 27234 534.20
- New 1 6349 30033 543.30
- S 1 54898 78582 632.75
summary(stepwise_housing)
Call:
lm(formula = P ~ S + New + Ba, data = house.selling.price.2)
Residuals:
Min 1Q Median 3Q Max
-34.804 -9.496 0.917 7.931 73.338
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -47.992 8.209 -5.847 8.15e-08 ***
S 62.263 4.335 14.363 < 2e-16 ***
New 18.371 3.761 4.885 4.54e-06 ***
Ba 20.072 5.495 3.653 0.000438 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 16.31 on 89 degrees of freedom
Multiple R-squared: 0.8681, Adjusted R-squared: 0.8637
F-statistic: 195.3 on 3 and 89 DF, p-value: < 2.2e-16
The model’s most helpful variables may be found via stepwise regression. We begin with the first model, which includes all of the variables in the housing data. The Broom package is used to extract more succinct information from an item. We will utilize the glance function in the Broom package.
Our R^2 for Model_1 with all predictors is 0.8689.
The Beds variable is eliminated from Model_2 since we know it is tightly connected to size and has a high p-value, rendering it statistically unimportant.
When we compare all four models for R^2, we observe that Model_1 with the highest R^2 of 0.8689 is the best model in this case. When we compare the adjusted R2 of all four models, we observe that Model_2 with the highest adjusted R2 of 0.8637 is the best model in this scenario.
PRESS <-function(all_housing_price_m1) { i <-residuals(all_housing_price_m1)/(1-lm.influence(all_housing_price_m1)$hat)sum(i^2)}PRESS(all_housing_price_m1) #model 1 - all variables
[1] 28390.22
PRESS(housing_price_nobed_m2) #model 2 - excludes variable Bed (includes size)
[1] 27860.05
PRESS(housing_price_nobed_noba_m3) #model 3 - excludes variable Bed and Bath (includes size)
[1] 27690.82
PRESS(housing_price_new_m4) #model 4 - excludes every variable except New (inc.size)
[1] 31066
Given many regression models, the one with the lowest PRESS should be chosen as the best performer. As a result, Model_3 with a PRESS of 27690.82 that removes variables Bed and Bath, as well as Model_2 with a PRESS of 27860.05 that excludes the variable Bed, should be evaluated.
AIC
Comparing all four models for AIC, and because the aim is to identify the model with the lowest AIC, we notice that Model_3 performs best at 765.8876 with variables Bed and Bath excluded. Model_2 omitting the variable Bed would be a realistic choice at 789.1366 since it is improbable that a house will be built without bedrooms or bathrooms.
#AIC for all 4 modelsAIC(all_housing_price_m1) #model 1
[1] 790.6225
AIC(housing_price_nobed_m2) #model 2
[1] 789.1366
AIC(housing_price_nobed_noba_m3) #model 3
[1] 765.8876
AIC(housing_price_new_m4) #model 4
[1] 800.1262
#BIC for all 4 modelsBIC(all_housing_price_m1) #model 1
[1] 805.8181
BIC(housing_price_nobed_m2) #model 2
[1] 801.7996
BIC(housing_price_nobed_noba_m3) #model 3
[1] 778.3866
BIC(housing_price_new_m4) #model 4
[1] 810.2566
Comparing all four models, as we did with AIC, we notice that BIC model_3 has the lowest BIC, which excludes variables Bed and Bath while including variables Size and New, at 778.3866, compared to Model_2, which includes variables Size,New, and Bath while omitting Bed and has a BIC of 801.7996.
E
Model_2, which eliminated the Bed variable, was my favorite of all the models tested for best fit. It regularly delivered high predictive values with low AIC, BIC, and PRESS. Despite having lower AIC,BIC, and PRESS values, Model_3 had a lower adjusted R2 of 0.8607 compared to Model_2, which had a higher adjusted R2 of 0.8637.
Girth Height Volume
Min. : 8.30 Min. :63 Min. :10.20
1st Qu.:11.05 1st Qu.:72 1st Qu.:19.40
Median :12.90 Median :76 Median :24.20
Mean :13.25 Mean :76 Mean :30.17
3rd Qu.:15.25 3rd Qu.:80 3rd Qu.:37.30
Max. :20.60 Max. :87 Max. :77.00
We rename the variable Girth to Diameter since in the instructions it says,that Girth is labelled erroneously.
library(dplyr)
Attaching package: 'dplyr'
The following object is masked from 'package:car':
recode
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
The model is now fitted using Volume as the outcome and Diameter and Height as explanatory factors.
#Fitting a multiple regression model with the Volume as outcomeTree_Vol<-lm(Volume~Diameter+Height,data=trees_d)summary(Tree_Vol)
Call:
lm(formula = Volume ~ Diameter + Height, data = trees_d)
Residuals:
Min 1Q Median 3Q Max
-6.4065 -2.6493 -0.2876 2.2003 8.4847
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -57.9877 8.6382 -6.713 2.75e-07 ***
Diameter 4.7082 0.2643 17.816 < 2e-16 ***
Height 0.3393 0.1302 2.607 0.0145 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.882 on 28 degrees of freedom
Multiple R-squared: 0.948, Adjusted R-squared: 0.9442
F-statistic: 255 on 2 and 28 DF, p-value: < 2.2e-16
Diameter and Height both have tiny p-values, indicating statistical significance. This makes reasonable because they forecast tree volume. It is also worth noting that the model exhibits nearly perfect linear correlation, since the R2 is close to 1.00 at 0.948. At 0.9442, the corrected R2 is likewise near to 1.00.
This illustration may be used to identify whether or not the residuals have non-linear patterns. If the red line across the center of the plot is nearly horizontal, the residuals are expected to follow a linear pattern. In the case of the trees, we can see that the red line deviates from a perfect horizontal line, but not much.The residuals appear to follow an approximately non-linear pattern, indicating that a non-linear model would be acceptable for this dataset, and so the regression assumptions are broken.
Normal Q-Q
This figure is used to assess if the regression model’s residuals are regularly distributed. We may infer the residuals are normally distributed if the points in this plot lie nearly along a straight diagonal line. The points in this data lie nearly along a straight diagonal line. A handful of the observations vary somewhat from the line towards the top, but not significantly enough to proclaim the residuals to be non-normally distributed.
Scale-Location
This figure verifies the assumption of equal variance, often known as “homoscedasticity” among the residuals in our regression model. The assumption of equal variance is likely satisfied if the red line is approximately horizontal across the plot. In this scenario, the red line is not horizontal across the plot, implying that equal variance may be broken.
The Cook’s Distance
The Cook’s Distance plot estimates the impact of a data item. It considers both the leverage and residue of each observation. Cook’s Distance is a calculation that summarizes how much a regression model changes when the ith observation is eliminated.In this scenario, the Cook’s Distance for each of the n=31 cases is considered. The observations with the highest/most influential Cooks Distances are 31 (0.605), 18 (.177), and 3 (.167).This is demonstrated below with the Cook’s Distance function(). As a result, we can observe that certain observations in the data set are extremely impactful.
Using the Cook’s Distance() tool to determine the influential observations:
This graphic is intended to highlight influential observations or those with a high level of leverage. If any of the spots in this plot are outside of Cook’s distance (the dashed lines), it is a significant observation. Influence is evident in this case, as evidenced by highly leveraged observations.
Leverage versus Cooks Distance
This map, like the Cook’s Distance plot, is suggestive of very influential data points. There are numerous extremely important data points in this plot, as well as in the Cook’s Distance plot, particularly data points 3, 18, and 31.
Gore Bush Buchanan
Min. : 788 Min. : 1316 Min. : 9.0
1st Qu.: 3055 1st Qu.: 4746 1st Qu.: 46.5
Median : 14152 Median : 20196 Median : 114.0
Mean : 43341 Mean : 43356 Mean : 258.5
3rd Qu.: 45974 3rd Qu.: 56542 3rd Qu.: 285.5
Max. :386518 Max. :289456 Max. :3407.0
str(florida)
'data.frame': 67 obs. of 3 variables:
$ Gore : int 47300 2392 18850 3072 97318 386518 2155 29641 25501 14630 ...
$ Bush : int 34062 5610 38637 5413 115185 177279 2873 35419 29744 41745 ...
$ Buchanan: int 262 73 248 65 570 789 90 182 270 186 ...
Call:
lm(formula = Buchanan ~ Bush, data = florida)
Residuals:
Min 1Q Median 3Q Max
-907.50 -46.10 -29.19 12.26 2610.19
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.529e+01 5.448e+01 0.831 0.409
Bush 4.917e-03 7.644e-04 6.432 1.73e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 353.9 on 65 degrees of freedom
Multiple R-squared: 0.3889, Adjusted R-squared: 0.3795
F-statistic: 41.37 on 1 and 65 DF, p-value: 1.727e-08
We first observe that the Bush votes are statistically significant, with a p-value of 1.73e-08. I also see that the R^2 and modified R^2 are low, indicating a less-than-ideal fit.