Homework_5_Saisrinivas_Ambatipudi

Author

Saisrinivas Ambatipudi

Published

May 8, 2023

Question 1

A

Backward elimination begins with a full model and gradually removes variables that do not add substantially to the model. As a result, because the variable Be/Beds has the highest p-value of 0.487, it is excluded first. We use the step() function to confirm this is correct.

data(house.selling.price.2)
fit_Backward<-lm(P~.-Be,data = house.selling.price.2)
summary(fit_Backward)

Call:
lm(formula = P ~ . - Be, data = house.selling.price.2)

Residuals:
    Min      1Q  Median      3Q     Max 
-34.804  -9.496   0.917   7.931  73.338 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -47.992      8.209  -5.847 8.15e-08 ***
S             62.263      4.335  14.363  < 2e-16 ***
Ba            20.072      5.495   3.653 0.000438 ***
New           18.371      3.761   4.885 4.54e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 16.31 on 89 degrees of freedom
Multiple R-squared:  0.8681,    Adjusted R-squared:  0.8637 
F-statistic: 195.3 on 3 and 89 DF,  p-value: < 2.2e-16

Every variable in the model has a p-value of 0.05, indicating that it is statistically significant. With the elimination of the Be/Beds variable, the relevance of the remaining variables, S, Ba, and New, has increased.

For verification, we can use the stats package’s step() function for backward stepwise prediction, which creates a regression model that incorporates all statistically significant predictor variables S, Ba, and New that are connected to the response variable.

step(object = fit_Backward,direction = "backward",scope = 
       fit_Backward) #backward eliminate
Start:  AIC=523.21
P ~ (S + Be + Ba + New) - Be

       Df Sum of Sq   RSS    AIC
<none>              23684 523.21
- Ba    1      3550 27234 534.20
- New   1      6349 30033 543.30
- S     1     54898 78582 632.75

Call:
lm(formula = P ~ (S + Be + Ba + New) - Be, data = house.selling.price.2)

Coefficients:
(Intercept)            S           Ba          New  
     -47.99        62.26        20.07        18.37  

B

We begin with a model and gradually add variables, beginning with the most statistically significant.In this situation, the New variable with a p-value of 4.3e-06 would be added.

fit_Forward<-lm(P~1,data = house.selling.price.2)
summary(fit_Forward)

Call:
lm(formula = P ~ 1, data = house.selling.price.2)

Residuals:
    Min      1Q  Median      3Q     Max 
-82.033 -26.633  -3.533  15.467 209.867 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   99.533      4.582   21.72   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 44.18 on 92 degrees of freedom
step_forward<-step(object=fit_Forward,direction = "forward",
                      scope = P~S+Be+Ba+New)
Start:  AIC=705.63
P ~ 1

       Df Sum of Sq    RSS    AIC
+ S     1    145097  34508 554.22
+ Ba    1     91484  88121 641.41
+ Be    1     62578 117028 667.79
+ New   1     22833 156772 694.99
<none>              179606 705.63

Step:  AIC=554.22
P ~ S

       Df Sum of Sq   RSS    AIC
+ New   1    7274.7 27234 534.20
+ Ba    1    4475.6 30033 543.30
<none>              34508 554.22
+ Be    1      40.4 34468 556.11

Step:  AIC=534.2
P ~ S + New

       Df Sum of Sq   RSS    AIC
+ Ba    1    3550.1 23684 523.21
+ Be    1     588.8 26645 534.17
<none>              27234 534.20

Step:  AIC=523.21
P ~ S + New + Ba

       Df Sum of Sq   RSS    AIC
<none>              23684 523.21
+ Be    1    130.55 23553 524.70
step_forward

Call:
lm(formula = P ~ S + New + Ba, data = house.selling.price.2)

Coefficients:
(Intercept)            S          New           Ba  
     -47.99        62.26        18.37        20.07  

Notably, the coefficients for both backward and forward choices are the same.

C

BEDS has a huge p-value, indicating that it is statistically insignificant, and so we fail to reject the null hypothesis H0. Furthermore, it appears that BEDS is tightly related to the variable Size. This makes logical since the more bedrooms a house has, the larger it is. As a result, BEDS may already be tied to Size. Also worth noting is the tiny sample size (n=93), which may have contributed to the high p-value and significant association.

D

housing_price<-lm(P~1,data = house.selling.price.2)
all_housing_price_m1<-lm(P~.,data = house.selling.price.2)
stepwise_housing<-step(object=housing_price,direction = "both", scope = formula(all_housing_price_m1))
Start:  AIC=705.63
P ~ 1

       Df Sum of Sq    RSS    AIC
+ S     1    145097  34508 554.22
+ Ba    1     91484  88121 641.41
+ Be    1     62578 117028 667.79
+ New   1     22833 156772 694.99
<none>              179606 705.63

Step:  AIC=554.22
P ~ S

       Df Sum of Sq    RSS    AIC
+ New   1      7275  27234 534.20
+ Ba    1      4476  30033 543.30
<none>               34508 554.22
+ Be    1        40  34468 556.11
- S     1    145097 179606 705.63

Step:  AIC=534.2
P ~ S + New

       Df Sum of Sq    RSS    AIC
+ Ba    1      3550  23684 523.21
+ Be    1       589  26645 534.17
<none>               27234 534.20
- New   1      7275  34508 554.22
- S     1    129539 156772 694.99

Step:  AIC=523.21
P ~ S + New + Ba

       Df Sum of Sq   RSS    AIC
<none>              23684 523.21
+ Be    1       131 23553 524.70
- Ba    1      3550 27234 534.20
- New   1      6349 30033 543.30
- S     1     54898 78582 632.75
summary(stepwise_housing)

Call:
lm(formula = P ~ S + New + Ba, data = house.selling.price.2)

Residuals:
    Min      1Q  Median      3Q     Max 
-34.804  -9.496   0.917   7.931  73.338 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -47.992      8.209  -5.847 8.15e-08 ***
S             62.263      4.335  14.363  < 2e-16 ***
New           18.371      3.761   4.885 4.54e-06 ***
Ba            20.072      5.495   3.653 0.000438 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 16.31 on 89 degrees of freedom
Multiple R-squared:  0.8681,    Adjusted R-squared:  0.8637 
F-statistic: 195.3 on 3 and 89 DF,  p-value: < 2.2e-16

The model’s most helpful variables may be found via stepwise regression. We begin with the first model, which includes all of the variables in the housing data. The Broom package is used to extract more succinct information from an item. We will utilize the glance function in the Broom package.

#model 1
all_housing_price_m1<-lm(P~.,data = house.selling.price.2)
summary(all_housing_price_m1)

Call:
lm(formula = P ~ ., data = house.selling.price.2)

Residuals:
    Min      1Q  Median      3Q     Max 
-36.212  -9.546   1.277   9.406  71.953 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -41.795     12.104  -3.453 0.000855 ***
S             64.761      5.630  11.504  < 2e-16 ***
Be            -2.766      3.960  -0.698 0.486763    
Ba            19.203      5.650   3.399 0.001019 ** 
New           18.984      3.873   4.902  4.3e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 16.36 on 88 degrees of freedom
Multiple R-squared:  0.8689,    Adjusted R-squared:  0.8629 
F-statistic: 145.8 on 4 and 88 DF,  p-value: < 2.2e-16
library(broom)
Warning: package 'broom' was built under R version 4.2.3
glance(all_housing_price_m1)
# A tibble: 1 × 12
  r.squared adj.r.squa…¹ sigma stati…²  p.value    df logLik   AIC   BIC devia…³
      <dbl>        <dbl> <dbl>   <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>   <dbl>
1     0.869        0.863  16.4    146. 5.94e-38     4  -389.  791.  806.  23553.
# … with 2 more variables: df.residual <int>, nobs <int>, and abbreviated
#   variable names ¹​adj.r.squared, ²​statistic, ³​deviance

Our R^2 for Model_1 with all predictors is 0.8689.

The Beds variable is eliminated from Model_2 since we know it is tightly connected to size and has a high p-value, rendering it statistically unimportant.

#model 2
housing_price_nobed_m2<-lm(P~.-Be,data = house.selling.price.2)
summary(housing_price_nobed_m2)

Call:
lm(formula = P ~ . - Be, data = house.selling.price.2)

Residuals:
    Min      1Q  Median      3Q     Max 
-34.804  -9.496   0.917   7.931  73.338 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -47.992      8.209  -5.847 8.15e-08 ***
S             62.263      4.335  14.363  < 2e-16 ***
Ba            20.072      5.495   3.653 0.000438 ***
New           18.371      3.761   4.885 4.54e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 16.31 on 89 degrees of freedom
Multiple R-squared:  0.8681,    Adjusted R-squared:  0.8637 
F-statistic: 195.3 on 3 and 89 DF,  p-value: < 2.2e-16
glance(housing_price_nobed_m2)
# A tibble: 1 × 12
  r.squared adj.r.squa…¹ sigma stati…²  p.value    df logLik   AIC   BIC devia…³
      <dbl>        <dbl> <dbl>   <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>   <dbl>
1     0.868        0.864  16.3    195. 4.96e-39     3  -390.  789.  802.  23684.
# … with 2 more variables: df.residual <int>, nobs <int>, and abbreviated
#   variable names ¹​adj.r.squared, ²​statistic, ³​deviance

R^2 for the Model_2 is 0.8681.

#model 3
housing_price_nobed_noba_m3<-lm(P~.-Be,-Ba,data = house.selling.price.2)
summary(housing_price_nobed_noba_m3)

Call:
lm(formula = P ~ . - Be, data = house.selling.price.2, subset = -Ba)

Residuals:
    Min      1Q  Median      3Q     Max 
-34.787  -9.785   0.897   8.108  73.084 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -49.016      8.588  -5.707 1.60e-07 ***
S             61.863      4.473  13.832  < 2e-16 ***
Ba            20.996      5.760   3.645 0.000457 ***
New           18.197      3.821   4.763 7.68e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 16.5 on 86 degrees of freedom
Multiple R-squared:  0.8654,    Adjusted R-squared:  0.8607 
F-statistic: 184.3 on 3 and 86 DF,  p-value: < 2.2e-16
glance(housing_price_nobed_noba_m3)
# A tibble: 1 × 12
  r.squared adj.r.squa…¹ sigma stati…²  p.value    df logLik   AIC   BIC devia…³
      <dbl>        <dbl> <dbl>   <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>   <dbl>
1     0.865        0.861  16.5    184. 2.48e-37     3  -378.  766.  778.  23405.
# … with 2 more variables: df.residual <int>, nobs <int>, and abbreviated
#   variable names ¹​adj.r.squared, ²​statistic, ³​deviance

We see the R^2 is 0.8654.

#model 4
housing_price_new_m4<-lm(P~S+New,data = house.selling.price.2)
summary(housing_price_new_m4)

Call:
lm(formula = P ~ S + New, data = house.selling.price.2)

Residuals:
    Min      1Q  Median      3Q     Max 
-47.207  -9.763  -0.091   9.984  76.405 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -26.089      5.977  -4.365 3.39e-05 ***
S             72.575      3.508  20.690  < 2e-16 ***
New           19.587      3.995   4.903 4.16e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 17.4 on 90 degrees of freedom
Multiple R-squared:  0.8484,    Adjusted R-squared:  0.845 
F-statistic: 251.8 on 2 and 90 DF,  p-value: < 2.2e-16
glance(housing_price_new_m4)
# A tibble: 1 × 12
  r.squared adj.r.squa…¹ sigma stati…²  p.value    df logLik   AIC   BIC devia…³
      <dbl>        <dbl> <dbl>   <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>   <dbl>
1     0.848        0.845  17.4    252. 1.37e-37     2  -396.  800.  810.  27234.
# … with 2 more variables: df.residual <int>, nobs <int>, and abbreviated
#   variable names ¹​adj.r.squared, ²​statistic, ³​deviance

R^2 for Model_4 is 0.8484.

When we compare all four models for R^2, we observe that Model_1 with the highest R^2 of 0.8689 is the best model in this case. When we compare the adjusted R2 of all four models, we observe that Model_2 with the highest adjusted R2 of 0.8637 is the best model in this scenario.

PRESS <- function(all_housing_price_m1) {
    i <- residuals(all_housing_price_m1)/(1 - lm.influence(all_housing_price_m1)$hat)
    sum(i^2)
}

PRESS(all_housing_price_m1)        #model 1 - all variables
[1] 28390.22
PRESS(housing_price_nobed_m2)      #model 2 - excludes variable Bed (includes size)
[1] 27860.05
PRESS(housing_price_nobed_noba_m3) #model 3 - excludes variable Bed and Bath (includes size)
[1] 27690.82
PRESS(housing_price_new_m4)        #model 4 - excludes every variable except New (inc.size)
[1] 31066

Given many regression models, the one with the lowest PRESS should be chosen as the best performer. As a result, Model_3 with a PRESS of 27690.82 that removes variables Bed and Bath, as well as Model_2 with a PRESS of 27860.05 that excludes the variable Bed, should be evaluated.

AIC

Comparing all four models for AIC, and because the aim is to identify the model with the lowest AIC, we notice that Model_3 performs best at 765.8876 with variables Bed and Bath excluded. Model_2 omitting the variable Bed would be a realistic choice at 789.1366 since it is improbable that a house will be built without bedrooms or bathrooms.

#AIC for all 4 models
AIC(all_housing_price_m1)           #model 1
[1] 790.6225
AIC(housing_price_nobed_m2)         #model 2
[1] 789.1366
AIC(housing_price_nobed_noba_m3)    #model 3
[1] 765.8876
AIC(housing_price_new_m4)           #model 4
[1] 800.1262
#BIC for all 4 models
BIC(all_housing_price_m1)           #model 1
[1] 805.8181
BIC(housing_price_nobed_m2)         #model 2
[1] 801.7996
BIC(housing_price_nobed_noba_m3)    #model 3
[1] 778.3866
BIC(housing_price_new_m4)           #model 4
[1] 810.2566

Comparing all four models, as we did with AIC, we notice that BIC model_3 has the lowest BIC, which excludes variables Bed and Bath while including variables Size and New, at 778.3866, compared to Model_2, which includes variables Size,New, and Bath while omitting Bed and has a BIC of 801.7996.

E

Model_2, which eliminated the Bed variable, was my favorite of all the models tested for best fit. It regularly delivered high predictive values with low AIC, BIC, and PRESS. Despite having lower AIC,BIC, and PRESS values, Model_3 had a lower adjusted R2 of 0.8607 compared to Model_2, which had a higher adjusted R2 of 0.8637.

Question 2

A

data("trees")
head(trees,10)
   Girth Height Volume
1    8.3     70   10.3
2    8.6     65   10.3
3    8.8     63   10.2
4   10.5     72   16.4
5   10.7     81   18.8
6   10.8     83   19.7
7   11.0     66   15.6
8   11.0     75   18.2
9   11.1     80   22.6
10  11.2     75   19.9
summary(trees)
     Girth           Height       Volume     
 Min.   : 8.30   Min.   :63   Min.   :10.20  
 1st Qu.:11.05   1st Qu.:72   1st Qu.:19.40  
 Median :12.90   Median :76   Median :24.20  
 Mean   :13.25   Mean   :76   Mean   :30.17  
 3rd Qu.:15.25   3rd Qu.:80   3rd Qu.:37.30  
 Max.   :20.60   Max.   :87   Max.   :77.00  

We rename the variable Girth to Diameter since in the instructions it says,that Girth is labelled erroneously.

library(dplyr)

Attaching package: 'dplyr'
The following object is masked from 'package:car':

    recode
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
trees_d<-trees %>% rename(Diameter=Girth)
head(trees_d)
  Diameter Height Volume
1      8.3     70   10.3
2      8.6     65   10.3
3      8.8     63   10.2
4     10.5     72   16.4
5     10.7     81   18.8
6     10.8     83   19.7

The model is now fitted using Volume as the outcome and Diameter and Height as explanatory factors.

#Fitting a multiple regression model with the Volume as outcome
Tree_Vol<-lm(Volume~Diameter+Height,data=trees_d)
summary(Tree_Vol)

Call:
lm(formula = Volume ~ Diameter + Height, data = trees_d)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.4065 -2.6493 -0.2876  2.2003  8.4847 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -57.9877     8.6382  -6.713 2.75e-07 ***
Diameter      4.7082     0.2643  17.816  < 2e-16 ***
Height        0.3393     0.1302   2.607   0.0145 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.882 on 28 degrees of freedom
Multiple R-squared:  0.948, Adjusted R-squared:  0.9442 
F-statistic:   255 on 2 and 28 DF,  p-value: < 2.2e-16

Diameter and Height both have tiny p-values, indicating statistical significance. This makes reasonable because they forecast tree volume. It is also worth noting that the model exhibits nearly perfect linear correlation, since the R2 is close to 1.00 at 0.948. At 0.9442, the corrected R2 is likewise near to 1.00.

B

par(mfrow = c(2,2))
plot(Tree_Vol,pch=18,col="blue",which = 1:6)

Fitted vs. Residuals

This illustration may be used to identify whether or not the residuals have non-linear patterns. If the red line across the center of the plot is nearly horizontal, the residuals are expected to follow a linear pattern. In the case of the trees, we can see that the red line deviates from a perfect horizontal line, but not much.The residuals appear to follow an approximately non-linear pattern, indicating that a non-linear model would be acceptable for this dataset, and so the regression assumptions are broken.

Normal Q-Q

This figure is used to assess if the regression model’s residuals are regularly distributed. We may infer the residuals are normally distributed if the points in this plot lie nearly along a straight diagonal line. The points in this data lie nearly along a straight diagonal line. A handful of the observations vary somewhat from the line towards the top, but not significantly enough to proclaim the residuals to be non-normally distributed.

Scale-Location

This figure verifies the assumption of equal variance, often known as “homoscedasticity” among the residuals in our regression model. The assumption of equal variance is likely satisfied if the red line is approximately horizontal across the plot. In this scenario, the red line is not horizontal across the plot, implying that equal variance may be broken.

The Cook’s Distance

The Cook’s Distance plot estimates the impact of a data item. It considers both the leverage and residue of each observation. Cook’s Distance is a calculation that summarizes how much a regression model changes when the ith observation is eliminated.In this scenario, the Cook’s Distance for each of the n=31 cases is considered. The observations with the highest/most influential Cooks Distances are 31 (0.605), 18 (.177), and 3 (.167).This is demonstrated below with the Cook’s Distance function(). As a result, we can observe that certain observations in the data set are extremely impactful.

Using the Cook’s Distance() tool to determine the influential observations:

cooks_D <- cooks.distance(Tree_Vol)
influential <- cooks_D[(cooks_D > (3 * mean(cooks_D, na.rm = TRUE)))]
influential
        3        18        31 
0.1673192 0.1775359 0.6052326 

Leverage vs. Residuals

This graphic is intended to highlight influential observations or those with a high level of leverage. If any of the spots in this plot are outside of Cook’s distance (the dashed lines), it is a significant observation. Influence is evident in this case, as evidenced by highly leveraged observations.

Leverage versus Cooks Distance

This map, like the Cook’s Distance plot, is suggestive of very influential data points. There are numerous extremely important data points in this plot, as well as in the Cook’s Distance plot, particularly data points 3, 18, and 31.

Question 3

A

data(florida)
head(florida,10)
            Gore   Bush Buchanan
ALACHUA    47300  34062      262
BAKER       2392   5610       73
BAY        18850  38637      248
BRADFORD    3072   5413       65
BREVARD    97318 115185      570
BROWARD   386518 177279      789
CALHOUN     2155   2873       90
CHARLOTTE  29641  35419      182
CITRUS     25501  29744      270
CLAY       14630  41745      186
summary(florida)
      Gore             Bush           Buchanan     
 Min.   :   788   Min.   :  1316   Min.   :   9.0  
 1st Qu.:  3055   1st Qu.:  4746   1st Qu.:  46.5  
 Median : 14152   Median : 20196   Median : 114.0  
 Mean   : 43341   Mean   : 43356   Mean   : 258.5  
 3rd Qu.: 45974   3rd Qu.: 56542   3rd Qu.: 285.5  
 Max.   :386518   Max.   :289456   Max.   :3407.0  
str(florida)
'data.frame':   67 obs. of  3 variables:
 $ Gore    : int  47300 2392 18850 3072 97318 386518 2155 29641 25501 14630 ...
 $ Bush    : int  34062 5610 38637 5413 115185 177279 2873 35419 29744 41745 ...
 $ Buchanan: int  262 73 248 65 570 789 90 182 270 186 ...
florida_vote<-(lm(Buchanan~Bush,data = florida))
summary(florida_vote)

Call:
lm(formula = Buchanan ~ Bush, data = florida)

Residuals:
    Min      1Q  Median      3Q     Max 
-907.50  -46.10  -29.19   12.26 2610.19 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 4.529e+01  5.448e+01   0.831    0.409    
Bush        4.917e-03  7.644e-04   6.432 1.73e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 353.9 on 65 degrees of freedom
Multiple R-squared:  0.3889,    Adjusted R-squared:  0.3795 
F-statistic: 41.37 on 1 and 65 DF,  p-value: 1.727e-08

We first observe that the Bush votes are statistically significant, with a p-value of 1.73e-08. I also see that the R^2 and modified R^2 are low, indicating a less-than-ideal fit.

library(alr4)
data(florida)
fl_voting_model <- lm(Buchanan~Bush,data=florida)
par(mfrow=c(2,3))
plot(fl_voting_model,which=1:6)

Palm Beach County is an obvious outlier. Next, let us see if, with a better bottle, Palm Beach County will be less of an outlier.

B

data(florida)
fl_voting_model_log <- lm(log(Buchanan)~log(Bush),data=florida)
par(mfrow=c(2,3))
plot(fl_voting_model_log,which=1:6)

Here palm beach county has become a less of an outlier.