Homework5_Kaushika Potluri

Published

November 9, 2022

library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.0      ✔ purrr   1.0.0 
✔ tibble  3.1.8      ✔ dplyr   1.0.10
✔ tidyr   1.2.1      ✔ stringr 1.5.0 
✔ readr   2.1.3      ✔ forcats 0.5.2 
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
library(MPV)
Error in library(MPV): there is no package called 'MPV'
library(alr4)
Loading required package: car
Loading required package: carData

Attaching package: 'car'

The following object is masked from 'package:dplyr':

    recode

The following object is masked from 'package:purrr':

    some

Loading required package: effects
lattice theme set by effectsTheme()
See ?effectsTheme for details.
library(smss)
knitr::opts_chunk$set(echo = TRUE)
data(house.selling.price.2)
house.selling.price.2
       P    S Be Ba New
1   48.5 1.10  3  1   0
2   55.0 1.01  3  2   0
3   68.0 1.45  3  2   0
4  137.0 2.40  3  3   0
5  309.4 3.30  4  3   1
6   17.5 0.40  1  1   0
7   19.6 1.28  3  1   0
8   24.5 0.74  3  1   0
9   34.8 0.78  2  1   0
10  32.0 0.97  3  1   0
11  28.0 0.84  3  1   0
12  49.9 1.08  2  2   0
13  59.9 0.99  2  1   0
14  61.5 1.01  3  2   0
15  60.0 1.34  3  2   0
16  65.9 1.22  3  1   0
17  67.9 1.28  3  2   0
18  68.9 1.29  3  2   0
19  69.9 1.52  3  2   0
20  70.5 1.25  3  2   0
21  72.9 1.28  3  2   0
22  72.5 1.28  3  1   0
23  72.0 1.36  3  2   0
24  71.0 1.20  3  2   0
25  76.0 1.46  3  2   0
26  72.9 1.56  4  2   0
27  73.0 1.22  3  2   0
28  70.0 1.40  2  2   0
29  76.0 1.15  2  2   0
30  69.0 1.74  3  2   0
31  75.5 1.62  3  2   0
32  76.0 1.66  3  2   0
33  81.8 1.33  3  2   0
34  84.5 1.34  3  2   0
35  83.5 1.40  3  2   0
36  86.0 1.15  2  2   1
37  86.9 1.58  3  2   1
38  86.9 1.58  3  2   1
39  86.9 1.58  3  2   1
40  87.9 1.71  3  2   0
41  88.1 2.10  3  2   0
42  85.9 1.27  3  2   0
43  89.5 1.34  3  2   0
44  87.4 1.25  3  2   0
45  87.9 1.68  3  2   0
46  88.0 1.55  3  2   0
47  90.0 1.55  3  2   0
48  96.0 1.36  3  2   1
49  99.9 1.51  3  2   1
50  95.5 1.54  3  2   1
51  98.5 1.51  3  2   0
52 100.1 1.85  3  2   0
53  99.9 1.62  4  2   1
54 101.9 1.40  3  2   1
55 101.9 1.92  4  2   0
56 102.3 1.42  3  2   1
57 110.8 1.56  3  2   1
58 105.0 1.43  3  2   1
59  97.9 2.00  3  2   0
60 106.3 1.45  3  2   1
61 106.5 1.65  3  2   0
62 116.0 1.72  4  2   1
63 108.0 1.79  4  2   1
64 107.5 1.85  3  2   0
65 109.9 2.06  4  2   1
66 110.0 1.76  4  2   0
67 120.0 1.62  3  2   1
68 115.0 1.80  4  2   1
69 113.4 1.98  3  2   0
70 114.9 1.57  3  2   0
71 115.0 2.19  3  2   0
72 115.0 2.07  4  2   0
73 117.9 1.99  4  2   0
74 110.0 1.55  3  2   0
75 115.0 1.67  3  2   0
76 124.0 2.40  4  2   0
77 129.9 1.79  4  2   1
78 124.0 1.89  3  2   0
79 128.0 1.88  3  2   1
80 132.4 2.00  4  2   1
81 139.3 2.05  4  2   1
82 139.3 2.00  4  2   1
83 139.7 2.03  3  2   1
84 142.0 2.12  3  3   0
85 141.3 2.08  4  2   1
86 147.5 2.19  4  2   0
87 142.5 2.40  4  2   0
88 148.0 2.40  5  2   0
89 149.0 3.05  4  2   0
90 150.0 2.04  3  3   0
91 172.9 2.25  4  2   1
92 190.0 2.57  4  3   1
93 280.0 3.85  4  3   0

A

For backward elimination, you fit a model using all possible explanatory values to predict the output. Then one by one, you delete the least significant explanatory variable in the model, which would have the largest p-value. In this example, we would delete Beds first, which has a p-value of 0.487.

B

With forward selection, you begin with no explanatory variables, then add one variable at a time to the model. The variable you add should be the most significant one, based on it having the lowest P-value of the group of possible explanatory variables. In this example, the first variable to add to the model is Size, given its extremely small p-value < 2e-16.

C

While the variable Beds does have a strong correlation with price, when adding additional variables using a regression model, the relationship significantly diminishes, thus the other variables may act as a control on the bed variable.

D

summary(lm(P ~ S, data = house.selling.price.2))

Call:
lm(formula = P ~ S, data = house.selling.price.2)

Residuals:
    Min      1Q  Median      3Q     Max 
-56.407 -10.656   2.126  11.412  85.091 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -25.194      6.688  -3.767 0.000293 ***
S             75.607      3.865  19.561  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 19.47 on 91 degrees of freedom
Multiple R-squared:  0.8079,    Adjusted R-squared:  0.8058 
F-statistic: 382.6 on 1 and 91 DF,  p-value: < 2.2e-16
summary(lm(P ~ S+New, data = house.selling.price.2))

Call:
lm(formula = P ~ S + New, data = house.selling.price.2)

Residuals:
    Min      1Q  Median      3Q     Max 
-47.207  -9.763  -0.091   9.984  76.405 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -26.089      5.977  -4.365 3.39e-05 ***
S             72.575      3.508  20.690  < 2e-16 ***
New           19.587      3.995   4.903 4.16e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 17.4 on 90 degrees of freedom
Multiple R-squared:  0.8484,    Adjusted R-squared:  0.845 
F-statistic: 251.8 on 2 and 90 DF,  p-value: < 2.2e-16
summary(lm(P ~ ., data = house.selling.price.2))

Call:
lm(formula = P ~ ., data = house.selling.price.2)

Residuals:
    Min      1Q  Median      3Q     Max 
-36.212  -9.546   1.277   9.406  71.953 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -41.795     12.104  -3.453 0.000855 ***
S             64.761      5.630  11.504  < 2e-16 ***
Be            -2.766      3.960  -0.698 0.486763    
Ba            19.203      5.650   3.399 0.001019 ** 
New           18.984      3.873   4.902  4.3e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 16.36 on 88 degrees of freedom
Multiple R-squared:  0.8689,    Adjusted R-squared:  0.8629 
F-statistic: 145.8 on 4 and 88 DF,  p-value: < 2.2e-16
summary(lm(P ~ . -Be, data = house.selling.price.2))

Call:
lm(formula = P ~ . - Be, data = house.selling.price.2)

Residuals:
    Min      1Q  Median      3Q     Max 
-34.804  -9.496   0.917   7.931  73.338 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -47.992      8.209  -5.847 8.15e-08 ***
S             62.263      4.335  14.363  < 2e-16 ***
Ba            20.072      5.495   3.653 0.000438 ***
New           18.371      3.761   4.885 4.54e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 16.31 on 89 degrees of freedom
Multiple R-squared:  0.8681,    Adjusted R-squared:  0.8637 
F-statistic: 195.3 on 3 and 89 DF,  p-value: < 2.2e-16
summary(lm(P ~ . -Be -Ba, data = house.selling.price.2))

Call:
lm(formula = P ~ . - Be - Ba, data = house.selling.price.2)

Residuals:
    Min      1Q  Median      3Q     Max 
-47.207  -9.763  -0.091   9.984  76.405 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -26.089      5.977  -4.365 3.39e-05 ***
S             72.575      3.508  20.690  < 2e-16 ***
New           19.587      3.995   4.903 4.16e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 17.4 on 90 degrees of freedom
Multiple R-squared:  0.8484,    Adjusted R-squared:  0.845 
F-statistic: 251.8 on 2 and 90 DF,  p-value: < 2.2e-16

a. R^2

As expected, the model with the most explanatory variables has the highest R-squared value at 0.8689. Therefore, if you were to select a model solely based on maximizing the R-squared value, it would be: ŷ = -41.79 + 64.76(Size) - 2.77(Beds) + 19.2(Baths) + 18.98(New).

b. Adjusted R^2

However, if you were to select a model based on adjusted R-squared, the best model for predicting selling price would exclude Beds and use Size, Baths, and New as explanatory variables. The adjusted R-squared value see a slight increase when Beds is removed (from 0.8629 to 0.8637). The model would be: ŷ = -47.99 + 62.26(Size) + 20.07(Baths) + 18.37(New).

c PRESS

PRESS(lm(P ~ ., data = house.selling.price.2))
Error in PRESS(lm(P ~ ., data = house.selling.price.2)): could not find function "PRESS"
PRESS(lm(P ~ . -Be, data = house.selling.price.2))
Error in PRESS(lm(P ~ . - Be, data = house.selling.price.2)): could not find function "PRESS"

When considering PRESS, a smaller PRESS value indicates a better predictive model. Comparing the PRESS value of the model with all variables and the model excluding Bed, the PRESS values would lead us to select the model with Size, Baths, and New as variables for predicting selling price.

  1. AIC
AIC(lm(P ~ ., data = house.selling.price.2))
[1] 790.6225
AIC(lm(P ~ . -Be, data = house.selling.price.2))
[1] 789.1366

When considering the AIC for both models, the value is slightly lower for the model that excludes Bed as a variable. Therefore, the AIC would lead us to use the model with Size, Baths, and New as explanatory variables to predicting selling price.

  1. BIC
BIC(lm(P ~ ., data = house.selling.price.2))
[1] 805.8181
BIC(lm(P ~ . -Be, data = house.selling.price.2))
[1] 801.7996

Lastly, like AIC, the BIC value is lower for the model that excludes Bed as a variable. Once again, we’d select the model that uses Size, Baths, and New as explanatory variables to predict selling price.

E

Given the results from the various criteria above, the model I would prefer to use to predict selling price is that which excludes Bed and includes Size, Bath, and New as variables: ŷ = -41.79 + 64.76(Size) - 2.77(Beds) + 19.2(Baths) + 18.98(New). This is because each of the criterion indicate this model as slightly stronger in its predictive power than the model that includes all variables except R-squared, which cannot be used alone to determine model strength.

#Question 2

data("trees")
trees
   Girth Height Volume
1    8.3     70   10.3
2    8.6     65   10.3
3    8.8     63   10.2
4   10.5     72   16.4
5   10.7     81   18.8
6   10.8     83   19.7
7   11.0     66   15.6
8   11.0     75   18.2
9   11.1     80   22.6
10  11.2     75   19.9
11  11.3     79   24.2
12  11.4     76   21.0
13  11.4     76   21.4
14  11.7     69   21.3
15  12.0     75   19.1
16  12.9     74   22.2
17  12.9     85   33.8
18  13.3     86   27.4
19  13.7     71   25.7
20  13.8     64   24.9
21  14.0     78   34.5
22  14.2     80   31.7
23  14.5     74   36.3
24  16.0     72   38.3
25  16.3     77   42.6
26  17.3     81   55.4
27  17.5     82   55.7
28  17.9     80   58.3
29  18.0     80   51.5
30  18.0     80   51.0
31  20.6     87   77.0

A

model <- lm(Volume ~ Girth + Height, data = trees)
summary(model)

Call:
lm(formula = Volume ~ Girth + Height, data = trees)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.4065 -2.6493 -0.2876  2.2003  8.4847 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -57.9877     8.6382  -6.713 2.75e-07 ***
Girth         4.7082     0.2643  17.816  < 2e-16 ***
Height        0.3393     0.1302   2.607   0.0145 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.882 on 28 degrees of freedom
Multiple R-squared:  0.948, Adjusted R-squared:  0.9442 
F-statistic:   255 on 2 and 28 DF,  p-value: < 2.2e-16

B

par(mfrow = c(2, 3)); plot(model, which = 1:6)

Based on the residuals vs. fitted values plot, the central points appear to roughly bounce randomly above and below 0, but the lowest and highest point appear to be very influential residuals. The red line should be flat along 0 horizontally, but it is U-shaped. This curvature may suggest a violation in the linearity assumption. With the normal Q-Q plot, it’s difficult to confidently say that the assumption of normality appears to be violated. The points generally run along the trend-line, but they do deviate above the line for the higher points. It’s a noteworthy deviation, but it’s difficult to make a certain decision based on the plot. In the scale-location plot, the line is not horizontal, thus suggesting a violation in the assumption of constant variance. Cook’s distance suggests that the 31st observation is above the threshold, meaning it is too influential as one observation.

Question 3

data("florida")
florida
               Gore   Bush Buchanan
ALACHUA       47300  34062      262
BAKER          2392   5610       73
BAY           18850  38637      248
BRADFORD       3072   5413       65
BREVARD       97318 115185      570
BROWARD      386518 177279      789
CALHOUN        2155   2873       90
CHARLOTTE     29641  35419      182
CITRUS        25501  29744      270
CLAY          14630  41745      186
COLLIER       29905  60426      122
COLUMBIA       7047  10964       89
DADE         328702 289456      561
DE SOTO        3322   4256       36
DIXIE          1825   2698       29
DUVAL        107680 152082      650
ESCAMBIA      40958  73029      504
FLAGLER       13891  12608       83
FRANKLIN       2042   2448       33
GADSDEN        9565   4750       39
GILCHRIST      1910   3300       29
GLADES         1420   1840        9
GULF           2389   3546       71
HAMILTON       1718   2153       24
HARDEE         2341   3764       30
HENDRY         3239   4743       22
HERNANDO      32644  30646      242
HIGHLANDS     14152  20196       99
HILLSBOROUGH 166581 176967      836
HOLMES         2154   4985       76
INDIAN RIVER  19769  28627      105
JACKSON        6868   9138      102
JEFFERSON      3038   2481       29
LAFAYETTE       788   1669       10
LAKE          36555  49963      289
LEE           73560 106141      305
LEON          61425  39053      282
LEVY           5403   6860       67
LIBERTY        1011   1316       39
MADISON        3011   3038       29
MANATEE       49169  57948      272
MARION        44648  55135      563
MARTIN        26619  33864      108
MONROE        16483  16059       47
NASSAU         6952  16404       90
OKALOOSA      16924  52043      267
OKEECHOBEE     4588   5058       43
ORANGE       140115 134476      446
OSCEOLA       28177  26216      145
PALM BEACH   268945 152846     3407
PASCO         69550  68581      570
PINELLAS     199660 184312     1010
POLK          74977  90101      538
PUTNAM        12091  13439      147
ST. JOHNS     19482  39497      229
ST. LUCIE     41559  34705      124
SANTA ROSA    12795  36248      311
SARASOTA      72854  83100      305
SEMINOLE      58888  75293      194
SUMTER         9634  12126      114
SUWANNEE       4084   8014      108
TAYLOR         2647   4051       27
UNION          1399   2326       26
VOLUSIA       97063  82214      396
WAKULLA        3835   4511       46
WALTON         5637  12176      120
WASHINGTON     2796   4983       88

A

model <- lm(formula = Buchanan ~ Bush, data = florida)
summary(model)

Call:
lm(formula = Buchanan ~ Bush, data = florida)

Residuals:
    Min      1Q  Median      3Q     Max 
-907.50  -46.10  -29.19   12.26 2610.19 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 4.529e+01  5.448e+01   0.831    0.409    
Bush        4.917e-03  7.644e-04   6.432 1.73e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 353.9 on 65 degrees of freedom
Multiple R-squared:  0.3889,    Adjusted R-squared:  0.3795 
F-statistic: 41.37 on 1 and 65 DF,  p-value: 1.727e-08
par(mfrow = c(2, 3)); plot(model, which = 1:6)

Based on the diagnostic plots, Palm Beach County is an outlier. First, when looking at the residuals vs fitted plot, the Palm Beach County residual is very large. When referring to the summary of the simple regression model, the third quartile for residuals is 12.26, yet the max is 2610.19. This is a significant jump and indicative of the value being an outlier. The normal Q-Q plot also indicates that the residuals for the model are generally normal except for the Palm Beach County residual, as it greatly deviates from the line in the plot. The Cook’s distance plot shows two points that may be of concern as outliers if you follow the metric of observations scoring over 1, which are DADE and Palm Beach at about 2. The residuals and leverages plot shows the Palm Beach County standardized residual value beyond the dashed line indicating Cook’s distance. This also suggests that the observation is an outlier and the observation has the potential to influence the regression model.

B

model <- lm(formula = log(Buchanan) ~ log(Bush), data = florida)
summary(model)

Call:
lm(formula = log(Buchanan) ~ log(Bush), data = florida)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.96075 -0.25949  0.01282  0.23826  1.66564 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -2.57712    0.38919  -6.622 8.04e-09 ***
log(Bush)    0.75772    0.03936  19.251  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4673 on 65 degrees of freedom
Multiple R-squared:  0.8508,    Adjusted R-squared:  0.8485 
F-statistic: 370.6 on 1 and 65 DF,  p-value: < 2.2e-16
par(mfrow = c(2, 3)); plot(model, which = 1:6)

Based on the diagnostic plots, Palm Beach County is still an outlier. First, when looking at the residuals vs fitted plot, the Palm Beach County residual is still very large. The normal Q-Q plot also indicates that the residuals for the model are generally normal except for the Palm Beach County residual, as it greatly deviates from the line in the plot. The Cook’s distance plot shows that may be of concern as outlier if you follow the metric of observations scoring over 0.2, which is Palm Beach at about 0.3. The residuals and leverages plot shows the Palm Beach County standardized residual value beyond the dashed line indicating Cook’s distance. This also suggests that the observation is an outlier and the observation has the potential to influence the regression model.