hw4
regression
prediction
Author

Caleb Hill

Published

November 12, 2022

Question 1

First, let’s load the relevant libraries and set all the graph themes to minimal.

Code
library(readxl)
library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.3.6      ✔ purrr   0.3.4 
✔ tibble  3.1.8      ✔ dplyr   1.0.10
✔ tidyr   1.2.1      ✔ stringr 1.4.1 
✔ readr   2.1.3      ✔ forcats 0.5.2 
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
Code
library(dplyr)
library(alr4)
Loading required package: car
Loading required package: carData

Attaching package: 'car'

The following object is masked from 'package:dplyr':

    recode

The following object is masked from 'package:purrr':

    some

Loading required package: effects
lattice theme set by effectsTheme()
See ?effectsTheme for details.
Code
library(smss)
theme_minimal()

A

The prediction equation for the following three subsections is: ŷ = −10,536 + 53.8x1 + 2.84x.

Code
x <- sum(-10536 + (53.8*1240) + (2.84*18000))
y <- 145000

y - x
[1] 37704

The predicted sale price is $107,296.That is a difference (residual) of $37,704.

B

Code
x <- sum(-10536 + (53.8*500) + (2.84*1000))
y <- sum(-10536 + (53.8*501) + (2.84*1000))
y - x
[1] 53.8

For a fixed lot size, the house selling price is predicted to increase $53.80 per each square-foot increase. This is because we are multiplying the size of the home (in square feet) by $53.80.

C

Code
sum(53.8/2.84)
[1] 18.94366

Lot size would have to increase by almost 19 square feet to have the same impact as a one-square-foot increase in home size.

Question 2

Code
data(salary)
head(salary)
   degree rank    sex year ysdeg salary
1 Masters Prof   Male   25    35  36350
2 Masters Prof   Male   13    22  35350
3 Masters Prof   Male   10    23  28200
4 Masters Prof Female    7    27  26775
5     PhD Prof   Male   19    30  33696
6 Masters Prof   Male   16    21  28516

A

Code
t.test(salary ~ sex, salary)

    Welch Two Sample t-test

data:  salary by sex
t = 1.7744, df = 21.591, p-value = 0.09009
alternative hypothesis: true difference in means between group Male and group Female is not equal to 0
95 percent confidence interval:
 -567.8539 7247.1471
sample estimates:
  mean in group Male mean in group Female 
            24696.79             21357.14 

While there is a difference in salary by Sex, the p-value does not meet the threshold for statistical significance (0.05). We cannot reject the null hypothesis.

B

Code
salary$sex <- relevel(salary$sex, ref = 1)
model1 <- lm(salary ~ sex + rank + degree + year + ysdeg, data = salary)
confint(model1, level = 0.95)
                 2.5 %      97.5 %
(Intercept) 14134.4059 17357.68946
sexFemale    -697.8183  3030.56452
rankAssoc    2985.4107  7599.31080
rankProf     8396.1546 13841.37340
degreePhD    -663.2482  3440.47485
year          285.1433   667.47476
ysdeg        -280.6397    31.49105

The 95% CI for the female sex’s impact on salary is between -697.82 to 3030.56.

Code
salary$sex <- relevel(salary$sex, ref = 2)

model_relevel_sex <- lm(salary ~ sex + rank + degree + year + ysdeg, data = salary)
confint(model_relevel_sex, level = 0.95)
                 2.5 %      97.5 %
(Intercept) 15268.0220 18556.81956
sexMale     -3030.5645   697.81832
rankAssoc    2985.4107  7599.31080
rankProf     8396.1546 13841.37340
degreePhD    -663.2482  3440.47485
year          285.1433   667.47476
ysdeg        -280.6397    31.49105

The 95% CI for the male sex’s impact on salary is between -3030.56 to 697.82.

C

Code
summary(model1)

Call:
lm(formula = salary ~ sex + rank + degree + year + ysdeg, data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-4045.2 -1094.7  -361.5   813.2  9193.1 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 15746.05     800.18  19.678  < 2e-16 ***
sexFemale    1166.37     925.57   1.260    0.214    
rankAssoc    5292.36    1145.40   4.621 3.22e-05 ***
rankProf    11118.76    1351.77   8.225 1.62e-10 ***
degreePhD    1388.61    1018.75   1.363    0.180    
year          476.31      94.91   5.018 8.65e-06 ***
ysdeg        -124.57      77.49  -1.608    0.115    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2398 on 45 degrees of freedom
Multiple R-squared:  0.855, Adjusted R-squared:  0.8357 
F-statistic: 44.24 on 6 and 45 DF,  p-value: < 2.2e-16

Three variables reach statistical significance (low p-value): rankAssoc, rankProf, and year. SexFemale does not reach statistical significance.

Year has the lowest estimate and standard error, about 5x to 10x less than the other two variables. The t-value is less than that of rankProf, but it is still the second highest.

D

Code
salary$rank <- relevel(salary$rank, ref = 3)

model_relevel_rank <- lm(salary ~ rank + degree + sex + year + ysdeg, data = salary)
summary(model_relevel_rank)

Call:
lm(formula = salary ~ rank + degree + sex + year + ysdeg, data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-4045.2 -1094.7  -361.5   813.2  9193.1 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  28031.18    1677.06  16.715  < 2e-16 ***
rankAsst    -11118.76    1351.77  -8.225 1.62e-10 ***
rankAssoc    -5826.40    1012.93  -5.752 7.28e-07 ***
degreePhD     1388.61    1018.75   1.363    0.180    
sexMale      -1166.37     925.57  -1.260    0.214    
year           476.31      94.91   5.018 8.65e-06 ***
ysdeg         -124.57      77.49  -1.608    0.115    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2398 on 45 degrees of freedom
Multiple R-squared:  0.855, Adjusted R-squared:  0.8357 
F-statistic: 44.24 on 6 and 45 DF,  p-value: < 2.2e-16

We excluded Assoc from the rank variable in this relevel and included Asst and Prof. This has shown a positive relationship between rank and salary for Prof, but not for Asst, and the variables are statistically significant at the 0.001 scale for both of them. The standard error is also lower for both compared Assoc, though not by much.

E

Code
model2 <- lm(salary ~ degree + sex + year + ysdeg, data = salary)
summary(model2)

Call:
lm(formula = salary ~ degree + sex + year + ysdeg, data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-8146.9 -2186.9  -491.5  2279.1 11186.6 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 15897.03    1259.87  12.618  < 2e-16 ***
degreePhD   -3299.35    1302.52  -2.533 0.014704 *  
sexMale      1286.54    1313.09   0.980 0.332209    
year          351.97     142.48   2.470 0.017185 *  
ysdeg         339.40      80.62   4.210 0.000114 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3744 on 47 degrees of freedom
Multiple R-squared:  0.6312,    Adjusted R-squared:  0.5998 
F-statistic: 20.11 on 4 and 47 DF,  p-value: 1.048e-09

Excluding rank, we now see that ysdeg is the best predictor variable, with the lowest estimate score, standard error, and meets the 0.001 p-value threshold to be statistically significant.

F

Code
salary$dean <- ifelse(salary$ysdeg >= '15', "Old Dean",
                  "New Dean")
table(salary$dean)

New Dean Old Dean 
      11       41 
Code
model3 <- lm(salary ~ dean, data = salary)
summary(model3)

Call:
lm(formula = salary ~ dean, data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-9311.1 -4185.9  -573.6  3931.8 13383.9 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)     20580       1727  11.913 3.23e-16 ***
deanOld Dean     4082       1945   2.098    0.041 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5729 on 50 degrees of freedom
Multiple R-squared:  0.08091,   Adjusted R-squared:  0.06253 
F-statistic: 4.402 on 1 and 50 DF,  p-value: 0.04097

It looks like those hired prior to the new Dean do have a statistically significant impact on salary, though at a minor code of 0.05. What’s interesting though is the positive relationship, which rejects the belief that the new Dean has been making more generous offers to new hires. However, we should add some control variables, making sure to avoid multicollinearity. Three variables that would impact multicollinearity are rank, ysdeg, and year.

Code
model4 <- lm(salary ~ dean + degree + sex, data = salary)
summary(model4)

Call:
lm(formula = salary ~ dean + degree + sex, data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-8919.1 -4457.4  -344.1  3386.5 15921.8 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)     18088       2159   8.379 5.92e-11 ***
deanOld Dean     4035       1921   2.100   0.0410 *  
degreePhD         345       1654   0.209   0.8357    
sexMale          3296       1768   1.864   0.0685 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5640 on 48 degrees of freedom
Multiple R-squared:  0.145, Adjusted R-squared:  0.09157 
F-statistic: 2.714 on 3 and 48 DF,  p-value: 0.05514

The standard error drops, but everything else remains very similar.

Question 3

A

Code
data(house.selling.price)

model5 <- lm(Price ~ Size + New, house.selling.price)
summary(model5)

Call:
lm(formula = Price ~ Size + New, data = house.selling.price)

Residuals:
    Min      1Q  Median      3Q     Max 
-205102  -34374   -5778   18929  163866 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -40230.867  14696.140  -2.738  0.00737 ** 
Size           116.132      8.795  13.204  < 2e-16 ***
New          57736.283  18653.041   3.095  0.00257 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 53880 on 97 degrees of freedom
Multiple R-squared:  0.7226,    Adjusted R-squared:  0.7169 
F-statistic: 126.3 on 2 and 97 DF,  p-value: < 2.2e-16

Size is statistically significant (SS) at the 0.001 level, while New is SS at the 0.01 level. Size has a small standard error and larger t-value, while New has a larger estimate, standard error, and lower t-value. We could relevel to see if Old houses had a better impact on Price but that can be observed in other questions.

Code
house.selling.price$New <- relevel(factor(house.selling.price$New), ref = 1)

model_relevel_new <- lm(Price ~ Size + New, house.selling.price)
summary(model_relevel_new)

Call:
lm(formula = Price ~ Size + New, data = house.selling.price)

Residuals:
    Min      1Q  Median      3Q     Max 
-205102  -34374   -5778   18929  163866 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -40230.867  14696.140  -2.738  0.00737 ** 
Size           116.132      8.795  13.204  < 2e-16 ***
New1         57736.283  18653.041   3.095  0.00257 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 53880 on 97 degrees of freedom
Multiple R-squared:  0.7226,    Adjusted R-squared:  0.7169 
F-statistic: 126.3 on 2 and 97 DF,  p-value: < 2.2e-16

B

The prediction equation for houses is as follows:

New: Predicted Price = -40,230 + 116.13(Size) + 57,736(New).

Old: Predicted Price = -40,230 + 116.13(Size) + -57,736(Old).

C

Code
sum(-40230 + (116.13*3000) + (57736*1))
[1] 365896
Code
sum(-40230 + (116.13*3000) + (-57736*1))
[1] 250424

For a new house, the predicted selling price is $365,896. For an old house, it’s $250,424.

D

Code
model6 <- lm(Price ~ Taxes + Size + New, house.selling.price)
summary(model6)

Call:
lm(formula = Price ~ Taxes + Size + New, data = house.selling.price)

Residuals:
    Min      1Q  Median      3Q     Max 
-165501  -25426    1449   20536  168747 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -21353.776  13311.487  -1.604  0.11196    
Taxes           37.231      6.735   5.528 2.78e-07 ***
Size            61.704     12.499   4.937 3.35e-06 ***
New1         46373.703  16459.019   2.818  0.00588 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 47170 on 96 degrees of freedom
Multiple R-squared:  0.7896,    Adjusted R-squared:  0.783 
F-statistic: 120.1 on 3 and 96 DF,  p-value: < 2.2e-16

E

The predicted selling price for the new model for new homes is as follows:

New: Predicted Price = -21,353 + 37.23(Taxes) + 61.74(Size) + 46,373.70(New)

Old: Predicted Price = -21,353 + 37.23(Taxes) + 61.74(Size) + -46,373.70(Old)

F

Code
sum(-21353 + (37.23*1) + (61.74*3000) + (46373.70*1))
[1] 210277.9
Code
sum(-21353 + (37.23*1) + (61.74*3000) + (-46373.70*1))
[1] 117530.5

The predicted selling price for a new 3000 square foot house is $210,277.90 and for an old house is $117,530.50.

G

Code
sum(-21353 + (37.23*1) + (61.74*1500) + (46373.70*1))
[1] 117667.9
Code
sum(-21353 + (37.23*1) + (61.74*1500) + (-46373.70*1))
[1] 24920.53

The predicted selling price for a new 1500 square foot house is $117,667.90 and for an old one is $24,920.53. For each square foot increase, we have a dollar increase of $61.74.

H

I prefer the model that includes taxes. This reduces both the residuals and the standard error for the original variables, Size and New. With a smaller standard error, we should have a more accurate prediction value when attempting to ascertain what the potential sales price for a house is. There is also a larger adjusted R squared percentage. With all that, I would include Taxes in the model for the best prediction value.