library(tidyverse)
library(alr4)
library(smss)
library(interactions)
::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE) knitr
Homework 4
Qn 1A
# predicted price.
<- -10536 + 53.8*1240 + 2.84*18000
yhat
# residual.
145000-yhat
[1] 37704
The model under-predicts by $37,704.
Qn 1B
It is predicted to increase by ~$53.80. This is because the effects of x1
and x2
on y
are independent of one another, and there is no interaction between them.
Qn 1C
53.8/2.84
[1] 18.94366
It would need to increase by ~19x.
Qn 2A
# load dataset.
data(salary)
# run model.
summary(lm(salary ~ sex, salary))
Call:
lm(formula = salary ~ sex, data = salary)
Residuals:
Min 1Q Median 3Q Max
-8602.8 -4296.6 -100.8 3513.1 16687.9
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 24697 938 26.330 <2e-16 ***
sexFemale -3340 1808 -1.847 0.0706 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5782 on 50 degrees of freedom
Multiple R-squared: 0.0639, Adjusted R-squared: 0.04518
F-statistic: 3.413 on 1 and 50 DF, p-value: 0.0706
The mean salary for men and women does not differ, p = 0.07 (i.e., it is the same).
Qn 2B
# run model.
<- lm(salary ~ ., salary)
model2B summary(model2B)
Call:
lm(formula = salary ~ ., data = salary)
Residuals:
Min 1Q Median 3Q Max
-4045.2 -1094.7 -361.5 813.2 9193.1
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 15746.05 800.18 19.678 < 2e-16 ***
degreePhD 1388.61 1018.75 1.363 0.180
rankAssoc 5292.36 1145.40 4.621 3.22e-05 ***
rankProf 11118.76 1351.77 8.225 1.62e-10 ***
sexFemale 1166.37 925.57 1.260 0.214
year 476.31 94.91 5.018 8.65e-06 ***
ysdeg -124.57 77.49 -1.608 0.115
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2398 on 45 degrees of freedom
Multiple R-squared: 0.855, Adjusted R-squared: 0.8357
F-statistic: 44.24 on 6 and 45 DF, p-value: < 2.2e-16
# get confidence interval.
confint(model2B)
2.5 % 97.5 %
(Intercept) 14134.4059 17357.68946
degreePhD -663.2482 3440.47485
rankAssoc 2985.4107 7599.31080
rankProf 8396.1546 13841.37340
sexFemale -697.8183 3030.56452
year 285.1433 667.47476
ysdeg -280.6397 31.49105
Sex is still not a significant predictor of salary. The 95% CI for difference in salary between males and females is approx. -698 to 3031.
Qn 2C
rank
and year
are significant predictors of salary, while all others were not.
Both rank
and year
positively predict salary: Associate Professors and full Professors were likely to earn quite a bit more than Assistant Professors, while professors with more years in their current rank also earned more.
Looking at the magnitude of the coefficients, rank
has a greater impact on salary than year
does.
Qn 2D
# change baseline category.
$rank <- relevel(salary$rank, ref = 'Assoc')
salary
# re-run model.
<- lm(salary ~ ., salary)
model2D summary(model2D)
Call:
lm(formula = salary ~ ., data = salary)
Residuals:
Min 1Q Median 3Q Max
-4045.2 -1094.7 -361.5 813.2 9193.1
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 21038.41 1109.12 18.969 < 2e-16 ***
degreePhD 1388.61 1018.75 1.363 0.180
rankAsst -5292.36 1145.40 -4.621 3.22e-05 ***
rankProf 5826.40 1012.93 5.752 7.28e-07 ***
sexFemale 1166.37 925.57 1.260 0.214
year 476.31 94.91 5.018 8.65e-06 ***
ysdeg -124.57 77.49 -1.608 0.115
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2398 on 45 degrees of freedom
Multiple R-squared: 0.855, Adjusted R-squared: 0.8357
F-statistic: 44.24 on 6 and 45 DF, p-value: < 2.2e-16
We can see that rank
now both negatively and positively predicts salary
. Specifically, Assistant Professors earn less than Associate Professors do, at the same magnitude that was observed in model1
(approx. ~$5292).
Qn 2E
<- lm(salary ~ . - rank, salary)
model2E summary(model2E)
Call:
lm(formula = salary ~ . - rank, data = salary)
Residuals:
Min 1Q Median 3Q Max
-8146.9 -2186.9 -491.5 2279.1 11186.6
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 17183.57 1147.94 14.969 < 2e-16 ***
degreePhD -3299.35 1302.52 -2.533 0.014704 *
sexFemale -1286.54 1313.09 -0.980 0.332209
year 351.97 142.48 2.470 0.017185 *
ysdeg 339.40 80.62 4.210 0.000114 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3744 on 47 degrees of freedom
Multiple R-squared: 0.6312, Adjusted R-squared: 0.5998
F-statistic: 20.11 on 4 and 47 DF, p-value: 1.048e-09
Excluding rank
, both adjusted R2 and the overall F-statistic drop. degree
and ysdeg
are now significant as well. Curiously, those with a PhD earn less than those with an MS. As years since highest degree increases, so does salary.
Qn 2F
# create new variable.
<- mutate(salary, HireTime = case_when(ysdeg <= 15 ~ "1", ysdeg > 15 ~ "0"))
salary $HireTime <- as.numeric(salary$HireTime)
salary
# run correlation matrix to check for multicollinearity. it's important to do so especially because HireTime is derived from ysdeg, so we would expect them to be highly correlated.
<- salary %>% select(year, ysdeg, HireTime)
subset <- cor(subset, use="complete.obs") matrix
I would remove ysdeg
from the regression model since it is highly correlated with HireTime
, r = 0.84.
<- lm(salary ~ . - ysdeg, salary)
model2F summary(model2F)
Call:
lm(formula = salary ~ . - ysdeg, data = salary)
Residuals:
Min 1Q Median 3Q Max
-3403.3 -1387.0 -167.0 528.2 9233.8
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 18301.04 1301.36 14.063 < 2e-16 ***
degreePhD 818.93 797.48 1.027 0.3100
rankAsst -4972.66 997.17 -4.987 9.61e-06 ***
rankProf 6124.28 1028.58 5.954 3.65e-07 ***
sexFemale 907.14 840.54 1.079 0.2862
year 434.85 78.89 5.512 1.65e-06 ***
HireTime 2163.46 1072.04 2.018 0.0496 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2362 on 45 degrees of freedom
Multiple R-squared: 0.8594, Adjusted R-squared: 0.8407
F-statistic: 45.86 on 6 and 45 DF, p-value: < 2.2e-16
Yes, those hired by the new Dean are making more, p = .05.
Qn 3A
# load dataset.
data(house.selling.price)
# run model.
<- lm(Price ~ Size + New, house.selling.price)
model3A summary(model3A)
Call:
lm(formula = Price ~ Size + New, data = house.selling.price)
Residuals:
Min 1Q Median 3Q Max
-205102 -34374 -5778 18929 163866
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -40230.867 14696.140 -2.738 0.00737 **
Size 116.132 8.795 13.204 < 2e-16 ***
New 57736.283 18653.041 3.095 0.00257 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 53880 on 97 degrees of freedom
Multiple R-squared: 0.7226, Adjusted R-squared: 0.7169
F-statistic: 126.3 on 2 and 97 DF, p-value: < 2.2e-16
Both Size
and New
significantly positively predict selling price. As each predictor goes up by 1 unit, selling price rises by $116 and $57,736 respectively.
Qn 3B
# new homes.
<- house.selling.price %>% filter(New == 1)
new <- lm(Price ~ Size, new)
model3B1 summary(model3B1)
Call:
lm(formula = Price ~ Size, data = new)
Residuals:
Min 1Q Median 3Q Max
-78606 -16092 -987 20068 76140
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -100755.31 42513.73 -2.370 0.0419 *
Size 166.35 17.09 9.735 4.47e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 45500 on 9 degrees of freedom
Multiple R-squared: 0.9133, Adjusted R-squared: 0.9036
F-statistic: 94.76 on 1 and 9 DF, p-value: 4.474e-06
# old homes.
<- house.selling.price %>% filter(New == 0)
old <- lm(Price ~ Size, old)
model3B2 summary(model3B2)
Call:
lm(formula = Price ~ Size, data = old)
Residuals:
Min 1Q Median 3Q Max
-175748 -29155 -7297 14159 192519
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -22227.808 15708.186 -1.415 0.161
Size 104.438 9.538 10.950 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 52620 on 87 degrees of freedom
Multiple R-squared: 0.5795, Adjusted R-squared: 0.5747
F-statistic: 119.9 on 1 and 87 DF, p-value: < 2.2e-16
Size significantly positively predicts price for both new and old houses, but by a greater magnitude for new houses:
New: E(Price) = 166*Size - 100,755
- Adjusted R2 for the model is also much higher (0.90 vs. 0.57).
Old: E(Price) = 104*Size - 22,228
Qn 3C
New: E(Price) = 166*3000 - 100,755 = $397,245
Old: E(Price) = 104*3000 - 22,228 = $289,772
Qn 3D
<- lm(Price ~ Size*New, house.selling.price)
model3D summary(model3D)
Call:
lm(formula = Price ~ Size * New, data = house.selling.price)
Residuals:
Min 1Q Median 3Q Max
-175748 -28979 -6260 14693 192519
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -22227.808 15521.110 -1.432 0.15536
Size 104.438 9.424 11.082 < 2e-16 ***
New -78527.502 51007.642 -1.540 0.12697
Size:New 61.916 21.686 2.855 0.00527 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 52000 on 96 degrees of freedom
Multiple R-squared: 0.7443, Adjusted R-squared: 0.7363
F-statistic: 93.15 on 3 and 96 DF, p-value: < 2.2e-16
interact_plot(model3D, pred = Size, modx = New)
Size
and New
positively interact to affect Price
, such that New
amplifies the positive relationship between Size
and Price
.
Qn 3E
Referring to model3D
above:
New: -22,228 + 104*Size - 78,528 + 62*Size = 166*Size - 100,756
Old: 104*Size - 22,228
Qn 3F
New: 166*3000 -100,756 = $397,244
Old: 104*3000 - 22,228 = $289,772
Qn 3G
New: 166*1500 - 100,756 = $148,244
Old: 104*1500 - 22,228 = $133,772
As size of home goes up, the difference in predicted selling prices between old and new homes becomes larger.
Qn 3H
Questions 3F and 3G demonstrate that size
affects the relationship between new
and price
, which reveals some type of dependency between the predictors. Hence, the model with interaction represents the relationships between size
, new
and price
better.