library(alr4)
library(smss)
library(magrittr)
Homework 4
Please check your answers against the solutions.
Load the necessary packages:
Question 1
A
Let’s write a function that expresses the outcome as a function of \(x_1\) and \(x_2\)
<- function(x1, x2) {-10536 + 53.8*x1 + 2.84*x2} y_hat
The predicted selling price for a home of 1240 square feet on a lot of 18,000 square feet is:
<- y_hat(x1 = 1240, x2 = 18000)
predicted cat('$', predicted, sep = '')
$107296
The actual price is $145,000. Thus, the residual is:
cat(145000 - predicted)
37704
The residual is positive. The model underpredicts the selling price for this house, arguably by a lot.
B
For a fixed lot size, the house selling price is predicted to increase by $53.8, because that’s the coefficient of the size of home
variable when lot size
is also in the model, thus controlling for (or holding fixed) the latter.
C
One-square-foot increase in home size is associated with an increase in price of $53.8. One-square-foot increase in lot size is associated with an increase in price of $2.84. To have an impact of a one-square foot increase in home size, which $53.8, lot size would have to increase by 53.8/2.84, which is an increase of about 18.94 square-feet.
Question 2
Load the data:
data(salary)
A
Because the question says “without regard to any other variable but sex,” we can run a simple linear regression model with sex as the only explanatory variable. This is equivalent to doing a two-sample t-test for salary for the Male and Female groups
summary(lm(salary ~ sex, data = salary))
Call:
lm(formula = salary ~ sex, data = salary)
Residuals:
Min 1Q Median 3Q Max
-8602.8 -4296.6 -100.8 3513.1 16687.9
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 24697 938 26.330 <2e-16 ***
sexFemale -3340 1808 -1.847 0.0706 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5782 on 50 degrees of freedom
Multiple R-squared: 0.0639, Adjusted R-squared: 0.04518
F-statistic: 3.413 on 1 and 50 DF, p-value: 0.0706
The coefficient -3340 for the sexFemale
variable suggests that female faculty makes on average $3340 less than their male colleagues in this college. In other words, the difference in mean is 3340. The variable is statistically significant at the 10% level, but not at the more widely accepted 5% level.
B
For the model with all predictors, here is what the confidence intervals look like:
lm(salary ~ ., data = salary) |>
confint()
2.5 % 97.5 %
(Intercept) 14134.4059 17357.68946
degreePhD -663.2482 3440.47485
rankAssoc 2985.4107 7599.31080
rankProf 8396.1546 13841.37340
sexFemale -697.8183 3030.56452
year 285.1433 667.47476
ysdeg -280.6397 31.49105
The 95% confidence interval for the sexFemale
variable is (-697, 3031). This suggests an confidence interval between $697 less or $3031 more salary for female faculty relative to male faculty, controlling for other variables.
C
summary(lm(salary ~ ., data = salary))
Call:
lm(formula = salary ~ ., data = salary)
Residuals:
Min 1Q Median 3Q Max
-4045.2 -1094.7 -361.5 813.2 9193.1
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 15746.05 800.18 19.678 < 2e-16 ***
degreePhD 1388.61 1018.75 1.363 0.180
rankAssoc 5292.36 1145.40 4.621 3.22e-05 ***
rankProf 11118.76 1351.77 8.225 1.62e-10 ***
sexFemale 1166.37 925.57 1.260 0.214
year 476.31 94.91 5.018 8.65e-06 ***
ysdeg -124.57 77.49 -1.608 0.115
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2398 on 45 degrees of freedom
Multiple R-squared: 0.855, Adjusted R-squared: 0.8357
F-statistic: 44.24 on 6 and 45 DF, p-value: < 2.2e-16
degree: The degree variable is not statistically significant at any conventionally accepted level. It is a dummy variable, so the coefficient suggests that PhDs make $1389 more compared to those with Masters controlling for everything else.
rank The model takes rank as an categorical variable, ignoring its order. The most common practice for ordered categorical variables like rank is to either treat them as just a regular categorical variable or as a numeric variable. The latter is most acceptable when the variable has lots of levels and/or the distance between each level can be reasonably thought of as equal. In this case, because there are only three levels (one more than what a dummy variable has), it makes sense to accept this as a regular categorical variable.
Given this, the rankAssoc category suggests that Associate Professors make $5292 more than Assistant Professors, the base (reference) level. rankProf suggests full professors make $11118 more than Assistant Professors. Both variables are statistically significant.
If we wanted to test for the statistical significance of the rank variable as a whole, rather than for the individual dummy variables, we would need to do a partial F-test to compare the model with all the variables to the one without any rank dummies. The easiest way to do this is:
<- lm(salary ~ ., data = salary)
fit1 <- lm(salary ~ . -rank, data = salary)
fit2 # see here: https://www.youtube.com/watch?v=G_obrpV70QQ
anova(fit1, fit2)
Analysis of Variance Table
Model 1: salary ~ degree + rank + sex + year + ysdeg
Model 2: salary ~ (degree + rank + sex + year + ysdeg) - rank
Res.Df RSS Df Sum of Sq F Pr(>F)
1 45 258858365
2 47 658649047 -2 -399790682 34.75 7.485e-10 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The very small p-value (Pr(>F)) suggests that the rank variable is significant as a whole.
sex: As we saw with confidence intervals, this variable is now not statistically significant at conventional levels. The coefficient suggests female faculty make $1166 more after everything is controlled, but interpreting coefficients when the effect is insignificant is not very meaningful.
year: This variable is statistically significant. It suggests that every additional in current rank is associated with $476 more salary.
ysdeg: The variable is insignificant. The coefficient would suggest that every additional year that passes since degree is associated with $124 less salary
D
$rank <- relevel(salary$rank, ref = 'Prof')
salarysummary(lm(salary ~ ., data = salary))
Call:
lm(formula = salary ~ ., data = salary)
Residuals:
Min 1Q Median 3Q Max
-4045.2 -1094.7 -361.5 813.2 9193.1
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 26864.81 1375.29 19.534 < 2e-16 ***
degreePhD 1388.61 1018.75 1.363 0.180
rankAsst -11118.76 1351.77 -8.225 1.62e-10 ***
rankAssoc -5826.40 1012.93 -5.752 7.28e-07 ***
sexFemale 1166.37 925.57 1.260 0.214
year 476.31 94.91 5.018 8.65e-06 ***
ysdeg -124.57 77.49 -1.608 0.115
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2398 on 45 degrees of freedom
Multiple R-squared: 0.855, Adjusted R-squared: 0.8357
F-statistic: 44.24 on 6 and 45 DF, p-value: < 2.2e-16
The coefficients are now for the two dummy variables that represent the two categories other than the reference category, which is now Prof. rankAsst being -11118 means Assistant Professors make 11118 less than Full Professors, controlling for other variables. rankAssoc being -5826 means Associate Professors make 5826 less than Full Professors, controlling for other variables. The same information in the previous model is presented in a different way.
E
summary(lm(salary ~ . -rank, data = salary))
Call:
lm(formula = salary ~ . - rank, data = salary)
Residuals:
Min 1Q Median 3Q Max
-8146.9 -2186.9 -491.5 2279.1 11186.6
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 17183.57 1147.94 14.969 < 2e-16 ***
degreePhD -3299.35 1302.52 -2.533 0.014704 *
sexFemale -1286.54 1313.09 -0.980 0.332209
year 351.97 142.48 2.470 0.017185 *
ysdeg 339.40 80.62 4.210 0.000114 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3744 on 47 degrees of freedom
Multiple R-squared: 0.6312, Adjusted R-squared: 0.5998
F-statistic: 20.11 on 4 and 47 DF, p-value: 1.048e-09
Excluding rank does cause the sign of the sex variable to flip to negative, but it is still not statistically significant.
F
Based on the description, we would need to create a new dummy variable, which I’ll call new_dean
, from the ysdeg
variable. To avoid multicollinearity, it would be a good idea to not include highly correlated variables. Because we are creating one variable from another, it is likely that they are highly correlated. Let’s see if that’s the case.
$new_dean <- ifelse(salary$ysdeg <= 15, 1, 0)
salarycor.test(salary$new_dean, salary$ysdeg)
Pearson's product-moment correlation
data: salary$new_dean and salary$ysdeg
t = -11.101, df = 50, p-value = 4.263e-15
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.9074548 -0.7411040
sample estimates:
cor
-0.8434239
Yes, they are very highly (negatively) correlated. So, we’ll exclude ysdeg
and only include new_dean
in its place, alongside other control variables.
summary(lm(salary ~ . -ysdeg, data = salary))
Call:
lm(formula = salary ~ . - ysdeg, data = salary)
Residuals:
Min 1Q Median 3Q Max
-3403.3 -1387.0 -167.0 528.2 9233.8
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 24425.32 1107.52 22.054 < 2e-16 ***
degreePhD 818.93 797.48 1.027 0.3100
rankAsst -11096.95 1191.00 -9.317 4.54e-12 ***
rankAssoc -6124.28 1028.58 -5.954 3.65e-07 ***
sexFemale 907.14 840.54 1.079 0.2862
year 434.85 78.89 5.512 1.65e-06 ***
new_dean 2163.46 1072.04 2.018 0.0496 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2362 on 45 degrees of freedom
Multiple R-squared: 0.8594, Adjusted R-squared: 0.8407
F-statistic: 45.86 on 6 and 45 DF, p-value: < 2.2e-16
The model above does suggest that the new dean has been making more generous offers, since the faculty appointed under the new dean make about $2163 more, controlling for other variables. The variable is statistically significant at the 5% level.
Let’s see what would have happened if we included both variables:
summary(lm(salary ~ . , data = salary))
Call:
lm(formula = salary ~ ., data = salary)
Residuals:
Min 1Q Median 3Q Max
-3621.2 -1336.8 -271.6 530.1 9247.6
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 25179.14 1901.59 13.241 < 2e-16 ***
degreePhD 1135.00 1031.16 1.101 0.277
rankAsst -11411.45 1362.02 -8.378 1.16e-10 ***
rankAssoc -6177.44 1043.04 -5.923 4.39e-07 ***
sexFemale 1084.09 921.49 1.176 0.246
year 460.35 95.09 4.841 1.63e-05 ***
ysdeg -47.86 97.71 -0.490 0.627
new_dean 1749.09 1372.83 1.274 0.209
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2382 on 44 degrees of freedom
Multiple R-squared: 0.8602, Adjusted R-squared: 0.838
F-statistic: 38.68 on 7 and 44 DF, p-value: < 2.2e-16
Now, neither variable is significant because of multicollinearity.
Question 3
Load the data:
data(house.selling.price)
A
<- lm(Price ~ Size + New, data = house.selling.price)
fit summary(fit)
Call:
lm(formula = Price ~ Size + New, data = house.selling.price)
Residuals:
Min 1Q Median 3Q Max
-205102 -34374 -5778 18929 163866
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -40230.867 14696.140 -2.738 0.00737 **
Size 116.132 8.795 13.204 < 2e-16 ***
New 57736.283 18653.041 3.095 0.00257 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 53880 on 97 degrees of freedom
Multiple R-squared: 0.7226, Adjusted R-squared: 0.7169
F-statistic: 126.3 on 2 and 97 DF, p-value: < 2.2e-16
In the model, both the Size
and the New
variables are positively associated with Price
. They are both statistically significant at the 5% level. A 1 square foot increase in the size of the house is associated with a $116 increase in house, controlling for whether the house is new. New houses are on average $57736 more expensive than old houses, controlling for size.
B
The interpretation part is a bit redundant, since it was done in part A.
The prediction equation is:
\[\mathbf{E}[\textrm{Price}] = -40230.867 + 116.132 * \textrm{size} + 57736.283 * \textrm{new}\]
For new homes, the new variable takes the value 1, for old homes, it takes 0.
So, for new homes, the equation is:
\[\mathbf{E}[\textrm{Price}] = -40230.867 + 116.132 * \textrm{size} + 57736.283 * 1 = 17505.42 + 116.132 * \textrm{size} \]
For old homes, the equation is:
\[\mathbf{E}[\textrm{Price}] = -40230.867 + 116.132 * \textrm{size} + 57736.283 * 0 = -40230.867 + 116.132 * \textrm{size}\]
C
We can create a data frame and use the predict() function to do the prediction.
data.frame(Size = c(3000, 3000), New = c(1, 0)) %>%
predict(fit, .)
1 2
365900.2 308163.9
A new home of 3000 square feet has a predicted price of 365900.2. An old home of 3000 square feet has a predicted price of 308163.9. (Note that the difference, 57736.3, is the coefficient of New
)
D
Model with interaction term:
<- lm(Price ~ Size + New + Size * New, data = house.selling.price)
fit_ia summary(fit_ia)
Call:
lm(formula = Price ~ Size + New + Size * New, data = house.selling.price)
Residuals:
Min 1Q Median 3Q Max
-175748 -28979 -6260 14693 192519
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -22227.808 15521.110 -1.432 0.15536
Size 104.438 9.424 11.082 < 2e-16 ***
New -78527.502 51007.642 -1.540 0.12697
Size:New 61.916 21.686 2.855 0.00527 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 52000 on 96 degrees of freedom
Multiple R-squared: 0.7443, Adjusted R-squared: 0.7363
F-statistic: 93.15 on 3 and 96 DF, p-value: < 2.2e-16
E
\[\mathbf{E}[\textrm{Price}] = -22227.808 + 104.438 * \textrm{size} + -78527.502 * \textrm{New} + 61.916 * \textrm{Size} * \textrm{New}\]
Again, we follow the same logic from B, replacing New with 1 and 0:
For new homes:
\[\mathbf{E}[\textrm{Price}] = -22227.808 + 104.438 * \textrm{size} + -78527.502 * 1 + 61.916 * \textrm{Size} * 1 = -100755.3 + 166.354 * size\]
For old homes:
\[\mathbf{E}[\textrm{Price}] = -22227.808 + 104.438 * \textrm{size} + -78527.502 * 0 + 61.916 * \textrm{Size} * 0 = -22227.808 + 104.438 * \textrm{size}\]
F
Using predict() again:
data.frame(Size = c(3000, 3000), New = c(1, 0)) %>%
predict(fit_ia, .)
1 2
398307.5 291087.4
A new home of 3000 square feet has a predicted price of $398307.5. An old home of 3000 square feet has a predicted price of $291087.4. The difference is $107220.1.
G
data.frame(Size = c(1500, 1500), New = c(1, 0)) %>%
predict(fit_ia, .)
1 2
148776.1 134429.8
A new home of 1500 square feet has a predicted price of $148776.1. An old home of 1500 square feet has a predicted price of $134429.8. The difference is $14346.3.
The difference between new and old home prices is much more when the size of the home is larger. For 3000 sq ft homes, the difference is 107220.1 as opposed to the 14346.1 difference for homes that are 1500 sq ft. This is consistent with the positive coefficient for the interaction term.
H
I prefer the model with the interaction term, because (1) the interaction term is significant, (2) the Adjusted R-squared is higher in the model with interaction (i.e. despite the penalty for the additional term, we have better fit). We could look into cross-validation / PRESS (which also end up showing the interaction model to be superior), but those are not this homework’s topic.
summary(fit)
Call:
lm(formula = Price ~ Size + New, data = house.selling.price)
Residuals:
Min 1Q Median 3Q Max
-205102 -34374 -5778 18929 163866
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -40230.867 14696.140 -2.738 0.00737 **
Size 116.132 8.795 13.204 < 2e-16 ***
New 57736.283 18653.041 3.095 0.00257 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 53880 on 97 degrees of freedom
Multiple R-squared: 0.7226, Adjusted R-squared: 0.7169
F-statistic: 126.3 on 2 and 97 DF, p-value: < 2.2e-16
summary(fit_ia)
Call:
lm(formula = Price ~ Size + New + Size * New, data = house.selling.price)
Residuals:
Min 1Q Median 3Q Max
-175748 -28979 -6260 14693 192519
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -22227.808 15521.110 -1.432 0.15536
Size 104.438 9.424 11.082 < 2e-16 ***
New -78527.502 51007.642 -1.540 0.12697
Size:New 61.916 21.686 2.855 0.00527 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 52000 on 96 degrees of freedom
Multiple R-squared: 0.7443, Adjusted R-squared: 0.7363
F-statistic: 93.15 on 3 and 96 DF, p-value: < 2.2e-16