Homework 4

hw4
Solutions to the fourth homework
Author

Rosemary Pang

Published

April 27, 2023

Please check your answers against the solutions.

Load the necessary packages:

library(alr4)
library(smss)
library(magrittr)

Question 1

A

Let’s write a function that expresses the outcome as a function of \(x_1\) and \(x_2\)

y_hat <- function(x1, x2) {-10536 + 53.8*x1 + 2.84*x2}

The predicted selling price for a home of 1240 square feet on a lot of 18,000 square feet is:

predicted <- y_hat(x1 = 1240, x2 = 18000)
cat('$', predicted, sep = '')
$107296

The actual price is $145,000. Thus, the residual is:

cat(145000 - predicted)
37704

The residual is positive. The model underpredicts the selling price for this house, arguably by a lot.

B

For a fixed lot size, the house selling price is predicted to increase by $53.8, because that’s the coefficient of the size of home variable when lot size is also in the model, thus controlling for (or holding fixed) the latter.

C

One-square-foot increase in home size is associated with an increase in price of $53.8. One-square-foot increase in lot size is associated with an increase in price of $2.84. To have an impact of a one-square foot increase in home size, which $53.8, lot size would have to increase by 53.8/2.84, which is an increase of about 18.94 square-feet.

Question 2

Load the data:

data(salary)

A

Because the question says “without regard to any other variable but sex,” we can run a simple linear regression model with sex as the only explanatory variable. This is equivalent to doing a two-sample t-test for salary for the Male and Female groups

summary(lm(salary ~ sex, data = salary))

Call:
lm(formula = salary ~ sex, data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-8602.8 -4296.6  -100.8  3513.1 16687.9 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)    24697        938  26.330   <2e-16 ***
sexFemale      -3340       1808  -1.847   0.0706 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5782 on 50 degrees of freedom
Multiple R-squared:  0.0639,    Adjusted R-squared:  0.04518 
F-statistic: 3.413 on 1 and 50 DF,  p-value: 0.0706

The coefficient -3340 for the sexFemale variable suggests that female faculty makes on average $3340 less than their male colleagues in this college. In other words, the difference in mean is 3340. The variable is statistically significant at the 10% level, but not at the more widely accepted 5% level.

B

For the model with all predictors, here is what the confidence intervals look like:

lm(salary ~ ., data = salary) |>
  confint() 
                 2.5 %      97.5 %
(Intercept) 14134.4059 17357.68946
degreePhD    -663.2482  3440.47485
rankAssoc    2985.4107  7599.31080
rankProf     8396.1546 13841.37340
sexFemale    -697.8183  3030.56452
year          285.1433   667.47476
ysdeg        -280.6397    31.49105

The 95% confidence interval for the sexFemale variable is (-697, 3031). This suggests an confidence interval between $697 less or $3031 more salary for female faculty relative to male faculty, controlling for other variables.

C

summary(lm(salary ~ ., data = salary))

Call:
lm(formula = salary ~ ., data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-4045.2 -1094.7  -361.5   813.2  9193.1 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 15746.05     800.18  19.678  < 2e-16 ***
degreePhD    1388.61    1018.75   1.363    0.180    
rankAssoc    5292.36    1145.40   4.621 3.22e-05 ***
rankProf    11118.76    1351.77   8.225 1.62e-10 ***
sexFemale    1166.37     925.57   1.260    0.214    
year          476.31      94.91   5.018 8.65e-06 ***
ysdeg        -124.57      77.49  -1.608    0.115    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2398 on 45 degrees of freedom
Multiple R-squared:  0.855, Adjusted R-squared:  0.8357 
F-statistic: 44.24 on 6 and 45 DF,  p-value: < 2.2e-16
  • degree: The degree variable is not statistically significant at any conventionally accepted level. It is a dummy variable, so the coefficient suggests that PhDs make $1389 more compared to those with Masters controlling for everything else.

  • rank The model takes rank as an categorical variable, ignoring its order. The most common practice for ordered categorical variables like rank is to either treat them as just a regular categorical variable or as a numeric variable. The latter is most acceptable when the variable has lots of levels and/or the distance between each level can be reasonably thought of as equal. In this case, because there are only three levels (one more than what a dummy variable has), it makes sense to accept this as a regular categorical variable.

Given this, the rankAssoc category suggests that Associate Professors make $5292 more than Assistant Professors, the base (reference) level. rankProf suggests full professors make $11118 more than Assistant Professors. Both variables are statistically significant.

If we wanted to test for the statistical significance of the rank variable as a whole, rather than for the individual dummy variables, we would need to do a partial F-test to compare the model with all the variables to the one without any rank dummies. The easiest way to do this is:

fit1 <- lm(salary ~ ., data = salary)
fit2 <- lm(salary ~ . -rank, data = salary)
# see here: https://www.youtube.com/watch?v=G_obrpV70QQ
anova(fit1, fit2)
Analysis of Variance Table

Model 1: salary ~ degree + rank + sex + year + ysdeg
Model 2: salary ~ (degree + rank + sex + year + ysdeg) - rank
  Res.Df       RSS Df  Sum of Sq     F    Pr(>F)    
1     45 258858365                                  
2     47 658649047 -2 -399790682 34.75 7.485e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The very small p-value (Pr(>F)) suggests that the rank variable is significant as a whole.

  • sex: As we saw with confidence intervals, this variable is now not statistically significant at conventional levels. The coefficient suggests female faculty make $1166 more after everything is controlled, but interpreting coefficients when the effect is insignificant is not very meaningful.

  • year: This variable is statistically significant. It suggests that every additional in current rank is associated with $476 more salary.

  • ysdeg: The variable is insignificant. The coefficient would suggest that every additional year that passes since degree is associated with $124 less salary

D

salary$rank <- relevel(salary$rank, ref = 'Prof')
summary(lm(salary ~ ., data = salary))

Call:
lm(formula = salary ~ ., data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-4045.2 -1094.7  -361.5   813.2  9193.1 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  26864.81    1375.29  19.534  < 2e-16 ***
degreePhD     1388.61    1018.75   1.363    0.180    
rankAsst    -11118.76    1351.77  -8.225 1.62e-10 ***
rankAssoc    -5826.40    1012.93  -5.752 7.28e-07 ***
sexFemale     1166.37     925.57   1.260    0.214    
year           476.31      94.91   5.018 8.65e-06 ***
ysdeg         -124.57      77.49  -1.608    0.115    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2398 on 45 degrees of freedom
Multiple R-squared:  0.855, Adjusted R-squared:  0.8357 
F-statistic: 44.24 on 6 and 45 DF,  p-value: < 2.2e-16

The coefficients are now for the two dummy variables that represent the two categories other than the reference category, which is now Prof. rankAsst being -11118 means Assistant Professors make 11118 less than Full Professors, controlling for other variables. rankAssoc being -5826 means Associate Professors make 5826 less than Full Professors, controlling for other variables. The same information in the previous model is presented in a different way.

E

summary(lm(salary ~ . -rank, data = salary))

Call:
lm(formula = salary ~ . - rank, data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-8146.9 -2186.9  -491.5  2279.1 11186.6 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 17183.57    1147.94  14.969  < 2e-16 ***
degreePhD   -3299.35    1302.52  -2.533 0.014704 *  
sexFemale   -1286.54    1313.09  -0.980 0.332209    
year          351.97     142.48   2.470 0.017185 *  
ysdeg         339.40      80.62   4.210 0.000114 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3744 on 47 degrees of freedom
Multiple R-squared:  0.6312,    Adjusted R-squared:  0.5998 
F-statistic: 20.11 on 4 and 47 DF,  p-value: 1.048e-09

Excluding rank does cause the sign of the sex variable to flip to negative, but it is still not statistically significant.

F

Based on the description, we would need to create a new dummy variable, which I’ll call new_dean, from the ysdeg variable. To avoid multicollinearity, it would be a good idea to not include highly correlated variables. Because we are creating one variable from another, it is likely that they are highly correlated. Let’s see if that’s the case.

salary$new_dean <- ifelse(salary$ysdeg <= 15, 1, 0)
cor.test(salary$new_dean, salary$ysdeg)

    Pearson's product-moment correlation

data:  salary$new_dean and salary$ysdeg
t = -11.101, df = 50, p-value = 4.263e-15
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.9074548 -0.7411040
sample estimates:
       cor 
-0.8434239 

Yes, they are very highly (negatively) correlated. So, we’ll exclude ysdeg and only include new_dean in its place, alongside other control variables.

summary(lm(salary ~ . -ysdeg, data = salary))

Call:
lm(formula = salary ~ . - ysdeg, data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-3403.3 -1387.0  -167.0   528.2  9233.8 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  24425.32    1107.52  22.054  < 2e-16 ***
degreePhD      818.93     797.48   1.027   0.3100    
rankAsst    -11096.95    1191.00  -9.317 4.54e-12 ***
rankAssoc    -6124.28    1028.58  -5.954 3.65e-07 ***
sexFemale      907.14     840.54   1.079   0.2862    
year           434.85      78.89   5.512 1.65e-06 ***
new_dean      2163.46    1072.04   2.018   0.0496 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2362 on 45 degrees of freedom
Multiple R-squared:  0.8594,    Adjusted R-squared:  0.8407 
F-statistic: 45.86 on 6 and 45 DF,  p-value: < 2.2e-16

The model above does suggest that the new dean has been making more generous offers, since the faculty appointed under the new dean make about $2163 more, controlling for other variables. The variable is statistically significant at the 5% level.

Let’s see what would have happened if we included both variables:

summary(lm(salary ~ . , data = salary))

Call:
lm(formula = salary ~ ., data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-3621.2 -1336.8  -271.6   530.1  9247.6 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  25179.14    1901.59  13.241  < 2e-16 ***
degreePhD     1135.00    1031.16   1.101    0.277    
rankAsst    -11411.45    1362.02  -8.378 1.16e-10 ***
rankAssoc    -6177.44    1043.04  -5.923 4.39e-07 ***
sexFemale     1084.09     921.49   1.176    0.246    
year           460.35      95.09   4.841 1.63e-05 ***
ysdeg          -47.86      97.71  -0.490    0.627    
new_dean      1749.09    1372.83   1.274    0.209    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2382 on 44 degrees of freedom
Multiple R-squared:  0.8602,    Adjusted R-squared:  0.838 
F-statistic: 38.68 on 7 and 44 DF,  p-value: < 2.2e-16

Now, neither variable is significant because of multicollinearity.

Question 3

Load the data:

data(house.selling.price)

A

fit <- lm(Price ~ Size + New, data = house.selling.price)
summary(fit)

Call:
lm(formula = Price ~ Size + New, data = house.selling.price)

Residuals:
    Min      1Q  Median      3Q     Max 
-205102  -34374   -5778   18929  163866 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -40230.867  14696.140  -2.738  0.00737 ** 
Size           116.132      8.795  13.204  < 2e-16 ***
New          57736.283  18653.041   3.095  0.00257 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 53880 on 97 degrees of freedom
Multiple R-squared:  0.7226,    Adjusted R-squared:  0.7169 
F-statistic: 126.3 on 2 and 97 DF,  p-value: < 2.2e-16

In the model, both the Size and the New variables are positively associated with Price. They are both statistically significant at the 5% level. A 1 square foot increase in the size of the house is associated with a $116 increase in house, controlling for whether the house is new. New houses are on average $57736 more expensive than old houses, controlling for size.

B

The interpretation part is a bit redundant, since it was done in part A.

The prediction equation is:

\[\mathbf{E}[\textrm{Price}] = -40230.867 + 116.132 * \textrm{size} + 57736.283 * \textrm{new}\]

For new homes, the new variable takes the value 1, for old homes, it takes 0.

So, for new homes, the equation is:

\[\mathbf{E}[\textrm{Price}] = -40230.867 + 116.132 * \textrm{size} + 57736.283 * 1 = 17505.42 + 116.132 * \textrm{size} \]

For old homes, the equation is:

\[\mathbf{E}[\textrm{Price}] = -40230.867 + 116.132 * \textrm{size} + 57736.283 * 0 = -40230.867 + 116.132 * \textrm{size}\]

C

We can create a data frame and use the predict() function to do the prediction.

data.frame(Size = c(3000, 3000), New = c(1, 0)) %>%
  predict(fit, .) 
       1        2 
365900.2 308163.9 

A new home of 3000 square feet has a predicted price of 365900.2. An old home of 3000 square feet has a predicted price of 308163.9. (Note that the difference, 57736.3, is the coefficient of New)

D

Model with interaction term:

fit_ia <- lm(Price ~ Size + New + Size * New, data = house.selling.price)
summary(fit_ia)

Call:
lm(formula = Price ~ Size + New + Size * New, data = house.selling.price)

Residuals:
    Min      1Q  Median      3Q     Max 
-175748  -28979   -6260   14693  192519 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -22227.808  15521.110  -1.432  0.15536    
Size           104.438      9.424  11.082  < 2e-16 ***
New         -78527.502  51007.642  -1.540  0.12697    
Size:New        61.916     21.686   2.855  0.00527 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 52000 on 96 degrees of freedom
Multiple R-squared:  0.7443,    Adjusted R-squared:  0.7363 
F-statistic: 93.15 on 3 and 96 DF,  p-value: < 2.2e-16

E

\[\mathbf{E}[\textrm{Price}] = -22227.808 + 104.438 * \textrm{size} + -78527.502 * \textrm{New} + 61.916 * \textrm{Size} * \textrm{New}\]

Again, we follow the same logic from B, replacing New with 1 and 0:

For new homes:

\[\mathbf{E}[\textrm{Price}] = -22227.808 + 104.438 * \textrm{size} + -78527.502 * 1 + 61.916 * \textrm{Size} * 1 = -100755.3 + 166.354 * size\]

For old homes:

\[\mathbf{E}[\textrm{Price}] = -22227.808 + 104.438 * \textrm{size} + -78527.502 * 0 + 61.916 * \textrm{Size} * 0 = -22227.808 + 104.438 * \textrm{size}\]

F

Using predict() again:

data.frame(Size = c(3000, 3000), New = c(1, 0)) %>%
  predict(fit_ia, .) 
       1        2 
398307.5 291087.4 

A new home of 3000 square feet has a predicted price of $398307.5. An old home of 3000 square feet has a predicted price of $291087.4. The difference is $107220.1.

G

data.frame(Size = c(1500, 1500), New = c(1, 0)) %>%
  predict(fit_ia, .) 
       1        2 
148776.1 134429.8 

A new home of 1500 square feet has a predicted price of $148776.1. An old home of 1500 square feet has a predicted price of $134429.8. The difference is $14346.3.

The difference between new and old home prices is much more when the size of the home is larger. For 3000 sq ft homes, the difference is 107220.1 as opposed to the 14346.1 difference for homes that are 1500 sq ft. This is consistent with the positive coefficient for the interaction term.

H

I prefer the model with the interaction term, because (1) the interaction term is significant, (2) the Adjusted R-squared is higher in the model with interaction (i.e. despite the penalty for the additional term, we have better fit). We could look into cross-validation / PRESS (which also end up showing the interaction model to be superior), but those are not this homework’s topic.

summary(fit)

Call:
lm(formula = Price ~ Size + New, data = house.selling.price)

Residuals:
    Min      1Q  Median      3Q     Max 
-205102  -34374   -5778   18929  163866 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -40230.867  14696.140  -2.738  0.00737 ** 
Size           116.132      8.795  13.204  < 2e-16 ***
New          57736.283  18653.041   3.095  0.00257 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 53880 on 97 degrees of freedom
Multiple R-squared:  0.7226,    Adjusted R-squared:  0.7169 
F-statistic: 126.3 on 2 and 97 DF,  p-value: < 2.2e-16
summary(fit_ia)

Call:
lm(formula = Price ~ Size + New + Size * New, data = house.selling.price)

Residuals:
    Min      1Q  Median      3Q     Max 
-175748  -28979   -6260   14693  192519 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -22227.808  15521.110  -1.432  0.15536    
Size           104.438      9.424  11.082  < 2e-16 ***
New         -78527.502  51007.642  -1.540  0.12697    
Size:New        61.916     21.686   2.855  0.00527 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 52000 on 96 degrees of freedom
Multiple R-squared:  0.7443,    Adjusted R-squared:  0.7363 
F-statistic: 93.15 on 3 and 96 DF,  p-value: < 2.2e-16