hw4
multiple linear regression
Justine Shakespeare
Homework 4 for DACSS 603 Spring 2023
Author

Justine Shakespeare

Published

April 24, 2023

Code
library(tidyverse)
library(alr4)
library(smss)
library(stargazer)

Question 1

ŷ = −10,536 + 53.8x1 + 2.84x2

x1 = size of home

x2 = size of lot

A

A particular home of 1240 square feet on a lot of 18,000 square feet sold for $145,000. Find the predicted selling price and the residual, and interpret.

x1 = 1240 x2 = 18000 y = 145000

Code
x1 = 1240
x2 = 18000

-10536 + (53.8*x1) + (2.84*x2)
[1] 107296

The predicted selling point is $107,296 for this house. Given that the actual selling value is $145,000, this model underpredicts the value of the house.

The residual for this data point is the difference between y_hat and y. In this case, that is $145,000 - $107,296, which is $37,704. This house sold for $37,704 more than the predicted value.

Code
145000 - 107296
[1] 37704

B

For fixed lot size, how much is the house selling price predicted to increase for each square-foot increase in home size? Why?

The house selling price is predicted to increase $53.80 for each square-foot increase, controlling for lot size. 53.80 is the slope or coefficient of the house size.

ŷ = −10,536 + 53.8x1 + 2.84x2

C

According to this prediction equation, for fixed home size, how much would lot size need to increase to have the same impact as a one-square-foot increase in home size?

A one square foot increase in the size of the home would lead to a $53.80 increase in the sales prices, and a one square foot increase in the size of the lot would lead to a $2.84 increase in the sales price. The find how much the lot size would need to increase to have the same impact as a one square foot increase in home size we can this equation 53.8=x*2.84 and solve for x.

Code
53.8/2.84
[1] 18.94366

The lot would need to increase by over 18.94 square feet to increase the sales price as much as one square foot increase in home size.

Question 2

Code
data(salary)

glimpse(salary)
Rows: 52
Columns: 6
$ degree <fct> Masters, Masters, Masters, Masters, PhD, Masters, PhD, Masters,…
$ rank   <fct> Prof, Prof, Prof, Prof, Prof, Prof, Prof, Prof, Prof, Prof, Pro…
$ sex    <fct> Male, Male, Male, Female, Male, Male, Female, Male, Male, Male,…
$ year   <int> 25, 13, 10, 7, 19, 16, 0, 16, 13, 13, 12, 15, 9, 9, 9, 7, 13, 1…
$ ysdeg  <int> 35, 22, 23, 27, 30, 21, 32, 18, 30, 31, 22, 19, 17, 27, 24, 15,…
$ salary <int> 36350, 35350, 28200, 26775, 33696, 28516, 24900, 31909, 31850, …

A

Test the hypothesis that the mean salary for men and women is the same, without regard to any other variable but sex. Explain your findings.

To test that hypothesis that the mean salary for men and women is the same, we’ll run a two sample t-test.

Code
t.test(formula = salary ~ sex, data = salary)

    Welch Two Sample t-test

data:  salary by sex
t = 1.7744, df = 21.591, p-value = 0.09009
alternative hypothesis: true difference in means between group Male and group Female is not equal to 0
95 percent confidence interval:
 -567.8539 7247.1471
sample estimates:
  mean in group Male mean in group Female 
            24696.79             21357.14 

The p-value of this test is larger than 0.05, so if we choose that for our alpha value then we should retain the null hypothesis that the mean salary for men and women is the same. The 95% confidence interval also spans from -567.8539 to 7,247.1471, which includes 0 and indicates that the means are not significantly different from one another.

B

Run a multiple linear regression with salary as the outcome variable and everything else as predictors, including sex. Assuming no interactions between sex and the other predictors, obtain a 95% confidence interval for the difference in salary between males and females.

Code
QbLM <- (lm(formula = salary ~ degree + rank + sex + year + ysdeg, data = salary))

stargazer(QbLM, data = salary, type = 'text')

===============================================
                        Dependent variable:    
                    ---------------------------
                              salary           
-----------------------------------------------
degreePhD                    1,388.613         
                            (1,018.747)        
                                               
rankAssoc                  5,292.361***        
                            (1,145.398)        
                                               
rankProf                   11,118.760***       
                            (1,351.772)        
                                               
sexFemale                    1,166.373         
                             (925.569)         
                                               
year                        476.309***         
                             (94.914)          
                                               
ysdeg                        -124.574          
                             (77.486)          
                                               
Constant                   15,746.050***       
                             (800.178)         
                                               
-----------------------------------------------
Observations                    52             
R2                             0.855           
Adjusted R2                    0.836           
Residual Std. Error     2,398.418 (df = 45)    
F Statistic           44.239*** (df = 6; 45)   
===============================================
Note:               *p<0.1; **p<0.05; ***p<0.01

===============================================
Statistic N     Mean    St. Dev.   Min    Max  
-----------------------------------------------
year      52   7.481      5.508     0      25  
ysdeg     52   16.115    10.222     1      35  
salary    52 23,797.650 5,917.289 15,000 38,045
-----------------------------------------------

After running the multiple linear regression and using the stargazer() function we see in the results that the coefficient for the sex variable is 1166.37 for sex Female with a standard error of 925.569. We can also see from these results that there are 52 observations, so n = 52. With these figures we can calculate the 95% confidence interval.

Code
# CI = (X bar) ± (t × s/sqrt(n)) = CI = (X bar) ± (t × se)

t_score <- qt(.025, df = 45)

1166.373 + (t_score * 925.569)
[1] -697.8187
Code
1166.373 - (t_score * 925.569)
[1] 3030.565

Or we can use the confint() function in R:

Code
  confint(QbLM) 
                 2.5 %      97.5 %
(Intercept) 14134.4059 17357.68946
degreePhD    -663.2482  3440.47485
rankAssoc    2985.4107  7599.31080
rankProf     8396.1546 13841.37340
sexFemale    -697.8183  3030.56452
year          285.1433   667.47476
ysdeg        -280.6397    31.49105

C

Interpret your finding for each predictor variable; discuss (a) statistical significance, (b) interpretation of the coefficient / slope in relation to the outcome variable and other variables

In this regression analysis the sex, ysdeg, and degree variables are not statistically significant. Because these variables are not significant I will not interpret the coefficients.

The rank and year variables were found to be statistically significant at the level of 0.01 (1%). The coefficients indicate that with the “Assoc” rank, a person will make $5,292.361 more than a person with the “Asst” rank (since the “Assoc” rank was not included in the regression output we know it was the baseline). With the rank “Prof” a person will make $11,118.760 more than a person with “Asst” rank.

With each unit increase of the year variable a person will make $476.309 more in salary.

D

Change the baseline category for the rank variable. Interpret the coefficients related to rank again.

Code
salary$rank <- relevel(salary$rank, ref = "Prof")

summary(lm(formula = salary ~ degree + rank + sex + year + ysdeg, data = salary))

Call:
lm(formula = salary ~ degree + rank + sex + year + ysdeg, data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-4045.2 -1094.7  -361.5   813.2  9193.1 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  26864.81    1375.29  19.534  < 2e-16 ***
degreePhD     1388.61    1018.75   1.363    0.180    
rankAsst    -11118.76    1351.77  -8.225 1.62e-10 ***
rankAssoc    -5826.40    1012.93  -5.752 7.28e-07 ***
sexFemale     1166.37     925.57   1.260    0.214    
year           476.31      94.91   5.018 8.65e-06 ***
ysdeg         -124.57      77.49  -1.608    0.115    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2398 on 45 degrees of freedom
Multiple R-squared:  0.855, Adjusted R-squared:  0.8357 
F-statistic: 44.24 on 6 and 45 DF,  p-value: < 2.2e-16

I used the relevel() function to change the baseline value for the rank variable to “Prof”. Now the results tell us that people with a “Asst” rank make $11,118.76 less than people with a “Prof” rank, and people with an “Assoc” rank make $5,826.40 than those with a “Prof” rank. Both of these findings are statistically significant.

E

Exclude the variable rank, refit, and summarize how your findings changed, if they did.

Code
QbLM <- (lm(formula = salary ~ degree + rank + sex + year + ysdeg, data = salary))

QbLM_noRank <- (lm(formula = salary ~ degree + sex + year + ysdeg, data = salary))

stargazer(QbLM, QbLM_noRank, data = salary, type = 'text')

=================================================================
                                 Dependent variable:             
                    ---------------------------------------------
                                       salary                    
                             (1)                    (2)          
-----------------------------------------------------------------
degreePhD                 1,388.613             -3,299.349**     
                         (1,018.747)            (1,302.520)      
                                                                 
rankAsst                -11,118.760***                           
                         (1,351.772)                             
                                                                 
rankAssoc               -5,826.403***                            
                         (1,012.933)                             
                                                                 
sexFemale                 1,166.373              -1,286.544      
                          (925.569)             (1,313.089)      
                                                                 
year                      476.309***             351.969**       
                           (94.914)              (142.481)       
                                                                 
ysdeg                      -124.574              339.399***      
                           (77.486)               (80.621)       
                                                                 
Constant                26,864.810***          17,183.570***     
                         (1,375.288)            (1,147.942)      
                                                                 
-----------------------------------------------------------------
Observations                  52                     52          
R2                          0.855                  0.631         
Adjusted R2                 0.836                  0.600         
Residual Std. Error  2,398.418 (df = 45)    3,743.502 (df = 47)  
F Statistic         44.239*** (df = 6; 45) 20.107*** (df = 4; 47)
=================================================================
Note:                                 *p<0.1; **p<0.05; ***p<0.01

===============================================
Statistic N     Mean    St. Dev.   Min    Max  
-----------------------------------------------
year      52   7.481      5.508     0      25  
ysdeg     52   16.115    10.222     1      35  
salary    52 23,797.650 5,917.289 15,000 38,045
-----------------------------------------------

Removing the rank variable from the analysis changes the significance of some variables. With the rank variable included in the model the variables degree and ysdeg were not significant and the variable year was significant at the 0.01 (1%) level. With the rank variable removed both the degree and ysdeg variables are now statistically significant, with degree at the 0.05 (5%) level and ysdeg at the 0.01 (1%) level. The year variable also dropped one significance level, from 0.01 (1%) to 0.05 (5%).

The sign also flipped for the variables ysdeg, sexFemale, and degreePhD.

F

Everyone in this dataset was hired the year they earned their highest degree. It is also known that a new Dean was appointed 15 years ago, and everyone in the dataset who earned their highest degree 15 years ago or less than that has been hired by the new Dean. Some people have argued that the new Dean has been making offers that are a lot more generous to newly hired faculty than the previous one and that this might explain some of the variation in Salary.

Create a new variable that would allow you to test this hypothesis and run another multiple regression model to test this. Select variables carefully to make sure there is no multicollinearity. Explain why multicollinearity would be a concern in this case and how you avoided it. Do you find support for the hypothesis that the people hired by the new Dean are making higher than those that were not?

Code
Q2F <- salary %>% 
  mutate("new_dean" = case_when(
    ysdeg <=15 ~ 1, 
    ysdeg >15 ~ 0))

summary(lm(salary ~ new_dean + degree + rank + sex + year, data = Q2F))

Call:
lm(formula = salary ~ new_dean + degree + rank + sex + year, 
    data = Q2F)

Residuals:
    Min      1Q  Median      3Q     Max 
-3403.3 -1387.0  -167.0   528.2  9233.8 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  24425.32    1107.52  22.054  < 2e-16 ***
new_dean      2163.46    1072.04   2.018   0.0496 *  
degreePhD      818.93     797.48   1.027   0.3100    
rankAsst    -11096.95    1191.00  -9.317 4.54e-12 ***
rankAssoc    -6124.28    1028.58  -5.954 3.65e-07 ***
sexFemale      907.14     840.54   1.079   0.2862    
year           434.85      78.89   5.512 1.65e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2362 on 45 degrees of freedom
Multiple R-squared:  0.8594,    Adjusted R-squared:  0.8407 
F-statistic: 45.86 on 6 and 45 DF,  p-value: < 2.2e-16

Since the new variable, new_dean, was built from the ysdeg variable, to avoid multicollinearity I did not include the ysdeg variable in the regression model (since the new_dean variable was included). We can confirm that these two variables would likely cause multicollinearity by checking that their correlation is high.

Code
cor.test(Q2F$new_dean, Q2F$ysdeg)

    Pearson's product-moment correlation

data:  Q2F$new_dean and Q2F$ysdeg
t = -11.101, df = 50, p-value = 4.263e-15
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.9074548 -0.7411040
sample estimates:
       cor 
-0.8434239 

It appears that these two variables have a high negative correlation.

The output of the regression with the new_dean variable shows that if a person was hired by the new dean they would earn $2,163.46 more in salary than if they were hired by the old dean. This finding is significant at the 0.05 (5%) level.

Question 3

A

Using the house.selling.price data, run and report regression results modeling y = selling price (in dollars) in terms of size of home (in square feet) and whether the home is new (1 = yes; 0 = no). In particular, for each variable; discuss statistical significance and interpret the meaning of the coefficient.

Code
data(house.selling.price)
glimpse(house.selling.price)
Rows: 100
Columns: 7
$ case  <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 1…
$ Taxes <int> 3104, 1173, 3076, 1608, 1454, 2997, 4054, 3002, 6627, 320, 630, …
$ Beds  <int> 4, 2, 4, 3, 3, 3, 3, 3, 5, 3, 3, 3, 3, 3, 3, 2, 3, 3, 2, 3, 2, 4…
$ Baths <int> 2, 1, 2, 2, 3, 2, 2, 2, 4, 2, 2, 2, 2, 2, 1, 1, 2, 1, 2, 2, 1, 3…
$ New   <int> 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1…
$ Price <int> 279900, 146500, 237700, 200000, 159900, 499900, 265500, 289900, …
$ Size  <int> 2048, 912, 1654, 2068, 1477, 3153, 1355, 2075, 3990, 1160, 1220,…
Code
houseData <- (lm(Price ~ Size + New, data = house.selling.price))
summary(houseData)

Call:
lm(formula = Price ~ Size + New, data = house.selling.price)

Residuals:
    Min      1Q  Median      3Q     Max 
-205102  -34374   -5778   18929  163866 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -40230.867  14696.140  -2.738  0.00737 ** 
Size           116.132      8.795  13.204  < 2e-16 ***
New          57736.283  18653.041   3.095  0.00257 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 53880 on 97 degrees of freedom
Multiple R-squared:  0.7226,    Adjusted R-squared:  0.7169 
F-statistic: 126.3 on 2 and 97 DF,  p-value: < 2.2e-16

The regression output indicates that both the Size and the New variable are significant (Size at the .1% level and New at the 1% level.) The coefficients tell use that with an increase in size, the price will increase by $116.132, and that the price for new houses will be $57,736.283 more than old houses.

B

Report and interpret the prediction equation, and form separate equations relating selling price to size for new and for not new homes.

E(price) = -40,230.867 + 116.132 * Size + 57736.283 * New

New home: E(price) = -40,230.867 + 116.132 * Size + 57736.283 * 1

We use 1 for x1 with a new home and get the following equation: E(price) = 17,505.42 + 116.132 * Size

Old home: E(price) = -40,230.867 + 116.132 * Size + 57736.283 * 0

We use 0 for x1 with an old home and get the following equation:
E(price) = -40,230.867 + 116.132 * Size

C

Find the predicted selling price for a home of 3000 square feet that is (i) new, (ii) not new.

Code
# Equation: -40,230.867 + 116.132 * size + 57736.283 * new

# new home
size <- 3000
new <- 1
-40230.867 + 116.132 * size + 57736.283 * new
[1] 365901.4
Code
# old home
size <- 3000
new <- 0
-40230.867 + 116.132 * size + 57736.283 * new
[1] 308165.1

The predicted price of a new 3,000 square foot home is $365,901.40 and the predicted price of an old 3,000 square foot home is $308,165.10.

D

Fit another model, this time with an interaction term allowing interaction between size and new, and report the regression results

Code
summary(lm(Price ~ Size + New + Size*New, data = house.selling.price))

Call:
lm(formula = Price ~ Size + New + Size * New, data = house.selling.price)

Residuals:
    Min      1Q  Median      3Q     Max 
-175748  -28979   -6260   14693  192519 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -22227.808  15521.110  -1.432  0.15536    
Size           104.438      9.424  11.082  < 2e-16 ***
New         -78527.502  51007.642  -1.540  0.12697    
Size:New        61.916     21.686   2.855  0.00527 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 52000 on 96 degrees of freedom
Multiple R-squared:  0.7443,    Adjusted R-squared:  0.7363 
F-statistic: 93.15 on 3 and 96 DF,  p-value: < 2.2e-16

The regression results indicate that the Size variable is still significant at the 0.001 (.1%) level, but the New variable is no longer significant. The interaction term Size*New is significant at the 0.01 (1%) level.

E

Report the lines relating the predicted selling price to the size for homes that are (i) new, (ii) not new.

I assume by “report the lines” we mean interpret the coefficients here…

equation: E(price) = -40230.867 + 104.438 * size - 78527.502 * new + 61.916(size*new)

New house: We replace “new” with 1. -22227.808 + 104.438 * size - 78527.502 * 1 + 61.916(size*1)

-100755.3 + 166.354 * size

Old house: We replace “new” with 0 -22227.808 + 104.438 * size - 78527.502 * 0 + 61.916(size*0)

-22227.808 + 104.438 * size

F

Find the predicted selling price for a home of 3000 square feet that is (i) new, (ii) not new.

Code
# new house
size <- 3000
new <- 1
-22227.808 + 104.438 * size - 78527.502 * new + 61.916*(size*new)
[1] 398306.7
Code
# old house
new <- 0
-22227.808 + 104.438 * size - 78527.502 * new + 61.916*(size*new)
[1] 291086.2
  1. The predicted price for a new house that is 3,000 square feet is $398,306.70.

  2. The predicted price for an old house of 3,000 square feet is $291,086.20.

G

Find the predicted selling price for a home of 1500 square feet that is (i) new, (ii) not new. Comparing to (F), explain how the difference in predicted selling prices changes as the size of home increases.

Code
# new house
size <- 1500
new <- 1
-22227.808 + 104.438 * size - 78527.502 * new + 61.916*(size*new)
[1] 148775.7
Code
# old house
new <- 0
-22227.808 + 104.438 * size - 78527.502 * new + 61.916*(size*new)
[1] 134429.2
  1. The predicted price for a new house that is 1,500 square feet is $148,775.70.

  2. The predicted price for an old house of 1,500 square feet is $134,429.20.

Code
# F
398306.7 - 291086.2
[1] 107220.5
Code
# G
148775.7 - 134429.2
[1] 14346.5

The price difference between old and new houses is greater for larger houses ($107,220.50) than for smaller houses ($14,346.50). This is captured in the interaction term.

H

Do you think the model with interaction or the one without it represents the relationship of size and new to the outcome price? What makes you prefer one model over another?

Code
HSPmodel <- (lm(Price ~ Size + New, data = house.selling.price))

HSPmodel_ia <- (lm(Price ~ Size + New + Size*New, data = house.selling.price))

stargazer(HSPmodel, HSPmodel_ia, type = 'text')

==================================================================
                                 Dependent variable:              
                    ----------------------------------------------
                                        Price                     
                              (1)                    (2)          
------------------------------------------------------------------
Size                      116.132***              104.438***      
                            (8.795)                (9.424)        
                                                                  
New                      57,736.280***           -78,527.500      
                         (18,653.040)            (51,007.640)     
                                                                  
Size:New                                          61.916***       
                                                   (21.686)       
                                                                  
Constant                -40,230.870***           -22,227.810      
                         (14,696.140)            (15,521.110)     
                                                                  
------------------------------------------------------------------
Observations                  100                    100          
R2                           0.723                  0.744         
Adjusted R2                  0.717                  0.736         
Residual Std. Error  53,880.950 (df = 97)    51,998.110 (df = 96) 
F Statistic         126.335*** (df = 2; 97) 93.151*** (df = 3; 96)
==================================================================
Note:                                  *p<0.1; **p<0.05; ***p<0.01

I prefer the model with the interaction term because the interaction term is significant and both the R^2 and adjusted R^2 figures are higher than for the model without the interaction term. The residual standard errors are also smaller for the model with the interaction term.