DACSS 603 Homework 4

hw4
regression
Homework 4 for DACSS 603
Author

Laura Collazo

Published

April 18, 2023

Code
library(tidyverse)
library(alr4)
library(smss)

Question 1

For this question the prediction equation is Price = -10,536 + 53.8HomeSize + 2.84LotSize.

a

When HomeSize = 1240 and LotSize= 18,000, the predicted Price is:

Code
sum(-10,536 + (53.8*1240) + (2.84*1800))
[1] 72350

Since this home actually sold for $145,000, the residual is:

Code
sum(72350-145000)
[1] -72650

b

When the lot size remains fixed, the price is predicted to increase $53.80 for every one-square foot increase in size.

c

Given this same equation, if home size remains fixed, the lot size would need to increase by the below in order to have the same impact on price as a one-square foot increase in home size:

Code
sum(53.8/2.84)
[1] 18.94366

Question 2

This question uses the “salary” data from the alr4 package to examine salary and characteristics of faculty in the early 1980s at a small Mid-West college.

Code
data("salary")

a

The below tests the hypothesis that mean salary for men and women is the same.

Code
fit_2a <- lm(salary ~ sex, data = salary)

summary(fit_2a)

Call:
lm(formula = salary ~ sex, data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-8602.8 -4296.6  -100.8  3513.1 16687.9 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)    24697        938  26.330   <2e-16 ***
sexFemale      -3340       1808  -1.847   0.0706 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5782 on 50 degrees of freedom
Multiple R-squared:  0.0639,    Adjusted R-squared:  0.04518 
F-statistic: 3.413 on 1 and 50 DF,  p-value: 0.0706

This model does not allow the null hypothesis to be rejected as the p-value is not less than 0.05. The adjusted R-squared is also low and indicates this model explains only 4.52% of the variation between salary and sex. This model also shows that being female results in being paid $3,340 less per year than male faculty.

b

The below model adds in degree, rank, year, and ysdeg as additional predictors to the regression model.

Code
fit_2b <- lm(salary ~ sex + degree + rank + year + ysdeg, data = salary)

summary(fit_2b)

Call:
lm(formula = salary ~ sex + degree + rank + year + ysdeg, data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-4045.2 -1094.7  -361.5   813.2  9193.1 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 15746.05     800.18  19.678  < 2e-16 ***
sexFemale    1166.37     925.57   1.260    0.214    
degreePhD    1388.61    1018.75   1.363    0.180    
rankAssoc    5292.36    1145.40   4.621 3.22e-05 ***
rankProf    11118.76    1351.77   8.225 1.62e-10 ***
year          476.31      94.91   5.018 8.65e-06 ***
ysdeg        -124.57      77.49  -1.608    0.115    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2398 on 45 degrees of freedom
Multiple R-squared:  0.855, Adjusted R-squared:  0.8357 
F-statistic: 44.24 on 6 and 45 DF,  p-value: < 2.2e-16

The 95% confidence interval for the difference in salary between males and females is below.

Code
confint(fit_2b)
                 2.5 %      97.5 %
(Intercept) 14134.4059 17357.68946
sexFemale    -697.8183  3030.56452
degreePhD    -663.2482  3440.47485
rankAssoc    2985.4107  7599.31080
rankProf     8396.1546 13841.37340
year          285.1433   667.47476
ysdeg        -280.6397    31.49105

c

This section interprets the findings for each predictor variable in the above model.

sex

The variable sex is not statistically significant in this model. It indicates that when all predictors are held constant, females are paid $1,166.37 more per year more than males.

degree

The variable degree is not statistically significant in this model. It indicates that when all predictors are held constant, those with a PhD earn $1,388.61 more than those with a Master’s.

rank

The variable rank is statistically significant in this model. It indicates that when all predictors are held constant, faculty with a title of Assoc earn $5,292.36 more per year than those with a title of Asst and those with a title of Prof earn $11,118.76 more per year than those with a title of Prof.

year

The variable year is statistically significant in this model. It indicates that when all predictors are held constant, for every year increase in faculty’s current rank, salary increases by $476.31.

ysdeg

The variable ysdeg is not statistically significant in this model. It indicates that when all predictors are held constant, for every year increase since the highest degree was earned, salary decreases by -$124.57.

d

Below the model is updated so that the baseline category for rank is “Prof.”

Code
salary$rank <- relevel(salary$rank, ref = "Prof")

fit_2d <- lm(salary ~ sex + degree + rank + year + ysdeg, data = salary)

summary(fit_2d)

Call:
lm(formula = salary ~ sex + degree + rank + year + ysdeg, data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-4045.2 -1094.7  -361.5   813.2  9193.1 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  26864.81    1375.29  19.534  < 2e-16 ***
sexFemale     1166.37     925.57   1.260    0.214    
degreePhD     1388.61    1018.75   1.363    0.180    
rankAsst    -11118.76    1351.77  -8.225 1.62e-10 ***
rankAssoc    -5826.40    1012.93  -5.752 7.28e-07 ***
year           476.31      94.91   5.018 8.65e-06 ***
ysdeg         -124.57      77.49  -1.608    0.115    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2398 on 45 degrees of freedom
Multiple R-squared:  0.855, Adjusted R-squared:  0.8357 
F-statistic: 44.24 on 6 and 45 DF,  p-value: < 2.2e-16

This change to the model does not affect its fit or the coefficients. Notice that rankAsst shows a decrease of -$11,118.76 in salary whereas in the previous model rankProf was an increase of $11,118.76 in salary.

e

This next model removes the variable rank from the model.

Code
fit_2e <- lm(salary ~ sex + degree + year + ysdeg, data = salary)

summary(fit_2e)

Call:
lm(formula = salary ~ sex + degree + year + ysdeg, data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-8146.9 -2186.9  -491.5  2279.1 11186.6 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 17183.57    1147.94  14.969  < 2e-16 ***
sexFemale   -1286.54    1313.09  -0.980 0.332209    
degreePhD   -3299.35    1302.52  -2.533 0.014704 *  
year          351.97     142.48   2.470 0.017185 *  
ysdeg         339.40      80.62   4.210 0.000114 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3744 on 47 degrees of freedom
Multiple R-squared:  0.6312,    Adjusted R-squared:  0.5998 
F-statistic: 20.11 on 4 and 47 DF,  p-value: 1.048e-09

Removing rank leads to ysdeg and degree now being statistically significant in the model. However, the adjusted R-square is lower than in the previous 2 models and the residual standard error is greater making this model not the best fit.

f

This final model creates a new variable new_hire using ysdeg. Those who were hired within 15 years or less were coded as 1 and everyone else 0. To avoid multicollinearity, the variable year was removed from the model. This is because it’s possible the years in faculty’s current rank are the same as years since the highest degree was earned (or in other words, years since hired).

Code
salary$new_hire <- ifelse(salary$ysdeg <= 15, 1, 0)

fit_2f <- lm(salary ~ sex + degree + new_hire, data = salary)

summary(fit_2f)

Call:
lm(formula = salary ~ sex + degree + new_hire, data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-8260.4 -3557.7  -462.6  3563.2 12098.5 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)    28663       1155  24.821  < 2e-16 ***
sexFemale      -2716       1433  -1.896    0.064 .  
degreePhD      -1227       1372  -0.895    0.375    
new_hire       -7418       1306  -5.679 7.74e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4558 on 48 degrees of freedom
Multiple R-squared:  0.4416,    Adjusted R-squared:  0.4067 
F-statistic: 12.65 on 3 and 48 DF,  p-value: 3.231e-06

This model shows the null hypothesis should be rejected and indicates that faculty hired by the new dean are actually making a lower salary than those who were hired more than 15 years ago.

Question 3

Code
data("house.selling.price")

This questions uses the dataset house.selling.price from the package smss.

a

Code
fit_3a <- lm(Price ~ Size + New, data = house.selling.price)

summary(fit_3a)

Call:
lm(formula = Price ~ Size + New, data = house.selling.price)

Residuals:
    Min      1Q  Median      3Q     Max 
-205102  -34374   -5778   18929  163866 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -40230.867  14696.140  -2.738  0.00737 ** 
Size           116.132      8.795  13.204  < 2e-16 ***
New          57736.283  18653.041   3.095  0.00257 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 53880 on 97 degrees of freedom
Multiple R-squared:  0.7226,    Adjusted R-squared:  0.7169 
F-statistic: 126.3 on 2 and 97 DF,  p-value: < 2.2e-16

This first model examines how the size of a house and being new or not influences price. It reveals the variables are statistically significant. Furthermore, we learn that a 1 unit change in size leads to a $116.13 increase in price and when a house is new, it will cost $57,736.28 more than a house that is old when the size is held constant.

b

The equation for the predicted selling price when the home is new is: price = -40230.867 + 116.132Size + 57736.283New

c

The predicted selling price for a home of 3000 square feed that is new is below.

Code
df_new <- data.frame(Size = 3000, New = 1)

predict(fit_3a, newdata = df_new)
       1 
365900.2 

The predicted selling price for a home of 3000 square feed that is not new is below.

Code
df_not_new <- data.frame(Size = 3000, New = 0)

predict(fit_3a, newdata = df_not_new)
       1 
308163.9 

d

The next model includes an interaction term between size and new.

Code
fit_3d <- lm(Price ~ Size * New, data = house.selling.price)

summary(fit_3d)

Call:
lm(formula = Price ~ Size * New, data = house.selling.price)

Residuals:
    Min      1Q  Median      3Q     Max 
-175748  -28979   -6260   14693  192519 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -22227.808  15521.110  -1.432  0.15536    
Size           104.438      9.424  11.082  < 2e-16 ***
New         -78527.502  51007.642  -1.540  0.12697    
Size:New        61.916     21.686   2.855  0.00527 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 52000 on 96 degrees of freedom
Multiple R-squared:  0.7443,    Adjusted R-squared:  0.7363 
F-statistic: 93.15 on 3 and 96 DF,  p-value: < 2.2e-16

e

Code
ggplot(house.selling.price,aes(y=Price,x=Size,color=factor(New)))+
  geom_point()+
  stat_smooth(method="lm",se=TRUE)
`geom_smooth()` using formula = 'y ~ x'

f

The predicted selling price, using the model with interaction terms, for a home of 3000 square feed that is new is below.

Code
predict(fit_3d, newdata = df_new)
       1 
398307.5 

The predicted selling price, using the model with interaction terms, for a home of 3000 square feed that is not new is below.

Code
predict(fit_3d, newdata = df_not_new)
       1 
291087.4 

g

The predicted selling price, using the model with interaction terms, for a home of 1500 square feed that is new is below.

Code
df_new <- data.frame(Size = 1500, New = 1)

predict(fit_3d, newdata = df_new)
       1 
148776.1 

The predicted selling price, using the model with interaction terms, for a home of 1500 square feed that is not new is below.

Code
df_not_new <- data.frame(Size = 1500, New = 0)

predict(fit_3d, newdata = df_not_new)
       1 
134429.8 

In comparing the predictions for part F and G, it can be observed that the difference in selling price between a new and not new home increases as the the size of the home increases.

h

I believe the model with the interaction term best represents the relationship of size and new to the outcome price. I’ve come to this conclusion as the model with the interaction term has a higher adjusted R-squared and lower residual standard error than the model without the interaction term.