Author

Karen Kimble

Published

November 14, 2022

Code
# Setup
library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.3.6      ✔ purrr   0.3.5 
✔ tibble  3.1.8      ✔ dplyr   1.0.10
✔ tidyr   1.2.1      ✔ stringr 1.4.1 
✔ readr   2.1.3      ✔ forcats 0.5.2 
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
Code
library(dplyr)
library(alr4)
Loading required package: car
Loading required package: carData

Attaching package: 'car'

The following object is masked from 'package:dplyr':

    recode

The following object is masked from 'package:purrr':

    some

Loading required package: effects
lattice theme set by effectsTheme()
See ?effectsTheme for details.
Code
library(smss)
Warning: package 'smss' was built under R version 4.2.2

Question 1

Part A

Code
# Predicted

y = -10536 + (53.8 * 1240) + (2.84 * 18000)

y
[1] 107296
Code
145000 - y
[1] 37704

The predicted selling price is 107,296 dollars but the actual selling price was 145,000 dollars, resulting in a residual of 37,704. This means that the predictor model underestimated the selling price by over 37,000 dollars.

Part B

For a fixed lot size, the house selling price is predicted to increase by 53.8 for each square foot increase in home size. This is because 53.8 is the coefficient for the square foot variable, meaning that the model estimates this amount of increase for each additional unit of x.

Part C

Code
53.8/2.84
[1] 18.94366

For a fixed home size, the lot size would need to increase by about 18.94 square feet in order to have an equivalent impact as an additional square foot of home size.

Question 2

Part A

Code
t.test(salary ~ sex, data = salary)

    Welch Two Sample t-test

data:  salary by sex
t = 1.7744, df = 21.591, p-value = 0.09009
alternative hypothesis: true difference in means between group Male and group Female is not equal to 0
95 percent confidence interval:
 -567.8539 7247.1471
sample estimates:
  mean in group Male mean in group Female 
            24696.79             21357.14 

The p-value from the t-test 0.09, greater than a 0.05 significance level. This indicates that there is not statistically significant evidence to reject the hypothesis that the mean salary for men and women are the same.

Part B

Code
model <- lm(salary ~ sex + degree + rank + year + ysdeg, data = salary)
confint(model)
                 2.5 %      97.5 %
(Intercept) 14134.4059 17357.68946
sexFemale    -697.8183  3030.56452
degreePhD    -663.2482  3440.47485
rankAssoc    2985.4107  7599.31080
rankProf     8396.1546 13841.37340
year          285.1433   667.47476
ysdeg        -280.6397    31.49105

The confidence interval for sex means that there is 95% confidence that the true difference in mean salaries for men and women lie between -697.82 and 3,030.56.

Part C

Code
summary(model)

Call:
lm(formula = salary ~ sex + degree + rank + year + ysdeg, data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-4045.2 -1094.7  -361.5   813.2  9193.1 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 15746.05     800.18  19.678  < 2e-16 ***
sexFemale    1166.37     925.57   1.260    0.214    
degreePhD    1388.61    1018.75   1.363    0.180    
rankAssoc    5292.36    1145.40   4.621 3.22e-05 ***
rankProf    11118.76    1351.77   8.225 1.62e-10 ***
year          476.31      94.91   5.018 8.65e-06 ***
ysdeg        -124.57      77.49  -1.608    0.115    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2398 on 45 degrees of freedom
Multiple R-squared:  0.855, Adjusted R-squared:  0.8357 
F-statistic: 44.24 on 6 and 45 DF,  p-value: < 2.2e-16

The above results show that the p-value for the variable of sex is larger than the significance level of 0.05, meaning there is still no statistically significant evidence to reject the hypothesis that the mean salaries for men and women are the same. When the individual is female, the model predicts the salary increases by 1,166.37.

For the degree level, there is also not statistically significant evidence to reject the hypothesis that the mean salaries for those with a master’s degree and a PhD are the same, since the p-value is larger than 0.05. The model predicts that when an individual has a PhD, their predicted salary increases by 1,388.61.

For the rank variable, there is statistically significant evidence to reject the hypothesis that the salaries for ranks Associate, Assistant, and Professor are the same. The p-values for both Associate and Professor rankings are extremely small and less than the significance level of 0.05. THe model predicts that faculty with an Associate ranking have a salary increase by 5,292.36, and faculty with a Professor ranking have a salary increase by 111,118.76.

The p-value for the variable of the amount of years in the current rank is also extremely small and less than the 0.05 significance level, meaning that there is statistically significant evidence to reject the hypothesis that the amount of years does not affect salary amount. For each additional year spent in the current rank, the model predicts a salary increase of 476.31.

Lastly, the p-value for the amount of years since the highest degree is larger than the signficiance level 0.05. There is no statistically significant evidence to reject the hypothesis that the amount of years since highest degree has no impact on salary amount. The model predicts that for each additional year since the highest degree, the salary decreases by 124.57.

Part D

Code
salary$rank <- relevel(salary$rank, ref = 'Prof')
model <- lm(salary ~ sex + degree + rank + year + ysdeg, data = salary)
summary(model)

Call:
lm(formula = salary ~ sex + degree + rank + year + ysdeg, data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-4045.2 -1094.7  -361.5   813.2  9193.1 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  26864.81    1375.29  19.534  < 2e-16 ***
sexFemale     1166.37     925.57   1.260    0.214    
degreePhD     1388.61    1018.75   1.363    0.180    
rankAsst    -11118.76    1351.77  -8.225 1.62e-10 ***
rankAssoc    -5826.40    1012.93  -5.752 7.28e-07 ***
year           476.31      94.91   5.018 8.65e-06 ***
ysdeg         -124.57      77.49  -1.608    0.115    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2398 on 45 degrees of freedom
Multiple R-squared:  0.855, Adjusted R-squared:  0.8357 
F-statistic: 44.24 on 6 and 45 DF,  p-value: < 2.2e-16

After changing the baseline category, the model shows that for faculty with the ranks of Assistant and Associate, the p-value is still extremely small and shows significant evidence to reject the hypothesis that the salaries for ranks Assistant, Associate, and Professor are the same. The model indicates that for those in the rank Assistant, their predicted salary decreases by 111,118.76. When Assistant is the baseline category, the model predicts a salary decrease if 5,826.40 when the faculty is ranked Associate.

Part E

Code
model <- lm(salary ~ sex + degree + year + ysdeg, data = salary)
summary(model)

Call:
lm(formula = salary ~ sex + degree + year + ysdeg, data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-8146.9 -2186.9  -491.5  2279.1 11186.6 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 17183.57    1147.94  14.969  < 2e-16 ***
sexFemale   -1286.54    1313.09  -0.980 0.332209    
degreePhD   -3299.35    1302.52  -2.533 0.014704 *  
year          351.97     142.48   2.470 0.017185 *  
ysdeg         339.40      80.62   4.210 0.000114 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3744 on 47 degrees of freedom
Multiple R-squared:  0.6312,    Adjusted R-squared:  0.5998 
F-statistic: 20.11 on 4 and 47 DF,  p-value: 1.048e-09

After excluding rank, all variables except for sex have statistically significant p-values. Though the p-value for the year variable increased, it still reamined under the 0.05 significance level. The variables degree and ysdeg now have p-values less than the significant 0.05 level when they were much higher in the previous model. Removing the rank variable resulted in new coefficients for all variables as well.

Part F

Code
salary$appointed <- ifelse(salary$year > 15, c("0"), c("1"))
model <- lm(salary ~ sex + degree + appointed, data = salary)
summary(model)

Call:
lm(formula = salary ~ sex + degree + appointed, data = salary)

Residuals:
     Min       1Q   Median       3Q      Max 
-11079.0  -4093.5   -333.7   3348.9  16842.5 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  29712.6     2593.2  11.458 2.45e-15 ***
sexFemale    -2504.7     1793.8  -1.396   0.1691    
degreePhD      541.4     1640.5   0.330   0.7428    
appointed1   -6005.5     2692.8  -2.230   0.0304 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5610 on 48 degrees of freedom
Multiple R-squared:  0.1541,    Adjusted R-squared:  0.1012 
F-statistic: 2.915 on 3 and 48 DF,  p-value: 0.04371

Multicollinearity would be a concern in this case because multiple variables are related due to the appointment of the new Dean. If the Dean only hired those who recently got their degree, then that means the variables year and years since highest degree are related–only those hired within the past 15 years would also have gotten their degree within 15 years. Thus, I omitted these two variables and created the variable “appointed”, with 1 indicating that the faculty member was appointed by the new dean and 0 indicating that they were not. The results from the model don’t support the hypothesis that the new Dean’s appointees are making higher salaries than those who were are not. The model predicts a salary decrease of 6,005 when the faculty member is appointed by the new Dean. If the people hired by the new Dean were making more money, this predictiin would be an increase.

Question 3

Part A

Code
model <- lm(Price ~ Size + New, data = house.selling.price)
Error in is.data.frame(data): object 'house.selling.price' not found
Code
summary(model)

Call:
lm(formula = salary ~ sex + degree + appointed, data = salary)

Residuals:
     Min       1Q   Median       3Q      Max 
-11079.0  -4093.5   -333.7   3348.9  16842.5 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  29712.6     2593.2  11.458 2.45e-15 ***
sexFemale    -2504.7     1793.8  -1.396   0.1691    
degreePhD      541.4     1640.5   0.330   0.7428    
appointed1   -6005.5     2692.8  -2.230   0.0304 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5610 on 48 degrees of freedom
Multiple R-squared:  0.1541,    Adjusted R-squared:  0.1012 
F-statistic: 2.915 on 3 and 48 DF,  p-value: 0.04371

For the variable size, the p-value is extremely small and less than the significance level of 0.05, meaning that there is statistically significant evidence to reject the hypothesis that the size of the house does not affect price. The coefficient of size indicates that for each additional square foot, the model predicts the price to increase by 116.132.

The p-value for the variable new is also smaller than the significance level of 0.05, so there is statistically significant evidence to reject the hypothesis that new houses have the same mean price as old houses. The coefficient of new means that for a house that is new, the price is predicted to increase by 57,736.283.

Part B

The prediction equation is: y = -40,230.867 + 116.132x1 + 57,736.283x2 (with x1 being the square feet of the house and x2 being whether the house is old or new)

This means that when the house has 0 square feet and is not new, the price would be predicted to be -40,230.867 (or the y-intercept).

The equation for not new homes: y = -40,230.867 + 116.132x1

The last part of the prediction equation is omitted since not new homes are equal to 0, cancelling out the last component.

The equation for new homes: y = -40,230.867 + 116.132x1 + 57,736.283x2

Part C

Code
# New

-40230.867 + (116.132 * 3000) + (57636.283 * 1)
[1] 365801.4

The predicted price of a new home with 3,000 square feet is $365,801.40.

Code
# Not new

-40230.867 + (116.132 * 3000)
[1] 308165.1

The predicted price of a not new home with 3,000 square feet is $308,165.10.

Part D

Code
model <- lm(Price ~ Size + New + Size*New, data = house.selling.price)
Error in is.data.frame(data): object 'house.selling.price' not found
Code
summary(model)

Call:
lm(formula = salary ~ sex + degree + appointed, data = salary)

Residuals:
     Min       1Q   Median       3Q      Max 
-11079.0  -4093.5   -333.7   3348.9  16842.5 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  29712.6     2593.2  11.458 2.45e-15 ***
sexFemale    -2504.7     1793.8  -1.396   0.1691    
degreePhD      541.4     1640.5   0.330   0.7428    
appointed1   -6005.5     2692.8  -2.230   0.0304 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5610 on 48 degrees of freedom
Multiple R-squared:  0.1541,    Adjusted R-squared:  0.1012 
F-statistic: 2.915 on 3 and 48 DF,  p-value: 0.04371

The coefficients changed for both variables with the new model. The coefficient for size indicates that for each additional square foot, the price of the home is predicted to increase by 104.438. For new, when the house is new, the price of the home is expected to decrease by 78527.502–now it is a negative relationship when in the previous model it was positive. The interaction coefficient is 61.916, meaning that for each additional square foot when the house is new, the price is predicted to increase by 61.916. The size variable’s p-value is still below the 0.05 significance level. However, the p-value for the variable new is above the 0.05 significance level, indicating that there is now no statistically significant evidence to reject the hypothesis that the mean prices for new houses and old houses (regardless of size) are the same. The interaction term’s p-value is smaller than the 0.05 significance level, meaning that there is statistically significant evidence to reject the hypothesis that the price of a new house doesn’t depend on its size and vice versa.

Part E

Equation for a new house: y = -22,2227.808 + 104.438x1 - 78,527.502x2 + 61.916(x1)(x2)

Equation for a not new house: y = -22,2227.808 + 104.438x1

Again, removed the last two terms of this equation since not new is equal to 0 and would therefore cancel out the last two terms.

Part F

Code
# New

-22227.808 + (104.438 * 3000) - (78527.502 * 1) + (61.916 * 3000 * 1)
[1] 398306.7

The predicted price for a new house with 3,000 square feet with the new equation is $398,306.70.

Code
# Not New

-22227.808 + (104.438 * 3000)
[1] 291086.2

The predicted price for a not new house with 3,000 square feet with the new equation is $291,086.19.

Part G

Code
# New

-22227.808 + (104.438 * 1500) - (78527.502 * 1) + (61.916 * 1500 * 1)
[1] 148775.7

The predicted price for a new house with 1,500 square feet is $148,775.70.

Code
# Not new

-22227.808 + (104.438 * 1500)
[1] 134429.2

The predicted price for a not new house with 1,500 square feet is 134,429.20.

Part H

I think the model with the interaction term better represents the relationship of size to the outcome of price, both from the results in the summary and from my own limited knowledge about housing. The regression results from the interaction model showed that the interaction term was statistically significant, indicating strong evidence that the price of a home does depend on size, but whether or not the house is new affects the magnitude of this effect. Also, I think when people buy homes they care both about size and about whether or not the house is new. Additionally, when the interaction model calculated the price of homes for both 3,000 square feet and 1,500 square feet when the house is both new and old, there is a dramatic difference in how much the price increased in the new house with the additional square footage than the not new house.