Homework 4 - Emily Duryea

hw4
Emily Duryea
The fourth homework assignment for DACSS 603
Author

Emily Duryea

Published

November 14, 2022

Homework 4

Question 1

For recent data in Jacksonville, Florida, on y = selling price of home (in dollars), x1 = size of home (in square feet), and x2 = lot size (in square feet), the prediction equation is 

ŷ = −10,536 + 53.8x1 + 2.84x2.

Code
# Importing needed libraries
library(readxl)
library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.3.6      ✔ purrr   0.3.5 
✔ tibble  3.1.8      ✔ dplyr   1.0.10
✔ tidyr   1.2.1      ✔ stringr 1.4.1 
✔ readr   2.1.3      ✔ forcats 0.5.2 
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
Code
library(ggplot2)
library(dplyr)
library(stringr)
library(alr4)
Loading required package: car
Loading required package: carData

Attaching package: 'car'

The following object is masked from 'package:dplyr':

    recode

The following object is masked from 'package:purrr':

    some

Loading required package: effects
lattice theme set by effectsTheme()
See ?effectsTheme for details.
Code
library(smss)
Warning: package 'smss' was built under R version 4.2.2

Part A

A particular home of 1240 square feet on a lot of 18,000 square feet sold for $145,000. Find the predicted selling price and the residual, and interpret.

Code
# Plugging given values into the equation to solve
-10536 + 53.8*1240 + 2.84*18000
[1] 107296
Code
# Calculating the residual
145000-107296
[1] 37704

The predicted selling price of the house would be $107,296, and the residual would be $37,704.

Part B

For fixed lot size, how much is the house selling price predicted to increase for each square-foot increase in home size? Why?

The house selling price predicted to increase for each square-foot increase is $53.80, based on the multiplier in the prediction equation.

Part C

According to this prediction equation, for fixed home size, how much would lot size need to increase to have the same impact as a one-square-foot increase in home size?

Code
# Calculating using square foot home size multiplier and square foot lot size multipler
53.8/2.84
[1] 18.94366

The lot size of a fixed size home would have to increase by 18.94366 feet to have the same impact as a one square foot increase in home size.

Question 2

(Data file: salary in alr4 R package). The data file concerns salary and other characteristics of all faculty in a small Midwestern college collected in the early 1980s for presentation in legal proceedings for which discrimination against women in salary was at issue. All persons in the data hold tenured or tenure track positions; temporary faculty are not included. The variables include degree, a factor with levels PhD and MS; rank, a factor with levels Asst, Assoc, and Prof; sex, a factor with levels Male and Female; Year, years in current rank; ysdeg, years since highest degree, and salary, academic year salary in dollars.

Part A

Test the hypothesis that the mean salary for men and women is the same, without regard to any other variable but sex. Explain your findings.

Code
# Importing the data
data(salary)

# Conducting a t-test to test hypothesis
t.test(salary ~ sex, data=salary)

    Welch Two Sample t-test

data:  salary by sex
t = 1.7744, df = 21.591, p-value = 0.09009
alternative hypothesis: true difference in means between group Male and group Female is not equal to 0
95 percent confidence interval:
 -567.8539 7247.1471
sample estimates:
  mean in group Male mean in group Female 
            24696.79             21357.14 

Based on the results of the t-test, there is no statistically significant difference in the salary of male and female professors. This is because the p-value (0.09009) is greater than 0.05.

Part B

Run a multiple linear regression with salary as the outcome variable and everything else as predictors, including sex. Assuming no interactions between sex and the other predictors, obtain a 95% confidence interval for the difference in salary between males and females.

Code
# Creating the model
summary(lm(salary ~ degree + rank + sex + year + ysdeg, data=salary))

Call:
lm(formula = salary ~ degree + rank + sex + year + ysdeg, data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-4045.2 -1094.7  -361.5   813.2  9193.1 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 15746.05     800.18  19.678  < 2e-16 ***
degreePhD    1388.61    1018.75   1.363    0.180    
rankAssoc    5292.36    1145.40   4.621 3.22e-05 ***
rankProf    11118.76    1351.77   8.225 1.62e-10 ***
sexFemale    1166.37     925.57   1.260    0.214    
year          476.31      94.91   5.018 8.65e-06 ***
ysdeg        -124.57      77.49  -1.608    0.115    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2398 on 45 degrees of freedom
Multiple R-squared:  0.855, Adjusted R-squared:  0.8357 
F-statistic: 44.24 on 6 and 45 DF,  p-value: < 2.2e-16
Code
# Putting model data into a variable
professor_salary <- lm(salary ~ degree + rank + sex + year + ysdeg, data=salary)

#Creating a confidence interval for the variables in the model
confint(professor_salary)
                 2.5 %      97.5 %
(Intercept) 14134.4059 17357.68946
degreePhD    -663.2482  3440.47485
rankAssoc    2985.4107  7599.31080
rankProf     8396.1546 13841.37340
sexFemale    -697.8183  3030.56452
year          285.1433   667.47476
ysdeg        -280.6397    31.49105

The confidence interval for salary difference based on sex is between -697.8183 and 3,030.56452.

Part C

Interpret your finding for each predictor variable; discuss (a) statistical significance, (b) interpretation of the coefficient / slope in relation to the outcome variable and other variables

Degree

The p-value is not statistically significant, as the p-value is greater than 0.05. The results indicate that professors with a PhD make $1,388.61 than professors without a PhD.

rankAssoc

The p-value is statistically significant, as it is less than 0.05. The results indicate that associate professors make $5,292.36 than assistant professors.

rankProf

The p-value is statistically significant. The results indicate that faculty professors make $11,118.76 more than assistant professors.

Sex

The p-value is not statistically significant. The results indicate that female professors make $1,166.37 more than male professors.

Year

The p-value is statistically significant. The results indicate that with each year as a professor, there is a salary increase of $476.31.

Ysdeg

The p-value is not statistically significant. The results indicate that for each year after completing their highest degree, there is a salary decrease of $124.57.

Part D

Change the baseline category for the rank variable. Interpret the coefficients related to rank again.

Code
# Changing the baseline category for the rank variable
salary$rank<- relevel(salary$rank, ref = 'Assoc')

# Redoing the model from Part B
summary(lm(salary ~ degree + rank + sex + year + ysdeg, data=salary))

Call:
lm(formula = salary ~ degree + rank + sex + year + ysdeg, data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-4045.2 -1094.7  -361.5   813.2  9193.1 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 21038.41    1109.12  18.969  < 2e-16 ***
degreePhD    1388.61    1018.75   1.363    0.180    
rankAsst    -5292.36    1145.40  -4.621 3.22e-05 ***
rankProf     5826.40    1012.93   5.752 7.28e-07 ***
sexFemale    1166.37     925.57   1.260    0.214    
year          476.31      94.91   5.018 8.65e-06 ***
ysdeg        -124.57      77.49  -1.608    0.115    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2398 on 45 degrees of freedom
Multiple R-squared:  0.855, Adjusted R-squared:  0.8357 
F-statistic: 44.24 on 6 and 45 DF,  p-value: < 2.2e-16

The baseline category has been changed to “Assoc.” The results indicate that assistant professors make $5,292.36 less than Associate professors. Faculty professor are expected to make $5,826.40 more than Associate professors. These results are both still statistically significant (p-value < 0.05).

Part E

Finkelstein (1980), in a discussion of the use of regression in discrimination cases, wrote, "[a] variable may reflect a position or status bestowed by the employer, in which case if there is discrimination in the award of the position or status, the variable may be 'tainted.' " Thus, for example, if discrimination is at work in promotion of faculty to higher ranks, using rank to adjust salaries before comparing the sexes may not be acceptable to the courts. Exclude the variable rank, refit, and summarize how your findings changed, if they did.

Code
# Creating a model without rank
summary(lm(salary ~ degree + sex + year + ysdeg, data=salary))

Call:
lm(formula = salary ~ degree + sex + year + ysdeg, data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-8146.9 -2186.9  -491.5  2279.1 11186.6 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 17183.57    1147.94  14.969  < 2e-16 ***
degreePhD   -3299.35    1302.52  -2.533 0.014704 *  
sexFemale   -1286.54    1313.09  -0.980 0.332209    
year          351.97     142.48   2.470 0.017185 *  
ysdeg         339.40      80.62   4.210 0.000114 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3744 on 47 degrees of freedom
Multiple R-squared:  0.6312,    Adjusted R-squared:  0.5998 
F-statistic: 20.11 on 4 and 47 DF,  p-value: 1.048e-09

After removing rank from the model, the results indicate that there is a difference between male and female salaries, with females making $1,286.54 less than men. However, this difference is not statistically significant.

Part F

Everyone in this dataset was hired the year they earned their highest degree. It is also known that a new Dean was appointed 15 years ago, and everyone in the dataset who earned their highest degree 15 years ago or less than that has been hired by the new Dean. Some people have argued that the new Dean has been making offers that are a lot more generous to newly hired faculty than the previous one and that this might explain some of the variation in Salary.

Create a new variable that would allow you to test this hypothesis and run another multiple regression model to test this. Select variables carefully to make sure there is no multicollinearity. Explain why multicollinearity would be a concern in this case and how you avoided it. Do you find support for the hypothesis that the people hired by the new Dean are making higher than those that were not?

Code
# Creating a variable with categories "new" for less than 15 years, and "old" for 15 or more years
#creating a dummy variable new and old dean
salary<-mutate(salary, dean= case_when(ysdeg < 15 ~"new",
                               ysdeg >=15 ~"old"))

# Rerunning the model with new variable
summary(lm(salary ~ dean + degree + sex + rank +year, data=salary))

Call:
lm(formula = salary ~ dean + degree + sex + rank + year, data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-3588.0 -1532.2  -232.2   565.7  9132.5 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  20468.7      951.7  21.507  < 2e-16 ***
deanold      -2421.6     1187.9  -2.038   0.0474 *  
degreePhD     1073.5      843.3   1.273   0.2096    
sexFemale     1046.7      858.0   1.220   0.2289    
rankAsst     -5012.5     1002.3  -5.001 9.16e-06 ***
rankProf      6213.3     1045.0   5.946 3.76e-07 ***
year           450.7       81.5   5.530 1.55e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2360 on 45 degrees of freedom
Multiple R-squared:  0.8597,    Adjusted R-squared:  0.841 
F-statistic: 45.95 on 6 and 45 DF,  p-value: < 2.2e-16

Based on the new variable and controlling for other predictors, faculty hired by the old dean make $2,421.60 less than new faculty.

Question 3

(Data file: house.selling.price in smss R package)

Part A

Using the house.selling.price data, run and report regression results modeling y = selling price (in dollars) in terms of sizeof home (in square feet) and whether the home is new(1 = yes; 0 = no). In particular, for each variable; discuss statistical significance and interpret the meaning of the coefficient.

Code
# Importing needed data
data(house.selling.price)

# Creating model based on size and age status
summary(lm(Price ~ Size + New, data= house.selling.price))

Call:
lm(formula = Price ~ Size + New, data = house.selling.price)

Residuals:
    Min      1Q  Median      3Q     Max 
-205102  -34374   -5778   18929  163866 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -40230.867  14696.140  -2.738  0.00737 ** 
Size           116.132      8.795  13.204  < 2e-16 ***
New          57736.283  18653.041   3.095  0.00257 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 53880 on 97 degrees of freedom
Multiple R-squared:  0.7226,    Adjusted R-squared:  0.7169 
F-statistic: 126.3 on 2 and 97 DF,  p-value: < 2.2e-16

Based on the size variable, the house price will increase by ~$116.132 for each square foot increase in size. This finding is statistically significant.

Based on the age of the house variable, a new house is projected to cost ~$57,736.283 more than an old house. This finding is also statistically significant.

Part B

Report and interpret the prediction equation, and form separate equations relating selling price to size for new and for not new homes.

Prediction equation:

y = -40230.867 + 116.13*x1 + 57736.283*x2

x1 = size

x2 = age of house (old vs. new)

Prediction equation for a new house:

y = -40230.867 + 116.132*x1 + 57736.283

Prediction equation for an old house:

y = -40230.867 + 116.132*x1

Part C

Find the predicted selling price for a home of 3000 square feet that is (i) new, (ii) not new.

Code
# Using the prediction model to predict the cost of a new house
-40230.867 + 116.132*3000 +  57736.283
[1] 365901.4
Code
# Using the prediction model to predict the cost of an old house
-40230.867 + 116.132*3000
[1] 308165.1

Based on the prediction model, a new house of 3,000 square feet would cost $365,901.40, while an old house of the same size would cost $308,165.10.

Part D

Fit another model, this time with an interaction term allowing interaction between size and new, and report the regression results

Code
# Creating a model for an interaction between size and new
summary(lm(Price ~ Size + New + Size*New, data=house.selling.price))

Call:
lm(formula = Price ~ Size + New + Size * New, data = house.selling.price)

Residuals:
    Min      1Q  Median      3Q     Max 
-175748  -28979   -6260   14693  192519 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -22227.808  15521.110  -1.432  0.15536    
Size           104.438      9.424  11.082  < 2e-16 ***
New         -78527.502  51007.642  -1.540  0.12697    
Size:New        61.916     21.686   2.855  0.00527 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 52000 on 96 degrees of freedom
Multiple R-squared:  0.7443,    Adjusted R-squared:  0.7363 
F-statistic: 93.15 on 3 and 96 DF,  p-value: < 2.2e-16

Based on this model, the prediction equation would be y = -2227.808 + 104.438*x1 + 61.916*x2 -78527.502*x3, where x1 is the size of the house in square feet, x2 is the size and house age interaction variable, and x3 is the house age variable (old vs new). Both size and the interaction between size and house age are statistically significant. However, the house age variable is no longer statistically significant.

Part E

Report the lines relating the predicted selling price to the size for homes that are (i) new, (ii) not new.

Prediction equation for a new house:

y = -2227.808 + 166.354*x1 - 78527.502

Prediction equation for an old house:

y = -2227.808 + 104.438*x1

Part F

Find the predicted selling price for a home of 3000 square feet that is (i) new, (ii) not new.

Code
# Using the prediction model to predict the cost of a new house
-2227.808 + 166.354*3000 - 78527.502
[1] 418306.7
Code
# Using the prediction model to predict the cost of an old house
-2227.808 + 104.438*3000
[1] 311086.2

Based on the new prediction model, a new house of 3,000 square feet would cost $418,306.70, while an old house of the same size would cost $311,086.20.

Part G

Find the predicted selling price for a home of 1500 square feet that is (i) new, (ii) not new. Comparing to (F), explain how the difference in predicted selling prices changes as the size of home increases.

Code
# Using the prediction model to predict the cost of a new house
-2227.808 + 166.354*1500 - 78527.502
[1] 168775.7
Code
# Using the prediction model to predict the cost of an old house
-2227.808 + 104.438*1500
[1] 154429.2

Based on the new prediction model, a new house of 1,500 square feet would cost $168,775.70, while an old house of the same size would cost $154,429.20. This model demonstrates that small houses vary less in price, regardless of age, than larger houses, where new vs old houses makes much more of a difference.

Part H

Do you think the model with interaction or the one without it represents the relationship of size and new to the outcome price? What makes you prefer one model over another?

Code
# Calculating predicted house cost of a large new house with first model
-40230.867 + 116.132*5000 +  57736.283
[1] 598165.4
Code
# Calculating predicted house cost of a large old house with first model
-40230.867 + 116.132*5000
[1] 540429.1
Code
# Calculating predicted house cost of a small new house with first model
-40230.867 + 116.132*1000 +  57736.283
[1] 133637.4
Code
# Calculating predicted house cost of a small old house with the first model
-40230.867 + 116.132*1000
[1] 75901.13
Code
# Calculating predicted house cost of a large new house with second model
-2227.808 + 166.354*5000 - 78527.502
[1] 751014.7
Code
# Calculating predicted house cost of a large old house with second model
-2227.808 + 104.438*5000
[1] 519962.2
Code
# Calculating predicted house cost of a small new house with second model
-2227.808 + 166.354*1000 - 78527.502
[1] 85598.69
Code
# Calculating predicted house cost of a small old house with the second model
-2227.808 + 104.438*1000
[1] 102210.2

I think that it depends. For large houses, I would use the second model, because it shows the difference the price a new house vs an old house more greatly. However, for smaller houses, the second model is inaccurate, since it predicts that a new small house would be cheaper than an old small house. I think the first model would be a better predictor for small houses.