Homework Assignment 4 - Darron Bunt
Author

Darron Bunt

Published

May 7, 2023

Code
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.2     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Code
library(alr4)
Loading required package: car
Loading required package: carData

Attaching package: 'car'

The following object is masked from 'package:dplyr':

    recode

The following object is masked from 'package:purrr':

    some

Loading required package: effects
lattice theme set by effectsTheme()
See ?effectsTheme for details.
Code
library(smss)
library(ggplot2)

Question 1

For recent data in Jacksonville, Florida, on y = selling price of home (in dollars), x1 = size of home (in square feet), and x2 = lot size (in square feet), the prediction equation is ŷ = −10,536 + 53.8x1 + 2.84x2.

A

A particular home of 1240 square feet on a lot of 18,000 square feet sold for $145,000. Find the predicted selling price and the residual, and interpret

Code
# Find the predicted selling price and the residual
# ŷ = −10,536 + 53.8x1 + 2.84x2, where x1 = 1,240 and x2 = 18,000
PredPrice <- -10536 + (53.8*1240) + (2.84*18000)
PredPrice
[1] 107296
Code
# Calculate the residual - actual selling price - predicted selling price
ActualPrice <- 145000
Residual <- ActualPrice - PredPrice
Residual 
[1] 37704

The predicted selling price is $107,296 and the residual is $37,704.

The actual selling price was more than the model would have predicted.

B

For fixed lot size, how much is the house selling price predicted to increase for each square-foot increase in home size? Why?

For a fixed lot size, each square-foot increase is predicted to increase the sale price by $53.80 - the value of x1.

If we increase the square footage of the example house in question by one square foot, we can observe this difference in the predicted price.

Code
# Calculate predicted house price for a 1,241 square foot home
House2 <- -10536 + (53.8*1241) + (2.84*18000)
House2
[1] 107349.8
Code
# Calculate the difference in predicted price for a 1,241 square foot home vs. a 1,240 square foot home
House2Diff <- House2 - PredPrice
House2Diff
[1] 53.8

C

According to this prediction equation, for fixed home size, how much would lot size need to increase to have the same impact as a one-square-foot increase in home size?

Based on this prediction equation, every square-foot increase in home size increases the value by $53.80 (the value of x1) and each square-foot increase in lot size increases the price by $2.84 (the value of x2).

In order to calculate how much lot size would need to increase in order to have the same impact as a one-square-foot increase in home size, we need to see how many times $2.84 goes into $53.80

Code
# Divide $53.80 by $2.84
HowManyFeet <- 53.80 / 2.84
HowManyFeet
[1] 18.94366

Lot size would need to increase 18.94 square feet in order to have the same impact on predicted selling price as a one square foot increase in home size.

Question 2

(Data file: salary in alr4 R package).

The data file concerns salary and other characteristics of all faculty in a small Midwestern college collected in the early 1980s for presentation in legal proceedings for which discrimination against women in salary was at issue. All persons in the data hold tenured or tenure track positions; temporary faculty are not included. The variables include degree, a factor with levels PhD and MS; rank, a factor with levels Asst, Assoc, and Prof; sex, a factor with levels Male and Female; Year, years in current rank; ysdeg, years since highest degree, and salary, academic year salary in dollars.

Code
# Load salary data file
data(salary)

A

Test the hypothesis that the mean salary for men and women is the same, without regard to any other variable but sex. Explain your findings.

Null hypothesis: That the mean salary for men and women, without regard to any other variable but sex, is the same Alternative hypothesis: That the mean salary for men and women, without regard to any other variable but sex, is not the same.

In order to test the hypothesis I am going to run a t-test to determine whether the difference in means is statistically significant.

Code
# Run a two-sample t-test
t.test(formula = salary ~ sex, data = salary)

    Welch Two Sample t-test

data:  salary by sex
t = 1.7744, df = 21.591, p-value = 0.09009
alternative hypothesis: true difference in means between group Male and group Female is not equal to 0
95 percent confidence interval:
 -567.8539 7247.1471
sample estimates:
  mean in group Male mean in group Female 
            24696.79             21357.14 

The t-test shows that there is indeed a difference between the mean salary for men ($23,696.79) and women ($21,357.14), but with a p-value of 0.09009, we fail to reject the null hypothesis at the 95% confidence level - that is to say, we do not have the evidence that we need to reject the hypothesis that the mean salary for men and women, without regard to any other variable but sex, is the same.

B

Run a multiple linear regression with salary as the outcome variable and everything else as predictors, including sex. Assuming no interactions between sex and the other predictors, obtain a 95% confidence interval for the difference in salary between males and females.

Code
# Run multiple linear regression with salary as outcome variable and everything else (including sex) as predictors
lm_salary <- lm(salary ~ ., data = salary)
summary(lm_salary)

Call:
lm(formula = salary ~ ., data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-4045.2 -1094.7  -361.5   813.2  9193.1 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 15746.05     800.18  19.678  < 2e-16 ***
degreePhD    1388.61    1018.75   1.363    0.180    
rankAssoc    5292.36    1145.40   4.621 3.22e-05 ***
rankProf    11118.76    1351.77   8.225 1.62e-10 ***
sexFemale    1166.37     925.57   1.260    0.214    
year          476.31      94.91   5.018 8.65e-06 ***
ysdeg        -124.57      77.49  -1.608    0.115    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2398 on 45 degrees of freedom
Multiple R-squared:  0.855, Adjusted R-squared:  0.8357 
F-statistic: 44.24 on 6 and 45 DF,  p-value: < 2.2e-16
Code
# Obtain 95% confidence interval for the difference in salary between males and females.
confint(lm_salary, "sexFemale", level = 0.95)
              2.5 %   97.5 %
sexFemale -697.8183 3030.565

The 95% confidence interval for the difference in salary between males and females is (-697.82, 3,030.57).

C

Interpret your finding for each predictor variable; discuss (a) statistical significance, (b) interpretation of the coefficient / slope in relation to the outcome variable and other variables

  • Degree
    • Controlling for everything else, a professor with a PhD makes $1,388.61 more than a professor with only a master’s. The p-value is 0.180, so the result is not statistically significant.
  • Rank
    • The p-values for rankAssoc and rankProf are both less than 0.05, indicating a statistically significant relationship between salary and rank. Controlling for everything else, a professor at the Associate rank makes $5,292.36 more than one at the rank of Assistant, and a full professor makes $11,118.76 more.
  • Sex
    • Controlling for everything else, a female professor makes $1,166 more than a male professor. With a p-value of 0.214, this is not a statistically significant result.
  • Year
    • Year, the years a professor has in their current rank, also had a statistically significant impact on salary. Each additional year of experience added $476.31 in salary.
  • Years Since Highest Degree
    • The result here was not statistically significant but suggests that, when controlling for all other factors, each year that has passed since one obtained their degree results in a $124.57 decrease in salary.

Overall, of the predictor variables considered, rank and year have a statistically significant impact on salary.

D

Change the baseline category for the rank variable. Interpret the coefficients related to rank again.

Code
# Change the baseline category for rank to "Prof"
salary$rank <- relevel(salary$rank, ref = "Prof")
summary(lm(salary ~ ., data = salary))

Call:
lm(formula = salary ~ ., data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-4045.2 -1094.7  -361.5   813.2  9193.1 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  26864.81    1375.29  19.534  < 2e-16 ***
degreePhD     1388.61    1018.75   1.363    0.180    
rankAsst    -11118.76    1351.77  -8.225 1.62e-10 ***
rankAssoc    -5826.40    1012.93  -5.752 7.28e-07 ***
sexFemale     1166.37     925.57   1.260    0.214    
year           476.31      94.91   5.018 8.65e-06 ***
ysdeg         -124.57      77.49  -1.608    0.115    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2398 on 45 degrees of freedom
Multiple R-squared:  0.855, Adjusted R-squared:  0.8357 
F-statistic: 44.24 on 6 and 45 DF,  p-value: < 2.2e-16
Code
# Change the baseline category for rank to "Assoc"
salary$rank <- relevel(salary$rank, ref = "Assoc")
summary(lm(salary ~ ., data = salary))

Call:
lm(formula = salary ~ ., data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-4045.2 -1094.7  -361.5   813.2  9193.1 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 21038.41    1109.12  18.969  < 2e-16 ***
degreePhD    1388.61    1018.75   1.363    0.180    
rankProf     5826.40    1012.93   5.752 7.28e-07 ***
rankAsst    -5292.36    1145.40  -4.621 3.22e-05 ***
sexFemale    1166.37     925.57   1.260    0.214    
year          476.31      94.91   5.018 8.65e-06 ***
ysdeg        -124.57      77.49  -1.608    0.115    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2398 on 45 degrees of freedom
Multiple R-squared:  0.855, Adjusted R-squared:  0.8357 
F-statistic: 44.24 on 6 and 45 DF,  p-value: < 2.2e-16

Interpret the coefficients related to rank again.

No matter which category is the baseline for rank, the results remain statistically significant.

  • With professor as the baseline, we that assistant professors are expected to make $11,118.76 less and associate professors $5,826.40 less.

  • With associate professor as the baseline, we see that full professors are expected to make $5,826.40 more, while assistant professors are expected to make $5,292.36 less.

  • And, as we calculated previously, with assistant as the baseline, a professor at the Associate rank makes $5,292.36 more than one at the rank of Assistant, and a full professor makes $11,118.76 more.

The numbers always remain the same; it’s the same information, just presented in slightly different ways depending on which rank is acting as the baseline variable.

E

Finkelstein (1980), in a discussion of the use of regression in discrimination cases, wrote, “a variable may reflect a position or status bestowed by the employer, in which case if there is discrimination in the award of the position or status, the variable may be ‘tainted.’” Thus, for example, if discrimination is at work in promotion of faculty to higher ranks, using rank to adjust salaries before comparing the sexes may not be acceptable to the courts.

Exclude the variable rank, refit, and summarize how your findings changed, if they did.

Code
# Run multiple linear regression again, but exclude the variable "rank"
lm_salary_NR <- lm(salary ~ degree + sex + year + ysdeg, data = salary)
summary(lm_salary_NR)

Call:
lm(formula = salary ~ degree + sex + year + ysdeg, data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-8146.9 -2186.9  -491.5  2279.1 11186.6 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 17183.57    1147.94  14.969  < 2e-16 ***
degreePhD   -3299.35    1302.52  -2.533 0.014704 *  
sexFemale   -1286.54    1313.09  -0.980 0.332209    
year          351.97     142.48   2.470 0.017185 *  
ysdeg         339.40      80.62   4.210 0.000114 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3744 on 47 degrees of freedom
Multiple R-squared:  0.6312,    Adjusted R-squared:  0.5998 
F-statistic: 20.11 on 4 and 47 DF,  p-value: 1.048e-09

Summarize how your findings changed, if they did.

  • With rank included as a predictor variable, rank and year had a statistically significant impact on salary.

  • With rank excluded, degree, year, and ysdeg had a statistically significant impact on salary

    • Degree
      • Without the “rank” variable, “degreePhD” becomes a statistically significant predictor variable. When controlling for all other factors, a PhD (vs. a master’s) now results in a $3,299.35 decrease in salary (much different than with rank included, where there was no statistical significance, but a PhD increased salary by $1,388.61).
    • Years Since Highest Degree
      • Without the “rank” variable, “ysdeg” (years since receiving one’s highest degree) becomes a statistically significant predictor variable. When controlling for all other factors, each year that has passed now results in a $339.40 increase in salary (much different than with rank included, where there was no statistical significance and a decrease).
    • Year
      • “year” remains statistically significant, though the p-value has increased. When controlling for all other factors, each additional year of experience adds $351.97 in salary.
    • Sex
      • The result for sex was still not statistically significant, but the its impact on salary did change from a positive one to a negative one.

F

Everyone in this dataset was hired the year they earned their highest degree. It is also known that a new Dean was appointed 15 years ago, and everyone in the dataset who earned their highest degree 15 years ago or less than that has been hired by the new Dean. Some people have argued that the new Dean has been making offers that are a lot more generous to newly hired faculty than the previous one and that this might explain some of the variation in Salary.

Create a new variable that would allow you to test this hypothesis and run another multiple regression model to test this. Select variables carefully to make sure there is no multicollinearity. Explain why multicollinearity would be a concern in this case and how you avoided it. Do you find support for the hypothesis that the people hired by the new Dean are making higher than those that were not?

Code
# Create a new variable that indicates whether the professor was hired by the new dean or not - ysdeg >16 years ago and ysdeg <= 15 years ago
salary$newDean <- ifelse(salary$ysdeg <= 15, 0, 1)
# 0 WAS hired by new Dean; 1 is was NOT hired by new Dean

Multicollinarity refers to a situation where 2+ variables in a regression model are highly correlated with each other - that is, they are measuring either the same or similar things. If variables that are highly correlated with each other are included in the same regression model, this can lead to unreliable results.

In our case, newDean is highly correlated with ysdeg - the numerical data from ysdeg was used to create the binary variable newDean (0 if ysdeg is 16+; 1 if ysdeg is 15 or less). Accordingly, when we run our regression analysis, we need to include newDean, but exclude ysdeg.

Code
# Run regression analysis with newDean included as a predictor variable
lm_salary_NewDean <- lm(salary ~ degree + rank + sex + year + newDean, data = salary)
summary(lm_salary_NewDean)

Call:
lm(formula = salary ~ degree + rank + sex + year + newDean, data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-3403.3 -1387.0  -167.0   528.2  9233.8 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 20464.50     952.41  21.487  < 2e-16 ***
degreePhD     818.93     797.48   1.027   0.3100    
rankProf     6124.28    1028.58   5.954 3.65e-07 ***
rankAsst    -4972.66     997.17  -4.987 9.61e-06 ***
sexFemale     907.14     840.54   1.079   0.2862    
year          434.85      78.89   5.512 1.65e-06 ***
newDean     -2163.46    1072.04  -2.018   0.0496 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2362 on 45 degrees of freedom
Multiple R-squared:  0.8594,    Adjusted R-squared:  0.8407 
F-statistic: 45.86 on 6 and 45 DF,  p-value: < 2.2e-16

Do you find support for the hypothesis that the people hired by the new Dean are making higher than those that were not?

Yes, there is support for the hypothesis that people hired by the new Dean are making more than those who were not. The p-value is significant at the 5% level (0.0496), and the results suggest that faculty who were not hired by the new Dean make $2,163.46 less, when controlling for other variables.

Question 3

(Data file: house.selling.price in smss R package)

Code
# Load data file
data("house.selling.price")

A

Using the house.selling.price data, run and report regression results modeling y = selling price (in dollars) in terms of size of home (in square feet) and whether the home is new (1 = yes; 0 = no). In particular, for each variable; discuss statistical significance and interpret the meaning of the coefficient.

Code
# Regression results modeling Size, New as predictor variables for Price
Price_SzNew <- lm(Price ~ Size + New, data = house.selling.price)
summary(Price_SzNew)

Call:
lm(formula = Price ~ Size + New, data = house.selling.price)

Residuals:
    Min      1Q  Median      3Q     Max 
-205102  -34374   -5778   18929  163866 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -40230.867  14696.140  -2.738  0.00737 ** 
Size           116.132      8.795  13.204  < 2e-16 ***
New          57736.283  18653.041   3.095  0.00257 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 53880 on 97 degrees of freedom
Multiple R-squared:  0.7226,    Adjusted R-squared:  0.7169 
F-statistic: 126.3 on 2 and 97 DF,  p-value: < 2.2e-16

for each variable; discuss statistical significance and interpret the meaning of the coefficient.

  • Size
    • Size had a statistically significant impact on selling price; controlling for newness, each additional square foot added $116.13 to the sale price.
  • New
    • New also had a statistically significant impact on selling price; controlling for square footage, a new house sold for $57,736.28 more than houses that were not new.

B

Report and interpret the prediction equation, and form separate equations relating selling price to size for new and for not new homes.

The equation would be ŷ = −40,230.867 + 116.132(Size) + 57736.283(New).

If a house is new: ŷ = −40,230.867 + 116.132(Size) + 57,736.283 If a house is not new: ŷ = −40,230.867 + 116.132(Size)

New/not new is a binary variable, and only new houses see an increase in sale price.

C

Find the predicted selling price for a home of 3000 square feet that is (i) new, (ii) not new.

New: ŷ = −40,230.867 + (116.1323000) + 57,736.283 Not new: ŷ = −40,230.867 + (116.1323000)

Code
New3000 <- -40230.867 + (116.132*3000) + 57736.283
NotNew3000 <- -40230.867 + (116.132*3000)
New3000
[1] 365901.4
Code
NotNew3000
[1] 308165.1

The predicted selling price for a new 3,000 square foot home is $365,901.40. The predicted selling price for a 3,000 square foot home that is not new is $308,165.10.

D

Fit another model, this time with an interaction term allowing interaction between size and new, and report the regression results

Code
# Fit a model allowing interaction between size and new
Price_SzNewInt <- lm(Price ~ Size + New + Size:New, data = house.selling.price)
summary(Price_SzNewInt)

Call:
lm(formula = Price ~ Size + New + Size:New, data = house.selling.price)

Residuals:
    Min      1Q  Median      3Q     Max 
-175748  -28979   -6260   14693  192519 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -22227.808  15521.110  -1.432  0.15536    
Size           104.438      9.424  11.082  < 2e-16 ***
New         -78527.502  51007.642  -1.540  0.12697    
Size:New        61.916     21.686   2.855  0.00527 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 52000 on 96 degrees of freedom
Multiple R-squared:  0.7443,    Adjusted R-squared:  0.7363 
F-statistic: 93.15 on 3 and 96 DF,  p-value: < 2.2e-16

E

Report the lines relating the predicted selling price to the size for homes that are (i) new, (ii) not new.

ŷ = −22,227.808 + 104.438(Size) + (-78527.502(New)) + 61.916(Size*New)

New: −22,227.808 + 104.438(Size) + (-78527.502(New)) + 61.916(Size*New)

Not new: −22,227.808 + 104.438(Size)

F

Find the predicted selling price for a home of 3000 square feet that is (i) new, (ii) not new.

Code
NewInt3000 <- -22227.808 + (104.438*3000) -78527.502 + (61.916*3000)
NotNewInt3000 <- -22227.808 + (104.438*3000)
NewInt3000
[1] 398306.7
Code
NotNewInt3000
[1] 291086.2

The predicted selling price for a new, 3000 square-foot home is $398,306.70. The predicted selling price for a 3000 square-foot home that is not new is $291,086.20.

G

Find the predicted selling price for a home of 1500 square feet that is (i) new, (ii) not new. Comparing to (F), explain how the difference in predicted selling prices changes as the size of home increases.

Code
NewInt1500 <- -22227.808 + (104.438*1500) -78527.502 + (61.916*1500)
NotNewInt1500 <- -22227.808 + (104.438*1500)
NewInt1500
[1] 148775.7
Code
NotNewInt1500
[1] 134429.2

Comparing to (F), explain how the difference in predicted selling prices changes as the size of home increases.

  • For 3000 square foot homes:
    • New: $398,306.70
    • Not New: $291,086.20
  • For 1500 square foot homes:
    • New: $148,775.70
    • Not New: $134,429.20

There is a $107,220.50 difference in sale price for the new/not new 3000 square foot homes but only a $14,346.50 difference in sale price for the new/not new 1500 square foot homes. Given the structure of the prediction equation, this makes a lot of sense - based on the predictive model, a house needs to be roughly 606 feet before it can be sold at a profit. A 1,500 square foot home has only roughly 894 “money making” square feet, while the 3000 square foot home has roughly 2394, and each money making square foot available adds 166.354 to the sale price.

H

Do you think the model with interaction or the one without it represents the relationship of size and new to the outcome price? What makes you prefer one model over another?

I think the model with the interaction more closely represents the relationship of size and newness to the outcome price. The p value for the model with interaction indicated statistical significance; further, the adjusted r squared for the model with interaction (0.7363) was higher than that for the model without (0.7169).