Homework 4

hw4

linear regression

multiple linear regression

Multiple linear regression for DACSS 603.

Author

Miguel Curiel

Published

April 25, 2023

Code

# load necessary packages
library(tidyverse)
library(alr4)
library(smss)

Question 1

For recent data in Jacksonville, Florida, on y = selling price of home (in dollars), x1 = size of home (in square feet), and x2 = lot size (in square feet), the prediction equation is

ŷ = −10,536 + 53.8x1 + 2.84x2.

A particular home of 1240 square feet on a lot of 18,000 square feet sold for $145,000. Find the predicted selling price and the residual, and interpret.
1. If ŷ = −10536 + 53.8x1 + 2.84x2, then by replacing the given values we have ŷ = -10536 + (53.8*1240) + (2.84*18000). Solving for that, the predicted selling price is $107,296 and the residual is $37,704. This means that the actual selling price was more than what the model would have predicted.
For fixed lot size, how much is the house selling price predicted to increase for each square-foot increase in home size? Why?
1. Selling price increases by 53.8 dollars for each square-foot increase because that is the coefficient assigned to x1. In other words, x1 is the effect of home size on the selling price when holding other factors constant.
According to this prediction equation, for fixed home size, how much would lot size need to increase to have the same impact as a one-square-foot increase in home size?
1. Lot size would need to increase 18.94 times to have the same impact as a one-square-foot increase in home size. This can be found by dividing x1 over x2, i.e. 53.8 / 2.84 = 18.94366.

Question 2

(Data file: salary in alr4 R package). The data file concerns salary and other characteristics of all faculty in a small Midwestern college collected in the early 1980s for presentation in legal proceedings for which discrimination against women in salary was at issue. All persons in the data hold tenured or tenure track positions; temporary faculty are not included. The variables include degree, a factor with levels PhD and MS; rank, a factor with levels Asst, Assoc, and Prof; sex, a factor with levels Male and Female; Year, years in current rank; ysdeg, years since highest degree, and salary, academic year salary in dollars.

Code

data("salary", package = "alr4")
salary <- salary
head(salary)

   degree rank    sex year ysdeg salary
1 Masters Prof   Male   25    35  36350
2 Masters Prof   Male   13    22  35350
3 Masters Prof   Male   10    23  28200
4 Masters Prof Female    7    27  26775
5     PhD Prof   Male   19    30  33696
6 Masters Prof   Male   16    21  28516

Test the hypothesis that the mean salary for men and women is the same, without regard to any other variable but sex. Explain your findings.
1. Since we are dealing with the mean of numerical data (salary) between two groups (male and female), we can run a two-sample t-test. As seen from the results below, even though the mean salary of men is higher, according to the p-value (0.09), it is not statistically significant, therefore we fail to reject the null hypothesis. In other words, we do not have enough evidence to say that male salaries are greater than female salaries.

Code

# extract the salary data and sex variable
salary <- alr4::salary
sex <- salary$sex

# calculate the mean salaries for men and women
male_salaries <- salary$salary[sex == "Male"]
female_salaries <- salary$salary[sex == "Female"]
mean_male_salary <- mean(male_salaries)
mean_female_salary <- mean(female_salaries)

# perform the t-test
t_test <- t.test(male_salaries, female_salaries)

# print the results
cat("Mean salary for men:", mean_male_salary, "\n")

Mean salary for men: 24696.79

Code

cat("Mean salary for women:", mean_female_salary, "\n")

Mean salary for women: 21357.14

Code

cat("p-value:", t_test$p.value, "\n")

p-value: 0.09009406

Run a multiple linear regression with salary as the outcome variable and everything else as predictors, including sex. Assuming no interactions between sex and the other predictors, obtain a 95% confidence interval for the difference in salary between males and females.

Code

# fit a multiple linear regression model
model <- lm(salary ~ ., data = salary)

# print the model summary
summary(model)


Call:
lm(formula = salary ~ ., data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-4045.2 -1094.7  -361.5   813.2  9193.1 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 15746.05     800.18  19.678  < 2e-16 ***
degreePhD    1388.61    1018.75   1.363    0.180    
rankAssoc    5292.36    1145.40   4.621 3.22e-05 ***
rankProf    11118.76    1351.77   8.225 1.62e-10 ***
sexFemale    1166.37     925.57   1.260    0.214    
year          476.31      94.91   5.018 8.65e-06 ***
ysdeg        -124.57      77.49  -1.608    0.115    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2398 on 45 degrees of freedom
Multiple R-squared:  0.855, Adjusted R-squared:  0.8357 
F-statistic: 44.24 on 6 and 45 DF,  p-value: < 2.2e-16

Code

# obtain a 95% confidence interval for the difference in salary between males and females
confint(model, "sexFemale", level = 0.95)

              2.5 %   97.5 %
sexFemale -697.8183 3030.565

Interpret your finding for each predictor variable; discuss (a) statistical significance, (b) interpretation of the coefficient / slope in relation to the outcome variable and other variables.
1. Statistical significance of each variable. There are several variables that turn out to be statistically significant, in particular rank and year. This makes sense as it is commonly assumed that people with higher ranking positions have better salaries. Similarly, people with more tenure or years in a company should correlate to greater salaries. We had already determined that gender does not play a significant role, but it is interesting to see that neither level of education or years since graduation play a significant role.
2. Coefficient / slope of each predictor variable in relation to the outcome variable and other variables. Some variables that immediately draw attention are rank and years since graduation - rank has a high coefficient, meaning that it plays an important role in increasing salary. Years, on the other side, has a negative coefficient, meaning that more years actually equates to less salary (which is a positive outcome for recent grads, but not for people that have several years in the workforce). The remaining variables - gender, degree, and years in a position - have positive relations, however they are not as high as rank.
Change the baseline category for the rank variable. Interpret the coefficients related to rank again.
1. After experimenting with the three possible ranks, coefficients do not change - what can change is the direction of the relationship. For example, if you use “Prof” as the baseline, now assistants and associates have a negative relationship, meaning that professors see an increase in salary but assistants and associates do not.

Code

# change the baseline category for the rank variable to "Asst"
salary$rank <- relevel(salary$rank, ref = "Prof")

# fit a new multiple linear regression model with a different baseline category for rank
model2 <- lm(salary ~ ., data = salary)

# print the model summary
summary(model2)


Call:
lm(formula = salary ~ ., data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-4045.2 -1094.7  -361.5   813.2  9193.1 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  26864.81    1375.29  19.534  < 2e-16 ***
degreePhD     1388.61    1018.75   1.363    0.180    
rankAsst    -11118.76    1351.77  -8.225 1.62e-10 ***
rankAssoc    -5826.40    1012.93  -5.752 7.28e-07 ***
sexFemale     1166.37     925.57   1.260    0.214    
year           476.31      94.91   5.018 8.65e-06 ***
ysdeg         -124.57      77.49  -1.608    0.115    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2398 on 45 degrees of freedom
Multiple R-squared:  0.855, Adjusted R-squared:  0.8357 
F-statistic: 44.24 on 6 and 45 DF,  p-value: < 2.2e-16

Finkelstein (1980), in a discussion of the use of regression in discrimination cases, wrote, “[a] variable may reflect a position or status bestowed by the employer, in which case if there is discrimination in the award of the position or status, the variable may be ‘tainted.’”Thus, for example, if discrimination is at work in promotion of faculty to higher ranks, using rank to adjust salaries before comparing the sexes may not be acceptable to the courts. (Exclude the variable rank, refit, and summarize how your findings changed, if they did.)
1. This results in significant changes. With this new model, all variables included are statistically significant. This means that gender, degree, and years since graduation do play an important role in salary. In particular, females see a decrease in salary, as do PhD graduates, while years since graduation is the only variable with a positive influence on salary.

Code

# fit a new multiple linear regression model without rank
model3 <- lm(salary ~ sex + degree + ysdeg, data = salary)

# print the model summary
summary(model3)


Call:
lm(formula = salary ~ sex + degree + ysdeg, data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-8328.5 -2621.9  -864.5  2987.3 11025.5 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  18325.0     1105.3  16.580  < 2e-16 ***
sexFemale    -2730.2     1236.8  -2.207  0.03210 *  
degreePhD    -4228.4     1311.7  -3.224  0.00228 ** 
ysdeg          476.0       61.7   7.716 5.94e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3937 on 48 degrees of freedom
Multiple R-squared:  0.5833,    Adjusted R-squared:  0.5572 
F-statistic: 22.39 on 3 and 48 DF,  p-value: 3.272e-09

Everyone in this dataset was hired the year they earned their highest degree. It is also known that a new Dean was appointed 15 years ago, and everyone in the dataset who earned their highest degree 15 years ago or less than that has been hired by the new Dean. Some people have argued that the new Dean has been making offers that are a lot more generous to newly hired faculty than the previous one and that this might explain some of the variation in Salary. (Create a new variable that would allow you to test this hypothesis and run another multiple regression model to test this. Select variables carefully to make sure there is no multicollinearity. Explain why multicollinearity would be a concern in this case and how you avoided it. Do you find support for the hypothesis that the people hired by the new Dean are making higher than those that were not?)
1. I created a new binary variable to distinguish between people hired before and after the new dean. To check for multicollinearity, I created a correlation matrix - and, as expected, years since graduation is somewhat correlated. This makes sense as people with more recent graduation dates are more likely to have been hired by the new dean - therefore, “ysdeg” was removed from the model. After fitting the model, most variables remain relatively the same and the people hired before/after the new dean do not seem to have a statistically significant salary difference.

Code

# Create a new variable indicating whether the person was hired by the new Dean or not
salary$newDean <- ifelse(salary$year >= 15, 0, 1)

Code

# Calculate the correlation matrix
cor(salary[, c("salary", "ysdeg", "newDean")])

            salary      ysdeg    newDean
salary   1.0000000  0.6748542 -0.3246869
ysdeg    0.6748542  1.0000000 -0.2931746
newDean -0.3246869 -0.2931746  1.0000000

Code

# Fit the multiple regression model
model4 <- lm(salary ~ sex + rank + degree + year + newDean, data=salary)

# Check the summary of the model
summary(model4)


Call:
lm(formula = salary ~ sex + rank + degree + year + newDean, data = salary)

Residuals:
    Min      1Q  Median      3Q     Max 
-3514.8 -1641.7  -263.6   895.5  8867.9 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  22870.5     2218.1  10.311 1.98e-13 ***
sexFemale      550.0      838.5   0.656    0.515    
rankAsst     -9198.9      948.3  -9.700 1.34e-12 ***
rankAssoc    -5121.3      963.7  -5.314 3.21e-06 ***
degreePhD      163.7      785.8   0.208    0.836    
year           478.2      106.0   4.512 4.58e-05 ***
newDean       1931.1     1514.0   1.275    0.209    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2423 on 45 degrees of freedom
Multiple R-squared:  0.8521,    Adjusted R-squared:  0.8323 
F-statistic:  43.2 on 6 and 45 DF,  p-value: < 2.2e-16

Question 3

(Data file: house.selling.price in smss R package)

Code

data("house.selling.price", package = "smss")
prices <- house.selling.price
head(prices)

  case Taxes Beds Baths New  Price Size
1    1  3104    4     2   0 279900 2048
2    2  1173    2     1   0 146500  912
3    3  3076    4     2   0 237700 1654
4    4  1608    3     2   0 200000 2068
5    5  1454    3     3   0 159900 1477
6    6  2997    3     2   1 499900 3153

Using the house.selling.price data, run and report regression results modeling y = selling price (in dollars) in terms of size of home (in square feet) and whether the home is new (1 = yes; 0 = no). In particular, for each variable; discuss statistical significance and interpret the meaning of the coefficient.
1. After running the model, both variables are statistically significant, although size is much more significant than new status (2e-16 > 0.00257). However, the coefficient is much higher for new status than it is fore size (57736 >116).

Code

# Fit the multiple regression model
model5 <- lm(Price ~ Size + New, data=prices)

# Check the summary of the model
summary(model5)


Call:
lm(formula = Price ~ Size + New, data = prices)

Residuals:
    Min      1Q  Median      3Q     Max 
-205102  -34374   -5778   18929  163866 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -40230.867  14696.140  -2.738  0.00737 ** 
Size           116.132      8.795  13.204  < 2e-16 ***
New          57736.283  18653.041   3.095  0.00257 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 53880 on 97 degrees of freedom
Multiple R-squared:  0.7226,    Adjusted R-squared:  0.7169 
F-statistic: 126.3 on 2 and 97 DF,  p-value: < 2.2e-16

Report and interpret the prediction equation, and form separate equations relating selling price to size for new and for not new homes.
1. The equation would be ŷ = -40230.867 + 116.132*size + 57736.283*new. What this means is that, if a house is new, it’s selling price will increase by $57,736.
Find the predicted selling price for a home of 3000 square feet that is (i) new, (ii) not new.
1. Using the above equation and replacing for the values given, we have the two following equations: ŷ(new) = -40230.867 + 116.132*3000 + 57736.283*1 and ŷ(not new) = -40230.867 + 116.132*3000 + 57736.283*0. This results in a selling price of $365,901.4 for the new home and $308,165.1 for one that is not new.
Fit another model, this time with an interaction term allowing interaction between size and new, and report the regression results.

Code

# Fit the multiple regression model
model6 <- lm(Price ~ Size * New, data=prices)

# Check the summary of the model
summary(model6)


Call:
lm(formula = Price ~ Size * New, data = prices)

Residuals:
    Min      1Q  Median      3Q     Max 
-175748  -28979   -6260   14693  192519 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -22227.808  15521.110  -1.432  0.15536    
Size           104.438      9.424  11.082  < 2e-16 ***
New         -78527.502  51007.642  -1.540  0.12697    
Size:New        61.916     21.686   2.855  0.00527 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 52000 on 96 degrees of freedom
Multiple R-squared:  0.7443,    Adjusted R-squared:  0.7363 
F-statistic: 93.15 on 3 and 96 DF,  p-value: < 2.2e-16

Report the lines relating the predicted selling price to the size for homes that are (i) new (ii) not new.
1. The new formula would be ŷ = -22227.808 + 104.438*size - 78527.502*new + 61.916*size*new. We would have to replace with a 1 or 0 depending where it says “new” in the previous equation depending on a house’s status.
Find the predicted selling price for a home of 3000 square feet that is (i) new, (ii) not new.
1. Replacing the given values in the previous equation, we would have the two following formulas: ŷ(new)= -22227.808 + 104.438*3000 - 78527.502*1 + 61.916*3000*1 and ŷ(not new)= -22227.808 + 104.438*3000 - 78527.502*0 + 61.916*3000*0. This results in a selling price of $398,306.7 for the new house and $291,086.2 for one that is not new.
Find the predicted selling price for a home of 1500 square feet that is (i) new, (ii) not new. Comparing to (F), explain how the difference in predicted selling prices changes as the size of home increases.
1. Following the same steps, we now have these formulas: ŷ(new)= -22227.808 + 104.438*1500 - 78527.502*1 + 61.916*1500*1 and ŷ(not new)= -22227.808 + 104.438*1500 - 78527.502*0 + 61.916*1500*0. This results in a selling price of $148,775.7 for the new house and $134,429.2 for one that is not new. Compared to the prices in (F), there seems to be an exponentiation effect on the selling price as houses increase in size.
Do you think the model with interaction or the one without it represents the relationship of size and new to the outcome price? What makes you prefer one model over another?
1. Even though it is tempting to say the the model without interaction performs better because of the lower R-squared, when comparing across other metrics (Root Mean Squared Error, Mean Absolute Error, and Akaike Information Criterion), it turns out the model with interaction performs better. Additionally, from the examples above, the prices predicted by the model with interaction does seem to be closer to the predicted prices. Therefore, I would chose the model with interaction.

Code

# Create a function to calculate the evaluation metrics
eval_metrics <- function(model) {
  # Calculate the R-squared value
  rsq <- summary(model)$r.squared
  
  # Calculate the RMSE value
  predicted <- predict(model, newdata = house.selling.price)
  actual <- house.selling.price$Price
  rmse <- sqrt(mean((predicted - actual)^2))
  
  # Calculate the MAE value
  mae <- mean(abs(predicted - actual))
  
  # Calculate the AIC value
  aic <- AIC(model)
  
  # Return a data frame with the evaluation metrics
  data.frame(R2 = rsq, RMSE = rmse, MAE = mae, AIC = aic)
}

# Calculate the evaluation metrics for each model
metrics1 <- eval_metrics(model5)
metrics2 <- eval_metrics(model6)

# Combine the metrics into a table
metrics_table <- rbind(metrics1, metrics2)
rownames(metrics_table) <- c("Model without Interaction"
                             , "Model with Interaction")

# Print the table
metrics_table

                                 R2     RMSE      MAE      AIC
Model without Interaction 0.7225963 53066.58 38015.03 2467.648
Model with Interaction    0.7443085 50947.53 35200.58 2461.498