Code
library(tidyverse)
library(alr4)
library(smss)
Laura Collazo
April 18, 2023
For this question the prediction equation is Price = -10,536 + 53.8HomeSize + 2.84LotSize
.
When HomeSize = 1240 and LotSize= 18,000, the predicted Price is:
Since this home actually sold for $145,000, the residual is:
When the lot size remains fixed, the price is predicted to increase $53.80 for every one-square foot increase in size.
Given this same equation, if home size remains fixed, the lot size would need to increase by the below in order to have the same impact on price as a one-square foot increase in home size:
This question uses the “salary” data from the alr4 package to examine salary and characteristics of faculty in the early 1980s at a small Mid-West college.
The below tests the hypothesis that mean salary for men and women is the same.
Call:
lm(formula = salary ~ sex, data = salary)
Residuals:
Min 1Q Median 3Q Max
-8602.8 -4296.6 -100.8 3513.1 16687.9
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 24697 938 26.330 <2e-16 ***
sexFemale -3340 1808 -1.847 0.0706 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5782 on 50 degrees of freedom
Multiple R-squared: 0.0639, Adjusted R-squared: 0.04518
F-statistic: 3.413 on 1 and 50 DF, p-value: 0.0706
This model does not allow the null hypothesis to be rejected as the p-value is not less than 0.05. The adjusted R-squared is also low and indicates this model explains only 4.52% of the variation between salary
and sex
. This model also shows that being female results in being paid $3,340 less per year than male faculty.
The below model adds in degree
, rank
, year
, and ysdeg
as additional predictors to the regression model.
Call:
lm(formula = salary ~ sex + degree + rank + year + ysdeg, data = salary)
Residuals:
Min 1Q Median 3Q Max
-4045.2 -1094.7 -361.5 813.2 9193.1
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 15746.05 800.18 19.678 < 2e-16 ***
sexFemale 1166.37 925.57 1.260 0.214
degreePhD 1388.61 1018.75 1.363 0.180
rankAssoc 5292.36 1145.40 4.621 3.22e-05 ***
rankProf 11118.76 1351.77 8.225 1.62e-10 ***
year 476.31 94.91 5.018 8.65e-06 ***
ysdeg -124.57 77.49 -1.608 0.115
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2398 on 45 degrees of freedom
Multiple R-squared: 0.855, Adjusted R-squared: 0.8357
F-statistic: 44.24 on 6 and 45 DF, p-value: < 2.2e-16
The 95% confidence interval for the difference in salary between males and females is below.
This section interprets the findings for each predictor variable in the above model.
The variable sex
is not statistically significant in this model. It indicates that when all predictors are held constant, females are paid $1,166.37 more per year more than males.
The variable degree
is not statistically significant in this model. It indicates that when all predictors are held constant, those with a PhD earn $1,388.61 more than those with a Master’s.
The variable rank
is statistically significant in this model. It indicates that when all predictors are held constant, faculty with a title of Assoc earn $5,292.36 more per year than those with a title of Asst and those with a title of Prof earn $11,118.76 more per year than those with a title of Prof.
The variable year
is statistically significant in this model. It indicates that when all predictors are held constant, for every year increase in faculty’s current rank, salary increases by $476.31.
The variable ysdeg
is not statistically significant in this model. It indicates that when all predictors are held constant, for every year increase since the highest degree was earned, salary decreases by -$124.57.
Below the model is updated so that the baseline category for rank
is “Prof.”
Call:
lm(formula = salary ~ sex + degree + rank + year + ysdeg, data = salary)
Residuals:
Min 1Q Median 3Q Max
-4045.2 -1094.7 -361.5 813.2 9193.1
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 26864.81 1375.29 19.534 < 2e-16 ***
sexFemale 1166.37 925.57 1.260 0.214
degreePhD 1388.61 1018.75 1.363 0.180
rankAsst -11118.76 1351.77 -8.225 1.62e-10 ***
rankAssoc -5826.40 1012.93 -5.752 7.28e-07 ***
year 476.31 94.91 5.018 8.65e-06 ***
ysdeg -124.57 77.49 -1.608 0.115
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2398 on 45 degrees of freedom
Multiple R-squared: 0.855, Adjusted R-squared: 0.8357
F-statistic: 44.24 on 6 and 45 DF, p-value: < 2.2e-16
This change to the model does not affect its fit or the coefficients. Notice that rankAsst shows a decrease of -$11,118.76 in salary whereas in the previous model rankProf was an increase of $11,118.76 in salary.
This next model removes the variable rank
from the model.
Call:
lm(formula = salary ~ sex + degree + year + ysdeg, data = salary)
Residuals:
Min 1Q Median 3Q Max
-8146.9 -2186.9 -491.5 2279.1 11186.6
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 17183.57 1147.94 14.969 < 2e-16 ***
sexFemale -1286.54 1313.09 -0.980 0.332209
degreePhD -3299.35 1302.52 -2.533 0.014704 *
year 351.97 142.48 2.470 0.017185 *
ysdeg 339.40 80.62 4.210 0.000114 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3744 on 47 degrees of freedom
Multiple R-squared: 0.6312, Adjusted R-squared: 0.5998
F-statistic: 20.11 on 4 and 47 DF, p-value: 1.048e-09
Removing rank
leads to ysdeg
and degree
now being statistically significant in the model. However, the adjusted R-square is lower than in the previous 2 models and the residual standard error is greater making this model not the best fit.
This final model creates a new variable new_hire
using ysdeg
. Those who were hired within 15 years or less were coded as 1 and everyone else 0. To avoid multicollinearity, the variable year was removed from the model. This is because it’s possible the years in faculty’s current rank are the same as years since the highest degree was earned (or in other words, years since hired).
Call:
lm(formula = salary ~ sex + degree + new_hire, data = salary)
Residuals:
Min 1Q Median 3Q Max
-8260.4 -3557.7 -462.6 3563.2 12098.5
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 28663 1155 24.821 < 2e-16 ***
sexFemale -2716 1433 -1.896 0.064 .
degreePhD -1227 1372 -0.895 0.375
new_hire -7418 1306 -5.679 7.74e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4558 on 48 degrees of freedom
Multiple R-squared: 0.4416, Adjusted R-squared: 0.4067
F-statistic: 12.65 on 3 and 48 DF, p-value: 3.231e-06
This model shows the null hypothesis should be rejected and indicates that faculty hired by the new dean are actually making a lower salary than those who were hired more than 15 years ago.
This questions uses the dataset house.selling.price from the package smss.
Call:
lm(formula = Price ~ Size + New, data = house.selling.price)
Residuals:
Min 1Q Median 3Q Max
-205102 -34374 -5778 18929 163866
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -40230.867 14696.140 -2.738 0.00737 **
Size 116.132 8.795 13.204 < 2e-16 ***
New 57736.283 18653.041 3.095 0.00257 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 53880 on 97 degrees of freedom
Multiple R-squared: 0.7226, Adjusted R-squared: 0.7169
F-statistic: 126.3 on 2 and 97 DF, p-value: < 2.2e-16
This first model examines how the size of a house and being new or not influences price. It reveals the variables are statistically significant. Furthermore, we learn that a 1 unit change in size leads to a $116.13 increase in price and when a house is new, it will cost $57,736.28 more than a house that is old when the size is held constant.
The equation for the predicted selling price when the home is new is: price = -40230.867 + 116.132Size + 57736.283New
The predicted selling price for a home of 3000 square feed that is new is below.
The predicted selling price for a home of 3000 square feed that is not new is below.
The next model includes an interaction term between size and new.
Call:
lm(formula = Price ~ Size * New, data = house.selling.price)
Residuals:
Min 1Q Median 3Q Max
-175748 -28979 -6260 14693 192519
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -22227.808 15521.110 -1.432 0.15536
Size 104.438 9.424 11.082 < 2e-16 ***
New -78527.502 51007.642 -1.540 0.12697
Size:New 61.916 21.686 2.855 0.00527 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 52000 on 96 degrees of freedom
Multiple R-squared: 0.7443, Adjusted R-squared: 0.7363
F-statistic: 93.15 on 3 and 96 DF, p-value: < 2.2e-16
The predicted selling price, using the model with interaction terms, for a home of 3000 square feed that is new is below.
The predicted selling price, using the model with interaction terms, for a home of 3000 square feed that is not new is below.
The predicted selling price, using the model with interaction terms, for a home of 1500 square feed that is new is below.
The predicted selling price, using the model with interaction terms, for a home of 1500 square feed that is not new is below.
1
134429.8
In comparing the predictions for part F and G, it can be observed that the difference in selling price between a new and not new home increases as the the size of the home increases.
I believe the model with the interaction term best represents the relationship of size and new to the outcome price. I’ve come to this conclusion as the model with the interaction term has a higher adjusted R-squared and lower residual standard error than the model without the interaction term.
---
title: "DACSS 603 Homework 4"
author: "Laura Collazo"
description: "Homework 4 for DACSS 603"
date: "04/18/2023"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- hw4
- regression
---
```{r, echo=T, warning = FALSE, message = FALSE}
library(tidyverse)
library(alr4)
library(smss)
```
# Question 1
For this question the prediction equation is `Price = -10,536 + 53.8HomeSize + 2.84LotSize`.
## a
When HomeSize = 1240 and LotSize= 18,000, the predicted Price is:
```{r, echo=T}
sum(-10,536 + (53.8*1240) + (2.84*1800))
```
Since this home actually sold for $145,000, the residual is:
```{r, echo=T}
sum(72350-145000)
```
## b
When the lot size remains fixed, the price is predicted to increase $53.80 for every one-square foot increase in size.
## c
Given this same equation, if home size remains fixed, the lot size would need to increase by the below in order to have the same impact on price as a one-square foot increase in home size:
```{r, echo=T}
sum(53.8/2.84)
```
# Question 2
This question uses the "salary" data from the alr4 package to examine salary and characteristics of faculty in the early 1980s at a small Mid-West college.
```{r, echo=T}
data("salary")
```
## a
The below tests the hypothesis that mean salary for men and women is the same.
```{r, echo=T}
fit_2a <- lm(salary ~ sex, data = salary)
summary(fit_2a)
```
This model does not allow the null hypothesis to be rejected as the p-value is not less than 0.05. The adjusted R-squared is also low and indicates this model explains only 4.52% of the variation between `salary` and `sex`. This model also shows that being female results in being paid $3,340 less per year than male faculty.
## b
The below model adds in `degree`, `rank`, `year`, and `ysdeg` as additional predictors to the regression model.
```{r, echo=T}
fit_2b <- lm(salary ~ sex + degree + rank + year + ysdeg, data = salary)
summary(fit_2b)
```
The 95% confidence interval for the difference in salary between males and females is below.
```{r}
confint(fit_2b)
```
## c
This section interprets the findings for each predictor variable in the above model.
### sex
The variable `sex` is not statistically significant in this model. It indicates that when all predictors are held constant, females are paid $1,166.37 more per year more than males.
### degree
The variable `degree` is not statistically significant in this model. It indicates that when all predictors are held constant, those with a PhD earn $1,388.61 more than those with a Master's.
### rank
The variable `rank` is statistically significant in this model. It indicates that when all predictors are held constant, faculty with a title of Assoc earn $5,292.36 more per year than those with a title of Asst and those with a title of Prof earn $11,118.76 more per year than those with a title of Prof.
### year
The variable `year` is statistically significant in this model. It indicates that when all predictors are held constant, for every year increase in faculty's current rank, salary increases by $476.31.
### ysdeg
The variable `ysdeg` is not statistically significant in this model. It indicates that when all predictors are held constant, for every year increase since the highest degree was earned, salary decreases by -$124.57.
## d
Below the model is updated so that the baseline category for `rank` is "Prof."
```{r}
salary$rank <- relevel(salary$rank, ref = "Prof")
fit_2d <- lm(salary ~ sex + degree + rank + year + ysdeg, data = salary)
summary(fit_2d)
```
This change to the model does not affect its fit or the coefficients. Notice that rankAsst shows a decrease of -$11,118.76 in salary whereas in the previous model rankProf was an increase of $11,118.76 in salary.
## e
This next model removes the variable `rank` from the model.
```{r}
fit_2e <- lm(salary ~ sex + degree + year + ysdeg, data = salary)
summary(fit_2e)
```
Removing `rank` leads to `ysdeg` and `degree` now being statistically significant in the model. However, the adjusted R-square is lower than in the previous 2 models and the residual standard error is greater making this model not the best fit.
## f
This final model creates a new variable `new_hire` using `ysdeg`. Those who were hired within 15 years or less were coded as 1 and everyone else 0. To avoid multicollinearity, the variable year was removed from the model. This is because it's possible the years in faculty's current rank are the same as years since the highest degree was earned (or in other words, years since hired).
```{r}
salary$new_hire <- ifelse(salary$ysdeg <= 15, 1, 0)
fit_2f <- lm(salary ~ sex + degree + new_hire, data = salary)
summary(fit_2f)
```
This model shows the null hypothesis should be rejected and indicates that faculty hired by the new dean are actually making a lower salary than those who were hired more than 15 years ago.
# Question 3
```{r}
data("house.selling.price")
```
This questions uses the dataset house.selling.price from the package smss.
## a
```{r}
fit_3a <- lm(Price ~ Size + New, data = house.selling.price)
summary(fit_3a)
```
This first model examines how the size of a house and being new or not influences price. It reveals the variables are statistically significant. Furthermore, we learn that a 1 unit change in size leads to a $116.13 increase in price and when a house is new, it will cost $57,736.28 more than a house that is old when the size is held constant.
## b
The equation for the predicted selling price when the home is new is: `price = -40230.867 + 116.132Size + 57736.283New`
```{r}
```
## c
The predicted selling price for a home of 3000 square feed that is new is below.
```{r}
df_new <- data.frame(Size = 3000, New = 1)
predict(fit_3a, newdata = df_new)
```
The predicted selling price for a home of 3000 square feed that is not new is below.
```{r}
df_not_new <- data.frame(Size = 3000, New = 0)
predict(fit_3a, newdata = df_not_new)
```
## d
The next model includes an interaction term between size and new.
```{r}
fit_3d <- lm(Price ~ Size * New, data = house.selling.price)
summary(fit_3d)
```
## e
```{r}
ggplot(house.selling.price,aes(y=Price,x=Size,color=factor(New)))+
geom_point()+
stat_smooth(method="lm",se=TRUE)
```
## f
The predicted selling price, using the model with interaction terms, for a home of 3000 square feed that is new is below.
```{r}
predict(fit_3d, newdata = df_new)
```
The predicted selling price, using the model with interaction terms, for a home of 3000 square feed that is not new is below.
```{r}
predict(fit_3d, newdata = df_not_new)
```
## g
The predicted selling price, using the model with interaction terms, for a home of 1500 square feed that is new is below.
```{r}
df_new <- data.frame(Size = 1500, New = 1)
predict(fit_3d, newdata = df_new)
```
The predicted selling price, using the model with interaction terms, for a home of 1500 square feed that is not new is below.
```{r}
df_not_new <- data.frame(Size = 1500, New = 0)
predict(fit_3d, newdata = df_not_new)
```
In comparing the predictions for part F and G, it can be observed that the difference in selling price between a new and not new home increases as the the size of the home increases.
## h
I believe the model with the interaction term best represents the relationship of size and new to the outcome price. I've come to this conclusion as the model with the interaction term has a higher adjusted R-squared and lower residual standard error than the model without the interaction term.