Code
::opts_chunk$set(echo = TRUE, warning = FALSE) knitr
Felix Betanourt
April 25, 2023
DACSS 603, Spring 2023
Some of the questions use data from the alr4 and smss R packages. You would need to call in those packages in R (no need for an install.packages() call in your .qmd file, though—just use library()) and load the data using the data() function.
Question 1
For recent data in Jacksonville, Florida, on y = selling price of home (in dollars), x1 = size of home (in square feet), and x2 = lot size (in square feet), the prediction equation is ŷ = −10,536 + 53.8x1 + 2.84x2.
A. A particular home of 1240 square feet on a lot of 18,000 square feet sold for $145,000. Find the predicted selling price and the residual, and interpret.
Predicted value:
Residual:
The predicted value is $107,296 therefore the the residual is $37,704. The residual is positive, which means that the model is under-predicting. Would be appropriate to check with other actual values and adjusting the regression model for better fit.
B. For fixed lot size, how much is the house selling price predicted to increase for each square foot increase in home size? Why?
The slope for lot size is 2.84, this means that for every square foot the price should increase by $2.84.
C. According to this prediction equation, for fixed home size, how much would lot size need to increase to have the same impact as a one-square-foot increase in home size?
lot size should increase almost 19 times to have the same impact in the price as home size.
Question 2
(Data file: salary in alr4 R package).
'data.frame': 52 obs. of 6 variables:
$ degree: Factor w/ 2 levels "Masters","PhD": 1 1 1 1 2 1 2 1 2 2 ...
$ rank : Factor w/ 3 levels "Asst","Assoc",..: 3 3 3 3 3 3 3 3 3 3 ...
$ sex : Factor w/ 2 levels "Male","Female": 1 1 1 2 1 1 2 1 1 1 ...
$ year : int 25 13 10 7 19 16 0 16 13 13 ...
$ ysdeg : int 35 22 23 27 30 21 32 18 30 31 ...
$ salary: int 36350 35350 28200 26775 33696 28516 24900 31909 31850 32850 ...
The data file concerns salary and other characteristics of all faculty in a small Midwestern college collected in the early 1980s for presentation in legal proceedings for which discrimination against women in salary was at issue.
All persons in the data hold tenured or tenure track positions; temporary faculty are not included. The variables include degree, a factor with levels PhD and MS; rank, a factor with levels Asst, Assoc, and Prof; sex, a factor with levels Male and Female; Year, years in current rank; ysdeg, years since highest degree, and salary, academic year salary in dollars.
A. Test the hypothesis that the mean salary for men and women is the same, without regard to any other variable but sex. Explain your findings.
Welch Two Sample t-test
data: salary by sex
t = 1.7744, df = 21.591, p-value = 0.09009
alternative hypothesis: true difference in means between group Male and group Female is not equal to 0
95 percent confidence interval:
-567.8539 7247.1471
sample estimates:
mean in group Male mean in group Female
24696.79 21357.14
Since p-value > 0.05 , there is not significant difference between salary means for Males and Females.
B. Run a multiple linear regression with salary as the outcome variable and everything else as predictors, including sex. Assuming no interactions between sex and the other predictors, obtain a 95% confidence interval for the difference in salary between males and females.
#salary2 <- salary%>%
# mutate(degree_o = case_when(
# degree == "Masters" ~ 0,
# degree == "PhD" ~ 1,
# )) %>%
# mutate(rank_o = case_when(
# rank == "Asst" ~ 1,
# rank == "Assoc" ~ 2,
# rank == "Prof" ~ 3,
# )) %>%
# mutate(sex_o = case_when(
# sex == "Male" ~ 0,
# sex == "Female" ~ 1,
# ))
summary (model1 <- lm(salary ~ ysdeg + year + rank + sex + degree, data=salary))
Call:
lm(formula = salary ~ ysdeg + year + rank + sex + degree, data = salary)
Residuals:
Min 1Q Median 3Q Max
-4045.2 -1094.7 -361.5 813.2 9193.1
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 15746.05 800.18 19.678 < 2e-16 ***
ysdeg -124.57 77.49 -1.608 0.115
year 476.31 94.91 5.018 8.65e-06 ***
rankAssoc 5292.36 1145.40 4.621 3.22e-05 ***
rankProf 11118.76 1351.77 8.225 1.62e-10 ***
sexFemale 1166.37 925.57 1.260 0.214
degreePhD 1388.61 1018.75 1.363 0.180
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2398 on 45 degrees of freedom
Multiple R-squared: 0.855, Adjusted R-squared: 0.8357
F-statistic: 44.24 on 6 and 45 DF, p-value: < 2.2e-16
Welch Two Sample t-test
data: salary by sex
t = 1.7744, df = 21.591, p-value = 0.09009
alternative hypothesis: true difference in means between group Male and group Female is not equal to 0
95 percent confidence interval:
-567.8539 7247.1471
sample estimates:
mean in group Male mean in group Female
24696.79 21357.14
C. Interpret your finding for each predictor variable; discuss (a) statistical significance, (b) interpretation of the coefficient / slope in relation to the outcome variable and other variables.
The variables that predict salary are years in current rank, and rank, as those are variables with p-value<0.05.
In terms of the coefficients:
D. Change the baseline category for the rank variable. Interpret the coefficients related to rank again.
Call:
lm(formula = salary ~ ysdeg + year + rank + sex + degree, data = salary)
Residuals:
Min 1Q Median 3Q Max
-4045.2 -1094.7 -361.5 813.2 9193.1
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 21038.41 1109.12 18.969 < 2e-16 ***
ysdeg -124.57 77.49 -1.608 0.115
year 476.31 94.91 5.018 8.65e-06 ***
rankAsst -5292.36 1145.40 -4.621 3.22e-05 ***
rankProf 5826.40 1012.93 5.752 7.28e-07 ***
sexFemale 1166.37 925.57 1.260 0.214
degreePhD 1388.61 1018.75 1.363 0.180
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2398 on 45 degrees of freedom
Multiple R-squared: 0.855, Adjusted R-squared: 0.8357
F-statistic: 44.24 on 6 and 45 DF, p-value: < 2.2e-16
Being Assistant means 5,292 dollars less in salary compared to being Associate, and at the same time being Professor is 5,826 dollars more than Associate.
E. Finkelstein (1980), in a discussion of the use of regression in discrimination cases, wrote, “[a] variable may reflect a position or status bestowed by the employer, in which case if there is discrimination in the award of the position or status, the variable may be ‘tainted.’” Thus, for example, if discrimination is at work in promotion of faculty to higher ranks, using rank to adjust salaries before comparing the sexes may not be acceptable to the courts. Exclude the variable rank, refit, and summarize how your findings changed, if they did.
Call:
lm(formula = salary ~ ysdeg + year + sex + degree, data = salary)
Residuals:
Min 1Q Median 3Q Max
-8146.9 -2186.9 -491.5 2279.1 11186.6
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 17183.57 1147.94 14.969 < 2e-16 ***
ysdeg 339.40 80.62 4.210 0.000114 ***
year 351.97 142.48 2.470 0.017185 *
sexFemale -1286.54 1313.09 -0.980 0.332209
degreePhD -3299.35 1302.52 -2.533 0.014704 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3744 on 47 degrees of freedom
Multiple R-squared: 0.6312, Adjusted R-squared: 0.5998
F-statistic: 20.11 on 4 and 47 DF, p-value: 1.048e-09
By excluding rank, we have these changes in the model:
R squared is reduced from 0.85 to 0.63, this means that the rank has an relevant weight in explaining salary.
In this model years since highest degree earned has a significant relationship with salary, as well as Degree. Both variables were not significant in the model when rank is included.
Years in current rank reduces significance over salary, from a p-value < 0.001 to a p-value < 0.05.
F. Everyone in this dataset was hired the year they earned their highest degree. It is also known that a new Dean was appointed 15 years ago, and everyone in the dataset who earned their highest degree 15 years ago or less than that has been hired by the new Dean.
Some people have argued that the new Dean has been making offers that are a lot more generous to newly hired faculty than the previous one and that this might explain some of the variation in Salary.
Create a new variable that would allow you to test this hypothesis and run another multiple regression model to test this. Select variables carefully to make sure there is no multicollinearity. Explain why multicollinearity would be a concern in this case and how you avoided it. Do you find support for the hypothesis that the people hired by the new Dean are making higher than those that were not?
salary$tenure <- ifelse(salary$ysdeg <= 15, "15 or less", "More than 15")
salary$tenure <- as.factor(salary$tenure)
salary2 <- salary%>%
mutate(degree_o = case_when(
degree == "Masters" ~ 0,
degree == "PhD" ~ 1,
)) %>%
mutate(rank_o = case_when(
rank == "Asst" ~ 1,
rank == "Assoc" ~ 2,
rank == "Prof" ~ 3,
)) %>%
mutate(sex_o = case_when(
sex == "Male" ~ 0,
sex == "Female" ~ 1,
)) %>%
mutate(tenure_o = case_when(
tenure == "15 or less" ~ 1,
tenure == "More than 15" ~ 0,
))
t.test(salary ~ tenure_o, data = salary2)
Welch Two Sample t-test
data: salary by tenure_o
t = 5.6848, df = 48.319, p-value = 7.427e-07
alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
95 percent confidence interval:
4746.660 9940.417
sample estimates:
mean in group 0 mean in group 1
27469.42 20125.88
ysdeg rank_o year sex_o tenure_o degree_o
ysdeg 1.00 0.70 0.64 -0.09 -0.84 0.48
rank_o 0.70 1.00 0.51 -0.23 -0.72 0.01
year 0.64 0.51 1.00 -0.38 -0.54 0.14
sex_o -0.09 -0.23 -0.38 1.00 0.09 -0.08
tenure_o -0.84 -0.72 -0.54 0.09 1.00 -0.24
degree_o 0.48 0.01 0.14 -0.08 -0.24 1.00
Years since highest degree earned has an strong correlation with tenure (new variable result of dividing the population in when they were hired). It was expected because tenure is a variable created based on years since highest degree earned (Everyone in this dataset was hired the year they earned their highest degree).
So there is multicollinearity for these 2 variables, so I will exclude years since highest degree earned from the model and keep tenure.
Call:
lm(formula = salary ~ year + tenure_o + rank + sex_o + degree_o,
data = salary2)
Residuals:
Min 1Q Median 3Q Max
-3403.3 -1387.0 -167.0 528.2 9233.8
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 18301.04 1301.36 14.063 < 2e-16 ***
year 434.85 78.89 5.512 1.65e-06 ***
tenure_o 2163.46 1072.04 2.018 0.0496 *
rankAsst -4972.66 997.17 -4.987 9.61e-06 ***
rankProf 6124.28 1028.58 5.954 3.65e-07 ***
sex_o 907.14 840.54 1.079 0.2862
degree_o 818.93 797.48 1.027 0.3100
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2362 on 45 degrees of freedom
Multiple R-squared: 0.8594, Adjusted R-squared: 0.8407
F-statistic: 45.86 on 6 and 45 DF, p-value: < 2.2e-16
Seems that with a significant p-value (<0.05), being hired 15 years ago or less means higher salary than faculty with more than 15 years. Specifically, being in the group of 15 years or less means 2,163 dollars more than the other group.
Question 3
(Data file: house.selling.price in smss R package)
'data.frame': 100 obs. of 7 variables:
$ case : int 1 2 3 4 5 6 7 8 9 10 ...
$ Taxes: int 3104 1173 3076 1608 1454 2997 4054 3002 6627 320 ...
$ Beds : int 4 2 4 3 3 3 3 3 5 3 ...
$ Baths: int 2 1 2 2 3 2 2 2 4 2 ...
$ New : int 0 0 0 0 0 1 0 1 0 0 ...
$ Price: int 279900 146500 237700 200000 159900 499900 265500 289900 587000 70000 ...
$ Size : int 2048 912 1654 2068 1477 3153 1355 2075 3990 1160 ...
A. Using the house.selling.price data, run and report regression results modeling y = selling price (in dollars) in terms of size of home (in square feet) and whether the home is new (1 = yes; 0 = no). In particular, for each variable; discuss statistical significance and interpret the meaning of the coefficient.
Call:
lm(formula = Price ~ Size + New, data = house)
Residuals:
Min 1Q Median 3Q Max
-205102 -34374 -5778 18929 163866
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -40230.867 14696.140 -2.738 0.00737 **
Size 116.132 8.795 13.204 < 2e-16 ***
New 57736.283 18653.041 3.095 0.00257 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 53880 on 97 degrees of freedom
Multiple R-squared: 0.7226, Adjusted R-squared: 0.7169
F-statistic: 126.3 on 2 and 97 DF, p-value: < 2.2e-16
Both independent variables (Size and if New or not) are statistically significant to predict the home price. In particular, for every square feet the home price increases by $116.13, and when the house is new the price is higher in $57,736 compared when is not new.
B. Report and interpret the prediction equation, and form separate equations relating selling price to size for new and for not new homes.
The prediction equation is:
y = selling price of home (in dollars), x1 = size of home (in square feet), and x2 = home is new (1 = yes; 0 = no), the prediction equation for new homes is
ŷ = -40230.867 + 116.132x1 + 57736.283
For the no new homes prediction equation we need to calculate the slope for no new (value 0 in the variable):
Prediction equation for No New home:
ŷ = -40230.867 + 116.132x1 - 97967.15
C. Find the predicted selling price for a home of 3000 square feet that is (i) new
D. Fit another model, this time with an interaction term allowing interaction between size and new, and report the regression results
Call:
lm(formula = Price ~ Size + New + Size * New, data = house)
Residuals:
Min 1Q Median 3Q Max
-175748 -28979 -6260 14693 192519
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -22227.808 15521.110 -1.432 0.15536
Size 104.438 9.424 11.082 < 2e-16 ***
New -78527.502 51007.642 -1.540 0.12697
Size:New 61.916 21.686 2.855 0.00527 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 52000 on 96 degrees of freedom
Multiple R-squared: 0.7443, Adjusted R-squared: 0.7363
F-statistic: 93.15 on 3 and 96 DF, p-value: < 2.2e-16
E. Report the lines relating the predicted selling price to the size for homes that are (i) new, (ii) not new.
The prediction formula for this model is:
ŷ = -22227.808 + 104.438size - 78527.502new (0 or 1) + 61.916sizenew (0 or 1).
To predict price for new or not new houses, we would need just to use the values 1 or 0 in the “new” variable.
F. Find the predicted selling price for a home of 3000 square feet that is (i) new,
1
398307.5
1
291087.4
G. Find the predicted selling price for a home of 1500 square feet that is (i) new,
1
148776.1
1
134429.8
Comparing to (F), explain how the difference in predicted selling prices changes as the size of home increases.
1
107220.1
1
14346.32
Seems that the difference in price between new or not new homes is bigger when the size is higher. For instance, for a new house with 3000 sqf is $107k higher than no new. But for 1500 sqf the new house is only $14k more.
H. Do you think the model with interaction or the one without it represents the relationship of size and new to the outcome price? What makes you prefer one model over another
Both models seems good to predict price given p-value for F-statistics is significant in both models, however the the value of F itself is a lot higher in the model with no interaction size-new, which means that there is better fit in this model compared to the interaction model.
The R2 and Adjusted R2 are similar in both models, so the interaction doesn’t seems to add more explanation to the variations to the home price.
On the other hand, the interaction model provides an additional insight which is that the difference in price between new and no new houses is moderated by the size. This moderation is relevant to understand the price variation and it is not captured in the no-interaction model.
For those reasons, I would rather use the model with the interaction.
---
title: "Homework 4"
author: "Felix Betanourt"
desription: "DACSS 603 HW4"
date: "04/25/2023"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- hw4
- simple regression
- multiple regression
- non-linear function
- p-value
- model fitting
editor:
markdown:
wrap: 72
---
```{r}
#| label: setup
#| warning: false
knitr::opts_chunk$set(echo = TRUE, warning = FALSE)
```
## Homework 4
DACSS 603, Spring 2023
```{r}
# Loading packages
suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(tidyverse))
library(formattable)
suppressPackageStartupMessages(library(kableExtra))
library(ggplot2)
suppressPackageStartupMessages(library(alr4))
suppressPackageStartupMessages(library(smss))
```
Some of the questions use data from the alr4 and smss R packages. You would need to call in those packages in R (no need for an install.packages() call in your .qmd file, though—just use library()) and load the data using the data() function.
Question 1
For recent data in Jacksonville, Florida, on y = selling price of home (in dollars), x1 = size of home (in square feet), and x2 = lot size (in square feet), the prediction equation is ŷ = −10,536 + 53.8x1 + 2.84x2.
A. A particular home of 1240 square feet on a lot of 18,000 square feet sold for $145,000. Find the predicted selling price and the residual, and interpret.
Predicted value:
```{r}
predicted <- -10536 + 53.8*1240 + 2.84*18000
predicted
```
Residual:
```{r}
residual <- 145000-107296
residual
```
The predicted value is $107,296 therefore the the residual is $37,704. The residual is positive, which means that the model is under-predicting. Would be appropriate to check with other actual values and adjusting the regression model for better fit.
B. For fixed lot size, how much is the house selling price predicted to increase for each square foot increase in home size? Why?
The slope for lot size is 2.84, this means that for every square foot the price should increase by $2.84.
C. According to this prediction equation, for fixed home size, how much would lot size need to increase to have the same impact as a one-square-foot increase in home size?
```{r}
x2 <- (53.8/2.84)
x2
```
lot size should increase almost 19 times to have the same impact in the price as home size.
Question 2
(Data file: salary in alr4 R package).
```{r}
str(salary)
```
The data file concerns salary and other characteristics of all faculty in a small Midwestern college collected in the early 1980s for presentation in legal proceedings for which discrimination against women in salary was at issue.
All persons in the data hold tenured or tenure track positions; temporary faculty are not included. The variables include degree, a factor with levels PhD and MS; rank, a factor with levels Asst, Assoc, and Prof; sex, a factor with levels Male and Female; Year, years in current rank; ysdeg, years since highest degree, and salary, academic year salary in dollars.
A. Test the hypothesis that the mean salary for men and women is the same, without regard to any other variable but sex. Explain your findings.
```{r}
t.test(salary ~ sex, data = salary)
```
Since p-value > 0.05 , there is not significant difference between salary means for Males and Females.
B. Run a multiple linear regression with salary as the outcome variable and everything else as predictors, including sex. Assuming no interactions between sex and the other predictors, obtain a 95% confidence interval for the difference in salary between males and females.
```{r}
#salary2 <- salary%>%
# mutate(degree_o = case_when(
# degree == "Masters" ~ 0,
# degree == "PhD" ~ 1,
# )) %>%
# mutate(rank_o = case_when(
# rank == "Asst" ~ 1,
# rank == "Assoc" ~ 2,
# rank == "Prof" ~ 3,
# )) %>%
# mutate(sex_o = case_when(
# sex == "Male" ~ 0,
# sex == "Female" ~ 1,
# ))
summary (model1 <- lm(salary ~ ysdeg + year + rank + sex + degree, data=salary))
t.test(salary ~ sex, data = salary, conf.int = TRUE)
```
C. Interpret your finding for each predictor variable; discuss (a) statistical significance, (b) interpretation of the coefficient / slope in relation to the outcome variable and other variables.
a) The variables that predict salary are years in current rank, and rank, as those are variables with p-value<0.05.
b) In terms of the coefficients:
- Years in current rank: for each year in current rank the salary increases by 476 units.
- Rank: the benchmark is assistant therefore by being associate there the salary is 5,292 higher than assistant and 11,118 higher for professor.
- Gender/sex: females makes 1,166 more in salary than males. Not a significant variable to explain salary.
- Degree: having a PhD means 1,388 more dollars than holding a Masters. Not a significant variable to explain salary.
- Years since highest degree earned: in this case, more years means less salary, specifically 124 dollars less per year since highest degree was earned. Also, not a significant variable to explain salary.
D. Change the baseline category for the rank variable. Interpret the coefficients related to rank again.
```{r}
salary$rank <- relevel(salary$rank, ref = "Assoc")
summary (model2 <- lm(salary ~ ysdeg + year + rank + sex + degree, data=salary))
#summary(salary$salary - model2$fitted.values)
```
Being Assistant means 5,292 dollars less in salary compared to being Associate, and at the same time being Professor is 5,826 dollars more than Associate.
E. Finkelstein (1980), in a discussion of the use of regression in discrimination cases, wrote, “[a] variable may reflect a position or status bestowed by the employer, in which case if there is discrimination in the award of the position or status, the variable may be ‘tainted.’ ” Thus, for example, if discrimination is at work in promotion of faculty to higher ranks, using rank to adjust salaries before comparing the sexes may not be acceptable to the courts.
Exclude the variable rank, refit, and summarize how your findings changed, if they did.
```{r}
summary (model3 <- lm(salary ~ ysdeg + year + sex + degree, data=salary))
```
By excluding rank, we have these changes in the model:
- R squared is reduced from 0.85 to 0.63, this means that the rank has an relevant weight in explaining salary.
- In this model years since highest degree earned has a significant relationship with salary, as well as Degree. Both variables were not significant in the model when rank is included.
- Years in current rank reduces significance over salary, from a p-value < 0.001 to a p-value < 0.05.
F. Everyone in this dataset was hired the year they earned their highest degree. It is also known that a new Dean was appointed 15 years ago, and everyone in the dataset who earned their highest degree 15 years ago or less than that has been hired by the new Dean.
Some people have argued that the new Dean has been making offers that are a lot more generous to newly hired faculty than the previous one and that this might explain some of the variation in Salary.
Create a new variable that would allow you to test this hypothesis and run another multiple regression model to test this. Select variables carefully to make sure there is no multicollinearity. Explain why multicollinearity would be a concern in this case and how you avoided it. Do you find support for the hypothesis that the people hired by the new Dean are making higher than those that were not?
```{r}
salary$tenure <- ifelse(salary$ysdeg <= 15, "15 or less", "More than 15")
salary$tenure <- as.factor(salary$tenure)
salary2 <- salary%>%
mutate(degree_o = case_when(
degree == "Masters" ~ 0,
degree == "PhD" ~ 1,
)) %>%
mutate(rank_o = case_when(
rank == "Asst" ~ 1,
rank == "Assoc" ~ 2,
rank == "Prof" ~ 3,
)) %>%
mutate(sex_o = case_when(
sex == "Male" ~ 0,
sex == "Female" ~ 1,
)) %>%
mutate(tenure_o = case_when(
tenure == "15 or less" ~ 1,
tenure == "More than 15" ~ 0,
))
t.test(salary ~ tenure_o, data = salary2)
cor_matrix <- cor(salary2[, c("ysdeg", "rank_o", "year", "sex_o", "tenure_o", "degree_o")], use = "complete.obs")
round(cor_matrix, 2)
```
Years since highest degree earned has an strong correlation with tenure (new variable result of dividing the population in when they were hired). It was expected because tenure is a variable created based on years since highest degree earned (Everyone in this dataset was hired the year they earned their highest degree).
So there is multicollinearity for these 2 variables, so I will exclude years since highest degree earned from the model and keep tenure.
```{r}
summary (model4 <- lm(salary ~ year + tenure_o + rank + sex_o + degree_o, data=salary2))
```
Seems that with a significant p-value (<0.05), being hired 15 years ago or less means higher salary than faculty with more than 15 years. Specifically, being in the group of 15 years or less means 2,163 dollars more than the other group.
Question 3
(Data file: house.selling.price in smss R package)
```{r}
data("house.selling.price")
house <- house.selling.price
str(house)
```
A. Using the house.selling.price data, run and report regression results modeling y = selling price (in dollars) in terms of size of home (in square feet) and whether the home is new (1 = yes; 0 = no). In particular, for each variable; discuss statistical significance and interpret the meaning of the coefficient.
```{r}
summary (model5 <- lm(Price ~ Size + New, data=house))
```
Both independent variables (Size and if New or not) are statistically significant to predict the home price. In particular, for every square feet the home price increases by $116.13, and when the house is new the price is higher in $57,736 compared when is not new.
B. Report and interpret the prediction equation, and form separate equations relating selling price to size for new and for not new homes.
The prediction equation is:
y = selling price of home (in dollars), x1 = size of home (in square feet), and x2 = home is new (1 = yes; 0 = no), the prediction equation for new homes is
ŷ = -40230.867 + 116.132x1 + 57736.283
For the no new homes prediction equation we need to calculate the slope for no new (value 0 in the variable):
```{r}
slope_nonew <- coef(model5)[1] - coef(model5)[3]
slope_nonew
```
Prediction equation for No New home:
ŷ = -40230.867 + 116.132x1 - 97967.15
C. Find the predicted selling price for a home of 3000 square feet that is (i) new
```{r}
predicted_new_house <- -40230.867 + 116.132*3000 + 57736.283
predicted_new_house
```
(ii) not new
```{r}
predicted_nonew_house <- -40230.867 + 116.132*3000 - 97967.15
predicted_nonew_house
```
D. Fit another model, this time with an interaction term allowing interaction between size and new, and report the regression results
```{r}
summary (model7 <- lm(Price ~ Size + New + Size*New, data=house))
```
E. Report the lines relating the predicted selling price to the size for homes that are (i) new, (ii) not new.
The prediction formula for this model is:
ŷ = -22227.808 + 104.438*size - 78527.502*new (0 or 1) + 61.916*size*new (0 or 1).
To predict price for new or not new houses, we would need just to use the values 1 or 0 in the "new" variable.
F. Find the predicted selling price for a home of 3000 square feet that is (i) new,
```{r}
newdata <- data.frame(
New = 1,
Size = 3000
)
predicted_price <- predict(model7, newdata)
predicted_price
```
(ii) not new.
```{r}
newdata <- data.frame(
New = 0,
Size = 3000
)
predicted_price2 <- predict(model7, newdata)
predicted_price2
```
G. Find the predicted selling price for a home of 1500 square feet that is (i) new,
```{r}
newdata <- data.frame(
New = 1,
Size = 1500
)
predicted_price3 <- predict(model7, newdata)
predicted_price3
```
(ii) not new.
```{r}
newdata2 <- data.frame(
New = 0,
Size = 1500
)
predicted_price4 <- predict(model7, newdata2)
predicted_price4
```
Comparing to (F), explain how the difference in predicted selling prices changes as the size of home increases.
```{r}
#Difference in price for model in question F
diff_f <- predicted_price - predicted_price2
diff_f
diff_g <- predicted_price3 - predicted_price4
diff_g
```
Seems that the difference in price between new or not new homes is bigger when the size is higher. For instance, for a new house with 3000 sqf is $107k higher than no new. But for 1500 sqf the new house is only $14k more.
H. Do you think the model with interaction or the one without it represents the relationship of
size and new to the outcome price? What makes you prefer one model over another
Both models seems good to predict price given p-value for F-statistics is significant in both models, however the the value of F itself is a lot higher in the model with no interaction size-new, which means that there is better fit in this model compared to the interaction model.
The R2 and Adjusted R2 are similar in both models, so the interaction doesn't seems to add more explanation to the variations to the home price.
On the other hand, the interaction model provides an additional insight which is that the difference in price between new and no new houses is moderated by the size. This moderation is relevant to understand the price variation and it is not captured in the no-interaction model.
For those reasons, I would rather use the model with the interaction.