Code
library(tidyverse)
library(alr4)
library(smss)
library(stargazer)
Justine Shakespeare
April 24, 2023
ŷ = −10,536 + 53.8x1 + 2.84x2
x1 = size of home
x2 = size of lot
A particular home of 1240 square feet on a lot of 18,000 square feet sold for $145,000. Find the predicted selling price and the residual, and interpret.
x1 = 1240 x2 = 18000 y = 145000
The predicted selling point is $107,296 for this house. Given that the actual selling value is $145,000, this model underpredicts the value of the house.
The residual for this data point is the difference between y_hat and y. In this case, that is $145,000 - $107,296, which is $37,704. This house sold for $37,704 more than the predicted value.
For fixed lot size, how much is the house selling price predicted to increase for each square-foot increase in home size? Why?
The house selling price is predicted to increase $53.80 for each square-foot increase, controlling for lot size. 53.80 is the slope or coefficient of the house size.
ŷ = −10,536 + 53.8x1 + 2.84x2
According to this prediction equation, for fixed home size, how much would lot size need to increase to have the same impact as a one-square-foot increase in home size?
A one square foot increase in the size of the home would lead to a $53.80 increase in the sales prices, and a one square foot increase in the size of the lot would lead to a $2.84 increase in the sales price. The find how much the lot size would need to increase to have the same impact as a one square foot increase in home size we can this equation 53.8=x*2.84 and solve for x.
The lot would need to increase by over 18.94 square feet to increase the sales price as much as one square foot increase in home size.
Rows: 52
Columns: 6
$ degree <fct> Masters, Masters, Masters, Masters, PhD, Masters, PhD, Masters,…
$ rank <fct> Prof, Prof, Prof, Prof, Prof, Prof, Prof, Prof, Prof, Prof, Pro…
$ sex <fct> Male, Male, Male, Female, Male, Male, Female, Male, Male, Male,…
$ year <int> 25, 13, 10, 7, 19, 16, 0, 16, 13, 13, 12, 15, 9, 9, 9, 7, 13, 1…
$ ysdeg <int> 35, 22, 23, 27, 30, 21, 32, 18, 30, 31, 22, 19, 17, 27, 24, 15,…
$ salary <int> 36350, 35350, 28200, 26775, 33696, 28516, 24900, 31909, 31850, …
Test the hypothesis that the mean salary for men and women is the same, without regard to any other variable but sex. Explain your findings.
To test that hypothesis that the mean salary for men and women is the same, we’ll run a two sample t-test.
Welch Two Sample t-test
data: salary by sex
t = 1.7744, df = 21.591, p-value = 0.09009
alternative hypothesis: true difference in means between group Male and group Female is not equal to 0
95 percent confidence interval:
-567.8539 7247.1471
sample estimates:
mean in group Male mean in group Female
24696.79 21357.14
The p-value of this test is larger than 0.05, so if we choose that for our alpha value then we should retain the null hypothesis that the mean salary for men and women is the same. The 95% confidence interval also spans from -567.8539 to 7,247.1471, which includes 0 and indicates that the means are not significantly different from one another.
Run a multiple linear regression with salary as the outcome variable and everything else as predictors, including sex. Assuming no interactions between sex and the other predictors, obtain a 95% confidence interval for the difference in salary between males and females.
===============================================
Dependent variable:
---------------------------
salary
-----------------------------------------------
degreePhD 1,388.613
(1,018.747)
rankAssoc 5,292.361***
(1,145.398)
rankProf 11,118.760***
(1,351.772)
sexFemale 1,166.373
(925.569)
year 476.309***
(94.914)
ysdeg -124.574
(77.486)
Constant 15,746.050***
(800.178)
-----------------------------------------------
Observations 52
R2 0.855
Adjusted R2 0.836
Residual Std. Error 2,398.418 (df = 45)
F Statistic 44.239*** (df = 6; 45)
===============================================
Note: *p<0.1; **p<0.05; ***p<0.01
===============================================
Statistic N Mean St. Dev. Min Max
-----------------------------------------------
year 52 7.481 5.508 0 25
ysdeg 52 16.115 10.222 1 35
salary 52 23,797.650 5,917.289 15,000 38,045
-----------------------------------------------
After running the multiple linear regression and using the stargazer() function we see in the results that the coefficient for the sex
variable is 1166.37 for sex Female with a standard error of 925.569. We can also see from these results that there are 52 observations, so n = 52. With these figures we can calculate the 95% confidence interval.
[1] -697.8187
[1] 3030.565
Or we can use the confint()
function in R:
Interpret your finding for each predictor variable; discuss (a) statistical significance, (b) interpretation of the coefficient / slope in relation to the outcome variable and other variables
In this regression analysis the sex
, ysdeg
, and degree
variables are not statistically significant. Because these variables are not significant I will not interpret the coefficients.
The rank
and year
variables were found to be statistically significant at the level of 0.01 (1%). The coefficients indicate that with the “Assoc” rank, a person will make $5,292.361 more than a person with the “Asst” rank (since the “Assoc” rank was not included in the regression output we know it was the baseline). With the rank “Prof” a person will make $11,118.760 more than a person with “Asst” rank.
With each unit increase of the year
variable a person will make $476.309 more in salary.
Change the baseline category for the rank variable. Interpret the coefficients related to rank again.
Call:
lm(formula = salary ~ degree + rank + sex + year + ysdeg, data = salary)
Residuals:
Min 1Q Median 3Q Max
-4045.2 -1094.7 -361.5 813.2 9193.1
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 26864.81 1375.29 19.534 < 2e-16 ***
degreePhD 1388.61 1018.75 1.363 0.180
rankAsst -11118.76 1351.77 -8.225 1.62e-10 ***
rankAssoc -5826.40 1012.93 -5.752 7.28e-07 ***
sexFemale 1166.37 925.57 1.260 0.214
year 476.31 94.91 5.018 8.65e-06 ***
ysdeg -124.57 77.49 -1.608 0.115
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2398 on 45 degrees of freedom
Multiple R-squared: 0.855, Adjusted R-squared: 0.8357
F-statistic: 44.24 on 6 and 45 DF, p-value: < 2.2e-16
I used the relevel()
function to change the baseline value for the rank
variable to “Prof”. Now the results tell us that people with a “Asst” rank make $11,118.76 less than people with a “Prof” rank, and people with an “Assoc” rank make $5,826.40 than those with a “Prof” rank. Both of these findings are statistically significant.
… Exclude the variable rank, refit, and summarize how your findings changed, if they did.
=================================================================
Dependent variable:
---------------------------------------------
salary
(1) (2)
-----------------------------------------------------------------
degreePhD 1,388.613 -3,299.349**
(1,018.747) (1,302.520)
rankAsst -11,118.760***
(1,351.772)
rankAssoc -5,826.403***
(1,012.933)
sexFemale 1,166.373 -1,286.544
(925.569) (1,313.089)
year 476.309*** 351.969**
(94.914) (142.481)
ysdeg -124.574 339.399***
(77.486) (80.621)
Constant 26,864.810*** 17,183.570***
(1,375.288) (1,147.942)
-----------------------------------------------------------------
Observations 52 52
R2 0.855 0.631
Adjusted R2 0.836 0.600
Residual Std. Error 2,398.418 (df = 45) 3,743.502 (df = 47)
F Statistic 44.239*** (df = 6; 45) 20.107*** (df = 4; 47)
=================================================================
Note: *p<0.1; **p<0.05; ***p<0.01
===============================================
Statistic N Mean St. Dev. Min Max
-----------------------------------------------
year 52 7.481 5.508 0 25
ysdeg 52 16.115 10.222 1 35
salary 52 23,797.650 5,917.289 15,000 38,045
-----------------------------------------------
Removing the rank
variable from the analysis changes the significance of some variables. With the rank
variable included in the model the variables degree
and ysdeg
were not significant and the variable year
was significant at the 0.01 (1%) level. With the rank
variable removed both the degree
and ysdeg
variables are now statistically significant, with degree
at the 0.05 (5%) level and ysdeg
at the 0.01 (1%) level. The year
variable also dropped one significance level, from 0.01 (1%) to 0.05 (5%).
The sign also flipped for the variables ysdeg
, sexFemale
, and degreePhD
.
Everyone in this dataset was hired the year they earned their highest degree. It is also known that a new Dean was appointed 15 years ago, and everyone in the dataset who earned their highest degree 15 years ago or less than that has been hired by the new Dean. Some people have argued that the new Dean has been making offers that are a lot more generous to newly hired faculty than the previous one and that this might explain some of the variation in Salary.
Create a new variable that would allow you to test this hypothesis and run another multiple regression model to test this. Select variables carefully to make sure there is no multicollinearity. Explain why multicollinearity would be a concern in this case and how you avoided it. Do you find support for the hypothesis that the people hired by the new Dean are making higher than those that were not?
Call:
lm(formula = salary ~ new_dean + degree + rank + sex + year,
data = Q2F)
Residuals:
Min 1Q Median 3Q Max
-3403.3 -1387.0 -167.0 528.2 9233.8
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 24425.32 1107.52 22.054 < 2e-16 ***
new_dean 2163.46 1072.04 2.018 0.0496 *
degreePhD 818.93 797.48 1.027 0.3100
rankAsst -11096.95 1191.00 -9.317 4.54e-12 ***
rankAssoc -6124.28 1028.58 -5.954 3.65e-07 ***
sexFemale 907.14 840.54 1.079 0.2862
year 434.85 78.89 5.512 1.65e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2362 on 45 degrees of freedom
Multiple R-squared: 0.8594, Adjusted R-squared: 0.8407
F-statistic: 45.86 on 6 and 45 DF, p-value: < 2.2e-16
Since the new variable, new_dean
, was built from the ysdeg
variable, to avoid multicollinearity I did not include the ysdeg
variable in the regression model (since the new_dean
variable was included). We can confirm that these two variables would likely cause multicollinearity by checking that their correlation is high.
Pearson's product-moment correlation
data: Q2F$new_dean and Q2F$ysdeg
t = -11.101, df = 50, p-value = 4.263e-15
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.9074548 -0.7411040
sample estimates:
cor
-0.8434239
It appears that these two variables have a high negative correlation.
The output of the regression with the new_dean
variable shows that if a person was hired by the new dean they would earn $2,163.46 more in salary than if they were hired by the old dean. This finding is significant at the 0.05 (5%) level.
Using the house.selling.price data, run and report regression results modeling y = selling price (in dollars) in terms of size of home (in square feet) and whether the home is new (1 = yes; 0 = no). In particular, for each variable; discuss statistical significance and interpret the meaning of the coefficient.
Rows: 100
Columns: 7
$ case <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 1…
$ Taxes <int> 3104, 1173, 3076, 1608, 1454, 2997, 4054, 3002, 6627, 320, 630, …
$ Beds <int> 4, 2, 4, 3, 3, 3, 3, 3, 5, 3, 3, 3, 3, 3, 3, 2, 3, 3, 2, 3, 2, 4…
$ Baths <int> 2, 1, 2, 2, 3, 2, 2, 2, 4, 2, 2, 2, 2, 2, 1, 1, 2, 1, 2, 2, 1, 3…
$ New <int> 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1…
$ Price <int> 279900, 146500, 237700, 200000, 159900, 499900, 265500, 289900, …
$ Size <int> 2048, 912, 1654, 2068, 1477, 3153, 1355, 2075, 3990, 1160, 1220,…
Call:
lm(formula = Price ~ Size + New, data = house.selling.price)
Residuals:
Min 1Q Median 3Q Max
-205102 -34374 -5778 18929 163866
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -40230.867 14696.140 -2.738 0.00737 **
Size 116.132 8.795 13.204 < 2e-16 ***
New 57736.283 18653.041 3.095 0.00257 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 53880 on 97 degrees of freedom
Multiple R-squared: 0.7226, Adjusted R-squared: 0.7169
F-statistic: 126.3 on 2 and 97 DF, p-value: < 2.2e-16
The regression output indicates that both the Size
and the New
variable are significant (Size at the .1% level and New at the 1% level.) The coefficients tell use that with an increase in size, the price will increase by $116.132, and that the price for new houses will be $57,736.283 more than old houses.
Report and interpret the prediction equation, and form separate equations relating selling price to size for new and for not new homes.
E(price) = -40,230.867 + 116.132 * Size + 57736.283 * New
New home: E(price) = -40,230.867 + 116.132 * Size + 57736.283 * 1
We use 1 for x1 with a new home and get the following equation: E(price) = 17,505.42 + 116.132 * Size
Old home: E(price) = -40,230.867 + 116.132 * Size + 57736.283 * 0
We use 0 for x1 with an old home and get the following equation:
E(price) = -40,230.867 + 116.132 * Size
Find the predicted selling price for a home of 3000 square feet that is (i) new, (ii) not new.
[1] 365901.4
[1] 308165.1
The predicted price of a new 3,000 square foot home is $365,901.40 and the predicted price of an old 3,000 square foot home is $308,165.10.
Fit another model, this time with an interaction term allowing interaction between size and new, and report the regression results
Call:
lm(formula = Price ~ Size + New + Size * New, data = house.selling.price)
Residuals:
Min 1Q Median 3Q Max
-175748 -28979 -6260 14693 192519
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -22227.808 15521.110 -1.432 0.15536
Size 104.438 9.424 11.082 < 2e-16 ***
New -78527.502 51007.642 -1.540 0.12697
Size:New 61.916 21.686 2.855 0.00527 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 52000 on 96 degrees of freedom
Multiple R-squared: 0.7443, Adjusted R-squared: 0.7363
F-statistic: 93.15 on 3 and 96 DF, p-value: < 2.2e-16
The regression results indicate that the Size
variable is still significant at the 0.001 (.1%) level, but the New
variable is no longer significant. The interaction term Size*New
is significant at the 0.01 (1%) level.
Report the lines relating the predicted selling price to the size for homes that are (i) new, (ii) not new.
I assume by “report the lines” we mean interpret the coefficients here…
equation: E(price) = -40230.867 + 104.438 * size - 78527.502 * new + 61.916(size*new)
New house: We replace “new” with 1. -22227.808 + 104.438 * size - 78527.502 * 1 + 61.916(size*1)
-100755.3 + 166.354 * size
Old house: We replace “new” with 0 -22227.808 + 104.438 * size - 78527.502 * 0 + 61.916(size*0)
-22227.808 + 104.438 * size
Find the predicted selling price for a home of 3000 square feet that is (i) new, (ii) not new.
[1] 398306.7
[1] 291086.2
The predicted price for a new house that is 3,000 square feet is $398,306.70.
The predicted price for an old house of 3,000 square feet is $291,086.20.
Find the predicted selling price for a home of 1500 square feet that is (i) new, (ii) not new. Comparing to (F), explain how the difference in predicted selling prices changes as the size of home increases.
[1] 148775.7
[1] 134429.2
The predicted price for a new house that is 1,500 square feet is $148,775.70.
The predicted price for an old house of 1,500 square feet is $134,429.20.
The price difference between old and new houses is greater for larger houses ($107,220.50) than for smaller houses ($14,346.50). This is captured in the interaction term.
Do you think the model with interaction or the one without it represents the relationship of size and new to the outcome price? What makes you prefer one model over another?
==================================================================
Dependent variable:
----------------------------------------------
Price
(1) (2)
------------------------------------------------------------------
Size 116.132*** 104.438***
(8.795) (9.424)
New 57,736.280*** -78,527.500
(18,653.040) (51,007.640)
Size:New 61.916***
(21.686)
Constant -40,230.870*** -22,227.810
(14,696.140) (15,521.110)
------------------------------------------------------------------
Observations 100 100
R2 0.723 0.744
Adjusted R2 0.717 0.736
Residual Std. Error 53,880.950 (df = 97) 51,998.110 (df = 96)
F Statistic 126.335*** (df = 2; 97) 93.151*** (df = 3; 96)
==================================================================
Note: *p<0.1; **p<0.05; ***p<0.01
I prefer the model with the interaction term because the interaction term is significant and both the R^2 and adjusted R^2 figures are higher than for the model without the interaction term. The residual standard errors are also smaller for the model with the interaction term.
---
title: "Homework 4"
author: "Justine Shakespeare"
description: "Homework 4 for DACSS 603 Spring 2023"
date: "04/24/2023"
format:
html:
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- hw4
- multiple linear regression
- Justine Shakespeare
---
```{r, results=FALSE , message=FALSE}
library(tidyverse)
library(alr4)
library(smss)
library(stargazer)
```
# Question 1
ŷ = −10,536 + 53.8x1 + 2.84x2
x1 = size of home
x2 = size of lot
## A
*A particular home of 1240 square feet on a lot of 18,000 square feet sold for $145,000. Find the predicted selling price and the residual, and interpret.*
x1 = 1240
x2 = 18000
y = 145000
```{r}
x1 = 1240
x2 = 18000
-10536 + (53.8*x1) + (2.84*x2)
```
The predicted selling point is \$107,296 for this house. Given that the actual selling value is \$145,000, this model underpredicts the value of the house.
The residual for this data point is the difference between y_hat and y. In this case, that is \$145,000 - \$107,296, which is \$37,704. This house sold for \$37,704 more than the predicted value.
```{r}
145000 - 107296
```
## B
*For fixed lot size, how much is the house selling price predicted to increase for each square-foot increase in home size? Why?*
The house selling price is predicted to increase \$53.80 for each square-foot increase, controlling for lot size. 53.80 is the slope or coefficient of the house size.
ŷ = −10,536 + 53.8x1 + 2.84x2
## C
*According to this prediction equation, for fixed home size, how much would lot size need to increase to have the same impact as a one-square-foot increase in home size?*
A one square foot increase in the size of the home would lead to a \$53.80 increase in the sales prices, and a one square foot increase in the size of the lot would lead to a \$2.84 increase in the sales price. The find how much the lot size would need to increase to have the same impact as a one square foot increase in home size we can this equation 53.8=x*2.84 and solve for x.
```{r}
53.8/2.84
```
The lot would need to increase by over 18.94 square feet to increase the sales price as much as one square foot increase in home size.
# Question 2
```{r}
data(salary)
glimpse(salary)
```
## A
*Test the hypothesis that the mean salary for men and women is the same, without regard to
any other variable but sex. Explain your findings.*
To test that hypothesis that the mean salary for men and women is the same, we'll run a two sample t-test.
```{r}
t.test(formula = salary ~ sex, data = salary)
```
The p-value of this test is larger than 0.05, so if we choose that for our alpha value then we should retain the null hypothesis that the mean salary for men and women is the same. The 95% confidence interval also spans from -567.8539 to 7,247.1471, which includes 0 and indicates that the means are not significantly different from one another.
## B
*Run a multiple linear regression with salary as the outcome variable and everything else as
predictors, including sex. Assuming no interactions between sex and the other predictors,
obtain a 95% confidence interval for the difference in salary between males and females.*
```{r}
QbLM <- (lm(formula = salary ~ degree + rank + sex + year + ysdeg, data = salary))
stargazer(QbLM, data = salary, type = 'text')
```
After running the multiple linear regression and using the stargazer() function we see in the results that the coefficient for the `sex` variable is 1166.37 for sex Female with a standard error of 925.569. We can also see from these results that there are 52 observations, so n = 52. With these figures we can calculate the 95% confidence interval.
```{r}
# CI = (X bar) ± (t × s/sqrt(n)) = CI = (X bar) ± (t × se)
t_score <- qt(.025, df = 45)
1166.373 + (t_score * 925.569)
1166.373 - (t_score * 925.569)
```
Or we can use the `confint()` function in R:
```{r}
confint(QbLM)
```
## C
*Interpret your finding for each predictor variable; discuss (a) statistical significance, (b)
interpretation of the coefficient / slope in relation to the outcome variable and other variables*
In this regression analysis the `sex`, `ysdeg`, and `degree` variables are not statistically significant. Because these variables are not significant I will not interpret the coefficients.
The `rank` and `year` variables were found to be statistically significant at the level of 0.01 (1%). The coefficients indicate that with the "Assoc" rank, a person will make \$5,292.361 more than a person with the "Asst" rank (since the "Assoc" rank was not included in the regression output we know it was the baseline). With the rank "Prof" a person will make \$11,118.760 more than a person with "Asst" rank.
With each unit increase of the `year` variable a person will make \$476.309 more in salary.
## D
*Change the baseline category for the rank variable. Interpret the coefficients related to rank
again.*
```{r}
salary$rank <- relevel(salary$rank, ref = "Prof")
summary(lm(formula = salary ~ degree + rank + sex + year + ysdeg, data = salary))
```
I used the `relevel()` function to change the baseline value for the `rank` variable to "Prof". Now the results tell us that people with a "Asst" rank make \$11,118.76 less than people with a "Prof" rank, and people with an "Assoc" rank make \$5,826.40 than those with a "Prof" rank. Both of these findings are statistically significant.
## E
...
*Exclude the variable rank, refit, and summarize how your findings changed, if they did.*
```{r}
QbLM <- (lm(formula = salary ~ degree + rank + sex + year + ysdeg, data = salary))
QbLM_noRank <- (lm(formula = salary ~ degree + sex + year + ysdeg, data = salary))
stargazer(QbLM, QbLM_noRank, data = salary, type = 'text')
```
Removing the `rank` variable from the analysis changes the significance of some variables. With the `rank` variable included in the model the variables `degree` and `ysdeg` were not significant and the variable `year` was significant at the 0.01 (1%) level. With the `rank` variable removed both the `degree` and `ysdeg` variables are now statistically significant, with `degree` at the 0.05 (5%) level and `ysdeg` at the 0.01 (1%) level. The `year` variable also dropped one significance level, from 0.01 (1%) to 0.05 (5%).
The sign also flipped for the variables `ysdeg`, `sexFemale`, and `degreePhD`.
## F
*Everyone in this dataset was hired the year they earned their highest degree. It is also
known that a new Dean was appointed 15 years ago, and everyone in the dataset who
earned their highest degree 15 years ago or less than that has been hired by the new Dean.
Some people have argued that the new Dean has been making offers that are a lot more
generous to newly hired faculty than the previous one and that this might explain some of
the variation in Salary.*
*Create a new variable that would allow you to test this hypothesis and run another multiple
regression model to test this. Select variables carefully to make sure there is no
multicollinearity. Explain why multicollinearity would be a concern in this case and how
you avoided it. Do you find support for the hypothesis that the people hired by the new
Dean are making higher than those that were not?*
```{r}
Q2F <- salary %>%
mutate("new_dean" = case_when(
ysdeg <=15 ~ 1,
ysdeg >15 ~ 0))
summary(lm(salary ~ new_dean + degree + rank + sex + year, data = Q2F))
```
Since the new variable, `new_dean`, was built from the `ysdeg` variable, to avoid multicollinearity I did not include the `ysdeg` variable in the regression model (since the `new_dean` variable was included). We can confirm that these two variables would likely cause multicollinearity by checking that their correlation is high.
```{r}
cor.test(Q2F$new_dean, Q2F$ysdeg)
```
It appears that these two variables have a high negative correlation.
The output of the regression with the `new_dean` variable shows that if a person was hired by the new dean they would earn \$2,163.46 more in salary than if they were hired by the old dean. This finding is significant at the 0.05 (5%) level.
# Question 3
## A
*Using the house.selling.price data, run and report regression results modeling y = selling price (in dollars) in terms of size of home (in square feet) and whether the home is new (1 = yes; 0 = no). In particular, for each variable; discuss statistical significance and interpret the
meaning of the coefficient.*
```{r}
data(house.selling.price)
glimpse(house.selling.price)
```
```{r}
houseData <- (lm(Price ~ Size + New, data = house.selling.price))
summary(houseData)
```
The regression output indicates that both the `Size` and the `New` variable are significant (Size at the .1% level and New at the 1% level.) The coefficients tell use that with an increase in size, the price will increase by \$116.132, and that the price for new houses will be \$57,736.283 more than old houses.
## B
*Report and interpret the prediction equation, and form separate equations relating selling price to size for new and for not new homes.*
E(price) = -40,230.867 + 116.132 * Size + 57736.283 * New
*New home:*
E(price) = -40,230.867 + 116.132 * Size + 57736.283 * 1
We use 1 for x1 with a new home and get the following equation:
E(price) = 17,505.42 + 116.132 * Size
*Old home:*
E(price) = -40,230.867 + 116.132 * Size + 57736.283 * 0
We use 0 for x1 with an old home and get the following equation:
E(price) = -40,230.867 + 116.132 * Size
## C
*Find the predicted selling price for a home of 3000 square feet that is (i) new, (ii) not new.*
```{r}
# Equation: -40,230.867 + 116.132 * size + 57736.283 * new
# new home
size <- 3000
new <- 1
-40230.867 + 116.132 * size + 57736.283 * new
# old home
size <- 3000
new <- 0
-40230.867 + 116.132 * size + 57736.283 * new
```
The predicted price of a new 3,000 square foot home is \$365,901.40 and the predicted price of an old 3,000 square foot home is \$308,165.10.
## D
*Fit another model, this time with an interaction term allowing interaction between size and new, and report the regression results*
```{r}
summary(lm(Price ~ Size + New + Size*New, data = house.selling.price))
```
The regression results indicate that the `Size` variable is still significant at the 0.001 (.1%) level, but the `New` variable is no longer significant. The interaction term `Size*New` is significant at the 0.01 (1%) level.
## E
*Report the lines relating the predicted selling price to the size for homes that are (i) new, (ii) not new.*
I assume by "report the lines" we mean interpret the coefficients here...
equation: E(price) = -40230.867 + 104.438 * size - 78527.502 * new + 61.916(size*new)
*New house:*
We replace "new" with 1.
-22227.808 + 104.438 * size - 78527.502 * 1 + 61.916(size*1)
-100755.3 + 166.354 * size
*Old house:*
We replace "new" with 0
-22227.808 + 104.438 * size - 78527.502 * 0 + 61.916(size*0)
-22227.808 + 104.438 * size
## F
*Find the predicted selling price for a home of 3000 square feet that is (i) new, (ii) not new.*
```{r}
# new house
size <- 3000
new <- 1
-22227.808 + 104.438 * size - 78527.502 * new + 61.916*(size*new)
# old house
new <- 0
-22227.808 + 104.438 * size - 78527.502 * new + 61.916*(size*new)
```
(i)
The predicted price for a new house that is 3,000 square feet is \$398,306.70.
(ii)
The predicted price for an old house of 3,000 square feet is \$291,086.20.
## G
*Find the predicted selling price for a home of 1500 square feet that is (i) new, (ii) not new. Comparing to (F), explain how the difference in predicted selling prices changes as the size of home increases.*
```{r}
# new house
size <- 1500
new <- 1
-22227.808 + 104.438 * size - 78527.502 * new + 61.916*(size*new)
# old house
new <- 0
-22227.808 + 104.438 * size - 78527.502 * new + 61.916*(size*new)
```
(i)
The predicted price for a new house that is 1,500 square feet is \$148,775.70.
(ii)
The predicted price for an old house of 1,500 square feet is \$134,429.20.
```{r}
# F
398306.7 - 291086.2
# G
148775.7 - 134429.2
```
The price difference between old and new houses is greater for larger houses (\$107,220.50) than for smaller houses (\$14,346.50). This is captured in the interaction term.
## H
*Do you think the model with interaction or the one without it represents the relationship of size and new to the outcome price? What makes you prefer one model over another?*
```{r}
HSPmodel <- (lm(Price ~ Size + New, data = house.selling.price))
HSPmodel_ia <- (lm(Price ~ Size + New + Size*New, data = house.selling.price))
stargazer(HSPmodel, HSPmodel_ia, type = 'text')
```
I prefer the model with the interaction term because the interaction term is significant and both the R^2 and adjusted R^2 figures are higher than for the model without the interaction term. The residual standard errors are also smaller for the model with the interaction term.